Imports in Java from 2013 to 2016: Winners and losers

Felipe Hoffa
3 min readJun 29, 2016

--

With all of GitHub open source contents inside BigQuery, we can now go into the actual contents on each file. For example, let’s find out the most popular Java imports in 2016:

    SELECT line, COUNT(*) c
FROM (
SELECT SPLIT(content, '\n') line
FROM [fh-bigquery:github_extracts.contents_java_2016]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
Most popular imports in 2016 .java files

Meanwhile, in 2010:

Most popular imports in 2010 .java files

Similar set, but we can see some subtle differences: For example, junit.framework.TestCase was popular in 2010, while org.junit.Test is the one for 2016.

In the following query, I try to spot the major winners and losers between 2013 and 2016. For that, I’ll look into the top 300 imports and see what percent of the imports for that year represent. Then I’ll compare these percentage through the years and find the biggest winners in popularity.

SELECT COALESCE(a.line,b.line) line, a.c imports_2013, b.c imports_2016, ROUND(100*a.ratio,2) ratio_2013, ROUND(100*b.ratio, 2) ratio_2016, ROUND(100*(b.ratio-a.ratio)/(b.ratio+a.ratio),2) win
FROM (
SELECT *, RATIO_TO_REPORT(c) OVER() ratio
FROM (
SELECT line, COUNT(*) c
FROM (
SELECT SPLIT(content, '\n') line
FROM [fh-bigquery:github_extracts.contents_java_2013]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
)
) a
OUTER JOIN EACH (
SELECT *, RATIO_TO_REPORT(c) OVER() ratio
FROM (
SELECT line, COUNT(*) c
FROM (
SELECT SPLIT(content, '\n') line
FROM [fh-bigquery:github_extracts.contents_java_2016]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
)
) b
ON a.line=b.line
ORDER BY win DESC, b.c DESC, a.c DESC

Top winners:

Biggest raise in imports from 2013 to 2016

Interesting winners! javax.inject.Inject, Android classes, the Nullable annotation, Mockito, …

I did an OUTER JOIN so we could find the 2016 imports that were not part of the top 300 in 2013:

Top imports in 2016 that were not top 300 in 2013

Hello java.util.Optional and java.util.Objects!

Now, if you want to get results like these you’ll need to find a way to date each file in the BigQuery repository. I did it by grouping all identical files (given their contents, as found on the contents table), and finding the first commit that added this file to github. This heuristic could be improved, but in the meantime, this was my code to date and extract the contents_java_201* tables:

SELECT id, size, content, binary, copies, sample_repo_name, sample_path, sample_commit,
min_commiter_timestamp, min_author_timestamp, min_committer_time, min_author_time, max_commiter_timestamp, max_author_timestamp, max_committer_time, max_author_time, avg_commiter_timestamp, avg_author_timestamp, avg_committer_time, avg_author_time
FROM [fh-bigquery:github_extracts.contents_%s] a
JOIN (
SELECT sha1, sample_commit,
min_commiter_timestamp, min_author_timestamp, min_committer_time, min_author_time, max_commiter_timestamp, max_author_timestamp, max_committer_time, max_author_time, avg_commiter_timestamp, avg_author_timestamp, avg_committer_time, avg_author_time
FROM [fh-bigquery:github_extracts.file_ages]
WHERE YEAR(min_commiter_timestamp)=%s
) b
ON a.id=b.sha1

UPDATE: @AbrahamMarin was asking about ‘static’ imports:

SELECT rank, line, c 
FROM (
SELECT ROW_NUMBER() OVER(ORDER BY c DESC) rank, line, c,
FROM (
SELECT line, COUNT(*) c
FROM (
SELECT REGEXP_EXTRACT((SPLIT(content, '\n')), '(.*;)') line
FROM [fh-bigquery:github_extracts.contents_java_2016]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
)
)
WHERE line CONTAINS 'static'

Find more GitHub on BigQuery content at https://medium.com/@hoffa/b3576fd2b150

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

No responses yet