Imports in Java from 2013 to 2016: Winners and losers
With all of GitHub open source contents inside BigQuery, we can now go into the actual contents on each file. For example, let’s find out the most popular Java imports in 2016:
SELECT line, COUNT(*) c
FROM (
SELECT SPLIT(content, '\n') line
FROM [fh-bigquery:github_extracts.contents_java_2016]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
Meanwhile, in 2010:
Similar set, but we can see some subtle differences: For example, junit.framework.TestCase was popular in 2010, while org.junit.Test is the one for 2016.
In the following query, I try to spot the major winners and losers between 2013 and 2016. For that, I’ll look into the top 300 imports and see what percent of the imports for that year represent. Then I’ll compare these percentage through the years and find the biggest winners in popularity.
SELECT COALESCE(a.line,b.line) line, a.c imports_2013, b.c imports_2016, ROUND(100*a.ratio,2) ratio_2013, ROUND(100*b.ratio, 2) ratio_2016, ROUND(100*(b.ratio-a.ratio)/(b.ratio+a.ratio),2) win
FROM (
SELECT *, RATIO_TO_REPORT(c) OVER() ratio
FROM (
SELECT line, COUNT(*) c
FROM (
SELECT SPLIT(content, '\n') line
FROM [fh-bigquery:github_extracts.contents_java_2013]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
)
) a
OUTER JOIN EACH (
SELECT *, RATIO_TO_REPORT(c) OVER() ratio
FROM (
SELECT line, COUNT(*) c
FROM (
SELECT SPLIT(content, '\n') line
FROM [fh-bigquery:github_extracts.contents_java_2016]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
)
) b
ON a.line=b.line
ORDER BY win DESC, b.c DESC, a.c DESC
Top winners:
Interesting winners! javax.inject.Inject, Android classes, the Nullable annotation, Mockito, …
I did an OUTER JOIN so we could find the 2016 imports that were not part of the top 300 in 2013:
Hello java.util.Optional and java.util.Objects!
Now, if you want to get results like these you’ll need to find a way to date each file in the BigQuery repository. I did it by grouping all identical files (given their contents, as found on the contents table), and finding the first commit that added this file to github. This heuristic could be improved, but in the meantime, this was my code to date and extract the contents_java_201* tables:
SELECT id, size, content, binary, copies, sample_repo_name, sample_path, sample_commit,
min_commiter_timestamp, min_author_timestamp, min_committer_time, min_author_time, max_commiter_timestamp, max_author_timestamp, max_committer_time, max_author_time, avg_commiter_timestamp, avg_author_timestamp, avg_committer_time, avg_author_time
FROM [fh-bigquery:github_extracts.contents_%s] a
JOIN (
SELECT sha1, sample_commit,
min_commiter_timestamp, min_author_timestamp, min_committer_time, min_author_time, max_commiter_timestamp, max_author_timestamp, max_committer_time, max_author_time, avg_commiter_timestamp, avg_author_timestamp, avg_committer_time, avg_author_time
FROM [fh-bigquery:github_extracts.file_ages]
WHERE YEAR(min_commiter_timestamp)=%s
) b
ON a.id=b.sha1
UPDATE: @AbrahamMarin was asking about ‘static’ imports:
SELECT rank, line, c
FROM (
SELECT ROW_NUMBER() OVER(ORDER BY c DESC) rank, line, c,
FROM (
SELECT line, COUNT(*) c
FROM (
SELECT REGEXP_EXTRACT((SPLIT(content, '\n')), '(.*;)') line
FROM [fh-bigquery:github_extracts.contents_java_2016]
HAVING REGEXP_MATCH(line, '^import')
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 300
)
)
WHERE line CONTAINS 'static'
Find more GitHub on BigQuery content at https://medium.com/@hoffa/b3576fd2b150