All the open source code in GitHub now shared within BigQuery: Analyze all the code!
4 min readJun 29, 2016
All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:
Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.
The pipeline mirrors code from:
- Projects that have a clear open source license.
- Forks and/or un-notable projects not included.
- Nevertheless, it represents terabytes of code.
Official sources:
In depth analysis
- Read Francesc’s step-by-step guide to analyze Go code. Use these patterns for any other language too :).
- Run a full JavaScript static code analyzer within a SQL query: Running JSHint inside BigQuery.
- Java imports: Most used Java imports, from 2013 to 2016.
- Top Angular directives.
- Tabs or spaces (the holy wars).
- SQL commas — leading or trailing?
I’m waiting for your contributions — I will add them here:
- 1 hour after the dataset announcement @thomasdarimont was able to find all the java projects that declare certain dependency.
- Lakshmanan V “Popular Java projects on GitHub that could use some help” (analyzed using BigQuery and Dataflow).
- Guillaume Laforge “What can we learn from million lines of Groovy code on Github?”.
- Filippo Valsorda “Analyzing Go Vendoring with BigQuery”.
- Go project uses BigQuery stats to guide design decisions, more than once.
- David Gageot analyzes 281,212 Docker projects.
- Kan Nishida uses R to cluster R packages.
- Aja Hammerly compares most popular gems according to Rubygems.org download data vs GitHub gem calls.
- Sergey Abakumoff looks at the most popular npm packages and trending keywords. Justin Beckwith performs a similar analysis. Sergey follows up with a deeper assessment on why almost empty packages duplicate all over GitHub. Sergey Abakumoff also analyzes Angular vs React messages.
- Brent Shaffer analyzes PHP code and libraries — also test coverage for different languages.
- A full run down by Egor Zhuk, “Yet another analysis of Github data with Google BigQuery”.
- John-David Dalton informs the travis-ci team on the counts for Node versions tested.
- Alex Zhitnitsky reviews 779,236 Java Logging Statements, 1,313 GitHub Repositories to determine “ERROR, WARN or FATAL”?
- Florin Badita “Naming conventions in Python import statements”. Then “Naming conventions in Python def function()”.
- Guillaume Laforge “Analyzing half a million Gradle build files — Guillaume Laforge’s Blog”, 2017 “Gradle vs Maven and Gradle in Kotlin or Groovy”
- @anvaka “analyzed ~2TB of code to build an index of the most common words in programming languages”. Cool visualizations, full code on GitHub, and a lot of comments on reddit.
- Sergey Abakumoff comes back, linking code to StackOverflow.
- Gareth Rushgrove finds all kind of metrics for Puppet.
- Justine Tunney tells us how Googlers used BigQuery and GitHub to patch thousands of vulnerable projects (HN).
- Walker Harrison found the top imports in Jupyter (.ipynb) notebooks.
- Jake McCrary went for the top Clojure libraries.
- Sebastian Baltes went searching for Stack Overflow code that shows up in GitHub project.
- Steren Giannini found all the constant regular expressions in Go — to improve Go’s regex capabilities (article).
- Matt Warren analysing C# code on GitHub with BigQuery.
- Michał Janaszek “State of npm scripts” (queries).
A series of posts by Robert Kozikowski:
- Advanced GitHub search with BigQuery.
- Top emacs packages used in GitHub repos.
- Visualizing relationships between python packages.
Tips
- Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
- How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
- I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.
Visualizations
- Google Data Studio 360 dashboard (previous post about Data Studio).
More resources
- Podcast: Myself, Will Curran, and Arfon Smith talk about the details of this announcement and more on The Changelog #209.
- GitHub Archive, monitoring GitHub since 2011.
Press
Social media
Stay curious! And find me on Twitter at @felipehoffa.