All the open source code in GitHub now shared within BigQuery: Analyze all the code!

4 min readJun 29, 2016

All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.

The pipeline mirrors code from:

Projects that have a clear open source license.
Forks and/or un-notable projects not included.
Nevertheless, it represents terabytes of code.

Official sources:

In depth analysis

Read Francesc’s step-by-step guide to analyze Go code. Use these patterns for any other language too :).
Run a full JavaScript static code analyzer within a SQL query: Running JSHint inside BigQuery.
Java imports: Most used Java imports, from 2013 to 2016.
Top Angular directives.
Tabs or spaces (the holy wars).
SQL commas — leading or trailing?

I’m waiting for your contributions — I will add them here:

1 hour after the dataset announcement @thomasdarimont was able to find all the java projects that declare certain dependency.
Lakshmanan V “Popular Java projects on GitHub that could use some help” (analyzed using BigQuery and Dataflow).
Guillaume Laforge “What can we learn from million lines of Groovy code on Github?”.
Filippo Valsorda “Analyzing Go Vendoring with BigQuery”.
Go project uses BigQuery stats to guide design decisions, more than once.
David Gageot analyzes 281,212 Docker projects.
Kan Nishida uses R to cluster R packages.
Aja Hammerly compares most popular gems according to Rubygems.org download data vs GitHub gem calls.
Sergey Abakumoff looks at the most popular npm packages and trending keywords. Justin Beckwith performs a similar analysis. Sergey follows up with a deeper assessment on why almost empty packages duplicate all over GitHub. Sergey Abakumoff also analyzes Angular vs React messages.
Brent Shaffer analyzes PHP code and libraries — also test coverage for different languages.
A full run down by Egor Zhuk, “Yet another analysis of Github data with Google BigQuery”.
John-David Dalton informs the travis-ci team on the counts for Node versions tested.
Alex Zhitnitsky reviews 779,236 Java Logging Statements, 1,313 GitHub Repositories to determine “ERROR, WARN or FATAL”?
Florin Badita “Naming conventions in Python import statements”. Then “Naming conventions in Python def function()”.
Guillaume Laforge “Analyzing half a million Gradle build files — Guillaume Laforge’s Blog”, 2017 “Gradle vs Maven and Gradle in Kotlin or Groovy”
@anvaka “analyzed ~2TB of code to build an index of the most common words in programming languages”. Cool visualizations, full code on GitHub, and a lot of comments on reddit.
Sergey Abakumoff comes back, linking code to StackOverflow.
Gareth Rushgrove finds all kind of metrics for Puppet.
Justine Tunney tells us how Googlers used BigQuery and GitHub to patch thousands of vulnerable projects (HN).
Walker Harrison found the top imports in Jupyter (.ipynb) notebooks.
Jake McCrary went for the top Clojure libraries.
Sebastian Baltes went searching for Stack Overflow code that shows up in GitHub project.
Steren Giannini found all the constant regular expressions in Go — to improve Go’s regex capabilities (article).
Matt Warren analysing C# code on GitHub with BigQuery.
Michał Janaszek “State of npm scripts” (queries).

A series of posts by Robert Kozikowski:

Advanced GitHub search with BigQuery.
Top emacs packages used in GitHub repos.
Visualizing relationships between python packages.

Tips

Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.

Visualizations

Google Data Studio 360 dashboard (previous post about Data Studio).

More resources

Podcast: Myself, Will Curran, and Arfon Smith talk about the details of this announcement and more on The Changelog #209.
GitHub Archive, monitoring GitHub since 2011.

Press

Venture Beat.

Social media

Stay curious! And find me on Twitter at @felipehoffa.

Oscars 2016: Movies that got the most attention on Wikipedia

Some observations:

medium.com

Showing off the new (free) Google Data Studio, with reddit April’s gilded comments for Sanders…

The Google Analytics team just announced Data Studio: their free, new, Data Visualization Product. Read their post for…

medium.com

Static JavaScript code analysis inside a SQL query: JSHint+GitHub+BigQuery

Can we run a static code analysis tool for JavaScript inside BigQuery? Yes we can.

medium.com

Big data and the elections 2016

Analyzing reddit, twitter, global media, wikipedia, funding, expenses… and a touch of ML

medium.com

All the open source code in GitHub now shared within BigQuery: Analyze all the code!

Official sources:

In depth analysis

Tips

Visualizations

More resources

Press

Social media

Oscars 2016: Movies that got the most attention on Wikipedia

Some observations:

Showing off the new (free) Google Data Studio, with reddit April’s gilded comments for Sanders…

The Google Analytics team just announced Data Studio: their free, new, Data Visualization Product. Read their post for…

Static JavaScript code analysis inside a SQL query: JSHint+GitHub+BigQuery

Can we run a static code analysis tool for JavaScript inside BigQuery? Yes we can.

Big data and the elections 2016

Analyzing reddit, twitter, global media, wikipedia, funding, expenses… and a touch of ML

Written by Felipe Hoffa

Responses (10)