Sitemap

All the open source code in GitHub now shared within BigQuery: Analyze all the code!

4 min readJun 29, 2016
Press enter or click to view image in full size
8

All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.

The pipeline mirrors code from:

  • Projects that have a clear open source license.
  • Forks and/or un-notable projects not included.
  • Nevertheless, it represents terabytes of code.

Official sources:

In depth analysis

I’m waiting for your contributions — I will add them here:

A series of posts by Robert Kozikowski:

Tips

  • Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
  • How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
  • I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.

Visualizations

Press enter or click to view image in full size
Press enter or click to view image in full size

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Developer Advocate around the data world. Ex-Google BigQuery, Ex-Snowflake, Always Felipe

Responses (10)