400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

Felipe Hoffa

3 min readAug 30, 2016

The rules:

Data source: GitHub files stored in BigQuery.
Stars matter: We’ll only consider the top 400,000 repositories — by number of stars they got on GitHub during the period Jan-May 2016.
No small files: Files need to have at least 10 lines that start with a space or a tab.
No duplicates: Duplicate files only have one vote, regardless of how many repos they live in.
One vote per file: Some files use a mix of spaces or tabs. We’ll count on which side depending on which method they use more.
Top languages: We’ll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go).

Numbers

How-to

I used the already existing [bigquery-public-data:github_repos.sample_files] table, that lists the files of the top 400,000 repositories. From there I extracted all the contents for the files with the top languages extensions:

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
FROM (
  SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name 
  FROM [bigquery-public-data:github_repos.sample_files] 
  WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
  GROUP BY id
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id=b.id864.6s elapsed, 1.60 TB processed

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

In the [contents] table we have each unique file represented only once. To see the total number of files and size represented:

SELECT SUM(copies) total_files, SUM(copies*size) total_size
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]

Then it was time to run the ranking according to the previously established rules:

SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
FROM (
  SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext, 
         SUM(best='tab') tabs, SUM(best='space') spaces, 
         COUNT(*) countext
  FROM (
    SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
           COUNT(line) WITHIN RECORD c
    FROM (
      SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name 
      FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
      HAVING REGEXP_MATCH(line, r'[ \t]')
    )
    HAVING c>10 # at least 10 lines that start with space or tab
  )
  GROUP BY ext
)
ORDER BY countext DESC
LIMIT 10016.0s elapsed, 133 GB processed

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

Update: Press and social media

Reddit, Hacker News, Slashdot

Gizmodo, Business Insider, ADTMag, i-programmer

Gizmodo .es, Business Insider .pl, Le blog du Modérateur, Tecnoblog (pt), Biglobe (jp), My Drivers (cn), Genk .vn.

More?

Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

All the open source code in GitHub now shared within BigQuery: Analyze all the code!

All the open source code in GitHub is now available in BigQuery.

medium.com

Analyzing GitHub issues and comments with BigQuery

(Docker, Kubernetes, TensorFlow and other top projects)

medium.com

The top weekend programming languages — based on GitHub’s activity

Stack Overflow published an article analyzing the “top weekend programming languages”. One of their data scientists …

medium.com

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The rules:

Numbers

How-to

Update: Press and social media

More?

All the open source code in GitHub now shared within BigQuery: Analyze all the code!

All the open source code in GitHub is now available in BigQuery.

Analyzing GitHub issues and comments with BigQuery

(Docker, Kubernetes, TensorFlow and other top projects)

The top weekend programming languages — based on GitHub’s activity

Stack Overflow published an article analyzing the “top weekend programming languages”. One of their data scientists …

Written by Felipe Hoffa

Responses (38)