400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The rules:

Numbers

How-to

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
FROM (
SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name
FROM [bigquery-public-data:github_repos.sample_files]
WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
GROUP BY id
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id=b.id
864.6s elapsed, 1.60 TB processed
SELECT SUM(copies) total_files, SUM(copies*size) total_size
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
1 billion files, 14 terabytes of code
SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
FROM (
SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext,
SUM(best='tab') tabs, SUM(best='space') spaces,
COUNT(*) countext
FROM (
SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
COUNT(line) WITHIN RECORD c
FROM (
SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
HAVING REGEXP_MATCH(line, r'[ \t]')
)
HAVING c>10 # at least 10 lines that start with space or tab
)
GROUP BY ext
)
ORDER BY countext DESC
LIMIT 100
16.0s elapsed, 133 GB processed

Update: Press and social media

More?

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.