What countries have more open source developers per capita than the US?
We are going to use GitHub Archive, GHTorrent, population statistics and BigQuery to find the answer.
This post was inspired by Silver Keskkula’s cool post analyzing GitHub Archive with BigQuery and asking “I jumped into this hoping to learn something about the changing geography of git pushes for Teleport, but it looks like the Github Archive guys have not made the location field on push history available […] but wouldn’t it be cool?”. I hope you like the answer Silver!
Top countries by number of pushes Aug 2016
It’s no surprise to see the US as the top country with the number of most pushes into GitHub, but it would be more interesting to normalize this number by country population. Let’s do that next.
country pushes
---------------- ---------
United States 1709062
Hungary 428786
Germany 319787
United Kingdom 304752
China 250708
France 151529
Canada 143815
India 109823
Japan 108934
Brazil 104802
Russia 101331
Australia 90174
Netherlands 75198
Spain 60553
Switzerland 53652
Ukraine 51811
Sweden 51584
Italy 46350
Poland 44280
Estonia 36515
Taiwan 35856
Belgium 32486
Greece 30541
Norway 29659
Mexico 27956
Top countries by number of pushes / population
The US now shows up in 6th place. Hungary, Estonia, Switzerland, Norway, and New Zealand had more pushes per capita than the US. But what if we look at the pushes by unique users instead? That, on the next section.
country pushes pop pushes/pop * 1000
---------------- --------- ----------- -------------------
Hungary 428786 9982000 42.955920660
Estonia 36515 1291170 28.280551750
Switzerland 53652 7581000 7.077166601
Norway 29659 5009150 5.920964635
New Zealand 24077 4252277 5.662142894
United States 1709062 310232863 5.508965051
Luxembourg 2693 497538 5.412651898
Sweden 51584 9555893 5.398134952
Iceland 1645 308910 5.325175617
United Kingdom 304752 62348447 4.887884377
Denmark 24939 5484000 4.547592998
Netherlands 75198 16645000 4.517753079
Canada 143815 33679000 4.270168354
Finland 22097 5244000 4.213768116
Australia 90174 21515754 4.191068554
Germany 319787 81802257 3.909268665
Ireland 14956 4622917 3.235186788
Belgium 32486 10403000 3.122753052
Austria 24734 8205000 3.014503352
Greece 30541 11000000 2.776454545
Czech Republic 25370 10476000 2.421725850
Latvia 5216 2217969 2.351701038
France 151529 64768389 2.339551784
Slovenia 4331 2007000 2.157947185
Israel 14189 7353985 1.106882867
Top countries by number of unique ids / population
Finally we discover that Iceland is the country with more GitHub users per capita contributing to GitHub during the last month. The US goes even further down the ranking, with NZ, Sweden, Switzerland, Norway, Finland, Denmark, Netherlands, Canada, the UK, Australia, Luxembourg, Estonia, and Ireland showing a larger presence of developers per capita.
country population uniques uniques/pop * 10000
---------------- ------------ --------- ---------------------
Iceland 308910 149 4.823411350
New Zealand 4252277 1370 3.221803283
Sweden 9555893 3014 3.154074664
Switzerland 7581000 2322 3.062920459
Norway 5009150 1509 3.012487149
Finland 5244000 1412 2.692601068
Denmark 5484000 1472 2.684172137
Netherlands 16645000 4277 2.569540403
Canada 33679000 7865 2.335283114
United Kingdom 62348447 14361 2.303345262
Australia 21515754 4896 2.275541912
Luxembourg 497538 111 2.230985372
Estonia 1291170 282 2.184065615
Ireland 4622917 1009 2.182604620
United States 310232863 65703 2.117860737
Belgium 10403000 1615 1.552436797
Austria 8205000 1230 1.499085923
Germany 81802257 11770 1.438835606
Czech Republic 10476000 1401 1.337342497
Slovenia 2007000 260 1.295465869
France 64768389 7852 1.212319794
Israel 7353985 814 1.106882867
Lithuania 2944459 298 1.012070469
Bulgaria 7148785 706 0.987580407
Latvia 2217969 219 0.987389814
How-to
Step 1: Find the data
GitHub Archive has a full timeline of GitHub’s events, but it doesn’t show us where the users are. The GitHub on BigQuery dataset snapshots the contents of GitHub, but doesn’t have this data either (though we can see timezones). Turns out GHTorrent has this data, and we just need to load it into BigQuery to get started.
bq load --max_bad_records 10000 --replace \
fh-bigquery:ghtorrent.users \
gs://mybucket/ghtorrent/users.csv \
id,login,company,created_at,type,fake,deleted,long,lat,country_code,state,city,location(33s)
The slowest part was downloading and uncompressing GHTorrent, but as I shared the September snapshot, you can skip this step.
Note that only ~9% of GitHub users have a country code on GHTorrent.
SELECT SUM(country_code!='\\N')/COUNT(*)
FROM [fh-bigquery:ghtorrent.users] b0.0916
For normalization purposes, I’ll only look at countries with a population larger than 300,000.
Licenses
Note that we are using 2 datasets with an CC SA license, hence our results will need to respect its terms:
- GHTorrent: http://ghtorrent.org/faq.html
- GeoNames: http://download.geonames.org/export/dump/readme.txt
Step 2: Count the number of pull events during August by country:
SELECT country_code, COUNT(*) pushes,
FROM [githubarchive:month.201608] a
JOIN [fh-bigquery:ghtorrent.users] b
# http://ghtorrent.org/faq.html
ON a.actor.login=b.login
WHERE country_code != '\\N'
AND a.type='PushEvent'
GROUP BY 1
ORDER BY 2 DESC3.2s elapsed, 832 MB processed
Step 3: Bring in population statistics to normalize
SELECT c.country, COUNT(*) pushes, FIRST(c.population) population, 1000*COUNT(*)/FIRST(c.population) ratio_pushes
FROM [githubarchive:month.201608] a
JOIN [fh-bigquery:ghtorrent.users] b
# http://ghtorrent.org/faq.html
ON a.actor.login=b.login
JOIN (
SELECT LOWER(iso) iso, population, country
FROM [gdelt-bq:extra.countryinfo]
WHERE population > 300000
) c
# http://download.geonames.org/export/dump/readme.txt
ON c.iso=b.country_code
WHERE country_code != '\\N'
AND a.type='PushEvent'
GROUP BY 1
ORDER BY ratio_pushes DESC
LIMIT 3003.3s elapsed, 832 MB processed
Step 4: Get the number of unique GitHub users per country too
SELECT c.country, COUNT(*) pushes, FIRST(c.population) population, 1000*COUNT(*)/FIRST(c.population) ratio_pushes, 10000*COUNT(DISTINCT login)/FIRST(c.population) ratio_unique_login
FROM [githubarchive:month.201608] a
JOIN [fh-bigquery:ghtorrent.users] b
# http://ghtorrent.org/faq.html
ON a.actor.login=b.login
JOIN (
SELECT LOWER(iso) iso, population, country
FROM [gdelt-bq:extra.countryinfo]
WHERE population > 300000
) c
# http://download.geonames.org/export/dump/readme.txt
ON c.iso=b.country_code
WHERE country_code != '\\N'
AND a.type='PushEvent'
GROUP BY 1
ORDER BY ratio_unique_login DESC
LIMIT 3002.9s elapsed, 832 MB processed
Interactive charts
Shared on a Google Sheet.
More?
Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.