Leading with commas — ugly or efficient? An investigation over 320 GB of SQL code
Winning arguments with data: Let’s analyze 320 Gigabytes of open source SQL code to determine if we should use trailing or leading commas. Popularity is not enough — can we determine which style leads to success?
Important update: I left Google and joined Snowflake in 2020 — so I’m unable to keep my older posts updated. If you want to try Snowflake, join us — I’m having a lot of fun ❄️.
Switching from BigQuery’s custom SQL to standard SQL:2011 has been great — except when not: I’m not allowed to leave trailing commas anymore. They are really practical when adding, removing, and commenting lines (Go even enforces them). To keep my productivity up, I was forced to switch to a controversial style: Now I open my enumeration lines with commas.
Leading with commas is not a popular style — but I’ll argue that it’s a productive one. Can I prove this? Here I analyze 320 GB of SQL code from 80,000 GitHub repositories to support this choice.
Would you prefer SQL code formatted like this:
# trailing commas
or like this:
# leading commas
Let’s bring the data
Luckily we all have access to terabytes of open source code ready to be analyzed in BigQuery. There is a huge table with all of the available extracted code — and your first step should be extracting only the code you are interested in, before performing further analysis.
To get all of the code inside “.sql” files out into a new table:
SELECT a.id id, size, content, binary, copies,
, ANY_VALUE(path) sample_path
, ANY_VALUE(repo_name) sample_repo_name
FROM `bigquery-public-data.github_repos.files` a
WHERE PATH LIKE '%.sql'
GROUP BY 1
JOIN `bigquery-public-data.github_repos.contents` b
This gives us a table representing a total of 2 million SQL files out of GitHub — more than 320 GB of code!
I left an improved public copy of this table in BigQuery —so you can go straight to the next step.
First approach: Count number of lines, files, repos
My first approach using data backfires:
- There are 50 million SQL lines that end with a comma, while only 1.4 million begin with one.
- There are 329,308 unique SQL files with trailing commas, while only 18,312 have leading ones.
- There are at least 72,506 repositories with SQL files containing trailing commas, while only 3,360 repositories with leading ones.
- Summary: Leading commas lose the popularity race. There are way less projects and files using them.
Side note: There are also 5,647 SQL lines in 1,029 files that have commas at the beginning and the end. At least 360 repositories allow these monstrosities.
Second approach: Incorporate success metrics
Losing the popularity race is not the end of the road. We need to ask ourselves: “Which projects are more successful?”
How do we define success in GitHub? Number of stars? Number of stars last year? Number of stars this year? Number of active users? General activity levels? How about all of that!
- Each repository gets one vote, regardless of number of lines or files.
- Files present in many repositories get assigned to the one with most stars.
- Repositories will get assigned to one of four groups: Only leading or trailing commas, no commas at the start of end of a line, or a mix of both styles.
These are my results:
What we see here:
- Only a small percentage of projects enforce leading commas, many more allow a mix of styles, and the majority only trailing commas. Most projects (at least 69,665 repos) show only trailing commas at the end of SQL lines. 571 repos show only leading commas, and 2,847 use a mix of both styles.
- Projects that allow a mix of styles show the most success. In average mixed style projects got 29.37 stars, 11.73 people involved, and >150 events during this year. In comparison SQL comma trailing only projects have 13.06 stars, 5.49 people involved, and less than 50 events this year (so far).
- Projects that enforce leading commas don’t show as much success as the mixed ones, but still seem to do better than the only trailing ones.
- The trend is stable throughout the years — comparing the total number of stars, the stars during 2016, and the stars, people, and activity during this ongoing 2017.
#standardSQLWITH comma_lines_per_files AS (
SELECT sample_repo_name, sample_stars_2016, sample_stars
, REGEXP_CONTAINS(line, r',\s*$') has_trailing
, REGEXP_CONTAINS(line, r'^\s*,') has_leading
, UNNEST(SPLIT(content, '\n')) line
WHERE line LIKE '%,%'
), stats_per_repo AS (
, MAX(has_leading) has_leading
, MAX(has_trailing) has_trailing
, ANY_VALUE(line) sample_line
, ANY_VALUE(sample_stars) stars
, ANY_VALUE(sample_stars_2016) stars_2016
, (SELECT COUNT(DISTINCT actor.id) FROM `githubarchive.month.2017*` WHERE sample_repo_name = repo.name AND type='WatchEvent') stars_2017
, (SELECT COUNT(DISTINCT actor.id) FROM `githubarchive.month.2017*` WHERE sample_repo_name = repo.name) actors_2017
, (SELECT COUNT(*) FROM `githubarchive.month.2017*` WHERE sample_repo_name = repo.name) activity_2017
GROUP BY sample_repo_name
)SELECT COUNT(DISTINCT sample_repo_name) repos
, IF(has_trailing, IF(has_leading, 'both', 'trailing'), IF(has_leading, 'leading', 'none')) commas
, ROUND(AVG(stars), 2) avg_stars
, ROUND(AVG(stars_2016), 2) stars_2016
, ROUND(AVG(stars_2017), 2) stars_2017
, ROUND(AVG(actors_2017), 2) actors_2017
, ROUND(AVG(activity_2017), 2) activity_2017
, STRING_AGG(sample_repo_name ORDER BY stars DESC LIMIT 3) top_repos
GROUP BY commas
ORDER BY reposrepos commas stars stars17 actors17 activity17 top_repos
571 leading 22.99 7.36 10.04 39.89 drone/drone,aspnetboilerplate/aspnetboilerplate,HazyResearch/deepdiv2847 both true 29.37 6.44 11.73 156.63 apache/spark,begriffs/postgrest,mybatis/mybatis-3
5933 none false 20.05 4.8 7.57 54.43 ajaxorg/ace,zulip/zulip,fivethirtyeight/data
69665 trailing false 13.06 3.22 5.49 43.68 Microsoft/vscode,rails/rails,kubernetes/kubernetes
- Dan Nguyen asks about SQL formatting on Twitter.
- Elliott’s first data attempt to answer.
- Reddit discusses early results.
- Go forces a trailing comma.
- More SQL analysis over the GitHub repos in BigQuery.
From a PostgreSQL proposal to allow trailing commas:
In my experience, it’s quite uncommon to mess with the first item in the list, which mostly eliminates the issue. A missing leading comma is also MUCH easier to spot than a missing trailing comma.
And from reddit, on Go enforcing the trailing comma:
“And I hate that JSON doesn’t allow a trailing comma at the last item of a list.”
“It also makes reordering things sooooo nice, but has the unintended consequences of hating CSV in other languages.”
“When i first started with Go, I hated practically every stance it had on syntax and workflow. Little stuff like this drove me nuts, tripped me up, and frustrated me to no end. But after doing it for a while, stuff like this turns into the things i like about go”
“I do enjoy this little feature. It costs me time in other languages, like today with Octave, when adding and deleting items.
There are so many little things like this that make Go a pleasure to use”
“This little trailing comma feature was the first thing that made me realise that Go was designed by people who’ve had a lot of experience. Moving things around without a trailing comma causes pain in my fingers (JSON, arrrggghhh!), but in Go I can just select a few lines, command :sort (I’m a vim guy) and voilà!”
Some coders like it hot — but most prefer colder climates
Previously we found some of the major concentration of open source coders around pretty cold places (Iceland, Sweden…
Counting uniques faster in BigQuery with HyperLogLog++
As a data exploration task, counting unique users is usually slow and resource intensive. This is because your database…