Big data stories in seconds: Hacker News and BigQuery
Author’s note: I left Google in 2020, hence these posts haven’t been updated to the latest features. But I’m still having fun with Hacker News’ data — this time with Snowflake.
After having a lot of fun with reddit’s data on BigQuery(collected by @jasonbaumgart, see the announcement and Max Woolf’s Howto), it was time to play with another forum that attracts a lot of attention: Hacker News.
Firebase hosts the official Hacker News API, and my friends at Firebase (Jenny Tong, @JamesTamplin) helped me obtain a dump of all Hacker News stories and comments since 2007. With this data in BigQuery, it was time to start querying.
Let’s start by visualizing Hacker News growth 2007–2015:
It’s interesting to see how growth has been stagnant since 2012. Why? Not sure. In the meantime I left sample code to this and other visualizations in an IPython/Jupyter notebook. Also make sure to read the comments at the announcement Hacker News on BigQuery post (thx Max Woolf).
Other visualizations in said notebook include the best times to post on Hacker News to get more than -let’s say- 30 votes:
Then most fun part of having a dataset in BigQuery is the ability to start combining it with others. For example, GitHub. When a project gets posted to the Hacker News homepage it generates a lot of attention — can we measure this?
But the fun doesn’t end there! My latest experiment is looking at the story of Bitcoin through Hacker News and reddit:
How-to to combine the 3 datasets (another /r/bigquery post).
The best part? It only takes seconds to answer your questions once you find this data in BigQuery. If you’ve never done it, find out how — you’ll be running queries like these in less than 5 minutes from now.