Hacker News on BigQuery: Now with daily updates — So what are the top domains?
We published a copy of all Hacker News contents in BigQuery back in 2015. It was time for an update, and even better than that, how about daily updates? In this post let’s look at the HN favorite sources, and changes in the best times to post.
Author’s note: I left Google in 2020, hence these posts haven’t been updated to the latest features. But I’m still having fun with Hacker News’ data — this time with Snowflake.
Thanks to the BigQuery Public Datasets Program, we now update Hacker News in BigQuery daily. Nothing better than having fresh data to analyze every day. To celebrate the occasion, I want to look at the top domains that Hacker News uses as sources. I recently did something similar for reddit, which was full of interesting surprises.
So what are the top domains shared in Hacker News during 2017?
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') domain, COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY domain ORDER BY c DESC LIMIT 10
That’s interesting, but not what most users see. Let’s rank by number of posts that have scored at least 40 upvotes — the ones that make it to the front page:
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1 ORDER BY 3 DESC LIMIT 10
Certainly Hacker News likes content hosted on sites like github.com and the NYTimes. But some of those radios look abysmal. What domains have the best chance of getting more than 40 upvotes?
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
, ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1
HAVING count>30
ORDER BY 4 DESC LIMIT 20
Oh, that’s cool! As a Googler I’m also proud to see that 3 of our blogs are on the top 10 of Hacker News worthy content. And a shoutout to Gwern Branwen, the only individual author in the top 20.
Let’s look at all the domains that have submitted at least 300 posts already this year:
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
, ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1
HAVING count>300
ORDER BY 4 DESC
Lessons learned:
- Main stream sites like the washingtonpost.com, bbc.com, bloomberg.com, and nytimes.com have the greatest chance of producing front-page worthy content. All above 10%.
- TechCrunch continues to be one of the HN favorite tech sources (>9%).
- For self hosting the best platform seems to be github.com (>7%). Meanwhile self hosted sites like medium.com and hackernoon.com (both Medium platform) have a < 3% chance of reaching the front page. So should I move this post out of Medium and into a GitHub project for HN to read it?
- Wikipedia and Youtube videos have a low chance of frontpage (~2%).
- Don’t use link shorteners (goo.gl, youtu.be): They exhibit 0% chance of frontpage.
Looks consistent with the 2016 results:
And who remembers these top domains from 2008 and 2009?
Best time to submit for a front page chance
Many want to know: What’s the best time to submit a story to HN for a front page chance (see Max Woolf’s analysis)? Turns it out it varies week by week. Plotted here for each week during the last 4 years, which was the hour with the highest chance of front page:
Turns out there was a clear pattern for years: Don’t submit in the middle of the California day. That changed during 2016 — was there an algorithm or behavioral change driving this? Only the admins know.
But what I do know is that Saturdays behave way different than Mondays, for sure. I’ll let you play with that:
#standardSQL
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY month ORDER BY chances_of_homepage DESC) rn
FROM (
SELECT TIMESTAMP_TRUNC(timestamp, WEEK) month
, EXTRACT(HOUR FROM timestamp AT TIME ZONE 'US/Pacific') hour
, ROUND(100*COUNTIF(score>40)/COUNT(*),4) chances_of_homepage
, COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE type='story' AND EXTRACT(YEAR FROM timestamp)>2012
AND EXTRACT(DAYOFWEEK FROM timestamp)=2
GROUP BY 1,2
)
)
WHERE rn=1 ORDER BY 1,2
Want more?
Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.