Hacker News on BigQuery: Now with daily updates — So what are the top domains?

We published a copy of all Hacker News contents in BigQuery back in 2015. It was time for an update, and even better than that, how about daily updates? In this post let’s look at the HN favorite sources, and changes in the best times to post.

5 min readMar 17, 2017

--

Author’s note: I left Google in 2020, hence these posts haven’t been updated to the latest features. But I’m still having fun with Hacker News’ data — this time with Snowflake.

Thanks to the BigQuery Public Datasets Program, we now update Hacker News in BigQuery daily. Nothing better than having fresh data to analyze every day. To celebrate the occasion, I want to look at the top domains that Hacker News uses as sources. I recently did something similar for reddit, which was full of interesting surprises.

So what are the top domains shared in Hacker News during 2017?

Most frequent domains on HN, 2017. Source: Hacker News copy shared in BigQuery

#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') domain, COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY domain ORDER BY c DESC LIMIT 10

That’s interesting, but not what most users see. Let’s rank by number of posts that have scored at least 40 upvotes — the ones that make it to the front page:

Most frequent domains with score>40 on HN, 2017. Source: Hacker News copy shared in BigQuery

#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1 ORDER BY 3 DESC LIMIT 10

Certainly Hacker News likes content hosted on sites like github.com and the NYTimes. But some of those radios look abysmal. What domains have the best chance of getting more than 40 upvotes?

Domains which have landed the most % of submissions in the HN frontpage, 2017 (score>40, count>30). Source: Hacker News copy shared in BigQuery

#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
  , ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage 
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1 
HAVING count>30
ORDER BY 4 DESC LIMIT 20

Oh, that’s cool! As a Googler I’m also proud to see that 3 of our blogs are on the top 10 of Hacker News worthy content. And a shoutout to Gwern Branwen, the only individual author in the top 20.

Let’s look at all the domains that have submitted at least 300 posts already this year:

All domains with > 300 posts on HN, 2017, ranked by chances of getting >40 upvotes. Source: Hacker News copy shared in BigQuery

#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
  , ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage 
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1 
HAVING count>300
ORDER BY 4 DESC

Lessons learned:

Main stream sites like the washingtonpost.com, bbc.com, bloomberg.com, and nytimes.com have the greatest chance of producing front-page worthy content. All above 10%.
TechCrunch continues to be one of the HN favorite tech sources (>9%).
For self hosting the best platform seems to be github.com (>7%). Meanwhile self hosted sites like medium.com and hackernoon.com (both Medium platform) have a < 3% chance of reaching the front page. So should I move this post out of Medium and into a GitHub project for HN to read it?
Wikipedia and Youtube videos have a low chance of frontpage (~2%).
Don’t use link shorteners (goo.gl, youtu.be): They exhibit 0% chance of frontpage.

Looks consistent with the 2016 results:

All domains with > 1350 posts to HN in 2016, ranked by chances of getting >40 upvotes. Source: Hacker News copy shared in BigQuery

And who remembers these top domains from 2008 and 2009?

Best time to submit for a front page chance

Many want to know: What’s the best time to submit a story to HN for a front page chance (see Max Woolf’s analysis)? Turns it out it varies week by week. Plotted here for each week during the last 4 years, which was the hour with the highest chance of front page:

Turns out there was a clear pattern for years: Don’t submit in the middle of the California day. That changed during 2016 — was there an algorithm or behavioral change driving this? Only the admins know.

But what I do know is that Saturdays behave way different than Mondays, for sure. I’ll let you play with that:

#standardSQL
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY month ORDER BY chances_of_homepage DESC) rn
    FROM (
        SELECT TIMESTAMP_TRUNC(timestamp, WEEK) month
          , EXTRACT(HOUR FROM timestamp AT TIME ZONE 'US/Pacific') hour
          , ROUND(100*COUNTIF(score>40)/COUNT(*),4) chances_of_homepage
          , COUNT(*) c 
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type='story' AND EXTRACT(YEAR FROM timestamp)>2012
        AND EXTRACT(DAYOFWEEK FROM timestamp)=2
        GROUP BY 1,2
    )
)
WHERE rn=1 ORDER BY 1,2

Want more?

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Loading all Hacker News comments into Snowflake (in less than 1 minute)

This is a quick demo to show how to export the Hacker News archive from BigQuery into Snowflake. It will be quick…

hoffa.medium.com

The news sources that reddit prefers

What sources does each subreddit prefer for it’s news? Look at these screenshots, and try out the interactive version.

medium.com

Big data stories in seconds: Hacker News and BigQuery

After having a lot of fun with reddit’s data on BigQuery(collected by @jasonbaumgart, see the announcement and Max…

medium.com

Hacker News on BigQuery: Now with daily updates — So what are the top domains?

We published a copy of all Hacker News contents in BigQuery back in 2015. It was time for an update, and even better than that, how about daily updates? In this post let’s look at the HN favorite sources, and changes in the best times to post.

Best time to submit for a front page chance

Want more?

Loading all Hacker News comments into Snowflake (in less than 1 minute)

This is a quick demo to show how to export the Hacker News archive from BigQuery into Snowflake. It will be quick…

The news sources that reddit prefers

What sources does each subreddit prefer for it’s news? Look at these screenshots, and try out the interactive version.

Big data stories in seconds: Hacker News and BigQuery

After having a lot of fun with reddit’s data on BigQuery(collected by @jasonbaumgart, see the announcement and Max…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Felipe Hoffa

No responses yet