Which subreddits have the most energy - how upvotes translate into pageviews

How much is a reddit upvote worth? We’ll find how much an upvote is worth, measured by imgur views.

Previously, on reddit: I posted a visualization analyzing reddit’s happiness, which got more than 4,300 upvotes on /r/dataisbeautiful. This translated into more than 1.7 million imgur views, so I started wondering: How do upvotes translate into pageviews?

How-to

Step 1: Find the top 40 subreddits

Criteria: Let’s find the subreddits which posted most imgur pictures having a score>200, during August 2016.

Warning: Some of these subs are NSFW. We are only going to analyze them though, so all is good unless you choose to go explore them further.

Top 40 imgur subreddits, August 2016
#Find the 40 top subreddits by imgur posts
SELECT subreddit, COUNT(*) c
FROM [fh-bigquery:reddit_posts.2016_08]
WHERE score > 200
AND url CONTAINS 'imgur'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 40

Step 2: Retrieve # of imgur pageviews for some sample posts

I don’t have the number of pageviews for each imgur url, but I can use imgur’s API to get those numbers. Their API is rate limited, so I will only be able to get a random sample of urls.

A sample of 905 imgur images from the top 40 Aug subreddits: Their upvotes and imgur # of views.
# Choose 12 random imgur posts for each of the 40 top subreddits.
SELECT subreddit, url, score
FROM (
SELECT FIRST(subreddit) subreddit, url, COUNT(*) c, FIRST(RAND()) rand, ROW_NUMBER() OVER(PARTITION BY subreddit ORDER BY rand) rn,
FIRST(score) score
FROM [fh-bigquery:reddit_posts.2016_08]
WHERE score > 200
AND subreddit IN (SELECT subreddit FROM (
SELECT subreddit, COUNT(*) c
FROM [fh-bigquery:reddit_posts.2016_08]
WHERE score > 200
AND url CONTAINS 'imgur'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 40
))
AND url NOT IN (
SELECT url
FROM [fh-bigquery:reddit_posts.2016_07]
WHERE score > 200
AND url CONTAINS 'imgur'
GROUP BY 1
)
AND url CONTAINS 'imgur'
GROUP BY url
HAVING c=1
)
WHERE rn<=12

Some Python to retrieve imgur’s stats with their API:

for x in res:
path = re.search(r'com/([^\.\?]*)?', x['url']).group(1)
if 'a/' in path or 'gallery/' in path:
id = path.split('/')[1]
if id in tot:
continue
tot[id]=x
try:
images = client.get_album_images(id)
views = client.get_album_images(id)[0].views if images else 0
except ImgurClientError as e:
try:
views = client.get_image(id).views
except ImgurClientError as e:
views=0
print x['subreddit'], id
else:
id = path
if id in tot:
continue
tot[id]=x
try:
views = client.get_image(id).views
except ImgurClientError as e:
views=0
print x['subreddit'], path
x['views']=views

Step 3: Chart average by subreddit

Avg views vs avg score (outliers removed, see step 4)

Step 4: discuss

Lowest energy: /r/the_donald averages more than a 1,000 upvotes per imgur submission, but in average these images get less than 21,000 views.

Top energy: /r/celebs, /r/gaming, /r/aww, and /r/interestingasfuck gather a lot of pageviews per upvote. Looking at the individual posts, this seems to be a product of some outliers with a lot of pageviews and few votes. I would remove these outliers (top left of the step 2 viz), as these pageviews probably came from a different source. Note that the query already removes links that where posted to multiple places in August and July, but that filter wasn’t enough. Still, even without those points, subs like /r/interestingasfuck continue to get top views per upvote.

Outliers removed: With the top left outliers removed, the top energy subreddits were /r/gifs, /r/interestingasfuck, /r/pokemongo, /r/aww, /r/WTF, and /r/Celebs.

Before removing outliers

Next steps: Go deeper!

Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Thanks to /u/stuck_in_the_matrix for sharing and collecting this data! Check pushshift.io for a realtime stream of reddit posts and comments.

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.