The most famous reddit accounts

Let’s analyze 3 billion reddit comments to find the most mentioned reddit users — who are the most famous ones? Interactive dashboard included.

Felipe Hoffa

--

How-to

  • There are more than 3 billion reddit comments stored in BigQuery, ready to be analyzed (thanks Jason Baumgartner!).
  • Instead of counting the number of mentions, we’ll count the number of distinct users mentioning the famous accounts — so the stats are harder to game.
  • You can play with an interactive Data Studio dashboard — find your favorite redditors and share what you your results! I added everyone that has gathered more than 800 redditors mentioning them.

Results

Bots on the top

  • The most famous reddit users are bots: /u/trollabot and /u/user_simulator
  • Both have been mentioned by more than 90,000 redditors each.
  • Both peaked at their introduction month, and never recovered to their initial popularity.

Virtual coins

  • The 3rd most popular reddit account has been mentioned only by 25,000 redditors. Still, /u/dogetipbot’s popularity soared during 2013.
  • In a similar category we can find /u/changetip. It’s the 7th most famous account (>10.2K mentioning).
  • Both bots popularity has gone away though, as other virtual coin bots.

The tragedy of Unidan

  • Everyone loved /u/unidan back in 2013 — until his good fame reversed. The account has accumulated more than 23.9K mentioners.
  • Redditors still keep mentioning Unidan around reddit — perhaps as a cautionary tale. Don’t manipulate votes, kids!
  • Unidan’s alternative account — /u/unidanx — still enjoys some popularity. At least enough to show up in this ranking, with more than 2.9K mentioners.

The most famous reddit CEOs

  • Usually being the CEO of reddit doesn’t make you too famous — unless a scandal breaks out.
  • /u/spez has broken that trend — with a high number of mentions, even in scandal-free months.
  • The first peak of mentions for any of the reddit CEOs happened back in July 2015. People weren’t too happy to see /u/chooter leave.
  • No other reddit CEO has been close to the number of mentioners than /u/spez got in November 2016. That was another controversial month.

They draw

  • /u/awildsketchappeared appeared after /u/shitty_watercolour — but has been more famous since. Still, shitty_watercolour has a wikipedia page, and wildsketch does not.
  • /u/shittywatercolour is not the right way to spell it — but still gets enough mentions to show up.

Top contributors

  • /u/gallowboob submits a lot of reddit front page posts — and people talk a lot about that.
  • Before /u/gallowboob showed up the most famous contributor was /u/karmanaut. His fame has clearly dropped.

Weird storytellers

  • /u/_vargas_ and /u/rogersimon10 show up with strange answers to your /r/askreddit questions. Don’t ask me why, but that gave them fame.
  • People keep mentioning /u/vargas. Misspelling, I guess.

Celebrities

  • /u/presidentobama is famous — but not as reddit-famous as other celebrities that dedicate more of their time to reddit: Wil Wheaton, Snoop Dog, and Gov Schwarzenegger.

More!

Play with the interactive Data Studio dashboard to find more famous redditors. Share your results :)

Query

#standardSQL
CREATE TEMP FUNCTION label_trend(monthlys ARRAY<STRUCT<month TIMESTAMP, c INT64>>) AS ((
SELECT CASE
WHEN c<36 AND recent_peak_corr>0 AND recent_peak_corr>growing_corr AND recent_peak_corr > ABS(comeback_corr) THEN 'recent_peak'
WHEN growing_corr>0 AND (c<36 OR growing_corr>ABS(comeback_corr)) THEN 'growing'
WHEN -1*growing_corr>ABS(comeback_corr) OR c<36 THEN 'declining'
WHEN comeback_corr>0 THEN 'comeback'
ELSE 'peaked'
END
FROM (
(SELECT AS STRUCT
CORR((IFNULL(x.c,0.1)), trends.growing) AS growing_corr
, CORR((IFNULL(x.c,0.1)), trends.comeback) AS comeback_corr
, CORR((IFNULL(x.c,0.1)), trends.recent_peak) AS recent_peak_corr
, COUNT(*) AS c
FROM UNNEST(monthlys) x
JOIN (
SELECT *, ABS(growing-54/2) comeback, -1*ABS(LEAST (0,74/2-growing)) recent_peak
FROM UNNEST(GENERATE_DATE_ARRAY('2013-01-01', '2017-06-01', INTERVAL 1 MONTH)) AS m WITH OFFSET growing
) trends
ON DATE(x.month)=trends.m))
));
WITH extract_users AS (
SELECT subreddit, author, REGEXP_EXTRACT_ALL(LOWER(body), r'/u/[a-zA-Z0-9-_]{3,20}') users, body, TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH) month
FROM `fh-bigquery.reddit_comments.20*`
WHERE _TABLE_SUFFIX >= '13'
)
SELECT * EXCEPT(monthlys), label_trend( monthlys) trend
FROM (
SELECT *
FROM (
SELECT user, month, COUNT(*) c
, APPROX_TOP_COUNT(sub, 1)[OFFSET(0)].value top_sub
, ANY_VALUE(mentioned_by) mentioned_by
, ARRAY_AGG(STRUCT(month, COUNT(*) AS c)) OVER(PARTITION BY user) monthlys
FROM (
SELECT author, user, month, COUNT(*) c
, APPROX_TOP_COUNT(subreddit, 1)[OFFSET(0)].value sub
, COUNT(DISTINCT author) OVER(PARTITION BY user) mentioned_by
FROM extract_users, UNNEST(users) user
GROUP BY 1, 2, 3
)
WHERE mentioned_by>800
GROUP BY 1, 2
)
)
ORDER BY c DESC

Query notes

  • I played with CORR() to cluster by trends. Interesting results, but I still want to play a little more with it.
  • Note the use of SQL UDFs and WITH subqueries.
  • OVER() helps me find the total number of mentions over all the years for each user, and also aggregate series for the correlation experiment.

Basic queries

Extract all username mentions:

WITH extract_users AS (
SELECT subreddit, author, REGEXP_EXTRACT_ALL(LOWER(body), r'/u/[a-zA-Z0-9-_]{3,20}') users, body, TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH) month
FROM `fh-bigquery.reddit_comments.20*`
WHERE _TABLE_SUFFIX >= '13'
)

Normalization: If an account was mentioned multiple times by the same user, it counts only once. Find also what was the top subreddit that user used that month to mention the account. Do it by month, but also do an OVER() to get the total number of mentioners across all months:

  SELECT author, user, month, COUNT(*) c 
, APPROX_TOP_COUNT(subreddit, 1)[OFFSET(0)].value sub
, COUNT(DISTINCT author) OVER(PARTITION BY user) mentioned_by
FROM extract_users, UNNEST(users) user
GROUP BY 1, 2, 3

Then count the number of mentioners each month, find the subreddit that gave the most mentions to that account, and re-use the total number of mentions we got on the previous query. Then keep only the accounts that have received at least 800 mentions by different users:

SELECT user, month, COUNT(*) c
, APPROX_TOP_COUNT(sub, 1)[OFFSET(0)].value top_sub
, ANY_VALUE(mentioned_by) mentioned_by
FROM (
[see previous query]
)
WHERE mentioned_by>800
GROUP BY 1, 2

Bonus

Next steps

Discuss on reddit:

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Responses (4)