Reddit AmItheAsshole is nicer to women than to men — a SQL proof?
When redditors ask “am I the asshole” while talking about women, they have a higher change of being judged as the asshole. Let’s check out these metrics — with BigQuery, dbt, and Data Studio
Make sure to not take anything I wrote here as the absolute truth. Several people on Twitter noted problems and added corrections to the analysis I offered. Reading this post as originally presented — and the reactions — can be a great way for you to learn as much as I did while reading the responses. You can find many of their unfiltered thoughts by following this Twitter thread.
/r/amItheAsshole has grown to be the 4th most active subreddit — by number of comments. People come to this subreddit to tell their stories, and they ask other redditors “am I the asshole here?”. Turns out most people are judged as “not the asshole”, as seen in this chart:
My tweet with these results got a lot of attention:
Including the question — is reddit nicer to women or to men?
By looking at the title or content of a post, you might have a hard time deciding if “I” is a man or a woman — but it’s pretty easy to count the number of “she/he/her/his/girlfriend/boyfriend” present in the story.
Let’s look at some random posts, and the count for each of these pronouns and gendered words:
We can see that the count of gendered pronouns and words in the example matches who the story is about. These stories are about a male customer, a female girlfriend, a male neighbor, a male son, and a female teenage daughter.
With these numbers, we can now set an arbitrary rule: If there’s more than double the number of male pronouns than female, that post is about a man. We can use the opposite rule to say that the post is about a woman. If the numbers are too close or null, we will call the post “neutral”.
Another rule we can set to simplify the analysis:
- If the judgment is ‘not the a-hole’ or ‘no a-holes here’ then we can say ‘the poster is not an asshole’.
- If the judgment is ‘asshole’ or ‘everyone sucks’ then we can say ‘the poster is an asshole’.
If we aggregate all these posts, we get to the numbers:
When I first presented these results, I was told “these numbers are too close, they could be a statistical error”.
How can we tell the numbers are not merely statistical error? Let’s see the trend month by month — is it stable?
Yes! The trend varies month by month, but there’s a clear higher chance of being an asshole when talking about women than when talking about men. If the small difference was just a statistical fluke, we would expect the trend to jump wildly instead.
And please note these results are very specific, as this tweet notes:
To which I replied
To extract all the /r/AmItheAsshole posts in BigQuery to a new table, you can do:
CREATE TABLE temp.data ASSELECT *
WHERE subreddit = 'AmItheAsshole'
AND _table_suffix > '19_'
Then the gender and judgement for each post can be determined with a query like:
WITH data AS (
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhe\b')) hes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bshe\b')) shes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bher\b')) hers
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhis\b')) hiss
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bthey\b')) theys
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bgirlfriend\b')) gfs
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bboyfriend\b')) bfs
WHERE link_flair_text IS NOT NULL
, gendered_data AS (
WHEN males > 2+females*2 THEN 'to_male'
WHEN females > 2+males*2 THEN 'to_female'
SELECT *, hes+shes+hers+hiss+theys+gfs+bfs totalgender, hes+hiss+bfs males, shes+hers+gfs females
SELECT CASE link_flair_text
WHEN 'not the a-hole' THEN 'no asshole'
WHEN 'no a-holes here' THEN 'no asshole'
WHEN 'everyone sucks' THEN 'asshole'
WHEN 'asshole' THEN 'asshole'
WHERE link_flair_text IS NOT NULL
And finally the statistics presented here:
SELECT *, c/total_gender AS perc
SELECT *, SUM(c) OVER(PARTITION BY to_gender, MONTH ) total_gender, SUM(c) OVER(PARTITION BY judgement, MONTH) total_judgement
SELECT to_gender, judgement, CONCAT(to_gender, ': ', judgement) to_gender_judgement, month, COUNT(*) c, ARRAY_AGG(STRUCT(title, selftext) ORDER BY RAND() LIMIT 1)[OFFSET(0)] sample_title
WHERE judgement IS NOT null
AND to_gender != 'neutral'
GROUP BY 1,2,3,4
WHERE c/total_gender > 0.01
AND total_judgement > 10
ORDER BY to_gender, perc DESC
You’ll find plenty of insightful and entertaining replies on the twitter thread for this post:
Feel free to join the discussion (and tell me if I’m wrong?). Remember to be nice to each other — most people are not the asshole anyways.
I only covered until August 2019 as that’s when the current full reddit archive in BigQuery stops — until future expected updates. Check my previous post for more details on collecting live data from pushshift.io. Thanks Jason Baumgartner for the constant supply!
Inequality: How to draw a Lorenz curve with SQL, BigQuery, and Data Studio
The top 0.1% of all Wikipedia pages earn 25% of the pageviews. The bottom 99% only get 42% of all the views. And the…
Interactive: The top 2019 Wikipedia pages
Wikimedia has published their list of most popular 2019 pages— but can we go deeper? Of course, here with BigQuery and…