Reddit AmItheAsshole is nicer to women than to men — a SQL proof?
When redditors ask “am I the asshole” while talking about women, they have a higher change of being judged as the asshole. Let’s check out these metrics — with BigQuery, dbt, and Data Studio
Update 2020–02–20
Make sure to not take anything I wrote here as the absolute truth. Several people on Twitter noted problems and added corrections to the analysis I offered. Reading this post as originally presented — and the reactions — can be a great way for you to learn as much as I did while reading the responses. You can find many of their unfiltered thoughts by following this Twitter thread.
Context
/r/amItheAsshole has grown to be the 4th most active subreddit — by number of comments. People come to this subreddit to tell their stories, and they ask other redditors “am I the asshole here?”. Turns out most people are judged as “not the asshole”, as seen in this chart:
My tweet with these results got a lot of attention:
Including the question — is reddit nicer to women or to men?
Deciding gender
By looking at the title or content of a post, you might have a hard time deciding if “I” is a man or a woman — but it’s pretty easy to count the number of “she/he/her/his/girlfriend/boyfriend” present in the story.
Let’s look at some random posts, and the count for each of these pronouns and gendered words:
We can see that the count of gendered pronouns and words in the example matches who the story is about. These stories are about a male customer, a female girlfriend, a male neighbor, a male son, and a female teenage daughter.
With these numbers, we can now set an arbitrary rule: If there’s more than double the number of male pronouns than female, that post is about a man. We can use the opposite rule to say that the post is about a woman. If the numbers are too close or null, we will call the post “neutral”.
Another rule we can set to simplify the analysis:
- If the judgment is ‘not the a-hole’ or ‘no a-holes here’ then we can say ‘the poster is not an asshole’.
- If the judgment is ‘asshole’ or ‘everyone sucks’ then we can say ‘the poster is an asshole’.
If we aggregate all these posts, we get to the numbers:
When I first presented these results, I was told “these numbers are too close, they could be a statistical error”.
Statistical significance?
How can we tell the numbers are not merely statistical error? Let’s see the trend month by month — is it stable?
Yes! The trend varies month by month, but there’s a clear higher chance of being an asshole when talking about women than when talking about men. If the small difference was just a statistical fluke, we would expect the trend to jump wildly instead.
And please note these results are very specific, as this tweet notes:
To which I replied
How-to
This time I’m using dbt
for the first time, and I left all my code on GitHub. Thanks Claire Carroll for your help getting started with this awesome tool!
To extract all the /r/AmItheAsshole posts in BigQuery to a new table, you can do:
CREATE TABLE temp.data ASSELECT *
FROM `fh-bigquery.reddit_posts.20*'
WHERE subreddit = 'AmItheAsshole'
AND _table_suffix > '19_'
Then the gender and judgement for each post can be determined with a query like:
WITH data AS (
SELECT *
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhe\b')) hes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bshe\b')) shes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bher\b')) hers
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhis\b')) hiss
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bthey\b')) theys
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bgirlfriend\b')) gfs
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bboyfriend\b')) bfs
FROM {{ref('aita_posts')}}
WHERE link_flair_text IS NOT NULL
)
, gendered_data AS (
SELECT *
, CASE
WHEN males > 2+females*2 THEN 'to_male'
WHEN females > 2+males*2 THEN 'to_female'
ELSE 'neutral'
END to_gender
FROM (
SELECT *, hes+shes+hers+hiss+theys+gfs+bfs totalgender, hes+hiss+bfs males, shes+hers+gfs females
FROM data
)
)
SELECT CASE link_flair_text
WHEN 'not the a-hole' THEN 'no asshole'
WHEN 'no a-holes here' THEN 'no asshole'
WHEN 'everyone sucks' THEN 'asshole'
WHEN 'asshole' THEN 'asshole'
END judgement
, *
FROM gendered_data
WHERE link_flair_text IS NOT NULL
And finally the statistics presented here:
SELECT *, c/total_gender AS perc
FROM (
SELECT *, SUM(c) OVER(PARTITION BY to_gender, MONTH ) total_gender, SUM(c) OVER(PARTITION BY judgement, MONTH) total_judgement
FROM (
SELECT to_gender, judgement, CONCAT(to_gender, ': ', judgement) to_gender_judgement, month, COUNT(*) c, ARRAY_AGG(STRUCT(title, selftext) ORDER BY RAND() LIMIT 1)[OFFSET(0)] sample_title
FROM {{ref('aita_posts_gendered')}}
WHERE judgement IS NOT null
AND to_gender != 'neutral'
GROUP BY 1,2,3,4
)
)
WHERE c/total_gender > 0.01
AND total_judgement > 10
ORDER BY to_gender, perc DESC
Discussion
You’ll find plenty of insightful and entertaining replies on the twitter thread for this post:
Feel free to join the discussion (and tell me if I’m wrong?). Remember to be nice to each other — most people are not the asshole anyways.
Want more?
I only covered until August 2019 as that’s when the current full reddit archive in BigQuery stops — until future expected updates. Check my previous post for more details on collecting live data from pushshift.io. Thanks Jason Baumgartner for the constant supply!
I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.