Reddit AmItheAsshole is nicer to women than to men — a SQL proof?

Update 2020–02–20

Make sure to not take anything I wrote here as the absolute truth. Several people on Twitter noted problems and added corrections to the analysis I offered. Reading this post as originally presented — and the reactions — can be a great way for you to learn as much as I did while reading the responses. You can find many of their unfiltered thoughts by following this Twitter thread.


/r/amItheAsshole has grown to be the 4th most active subreddit — by number of comments. People come to this subreddit to tell their stories, and they ask other redditors “am I the asshole here?”. Turns out most people are judged as “not the asshole”, as seen in this chart:

Most people are judged to be “not the asshole”

My tweet with these results got a lot of attention:

Including the question — is reddit nicer to women or to men?

Deciding gender

By looking at the title or content of a post, you might have a hard time deciding if “I” is a man or a woman — but it’s pretty easy to count the number of “she/he/her/his/girlfriend/boyfriend” present in the story.

Let’s look at some random posts, and the count for each of these pronouns and gendered words:

Some random posts, and the count of each of these pronouns and gendered words.

We can see that the count of gendered pronouns and words in the example matches who the story is about. These stories are about a male customer, a female girlfriend, a male neighbor, a male son, and a female teenage daughter.

With these numbers, we can now set an arbitrary rule: If there’s more than double the number of male pronouns than female, that post is about a man. We can use the opposite rule to say that the post is about a woman. If the numbers are too close or null, we will call the post “neutral”.

Another rule we can set to simplify the analysis:

  • If the judgment is ‘not the a-hole’ or ‘no a-holes here’ then we can say ‘the poster is not an asshole’.
  • If the judgment is ‘asshole’ or ‘everyone sucks’ then we can say ‘the poster is an asshole’.

If we aggregate all these posts, we get to the numbers:

Most people are not the asshole. But if you talk about women, there’s a higher probability of being one.

When I first presented these results, I was told “these numbers are too close, they could be a statistical error”.

Statistical significance?

How can we tell the numbers are not merely statistical error? Let’s see the trend month by month — is it stable?

Monthly judgements — when the poster talks about women, the poster always has a higher chance of being the asshole.

Yes! The trend varies month by month, but there’s a clear higher chance of being an asshole when talking about women than when talking about men. If the small difference was just a statistical fluke, we would expect the trend to jump wildly instead.

And please note these results are very specific, as this tweet notes:

To which I replied


This time I’m using dbt for the first time, and I left all my code on GitHub. Thanks Claire Carroll for your help getting started with this awesome tool!

To extract all the /r/AmItheAsshole posts in BigQuery to a new table, you can do:

FROM `fh-bigquery.reddit_posts.20*'
WHERE subreddit = 'AmItheAsshole'
AND _table_suffix > '19_'

Then the gender and judgement for each post can be determined with a query like:

WITH data AS (
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhe\b')) hes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bshe\b')) shes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bher\b')) hers
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhis\b')) hiss
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bthey\b')) theys
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bgirlfriend\b')) gfs
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bboyfriend\b')) bfs
FROM {{ref('aita_posts')}}
WHERE link_flair_text IS NOT NULL
, gendered_data AS (
WHEN males > 2+females*2 THEN 'to_male'
WHEN females > 2+males*2 THEN 'to_female'
ELSE 'neutral'
END to_gender
SELECT *, hes+shes+hers+hiss+theys+gfs+bfs totalgender, hes+hiss+bfs males, shes+hers+gfs females
FROM data

SELECT CASE link_flair_text
WHEN 'not the a-hole' THEN 'no asshole'
WHEN 'no a-holes here' THEN 'no asshole'
WHEN 'everyone sucks' THEN 'asshole'
WHEN 'asshole' THEN 'asshole'
END judgement
, *
FROM gendered_data
WHERE link_flair_text IS NOT NULL

And finally the statistics presented here:

SELECT *, c/total_gender AS perc
SELECT *, SUM(c) OVER(PARTITION BY to_gender, MONTH ) total_gender, SUM(c) OVER(PARTITION BY judgement, MONTH) total_judgement
SELECT to_gender, judgement, CONCAT(to_gender, ': ', judgement) to_gender_judgement, month, COUNT(*) c, ARRAY_AGG(STRUCT(title, selftext) ORDER BY RAND() LIMIT 1)[OFFSET(0)] sample_title
FROM {{ref('aita_posts_gendered')}}
WHERE judgement IS NOT null
AND to_gender != 'neutral'
GROUP BY 1,2,3,4
WHERE c/total_gender > 0.01
AND total_judgement > 10
ORDER BY to_gender, perc DESC


You’ll find plenty of insightful and entertaining replies on the twitter thread for this post:

Feel free to join the discussion (and tell me if I’m wrong?). Remember to be nice to each other — most people are not the asshole anyways.

Want more?

I only covered until August 2019 as that’s when the current full reddit archive in BigQuery stops — until future expected updates. Check my previous post for more details on collecting live data from Thanks Jason Baumgartner for the constant supply!

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on, and all about BigQuery on




Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

4 questions I got asked at a data science consulting phone interview

7 Practical Ways How GPS Dash Cam Can Help After Accident

… The problem with 3rd party data sharing

Coalition polls for the people with Coalitiewijzer

Internal Content Indexing NLU Service Now Available on the CityFALCON API — CityFALCON Blog

Is your Data Dead and Dirty?

Annotating Bar Charts and Other Matplolib Techniques

Machine Learning Workflow with Dask_ml: Breast Cancer prediction use case

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

More from Medium

Is Microsoft Access Dead?


Kinesis Data Stream Working and Architecture — Part2

How to Compute a Moving Average in BigQuery Using SQL