Reddit AmItheAsshole is nicer to women than to men — a SQL proof?

Update 2020–02–20

Make sure to not take anything I wrote here as the absolute truth. Several people on Twitter noted problems and added corrections to the analysis I offered. Reading this post as originally presented — and the reactions — can be a great way for you to learn as much as I did while reading the responses. You can find many of their unfiltered thoughts by following this Twitter thread.

Context

/r/amItheAsshole has grown to be the 4th most active subreddit — by number of comments. People come to this subreddit to tell their stories, and they ask other redditors “am I the asshole here?”. Turns out most people are judged as “not the asshole”, as seen in this chart:

Most people are judged to be “not the asshole”

My tweet with these results got a lot of attention:

Including the question — is reddit nicer to women or to men?

Deciding gender

By looking at the title or content of a post, you might have a hard time deciding if “I” is a man or a woman — but it’s pretty easy to count the number of “she/he/her/his/girlfriend/boyfriend” present in the story.

Let’s look at some random posts, and the count for each of these pronouns and gendered words:

Some random posts, and the count of each of these pronouns and gendered words.

We can see that the count of gendered pronouns and words in the example matches who the story is about. These stories are about a male customer, a female girlfriend, a male neighbor, a male son, and a female teenage daughter.

With these numbers, we can now set an arbitrary rule: If there’s more than double the number of male pronouns than female, that post is about a man. We can use the opposite rule to say that the post is about a woman. If the numbers are too close or null, we will call the post “neutral”.

Another rule we can set to simplify the analysis:

  • If the judgment is ‘not the a-hole’ or ‘no a-holes here’ then we can say ‘the poster is not an asshole’.
  • If the judgment is ‘asshole’ or ‘everyone sucks’ then we can say ‘the poster is an asshole’.

If we aggregate all these posts, we get to the numbers:

Most people are not the asshole. But if you talk about women, there’s a higher probability of being one.

When I first presented these results, I was told “these numbers are too close, they could be a statistical error”.

Statistical significance?

How can we tell the numbers are not merely statistical error? Let’s see the trend month by month — is it stable?

Monthly judgements — when the poster talks about women, the poster always has a higher chance of being the asshole.

Yes! The trend varies month by month, but there’s a clear higher chance of being an asshole when talking about women than when talking about men. If the small difference was just a statistical fluke, we would expect the trend to jump wildly instead.

And please note these results are very specific, as this tweet notes:

To which I replied

How-to

This time I’m using dbt for the first time, and I left all my code on GitHub. Thanks Claire Carroll for your help getting started with this awesome tool!

To extract all the /r/AmItheAsshole posts in BigQuery to a new table, you can do:

CREATE TABLE temp.data ASSELECT *  
FROM `fh-bigquery.reddit_posts.20*'
WHERE subreddit = 'AmItheAsshole'
AND _table_suffix > '19_'

Then the gender and judgement for each post can be determined with a query like:

WITH data AS (
SELECT *
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhe\b')) hes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bshe\b')) shes
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bher\b')) hers
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bhis\b')) hiss
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bthey\b')) theys
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bgirlfriend\b')) gfs
, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(CONCAT(selftext, title), r'(?i)\bboyfriend\b')) bfs
FROM {{ref('aita_posts')}}
WHERE link_flair_text IS NOT NULL
)
, gendered_data AS (
SELECT *
, CASE
WHEN males > 2+females*2 THEN 'to_male'
WHEN females > 2+males*2 THEN 'to_female'
ELSE 'neutral'
END to_gender
FROM (
SELECT *, hes+shes+hers+hiss+theys+gfs+bfs totalgender, hes+hiss+bfs males, shes+hers+gfs females
FROM data
)
)


SELECT CASE link_flair_text
WHEN 'not the a-hole' THEN 'no asshole'
WHEN 'no a-holes here' THEN 'no asshole'
WHEN 'everyone sucks' THEN 'asshole'
WHEN 'asshole' THEN 'asshole'
END judgement
, *
FROM gendered_data
WHERE link_flair_text IS NOT NULL

And finally the statistics presented here:

SELECT *, c/total_gender AS perc
FROM (
SELECT *, SUM(c) OVER(PARTITION BY to_gender, MONTH ) total_gender, SUM(c) OVER(PARTITION BY judgement, MONTH) total_judgement
FROM (
SELECT to_gender, judgement, CONCAT(to_gender, ': ', judgement) to_gender_judgement, month, COUNT(*) c, ARRAY_AGG(STRUCT(title, selftext) ORDER BY RAND() LIMIT 1)[OFFSET(0)] sample_title
FROM {{ref('aita_posts_gendered')}}
WHERE judgement IS NOT null
AND to_gender != 'neutral'
GROUP BY 1,2,3,4
)
)
WHERE c/total_gender > 0.01
AND total_judgement > 10
ORDER BY to_gender, perc DESC

Discussion

You’ll find plenty of insightful and entertaining replies on the twitter thread for this post:

Feel free to join the discussion (and tell me if I’m wrong?). Remember to be nice to each other — most people are not the asshole anyways.

Want more?

I only covered until August 2019 as that’s when the current full reddit archive in BigQuery stops — until future expected updates. Check my previous post for more details on collecting live data from pushshift.io. Thanks Jason Baumgartner for the constant supply!

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

--

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Mathematical Guide to Modelling the Distribution of Asset Returns

Why does Greek have 6 “ee” sounds?

This is Why Python Will Stay Among The Top Languages in 2021

How to process and visualize financial data on Google Cloud with Big Query & Data Studio

Shapiro-Wilk test power analysis in R using Monte Carlo.

The Illusion of Making Data-Driven Decisions

A Beginner Tutorial of Great Expectations

Web Scraping Rentals Website Using Beautiful Soup and Pandas

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

More from Medium

Joining American Community Survey and Google Analytics data in BigQuery and extracting insights

How to Pivot in Google BigQuery

Predicting the Fare on a Billion Taxi Trips with BigQuery

Making Interactive Report Using Google Data Studio and BigQuery Sandbox (Public Dataset…