Monitoring the crisis with reddit and BigQuery: The 2019 Chilean protests
Follow these steps to bring realtime reddit data into BigQuery — then use Data Studio to create interactive dashboards to share with the world.
Reddit /r/chile is the main resource I’m using to follow the Chilean 2019 protests. What started on 10/14 as localized disturbs after a US$0.03 increase in the subway ticket, ended up mobilizing more than 1 million people 11 days later into the biggest protest march Chile has ever seen. How can we visualize these 11 days plus ongoing activity — using reddit’s comments?
Looking at the # of comments on reddit’s /r/chile, we can quickly see not much activity during the first conflict days — until 10/18 at 8pm:
10/18 is when “the entire Metro system was closed after attacks were reported at nearly all its 164 stations”. The top comment on /r/chile at 6pm reads “I am 100% sure that none of this would have happened if the Piñera government was not so harmful communicationally”. The user who wrote this comment — /u/BisonteL — was right, but no one knew how much worse the conflict was going to grow.
24 hours later (10/18 6pm) the top comment calls out that the first curfew call was fake — the president even said so. But later that night, a real curfew was imposed.
Then on 10/20 the violence explodes throughout the country, and the number of comments per hour peaks:
And so on… as the crisis became the new normal, the number of comments per hour decreased — Friday saw the largest march, the weekend showed even less activity, but it then recovered on Monday 10/28.
At 8am the top comment points to a newspaper calling out immigrants as the people behind the subway fires. At 3pm the top comment points this report was false. It took a day for the newspaper to retract their mistake:
How-to
I made a public interactive Data Studio dashboard — and also shared the raw data with you:
Loading the data
I’m using pushshift.io’s API to get the latest reddit comments. Calling this URL brings up-to 10,000 comments published after certain date for an arbitrary subreddit:
http://dev.pushshift.io/rc/_search/?q=subreddit:chile%20AND%20created_utc:%3E1572297945&sort=created_utc&size=10000
To load this data into BigQuery, I use jq
to transform the ElasticSearch API response into json records BigQuery can ingest:
wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND created_utc:>=1572323766 &sort=created_utc&size=10000" -O - \
| jq -c .hits.hits[]._source \
> tobq.jsonwget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND updated_utc:>=1571863500 &sort=created_utc&size=10000" -O - \
| jq -c .hits.hits[]._source \
> tobq_updated.json
Then to load these records into the existing raw table:
bq load --source_format=NEWLINE_DELIMITED_JSON \
fh-bigquery:reddit_raw.201910_chile_raw_pushshift tobq.jsonbq load --source_format=NEWLINE_DELIMITED_JSON \
fh-bigquery:reddit_raw.201910_chile_raw_pushshift tobq_updated.json
The max created_utc
and updated_utc
came out of BigQuery:
SELECT FORMAT(
"""wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND created_utc:>=%i &sort=created_utc&size=10000" -O - | jq -c .hits.hits[]._source > tobq.json
wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND updated_utc:>=%i &sort=created_utc&size=10000" -O - | jq -c .hits.hits[]._source > tobq_updated.json
"""
, created_utc, updated_utc)
FROM (
SELECT MAX(created_utc) created_utc, MAX(updated_utc) updated_utc
FROM `reddit_raw.201910_chile_raw_pushshift`
)
Then I run a 2 step script in BigQuery. The first query applies some basic transforms and de-duplication over the raw data. The second query aggregates it all and prepares it as much as possible for the interactive dashboard:
CREATE OR REPLACE TABLE `reddit_extracts.201910_chile`
ASSELECT *
EXCEPT(edited_body)
REPLACE(COALESCE(edited_body, body) AS body)
FROM (
SELECT TIMESTAMP_SECONDS(x.created_utc) ts
, FORMAT(REGEXP_EXTRACT(x.permalink, '^(.*)/[a-z0-9]*/$')) post
, REGEXP_REPLACE(REGEXP_EXTRACT(x.permalink, '^.*/([^/]*)/[a-z0-9]*/$'), '_', ' ') post_title
, FORMAT('https://reddit.com%s', REGEXP_EXTRACT(x.permalink, '^(.*)/[a-z0-9]*/')) link
, x.*
FROM (
SELECT id, ARRAY_AGG(a ORDER BY retrieved_on DESC LIMIT 1)[OFFSET(0)] x
FROM `reddit_raw.201910_chile_raw_pushshift` a
WHERE TIMESTAMP_SECONDS(a.created_utc) > '2019-10-13'
GROUP BY id
)
)
;CREATE OR REPLACE TABLE `reddit_extracts.201910_chile_hour`
AS
SELECT * REPLACE(TIMESTAMP(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', hour, 'Chile/Continental')) AS hour)
, REGEXP_REPLACE(REGEXP_EXTRACT(top_post.value, '^.*/([^/]*)$'), '_', ' ') top_post_title
FROM (
SELECT TIMESTAMP_TRUNC(ts, HOUR) hour, COUNT(*) comments
, APPROX_TOP_COUNT(post, 1)[SAFE_OFFSET(0)] top_post
, ARRAY_AGG(IF( body NOT IN ('[deleted]','[removed]') AND score>=1,STRUCT(score,replies,permalink, body),null) IGNORE NULLs ORDER BY replies DESC LIMIT 1)[OFFSET(0)] top_comm
FROM (
SELECT *, (SELECT COUNT(*) FROM `reddit_extracts.201910_chile` WHERE parent_id=a.id) replies
FROM `reddit_extracts.201910_chile` a
)
GROUP BY 1
)
You can find the shared table in BigQuery:
Some tricks in the query:
- To make the time in Data Studio local to Santiago, Chile:
TIMESTAMP(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', hour, 'Chile/Continental')
- To find the top post per hour:
APPROX_TOP_COUNT(post, 1)
- Top comment for any hour is the one with the most replies, with a positive score:
ARRAY_AGG(IF( body NOT IN ('[deleted]','[removed]') AND score>=1,STRUCT(score,replies,permalink, body),null) IGNORE NULLs ORDER BY replies DESC LIMIT 1)[OFFSET(0)]
- If a comment has been edited, share only the latest copy:
REPLACE(COALESCE(edited_body, body) AS body)
Next steps
- Check out billions of historic reddit comments shared in BigQuery (thx ). For example:
- I’ll keep updating the BigQuery table for the Chile protests as long as there’s interesting activity. Will this be over soon?
- did an interesting job analyzing millions of related tweets.
- wrote one of my favorite tweets analyzing one of the president speeches: He calls out violence, criminals, god, and empathy — but he never says ‘sorry’. For what is worth — the next day he finally said the ‘sorry’ word.
- What about Wikipedia? The following query shows the trends for the president, the protests page, and what a curfew is — in the Spanish Wikipedias:
SELECT datehour, title, views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki LIKE 'es%'
AND title IN ('Sebastián_Piñera', 'Protestas_en_Chile_de_2019', 'Toque_de_queda')
AND datehour > '2019-10-13'
Want more?
I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.