Sitemap

Monitoring the crisis with reddit and BigQuery: The 2019 Chilean protests

5 min readOct 29, 2019

--

Follow these steps to bring realtime reddit data into BigQuery — then use Data Studio to create interactive dashboards to share with the world.

Reddit /r/chile is the main resource I’m using to follow the Chilean 2019 protests. What started on 10/14 as localized disturbs after a US$0.03 increase in the subway ticket, ended up mobilizing more than 1 million people 11 days later into the biggest protest march Chile has ever seen. How can we visualize these 11 days plus ongoing activity — using reddit’s comments?

Chile’s largest march ever. Photo source jujuydice.com.
This picture got to the top of /r/pics.

Looking at the # of comments on reddit’s /r/chile, we can quickly see not much activity during the first conflict days — until 10/18 at 8pm:

/r/chile comments per hour — first peak on 10/18 8pm.

10/18 is when “the entire Metro system was closed after attacks were reported at nearly all its 164 stations”. The top comment on /r/chile at 6pm reads “I am 100% sure that none of this would have happened if the Piñera government was not so harmful communicationally”. The user who wrote this comment — /u/BisonteL — was right, but no one knew how much worse the conflict was going to grow.

Top comments on 10/18, before the big crisis exploded nationally.

24 hours later (10/18 6pm) the top comment calls out that the first curfew call was fake — the president even said so. But later that night, a real curfew was imposed.

Top comments on 10/19, when a curfew is imposed.

Then on 10/20 the violence explodes throughout the country, and the number of comments per hour peaks:

And so on… as the crisis became the new normal, the number of comments per hour decreased — Friday saw the largest march, the weekend showed even less activity, but it then recovered on Monday 10/28.

At 8am the top comment points to a newspaper calling out immigrants as the people behind the subway fires. At 3pm the top comment points this report was false. It took a day for the newspaper to retract their mistake:

Monday after a slowing weekend — comments per hour on /r/chile start peaking again — and calling out misleading news.

How-to

I made a public interactive Data Studio dashboard — and also shared the raw data with you:

Interactive Data Studio dashboard following /r/chile top comments per hour.

Loading the data

I’m using pushshift.io’s API to get the latest reddit comments. Calling this URL brings up-to 10,000 comments published after certain date for an arbitrary subreddit:

http://dev.pushshift.io/rc/_search/?q=subreddit:chile%20AND%20created_utc:%3E1572297945&sort=created_utc&size=10000

To load this data into BigQuery, I use jq to transform the ElasticSearch API response into json records BigQuery can ingest:

wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND created_utc:>=1572323766 &sort=created_utc&size=10000" -O - \
| jq -c .hits.hits[]._source \
> tobq.json
wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND updated_utc:>=1571863500 &sort=created_utc&size=10000" -O - \
| jq -c .hits.hits[]._source \
> tobq_updated.json

Then to load these records into the existing raw table:

bq load --source_format=NEWLINE_DELIMITED_JSON \
fh-bigquery:reddit_raw.201910_chile_raw_pushshift tobq.json
bq load --source_format=NEWLINE_DELIMITED_JSON \
fh-bigquery:reddit_raw.201910_chile_raw_pushshift tobq_updated.json

The max created_utc and updated_utc came out of BigQuery:

SELECT FORMAT(
"""wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND created_utc:>=%i &sort=created_utc&size=10000" -O - | jq -c .hits.hits[]._source > tobq.json
wget "http://dev.pushshift.io/rc/_search/?q=subreddit:chile AND updated_utc:>=%i &sort=created_utc&size=10000" -O - | jq -c .hits.hits[]._source > tobq_updated.json
"""
, created_utc, updated_utc)
FROM (
SELECT MAX(created_utc) created_utc, MAX(updated_utc) updated_utc
FROM `reddit_raw.201910_chile_raw_pushshift`
)

Then I run a 2 step script in BigQuery. The first query applies some basic transforms and de-duplication over the raw data. The second query aggregates it all and prepares it as much as possible for the interactive dashboard:

CREATE OR REPLACE TABLE `reddit_extracts.201910_chile`
AS
SELECT *
EXCEPT(edited_body)
REPLACE(COALESCE(edited_body, body) AS body)
FROM (
SELECT TIMESTAMP_SECONDS(x.created_utc) ts
, FORMAT(REGEXP_EXTRACT(x.permalink, '^(.*)/[a-z0-9]*/$')) post
, REGEXP_REPLACE(REGEXP_EXTRACT(x.permalink, '^.*/([^/]*)/[a-z0-9]*/$'), '_', ' ') post_title
, FORMAT('https://reddit.com%s', REGEXP_EXTRACT(x.permalink, '^(.*)/[a-z0-9]*/')) link
, x.*
FROM (
SELECT id, ARRAY_AGG(a ORDER BY retrieved_on DESC LIMIT 1)[OFFSET(0)] x
FROM `reddit_raw.201910_chile_raw_pushshift` a
WHERE TIMESTAMP_SECONDS(a.created_utc) > '2019-10-13'
GROUP BY id
)
)
;
CREATE OR REPLACE TABLE `reddit_extracts.201910_chile_hour`
AS
SELECT * REPLACE(TIMESTAMP(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', hour, 'Chile/Continental')) AS hour)
, REGEXP_REPLACE(REGEXP_EXTRACT(top_post.value, '^.*/([^/]*)$'), '_', ' ') top_post_title
FROM (
SELECT TIMESTAMP_TRUNC(ts, HOUR) hour, COUNT(*) comments
, APPROX_TOP_COUNT(post, 1)[SAFE_OFFSET(0)] top_post
, ARRAY_AGG(IF( body NOT IN ('[deleted]','[removed]') AND score>=1,STRUCT(score,replies,permalink, body),null) IGNORE NULLs ORDER BY replies DESC LIMIT 1)[OFFSET(0)] top_comm
FROM (
SELECT *, (SELECT COUNT(*) FROM `reddit_extracts.201910_chile` WHERE parent_id=a.id) replies
FROM `reddit_extracts.201910_chile` a
)
GROUP BY 1
)

You can find the shared table in BigQuery:

Some tricks in the query:

  • To make the time in Data Studio local to Santiago, Chile:
TIMESTAMP(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', hour, 'Chile/Continental')
  • To find the top post per hour:
APPROX_TOP_COUNT(post, 1)
  • Top comment for any hour is the one with the most replies, with a positive score:
ARRAY_AGG(IF( body NOT IN ('[deleted]','[removed]') AND score>=1,STRUCT(score,replies,permalink, body),null) IGNORE NULLs ORDER BY replies DESC LIMIT 1)[OFFSET(0)]
  • If a comment has been edited, share only the latest copy:
REPLACE(COALESCE(edited_body, body) AS body)

Next steps

  • Check out billions of historic reddit comments shared in BigQuery (thx ). For example:
  • I’ll keep updating the BigQuery table for the Chile protests as long as there’s interesting activity. Will this be over soon?
  • did an interesting job analyzing millions of related tweets.
  • wrote one of my favorite tweets analyzing one of the president speeches: He calls out violence, criminals, god, and empathy — but he never says ‘sorry’. For what is worth — the next day he finally said the ‘sorry’ word.
  • What about Wikipedia? The following query shows the trends for the president, the protests page, and what a curfew is — in the Spanish Wikipedias:
Spanish Wikipedia pageviews per hour. Times in UTC. Largest peak is people wondering what the declared curfew is.
SELECT datehour, title, views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki LIKE 'es%'
AND title IN ('Sebastián_Piñera', 'Protestas_en_Chile_de_2019', 'Toque_de_queda')
AND datehour > '2019-10-13'

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Developer Advocate around the data world. Ex-Google BigQuery, Ex-Snowflake, Always Felipe

No responses yet