Sitemap

Interactive: The top 2019 Wikipedia pages

has published their list of most popular 2019 pages— but can we go deeper? Of course, here with BigQuery and Data Studio. And play with the interactive dashboard.

--

Finding the top 10 Wikipedia pages by # of views is interesting, and

already did the job and published their numbers:

Most popular Wikipedia pages according to Wikimedia [dashboard with links]

And how did these pages trend throughout the year? Check out this interactive dashboard:

Interactive Data Studio dashboard — top 2019 Wikipedia pages

The top 2019 movies

Top 2019 movies

In a year dominated by movies, Avengers got the top #1 overall. What’s interesting here is to look at the peak days — for example Once Upon a Time in Hollywood got more views throughout the months than The Irishman, but the latter had a way higher peak when published (December 2). Once Upon a Time barely shows up within the other giants, but is steadily accumulating a large number of views.

Top 2019 series

Top 2019 series

Series also got a lot of attention. What’s notable here:

  • People came back every week to check out the latest episode of Game of Thrones.
  • While the series Chernobyl attracted a lot of attention, it brought a lot more attention to the page detailing the real disaster behind.
  • The 3rd season of Stranger Things generated interest, but not as much as the other series in this list.
  • The Mandalorian and The Witcher captured the world’s attention at the end of 2019.

Interesting people

Comparing interesting people
  • Ted Bundy was the top person on Wikipedia during 2019 — powered by the release of a Netflix series and then a movie. People seem fascinated by this character, making his page the #2 overall for 2019.
  • Meanwhile, Donald Trump page only raised to #15 of 2019. It’s interesting to see that this page got 5% of its yearly pageviews on December 19th, day after impeachment.
  • And Boris Johnson page raised to #48 — mainly powered by 2 days: when he became prime minister, and the December elections.

How-to

Loading the data

I already published a post on how to load the Wikipedia pageviews into BigQuery. And now we are working to bring this dataset into the official Google Cloud Public Datasets project (thanks

— check the code).

Finding the top pages

To get the top pages for 2019, a simple query will do:

SELECT title, SUM(views) total_views
FROM `bigquery-public-data.wikipedia.pageviews_2019`
WHERE wiki IN ('en', 'en.m')
AND datehour>='2019-01-01'
GROUP BY title
ORDER BY 2 DESC
LIMIT 10

This works — but we can make this query way faster than 40s if we filter out noise — for example, any hour with less than 4 pageviews, and default pages:

SELECT title, SUM(views) total_views
FROM `bigquery-public-data.wikipedia.pageviews_2019`
WHERE wiki IN ('en', 'en.m')
AND title NOT IN ('Main_Page','-','Wikipedia')
AND title NOT LIKE 'File%'
AND title NOT LIKE 'Special:%'
AND title NOT LIKE 'Portal:%'
AND datehour>='2019-01-01'
AND views>3
GROUP BY title
ORDER BY 2 DESC
LIMIT 10

Cool — that brings the query from 40s to 16s. But brings out the question — why is Simple_Mail_Transfer_Protocol a top page? Wikimedia’s Andrew G. West recommends filtering out any page that has most of its pageviews coming only from mobile or desktop to filter out spam.

Extracting daily pageviews for visualizing

And then we want not only the top pages, but also the daily pageviews to build a chart in Data Studio. So we can write a query like this:

CREATE OR REPLACE TABLE `wikipedia_extracts.2019_top_en_daily_views`
AS
WITH data AS (
SELECT * FROM
# fixing `bigquery-public-data.wikipedia.pageviews_2019`
`fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki IN ('en', 'en.m')
AND title NOT IN ('Main_Page','-','Wikipedia')
AND title NOT LIKE 'File%'
AND title NOT LIKE 'Special:%'
AND title NOT LIKE 'Portal:%'
AND views > 3
AND datehour > '2000-01-01'
)
, pivot_desktop_mobile AS (
SELECT title, DATE(datehour) date
, SUM(IF(wiki='en', views, null)) desktop
, SUM(IF(wiki='en.m', views, null)) mobile
FROM data
GROUP BY title, date
)
, top_year AS (
SELECT *, total_mobile/(total_desktop+total_mobile) ratio_mobile
FROM (
SELECT title, SUM(desktop+mobile) total_views
, SUM(desktop) total_desktop
, SUM(mobile) total_mobile
, ARRAY_AGG(STRUCT(date, desktop+mobile AS views)) arr
, ARRAY_AGG(date ORDER BY desktop+mobile DESC LIMIT 1)[OFFSET(0)] top_day
FROM pivot_desktop_mobile
WHERE mobile/(desktop+mobile) BETWEEN 0.05 AND 0.95
GROUP BY title
ORDER BY total_views DESC
LIMIT 200
)
)
SELECT title, date, views
, STRUCT(
top_day, total_views, total_desktop, total_mobile
, total_mobile/(total_mobile+total_desktop) AS ratio_mobile
) AS stats
FROM top_year, UNNEST(arr)
913GB of data processed in 32 seconds. Impressive, but careful: this query will eat your monthly free TB.

That query stores the top 200 daily pageviews in a table we can connect to Data Studio plus BI Engine for a fast interactive user experience — as seen above.

Cost savings

Going over 913GB of data in 32 seconds is impressive! But it can quickly generate a huge bill. Instead of running these queries, you should first extract the data you are interested in to a new table — and then your queries will be magnitudes cheaper. Check out my Stack Overflow answer to go through the steps that transform these terabyte queries into only 2.5GB queries.

First I’ll create a table with the daily extracts:

CREATE TABLE `wikipedia_extracts.2019_en_month_views` 
AS
WITH data AS (
SELECT *
FROM # fixing `bigquery-public-data.wikipedia.pageviews_2019`
`fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki IN ('en', 'en.m')
AND title NOT IN ('Main_Page','-','Wikipedia')
AND title NOT LIKE 'File%'
AND title NOT LIKE 'Special:%'
AND title NOT LIKE 'Portal:%'
AND views > 3
AND datehour > '2000-01-01'
)
, pivot_desktop_mobile AS (
SELECT title, DATE(datehour) day
, SUM(IF(wiki='en', views, null)) desktop
, SUM(IF(wiki='en.m', views, null)) mobile
, SUM(views) views
FROM data
GROUP BY title, day
)
SELECT *
FROM pivot_desktop_mobile
# 1 min 23 sec elapsed, 913 GB processed

Now you can write queries over this table — for example, to get the top monthly pages:

WITH data AS (
SELECT DATE_TRUNC(day, MONTH) month, title, SUM(views) views
FROM `fh-bigquery.wikipedia_extracts.2019_en_month_views`
WHERE mobile/(desktop+mobile) BETWEEN 0.05 AND 0.95
GROUP BY 1,2
)
SELECT FORMAT_DATE('2019-%m', month)
, ARRAY_AGG(STRUCT(title, views) ORDER BY views DESC LIMIT 5 ) top
FROM data
GROUP BY month
ORDER BY month
# 21.2 sec elapsed, 17.3 GB processed

And this query only processed 17GB, instead of 913GB!

Top 5 Wikipedia pages for each 2019 month

When working with large tables, remember to extract data first, and write all your fun queries over that extract instead.

How are the views of these pages distributed?

The top 7.2% of Wikipedia pages earn 87% of all the monthly views:

The top 7.2% of Wikipedia pages earn 87% of all the monthly views.
WITH wiki_prefixes AS (SELECT ['File:', 'Talk:', 'Template_talk:', 'Wikipedia:', 'Category:', 'User_talk:', 'Page:', 'Template:', 'Category_talk:' , 'User:', 'Author:', 'Portal:', 'Wikipedia_talk:', 'Portal_talk:', 'File_talk:', 'Draft:', 'Help:', 'Draft_talk:', 'en:', 'Book_talk:', 'Module:', 'MOS:', 'Special:', 'Book:'] x)SELECT fhoffa.x.int(POW(10, fhoffa.x.int(LOG10(views)))) views_min
, fhoffa.x.int(POW(10, fhoffa.x.int(1+LOG10(views)))) views_max
, COUNT(*) pages, SUM(viewS) total_views
, STRING_AGG(title ORDER BY views DESC LIMIT 3) sample_titles
FROM `fh-bigquery.wikipedia_extracts.201912_en_totals`
WHERE title NOT IN ('-', 'Main_Page')
AND (
title NOT LIKE '%:%'
OR REGEXP_EXTRACT(title, '[^:]*:') NOT IN UNNEST((SELECT(x) FROM wiki_prefixes))
)
GROUP BY 1,2
ORDER BY 1
# 7.2 sec elapsed, 805.8 MB processed)

Next steps

It’s your turn to play with BigQuery — you get a free terabyte of queries every month, no credit card needed.

How about looking for the top 5 Wikipedia pages for each month in your own language? We only did English here!

Did you know that the top 0.1% of Wikipedia pages get 25% of the views? The bottom 80% get only 4%.

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

--

--

No responses yet