Visualizing the DNC vs RNC conventions with Wikipedia+Wikidata+BigQuery
The Democratic and Republican 2016 conventions were wildly different in many aspects, including speakers, their notability, and diversity. How can we visualize this?
First I need the list of speakers from each convention— but then, how do I quickly compute their fame and other characteristics? Finding the list of speakers is easy, just go to Wikipedia and copy paste the 2 relevant pages:
Step 1: Find the list of speakers
- https://en.wikipedia.org/wiki/2016_Republican_National_Convention
- https://en.wikipedia.org/wiki/2016_Democratic_National_Convention
Step 2: Write Python code to extract Wikipedia links
Copy pasting the Wikipedia pages source, I can get a list like this:
**U.S. Representative [[Ted Lieu]] of [[California]]
**General [[John R. Allen]], [[U.S. Marine Corps]] (retired)
**Captain [[Florent Groberg]] (retired), recipient of the [[Medal of Honor]]
**[[Chloe Grace Moretz]], actress
**U.S. Representative [[Xavier Becerra]] of [[California]]
**U.S. Senator [[Sherrod Brown]] of [[Ohio]]
**[[Chelsea Clinton]], daughter of Bill and Hillary Clinton (headliner)
And then some Python code to extract only the links:
democrat=[{'order':x,'title':y,'convention':'democrat'} for x,y in enumerate([
urllib.quote(x.split('|')[0].replace(' ','_'), '()')
for x in re.findall(r'\[\[([^\]]*)\]\]', text)
if '[' not in x
])]
Resulting in:
[{'convention': 'democrat', 'order': 0, 'title': 'Marcia_Fudge'},
{'convention': 'democrat', 'order': 1, 'title': 'Ohio'},
{'convention': 'democrat', 'order': 2, 'title': 'Mayor_of_Denver'},
{'convention': 'democrat', 'order': 3, 'title': 'Wellington_Webb'},
{'convention': 'democrat', 'order': 4, 'title': 'Colorado'},
{'convention': 'democrat', 'order': 5, 'title': 'Diane_Russell'},
{'convention': 'democrat', 'order': 6, 'title': 'Maine'},
{...}]
Ouch — I got all the links, but not everyone there is a person. How can I differentiate who are the actual people mentioned?
Step 3: Use Wikidata to know who is actually is a person
I have a copy of Wikidata in BigQuery:
So now I only need to load the list of links to BigQuery:
sample_schema = bq.Schema.from_data(republican)sample_table = bq.Table(
"fh-bigquery:public_dump.convention3").create(schema = sample_schema, overwrite = True)sample_table.insert_data(republican)
sample_table.insert_data(democrat)
And then do a JOIN with Wikidata to get only people:
SELECT title, convention, order, en_description description, en_label
FROM [fh-bigquery:public_dump.convention3] a
JOIN [fh-bigquery:wikidata.latest_enesjafrde_v1_a] b
ON a.title=b.en_wiki
OMIT RECORD IF NOT SOME(instance_of.numeric_id=5
Nice! I have a list of the speakers, their respective conventions, and their relative order of appearance. Now how do I find about their relative fame?
Step 4: Measure fame by Wikipedia pageviews
I also have all of the Wikipedia pageviews in BigQuery, so a quick join (plus some cleanups) will get me the data I want (including gender this time):
SELECT a.title, convention, order, description,
gender, en_label, SUM(requests) requests_june, COUNT(*) hours
FROM (
SELECT title, convention, order, description, gender, en_label
FROM (
SELECT title, convention, order, en_description description, en_label,
FIRST(gender.numeric_id) WITHIN RECORD gender
FROM (
SELECT convention, order, REPLACE(title, '_Jr', ',_Jr') title
FROM [fh-bigquery:public_dump.convention3]
) a
JOIN [fh-bigquery:wikidata.latest_enesjafrde_v1_a] b
ON a.title=b.en_wiki
OMIT RECORD IF NOT SOME(instance_of.numeric_id=5)
) GROUP BY 1,2,3,4,5,6
) a
JOIN [fh-bigquery:wikipedia.pagecounts_201606] b
ON a.title=b.title
WHERE language='en'
GROUP BY 1,2,3,4,5,6
ORDER BY convention, order
Warning: The Wikipedia pageviews tables are big. Extract and summarize only the English Wikipedia pageviews first to make your free monthly quota last.
Neat! Now with all data in hand we can create a quick visualization.
Step 5: Visualize
I’m going to use Python to create a quick HTML visualization:
html='<div>'
for x in results:
x['color']={6581072:'orange', 1052281:'salmon'}.get(x['gender'], 'green')
x['size']=int((x['requests_june']-78.0)/(1159082-78)*100+12)
if x['convention']=='democrat':
html+=(u'''<span title="{description} ({requests_june} views)" style="font-size:10px" ><a
href="https://en.wikipedia.org/wiki/{a_title}" style="text-decoration: none;color:{color};font-size:{size}px">{en_label}</a>
</span> - '''.format(**x))
html+='</div>'
And the final results:
Follow me at @felipehoffa and other BigQuery news on reddit.com/r/bigquery.