10 top tips: Unleash your BigQuery superpowers
Ta ta taaam! In this post I want to share the 10 top tips to unleash your BigQuery super powers.
If you know SQL — you’re already a hero. You have the power to transform data into insights. How awesome is it to save the day when someone in need comes to you to reveal the magic numbers they will then be able to paste on their business proposals? And in your free time you can roam around your data lakes, looking for patterns that no one else has seen before, and then you can bring them back for everyone to marvel.
Yes, if you know SQL, you’re already a hero. And with BigQuery, you can become a super-hero: You can run queries faster than anyone else. You’re not afraid of running full table scans. You’ve made your datasets highly available, without fear of maintenance periods. Indexes? Where we are going we don’t need indexes! Nor vacuums! And you already know that — because you’re already a BigQuery super-hero.
If you’re a BigQuery user, you’re already a superhero. But superheroes sometimes don’t know or have not learned to use all the superpowers they have. And this session is about that: What are the top 10 BigQuery superpowers you have — that you have not already discovered? Time travel? Teleportation? Super-speed? X-ray vision? Traveling the multiverse? Let’s get started.
#1 The power of data
Let’ s say your favorite person has been trapped by an evil force. And they will only release them if you answer this simple riddle: Who were the top super-heroes on Wikipedia the first week of February 2018?
Oh no! Who knows the answer to that question! Where will you get a log of all the Wikipedia pageviews? How can you tell which pages are super-heroes? How long will it take to collect all of this data, and comb through it all? Who can save us? And remember, you only have 10 minutes left!
Well, I can answer that question, and it will only take me 10 seconds. This is how:
SELECT SUM(views) views, title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
JOIN (
SELECT DISTINCT en_wiki
FROM `fh-bigquery.wikidata.wikidata_latest_20190822`
WHERE EXISTS (SELECT * FROM UNNEST(instance_of) WHERE numeric_id=188784)
AND en_wiki IS NOT null
) b
ON a.title=b.en_wiki
AND a.wiki='en'
AND DATE(a.datehour) BETWEEN '2018-02-01' AND '2018-02-07'
GROUP BY title
ORDER BY views DESC
LIMIT 10
There it is — in less than 10 seconds, I found out all the super heroes on the English Wikipedia, and found how many pageviews they got in whatever period you want to test for. And these are the top 10, for the first week of February 2018:
There you are — at that time, the top super hero was Black Panther. But you might need to know more about each superhero to set your friend free. Well, this query will do:
And how cool is this? Yes — I have the power of data — and you have it too. I’ve shared all the Wikipedia pageviews, and my latest Wikidata experiments — so now you can copy paste these queries, modify them, and save your friends.
#2 The power of teleportation
Oh, so you want to see the tables where I have the Wikipedia pageviews and Wikidata? Let’s jump to the BigQuery web UI.
Did you know that you can autocomplete your queries while typing them? Just press tab
while writing your queries. Or you can run a sub-query, by selecting it and pressing cmd-e
.
And teleportation? We can jump straight to our tables with cmd
and clicking on them. Now we can see that the Wikipedia 2018 pageviews table has more than 2 TB of data, and the Wikidata one I used has facts for more than 46 million entities. And we just joined them to get the results we wanted.
Also, while looking at the schema, you can click on the fields, and that will auto-populate your query. Ta-da!
Find a list of the shortcuts clicking on the top left box.
#3 The power of miniaturization
Did I just say that the pageviews table has more than 2 TB of data? That’s a lot! Remember you have 1 free TB of queries every month, so going through 2 TB in one query means you will be out of the free quota pretty quickly. So how much data did I just consume? Let me run that first query again, without hitting the cache.
4.6 sec elapsed, 9.8 GB processed
How is that possible? I just joined a 2TB table with a 750GB one. Even with partitioning, one week of Wikipedia pageviews is 2TB/52… 38.5GB. So even with daily partitioning, I’m somehow querying less data.
Well, turns out I have the data in the tables clustered by the language of the Wikipedia and title, so make sure to always use those filters when going through the Wikipedia logs.
SELECT SUM(views) views, title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE a.wiki='en'
AND DATE(a.datehour) BETWEEN '2018-02-01' AND '2018-02-07'
AND title='Carol_Danvers'
GROUP BY title
# 2.5 sec elapsed, 2.4 GB processed
And that’s how you miniaturize your queries!
Even with geo clusters:
#4 The power of X-ray vision
Remember that the annotated results took more time to process? Why? Well, with our X-ray vision powers, we can see what BigQuery did in the background. Let’s look at the query history and the execution details tab.
And that are all the steps BigQuery had to go through to run our query. Now, if this is a little hard to read, we have some alternatives. For example, the legacy BigQuery web UI has more compact results:
Here we can see that the slowest operations were computing while reading the 56 million rows table twice.
So if I want to improve the performance of my query, I have to focus on that. Well, let’s do it — if I change the 2:
FROM `fh-bigquery.wikidata.subclasses_20190822`
to:
FROM `fh-bigquery.wikidata.wikidata_latest_20190822`
Now my query runs in half the time! And then we moved the slowest part elsewhere:
Which is this JOIN
now:
It even shows us that it’s looking for all the super-heroes between ‘3-D Man’ and ‘Zor-El’… yes, it’s going through the whole alphabet.
For an even deeper view, check out the BigQuery Query plan visualizer by Stephan Meyn:
#5 The power of Materialization
It’s really cool to have these tables in BigQuery. But how did I load them?
You can see that I periodically bring new raw files into GCS (Google Cloud Storage), and then I read them raw into BigQuery.
In the case of the Wikipedia pageviews, I do all the CSV parsing inside BigQuery, as there are many edge cases, and I need to solve some case by case.
Then I materialize this tables periodically into my partitioned and clustered tables.
In the case of Wikidata, they have some complicated JSON — so I read each JSON row raw into BigQuery. I could parse it with SQL, but I that’s not enough. And that brings us to our next super power.
#6 Navigating the multiverse
So we live in this SQL universe. It’s an incredible place to manipulate and understand data, but each universe has its limitations and its rules. What if we could jump to a different universe, with different rules and powers, and manage to connect both universes, somehow? What if we could jump into the.. JavaScript universe?
First, let’s talk about the UDFs — User Defined Functions:
[Go through SQL UDFs, sharing them, like fhoffa.x.random_int()
or fhoffa.x.median()
, then introduce the power of natural language processing by downloading a random JS library from the Internet, and using it from within BigQuery]
And thus I can now take each row of Wikidata JSON and parse it inside BigQuery, using whatever JavaScript logic I want to use, and then I materialize this into BigQuery.
#7 Time travel
Let’s take this table. It’s a beautiful table, with a couple thousand rows. But not everyone is happy — turns out someone wants to delete half of it’s rows — randomly.
How would our super-enemy pull this off?
DELETE FROM `fh-bigquery.temp.peaceful_universe`
WHERE fhoffa.x.random_int(0,10) < 5
Oh no. Half of the rows of our peaceful universe are gone. Randomly. How is that even fair? How will we ever recover from this?
“Mr Hoffa, I don’t feel so good” — one row.
5 days later
We learned how to move forward without these rows, but we still miss them. If only there was a way to travel back in time and bring them back.
Yes we can.
Instead of:
SELECT COUNT(*)
FROM `fh-bigquery.temp.peaceful_universe`
we can write:
SELECT COUNT(*)
FROM `fh-bigquery.temp.peaceful_universe`
FOR SYSTEM TIME AS OF
TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -5 MINUTE)
And to replace the table with the past one:
CREATE OR REPLACE TABLE `fh-bigquery.temp.peaceful_universe`
AS
SELECT *
FROM `fh-bigquery.temp.peaceful_universe`
FOR SYSTEM TIME AS OF
TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -5 MINUTE)
Warning:
CREATE OR REPLACE TABLE
deletes the table history, so write the results elsewhere. But soon this warning won’t be necessary.
#8 The power of super-speed
#9 Invulnerability
Our most annoying enemy:
Division by zero
From the BigQuery docs:
SAFE. prefix
Syntax:
SAFE.function_name()
Description
If you begin a function with the SAFE.
prefix, it will return NULL
instead of an error.
Operators, such as +
and =
, do not support the SAFE.
prefix. To prevent errors from a division operation, use SAFE_DIVIDE. Some operators, such as IN
, ARRAY
, and UNNEST
, resemble functions, but do not support the SAFE.
prefix. The CAST
and EXTRACT
functions also do not support the SAFE.
prefix. To prevent errors from casting, use SAFE_CAST.
#10 The power of self-control
All super-heroes struggle when they first discover their super-powers. Having super strength is cool, but you can break a lot of things if you’re not careful. Having super-speed is fun — but only if you also learn how to break. And if you’re accidentally exposed to gamma rays, you might never be able to control your powers, once you get angry. Don’t get angry!
Look mom, I can query 5 petabytes of data in 3 minutes!
That’s super cool, until you remember that querying one petabyte is one thousand times more expensive than querying one terabyte. And you only have one free terabyte every month. If you have not entered a credit card, don’t worry — you will have your free terabyte every month, no need to have a credit card. But if you want to go further, now you need to be aware of your budget and set-up cost controls.
Look at the docs:
And if that’s not clear enough, I wrote an illustrated guide on Stack Overflow:
Remember — with great powers, comes great responsibility. Turn on your cost controls.
And there are a lot of more. How about the power to predict the future? Check out Lak Lakshmanan talk and more of the awesome resources we have. And that brings me to the super power #11:
#11 The power of community
No super-hero should stand alone. Join our reddit community, where we share tips and news. Come to Stack Overflow for answers, and to help new super-heroes learning the ropes. We can all learn from each other. And follow me and my friends on Twitter, we have a lot to share.
BigQuery heroes, assemble:
With that, a million thanks to all the heroes that helped me put this list together. Ping me, and I’ll tag you here.
And there’s so much more! I didn’t mention INFORMATION_SCHEMA, nested and repeated fields, load truncate, joining with MySQL/Cloud SQL, spreadsheets, GIS, saved queries, shared queries, exploring with Data Studio, BI Engine, multi regions, copying datasets, automated data transfers, and so much more.
Data heroes, let’s leave it here for today. I’ll be back! And remember, stay curious.