Loading MySQL backup files into BigQuery — straight out of Cloud SQL

Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.

The problem

Let’s say you need to load some existing MySQL backups into BigQuery — like the MySQL dumps that GHTorrent publishes. But when you try to load them into BigQuery, you find that it’s not straightforward:

The solution: Cloud SQL as a BigQuery federated source

Now that BigQuery can read tables straight out from Cloud SQL instances (MySQL and PostgreSQL) we can just load our MySQL backups into Cloud SQL, and then:

Step by step

Setup a MySQL instance in Cloud SQL

It’s pretty straightforward to create a MySQL instance:

Then create a user for BigQuery to connect:

Create a database too:

You can connect to it:

And setup your tables — in this case, taking the script out of GHTorrent’s schema.sql:

Note that particular MySQL config line that says ALLOW_INVALID_DATES, which will create some trouble for us later.

Load your backups into CloudSQL

Once you’ve created a database and the tables are set up, reading an existing file from GCS into CloudSQL is a one step operation:

Connect BigQuery to CloudSQL

In the new BigQuery web UI you can create a connection to the CloudSQL instance we just created:

Make sure to create this connection in a matching region.

Now just read into BigQuery

Everything is in it’s place — let’s bring the data in:

But then:

Ok, now we need to deal with this error message:

Turns out the MySQL table has some invalid timestamps — so it won’t be as straightforward. How do we fix this?

Reading into BigQuery with some SQL magic

The query that fixes the import process:

How does this work?

  • IF(created_at=”0000–00–00 00:00:00",null,created_at) AS x is executed inside MySQL. When it finds an invalid timestamp, it replaces it with null.
  • SELECT * EXCEPT(x) REPLACE(x AS created_at) is executed inside BigQuery. REPLACE() replaces the faulty created_at column with its fix, and EXCEPT() makes sure we don’t import that temporary column.

Implementation notes

  • The final query took 6 minutes and processed 2.6GB. While testing you can add a LIMIT within the MySQL query to reduce these values substantially (3 seconds and 1KB in the picture’s example).
  • Loading this table into BigQuery didn’t treat \N as null. You can fix that in the same way we fixed created_at invalid timestamps.
  • Note that we didn’t need to set up service accounts or firewall rules anywhere. This is all handled transparently between Cloud SQL and BigQuery.
  • gcloud makes it easy to load data into MySQL tables, but you still need to manually create those tables as shown above.
  • You can stop and restart your Cloud SQL instances as needed (to save costs).
  • I love that BigQuery is able to figure out the schema automatically out of a MySQL SELECT *.

On loading MySQL backups straight into BigQuery

Why wasn’t it straightforward to load the .CSV files produced by MySQL into BigQuery? Turns out MySQL default CSV export is not 100% CSV compliant (due to how they escape their quoted quote marks).

Stack Overflow has been asking about this:

There’s an open issue requesting BigQuery’s support for MySQL’s default CSV export choices — vote and follow it for updates. Meanwhile in this post, we’ll find out how BigQuery can read these tables straight out of MySQL.

Alternative: Parse with a UDF

You could also ingest the rows “raw” and parse the CSV inside BigQuery. See See https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83:

Want more?

Imagine the possibilities: Now we can have a live Cloud SQL instance replicating a remote MySQL database, for live updates into BigQuery of your transactional database. For more on this, stay tuned.

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Tweet me @felipehoffa, my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.