Sitemap

Loading MySQL backup files into BigQuery — straight out of Cloud SQL

Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.

6 min readSep 9, 2019

The problem

Let’s say you need to load some existing MySQL backups into BigQuery — like the MySQL dumps that GHTorrent publishes. But when you try to load them into BigQuery, you find that it’s not straightforward:

bq load --autodetect temp.users1 gs://fh/users.csv BigQuery error in load operation: Error processing job:Error while reading data, error message: CSV table encountered too many errors, giving up.Failure details:
- gs://fh/users.csv: Error while reading data, error message: Error detected while parsing row starting at position: 1558192614. Error: Data between close double quote (") and field separator.

The solution: Cloud SQL as a BigQuery federated source

Now that BigQuery can read tables straight out from Cloud SQL instances (MySQL and PostgreSQL) we can just load our MySQL backups into Cloud SQL, and then:

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')
# This statement created a new table named fh-bigquery:temp.users.

Step by step

Setup a MySQL instance in Cloud SQL

It’s pretty straightforward to create a MySQL instance:

https://console.cloud.google.com/sql/create-instance-mysql

Then create a user for BigQuery to connect:

Creating a user for BigQuery

Create a database too:

Creating the ghtorrent database

You can connect to it:

gcloud sql connect mycloudsql --user=root --quiet

And setup your tables — in this case, taking the script out of GHTorrent’s schema.sql:

SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL,ALLOW_INVALID_DATES';
SET @OLD_TIME_ZONE=@@session.time_zone;
DROP SCHEMA IF EXISTS `ghtorrent` ;
CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;
USE `ghtorrent` ;
-- -----------------------------------------------------
-- Table `ghtorrent`.`users`
-- -----------------------------------------------------
DROP TABLE IF EXISTS `ghtorrent`.`users` ;
CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',
`login` VARCHAR(255) NOT NULL COMMENT '',
`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',
`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`long` DECIMAL(11,8) COMMENT '',
`lat` DECIMAL(10,8) COMMENT '',
`country_code` CHAR(3) COMMENT '',
`state` VARCHAR(255) COMMENT '',
`city` VARCHAR(255) COMMENT '',
`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
PRIMARY KEY (`id`) COMMENT '')
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

Note that particular MySQL config line that says ALLOW_INVALID_DATES, which will create some trouble for us later.

Load your backups into CloudSQL

Once you’ve created a database and the tables are set up, reading an existing file from GCS into CloudSQL is a one step operation:

gcloud sql import csv \
mycloudsql \
gs://fh-bigquery/ghtorrent/mysql-2019-02-01/users.csv \
--database=ghtorrent --table=users;

Connect BigQuery to CloudSQL

In the new BigQuery web UI you can create a connection to the CloudSQL instance we just created:

Connecting BigQuery to our CloudSQL instance

Make sure to create this connection in a matching region.

Now just read into BigQuery

Everything is in it’s place — let’s bring the data in:

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')

But then:

Errors while reading from MySQL

Ok, now we need to deal with this error message:

Error while reading data, error message: Failed to convert column created_at of 867-th row from MySQL to BigQuery. Error: Returns non-TIMESTAMP value: 0000–00–00 00:00:00, TIMESTAMP must be in format: ‘YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]’, and range between 0001–01–01 00:00:00 to 9999–12–31 23:59:59.999999. You can use MySQL formatting functions to convert TIMESTAMP to BigQuery format.

Turns out the MySQL table has some invalid timestamps — so it won’t be as straightforward. How do we fix this?

Reading into BigQuery with some SQL magic

The query that fixes the import process:

CREATE TABLE temp.users AS
SELECT * EXCEPT(x) REPLACE(x AS created_at)
FROM EXTERNAL_QUERY(
'fh-bigquery.US.ghtorrent'
, 'SELECT *, IF(created_at="0000-00-00 00:00:00",null,created_at) x FROM users')
# Query complete (5 min 58 sec elapsed, 2.6 GB processed)

How does this work?

  • IF(created_at=”0000–00–00 00:00:00",null,created_at) AS x is executed inside MySQL. When it finds an invalid timestamp, it replaces it with null.
  • SELECT * EXCEPT(x) REPLACE(x AS created_at) is executed inside BigQuery. REPLACE() replaces the faulty created_at column with its fix, and EXCEPT() makes sure we don’t import that temporary column.

Implementation notes

  • The final query took 6 minutes and processed 2.6GB. While testing you can add a LIMIT within the MySQL query to reduce these values substantially (3 seconds and 1KB in the picture’s example).
  • Loading this table into BigQuery didn’t treat \N as null. You can fix that in the same way we fixed created_at invalid timestamps.
  • Note that we didn’t need to set up service accounts or firewall rules anywhere. This is all handled transparently between Cloud SQL and BigQuery.
  • gcloud makes it easy to load data into MySQL tables, but you still need to manually create those tables as shown above.
  • You can stop and restart your Cloud SQL instances as needed (to save costs).
  • I love that BigQuery is able to figure out the schema automatically out of a MySQL SELECT *.

On loading MySQL backups straight into BigQuery

Why wasn’t it straightforward to load the .CSV files produced by MySQL into BigQuery? Turns out MySQL default CSV export is not 100% CSV compliant (due to how they escape their quoted quote marks).

Stack Overflow has been asking about this:

There’s an open issue requesting BigQuery’s support for MySQL’s default CSV export choices — vote and follow it for updates. Meanwhile in this post, we’ll find out how BigQuery can read these tables straight out of MySQL.

Alternative: Parse with a UDF

You could also ingest the rows “raw” and parse the CSV inside BigQuery. See See https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83:

SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
,csv.cols[SAFE_OFFSET(2)] c
,csv.cols[SAFE_OFFSET(3)] d
,csv.cols[SAFE_OFFSET(4)] e
,csv.cols[SAFE_OFFSET(5)] f
FROM (
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
)

Want more?

Imagine the possibilities: Now we can have a live Cloud SQL instance replicating a remote MySQL database, for live updates into BigQuery of your transactional database. For more on this, stay tuned.

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Tweet me @felipehoffa, my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Developer Advocate around the data world. Ex-Google BigQuery, Ex-Snowflake, Always Felipe

Responses (3)