Loading MySQL backup files into BigQuery — straight out of Cloud SQL
Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.
Let’s say you need to load some existing MySQL backups into BigQuery — like the MySQL dumps that GHTorrent publishes. But when you try to load them into BigQuery, you find that it’s not straightforward:
bq load --autodetect temp.users1 gs://fh/users.csv BigQuery error in load operation: Error processing job:Error while reading data, error message: CSV table encountered too many errors, giving up.Failure details:
- gs://fh/users.csv: Error while reading data, error message: Error detected while parsing row starting at position: 1558192614. Error: Data between close double quote (") and field separator.
The solution: Cloud SQL as a BigQuery federated source
Now that BigQuery can read tables straight out from Cloud SQL instances (MySQL and PostgreSQL) we can just load our MySQL backups into Cloud SQL, and then:
CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
, 'SELECT * FROM users;')# This statement created a new table named fh-bigquery:temp.users.
Step by step
Setup a MySQL instance in Cloud SQL
It’s pretty straightforward to create a MySQL instance:
Then create a user for BigQuery to connect:
Create a database too:
You can connect to it:
gcloud sql connect mycloudsql --user=root --quiet
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL,ALLOW_INVALID_DATES';
DROP SCHEMA IF EXISTS `ghtorrent` ;
CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;
USE `ghtorrent` ;
-- Table `ghtorrent`.`users`
DROP TABLE IF EXISTS `ghtorrent`.`users` ;
CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',
`login` VARCHAR(255) NOT NULL COMMENT '',
`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',
`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`long` DECIMAL(11,8) COMMENT '',
`lat` DECIMAL(10,8) COMMENT '',
`country_code` CHAR(3) COMMENT '',
`state` VARCHAR(255) COMMENT '',
`city` VARCHAR(255) COMMENT '',
`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
PRIMARY KEY (`id`) COMMENT '')
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;
Note that particular MySQL config line that says
ALLOW_INVALID_DATES, which will create some trouble for us later.
Load your backups into CloudSQL
Once you’ve created a database and the tables are set up, reading an existing file from GCS into CloudSQL is a one step operation:
gcloud sql import csv \
Connect BigQuery to CloudSQL
In the new BigQuery web UI you can create a connection to the CloudSQL instance we just created:
Make sure to create this connection in a matching region.
Now just read into BigQuery
Everything is in it’s place — let’s bring the data in:
CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
, 'SELECT * FROM users;')
Ok, now we need to deal with this error message:
Error while reading data, error message: Failed to convert column created_at of 867-th row from MySQL to BigQuery. Error: Returns non-TIMESTAMP value: 0000–00–00 00:00:00, TIMESTAMP must be in format: ‘YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]’, and range between 0001–01–01 00:00:00 to 9999–12–31 23:59:59.999999. You can use MySQL formatting functions to convert TIMESTAMP to BigQuery format.
Turns out the MySQL table has some invalid timestamps — so it won’t be as straightforward. How do we fix this?
Reading into BigQuery with some SQL magic
The query that fixes the import process:
CREATE TABLE temp.users AS
SELECT * EXCEPT(x) REPLACE(x AS created_at)
, 'SELECT *, IF(created_at="0000-00-00 00:00:00",null,created_at) x FROM users')# Query complete (5 min 58 sec elapsed, 2.6 GB processed)
How does this work?
IF(created_at=”0000–00–00 00:00:00",null,created_at) AS xis executed inside MySQL. When it finds an invalid timestamp, it replaces it with
SELECT * EXCEPT(x) REPLACE(x AS created_at)is executed inside BigQuery.
REPLACE()replaces the faulty
created_atcolumn with its fix, and
EXCEPT()makes sure we don’t import that temporary column.
- The final query took 6 minutes and processed 2.6GB. While testing you can add a
LIMITwithin the MySQL query to reduce these values substantially (3 seconds and 1KB in the picture’s example).
- Loading this table into BigQuery didn’t treat
\Nas null. You can fix that in the same way we fixed
- Note that we didn’t need to set up service accounts or firewall rules anywhere. This is all handled transparently between Cloud SQL and BigQuery.
gcloudmakes it easy to load data into MySQL tables, but you still need to manually create those tables as shown above.
- You can stop and restart your Cloud SQL instances as needed (to save costs).
- I love that BigQuery is able to figure out the schema automatically out of a MySQL
On loading MySQL backups straight into BigQuery
Why wasn’t it straightforward to load the .CSV files produced by MySQL into BigQuery? Turns out MySQL default CSV export is not 100% CSV compliant (due to how they escape their quoted quote marks).
Stack Overflow has been asking about this:
- Data between quotes and field (9k views), Valid CSV filed import fails with Data between close double quote (“) and field separator: field starts with (1k views), Best Practice to migrate data from MySQL to BigQuery (4k views).
There’s an open issue requesting BigQuery’s support for MySQL’s default CSV export choices — vote and follow it for updates. Meanwhile in this post, we’ll find out how BigQuery can read these tables straight out of MySQL.
Alternative: Parse with a UDF
You could also ingest the rows “raw” and parse the CSV inside BigQuery. See See https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83:
SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
Imagine the possibilities: Now we can have a live Cloud SQL instance replicating a remote MySQL database, for live updates into BigQuery of your transactional database. For more on this, stay tuned.
Optimizing BigQuery: Cluster your tables
BigQuery just announced the ability to cluster tables — which I’ll describe here. If you are looking for massive…