Felipe Hoffa

Sep 9, 2019

6 min read

Loading MySQL backup files into BigQuery — straight out of Cloud SQL

Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.

The problem

bq load --autodetect temp.users1 gs://fh/users.csv BigQuery error in load operation: Error processing job:Error while reading data, error message: CSV table encountered too many errors, giving up.Failure details:
- gs://fh/users.csv: Error while reading data, error message: Error detected while parsing row starting at position: 1558192614. Error: Data between close double quote (") and field separator.

The solution: Cloud SQL as a BigQuery federated source

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')
# This statement created a new table named fh-bigquery:temp.users.

Step by step

Setup a MySQL instance in Cloud SQL

https://console.cloud.google.com/sql/create-instance-mysql
Creating a user for BigQuery
Creating the ghtorrent database
gcloud sql connect mycloudsql --user=root --quiet
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL,ALLOW_INVALID_DATES';
SET @OLD_TIME_ZONE=@@session.time_zone;
DROP SCHEMA IF EXISTS `ghtorrent` ;
CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;
USE `ghtorrent` ;
-- -----------------------------------------------------
-- Table `ghtorrent`.`users`
-- -----------------------------------------------------
DROP TABLE IF EXISTS `ghtorrent`.`users` ;
CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',
`login` VARCHAR(255) NOT NULL COMMENT '',
`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',
`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`long` DECIMAL(11,8) COMMENT '',
`lat` DECIMAL(10,8) COMMENT '',
`country_code` CHAR(3) COMMENT '',
`state` VARCHAR(255) COMMENT '',
`city` VARCHAR(255) COMMENT '',
`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
PRIMARY KEY (`id`) COMMENT '')
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

Load your backups into CloudSQL

gcloud sql import csv \
mycloudsql \
gs://fh-bigquery/ghtorrent/mysql-2019-02-01/users.csv \
--database=ghtorrent --table=users;

Connect BigQuery to CloudSQL

Connecting BigQuery to our CloudSQL instance

Now just read into BigQuery

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')
Errors while reading from MySQL
Error while reading data, error message: Failed to convert column created_at of 867-th row from MySQL to BigQuery. Error: Returns non-TIMESTAMP value: 0000–00–00 00:00:00, TIMESTAMP must be in format: ‘YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]’, and range between 0001–01–01 00:00:00 to 9999–12–31 23:59:59.999999. You can use MySQL formatting functions to convert TIMESTAMP to BigQuery format.

Reading into BigQuery with some SQL magic

CREATE TABLE temp.users AS
SELECT * EXCEPT(x) REPLACE(x AS created_at)
FROM EXTERNAL_QUERY(
'fh-bigquery.US.ghtorrent'
, 'SELECT *, IF(created_at="0000-00-00 00:00:00",null,created_at) x FROM users')
# Query complete (5 min 58 sec elapsed, 2.6 GB processed)
  • IF(created_at=”0000–00–00 00:00:00",null,created_at) AS x is executed inside MySQL. When it finds an invalid timestamp, it replaces it with null.
  • SELECT * EXCEPT(x) REPLACE(x AS created_at) is executed inside BigQuery. REPLACE() replaces the faulty created_at column with its fix, and EXCEPT() makes sure we don’t import that temporary column.

Implementation notes

  • The final query took 6 minutes and processed 2.6GB. While testing you can add a LIMIT within the MySQL query to reduce these values substantially (3 seconds and 1KB in the picture’s example).
  • Loading this table into BigQuery didn’t treat \N as null. You can fix that in the same way we fixed created_at invalid timestamps.
  • Note that we didn’t need to set up service accounts or firewall rules anywhere. This is all handled transparently between Cloud SQL and BigQuery.
  • gcloud makes it easy to load data into MySQL tables, but you still need to manually create those tables as shown above.
  • You can stop and restart your Cloud SQL instances as needed (to save costs).
  • I love that BigQuery is able to figure out the schema automatically out of a MySQL SELECT *.

On loading MySQL backups straight into BigQuery

Alternative: Parse with a UDF

SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
,csv.cols[SAFE_OFFSET(2)] c
,csv.cols[SAFE_OFFSET(3)] d
,csv.cols[SAFE_OFFSET(4)] e
,csv.cols[SAFE_OFFSET(5)] f
FROM (
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
)

Want more?