BigQuery beyond SQL and JS: Running C and Rust code at scale with wasm

Update 2019: Now you can run these functions with #standardSQL

— — — — —

It all started here:

As you can see we were all impressed:

3 seconds to add 2 numbers isn’t impressive at all — but a couple hours later, we were able to run this same function at scale:

1 row in 3 seconds is not impressive. 5 billion rows in 56 seconds is. Let’s see how we went from a to b:

BigQuery strengths: Throughput, not latency

If you are expecting results in less than one second, BigQuery is not the right tool: Usually BigQuery won’t return results in less than one second.

This because BigQuery was built for throughput, not for latency. A race car will always be faster than a heavy-weight truck — except when you ask the race car to move 3 tons of cargo from one place to another. That’s when you bring BigQuery into your life: not for sub-second operations, but to analyze tons of data effortlessly.

Nevertheless, scaling up the results of this Web Assembly function in BigQuery wasn’t as easy.

BigQuery UDF tricks: For slow UDFs, batch the requests

In Francesc’s initial UDF, most of the time is spent initializing Web Assembly. Each time we pass a new row to the UDF, a long time is spent initializing the function again.

To solve this problem we can batch the requests to the UDF: We will group big chunks of rows arbitrarily, so the UDF can initialize once and process each row in the chunk without the need to initialize again.

This means wrapping a source like this:

SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205]

into random groups, with code like this:

SELECT FLOOR(RAND()*100000) group, NEST(requests) as x 
FROM (
SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205])
GROUP BY group

For a table of 5 billion rows like this one, this code creates ~100,000 groups of ~50,000 elements each. We can pass these groups to the UDF, to get to our desired performance.

Yes I am :).

Code

Adding 2 numbers in 3 seconds

SELECT * FROM
js(
(SELECT 1 x, 2 y)
, x, y
, "[{name:'s', type: 'float'}]",
"function (row, emit) {
const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
const env = {
'abortStackOverflow': _ => { throw new Error('overflow'); },
'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
'tableBase': 0,
'memory': memory,
'memoryBase': 1024,
'STACKTOP': 0,
'STACK_MAX': memory.buffer.byteLength,
};
const imports = { env };
const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
const exports = wa.instance.exports;
const sum = exports._sum;
emit({s: sum(row.x, row.y)});
});
}"
)
# Query complete (3.0s elapsed, 0 B processed)
# 3.0

Duplicating a number 5 billion times in 48 seconds

SELECT SUM(s)
FROM
js((
SELECT FLOOR(RAND()*100000) group, NEST(requests) as x
FROM (
SELECT requests, content_size
FROM [fh-bigquery:wikipedia.pagecounts_201205]
)
GROUP BY group)
, group, x
, "[{name:'s', type: 'float'}]",
"function (row, emit) {
const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
const env = {
'abortStackOverflow': _ => { throw new Error('overflow'); },
'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
'tableBase': 0,
'memory': memory,
'memoryBase': 1024,
'STACKTOP': 0,
'STACK_MAX': memory.buffer.byteLength,
};
const imports = { env };
const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
const exports = wa.instance.exports;
const sum = exports._sum;
for (var i = 0, len = row.x.length; i < len; i++) {
emit({s: sum(row.x[i], row.x[i])});
}
});
}"
)

Step by step

See Francesc’s https://blog.sourced.tech/post/calling-c-functions-from-bigquery/.

FAQ

Why not #standardSQL

BigQuery traditionally used its own variant of SQL — which we are using here. But lately BigQuery has taken a huge step into adopting standard SQL. Nevertheless we had to use legacy SQL here, as UDFs in the new model don’t have a way to handle asynchronous returns. Stay tuned.

UPDATE 2019: Now it runs on #standardSQL

Next steps

Will Francesc be able to run source{d} C library inside BigQuery? Stay tuned!

--

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Node js vs .NET Core: Which One to Choose for Backend Development?

Building a React Infinite Scroller Bit Component from Scratch

How to store arrays in sqlite database

How to Build a Dynamic, Controlled Form with React Hooks (2019)

Use of javascript in Microsoft.

Non-blocking, event-driven model of Node JS explained using real-world analogies

The Angular template variable you are missing — Return of the VAR

REDUX THE BOSS.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

More from Medium

Using Google Dataproc to create Delta Tables

Dataproc Serverless for Spark on GCP

Automate Your Data Warehouse with Airflow on GCP

How to setup Airflow Sensor’s mode as Reschedule