BigQuery beyond SQL and JS: Running C and Rust code at scale with wasm

Update 2019: Now you can run these functions with #standardSQL

— — — — —

It all started here:

As you can see we were all impressed:

3 seconds to add 2 numbers isn’t impressive at all — but a couple hours later, we were able to run this same function at scale:

1 row in 3 seconds is not impressive. 5 billion rows in 56 seconds is. Let’s see how we went from a to b:

BigQuery strengths: Throughput, not latency

If you are expecting results in less than one second, BigQuery is not the right tool: Usually BigQuery won’t return results in less than one second.

This because BigQuery was built for throughput, not for latency. A race car will always be faster than a heavy-weight truck — except when you ask the race car to move 3 tons of cargo from one place to another. That’s when you bring BigQuery into your life: not for sub-second operations, but to analyze tons of data effortlessly.

Nevertheless, scaling up the results of this Web Assembly function in BigQuery wasn’t as easy.

BigQuery UDF tricks: For slow UDFs, batch the requests

In Francesc’s initial UDF, most of the time is spent initializing Web Assembly. Each time we pass a new row to the UDF, a long time is spent initializing the function again.

To solve this problem we can batch the requests to the UDF: We will group big chunks of rows arbitrarily, so the UDF can initialize once and process each row in the chunk without the need to initialize again.

This means wrapping a source like this:

SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205]

into random groups, with code like this:

SELECT FLOOR(RAND()*100000) group, NEST(requests) as x 
FROM (
SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205])
GROUP BY group

For a table of 5 billion rows like this one, this code creates ~100,000 groups of ~50,000 elements each. We can pass these groups to the UDF, to get to our desired performance.

Yes I am :).

Code

Adding 2 numbers in 3 seconds

SELECT * FROM
js(
(SELECT 1 x, 2 y)
, x, y
, "[{name:'s', type: 'float'}]",
"function (row, emit) {
const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
const env = {
'abortStackOverflow': _ => { throw new Error('overflow'); },
'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
'tableBase': 0,
'memory': memory,
'memoryBase': 1024,
'STACKTOP': 0,
'STACK_MAX': memory.buffer.byteLength,
};
const imports = { env };
const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
const exports = wa.instance.exports;
const sum = exports._sum;
emit({s: sum(row.x, row.y)});
});
}"
)
# Query complete (3.0s elapsed, 0 B processed)
# 3.0

Duplicating a number 5 billion times in 48 seconds

SELECT SUM(s)
FROM
js((
SELECT FLOOR(RAND()*100000) group, NEST(requests) as x
FROM (
SELECT requests, content_size
FROM [fh-bigquery:wikipedia.pagecounts_201205]
)
GROUP BY group)
, group, x
, "[{name:'s', type: 'float'}]",
"function (row, emit) {
const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
const env = {
'abortStackOverflow': _ => { throw new Error('overflow'); },
'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
'tableBase': 0,
'memory': memory,
'memoryBase': 1024,
'STACKTOP': 0,
'STACK_MAX': memory.buffer.byteLength,
};
const imports = { env };
const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
const exports = wa.instance.exports;
const sum = exports._sum;
for (var i = 0, len = row.x.length; i < len; i++) {
emit({s: sum(row.x[i], row.x[i])});
}
});
}"
)

Step by step

See Francesc’s https://blog.sourced.tech/post/calling-c-functions-from-bigquery/.

FAQ

Why not #standardSQL

BigQuery traditionally used its own variant of SQL — which we are using here. But lately BigQuery has taken a huge step into adopting standard SQL. Nevertheless we had to use legacy SQL here, as UDFs in the new model don’t have a way to handle asynchronous returns. Stay tuned.

UPDATE 2019: Now it runs on #standardSQL

Next steps

Will Francesc be able to run source{d} C library inside BigQuery? Stay tuned!

--

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Trending Products Prefix Tree

The useState Hook in ReactJS

NestJS and sequelize: secure your data with transactions

The Elevator Problem: problem and solution (JS)

Five tips when using React hooks

Mastering Computed Properties in VueJS

Level Up Your Mobile Developer Interviews With CodeSignal

React Login, Logout, JWT Authentication Token with React Redux.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

More from Medium

Auto Scaling Impact on Data Scientists Day-to-Day

Data Warehouse Design recommendations with Google BigQuery

From Zero to Hero with Databricks on Google Cloud

BigQuery Best Practices: All in one place.