BigQuery beyond SQL and JS: Running C and Rust code at scale with wasm

Running arbitrary JavaScript code with a SQL query is cool — but you know what’s cooler? BigQuery running arbitrary C and Rust code. Francesc explained how, and in this post will figure out how to make this run at scale.

Felipe Hoffa

6 min readApr 10, 2018

Update 2019: Now you can run these functions with #standardSQL

running async JS functions on BigQuery with #standardSQL

Now that BigQuery supports async on #standardSQL, how can I convert this #legacySQL function to run on #standardSQL…

stackoverflow.com

— — — — —

It all started here:

As you can see we were all impressed:

Impressive: Francesc had managed to run arbitrary C code in BigQuery, leveraging the BigQuery’s JavaScript user-defined-function support and WebAssembly.
Disappointing: Adding 2 numbers using this method took 3 seconds.

3 seconds to add 2 numbers isn’t impressive at all — but a couple hours later, we were able to run this same function at scale:

1 row in 3 seconds is not impressive. 5 billion rows in 56 seconds is. Let’s see how we went from a to b:

BigQuery strengths: Throughput, not latency

If you are expecting results in less than one second, BigQuery is not the right tool: Usually BigQuery won’t return results in less than one second.

This because BigQuery was built for throughput, not for latency. A race car will always be faster than a heavy-weight truck — except when you ask the race car to move 3 tons of cargo from one place to another. That’s when you bring BigQuery into your life: not for sub-second operations, but to analyze tons of data effortlessly.

Nevertheless, scaling up the results of this Web Assembly function in BigQuery wasn’t as easy.

BigQuery UDF tricks: For slow UDFs, batch the requests

In Francesc’s initial UDF, most of the time is spent initializing Web Assembly. Each time we pass a new row to the UDF, a long time is spent initializing the function again.

To solve this problem we can batch the requests to the UDF: We will group big chunks of rows arbitrarily, so the UDF can initialize once and process each row in the chunk without the need to initialize again.

This means wrapping a source like this:

SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205]

into random groups, with code like this:

SELECT FLOOR(RAND()*100000) group, NEST(requests) as x 
FROM (
  SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205])
GROUP BY group

For a table of 5 billion rows like this one, this code creates ~100,000 groups of ~50,000 elements each. We can pass these groups to the UDF, to get to our desired performance.

Yes I am :).

Code

Adding 2 numbers in 3 seconds

SELECT * FROM
js( 
  (SELECT 1 x, 2 y)
  , x, y
  , "[{name:'s', type: 'float'}]",
"function (row, emit) {
  const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
  const env = {
      'abortStackOverflow': _ => { throw new Error('overflow'); },
      'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
      'tableBase': 0,
      'memory': memory,
      'memoryBase': 1024,
      'STACKTOP': 0,
      'STACK_MAX': memory.buffer.byteLength,
  };
  const imports = { env };
  const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);WebAssembly.instantiate(bytes, imports).then(wa => {
      const exports = wa.instance.exports;
      const sum = exports._sum;
      emit({s: sum(row.x, row.y)});
  });
}"
)# Query complete (3.0s elapsed, 0 B processed)
# 3.0

Duplicating a number 5 billion times in 48 seconds

SELECT SUM(s)
FROM
js((
  SELECT FLOOR(RAND()*100000) group, NEST(requests) as x
  FROM (
    SELECT requests, content_size
    FROM [fh-bigquery:wikipedia.pagecounts_201205]
  )
  GROUP BY group)
  , group, x
  , "[{name:'s', type: 'float'}]",
"function (row, emit) {
  const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
  const env = {
      'abortStackOverflow': _ => { throw new Error('overflow'); },
      'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
      'tableBase': 0,
      'memory': memory,
      'memoryBase': 1024,
      'STACKTOP': 0,
      'STACK_MAX': memory.buffer.byteLength,
  };
  const imports = { env };
  const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
      const exports = wa.instance.exports;
      const sum = exports._sum;
      for (var i = 0, len = row.x.length; i < len; i++) {
        emit({s: sum(row.x[i], row.x[i])});
      }
  });
}"
)

Step by step

See Francesc’s https://blog.sourced.tech/post/calling-c-functions-from-bigquery/.

FAQ

Why not #standardSQL

BigQuery traditionally used its own variant of SQL — which we are using here. But lately BigQuery has taken a huge step into adopting standard SQL. Nevertheless we had to use legacy SQL here, as UDFs in the new model don’t have a way to handle asynchronous returns. Stay tuned.

UPDATE 2019: Now it runs on #standardSQL

running async JS functions on BigQuery with #standardSQL

Now that BigQuery supports async on #standardSQL, how can I convert this #legacySQL function to run on #standardSQL…

stackoverflow.com

Next steps

Will Francesc be able to run source{d} C library inside BigQuery? Stay tuned!

BigQuery beyond SQL and JS: Running C and Rust code at scale with wasm

Running arbitrary JavaScript code with a SQL query is cool — but you know what’s cooler? BigQuery running arbitrary C and Rust code. Francesc explained how, and in this post will figure out how to make this run at scale.

running async JS functions on BigQuery with #standardSQL

Now that BigQuery supports async on #standardSQL, how can I convert this #legacySQL function to run on #standardSQL…

BigQuery strengths: Throughput, not latency

BigQuery UDF tricks: For slow UDFs, batch the requests

Code

Adding 2 numbers in 3 seconds

Duplicating a number 5 billion times in 48 seconds

Step by step

FAQ

Why not #standardSQL

running async JS functions on BigQuery with #standardSQL

Now that BigQuery supports async on #standardSQL, how can I convert this #legacySQL function to run on #standardSQL…

Next steps

Written by Felipe Hoffa