Creating a new function that can parse numbers from text

New in BigQuery: Persistent UDFs

User defined functions are a powerful way to extend BigQuery, but until now it has been a drag having to copy paste them around. Not anymore! Let’s play with this new feature and share some useful ones — including NLP inside BigQuery.

6 min readJul 2, 2019

With persistent UDF support in BigQuery, now you can:

Create user-defined SQL and JavaScript functions.
Reference these functions across queries and in logical views.
Create org-wide libraries of business logic within shared datasets.

For example, now anyone can call this number parsing function:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
  , fhoffa.x.parse_number('three point 5')
  , fhoffa.x.parse_number('2 hundred')
  , fhoffa.x.parse_number('minus 8')
  , fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')157, 3.5, 200, -8, 5000325.01

How cool is that? See how to do this below.

2019–08 update: Official community repo

Check out GoogleCloudPlatform/bigquery-utils/ on GitHub for a library of common UDFs:

GoogleCloudPlatform/bigquery-utils

This directory contains community contributed user-defined functions to extend BigQuery for more specialized usage…

github.com

And Javier de la Torre’s repo jatorre/bigquery-jslibs:

jatorre/bigquery-jslibs

BigQuery allows you to create UDF functions using JS. In order to do so you have to upload the JS library into a Google…

github.com

Some cool functions I’m releasing now

Radians

Based on a Stack Overflow question:

SELECT fhoffa.x.radians(180) is_this_pi3.141592653589793

To create this function I did:

CREATE OR REPLACE FUNCTION `x.radians`(x ANY TYPE) AS (ACOS(-1) * x / 180);

“Better” cast to integer

My own int() makes me happier than the native CAST(x AS INT64):

SELECT fhoffa.x.int(1) int1
  , fhoffa.x.int(2.5) int2
  , fhoffa.x.int('7') int3
  , fhoffa.x.int('7.8') int41, 2, 7, 7

To create this function I did:

CREATE OR REPLACE FUNCTION `x.int`(v ANY TYPE) AS (CAST(FLOOR(CAST(v AS FLOAT64)) AS INT64));

Note that CAST(x AS INT64) rounds the number, but I wish it just truncated it. I don’t have to deal with that behavior anymore, since my own int() takes care of it. Which helps with:

Random integer in a range

Due to the ROUND effects of CAST, getting a random integer isn’t straightforward with the BQ native functions. But now it is, thanks to my own random_int():

SELECT fhoffa.x.random_int(0,10) randint, COUNT(*) c
FROM UNNEST(GENERATE_ARRAY(1,1000))
GROUP BY 1
ORDER BY 1

Well distributed random integers between 0 and 10

To create this function I did:

CREATE OR REPLACE FUNCTION `x.random_int`(min ANY TYPE, max ANY TYPE) AS (fhoffa.x.int(min + RAND()*(max-min)));

Note that I’m using my own previously defined int() to cast appropriately!

Medians

To get the median of an array of numbers:

SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000]) median_1
  , fhoffa.x.median([1,2,3]) median_2
  , fhoffa.x.median([1,2,3,4]) median_33.0, 2.0, 2.5

To use with real data:

SELECT year
  , fhoffa.x.median(ARRAY_AGG(weight_pounds)) as median_weight
FROM `bigquery-public-data.samples.natality`
GROUP BY 1
ORDER BY 1

Median weight for babies throughout the years

To create this function Elliott did:

CREATE OR REPLACE FUNCTION `fhoffa.x.median`(arr ANY TYPE) AS ((
  SELECT IF (
    MOD(ARRAY_LENGTH(arr), 2) = 0,
    (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
    arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
  )
  FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));

Fun with JavaScript

Now it’s easier than ever to encapsulate JavaScript functions and its dependencies. For example, I downloaded the latest NLP compromise library to create the following functions.

Parse numbers from text

The opening example:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
  , fhoffa.x.parse_number('three point 5')
  , fhoffa.x.parse_number('2 hundred')
  , fhoffa.x.parse_number('minus 8')
  , fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')157, 3.5, 200, -8, 5000325.01

To create this function I did:

CREATE OR REPLACE FUNCTION x.nlp_compromise_number(str STRING)
RETURNS NUMERIC LANGUAGE js AS '''
   return nlp(str).values(0).toNumber().out()
'''
OPTIONS (
  library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Note that I’m using a JS library I downloaded straight from the internet — BigQuery is able to work with it seamlessly. But I had to share the copy I stored in GCS with you for this to work:

gsutil acl ch -u AllUsers:R gs://fh-bigquery/js/*

NLP — getting names out of text

Extracting names out of text:

SELECT fhoffa.x.nlp_compromise_people(
  "hello, I'm Felipe Hoffa and I work with Elliott Brossard - who thinks Jordan Tigani will like this post?"
) names

Names NLP compromise found on the above text

Let’s find the top names mentioned on reddit /r/movies last February:

SELECT name, COUNT(*) c
FROM (
  SELECT fhoffa.x.nlp_compromise_people(title) names
  FROM `fh-bigquery.reddit_posts.2019_02`
  WHERE subreddit = 'movies'
), UNNEST(names) name
WHERE name LIKE '% %'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

NLP compromise might not be perfect, but I like it anyways

To create this function I did:

CREATE FUNCTION x.nlp_compromise_people(str STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
   return nlp(str).people().out('topk').map(x=>x.normal)
'''
OPTIONS (
  library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Fuzzy matching

Soundex

Based on Brian Suk’s post: fhoffa.x.soundex()

To create this function I did:

CREATE OR REPLACE FUNCTION `fhoffa.x.soundex`(s STRING)
RETURNS STRING LANGUAGE js AS """
// https://gist.github.com/shawndumas/1262659
     var a = s.toLowerCase().split(''),
         f = a.shift(),
         r = '',
         codes = {
             a: '', e: '', i: '', o: '', u: '',
             b: 1, f: 1, p: 1, v: 1,
             c: 2, g: 2, j: 2, k: 2, q: 2, s: 2, x: 2, z: 2,
             d: 3, t: 3,
             l: 4,
             m: 5, n: 5,
             r: 6
         };
 
     r = f +
         a
         .map(function (v, i, a) { return codes[v] })
         .filter(function (v, i, a) {
             return ((i === 0) ? v !== codes[f] : v !== a[i - 1]);
         })
         .join('');
 
     return (r + '000').slice(0, 4).toUpperCase();
""";

Levenshtein distance

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

Uses fuzzball.js:

CREATE OR REPLACE FUNCTION fhoffa.x.levenshtein(a string, b string)
RETURNS INT64
LANGUAGE js AS """
  return fuzzball.distance(a,b);
"""
OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");

Fuzzy choose one

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  # , ['john', 'johnny', 'jonathan', 'jonas']
)johnny

Also with fuzzball:

CREATE OR REPLACE FUNCTION `fhoffa.x.fuzzy_extract_one`(a STRING, b ARRAY<STRING>) RETURNS STRING LANGUAGE js
OPTIONS (library=["gs://fh-bigquery/js/fuzzball.umd.min.js"]) AS """
  return fuzzball.extract(a,b)[0][0];
""";

Parse CSV

SELECT csv.cols[SAFE_OFFSET(0)] a
  , csv.cols[SAFE_OFFSET(1)] b
  ,csv.cols[SAFE_OFFSET(2)] c
  ,csv.cols[SAFE_OFFSET(3)] d
  ,csv.cols[SAFE_OFFSET(4)] e
  ,csv.cols[SAFE_OFFSET(5)] f
FROM (
  SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"')  csv
)

Source:

CREATE OR REPLACE FUNCTION fhoffa.x.parse_csv(str STRING)
RETURNS STRUCT<cols ARRAY<STRING>, meta STRING> LANGUAGE js AS '''
  // SELECT fhoffa.x.csv_parse('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"')  csv
  parsed = Papa.parse(str, {'quoteChar':'"', delimiter:','})
  return {cols: parsed.data[0], meta: JSON.stringify(parsed.meta)}
'''
OPTIONS (library="gs://fh-bigquery/js/papaparse.js");

WASM

SELECT fhoffa.x.sample_wasm_udf([2,3,4])

See https://stackoverflow.com/a/60102266/132438

Notes

Persistent UDFs are in beta today.
UDFs are case-sensitive. I’m asking the team if we could relax this. Tell me if you’d like this too.
How to version and source these functions? Share your ideas, and stay tuned.
BigQuery can run wasm, so you could write these functions in any programming language that compiles to it (pending an async JS issue Myles Borins has been working to fix).
You can create persistent UDFs within the BigQuery sandbox without a credit card. They will be persisted indefinitely (beyond the default 60 day storage for tables in the same dataset).
To make my UDFs usable by anyone, I shared the dataset containing them with allAuthenticatedUsers

Share a dataset containing UDFs with allAuthenticatedUsers

Next steps

Create and share your own UDFs! Post them on reddit.com/r/bigquery if you’d like more people to find them.

Thanks to Elliott Brossard for leading the development of this feature and helping me out with this post.

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

When will Stack Overflow reply: How to predict with BigQuery

When you finally post a question on Stack Overflow, a long wait starts. Will someone answer your question? Will the…

towardsdatascience.com

Python PyPI stats in BigQuery: Reclustered

Let’s dig into raw logs of Python installation activities. In this post we’ll see how to measure a package’s…

towardsdatascience.com

New in BigQuery: Persistent UDFs

User defined functions are a powerful way to extend BigQuery, but until now it has been a drag having to copy paste them around. Not anymore! Let’s play with this new feature and share some useful ones — including NLP inside BigQuery.

2019–08 update: Official community repo

GoogleCloudPlatform/bigquery-utils

This directory contains community contributed user-defined functions to extend BigQuery for more specialized usage…

jatorre/bigquery-jslibs

BigQuery allows you to create UDF functions using JS. In order to do so you have to upload the JS library into a Google…

Some cool functions I’m releasing now

Radians

“Better” cast to integer

Random integer in a range

Medians

Fun with JavaScript

Parse numbers from text

NLP — getting names out of text

Fuzzy matching

Soundex

Levenshtein distance

Fuzzy choose one

Parse CSV

WASM

Notes

Next steps

When will Stack Overflow reply: How to predict with BigQuery

When you finally post a question on Stack Overflow, a long wait starts. Will someone answer your question? Will the…

Python PyPI stats in BigQuery: Reclustered

Let’s dig into raw logs of Python installation activities. In this post we’ll see how to measure a package’s…

Written by Felipe Hoffa

Responses (7)