Creating a new function that can parse numbers from text

New in BigQuery: Persistent UDFs

User defined functions are a powerful way to extend BigQuery, but until now it has been a drag having to copy paste them around. Not anymore! Let’s play with this new feature and share some useful ones — including NLP inside BigQuery.

6 min readJul 2, 2019

--

With persistent UDF support in BigQuery, now you can:

  • Create user-defined SQL and JavaScript functions.
  • Reference these functions across queries and in logical views.
  • Create org-wide libraries of business logic within shared datasets.

For example, now anyone can call this number parsing function:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
, fhoffa.x.parse_number('three point 5')
, fhoffa.x.parse_number('2 hundred')
, fhoffa.x.parse_number('minus 8')
, fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')
157, 3.5, 200, -8, 5000325.01

How cool is that? See how to do this below.

2019–08 update: Official community repo

Check out GoogleCloudPlatform/bigquery-utils/ on GitHub for a library of common UDFs:

And Javier de la Torre’s repo jatorre/bigquery-jslibs:

Some cool functions I’m releasing now

Radians

Based on a Stack Overflow question:

SELECT fhoffa.x.radians(180) is_this_pi3.141592653589793

To create this function I did:

CREATE OR REPLACE FUNCTION `x.radians`(x ANY TYPE) AS (ACOS(-1) * x / 180);

“Better” cast to integer

My own int() makes me happier than the native CAST(x AS INT64):

SELECT fhoffa.x.int(1) int1
, fhoffa.x.int(2.5) int2
, fhoffa.x.int('7') int3
, fhoffa.x.int('7.8') int4
1, 2, 7, 7

To create this function I did:

CREATE OR REPLACE FUNCTION `x.int`(v ANY TYPE) AS (CAST(FLOOR(CAST(v AS FLOAT64)) AS INT64));

Note that CAST(x AS INT64) rounds the number, but I wish it just truncated it. I don’t have to deal with that behavior anymore, since my own int() takes care of it. Which helps with:

Random integer in a range

Due to the ROUND effects of CAST, getting a random integer isn’t straightforward with the BQ native functions. But now it is, thanks to my own random_int():

SELECT fhoffa.x.random_int(0,10) randint, COUNT(*) c
FROM UNNEST(GENERATE_ARRAY(1,1000))
GROUP BY 1
ORDER BY 1
Well distributed random integers between 0 and 10

To create this function I did:

CREATE OR REPLACE FUNCTION `x.random_int`(min ANY TYPE, max ANY TYPE) AS (fhoffa.x.int(min + RAND()*(max-min)));

Note that I’m using my own previously defined int() to cast appropriately!

Medians

To get the median of an array of numbers:

SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000]) median_1
, fhoffa.x.median([1,2,3]) median_2
, fhoffa.x.median([1,2,3,4]) median_3
3.0, 2.0, 2.5

To use with real data:

SELECT year
, fhoffa.x.median(ARRAY_AGG(weight_pounds)) as median_weight
FROM `bigquery-public-data.samples.natality`
GROUP BY 1
ORDER BY 1
Median weight for babies throughout the years

To create this function Elliott did:

CREATE OR REPLACE FUNCTION `fhoffa.x.median`(arr ANY TYPE) AS ((
SELECT IF (
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));

Fun with JavaScript

Now it’s easier than ever to encapsulate JavaScript functions and its dependencies. For example, I downloaded the latest NLP compromise library to create the following functions.

Parse numbers from text

The opening example:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
, fhoffa.x.parse_number('three point 5')
, fhoffa.x.parse_number('2 hundred')
, fhoffa.x.parse_number('minus 8')
, fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')
157, 3.5, 200, -8, 5000325.01

To create this function I did:

CREATE OR REPLACE FUNCTION x.nlp_compromise_number(str STRING)
RETURNS NUMERIC LANGUAGE js AS '''
return nlp(str).values(0).toNumber().out()
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Note that I’m using a JS library I downloaded straight from the internet — BigQuery is able to work with it seamlessly. But I had to share the copy I stored in GCS with you for this to work:

gsutil acl ch -u AllUsers:R gs://fh-bigquery/js/*

NLP — getting names out of text

Extracting names out of text:

SELECT fhoffa.x.nlp_compromise_people(
"hello, I'm Felipe Hoffa and I work with Elliott Brossard - who thinks Jordan Tigani will like this post?"
) names
Names NLP compromise found on the above text

Let’s find the top names mentioned on reddit /r/movies last February:

SELECT name, COUNT(*) c
FROM (
SELECT fhoffa.x.nlp_compromise_people(title) names
FROM `fh-bigquery.reddit_posts.2019_02`
WHERE subreddit = 'movies'
), UNNEST(names) name
WHERE name LIKE '% %'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
NLP compromise might not be perfect, but I like it anyways

To create this function I did:

CREATE FUNCTION x.nlp_compromise_people(str STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
return nlp(str).people().out('topk').map(x=>x.normal)
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Fuzzy matching

Soundex

Based on Brian Suk’s post: fhoffa.x.soundex()

To create this function I did:

CREATE OR REPLACE FUNCTION `fhoffa.x.soundex`(s STRING)
RETURNS STRING LANGUAGE js AS """
// https://gist.github.com/shawndumas/1262659
var a = s.toLowerCase().split(''),
f = a.shift(),
r = '',
codes = {
a: '', e: '', i: '', o: '', u: '',
b: 1, f: 1, p: 1, v: 1,
c: 2, g: 2, j: 2, k: 2, q: 2, s: 2, x: 2, z: 2,
d: 3, t: 3,
l: 4,
m: 5, n: 5,
r: 6
};

r = f +
a
.map(function (v, i, a) { return codes[v] })
.filter(function (v, i, a) {
return ((i === 0) ? v !== codes[f] : v !== a[i - 1]);
})
.join('');

return (r + '000').slice(0, 4).toUpperCase();
""";

Levenshtein distance

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')

Uses fuzzball.js:

CREATE OR REPLACE FUNCTION fhoffa.x.levenshtein(a string, b string)
RETURNS INT64
LANGUAGE js AS """
return fuzzball.distance(a,b);
"""
OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");

Fuzzy choose one

SELECT fhoffa.x.fuzzy_extract_one('jony' 
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
# , ['john', 'johnny', 'jonathan', 'jonas']
)
johnny

Also with fuzzball:

CREATE OR REPLACE FUNCTION `fhoffa.x.fuzzy_extract_one`(a STRING, b ARRAY<STRING>) RETURNS STRING LANGUAGE js
OPTIONS (library=["gs://fh-bigquery/js/fuzzball.umd.min.js"]) AS """
return fuzzball.extract(a,b)[0][0];
""";

Parse CSV

SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
,csv.cols[SAFE_OFFSET(2)] c
,csv.cols[SAFE_OFFSET(3)] d
,csv.cols[SAFE_OFFSET(4)] e
,csv.cols[SAFE_OFFSET(5)] f
FROM (
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
)

Source:

CREATE OR REPLACE FUNCTION fhoffa.x.parse_csv(str STRING)
RETURNS STRUCT<cols ARRAY<STRING>, meta STRING> LANGUAGE js AS '''
// SELECT fhoffa.x.csv_parse('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
parsed = Papa.parse(str, {'quoteChar':'"', delimiter:','})
return {cols: parsed.data[0], meta: JSON.stringify(parsed.meta)}
'''
OPTIONS (library="gs://fh-bigquery/js/papaparse.js");

WASM

SELECT fhoffa.x.sample_wasm_udf([2,3,4])

See https://stackoverflow.com/a/60102266/132438

Notes

  • Persistent UDFs are in beta today.
  • UDFs are case-sensitive. I’m asking the team if we could relax this. Tell me if you’d like this too.
  • How to version and source these functions? Share your ideas, and stay tuned.
  • BigQuery can run wasm, so you could write these functions in any programming language that compiles to it (pending an async JS issue Myles Borins has been working to fix).
  • You can create persistent UDFs within the BigQuery sandbox without a credit card. They will be persisted indefinitely (beyond the default 60 day storage for tables in the same dataset).
  • To make my UDFs usable by anyone, I shared the dataset containing them with allAuthenticatedUsers
Share a dataset containing UDFs with allAuthenticatedUsers

Next steps

Create and share your own UDFs! Post them on reddit.com/r/bigquery if you’d like more people to find them.

Thanks to Elliott Brossard for leading the development of this feature and helping me out with this post.

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.