Creating a new function that can parse numbers from text

New in BigQuery: Persistent UDFs

User defined functions are a powerful way to extend BigQuery, but until now it has been a drag having to copy paste them around. Not anymore! Let’s play with this new feature and share some useful ones — including NLP inside BigQuery.

  • Create user-defined SQL and JavaScript functions.
  • Reference these functions across queries and in logical views.
  • Create org-wide libraries of business logic within shared datasets.
SELECT fhoffa.x.parse_number('one hundred fifty seven')
, fhoffa.x.parse_number('three point 5')
, fhoffa.x.parse_number('2 hundred')
, fhoffa.x.parse_number('minus 8')
, fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')
157, 3.5, 200, -8, 5000325.01

2019–08 update: Official community repo

Check out GoogleCloudPlatform/bigquery-utils/ on GitHub for a library of common UDFs:

Some cool functions I’m releasing now

Radians

Based on a Stack Overflow question:

SELECT fhoffa.x.radians(180) is_this_pi3.141592653589793
CREATE OR REPLACE FUNCTION `x.radians`(x ANY TYPE) AS (ACOS(-1) * x / 180);

“Better” cast to integer

My own int() makes me happier than the native CAST(x AS INT64):

SELECT fhoffa.x.int(1) int1
, fhoffa.x.int(2.5) int2
, fhoffa.x.int('7') int3
, fhoffa.x.int('7.8') int4
1, 2, 7, 7
CREATE OR REPLACE FUNCTION `x.int`(v ANY TYPE) AS (CAST(FLOOR(CAST(v AS FLOAT64)) AS INT64));

Random integer in a range

Due to the ROUND effects of CAST, getting a random integer isn’t straightforward with the BQ native functions. But now it is, thanks to my own random_int():

SELECT fhoffa.x.random_int(0,10) randint, COUNT(*) c
FROM UNNEST(GENERATE_ARRAY(1,1000))
GROUP BY 1
ORDER BY 1
Well distributed random integers between 0 and 10
CREATE OR REPLACE FUNCTION `x.random_int`(min ANY TYPE, max ANY TYPE) AS (fhoffa.x.int(min + RAND()*(max-min)));

Medians

To get the median of an array of numbers:

SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000]) median_1
, fhoffa.x.median([1,2,3]) median_2
, fhoffa.x.median([1,2,3,4]) median_3
3.0, 2.0, 2.5
SELECT year
, fhoffa.x.median(ARRAY_AGG(weight_pounds)) as median_weight
FROM `bigquery-public-data.samples.natality`
GROUP BY 1
ORDER BY 1
Median weight for babies throughout the years
CREATE OR REPLACE FUNCTION `fhoffa.x.median`(arr ANY TYPE) AS ((
SELECT IF (
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));

Fun with JavaScript

Now it’s easier than ever to encapsulate JavaScript functions and its dependencies. For example, I downloaded the latest NLP compromise library to create the following functions.

Parse numbers from text

The opening example:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
, fhoffa.x.parse_number('three point 5')
, fhoffa.x.parse_number('2 hundred')
, fhoffa.x.parse_number('minus 8')
, fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')
157, 3.5, 200, -8, 5000325.01
CREATE OR REPLACE FUNCTION x.nlp_compromise_number(str STRING)
RETURNS NUMERIC LANGUAGE js AS '''
return nlp(str).values(0).toNumber().out()
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");
gsutil acl ch -u AllUsers:R gs://fh-bigquery/js/*

NLP — getting names out of text

Extracting names out of text:

SELECT fhoffa.x.nlp_compromise_people(
"hello, I'm Felipe Hoffa and I work with Elliott Brossard - who thinks Jordan Tigani will like this post?"
) names
Names NLP compromise found on the above text
SELECT name, COUNT(*) c
FROM (
SELECT fhoffa.x.nlp_compromise_people(title) names
FROM `fh-bigquery.reddit_posts.2019_02`
WHERE subreddit = 'movies'
), UNNEST(names) name
WHERE name LIKE '% %'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
NLP compromise might not be perfect, but I like it anyways
CREATE FUNCTION x.nlp_compromise_people(str STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
return nlp(str).people().out('topk').map(x=>x.normal)
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Fuzzy matching

Soundex

Based on Brian Suk’s post: fhoffa.x.soundex()

CREATE OR REPLACE FUNCTION `fhoffa.x.soundex`(s STRING)
RETURNS STRING LANGUAGE js AS """
// https://gist.github.com/shawndumas/1262659
var a = s.toLowerCase().split(''),
f = a.shift(),
r = '',
codes = {
a: '', e: '', i: '', o: '', u: '',
b: 1, f: 1, p: 1, v: 1,
c: 2, g: 2, j: 2, k: 2, q: 2, s: 2, x: 2, z: 2,
d: 3, t: 3,
l: 4,
m: 5, n: 5,
r: 6
};

r = f +
a
.map(function (v, i, a) { return codes[v] })
.filter(function (v, i, a) {
return ((i === 0) ? v !== codes[f] : v !== a[i - 1]);
})
.join('');

return (r + '000').slice(0, 4).toUpperCase();
""";

Levenshtein distance

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
CREATE OR REPLACE FUNCTION fhoffa.x.levenshtein(a string, b string)
RETURNS INT64
LANGUAGE js AS """
return fuzzball.distance(a,b);
"""
OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");

Fuzzy choose one

SELECT fhoffa.x.fuzzy_extract_one('jony' 
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
# , ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
CREATE OR REPLACE FUNCTION `fhoffa.x.fuzzy_extract_one`(a STRING, b ARRAY<STRING>) RETURNS STRING LANGUAGE js
OPTIONS (library=["gs://fh-bigquery/js/fuzzball.umd.min.js"]) AS """
return fuzzball.extract(a,b)[0][0];
""";

Parse CSV

SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
,csv.cols[SAFE_OFFSET(2)] c
,csv.cols[SAFE_OFFSET(3)] d
,csv.cols[SAFE_OFFSET(4)] e
,csv.cols[SAFE_OFFSET(5)] f
FROM (
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
)
CREATE OR REPLACE FUNCTION fhoffa.x.parse_csv(str STRING)
RETURNS STRUCT<cols ARRAY<STRING>, meta STRING> LANGUAGE js AS '''
// SELECT fhoffa.x.csv_parse('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
parsed = Papa.parse(str, {'quoteChar':'"', delimiter:','})
return {cols: parsed.data[0], meta: JSON.stringify(parsed.meta)}
'''
OPTIONS (library="gs://fh-bigquery/js/papaparse.js");

WASM

SELECT fhoffa.x.sample_wasm_udf([2,3,4])

Notes

  • Persistent UDFs are in beta today.
  • UDFs are case-sensitive. I’m asking the team if we could relax this. Tell me if you’d like this too.
  • How to version and source these functions? Share your ideas, and stay tuned.
  • BigQuery can run wasm, so you could write these functions in any programming language that compiles to it (pending an async JS issue Myles Borins has been working to fix).
  • You can create persistent UDFs within the BigQuery sandbox without a credit card. They will be persisted indefinitely (beyond the default 60 day storage for tables in the same dataset).
  • To make my UDFs usable by anyone, I shared the dataset containing them with allAuthenticatedUsers
Share a dataset containing UDFs with allAuthenticatedUsers

Next steps

Create and share your own UDFs! Post them on reddit.com/r/bigquery if you’d like more people to find them.

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.