Easy pivot() in BigQuery, in one step
Introducing the easiest way to get a pivot done in BigQuery. This has been one of the most recurrent BigQuery open questions. And now, thanks to dynamic SQL and stored procedures, we can finally make it easy for everyone.
Important update: I left Google and joined Snowflake in 2020 — so I’m unable to keep my older posts updated. If you want to try Snowflake, join us — I’m having a lot of fun ❄️.
Quick pivots
Before going any further, let me show you how to call fhoffa.x.pivot()
to easily pivot any table, creating a new one with the results:
CALL fhoffa.x.pivot(
'bigquery-public-data.iowa_liquor_sales.sales' # source table
, 'fh-bigquery.temp.test' # destination table
, ['date'] # row_ids
, 'store_number' # pivot_col_name
, 'sale_dollars' # pivot_col_value
, 30 # max_columns
, 'SUM' # aggregation
, '' # optional_limit
);
That’s it. Let’s see what’s happening here. The original table has one row per sale (invoice), including the date of the transaction, the store number, and the sale amount:
Then if someone requests a transformation like “I want to see the total amount sold per day per store, with each store in its own column” — that’s a pivot. The code above will create a new table that looks exactly like requested:
Let’s review each line of the above code:
CALL fhoffa.x.pivot()
: The first thing you might notice is the keywordCALL
. That’s how you run a stored procedure like thispivot()
. I shared it in my datasetfhoffa.x
, so you can call it at any moment too. The following lines are the parameters for that stored procedure. Note that most of these parameters strings:‘bigquery-public-data.iowa_liquor_sales.sales’:
Denotes the table we want to pivot. In this case, a public table sharing the liquor sales in Iowa.‘fh-bigquery.temp.test’
: The stored procedure will create (or overwrite) this table, with the pivoted results.[‘date’]
: An array with the list of the dimensions we want to summarize for on each row. Here it’s justdate
, because all we want is a summary per day.‘store_number’
: This is the column name for the column with the values we want to transform into columns.‘sale_dollars’
: This is the column name for the column with the values we want to summarize for each new column.30
: A number to limit the number of columns to be created. Maybe we want to include only the top 30 stores? This is useful to avoid creating thousands of non-meaningful columns.‘SUM’
: The aggregation strategy you want to use for the values. UsuallyANY_VALUE
should be the easiest one to use — but here we really want theSUM
per day.‘’
: Looks empty, but you could writeLIMIT 10
, if you wanted to only create 10 rows (useful when testing the pivots, for faster materializations).
Let’s see some variations of this with different source tables and requests:
What if we want the summary per month?
If instead of summarizing by the raw date
we want a summary per month, then we can just use the row ids array to ask for this transformation:
CALL fhoffa.x.pivot(
'bigquery-public-data.iowa_liquor_sales.sales' # source table
, 'fh-bigquery.temp.test_pivot' # destination table
, ['DATE_TRUNC(date, MONTH) AS month'] # row_ids
, 'store_number' # pivot_col_name
, 'sale_dollars' # pivot_col_value
, 30 # max_columns
, 'SUM' # aggregation
, '' # optional_limit
);
What if we have multiple dimensions to summarize for?
In this case I want a summary for each id
and date
. Now it makes sense to have an array of more than one element:
CALL fhoffa.x.pivot(
'bigquery-public-data.ghcn_d.ghcnd_2019' # source table
, 'fh-bigquery.temp.test_pivotted' # destination table
, ['id', 'date'] # row_ids
, 'element' # pivot_col_name
, 'value' # pivot_col_value
, 30 # max_columns
, 'AVG' # aggregation
, 'LIMIT 10' # optional_limit
);
Note that I changed the source table here — this is NOAA’s GHCND daily weather. With pivot()
I’m getting a column for each sensor. Also instead of using SUM()
here I use AVG()
— because it’s what makes more sense for weather data.
What if we have a more complex case?
In this case I want to transform the name of each column — as the values in the original table include more data than I want:
CREATE OR REPLACE VIEW `fh-bigquery.temp.a` AS (
SELECT * EXCEPT(SensorName), REGEXP_REPLACE(SensorName, r'.*/', '') SensorName
FROM `data-sensing-lab.io_sensor_data.moscone_io13`
);CALL fhoffa.x.pivot(
'fh-bigquery.temp.a'
, 'fh-bigquery.temp.delete_pivotted' # destination table
, ['MoteName', 'TIMESTAMP_TRUNC(Timestamp, HOUR) AS hour'] # row_ids
, 'SensorName' # pivot_col_name
, 'Data' # pivot_col_value
, 8 # max_columns
, 'AVG' # aggregation
, 'LIMIT 10' # optional_limit
);
The solution was to create a VIEW
with a pre-transformation of the data. To clean up the name of the new columnsREGEXP_REPLACE
in the VIEW
gets rid of the noise — and then pivot()
can use the view
as a source for the transformations.
# of Stack Overflow questions per year with each column one of the top 10 tags?
CREATE VIEW `fh-bigquery.temp.stack_unnest_tags`
AS
SELECT *
FROM `bigquery-public-data.stackoverflow.posts_questions`, UNNEST(SPLIT(tags,'|')) tag;CALL fhoffa.x.pivot(
'fh-bigquery.temp.stack_unnest_tags' # source table
, 'fh-bigquery.temp.stack_pivot' # destination table
, ['EXTRACT(YEAR FROM creation_date) AS year'] # row_ids
, 'tag' # pivot_col_name
, 'view_count' # pivot_col_value
, 10 # max_columns
, 'COUNT' # aggregation
, '' # optional_limit
);
- First, create a view with an
UNNEST()
for the tags of each question. - The aggregation is just
COUNT
.
If you import the results to sheets, then you can easily visualize the trends:
How-to
Creating the shared stored procedure pivot()
CREATE OR REPLACE FUNCTION
`fhoffa.x.normalize_col_name`(col_name STRING) AS (
REGEXP_REPLACE(col_name,r'[/+#|]', '_'
)CREATE OR REPLACE PROCEDURE `fhoffa.x.pivot`(
table_name STRING
, destination_table STRING
, row_ids ARRAY<STRING>
, pivot_col_name STRING
, pivot_col_value STRING
, max_columns INT64
, aggregation STRING
, optional_limit STRING
)
BEGIN DECLARE pivotter STRING; EXECUTE IMMEDIATE (
"SELECT STRING_AGG(' "||aggregation
||"""(IF('||@pivot_col_name||'="'||x.value||'", '||@pivot_col_value||', null)) e_'||fhoffa.x.normalize_col_name(x.value))
FROM UNNEST((
SELECT APPROX_TOP_COUNT("""||pivot_col_name||", @max_columns) FROM `"||table_name||"`)) x"
) INTO pivotter
USING pivot_col_name AS pivot_col_name, pivot_col_value AS pivot_col_value, max_columns AS max_columns; EXECUTE IMMEDIATE (
'CREATE OR REPLACE TABLE `'||destination_table
||'` AS SELECT '
||(SELECT STRING_AGG(x) FROM UNNEST(row_ids) x)
||', '||pivotter
||' FROM `'||table_name||'` GROUP BY '
|| (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)||' ORDER BY '
|| (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)
||' '||optional_limit
);
END;
The biggest new trick to have these pivots: EXECUTE IMMEDIATE
.
Now that BigQuery has support for EXECUTE IMMEDIATE
, we can generate strings inside BigQuery and then use BigQuery to run them. In this case the code is generating two strings:
- First it looks for a list of all the values used to generate new columns. Note that
APPROX_TOP_COUNT
will quickly calculate what are the most common values. This is our first string, that also includes the aggregating strategy for each pivoted row. - The second generated string is the query that will
CREATE OR REPLACE
our destination table. This string uses the previously generated string listing all the columns and aggregating strategy. - An auxiliary function
`fhoffa.x.normalize_col_name()`
: This is the first iteration of a helper that makes sure that the name of each generated column is acceptable to BigQuery.
Next steps
- While this is a fun solution for
PIVOT()
with stored procedures and dynamic SQL, BigQuery could also have native support for it. If you are interested, upvote and follow this issue on the BigQuery issue tracker. - I will move this procedure to the common repository
bqutil
— once we make sure we have the correct API for it. Any notes before we do so? - Check the previously released
UNPIVOT()
:
- A first look on Dynamic SQL and EXECUTE IMMEDIATE by Lak Lakshmanan.
- Join the Data Show, for a live explanation on how this works.
Thanks to
- Lak Lakshmanan, Eric Schmidt, Alok Pattani for reviewing this post.
- Mikhail Berlyant for answering countless Stack Overflow questions, including the code that inspired this solution writing SQL to get pivots done in BigQuery.
Want more?
I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.