Easy pivot() in BigQuery, in one step

Introducing the easiest way to get a pivot done in BigQuery. This has been one of the most recurrent BigQuery open questions. And now, thanks to dynamic SQL and stored procedures, we can finally make it easy for everyone.

Quick pivots

Before going any further, let me show you how to call fhoffa.x.pivot() to easily pivot any table, creating a new one with the results:

CALL fhoffa.x.pivot(
'bigquery-public-data.iowa_liquor_sales.sales' # source table
, 'fh-bigquery.temp.test' # destination table
, ['date'] # row_ids
, 'store_number' # pivot_col_name
, 'sale_dollars' # pivot_col_value
, 30 # max_columns
, 'SUM' # aggregation
, '' # optional_limit
);

That’s it. Let’s see what’s happening here. The original table has one row per sale (invoice), including the date of the transaction, the store number, and the sale amount:

Original table to be pivoted

Then if someone requests a transformation like “I want to see the total amount sold per day per store, with each store in its own column” — that’s a pivot. The code above will create a new table that looks exactly like requested:

Pivoted table, with one store per column

Let’s review each line of the above code:

  • CALL fhoffa.x.pivot(): The first thing you might notice is the keyword CALL. That’s how you run a stored procedure like this pivot(). I shared it in my dataset fhoffa.x, so you can call it at any moment too. The following lines are the parameters for that stored procedure. Note that most of these parameters strings:
  • ‘bigquery-public-data.iowa_liquor_sales.sales’: Denotes the table we want to pivot. In this case, a public table sharing the liquor sales in Iowa.
  • ‘fh-bigquery.temp.test’: The stored procedure will create (or overwrite) this table, with the pivoted results.
  • [‘date’]: An array with the list of the dimensions we want to summarize for on each row. Here it’s just date, because all we want is a summary per day.
  • ‘store_number’: This is the column name for the column with the values we want to transform into columns.
  • ‘sale_dollars’: This is the column name for the column with the values we want to summarize for each new column.
  • 30: A number to limit the number of columns to be created. Maybe we want to include only the top 30 stores? This is useful to avoid creating thousands of non-meaningful columns.
  • ‘SUM’: The aggregation strategy you want to use for the values. Usually ANY_VALUE should be the easiest one to use — but here we really want the SUM per day.
  • ‘’: Looks empty, but you could write LIMIT 10, if you wanted to only create 10 rows (useful when testing the pivots, for faster materializations).

Let’s see some variations of this with different source tables and requests:

If instead of summarizing by the raw date we want a summary per month, then we can just use the row ids array to ask for this transformation:

CALL fhoffa.x.pivot(
'bigquery-public-data.iowa_liquor_sales.sales' # source table
, 'fh-bigquery.temp.test_pivot' # destination table
, ['DATE_TRUNC(date, MONTH) AS month'] # row_ids
, 'store_number' # pivot_col_name
, 'sale_dollars' # pivot_col_value
, 30 # max_columns
, 'SUM' # aggregation
, '' # optional_limit
);

In this case I want a summary for each id and date. Now it makes sense to have an array of more than one element:

CALL fhoffa.x.pivot(
'bigquery-public-data.ghcn_d.ghcnd_2019' # source table
, 'fh-bigquery.temp.test_pivotted' # destination table
, ['id', 'date'] # row_ids
, 'element' # pivot_col_name
, 'value' # pivot_col_value
, 30 # max_columns
, 'AVG' # aggregation
, 'LIMIT 10' # optional_limit
);

Note that I changed the source table here — this is NOAA’s GHCND daily weather. With pivot() I’m getting a column for each sensor. Also instead of using SUM() here I use AVG() — because it’s what makes more sense for weather data.

In this case I want to transform the name of each column — as the values in the original table include more data than I want:

CREATE OR REPLACE VIEW `fh-bigquery.temp.a` AS (
SELECT * EXCEPT(SensorName), REGEXP_REPLACE(SensorName, r'.*/', '') SensorName
FROM `data-sensing-lab.io_sensor_data.moscone_io13`
);
CALL fhoffa.x.pivot(
'fh-bigquery.temp.a'
, 'fh-bigquery.temp.delete_pivotted' # destination table
, ['MoteName', 'TIMESTAMP_TRUNC(Timestamp, HOUR) AS hour'] # row_ids
, 'SensorName' # pivot_col_name
, 'Data' # pivot_col_value
, 8 # max_columns
, 'AVG' # aggregation
, 'LIMIT 10' # optional_limit
);

The solution was to create a VIEW with a pre-transformation of the data. To clean up the name of the new columnsREGEXP_REPLACE in the VIEW gets rid of the noise — and then pivot() can use the view as a source for the transformations.

CREATE VIEW `fh-bigquery.temp.stack_unnest_tags`
AS
SELECT *
FROM `bigquery-public-data.stackoverflow.posts_questions`, UNNEST(SPLIT(tags,'|')) tag;
CALL fhoffa.x.pivot(
'fh-bigquery.temp.stack_unnest_tags' # source table
, 'fh-bigquery.temp.stack_pivot' # destination table
, ['EXTRACT(YEAR FROM creation_date) AS year'] # row_ids
, 'tag' # pivot_col_name
, 'view_count' # pivot_col_value
, 10 # max_columns
, 'COUNT' # aggregation
, '' # optional_limit
);
  • First, create a view with an UNNEST() for the tags of each question.
  • The aggregation is just COUNT.

If you import the results to sheets, then you can easily visualize the trends:

Visualizing the # of Stack Overflow questions per year on Sheets

How-to

CREATE OR REPLACE FUNCTION 
`fhoffa.x.normalize_col_name`(col_name STRING) AS (
REGEXP_REPLACE(col_name,r'[/+#|]', '_'
)
CREATE OR REPLACE PROCEDURE `fhoffa.x.pivot`(
table_name STRING
, destination_table STRING
, row_ids ARRAY<STRING>
, pivot_col_name STRING
, pivot_col_value STRING
, max_columns INT64
, aggregation STRING
, optional_limit STRING
)
BEGIN
DECLARE pivotter STRING; EXECUTE IMMEDIATE (
"SELECT STRING_AGG(' "||aggregation
||"""(IF('||@pivot_col_name||'="'||x.value||'", '||@pivot_col_value||', null)) e_'||fhoffa.x.normalize_col_name(x.value))
FROM UNNEST((
SELECT APPROX_TOP_COUNT("""||pivot_col_name||", @max_columns) FROM `"||table_name||"`)) x"
) INTO pivotter
USING pivot_col_name AS pivot_col_name, pivot_col_value AS pivot_col_value, max_columns AS max_columns;
EXECUTE IMMEDIATE (
'CREATE OR REPLACE TABLE `'||destination_table
||'` AS SELECT '
||(SELECT STRING_AGG(x) FROM UNNEST(row_ids) x)
||', '||pivotter
||' FROM `'||table_name||'` GROUP BY '
|| (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)||' ORDER BY '
|| (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)
||' '||optional_limit
);
END;

The biggest new trick to have these pivots: EXECUTE IMMEDIATE.

Now that BigQuery has support for EXECUTE IMMEDIATE, we can generate strings inside BigQuery and then use BigQuery to run them. In this case the code is generating two strings:

  • First it looks for a list of all the values used to generate new columns. Note that APPROX_TOP_COUNT will quickly calculate what are the most common values. This is our first string, that also includes the aggregating strategy for each pivoted row.
  • The second generated string is the query that will CREATE OR REPLACE our destination table. This string uses the previously generated string listing all the columns and aggregating strategy.
  • An auxiliary function `fhoffa.x.normalize_col_name()`: This is the first iteration of a helper that makes sure that the name of each generated column is acceptable to BigQuery.

Next steps

Want more?

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.