Sitemap

Google Analytics 4 in Snowflake: SQL Cookbook

How to export GA4 from BigQuery into Snowflake, and translate the cookbook sample queries.

8 min readNov 15, 2021
Picture generated by AI (VQGAN+CLIP)

2024 update: Official connector

https://www.youtube.com/watch?v=yiBZLnCJnzE

2021 Article

Google Analytics 4 exports event data from individual user level for free into BigQuery. You might then be wondering “How do I move this data into Snowflake, to get all the benefits of the Data Cloud?” and “Is there a quick way to translate the sample BigQuery queries into Snowflake SQL?”. That’s the goal of this post.

Watch on Youtube

Export GA4 from BigQuery to Snowflake

Export from BigQuery to GCS

Let’s start by exporting the GA4 sample e-commerce dataset.

Snowflake can read files directly from GCS — even if your Snowflake is running on AWS. To bring data from BigQuery to Snowflake you only need to ask BigQuery to export these tables into GCS:

Exporting GA4 sample data in the BigQuery web UI

To export manually, create a bucket in GCS to receive the exported tables. Exporting in Parquet format with snappy compression works well:

bq extract --destination_format=PARQUET --compression=SNAPPY bigquery-public-data:ga4_obfuscated_sample_ecommerce.events_20201202  gs://your-bucket/yourprefix/ga4sample-20201202-*

Prepare your Snowflake account to read from GCS

Setting up Snowflake to read securely from Google Cloud Storage is straightforward, just follow these steps:

Create a table in Snowflake, read the exported Parquet files

list @fh_gcp_stage; -- check files existcreate or replace table ga4_variant(v variant);copy into ga4_variant
from @fh_gcp_stage/yourprefix/
pattern='yourprefix/ga4sample-.*'
file_format = (type='PARQUET');

Note that the script above will load all files in that specific folder, while skipping the ones it has already loaded into that specific table. See the LOAD_UNCERTAIN_FILES option for more details.

Automate with Snowpipe

You can repeat this process for each new day with new data, but you could also set up Snowflake to automatically load each new file that shows up in GCS — check the docs for Snowpipe.

Getting ready to query

Semi-structured data in Snowflake with VARIANT

Data is ready to query now — we created a table with a single column of the VARIANT type. This type in Snowflake natively understands and optimizes semi-structured data, without the need to define a schema:

select distinct v:device.category
, v:traffic_source.medium
, v:user_ltv.currency
, v:geo.country
from ga4_variant

Not only it was easy to write the query, it also ran in only 147ms. Even better, the new Snowflake web UI automatically produces a summary of the results, as seen on the gray column to the right.

Even though VARIANTs in Snowflake are powerful, we might want to create a view over them to make further querying easier.

Creating a view for the raw GA4 data

This view gives aliases to the VARIANT columns, defines types, and parses timestamps and dates:

create or replace view ga4 as
select to_timestamp(v:event_timestamp::int, 6) event_timestamp
, v:event_dimensions event_dimensions
, to_date(v:event_date::string, 'yyyymmdd') event_date
, v:event_name::string event_name
, v:user_id::string user_id
, v:privacy_info privacy_info
, v:stream_id::int stream_id
, v:event_server_timestamp_offset::int event_server_timestamp_offset
, parse_ga4_objarray(v:user_properties) user_properties
, v:device device
, v:platform::string platform
, parse_ga4_objarray(v:event_params) event_params
, v:event_previous_timestamp::int event_previous_timestamp
, v:geo geo
, v:traffic_source traffic_source
, v:ecommerce ecommerce
, v:items items
, v:event_bundle_sequence_id::int event_bundle_sequence_id
, v:user_ltv user_ltv
, v:event_value_in_usd::float event_value_in_usd
, v:user_pseudo_id::string user_pseudo_id
, to_timestamp(v:user_first_touch_timestamp::int, 6) user_first_touch_timestamp
, v:app_info app_info
from ga4_variant

Using this view our previous query changes to:

select distinct device:category
, traffic_source:medium
, user_ltv:currency
, geo:country
from ga4;

You might notice that each of these specific columns still has VARIANTs inside, which you navigate in the same way as before.

But there’s another element of this view that we need to get deeper into, the parse_ga4_objarray() UDF.

Saner key-value schemas with a UDF

GA4 has columns with key-value mappings. These values were transformed into a complicated schema in BigQuery, making these queries too complicated.

For example, this is a query that unnests the key-values in BigQuery:

SELECT event_timestamp,
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'value'
) AS event_value
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name = 'purchase'
AND _TABLE_SUFFIX BETWEEN '20201201' AND '20201202';

This is how the same query looks in Snowflake, after we setup the view with the UDF:

SELECT event_timestamp, event_params:value event_value
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02';

To make this happen our UDF takes the key-values in the BigQuery schema, and brings them back to a simple object schema:

create or replace function parse_ga4_objarray(V variant)
returns variant
language javascript
AS
$$
var result = {};
for (const x of V) {
result[x.key] = x.value ? Object.values(x.value)[0] : null;
}
return result
$$;

2022–03–15 — Updated UDF:

Thanks to

for this improved UDF:

create or replace function parse_ga4_objarray_fixed(V variant)
returns variant
language javascript
AS
$$
var result = {};
for (const x of V) {
if (x.value){
for (const [key, value] of Object.entries(x.value)) {
if ( key ! = 'set_timestamp_micros') {
result[x.key] = value;
}
}
} else {
result[x.key] = null;
}
}
return result
$$;

You can continue to use this view seamlessly, or you can materialize its results if you have huge datasets that you want to optimize for faster queries.

Translating the sample queries: Basic

With that said, now let’s translate the sample queries from BigQuery into Snowflake. Note that we won’t use a different table for each day, but instead we will have a single table with multiple dates.

Let’s start with the sample basic queries:

Example: Query a specific date range for selected events

WITH
UserInfo AS (
SELECT
user_pseudo_id,
MAX(IFF(event_name IN ('first_visit', 'first_open'), 1, 0)) AS is_new_user
-- IFF instead of IF
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
GROUP BY 1
)
SELECT
COUNT(*) AS user_count,
SUM(is_new_user) AS new_user_count
FROM UserInfo;

Example: Average number of transactions per purchaser

SELECT COUNT(*) / COUNT(DISTINCT user_pseudo_id) AS avg_transaction_per_purchaser
FROM ga4
WHERE event_name IN ('in_app_purchase', 'purchase')
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Example: Query values for a specific event name

SELECT event_timestamp, event_params:value::float event_value
-- much easier with the object parser in the view
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02';

Example: Query total value for a specific event name.

SELECT SUM(event_params:value) event_value
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
;

Example: Top 10 items added to cart by most users

SELECT x.value:item_id::string item_id
, x.value:item_name::string item_name
, COUNT(DISTINCT user_pseudo_id) AS user_count
FROM ga4, table(flatten(items)) x
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
AND event_name IN ('add_to_cart')
GROUP BY 1, 2
ORDER BY user_count DESC
LIMIT 10;

Example: Average number of pageviews by purchaser type

WITH
UserInfo AS (
SELECT
user_pseudo_id,
COUNT_IF(event_name = 'page_view') AS page_view_count,
-- COUNT_IF instead of COUNTIF
COUNT_IF(event_name IN ('in_app_purchase', 'purchase')) AS purchase_event_count
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
GROUP BY 1
)
SELECT
(purchase_event_count > 0) AS purchaser,
COUNT(*) AS user_count,
SUM(page_view_count) AS total_page_views,
SUM(page_view_count) / COUNT(*) AS avg_page_views
FROM UserInfo
GROUP BY 1;

Example: Sequence of pageviews

SELECT
user_pseudo_id,
event_timestamp,
event_params:ga_session_id AS ga_session_id,
event_params:page_location::string AS page_location,
event_params:page_title::string AS page_title
FROM ga4
WHERE event_name = 'page_view'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
ORDER BY user_pseudo_id, ga_session_id, event_timestamp ASC;

Translating the sample queries: Advanced

Let’s continue with the advanced sample queries:

Example: Products purchased by customers who purchased a specific product

WITH
Params AS (
-- Replace with selected item_name or item_id.
SELECT 'Google Navy Speckled Tee' AS selected_product
),
PurchaseEvents AS (
SELECT user_pseudo_id, items
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
AND event_name = 'purchase'
),
ProductABuyers AS (
SELECT DISTINCT user_pseudo_id
FROM Params, PurchaseEvents
, table(flatten(items)) items
-- flatten instead of unnest
WHERE items.value:item_name = selected_product
-- item.item_id can be used instead of items.item_name.
)
SELECT items.value:item_name::string AS item_name
, SUM(items.value:quantity) AS item_quantity
FROM
Params
, PurchaseEvents
, table(flatten(items)) AS items
WHERE
user_pseudo_id IN (SELECT user_pseudo_id FROM ProductABuyers)
-- item.item_id can be used instead of items.item_name
AND items.value:item_name != selected_product
GROUP BY 1
ORDER BY item_quantity DESC;

Example: Average amount of money spent per purchase session by user

SELECT user_pseudo_id
, COUNT(DISTINCT(event_params:ga_session_id)) AS session_count
, AVG((event_params:value)) AS avg_spend_per_session_by_user
-- much simpler with our simplified schema
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
GROUP BY 1;

Get the latest ga_session_id and ga_session_number for specific users during last 4 days

The results of this query will depend on how you want to manage timezones:

-- Replace timezone. List at https://en.wikipedia.org/wiki/List_of_tz_database_time_zones.
set REPORTING_TIMEZONE = 'America/Los_Angeles';
-- Replace list of user_pseudo_id's with ones you want to query.
set USER_PSEUDO_ID_LIST = '1005326.4012506369,1009730.0442567259,1029388.2450501039';
// TODO: Incorporate timezone adjustments to BQ sample
CREATE FUNCTION GetDateSuffix(date_shift int, timezone STRING)
returns string
as
$$
(SELECT '') --FORMAT_DATE('%Y%m%d', DATE_ADD(CURRENT_DATE(timezone), INTERVAL date_shift DAY)))
$$;
SELECT DISTINCT
user_pseudo_id,
FIRST_VALUE(event_params:ga_session_id) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS ga_session_id,
FIRST_VALUE(event_params:ga_session_number) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS ga_session_number
FROM ga4
WHERE array_contains(user_pseudo_id::variant, split($USER_PSEUDO_ID_LIST, ','))
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Next steps

Thanks to

I’ve been working with

— Developer Advocate at Google for Google Analytics to write this post. We’d love to keep working together to help you — let us know if you have any further ideas and/or needs.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Developer Advocate around the data world. Ex-Google BigQuery, Ex-Snowflake, Always Felipe

Responses (3)