How Snowflake makes the dbt Python models shine
Now that dbt has made Python a seamless option within your SQL pipelines, it’s super cool to see how they made this possible with Snowflake. That’s what this post is about. Are you ready to try this out?
What is dbt
Before we start, let me explain dbt — especially for data engineers that still have not worked with it: “dbt” is a tool that helps us manage all the transformations we do with SQL to data in our databases.
For example, let’s say we have a table with all our customer orders, and we want to keep a count of how many of these orders happened during on a Holiday. We could solve this with 2 queries in SQL:
- Add a column
is_holiday
to the tablesf100_orders
, by joining it to a table withall_holidays
. Call the resulting tablesf100_orders_annotated
. - Count the rows that happen to have
is_holiday=true
on the tablesf100_orders_annotated
.
These 2 steps that combine the source data, as visualized by dbt:
We can use dbt to write these 2 transformations as “dbt models”, which are files that contain SQL and a little bit of dbt configuration (when necessary):
-- sf100_orders_annotated.sql
{{ config(materialized='table') }}
select a.*, b.date is not null is_holiday
from {{ref('sf100_orders')}} a
left join {{ref('all_holidays')}} b
on a.o_orderdate = b.date
-- count_holidays.sql
{{config(materialized='table')}}
select count(*) total_days
, count_if(is_holiday) holiday_count
, holiday_count/total_days ratio
from {{ref('sf100_orders_annotated')}}
These 2 files look like typical SQL — with a twist. Instead of having references to tables, dbt allows us to refer to other “models” with the use of annotations{{ref(‘sf100_orders_annotated’)}}
. This is how dbt determines the lineage and dependencies between all the transformations our data goes through.
That’s the gist of dbt: Write SQL transformations in a well organized way, to keep your data processes well tested, documented, and versioned.
What’s new in dbt is that we can now write these transformations not only in SQL, but also in Python — while keeping everything else working as expected.
Our first dbt Python model
To show off how easy is to incorporate Python within a dbt pipeline, let’s change the above sf100_orders_annotated.sql
into sf100_orders_annotated.py
:
# sf100_orders_annotated.py
import snowflake.snowpark.functions as F
def model(dbt, session):
df_sf100 = dbt.ref('sf100_orders')
df_holidays = dbt.ref('all_holidays')
joined_df = df_sf100.join(
df_holidays,
df_sf100['o_orderdate'] == df_holidays['date'],
'left'
)
joined_df = joined_df.withColumn(
'is_holiday',
F.col('date').isNotNull()
)
result_df = joined_df.select(df_sf100['*'], 'is_holiday')
return result_df
What’s interesting in this code:
- We were able to replace a SQL dbt model with a model written in pure Python.
- dbt still takes care of connecting to all dependencies by using references to them, like in
dbt.ref(‘sf100_orders’)
anddbt.ref(‘all_holidays’)
. - It doesn’t matter if these dependencies are SQL or Python dbt models. They are just a source of data for our current model.
- All we need to return at the end of this model is a dataframe. Then dbt takes care of materializing the dataframe into a table the other models can use — just like it did with the equivalent SQL model.
- To produce the desired join between the tables we used familiar PySpark dataframe manipulation functions.
- But this is not PySpark — these are Snowpark Python Dataframes, which bring us great advantages (that we will review next).
Benchmarking this dbt Python model on Snowflake
The orders
table has 150 million rows (straight out of TPC-H-SF100), and we are joining it with a holidays
table that has 458 rows. The join should be fast, and then most of the time will be spent re-materializing the table into a new table with the same 150 million rows, now with the new is_holiday
column.
With the classic dbt SQL model: 15 seconds, to join the 2 tables and re-materialize the large one.
With the dbt Python model: 15 seconds (+9s of Python code initialization).
This is a nice result: It means Snowflake was able to parallelize the Python code to transform and materialize the resulting table, at virtually the same speed than pure SQL.
How the dbt Python models work on Snowflake
What dbt does to make the Python models work in Snowflake is pretty cool: They take your model code and push it (mostly untouched) into Snowflake, wrapped in a Snowflake Python Stored Procedure. This wrapping code takes care of setting up the right libraries, materializing the resulting data frame, and executing the stored procedure.
You can find exactly how dbt pulls this off by looking into your Snowflake logs:
Setting this up is what took dbt and Snowflake around 9 seconds — but then to execute the join as defined in the Python code, it only took 15 seconds while re-materializing the 150m rows. What’s the secret sauce?
This is the “magic” that the Snowpark Python Dataframes can pull off: All the dataframe manipulation (that we coded in Python), gets translated by Snowflake into a SQL query. You can find the executed query in your Snowflake logs too:
This is really cool:
- Your Python experts can keep writing Python, even within dbt pipelines.
- Your Python code runs inside Snowflake, as dbt takes care of setting this up.
- Snowpark translates the Python data frame manipulations into highly scalable and fast SQL code.
But then you might be wondering: What if you want to run custom Python code that can not be translated into SQL? Will that be fast too?
The answer is yes:
A dbt Python model with code that can’t be translated into SQL
On the next example I want to produce the same results as the previous Python model, but with more custom Python code — Instead of getting the Holidays out of a table join, we are going to use the Python holidays
library:
# sf100_orders_annotated.py
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
import holidays
def model(dbt, session):
dbt.config(
packages = ["holidays"]
)
us_holidays = holidays.US(years=range(1990, 2030))
@F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
def is_holiday(x):
return x in us_holidays
df = dbt.ref('sf100_orders')
df = df.withColumn("IS_HOLIDAY", is_holiday("O_ORDERDATE"))
return df
In the above we are using a Python library that’s already available in Snowflake thanks to Anaconda (holidays).
What’s interesting about it is that we are taking the 150m dbt.ref(‘sf100_orders’)
table and adding a column to it, that contains the result of applying the Python function is_holiday(“O_ORDERDATE”)
. This is a function that we defined right there, with a particular decorator: @F.udf()
.
What the @F.udf()
decorator does is interesting: It tells Snowflake to convert this custom Python function into a Snowflake Python UDF. Then Snowflake can transform the data frame operations into scalable SQL code — that happens to call the just created Python UDF. Like this, as seen on the Snowflake logs:
Benchmarking the call of a Python UDF
Running this transformation took 36 seconds — not too bad for re-materializing 150 million rows, and calling an arbitrary Python UDF 150 million times:
You can see in the screenshot that the Python code executed 150 million times, taking in total 324 seconds. This was much faster for us, as Snowflake took care of parallelizing these calls.
Notes on performance
Python will be slower than pure SQL — but how much slower will depend a lot on your code. Small optimizations can have a huge impact.
For example, compare this UDF implementation that initializes the Holidays dict outside the UDF:
us_holidays = holidays.US(years=range(1990, 2030))
@F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
def is_holiday(x):
return x in us_holidays
Versus initializing the Holidays dict inside the UDF:
@F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
def is_holiday(x):
us_holidays = holidays.US(years=range(1990, 2030))
return x in us_holidays
The first example took 38 seconds to run on a L-wh over 150m rows.
The second example took 3 hours and 38 minutes to run on a L-wh over the same 150m rows (1hr 52m on a XL-wh).
That’s a huge difference. Moving the initialization outside our UDF saves us a lot of time. Having it inside the UDF makes the UDF run in 5.54ms. Maybe 5.4ms doesn’t sound like much, until you consider that 5.54ms times 150 million becomes 9.6 days of CPU power.
The good news, as seen above: If you take care of initialization outside the UDF, the Snowpark @F.udf()
decorator takes care of moving these values from the context into the UDF.
The possibilities are endless
Now you can take these building blocks and unleash your imagination: What could you do in your dbt pipelines, if you could write arbitrary Python code transformations within any arbitrary step?
- dbt takes care of pushing your Python code to Snowflake
- Snowflake takes care of parallelizing the data frame transformations
- Snowflake and Anaconda have made it easy for you to use tons of Python libraries
- You can bring your own Python libraries
Next steps
- Try the dbt Python models in your Snowflake account, and share your results.
- Participate in the dbt community to shape the future of the dbt Python models.
- Read the dbt Python model docs, and the Snowflake Snowpark for Python library docs.
- Share your finding withs the dbt community on the dbt Slack channels #dbt-core-python-models and #db-snowflake.
- Try some dbt+Snowflake quickstarts like “Data Engineering with Snowpark Python and dbt” and “Leverage dbt Cloud to Generate ML ready pipelines using Snowpark Python”.
- Compare the simplicity, power, and performance of the dbt Python models on Snowflake — versus the set up that dbt had to pull off to run Python models in other platforms.
- Check the previous posts by Jeremiah Hansen and Eda Johnson, “Data Engineering with Snowpark Python and dbt” and “A First Look at the dbt Python Models with Snowpark”. Also phData’s “How to Use dbt With Snowpark Python to Implement Sentiment Analysis” and “How to Build a Python Model in dbt with Snowflake”.
- For ML, dbt has an in-depth guide “Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake”, and tropos published “Time-Series Forecasting With Python For Snowpark And Dbt Labs”.
Want more?
- Try this out with a Snowflake free trial account — you only need an email address to get started.
- Try dbt Cloud — which I used as a cool web IDE integrated to dbt and Snowflake to develop the examples in this post (it does a lot more than that too, but that’s a story for another post).
I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.
Oh, and I’m on threads.net/fhoffa now too :)