Snowflake tables on Cloudflare R2 for global data lakes

When replicating data to multiple regions, cloud egress costs can be pretty significant. So this is exciting: Now Snowflake can use Cloudflare R2 as external storage for tables— featuring cheap storage and zero egress fees. Here’s how.

Felipe Hoffa

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readMar 21, 2023

Update May 2023: Now this is an official partnership between Cloudflare and Snowflake

Use Snowflake with R2 to extend your global data lake

R2 is the ideal object storage platform to build data lakes. It's infinitely scalable, highly durable (eleven 9's of…

blog.cloudflare.com

— -

It all started this Sunday on reddit.com/r/snowflake with a request from /u/chilsana: “Anyone using CloudFlare R2 as S3 Compatible Storage?”:

James Malone replied with an invitation to try it — and that’s exactly what I did on Monday. I’ll explain here the steps for you to get all this working together.

What’s exciting here is Cloudflare R2’s pricing model for distributing data across regions and clouds. Let me quote:

Generous free tier: Each month you can enjoy 10GB of stored data; 1,000,000 Class A operations; and 10,000,000 Class B operations.
No egress fees: Migrate volumes of data to multiple databases and analytics solutions as needed, dramatically reducing storage costs
Storage beyond 10GB: $0.015 / GB-month

As Corey Quinn notes in “The Compelling Economics of Cloudflare R2”, when comparing the costs of S3 for a month:

- 1 GB object in S3, 1 download: 11.3¢ (2.3¢ storage, 9¢ transfer)
- 1 GB object in R2, 1000 downloads: 1.5¢ for the month (just storage)
- 1 TB object in S3, 1000 downloads: $113
- 1 TB object in R2, 1000 downloads: $15

Once we get this integration working, you’ll be able to export your tables to Cloudflare R2 with SQL in Snowflake, like in:

copy into @my_cloudflare_stage/name_the_external_table
from my_own_table
file_format = (type=parquet);

And query them — from a Snowflake account in any region, without worrying about egress costs (other than the above one when moving data from your Snowflake account in your cloud to Cloudflare):

select avg($1:_COL_1), min($1:_COL_0), max($1:_COL_0), count(*) c
from @my_cloudflare_stage/name_the_external_table
(file_format=>'my_parquet_format');

On the Snowflake side check the docs for egress costs, and what your provider might charge.

Let’s get started

Create a Cloudflare R2 account

First sign up to get your own account, and verify your email address.
Cloudflare will then ask for a credit card or PayPal to start using R2.
Once your account is ready, you’ll be able to create your first R2 bucket.
Create an API token — give it edit permissions so Snowflake can write to it. Don’t worry too much about losing these secrets — you can create API tokens later too.
And that’s it. Note down your account id, bucket name, and token details so we can bring these to Snowflake.

Back to Snowflake

Check the Snowflake docs for the “Working With Amazon S3-compatible Storage” (Public Preview as I’m writing this). Then:

Ask Snowflake support to add access to your new Cloudflare R2 endpoint your_account_id.r2.cloudflarestorage.com on your account (as described in the docs).
Create a external stage with the details we’ve collected so far:

CREATE or REPLACE STAGE my_s3compat_stage
URL = 's3compat://your_bucket/'
ENDPOINT = 'your_account_id.r2.cloudflarestorage.com'
region = 'wnam'
CREDENTIALS = (AWS_KEY_ID = 'your_key' AWS_SECRET_KEY = 'your_secret');

Note above: I used region=’wnam’ as apparently that’s the closest Cloudflare region to my Snowflake account. You might need to play with this in case of flakiness.
That’s it. We are ready. We can export a table to our Cloudflare R2 bucket:

copy into @my_cloudflare_stage/name_the_external_table
from my_own_table
file_format = (type=parquet);

And we can list the files:

list @my_cloudflare_stage;

If you want to read the files with a SELECT, you might need to have a defined parquet format:

CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = PARQUET
COMPRESSION = SNAPPY;

We are ready for our first SELECT:

select avg($1:_COL_1), min($1:_COL_0), max($1:_COL_0), count(*) c
from @my_cloudflare_stage/name_the_external_table
(file_format=>'my_parquet_format')

A query that goes over all Stack Overflow users — staged in parquet files on Cloudflare R2

Notes

James Malone suggested to start by testing the Cloudflare S3 API with the Snowflake S3 compatibility test suite — but I got 2 type of errors:

Most tests failed with Version id not match a read after write.
Then getBucketLocation failed either with Must be one of: wnam, enam, weur, eeur, apac, auto (Service: Amazon S3; Status Code: 400; Error Code: InvalidRegionName or with expected: <wnam> but was: <WNAM>. So I could set the region as wnam or WNAM — and either case would trip the tests.
Nevertheless once connecting it to Snowflake — it worked.
This is all Public Preview right now — so your results might vary.
Be careful with sensitive data — instead of using R2, consider Snowflake replication across regions, using secure views, and secure UDFs. Read more on “Introduction to Secure Data Sharing”.
Could Snowflake support Cloudflare natively for the Data Marketplace? Add it to the wishlist, and let your account executive know.

Introduction to Secure Data Sharing | Snowflake Documentation

Secure Data Sharing lets you share selected objects in a database in your account with other Snowflake accounts. You…

docs.snowflake.com

Next steps

Now it’s your turn to try this — please report back results.
Note that I could have used HEADER = TRUE when exporting the Parquet files — that could make it easier to get the column names back. To make this post shorter, I’ll leave that to you.
Test generating read tokens on Cloudflare — so people around the world can access your tables without egress costs. Their free tier should cover “3–4 requests per second for free each month”.
Can we do Change-Data-Capture (CDC) with Snowpipe and R2? Try it out, report results, or stay tuned.
What else could you try with the S3 compatible storage layer in Snowflake? Brad Culberson suggested Backblaze B2. Or something crazier, like a Snowflake external table on a Raspberry PI with Minio.
For the best query performance copy the tables back into Snowflake native storage, and look forward to the Iceberg Tables in development.

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.