From BigQuery Director to Snowflake user — with Dan Delorey, VP of Data at SoFi

After an incredible 13 year career, Dan Delorey left Google to become the VP of Data at SoFi. It’s a huge switch, and — fortunately for us — Dan shared his story with the Data Engineering Podcast. Here I want to share my notes, starting with Dan as a Google Software Engineer who became an Engineering Director for BigQuery, and then moved into a VP role at a regulated financial company.

10 min readNov 2, 2022

Dan Delorey — photo courtesy of Dan Delorey

This is a fascinating conversation, as it moves from a deep focus on query execution and data storage — and then moves into business problems with data at SoFi (like “finding data, making sure data is reliable, monitoring SLAs about data delivery, answering business user questions”).

Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data

An interview with Dan Delorey about his career, from helping to build and scale Dremel at Google, to managing BigQuery…

www.dataengineeringpodcast.com

In many ways this story mirrors my own path — as I also started at Google as a Software Engineer, then moved to the BigQuery team (as a Developer Advocate), and then left Google to join Snowflake (as a Developer Advocate). I love seeing through Dan how my own focus changed from “faster queries” to “business value”.

But let’s start at the beginning, with a key piece of technology that brought SQL back to life at Google: Dremel.

Before BigQuery: Dremel

The story begins with Dan as a graduate student analyzing data, and then joining Google as a Software Engineer analyzing ads data. A couple years later he joined the Dremel team — which was the internal tool that later became BigQuery.

This was in 2010, when most data at Google was analyzed using MapReduce (the inspiration for Hadoop). MapReduce wasn’t easy to use though — and the Dremel team “was trying to solve the problem of boilerplate code, long startup times, difficulty in chaining steps together, difficulty in writing the programs”. Their solution was to use SQL instead, and the details were published in the 2010 Dremel paper (that won the 2020 VLDB Test of Time award).

The big innovations in Dremel were:

Using SQL.
Separation of compute and storage.
Columnar storage with nested/repeated data.
Running queries in less than a minute.
No pre-computing of stats or meta-data.
Later on — “shuffled” joins.

This changed the paradigm and stimulated more people to ask more questions. As Dan says:

“Data just engenders more questions, it really opened up people’s eyes to the possibilities of looking at all these interesting analyses I could do that now require no work for me to preprocess the data, or do a multiphase orchestration with all sorts of different transformations. I can just express it as a common SQL statement using the join operator I’m used to, and under the hood, Dremel takes care of scheduling all of that, building the query plan. So I think that was one way in which it changed the paradigm, where people were able to expect to join everything. And then it went to the next level when we rolled out BigQuery. And now you see the same thing evolving with the Snowflake Marketplace, but the idea that the entire world’s data analysis system could be one global system where all the data was joinable from its original place at rest, so there was no need to copy or get stale redundant versions of your data anywhere. I could just share my table with you and you would be able to query it.”

In many ways Dremel enabled a gigantic Data Lake inside Google — with Dan growing from Software Engineer to Tech Lead. Then Dan joined the BigQuery team as a manager.

BigQuery: From Data Lake to Data Warehouse

The BigQuery team was started to package Dremel so that the outside world could use it. In 2015 I was already their Developer Advocate, traveling the world showing off how quickly people could analyze their data using it. That’s when Dan switched to the newly merged team.

Dan first was a manager for the teams overseeing BigQuery’s storage and querying. Then he became a director, with his latter responsibilities including “Enterprise Data Warehouse migration” — where he cared about the needs of customers who wanted to migrate into BigQuery.

What was an interesting realization for Dan as he met these potential customers, is that they not only wanted fast queries — they also needed strong governability: How to ensure a healthy relationship between data producers and data consumers? As Dan says:

“I think in the early days, you could accurately describe what we built with Dremel as a data lake, or at least an early version of a data lake. And then when we started to roll out BigQuery, we started by trying to keep that same data lake paradigm. And we discovered after a few years that it wasn’t working for organizations, and we had to become much more like a data warehouse in BigQuery, moving away from the data lake paradigm. And I’ll explain what I mean by that in a data lake: I think the governance and the agreements between data producers and data consumers are hard to maintain. It lowers the barrier for entry for me as a consumer to just be able to query any data at rest as long as I can get access to that data. But it also leads to me potentially accidentally taking dependencies on things that the data producer has no interest in guaranteeing. And so schemas can change out from underneath, data can disappear, they can be low quality or unreliable data and I have no way of knowing that. And so I think there is now back toward the data warehouse paradigm with things like Snowflake and BigQuery trying to keep all the advantages of federated data and all of that, but making it clear that the data that I actually need to be reliable because it’s going into my regulatory filings or it’s being shown back to my end users or something like that.”

In 2021 Dan left Google, and joined Jim Caputo (who started the BigQuery team, and is now the VP of Engineering for the Lending Division at Sofi), as SoFi’s VP of Data.

VP of Data at a regulated financial company

In 2021 Dan switched from being a creator of tools, to a consumer of tools. This gave him new perspectives on what data customers want and need, and adopting concepts like Data Mesh.

SoFi is a bank, under heavy audit and regulatory requirements. Data not only needs to be distributed, but also provided with guarantees. Dan and his team at SoFi evaluated different tools to build their data infrastructure, and ended up choosing Snowflake.

This is how they organize their data within 4 zones:

The raw zone

“The organizational structure that we’ve landed on is to have our engineering teams responsible for the production and ingestion of their data all the way into what we call the cleansed area in our data warehouse. We conceptualize our data warehouse in four zones. The first is raw. That’s where the ingested data lands. And there are really no guarantees. You can think of this like the data lake component. It’s private by default. The data may be schemaless, meaning they’re just dumping JSON blobs into variant types in Snowflake, and then cleaning it up as the first phase in their ELT process. But our goal is for all analytic data, or all potentially interesting analytic data to land in the raw zone in Snowflake. […] For us, I want to shut down the multiple tools — multiple sources of truth — problem. And so we are putting it all inside Snowflake, inside private schemas, so that people cannot accidentally be getting to each other’s raw data.”

The cleansed zone

“The second zone we call cleansed. Cleansed is where you impose a schema. You do whatever cleaning is necessary, deduplication, standardization of data types, data content, probably some introduction of synthetic join keys to support the next phase in the process. And the cleanse layer is where we expect there to be a contract between the data producers and the data consumers. Meaning if I put a field in there, I’m not going to just pull that field away without some automated testing being able to catch me. For example, we don’t give any group Direct, DDL or DML access to their cleansed schema. The only way you can update the schemas or the data inside your cleansed schema is via some automated process which requires GitLab. It requires our CI CD pipelines to run. And so your consumers always have the ability to detect if you’re changing something.”

The modeled zone

“The next zone for us on top of cleansed, we call modeled. It’s where you build your data models. There will be some amount of join aggregation. There will have all sorts of flavors in there from star or snowflake, schemas factored, dimension tables, just flat, broad tables. And we’re still doing it in a distributed way. There we do have a central core data warehouse, which is the most important piece. But then around it we have what are called Team Data Marts, where individual groups for the different vertical business units we have can be building their own data models.”

The summarized zone

“And then the last zone we have on top of all of that we call summarized. That’s where you build your reporting layer. So those should be the tables that are built for optimizing some reports. In general, the pipeline we think will flow where data science, they can prototype reports directly off the model schema. But then if they find the SQL query for the report they want, they’ll probably turn that into the equivalent of a materialized view or base tables with aggregates or something that allows them to do very simple reporting. And we’d like to pull most of the business logic out of our business intelligence tool, which is Tableau, and keep it all in the data warehouse.”

Further thoughts

The podcast conversation then delves into other interesting topics, like:

Responsible handling of PII, while enabling discovery.
How Dan in the BigQuery team was trying to build “a unified platform that everyone could just leverage”, while now he doesn’t think that’s possible: “Everybody needs bespoke solutions, everybody’s making different tradeoffs.”
Enabling the Data Economy, with Data Marketplaces.

Why Snowflake

I’m in awe of what Dan and the BigQuery team have built at Google — as I’ve said in many of my past posts. Meanwhile in my new life, I love how the Snowflake team has built a product with strong governance features and a tight fit to what Enterprise needs.

I asked Dan to review this post, and his thoughts on why SoFi chose Snowflake. His words:

“Thanks for taking the time to listen to the episode and translate my thoughts for your audience, Felipe. You do have a way with words, and most of what you’ve captured here is very accurate.

The one place I add some additional detail is in your last section, “Why Snowflake?” While I do agree with you that the governance features on Snowflake are strong and the Enterprise fit is tight, those weren’t actually the deciding criteria for us. What it really came down to for us was institutional momentum and scale. Snowflake has an excellent sales organization who was already deeply engaged with SoFi before I arrived. There was already a lot of buy in at all levels of the organization for using Snowflake. Changing that would have taken significant effort which, given my next point, I decided would not be an effort well spent.

The reason it wasn’t worth the effort to try to push for BigQuery rather than Snowflake was because at our scale the technical differences between the two approaches were not going to matter. Either would serve our purposes well for the next few years, and we could clearly see the investment Snowflake was continuing to put into building the future. In the end picking one and embarking on our journey to rebuild our data infrastructure and migrate from the legacy was more important than stressing over irrelevant differences.

Given how far we’ve come in a year and a half, I think we made the right decision for our situation.”

Thanks Dan — for sharing all your accomplishments and insights with us!

Moving forward: Data Mesh at Sofi

SoFi keeps moving forward on this road, and Luke Lin (Director of Product), just published a post on SoFi’s “Data Mesh Journey”. Check it out to go deeper into their “hybrid data mesh”:

Data infrastructure and platform are centralized
Data engineering is centralized
Data ingestion to Snowflake is distributed to product/engineering teams
Data science and business intelligence are distributed across Finance, Risk, and product/engineering teams

A Data Mesh Journey

Many data teams today aspire to implement a data mesh believing that decentralizing data ownership will result in a…

pmdata.substack.com

Want more?

Listen to Dan’s conversation on the Data Engineering Podcast.
Try this out with a Snowflake free trial account — you only need an email address to get started.
As Dan says, Snowflake has an excellent sales team ready to deeply engage with your needs — call them when you are verify this.
I’ll be looking forward to more posts from the SoFi data team!

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.