For some of you, that headline might not mean much. That’s ok, and this one might not be for you. I’m mixing in some technical posts every once and a while.
This is the first of the series.
If you’re reading this, I’m going to assume you know what DBT is. You don’t need to know what SQLMesh is, or have an opinion Semantic Data Fabric, but the more familiar with the world of data warehouses and data modeling, the better.
This is going to be a side-by-side comparison of 3 tools:
DBT (https://www.getdbt.com/)
Semantic Data Fabric (https://www.sdf.com/)
SQLMesh (https://sqlmesh.com/)
Some guidelines:
I have no direct affiliation with any of these tools other than being a user
We’re using the same raw data set for each tool
The schema of the Play-by-Play data can be found below:
All the code (and some other stuff) can be found in this repo:
While we could easily add Snowflake or BigQuery into the mix for any of these tools, we’re going to use DuckDB so we don’t have to deal with cloud resources, associated costs, and additional configuration
We’re keeping things pretty simple, and we’re doing that on purpose
Simple project setup, simple table definitions, simple outputs
While most of these tools offer non-SQL modeling options (ie. Python), we’re sticking to SQL for this walkthrough
Simple also means we’re not testing/ comparing every feature of these tools. There is just too much to discuss across all 3 pieces of software
The tools outlined here are going to be evaluated based on a few categories:
Computes and Utilities
Speed of Development
Quality of Life
There is no score. There is no winner. This is only a comparison that you can hopefully use to decide which tool fits best with your workflow.
The Setup
I exported all the play-by-play data to some Parquet files that are stored in S3; this data was slightly cleaned in a pre-processing step, that took the raw Play-By-Play data from the Python package linked above and stripped out malformed characters, and dropped some columns. You can find the simple pipeline in Github.
We’re using DuckDB on a local machine to access this data, and each of the three tools mentioned above are configured to use DuckDB to access the data.
You won’t find the DuckDB “instance” in the repo (it’s gitignored), but the pipeline will look for a local DB and create one if it doesn’t exist.
The (large) schema of the our raw data can be found in the repo; just know that each row of data represents a single play. We can move throughout the course of a single game to rewind or fast-forward to a certain game state.
In our can case, we’re using this play-by-play data to do some simple aggregations.
We’ll be replicating the same tables in each tool, using native patterns for that tool.
If you’re a Kimball die-hard, married to Data Vault, or an Inmon purist, you might be a bit annoyed at what you see. These data models aren’t 100% by-the-book; they’re only meant to illustrate how different tools use different patterns.
And, as a result, produce different developer experiences.
DBT
There are plenty of tutorials about setting up a DBT project, how to do X with DBT, and much more online. I don’t want to focus on that.
Instead, we’re going to talk about the experience of using DBT.
Compute and Utilities
The decision to use DuckDB was driven by DBT, even though the other tools support a “cost-less” local developer experience. I simply do not want to pay for every execution of this project. The other tools in this write up enable us to entirely bypass the compute costs of designing and tweaking a data warehouse. Unfortunately, DBT is not such a tool.
There is effectively no way to validate the outputs of your DBT models without running them. This means that in the case of a single model change - even as simple as casting a value to another type - results in an incremental cost.
You also cannot evaluate your tests without incurring some compute overhead.
The local development experience is solid enough - there are a lot of useful extensions in the ecosystem, baked-in utilities to construct a graph of model dependencies, and a well-documented project configuration. But, it comes at a price.
Take the relatively simple `base_games.sql` model below. We’re basically just joining play-by-play table to itself, and doing some minor aggregations to create a single record of the “final” outcome of the game.
with plays_home as (
select
game_id,
home_team as team,
count(distinct play_id) as total_plays
from {{ ref('pbp_all')}}
where posteam = home_team
{{ group_by(2) }}
),
plays_away as (
select
game_id,
away_team as team,
count(distinct play_id) as total_plays
from {{ ref('pbp_all')}}
where posteam = away_team
{{ group_by(2) }}
),
plays as (
select
game_id,
plays_home.team and home_team,
plays_home.total_plays as total_plays_home,
plays_away.team as away_team,
plays_away.total_plays as total_plays_away
from plays_away
inner join plays_home
plays_away.game_id = plays_home.game_id
)
select
distinct game_id,
week,
season_type,
home_team,
away_team,
md5(plays.home_team) as home_team_masked,
md5(plays.away_team) as away_team_masked,
total_away_score,
total_home_score,
total_plays_home,
total_plays_away,
case
when total_away_score > total_home_score then away_team
else home_team
end as victor
from {{ ref('pbp_all') }}
inner join plays
on pbp_all.game_id = plays.game_id
To validate the output of this model, we would run it on our local machine, which would connect to a database, potentially start up a server/warehouse, and execute that logic in our sandbox schema (as is DBT convention).
If we deploy this model to staging, run some tests on it, and then promote it to production, we’ve now incurred compute charges on at least 4 separate instances.
In our small `base_games` table, that is relatively insignificant, but when you multiply this across 100s or 1000s of production models, you introduce a significant multiplier effect.
Speed of Development
This is easily where DBT shines - once you know the patterns and the configuration framework, you can introduce new models as quickly as you can throw together some SQL.
But, this is a double-edge sword.
Anyone who works in the space knows this. It’s easy to introduce new models, but that often leads to messy projects, duplicate logic, and database objects that persist long after their utility (or accuracy) has expired.
I’ve designed, redesigned or contributed to well over a dozen DBT projects. You can make things happen quickly, but that flexibility is also what contributes to many of the anti-patterns that have become prevalent in the data ecosystem.
Quality of Life
There is a big ecosystem surrounding DBT - packages, SaaS vendors, blogs, whatever else.
When it comes to quickly covering ground in a warehouse project, you’ll probably be able to move very fast with this tool. That is of course assuming you have the appropriate supporting tools and permissions.
In fact, many of the quality-of-life components of DBT are enabled by surrounding tools. Especially the third-party packages.
So, while we all have a few ideas about what could be improved with DBT (name-spacing, for one), the quality of life when using it is quite high. The challenges come when building at scale - duplicate logic, inconsistent definitions, poor input data.
Some of those pieces are out of the reach of DBT entirely; others are not.
In my experience, anyone who says that DBT doesn’t enable them to move quickly is doing something wrong or blocked by organizational constraints.
A quick comment on lineage - DBT kind of shows lineage, but only at a high level. That is, table to table. The tool’s rise to popularity almost certainly helped spawn a suite of supplemental tools, or at least helped them also rise to prominence. This is feature - rather, lack thereof - is a drawback of DBT at scale.
SDF
SDF stands for Semantic Data Fabric, a new-comer product with roots at Meta. I need to caveat everything I mention below because SDF is still in beta. So, as of this post, some features are not yet available.
Of the tools in this post, this one feels most different. And there are some clear reasons why.
First off, it’s written in Rust, which means you install a binary. There is nothing to `pip install`
, no package extensions to include, and no virtual environment to maintain.
This lends nicely to development speed, as you don’t have to battle with package dependencies, but that also means it might feel foreign to many less experienced engineers and analysts.
Python is undoubtedly the language of choice in business analytics (sorry, R), and you’re bound to find candidates who are much more familiar with Python than Rust. That said, you’re not writing any Rust code when you’re using SDF.
You just… write SQL.
No macros, no ref statements, no jinja templating (Again: They’re in Beta).
But, just like DBT, there is a fair amount of YAML-based configuration. I did not notice a significant difference in YAML when setting up DBT and SDF, but like most things in software, it depends.
Most of the YAML that SDF relies on is not generated manually - it’s created when you run SDF. In fact, for our small NFL project, the system generated YAML exceeds 6600 lines.
YAML - the workhorse of modern data tools.
Compute and Utilities
Here’s the thing - what SDF might lack in convenience utilities it makes up for in development experience.
You can “run” your production warehouse from your local machine, complete with propagating changes throughout downstream models without incurring a cent of compute cost.
This is different from both the DBT approach and the SQLMesh approach (see below), which require you to run a SQL command against your database or execute a query against it without generating a materialization, respectively.
SDF essentially keeps a cache of your current project definition in a hidden folder on your machine. Change a column or datatype? You can apply those changes to your local project state without incurring compute cost.
You might think of it as something like `terraform plan`
, if you’re familiar with IAC patterns.
That DBT model I shared above? In SDF, `base_games.sql`
it looks like this:
with plays_home as (
select
game_id,
home_team,
count(distinct play_id) as total_plays
from pbp_all
where posteam = home_team
group by 1, 2
),
plays_away as (
select
game_id,
away_team,
count(distinct play_id) as total_plays
from pbp_all
where posteam = away_team
group by 1, 2
),
plays as (
select
plays_home.game_id,
plays_home.home_team,
plays_home.total_plays as total_plays_home,
plays_away.away_team,
plays_away.total_plays as total_plays_away
from plays_away
inner join plays_home
on plays_away.game_id = plays_home.game_id
)
select
distinct plays.game_id,
week,
season_type,
plays.home_team,
plays.away_team,
md5(plays.home_team) as home_team_masked,
md5(plays.away_team) as away_team_masked,
total_away_score,
total_home_score,
total_plays_home,
total_plays_away,
case
when total_away_score > total_home_score then plays.away_team
else plays.home_team
end as victor
from pbp_all
inner join plays
on pbp_all.game_id = plays.game_id
;
Looks pretty familiar, right?
It’s just SQL without the `{{ ref() }}` and `{{ group_by() }}` macros.
Some of you might not like this - you lose the convenience utilities we all know with DBT. Others might look at this and declare that SQL-based data models are the way things are supposed to be. I’ll let you decide.
I did find myself pleasantly surprised by how clean working with SDF felt.
But, I also found myself mulling over the fact that if I were to do something complex in SDF, I would be writing much more code.
If I were to translate some DBT projects I’ve seen into SDF models, I know it would be a challenge.
However, that speaks more about the state of data modeling at many businesses today than it does SDF.
That thought alone might be just one reason why a tool like SDF is a strong choice for companies with strong domain-driven data models. After all, there is clearly a reason behind the naming choice of “semantic data fabric” (see: what is data fabric).
The SDF team did build in some nice utilities and tooling to help wrangle the inevitable sprawl of data models, though. `Classifiers`
provide an SDF-native mechanism to apply categorical tagging to your columns and their child columns throughout your project.
By pairing `Classifiers` and `Function`
definitions, you can reclassify data points from sensitive to non-sensitive as needed, and have that change propagate across all relevant models.
For example, we’re naively hashing the team names in order to simulate what would normally be done through some data masking mechanism. In our case, we’re just applying an md5 hash to a string.
So, something like `Dallas Cowboys` becomes `5a40b7cccc2a230895881416edda4622`.
--- workspace.sdf.yml
classifier:
name: MASKER
labels:
- name: hash_me
- name: it_was_hashed
--- src/functions.sdf.yml
function:
name: md5
reclassify:
- from: MASKER.hash_me
to: MASKER.it_was_hashed
--- base_games.sql
select
distinct plays.game_id,
week,
season_type,
plays.home_team,
plays.away_team,
md5(plays.home_team) as home_team_masked,
md5(plays.away_team) as away_team_masked,
<... more query stuff happening here ...>
And when running SDF, we see that we’ve now reclassified our data from something needing to be hashed (aka hash_me) to a clean value (now flagged as it_was_hashed).
This reclassification follows our home_team and away_team values to every model that this data is used.
Schema nfl_sdf.pub.base_game
+------------------+-----------+-------------+----------------------+
| column_name | data_type | is_nullable | classifier |
+------------------+-----------+-------------+----------------------+
| game_id | varchar | true | |
| week | int | true | |
| season_type | varchar | true | |
| home_team | varchar | true | MASKER.hash_me |
| away_team | varchar | true | MASKER.hash_me |
| home_team_masked | varchar | true | MASKER.it_was_hashed |
| away_team_masked | varchar | true | MASKER.it_was_hashed |
| total_away_score | double | true | |
| total_home_score | double | true | |
| total_plays_home | bigint | true | |
| total_plays_away | bigint | true | |
| victor | varchar | true | MASKER.hash_me |
+------------------+-----------+-------------+----------------------+
Speed of Development
I’ve never set up an SDF project before. It’s only been publicly available for a few weeks.
And while setting this thing up had a few hiccups, it was nothing beyond what you’d expect for a tool in Beta. First attempts at standing up a new tool can take some getting used to. For what it’s worth, SQLMesh also took me for a little ride.
I will give the team credit, though; they responded to my slacks and clarified questions QUICKLY. Even to the point of sending me resources multiple times over.
It’s clear that this approach is different from DBT.
It’s perhaps less about the convenience methods and more about building fast, stable software. In SDF’s case, that means fast run times, building a native graph of dependencies without the need for Python and jinja wrappers, and evaluating changes against your local project cache.
Writing and running models with SDF gave me the sense I could feel the underlying Rust framework - a markedly different experience than working with a Python-based tool like DBT.
If you’re not familiar with using statically typed languages with strong type-checking, it might be the right time for you to try something new.
Quality of Life
There are some things about SDF that stand out as nice quality of life additions. While the feedback cycle of making changes locally and “running” them against your project is great, I found that small details like syntax errors that include references to your bad code are super useful (thanks, Rust compiler). This beats the nondescript errors you often find in many databases, handsdown.
error: Ambiguous column 'home_team' found, available are 'pbp_all.play_id', 'pbp_all.game_id', 'pbp_all.old_game_id', 'pbp_all.home_team', 'pbp_all.away_team', ...
--> src/base/base_game.sql:42:4
While it might not yet have the robust ecosystem that DBT has developed, I get the impression that right now, your ability to use SDF effectively depends mostly on your ability to effectively model data using SQL.
In the future, I’d expect that Polars, PySpark, and other tools can be plugged into the mix. Similarly, some of that YAML configuration will likely be supplemented by Python or Rust.
Finally, when it comes to lineage, SDF excels. Simple as that. Between the query parsing and Classifiers, SDF makes it really easy to find where a column originates, and how it may be modified throughout your project.
I obviously only scratched the surface of SDF, but so far, I like what I’ve seen.
SQLMesh
I was intruiged when SQLMesh was released; to me, it was a meaningful step away from the DBT paradigm.
After having read through the docs and setting up the project, I initially found myself thinking of SQLMesh as a new tools build on top of DBT patterns.
But that’s not entirely correct.
Opinionated tools are generally a good thing (even if you don’t agree with their opinions). In the case of SQLMesh, there are no shortages of opinions:
Pipeline definitions should be in SQL, rather than configured via YAML
Python is an option, too
`SELECT` expressions are expected to follow a certain convention
unique column names (this is a given)
explicit types
explicit aliases are recommended
Inline models configurations provide metadata, as well as control (orchestration)
“Backfilling” is a core concept within the SQLMesh workflow
If you look at these requirements independent of other tools, they make a lot of sense. Even things like explicit types and explicit aliases are generally just good practice.
But when you put DBT and SQLMesh side-by-side, you begin to get the sense that these choices were very intentional.
YAML is everywhere in DBT; from docs to project configs to tests to packages. You can’t escape it. That’s not the case with SQLMesh, and it’s nice to see a modeling tool that is lighter on the markup.
This does mean properties that would typically be defined in YAML are now pushed down to individual models, though.
There are at least 14 different model-level configuration options for “full-refresh” models alone; incremental models add another 3 options. And, interestingly, those are independent of the `model_defaults`
section in the project-level `config.yaml`
or `config.py
`file.
Of course, many of these properties are optional, but it does add overhead to the project configuration. For a larger-scale project, these are probably quite useful. For smaller deployments, it feels excessive.
Where some tools require that you write SQL, SQLMesh almost asks you to swap some out for Python. I can’t speak to the friendliness of the API, but this certainly fits the toolset of today’s data teams.
Compute and Utilities
Much like SDF, SQLMesh keeps a local cache of your project. But, it’s used different - specifically for Plans, or what you can think of as a project diff.
Environments and Plans add additional complexity to SQLMesh projects.
Under the hood, there is a robust state machine tracking changes of your models and their dependencies. The “state” of your project - aka Plan - gets applied to an Environment. This get increasingly complex as the scope of your changes is determined.
I am all for tracking changes and enabling teams to understand how their tables have changed overtime. But, SQLMesh seems to take this to the extreme:
Every time a model is changed as part of a plan, a new variant of this model gets created behind the scenes […]. In turn, each model variant's data is stored in a separate physical table.
Each model variant gets its own physical table while environments only contain references to these tables.
This sounds like an interesting feature that enables teams to rewind model changes or fast-forward to a designated model state. But, I’m curious how teams would use this in a real-life production setting outside CI/CD.
Given that databases like Snowflake enable Time-Travel - also something commonly used in CI/CD - I’m curious when one would be used over the other. If you’re not using Snowflake, then maybe a feature like this has more utility. Regardless, this part of SQLMesh adds significant complexity.
One other thing about SQLMesh that stands out is the Virtual Update feature; I initially equated it to something like React’s Virtual DOM - a representation of the DOM (Document Object Model) that is used to determine what needs to change on a webpage as a user interacts with it and changes state - but the docs indicate that “the process of promoting a change to production is reduced to reference swapping”. So, it’s not like the Virtual DOM at all.
In fact, it appears more like an internal mechanism for blue-green deployments.
The more I read through the docs for SQLMesh, the more it appeared that many features of the tool were inspired by existing paradigms in other tools - DBT, Airflow, various pipeline vendors.
Testing within SQLMesh is done through two mechanisms - Tests and Audits. SQLMesh Tests effectively act as unit tests, checking some logical output is produced as expected. Audits, however, allow you to attach assertions to your models; these are more akin to data quality checks. Much like DBT, these are SQL-based assertions, designed to assert that no rows are returned.
The distinction between the two components of the project make sense, and follow with the established DBT pattern of testing the data and custom logic tests.
SQLMesh’s implementation feels more robust, possibly because of the wide range of Audits available by default. These Audits seem to blend the default testing patterns of DBT with those available in packages like DBT-Expectations, Calogica’s testing toolkit inspired by Great Expectations.
It is unclear if there are other mechanisms to test Python-based models (beyond Python-native testing utilities, of course), given that Tests and Audits are SQL-based.
You can see that an Audit is included in the SQLMesh `base_game.sql`
model:
MODEL (
name base_sqlmesh.base_game,
kind FULL,
grain game_id,
audits [assert_game_scores_positive],
);
with plays_home as (
select
game_id,
home_team,
count(distinct play_id) as total_plays
from raw.pbp_all
where posteam = home_team
group by 1, 2
),
plays_away as (
select
game_id,
away_team,
count(distinct play_id) as total_plays
from raw.pbp_all
where posteam = away_team
group by 1, 2
),
plays as (
select
plays_home.game_id,
plays_home.home_team,
plays_home.total_plays as total_plays_home,
plays_away.away_team,
plays_away.total_plays as total_plays_away
from plays_away
inner join plays_home
on plays_away.game_id = plays_home.game_id
)
select
distinct plays.game_id,
week,
season_type,
plays.home_team,
plays.away_team,
md5(plays.home_team) as home_team_masked,
md5(plays.away_team) as away_team_masked,
total_away_score,
total_home_score,
total_plays_home,
total_plays_away,
case
when total_away_score > total_home_score then plays.away_team
else plays.home_team
end as victor
from raw.pbp_all
inner join plays
on pbp_all.game_id = plays.game_id
;
And of course, our Audit `assert_game_scores_positive.sql`
:
AUDIT (
name assert_game_scores_positive,
);
SELECT *
FROM @this_model
WHERE total_away_score < 0 OR total_home_score < 0
;
Speed of Development
I found setting up my first working SQLMesh project to be as challenging as the SDF project. It wasn’t necessary complicated, just something with which I was not familiar.
SQLMesh’s Environments and Plans is a good example of this. Conceptually, they are not difficult to grok, but the way they interact with table changes, Virtual Updates, and “Backfilling” takes some getting used to.
SQLMesh lets you spin up models as quickly as an other tool, but making use of its advanced features requires following specific paradigms within the tool.
For many, these will be new concepts.
Quality of Life
One small, though impactful thing that SQLMesh provides is the ability to configure your project in different - declaratively with YAML or dynamically with Python. The `config.yml`
file you need to set up your connections and projects specs can instead be a `config.py`
file, enabling you to do more complex pathing logic, loading env vars, and more.
This project uses a `config.py`
in order to access a DuckDB instance located at the project root. Using YAML would mean a hard-coded value, effectively making the project useless across computers.
SQLMesh feels like a lot of software you find in the Python-based data tool category - there is the quick-and-dirty way to get things working, and then there is the idiomatic approach.
The idiomatic approach always unlocks a new toolkit, but it can be challenging to embrace.
I encourage everyone to look at SQLMesh Macros as an example of something that may be very powerful when done right, but requires a foreign syntax to get there.
Final Considerations
DBT, SDF and SQLMesh solve the same problem in different ways.
You can’t say that one is better than the others, mostly because there are parts to each of them which support different workflows. Plus, there are many features to each tool, so it’s quite difficult to say you should only use X in some situations but only Y in others.
I stand by my earlier comment that using the idiomatic approach (read: tool-specific) will unlock new methods to solve your problem, but it’s not always straightforward to get to this point.
Especially when you or your team is accustomed to solving that problem in a specific way.
Perhaps a wildly under-appreciated fact is that these 3 tools can only take you so far. Speed of development and easy of use does not make up for data-producing systems that lack governance and well structured access patterns.
Data modeling as a practice existed longer before DBT, Cloud Warehouses, and compute credits. Some companies have apparently forgotten this, or worse yet, never knew this was the case.
At the end of the day, you’re not going to go wrong by using DBT, SDF or SQLMesh - in fact, choosing one is almost certainly better than not using any of them. Yes, there are tradeoffs in choosing one vs the others, but that’s the nature of software.
Some things should not factor into your decisions, mostly because they are table stakes:
Speed of getting started
You can get a fully functional project running in well under an hour no matter which tool you’re using
Support for specialized systems and platforms
If you’re using such a specialized toolset that none of these tools fit what you need, you’re probably looking into the wrong category of solutions altogether
Deployments
The options for deploying data tools like these are virtually ENDLESS. They all going to support a paid deployment mechanism in one way or another, but prudent data teams know you don’t need to rely on a vendor for this. In fact, the most cost-conscious of them will purposely look elsewhere.
Instead, your decision should come down to:
Incremental cost of delivery
How well the tool fits into your existing systems and development workflow
Whether you can support this tool within your workflow for a long time
Do they have 100% feature parity? No. But, chances are they will in short order. This category - who saw new entrants to the market even as I was writing this post - is on a path towards feature commoditization.
Over the next next 12-18 months, I expect most tools in the space to spend significant resources attempting to further differentiate. We’ll see how it plays out.
Think I missed the mark, or want to share you thoughts? Give me a shout on LinkedIn or Twitter.
Very solid analysis - exactly what I was looking for (currently evaluating SQLMesh). Thanks!
This is exactly what I've been looking for. Thanks for writing this up so thoughtfully, Joe :)