Hiding in Plain Sight: The High Cost of a Data Team
The math doesn't add up for many companies.
What is the cost of doing business?
There is no single answer to that question. For some, it’s paid ads, salaries paid for content marketing, or the dollars they shell out on SaaS tools. We tend to look at receipts as the “cost” of things - how much is a new Salesforce license, how long has this Ad Set been running on Facebook, is this new tool actually worth the $30,000 upfront annual contract price tag?
This view of “cost” is accurate from an accounting perspective, but it is too narrow a view when it comes to the productivity cost of a team.
Technical roles - engineers, data scientists, analysts or anyone that is expected to write code - have an inherently higher cost than many other roles. If you’re working in sales, perhaps your total compensation is up there, but you’re offsetting your own price tag by the deals you bring in. This is fairly straightforward.
What is not obvious, though, is that for some of these technical roles to even do their job, you’re looking at tens of thousands of dollars in overhead. And that is on the conservative side.
Here are the average base salaries for the most common data team roles by market, according to Indeed:
| Role | New York, NY | San Francisco, CA | Austin, TX |
| -------------- | ------------ | ----------------- | ---------- |
| Data Analyst | $84,376 | $95,171 | $75,288 |
| Data Scientist | $135,280 | $140,487 | $119,553 |
| Data Engineer | $138,676 | $147,978 | $120,578 |
You don’t need to read into the fact that Analytics Engineer is not included; it’s too nascent of a title for Indeed to report.
These salaries are not obscene, but don’t forget they are only the base and skew lower than many actual salaries as these are averages among all level of experience. These roles are unlikely to get large cash bonuses, and far more likely to receive equity top-ups, so the total cash compensation is fairly accurate here.
But, let’s not forget - this is just to show up.
Tack on a minimum 10% for payroll and insurance, as well as whatever miscellaneous perks you choose, and we’re now looking at a sizable chunk of change.
A “standard” data team in today’s startup ecosystem is anywhere from 3-8% of headcount. For a Series A business of 30ish employees, we’re looking at roughly 1-2 people. By the time you’re at Series B or C, you’re definitely north of 150+ employees, and probably a team of around 5.
That’s a combined annual payroll of between $250,000 and $650,000 if we use a conservative $130,000/team member.
For a company who is lucky to be hitting $15M+ in revenue at Series A and $30M+ in revenue by Series C, these numbers are meaningful.
The cost of software, tools and systems
There is no escaping the price tag associated with salaried employees; those salaries are supposed to be a fair exchange for work performed, anyway. But these employees - charged with “making the business data-driven” or “uncovering insights” - need tools to do their jobs.
(Side note: Don’t use those phrases when hiring. They are complete corporate fluff. But, we digress)
Many of those tools are expensive by default, and are designed to make it easy for your teams to inadvertently drive your monthly cost higher. Go figure.
The Warehouse and Pipelines
Let’s focus on Snowflake here, since it is arguably the fastest growing option in the cloud warehouse space. Sure, Amazon’s Redshift has an edge in marketshare (22.16% vs 19.73% as of 2022), but talk to anyone in the data space and they’re likely moving off of Redshift to some newer option.
I work with Snowflake a lot. I like the tool. But I’ve found there are plenty of companies who have made the move to Snowflake who start off just fine, only to wake up one quarter and realize their spend has topped $100k ARR. In some cases I’ve seen Series C startups spending well north of $300k annually with minimal monetization efforts related to that infrastructure. There are plenty of examples of $1M+ ARR Snowflake users, as well. I know of multiple who went through layoffs in 2022.
But, Snowflake and other cloud warehouses are useless without data in them. And their utility decreases as the “freshness” of that data wanes. Stale data = less useful for the business. Coincidentally for Snowflake, stale data also means a lower ARR.
As a business you can do only a few things.
Option 1 - point some data engineers at the problem and let them build pipelines for you. This may take weeks, or it may take months depending on your maturity as a business and technology org. This is also an expensive undertaking, both in terms of absolute dollars and opportunity costs for the team.
Option 2 is a more sane choice - pay for more tools to make it easier to get data from your applications (mobile, web, whatever) into Snowflake. This, too, has a price, but it’s well worth the increase in your team’s velocity.
So, you adopt a pipeline vendor. Fivetran, Airbyte, or something else. They have their own costs, often obscured by the fact that they charge based on some bespoke “credit” methodology. Maybe 1 credit is 100MB of data, maybe it is 400K monthly updated rows, maybe it is something else altogether.
You start small. You want to transfer some of your production application data from Postgres to Snowflake. You decide to stick with only a handful of tables at first. Things are stable... $2500 per month seems reasonable right?
By this point we’ve passed the $300k mark and are inching closer to a $500k total cost for a single team to do it’s job. And that is on the conservative side with a small team.
I don’t care how much money you’ve raised. That is a lot of upfront investment for something that your exec team has been told will improve the business.
Will it, though? That depends.
The hidden drivers of cost
Before we get into the softer side of business/ data alignment and how that rears its ugly head in terms of cost, let’s talk about the way you go from a “reasonable” operating cost for data activities to something that scares your CFO.
It has three words - Usage-based Billing.
You might hear people call it consumption-based billing, too. This business model has become the de-facto standard for many, if not all, notable data vendors today. And it happens to be the biggest driver of unexpected costs for data teams and their finance counterparts.
The economics are pretty simple - you pay for what you use. The “gotcha”, however, is that the vendors commonly used by data teams, especially within the data warehouse and data pipeline categories, are designed to drive higher and higher usage.
It’s a genius business model for the vendor, but potential black hole for your cash as a customer. At first, the time-to-value is great - your team configures Fivetran and next thing you know, data shows up in your Snowflake instance.
You can talk your way through the initial usage bills since a lot of your initial data is loaded for free. But over time, scope creep sets in, and your once $500/ month usage now has passed $1000/month. Due to some inevitable misalignment with the business, a cool $2000+/month is not soon out of reach.
You started with a small scope of only “the essential” tables from production. Then the BizOps team asks for a slight twist on an existing analysis; this requires new tables. You set up Fivetran to transfer more tables. This increases your usage.
You eventually realize that the Engineering team didn’t include a nice updated_at field in some of the new tables you need, so you’re going to have to do full-table loads until they deploy a change.
Congratulations, you got got.
As a product-led motion, this is the goal. A customer signs up, see immediate value, and comes back for more. But what’s missing on the customer-side of the equation is knowledge. Too many teams - many of whom are young and inexperienced, or following rote “best practices” shared by other influential data vendors - wind up inflating their SaaS vendor bills because it’s so freaking easy to do.
They literally do not know how to make this workflow happen in other ways.
You almost can’t blame the teams doing the implementation here - they’re doing what they’re supposed to be doing. What they’ve been told to do.
But this is the pattern that leads from a reasonable spend for business and data analytics infrastructure to something that balloons significantly over time.
And this is business-as-usual for many companies today. Really makes you wonder.
The Compute Challenge
Ok, by this point it’s pretty clear how usage-based billing can get you in a hole you never anticipated, but we have only scratched the surface. Compute is the other part of the puzzle.
In Snowflake, you pay for storage (more often than not a relatively small part of the total bill) and for compute. Compute is basically just the time some server is running and executing the commands you tell it.
The Modern Data Stack lives off of compute credits. Every piece of the stack - from loading data, to processing that data, to running tests, to introducing new code via pull-request, to ad-hoc queries, to dashboards, to understanding lineage - requires compute.
Some of those activities incur higher compute costs than others, but the point remains. They all need a server to run some commands. And doing that costs money.
You might get pushback from your Looker account rep that they cache query results and try to minimize the need for querying the database, but the fact remains you’re still paying > $100k in platform fees and generating a larger cloud warehouse bill every time someone at your company hits refresh.
Don’t get me wrong. It’s not wrong to use Looker, or any of the tools in the categories I mentioned. But it’s important that you know the “hidden” costs associated with the tools your teams adopts.
The kicker to our earlier Fivetran debacle? You get hit with the one-two punch of a Fivetran bill for loading data, and Snowflake dings you for the compute to process the loading.
This is one thing you can categorize as “the cost of doing business”. There is no way around it.
The Alignment Problem
The final nail in the coffin for your warehouse and pipeline costs have little to do with the absolute cost of the technology, and more about the business problems dependent on their outputs.
Analytics as a function is intended to improve business operations in some regard. It doesn’t matter if you’re working on business intelligence questions, product analytics questions, or building a predictive model of some sort. If you’re working on some problem set within a business, and you call yourself a data or analytics person, your job is ultimately to help the business perform better.
But, that’s really hard to do.
Plenty of companies wind up over-indexing on solving problems that really don’t matter, or changing focus so rapidly that whatever effort went into one problem is lost on the next. This is one part of how you accumulate tech debt, but that’s another topic.
Misalignment between teams does not just look like frustrated coworkers and incorrect deliverables. It also translates to increased costs, and in our case, those costs are borne by the data team.
Remember that BizOps request from earlier? We can use any team in their place. Without alignment in terms of analytical focus, metric definitions, or even problem areas, your data team will be eating up compute credits in an effort to catch up with the needs of the rest of the org.
You won’t hear many teams or companies talk about this, but misalignment is an implicit driver of cost. It requires work and rework. And it often happens over and over.
So, what’s the alternative?
We’ve successfully hit a theoretical run rate over somewhere between $500K and $1M for a single team to do it’s job. Pretty wild.
Frankly, I don’t see how many companies can justify this kind of spend. Especially when many early data hires are young and inexperienced. Much of the cost incurred with today’s data stacks accumulates over time by way of design decisions and tooling selection.
An easy way to trim off a sizable chunk of your future warehouse and pipeline bills?
Bring in someone who’s done it before.
Not as a full-time hire; just to stand things up so your infrastructure cost does not scale faster than the value you’re generating from it.
We do it all the time at Purview Labs.
Love it! Similar to what Lauren Balik often points out, but in a not so harsh way :)
Lovely article. I've come across similar sentiment from Lauren on Twitter. I'd love to have a conversation with either of you on this issue if you are open to it to have a better understanding of this aspect of the growing MDS space. Please let me know what you think, and thank you again for this insightful article.