The Business of Open Source

The open-core business model in the world of Data is not as simple as it seems

Oct 02, 2023

Open LED signage — If a data tool is open source but no one talks about it, does it even matter?

At first glance, open source software sounds like a utopian ideal - useful products packaged up for anyone to use. Of course, they require a base level of technical acumen, but if you’re already interested in using open source tooling, you probably have some chops.

Recently, there has been a strong push for open-core businesses - companies whose core offering comes in two flavors: a do-it-yourself version (open source) or a managed-for-you-version (hosted).

Is your team too small to afford a data engineer or someone else to run your infrastructure? Pay us for this hot new managed service and we’ll handle it for you. It’s that easy.

Never miss the data tool hype-train again!

There is a certain mystique about open source and open-core businesses. They appear to be exemplars of an one-for-all and all-for-one mindset where the people who build the product aren’t just shilling it for profit. They are here to help the average man or woman.

But that’s a naive and simplified view on something far more complex.

The boom in data tools over the last 5 - 10 years has made this exceedingly obvious. Never before has the Github Stars to VC-backed startup pipeline been stronger.

But, does the idea of open-core even make sense for businesses in today’s market? The answer depends on who you talk to.

Either way, let’s take a look at the concept of open-source and what it means in the world of data.

What’s Open Source?

Fundamentally, Open Source (OSS) is a means of releasing software that allows the end user to distribute, extend, or otherwise alter that software for their own use. It’s the backbone of software today (not just data-related tooling), and you’d be hard pressed to find an engineer who has not worked with a variety of open source tools. I’d venture to say finding such a person would be impossible.

Almost every popular programming language is open-source. In fact, every language has an eco-system dedicated to the distribution of these OSS packages:

javascript —> npm
python —> pip
rust —> cargo
java —> maven/ gradle
… there are many other languages and package managers, but you get the idea

Are you reading this post on a Firefox browser? That’s open source. You can introduce changes to the Firefox project if you really want to.

Chrome is not open source. But, a lot of the stuff under-the-hood, namely the Chromium web browser that powers all instances of Chrome, is an open-source project.

Confused? Welcome to the world of modern software. There’s layers to this game.

There is often a difference between the OSS version of a piece of software, and the commercially available version of it. The vast majority of Chrome’s features come from the open source Chromium browser, but there are some which are only supported by Chrome - not Chromium.

You could install Chromium instead, but you’ll lose some functionality. For instance, you’ll need to manually install every update.

It does get more complicated. Google offers Chrome under a free license; so, despite you not paying for the features they add on top of Chromium, it’s a closed-source project. For our purposes, we don’t need to focus on licensing, though it does have a meaningful impact at scale.

OSS typically follows this pattern - a free, community supported version of a product, and a commercialized cousin with some additional, often very useful, features. It’s like the current version of you and that other version of you that hits the gym, strictly eats whole foods, and doesn’t endlessly scroll social media before going to bed.

This works really well when multi-national mega-corps like Google are incubating the product and issuing free licenses for its use. But many of the OSS tools you’re likely to use in the data world fall under the category of open-core, where a managed version is available if you’re willing to pay the price.

They also operate at a very different scale of business.

The Economics of Open Source

To the average person in another industry, this model probably makes no sense. The incentives on both sides appear misaligned.

Build software that anyone can use.
Let them use it for free.
Pay your own people to maintain it.
Users can contribute to any open project available in the wild.

This model breaks the mold of being compensated for providing value, at least with regards to financial compensation. Often, resources are dedicated to projects that generate no revenue, or in many cases, average people are contributing to projects that are assets of a company with which they are in no way officially affiliated.

That second scenario is a really interesting positive side-effect of open-core businesses: your features are sometimes built by your own users. It is literally free labor and free product development. Quite amazing.

By design, OSS projects rely on contributions (of code) by the public. The alternative path is that the company spends its own resources to develop and maintain a software package that others can use for free.

Can you name an industry other than tech where this would work? Non-profits don’t count.

The closest corollary may be a loss-leader for e-commerce or retail, but even that is not an entirely apt comparison.

Don’t get me wrong - OSS is fundamental to modern software.

In the early days of OSS, this approach offered an alternative to lock-in with tools built by large tech companies. In the mid-1980s, Postgres - one of the most popular database options today - was born (in part) as a response to the marketing aggression and growing market dominance of Oracle and its relation database service.

There is something incredible about choosing to build a free, widely-available alternative to a technology just because you don’t like the incumbent.

But OSS has grown significantly since then.

As engineers and developers have collectively gravitated towards marketplaces like GitHub, the cycle time between creating some software and others finding that software has sped up. There is no doubt platforms like GitHub have contributed to the rise in OSS - more engineers can contribute to these project, and it’s easier than ever for OSS projects to be found.

And this has lead to a partial change in what OSS represents.

The MIT Tech Review and ThoughSpot suggest that OSS might be as much an effort in corporate branding as it is a popular business model. The modern OSS community has built an eco-system where many projects aspire to be spun-off into venture-backed businesses, especially in data.

In many cases, there is nothing wrong with this. The initial contributors set off to solve a problem, grew distribution, identified a market, and created a business.

At face value, that’s great. It’s the exact model that you find with digital marketing gurus and Twitter/ LinkedIn influencers: build an audience first.

But the reality of the “OSS to VC-backed company” pipeline might not be as sexy as it seems.

Open Source and Open Core

The open-core SaaS model is pretty simple:

Build a thing that solves a problem
Get people to use it
Get some portion of those users to pay you to manage the thing for them

That’s fundamentally the OSS to product-business pipeline. Maybe you throw some professional services in there, or you start pricing by usage (Everyone’s favorite pricing model). The formula is well-defined, with many software products fitting this mold.

Take a small slice of popular data tool vendors, and you’re likely to find more than a handful of open source/ open-core products - dbt, Dagster, Astronomer, Grafana, Snowplow, Starburst. There are plenty more.

Even big names like Snowflake use open open source technologies as fundamental components to their offer. Databricks’ own website states their Data Lakehouse product “is underpinned by widely adopted open source projects Apache Spark™, Delta Lake and MLflow”.

This is a new kind of leverage.

It’s a distribution network, with free labor, that becomes a business.

It’s the repackaging of well-distributed, widely available software into new products.

Sounds great, right?

All of the companies mentioned above are venture-backed, and all of them - save Snowflake - are currently privately held.

With the exception of Grafana, who is probably in a somewhat favorable position due to San Francisco Partners’ recent acquisition of NewRelic and the never-ending chatter about DataDog’s outrageous pricing model, you’d be hard-pressed to find a contender for a future public offering.

Yet the pipeline remains strong. Curiously so, if you ask me.

Point-Solutions and Systems-of-Record

OSS exists on a spectrum. On one end you have tiny single use packages - some might even call them toys. On the other, you have large scale enterprise technologies which historically have grown out of big tech companies.

The distinction may seem obvious with the examples above; they are clearly extremes. JavaScript’s LeftPad fiasco is internet lore at this point, while technologies like Hadoop and Spark are some of the most popular and impactful OSS data tools released in the past two decades. But it still illustrates the broader point - where the OSS lands on the spectrum matters.

Hadoop and Spark were incubated at Yahoo! and AMPLab of UC Berkeley, respectively. They are fundamental components of technology for thousands of businesses and technical organizations.

These two tools have spawned multiple commercially viable businesses, and have been co-opted by platform providers like AWS.

Will the typical OSS data tool available today reach this scale? It’s unlikely in my opinion, and that’s mainly a function of where there new tools exist on the spectrum of available OSS.

On one end, you have single-use-case tooling and utilities for “end users” (aka analysts and other reasonably technical roles). They solve specific needs, often tied to existing workflows within the business. On the other, you have databases, infrastructure-as-code, and system observability. These are wide-reaching and flexible tools which (in some cases) act as the system-of-record.

Tech businesses can survive without point solutions; they cannot function without systems-of-record.

Over the last 5-10 years and in the data space specifically, it seems as though there have been fewer lower-level technologies following the OSS-to-product-business formula than one-off utilities and point solutions packages. The exception to this statement is likely Voltron Data, from the team behind PyArrow (and Pandas, back in the day).

And even if that is not the case, there is no denying that point-solutions have been marketing to practitioners aggressively in an attempt to drum up FOMO. You can make a case that tools like dbt, Dagster, Prefect, and many others fall into this camp.

dbt, itself an point-solution on our OSS spectrum, illustrates another unique phenomenon for this market. While individuals may have gripes about its implementation details - namespacing of models, excessive YAML configurations, lack of governance - dbt Labs has been able to plug its own product gaps reasonably well with many of the surrounding tools in the ecosystem.

Is this a product moat?

You could make that case. Though a more realistic assessment is sort of. These complementary tools are, after all, separate businesses and separate concerns. And down the line, they’ll likely be fighting for mind-share and market-share.

It’s tough to say how defensible this positioning is over the long term. If your product encourages users to actively supplement its feature set with additional tooling, the market will figure out how to do this efficiently. It’s just one of the second order effects of the open-core model.

Second Order Effects

Some vendors - for instance, Clickhouse - have apparently reduced their OSS focus in favor of proprietary, closed-source feature development. Their CTO even weighed in publicly on the matter:

It's good to have a small, limited number of modifications exclusive to ClickHouse Cloud, but only those that do not compromise the features or operation in self-managed usages, but in the same way, are crucial and distinguishing for the Cloud.

This shouldn’t be surprising. There is a rich history of large tech platforms like AWS and others releasing managed offerings of these open source options. AWS, specifically, is comically guilty of taking OSS projects and releasing some managed version of it - Managed OpenSearch (ElasticSearch), Managed Airflow, Managed Flink, Managed Kafka (MKS), the list goes on. That is a only a small subset.

You may have gathered that AWS itself has a less than favorable relationship with the broader OSS community - they are arguably the tech giant with the least notable contributions to OSS packages. Or, at least those packages they can’t readily commercialize.

All this is to say the business of open source software is tricky, and the path to commercialization is often tied to a brand.

And the branding aspect of OSS cannot be understated, especially for products that exist closer to the point-solution end of our spectrum.

Apache Airflow - a very popular and equally hated orchestration tool - is now primarily maintained by employees of the Astronomer, a company whose primary offer is a SaaS for managed Airflow deployments. They are not the exception here.

This relationship between OSS tool and well-branded corporate sponsor has spawned an entirely new category of role in tech - Developer Relations .

Your Friends, DevRel

In other industries, it’s called Community Building or Community Marketing. At one point, they were known as “Evangelists”. But today, it seems tech companies prefer to use different terminology in order to hide the fact that they are, after all, selling another product.

I am sure that many businesses have seen both improved traction and improved brand awareness from DevRel efforts, but don’t be fooled. These roles are about driving the bottom line, and that rings especially true for your VC-backed point-solutions. Until the recent boom in open-core businesses, DevRel was primarily associated with only a few companies, many of whom are quite large - Apple, Google, NewRelic, Twilio.

Today, this function has become commonplace among small to mid-sized tech companies building developer tools. If you work in data, I guarantee you’ve come across some blog posts, YouTube videos, or Slack channels where the brand’s Developer Advocate or Head of Developer Experience is the person leading the charge.

Software Re-bundling

The OSS-induced spread of point solutions has lead to software sprawl at many companies. Pair that with the end of the ZIRP era, and it’s likely that many businesses are going to be shortening their list of accounts payable; for many, it’s already happening.

Businesses can push back against software sprawl by cutting licenses and opting to not renewal contracts. But, how do vendors, especially those who may have strong ties to these open-core tools, play this game?

Re-bundling.

This is a positive externality, at least with regards for the broader market. The average consumer (a business adopting OSS tooling) benefits from another organization packaging up various tools into one offer; it’s an efficiency gain for the end customer, and the re-bundler is able to play labor arbitrage by packaging up OSS tools.

It is an all-around fantastic model.

We are already seeing this happen in the data space with companies like Datacoves, Paradime.io, and Y42. There are others in the market, as well, including some service-heavy players like 5X Data.

Other single-platform tools like Mozart Data and Keebola also fit this mold.

Almost all of these businesses use dbt Core, the open-core version of dbt Labs’ product. You have to admire when independent businesses make open-core software a transparent component within their own offering.

These re-bundled offers are a win for everyone involved. The open-core business gains contributors to - and users of - their OSS tooling, the end customers get a friendlier experience thanks to another company handling all the tedious leg-work, and the re-bundlers fill a market need by reducing customer-facing complexity.

Community Alternatives

The final side-effect of open-core businesses is the creation of 100% open solution. And, it builds on the re-bundling/ re-packaging of tools like mentioned above.

Think of it like knowledge sharing; a callback to the true spirit of OSS.

Projects like MDS In a Box, a 100% open-source codebase for running the “Modern Data Stack”, embody this. They include all the features and components of an analytics toolkit, but remain accessible to everyone.

Similarly, we’ve seen releases of fully open-source playbooks on how to mimic dbt Cloud behavior without incurring additional cost related to a recent pricing change from dbt Labs’ hosted cloud service.

Are these options are robust as a fully-managed service? Hard to say, but that does not discount the fact that the option remains.

The very tooling that started as OSS and evolved into venture-backed open-core businesses is now being independently co-opted by other open-source projects. All to illustrate that you don’t need to pay for a managed service.

Perhaps we’ve come full circle.

It’s clear that as long as OSS tools are around, there will always be an alternative to the paid service.

So, Where Does This Leave Us?

The open source path is attractive in some cases, and for good reason. For personal and professional recognition, it’s hard to beat.

But as a means to build a business, that path is not as friendly.

Building a product business is already difficult. Transitioning from a open-source positioning appears to be even more difficult. Not only are you developing the technology, but you’re also managing the perception and opinion of the broader community - the very people helping you develop your product.

The current focus on GitHub Stars feels akin to the Dot-Com era of measuring eyeballs on a webpage. No mention of profit, of building sustainable revenue streams. Just get users; growth-at-all-costs.

Is open-core the best way to build a business in the data space?

Unlikely, when you consider the wide range of tools that fall into the world of “data”.

Is open-core the best way to leverage the broader community, both in terms of distribution and product development?

Yes.

But, it would be naive to think that the open-core path is a guaranteed win. If anything, past experience and recent market headwinds show that scale matters, product category matters, and the open-core model introduces competition in some very clever and unexpected ways.

Think I missed the mark, or want to share your thoughts?

Find me on LinkedIn or Twitter.

The Data Jargon Newsletter

Discussion about this post