Bring Your Own … Blob Storage?

We’ve entered a golden age of S3. It’s evolved from (just) a blob store, to a powerhouse that enables an entirely new set of infrastructure companies and products.

You can now use S3 to design systems that are stateless, customer-native, and portable across many different clouds and customer configuration options and networks. Customers can own and control their data and have a portable layer to move across any cloud account.

Here at Nuon, we believe that S3 is the de facto choice for any infrastructure company designing software that runs in customer cloud accounts, and we have already seen companies benefit from this model.

History of S3

S3 launched in 2006 with the simple premise of giving customers reliable, scalable object storage. Along with the advent of cloud, S3 has long been a backbone — providing easy and infinitely scalable ways to store systems.

If you’ve written any Python in the last 15 years and needed to push or pull data, odds are you’ve seen some code along these lines:

python
python

Long relegated to storing blob images, powering archival storage, and data science (looking at you, Panda enthusiasts circa 2015), why is S3 suddenly in the spotlight? After all, hasn’t it been storing billions and billions of blobs for almost two decades now?

You used to only be able to see who is using S3 based on their uptime during a major S3 outage, but now companies are touting S3-native architectures, using S3 as an interface to customer data, and building an entire ecosystem on top of it. It doesn’t take an outage in AWS to know just how much of the internet depends on AWS, especially S3.

So, when did S3 go from just another way to store bytes to the coolest kid on the block?

S3 Is No Longer Just Plain Blob Storage

The simplicity of S3 has long appealed to systems folks over the years. The idea of 100% durable storage that never goes down and comes with read-after-write consistency? Sign me up.

(Fun fact, Nuon started as yet another platform-as-a-service in 2020. And was powered by … none other than S3 as its primary data store. That’s a story for another day.)

As developers, we longed for different features from S3 — better performance, conditional writes, versioning, and more. Along the way the folks designed around these missing features at the application layer. As new functionality has been added to S3, these same features unlocked new capabilities all while keeping the same guarantee of infinite storage durability.

Versioning

It’s hard to imagine today’s S3 without versioning. But in fact, the first few years of S3 did not have native versioning built in.

If you were writing code that used S3 back then, you had to bring your own versioning. This usually meant doing some type of copy-on-write in code:

python
python

Or, simply embedding timestamps into your path structure to make lookup easier:

python
python

As you can see here, without built-in versions, you’d not only have to manage reading/writing data, but storing metadata separately to denote when something was updated, or to store old versions (such as for rollback).

Thankfully, S3 launched versioning in 2010, so many of us never had to deal with a version-less S3.

Now onto some more recent S3 features that have been on our wishlist…

Conditional Writes

Imagine your system is writing data to S3, and can have multiple writers at the same time. How do you guarantee that the correct data is recorded, and two processes don’t accidentally write over each other? Before conditional writes, this had to be done in application code (usually by introducing some metadata store or locking system on top of S3).

Long sought after with S3 was a way to perform conditional writes — in other words, a way to say “only update this key, if this other condition is true.”

Conditional writes allow multiple systems to consume, write, and update data in the same bucket without worrying about key collisions. This enables you to treat S3 writes as having a mutex of sorts, and to reject writes when two systems are updating the same key at once.

For instance, let’s say we want to write a key to S3 only if it doesn’t already exist. Conditional writes let us do the following:

python
python

Now, let’s say we have a system that does a read before writing. For instance, we want to read a blob and then update fields in it. If someone else writes that same blob between the time we read, update, and rewrite it, we have a consistency issue, right?

Conditional writes let us reject that write if the same key was updated between the two:

In this case, we can read the object and reject our write if the ETag was changed midway through.

python
python

(If you’re familiar with S3’s consistency challenges, you might also note that read-after-write consistency did not come to S3 until 2020! Some other blob storage systems did introduce read-after-write consistency before then.)

Conditional writes unlock the ability to build distributed, stateful systems on S3, without needing another metadata or locking store. From a simplicity perspective, this is a huge improvement that enabled developers to build truly portable systems on top of S3 for effectively the first time.

S3 Express One Zone and VPC Endpoints

S3 read-and-writes historically came with an expectation of sub-second writes at worst, so systems that were designing around S3 had to make guarantees around that.

After all, that durability and consistency has to come from somewhere, right?

I spend a lot of time talking about what happens inside a customer’s cloud account, so I would be remiss not to talk about the benefits of S3 being a primitive offered by AWS.

By leveraging AWS-native and VPC-native infrastructure, AWS users can pick and choose the dynamics of their blob store:

Want inner-VPC networking only? You can use S3 endpoints and lock out any external access to your data.
Care about single-region, and low latency access? S3 Express One Zone gives you 10x performance, in a single zone. (Think 10ms reads/writes.)
Care about consistency? S3 replication gives you strong replication guarantees.

It’s all just S3, so as long as you’re using the standard S3 APIs (more on that later), you can choose your tradeoffs. Or, in many cases, let your customers choose their tradeoffs. Phew! This is a lot of customization and optionality all wrapped behind the simple (okay, maybe not so simple) premise of blob storage.

Native Querying

I’ve interacted with a number of systems over the years that either had to implement a metadata store on top of S3, or resort to clever hacks around the fact that S3 never had good primitives for querying.

While some query services like Athena have addressed this problem over the years, there hasn’t been a truly elegant solution until recently.

Raise your hand if you have ever seen a bucket key that embeds its metadata and querying data into its path:

python
python

If you can list objects fast enough, or build deterministic paths using consistent metadata, you can get a whole lot more out of a plain ol' blob storage system, after all.

You might be saying “But Jon, that’s not a query language!” Yes, I know. As I mentioned before, the allure of infinitely scalable, (almost) always up, durable storage will drive us systems folks to long lengths to design around it. All the way to inventing our own metadata systems, and pseudo-query languages.

Analytics and streaming systems have standardized around S3 as the replacement for Hadoop and various other JVM-based distributed storage systems. Apache Iceberg has driven a new set of query languages that can work directly with S3 data.

But more on that later. For now let’s talk about how AWS S3 has become the default standard for blob storage and take a look at how some of my favorite companies are using blob storage to radically simplify, optimize, and innovate against the various incumbents in their respective spaces.

S3 Has Become the New Storage Standard

Okay, so we’ve talked about how S3 has shipped all these long-awaited improvements over the years, and the systems they enable. Now, let’s talk about the boring parts: standards. Those boring pieces and stability at the storage layer will open up doors for new startups, expanding that standard and more S3 goodness — either in a vendor’s or customer’s cloud account.

History of the S3 Spec

Once upon a time, compatibility between cloud systems like S3 was a language level problem.

Tools like Go Cloud were (and continue to be, in some cases) attempts at giving developers an abstraction over similar cloud primitives from different providers.

Today, I would argue that S3 is that cloud primitive, and its API has evolved into a standard, as much as an interface.

In fact, you can now use S3-compatible APIs to talk to many different S3 implementations.

What Does a Blob Storage Standard Enable?

S3’s API being both stable enough and portable enough to work across clouds and environments is good for everyone. It enables startups to build new systems around these APIs, and allows competitors to create alternative implementations and expand those standards.

As an ecosystem, it also means stability. The more building blocks we have, the better. While the cloud is not completely standardized yet, infrastructure building blocks like S3 create the abstractions we need to get most of the way there.

Just as importantly, this stability enables software to be written once, and run in any customer’s environment. As a developer, you can build stateless systems that leverage S3 and run your software essentially in cloud, or on-premise.

Rethinking Critical Infrastructure Around (Modern) S3

We’ve talked a lot about all the things that S3 enables, and how it’s a standard. Let’s talk about some of my favorite startups and how they are leveraging S3 to rethink critical infrastructure.

Using S3, these companies have built faster, cheaper, and innovative infrastructure that incumbents can’t compete with.

These systems are more portable than ever — they can run in vendor accounts (i.e., cloud), or in customer accounts (Bring Your Own Cloud) seamlessly, and have operational characteristics that make them portable across not only clouds, but across network configurations!

Warpstream, RedPanda, Chroma, and Turbopuffer

WarpStream and Turbopuffer are recent startups that have leveraged modern S3 to build alternative implementations of incumbent products. In WarpStream’s case, they are Kafka compliant and enable customers to run a 10x cheaper Kafka alternative in their own account. Due to their S3 model, WarpStream is operationally more simple than incumbents and enables customers to control the storage layer behind their streams.

Redpanda offers streaming in your own cloud account, and an agentic data-plane to connect to customer data, with topics powered by blob storage.

Chroma, another startup using S3, is building vector-search with full-text and metadata filtering.

Turbopuffer is also leveraging S3 to build serverless vector and full-text search.

SlateDB is an embedded database backed by durable storage. Basically, using SlateDB you can embed a data store in your application and have strong consistency and durability guarantees all leveraging S3. Using Slate, you can build realtime systems that store data in either your S3 bucket or a customer-provided bucket, enabling both new architectures and integrations with customer storage.

Kafka and the Iceberg Ecosystem

The data gravity of S3 is pushing more data infrastructure tooling to integrate with it. While S3 doesn’t have a proper querying language, common formats like Iceberg or Parquet have enabled query systems to be built on S3. Now, it’s not only possible to query S3 data, but combining with things like S3 Express One Zone storage can enable approaching-real-time lakehouse architectures, all with the S3 API.

Expanding the S3 Ecosystem

So, we’ve discussed S3 — every cloud offers it, and it’s got a standard protocol. What happens next? Companies can implement the S3 protocol as a service, or design on it (or in Tigris’s case, both!).

Branching

Tigris is an S3-compatible storage provider that is not only API compatible with S3, but enables you to reuse most of the same code you already have that talks to S3.

From Tigris’s Getting Started Guide:

python
python

They have leveraged the stability of S3’s API to not only build a cheaper, faster alternative for storage, but have also expanded upon the protocol to build new features.

Tigris recently launched bucket forking — a new primitive that gives you similar semantics to blob versioning, but for a whole bucket. Forks enable you to do things like create an entirely new version of a bucket, make changes, and have similar characteristics of a Git-based model to merge back in.

This is the value of the S3 ecosystem. It’s not only a way to build applications and consume customer data, but as a standard itself, new features can be built on top of its primitives.

Ecosystems evolve when the underlying infrastructure is stable, reliable, and enables core primitives that can be orchestrated.

Streams

S2 is a new startup offering an idiomatic way to move data between “streams,” backed by S3. Instead of needing to deploy a queue system, or manage different flows of data, you can write to lightweight, S3-backed streams using a standard API.

BYO … Bucket?

The emergence of S3 as a storage standard and the interoperability between different S3 providers means that many developers can design systems with one target in mind — S3 — and then run that in many different environments with different customer configurations.

With the durability guarantees, consistency of S3, and the growing ecosystem, many companies are betting their blobs (okay, swear I’m done with the puns) on the S3 API as the best way to deliver robust and easy to operate infrastructure into customer environments.

Image of Traditional SaaS connected to a customer owned Blob Storage. — Traditional SaaS with customer-managed Blob Storage

Bring Your Own Bucket

The ubiquity of S3 and the standard API has enabled companies like bitdrift to offer a bring your own bucket model. Customers can own their data or connect existing data sources to a vendor-managed cloud SaaS, and the SaaS will read/write from the customer’s bucket.

This gives the customers the benefit of data ownership, some control over access, and operational control over how the bucket is configured, while giving the vendor a consistent API to work with.

Image of Vendor deployed BYOC Application connected to Blob Storage in Customer's environment. — Vendor deployed BYOC Application connected to Blob Storage in Customer's environment.

Bring Your Own Cloud

With gravity toward data living inside a customer’s network, S3 is perfectly positioned for customers who want to securely expose data across their network. Things like private endpoints and access points make it so that you can control access to S3 data and give internal VPC traffic custom gateways.

Vendors who build systems using the S3 API as a data layer can reap the benefits of durability, consistency, and performance, while building simple-to-operate systems that can run natively in a customer’s cloud account.

Conclusion

I believe that BYOC and S3 are going to continue to unlock new architectures. As the S3 API gets expanded and inevitably adds more latency, consistency, and query options, more and more systems will be designed around it from first principles.

At the same time, BYOC is growing due to AI innovation, as well as demands from the enterprise for sovereign data solutions and better security, ownership, and integration. As more software moves into customer accounts, more customer data will live in S3. The data gravity of not just S3 and blob storage, but customer-owned blob storage, will act as a flywheel generating more demand for blob-storage native BYOC software.

Needless to say, we’re bullish on the combination of blob storage and Bring Your Own Cloud software. ❤️