Skip to main content
Insights

Bring Your Own … Blob Storage?

Thoughts on why blob storage, and specifically S3 are a perfect match for BYOC software.

Jon Morehouse portrait

Jon Morehouse

Founder & CEO

13 min read
Image: Bring Your Own … Blob Storage?

Over the past few years we have entered the golden age of S3, and S3 has evolved from (just) a blob store, to now powering an entirely new set of infrastructure companies and products.

Here at Nuon, we believe that S3 is the de facto choice for any infrastructure company designing software that runs in customer cloud accounts, and we have already seen companies benefit from this model.

You can now use S3 to design systems that are stateless, customer-native and portable across many different clouds and customer configuration options and networks. Customer’s can own and control their data and have a portable layer to move across any cloud account.

History of S3

S3 launched in 2006 with the simple premise of giving customers reliable, scalable object storage. Along with the advent of cloud, S3 has long been a backbone - providing easy and infinitely scalable ways to store systems.

If you’ve written any python in the last 15 years and needed to push or pull data, odds are you’ve seen some code along these lines:

import boto3
s3 = boto3.client("s3")

# Write data to S3
s3.put_object(Bucket="my-bucket", Key="data.txt", Body=b"Hello, S3!")

# Read data from S3
response = s3.get_object(Bucket="my-bucket", Key="data.txt")
data = response['Body'].read()

Long relegated to storing blob images, powering archival storage and data science (looking at you, panda enthusiasts circa 2015), why is S3 suddenly in the spotlight then? After all, hasn’t it been storing billions and billions of blobs for almost 2 decades now?

You used to only be able to see who is using S3 based on their uptime during a major S3 outage, but now companies are touting S3-native architectures, using S3 as an interface to customer data, and building an entire ecosystem on top of it. It doesn’t just take an outage in AWS to know just how much of the internet depends on AWS, especially S3.

So, when did s3 go from “just another way to store bytes” to the coolest kid on the block?

S3 Is No Longer Just Plain Blob Storage

The simplicity of S3 has long appealed to systems folks over the years. The idea of 100% durable storage that never goes down and comes with read after write consistency? Sign me up.

Fun fact, Nuon started as yet another platform-as-a-service in 2020. And was powered by … none other than S3 as it’s primary datastore. That’s a story for another day.

As developers, we long wanted for different features from S3 - better performance, conditional writes, versioning, and more. Along the way the folks designed around these features at the application layer, but then as new functionality has been added these same features unlocked new capabilities that have kept the same guarantee of infinite storage durability, but with new capabilities.

Versioning

It’s hard to imagine today’s S3 without versioning. But in fact, the first few years of S3 did not have native versioning built in.

If you were writing code that used S3 back then, you had to bring your own versioning. This usually meant doing some type of copy-on-write in code:

def update_with_version(bucket, key, new_data):    
	timestamp = datetime.now().isoformat()    # Backup existing    
	s3.copy_object(key, f"{key}.backup.{timestamp}")    
	
	# Write new version    
	s3.put_object(key, new_data)    
	
	# Update version index    
	s3.put_object(f"{key}.versions", append_version_metadata())

Or, simply embedding timestamps into your path structure to make lookup easier:

def update_with_version(bucket, key, new_data):
    timestamp = datetime.now().isoformat()
    
    # Write new version    
    s3.put_object(key, new_data)
    
    # Read the old versions file, update it, and rewrite it.
    version_data = s3.get_object(f"{key}.versions").read()
    versions = json.loads(version_data)
    versions.append(timestamp)
    
    s3.put_object(f"{key}.versions", json.dumps(versions))

As you can see here, without versions built-in, you’d have to not only manage reading/writing data, but storing metadata separately to denote when something was updated, or to store old versions (for rollback and more).

Thankfully many of us never had to deal with an S3 without versions as professionals. This launched in 2010, so if you never experienced life without S3 versioning, then ignorance is bliss.

Now onto some more recent S3 features that have been on our wishlist…

Conditional Writes

Imagine your system is writing data to S3, and can have multiple writers at the same time. How do you guarantee that the correct data is recorded, and two processes don’t accidentally write over each other? Before conditional writes, this had to be done in application code (usually by introducing some metadata store or locking system on top of s3).

Long sought after with S3 was a way to perform conditional writes, in other words a way to say “only update this key, if this other condition is true”.

Conditional writes allow multiple systems to consume, write and update data in the same bucket without worrying about key collisions. This enables you to treat S3 writes as having a mutex of sorts, and rejecting writes when two systems are updating the same key at once.

For instance, let’s say we want to write a key to S3 only if it doesn’t already exist. Conditional writes let us do the following:

def acquire_deploy_lock(s3_client, bucket):
    try:
        s3_client.put_object(
            Bucket=bucket,
            Key='bucket-key',
            Body=json.dumps({'locked_by': writer_id, 'time': now()}),
            IfNoneMatch='*'  # Only succeeds if object doesn't exist        )
        return True        
    except ClientError:
        return False

Now, let’s say we have a system that does a read before writing. For instance, we want to read a blob and then update fields in it. If someone else writes that same blob between the time we read, update and rewrite it, we have a consistency issue, right?

Conditional writes let us reject that write if the same key was updated between the two:

In this case, we can read the object, and reject our write if the ETag was changed midway through.

    response = s3_client.get_object(Bucket=bucket, Key=key)
    current_etag = response['ETag']
    config = json.loads(response['Body'].read())
    config["example"] = "updated"    

    try:
        s3_client.put_object(
            Bucket=bucket,
            Key=key,
            Body=json.dumps(config),
            ContentType='application/json',
            Metadata={'IfMatch': current_etag}
        )
    except ClientError as e:
        if e.response['Error']['Code'] == 'PreconditionFailed':
            print("Object was modified between reading and writing, rejecting!")
If you are reading this, and familiar with consistency challenges with S3, you might also note that Read after Write consistency did not come to S3 until 2020!

Some other blob storage systems did introduce read-after-write consistency before this.

Conditional writes unlock the ability to build distributed, stateful systems on S3, without needing another metadata or locking store on top of this. From a simplicity perspective, this is a huge unlock as developers were unlocked to build truly portable systems on top of S3 for effectively, the first time.

S3 Express Single Zone and VPC endpoints

S3 reads and writes historically came with an expectation of sub-second writes at worse, so systems that were designing around S3 had to make guarantees around that.

After all, that durability and consistency has to come from somewhere, right?

I spend a lot of time talking about what happens inside a customer’s cloud account, so I would be remiss to talk about the benefits of S3 being a primitive offered by AWS.

By leveraging AWS native, and VPC native infrastructure, AWS users can pick and choose the dynamics of their blob store:

  • Want inner-VPC networking only? You can use S3 endpoints and lock out any external access to your data.
  • Care about single-region, and low latency access. S3 Express Single Zone gives you 10x performance, in a single zone. (Think 10ms reads/writes).
  • Care about consistency and replication. S3 replication gives you strong replication guarantees.

It’s all just S3, so anything that uses the standard S3 apis (more on that later), can choose their tradeoffs. Or, in many cases, let their customer’s choose their tradeoffs. Phew! this is a lot of customization and optionality all wrapped behind the simple (ok, maybe not so simple) premise of blob storage.

Native Querying

I’ve interacted with a number of systems over the years that either had to implement a metadata store on top of S3, or resort to clever hacks around the fact that S3 never had good primitives for querying.

While some things like Athena have solved this over the years, this has not been a very well solved problem until recently.

Raise your hand if you have ever seen a bucket key that embeds it’s metadata and querying data into it’s path:

s3://my-bucket/system=abc/tenant=def/partition=ghi

If you can list objects fast enough, or build deterministic paths using consistent metadata you can get a whole lot more out of a plain ol' blob storage system, after all.

You might be saying “But Jon, that’s not a query language! Yes, I know.”. I mentioned before, the allure of infinitely scalable, (almost) always up, durable storage will drive us systems folks to long lengths to design around it. All the way to inventing our own metadata systems, and pseudo-query languages.

Analytics and streaming systems have standardized around S3 as the replacement for Hadoop and various other JVM based distributed storage systems. Apache Iceberg has driven a new set of query languages that can work directly with S3 data.

More on this later, now let’s talk about how AWS S3 has become the default standard for blob storage and some of my favorite companies that are using blob-storage to radically simplify, optimize and innovate against the various incumbents in their respective spaces.

The Standard API for Blob Storage

Ok, so we’ve talked about how S3 has shipped all these long awaited improvements over the years, and the systems it enables on top of it. Now, let’s talk about the boring parts: standards. But in this case, those boring pieces and stability at the storage layer will open up doors for new startups, expanding that standard and more S3 goodness - either in a vendor, or customer’s cloud account.

History of the S3 spec

Once upon a time, compatibility between cloud systems like S3 was a language level problem.

Tools like Go Cloud were (and continue to be in some cases) attempts at giving developers an abstraction over similar cloud primitives from different providers.

Today, I would argue that S3 is that cloud primitive, and it’s API has evolved into as much a standard, as an interface.

In fact, you can now use S3 compatible APIs to talk to many different S3 implementations.

What does a blob storage standard enable?

S3’s API being both stable enough and portable enough to work across clouds and environments is good for everyone. It enables startups on top of it to build new systems around these apis, allows competitors to create alternative implementations and expand those standards.

As an ecosystem, it also means stability. The more building blocks we have, the better. While the cloud is not completely standardized yet - infrastructure building blocks like S3 create the abstractions we need to get most of the way there.

Just as importantly, this stability enables software to be written once, and run in any customer’s environment. As a developer, you can build stateless systems that leverage S3 and run your software essentially in cloud, or on-premise.

Rethinking Critical Infrastructure around (modern) S3

We’ve talked a lot about all the things that S3 enables, and how it’s a standard. Let’s talk about some of my favorite startups and how they are leveraging S3 to rethink critical infrastructure.

Using S3, these companies have built faster, cheaper and innovative infrastructure that incumbents can’t compete with.

These systems are more portable than ever - they can run in vendor accounts (i.e., cloud), or in customer accounts (Bring Your Own Cloud) seamlessly, and have operational characteristics that make them portable across not only clouds, but across network configurations!

Warpstream, RedPanda, Chroma, and Turbopuffer

Warpstream and Turbopuffer are recent startups that have leveraged modern S3 to build alternative implementations of incumbent products. In Warpstream’s case, they are Kafka compliant and enable customers to run a 10x cheaper Kafka alternative in their own account. Due to their S3 model, Warpstream is more simple, operationally, than incumbents and enables customers to control the storage layer behind their streams.

RedPanda offers streaming in your own cloud account, and an agentic data-plane to connect to customer data, with topics powered by blob storage.

Chroma is another startup using S3, and is building vector-search with full-text and metadata filtering.

Turbopuffer is also leveraging S3 to build serverless vector and full-text search.

Slate DB

SlateDB is an embedded database, backed by durable storage. Basically, using SlateDB you can embed a datastore in your application and have strong consistency and durability guarantees all leveraging S3. Using Slate you can build realtime systems that store data in either your S3 bucket, or a customer provided bucket enabling both new architectures and integrations with customer storage.

Kafka, and the Iceberg ecosystem

The data gravity of S3 is pushing more data infrastructure tooling to integrate with it. While S3 doesn’t have a proper querying language, common formats like Iceberg or Parquet have enabled query systems to be built on top of it. Now, it’s not only possible to query S3 data, but combining with things like S3 Express One Zone storage can enable approaching-real-time lakehouse architectures, all on top of the S3 API.

Expanding The S3 Ecosystem

So, we’ve discussed S3 - every cloud offers it, and it’s got a standard protocol. What happens next? Companies can implement the S3 protocol as a service, or design on top of it (or in Tigris’s case, both!).

Branching

Tigris is an S3 compatible storage provider, that is not only API compatible with S3, but enables you to reuse most of the same code you already have that talks to S3.

From Tigris’s Getting Started Guide:

import boto3
from botocore.client import Config

# Create S3 service 
clientsvc = boto3.client(
    's3',
    endpoint_url='https://t3.storage.dev',
    config=Config(s3={'addressing_style': 'virtual'}),
)

# List buckets
response = svc.list_buckets()

They have leveraged the stability of S3’s API to not only build a cheaper, faster alternative for storage - but have also expanded upon the protocol to build new features.

Tigris recently launched bucket forking - a new primitive that gives you similar semantics to blob versioning, but for an entire bucket. Forks enable you to do things like create an entirely new version of a bucket, make changes and have similar characteristics of a git based model to merge back in.

This is the value of the S3 ecosystem. It’s not only a way to build applications and consume customer data, but as a standard itself, new features can be built on top of its primitives.

Ecosystems evolve when the underlying infrastructure is stable, reliable and enables core primitives that can be orchestrated.

Streams

S2 is a new startup offering an idiomatic way to move data between “streams,” backed by S3. Instead of needing to deploy a queue system, or manage different flows of data, you can write to light-weight s3 backed streams using a standard API.

They are taking the S3 API and building on top of it, to create an entirely new infrastructure primitive.

BYO … Bucket?

The emergence of S3 as a storage standard, and the interoperability between different S3 providers means that many developers can design systems with one target in mind, S3, and then run that in many different environments with different customer configurations.

With the durability guarantees, consistency of S3, and the growing ecosystem, many companies are betting their blobs (ok, swear I’m done with the puns) on the S3 API as the best way to deliver robust and easy to operate infrastructure into customer environments.

Traditional SaaS with customer-managed Blob Storage

Bring Your Own Bucket

The ubiquity of S3 and the standard API has enabled companies like BitDrift to offer a bring your own bucket model. Customers can own their data, or connect existing data sources to a vendor managed cloud SaaS, and the SaaS will read/write from the customer’s bucket.

This gives the customers the benefit of data ownership and some control over access and operational control over how the bucket is configured, while giving the vendor a consistent API to build on top of.

Vendor deployed BYOC Application connected to Blob Storage in Customer's environment.

Bring Your Own Cloud

With gravity towards data living inside a customer’s network , S3 is perfectly positioned for customers who want to securely expose data across their network. Things like private endpoints and access points make it so that you can control access to S3 data and give internal VPC traffic custom gateways.

Vendors who build systems using the S3 API as a data layer can reap the benefits of durability, consistency and performance, while building simple to operate systems that can run natively in a customer’s cloud account.

Conclusion

I believe that BYOC and S3 are going to continue to unlock new architectures. As the S3 API gets expanded and inevitably adds more latency, consistency and query options (and the ecosystem grows around it), more and more systems will be designed around it from first principles.

At the same time, BYOC is growing due to AI innovation, demands from the enterprise for sovereign data solutions and better security, ownership and integration. Combined, we see this as a flywheel that will not only make S3 and blob storage more and more common, but combined, we expect to see most new BYOC infrastructure to leverage it.

Combined, this creates a flywheel — as more software moves into customer accounts, more customer data will live in S3. The data gravity of not just S3 and blob storage, but customer owned blob storage, will act like as a flywheel generating more demand for blob-storage native BYOC software.

Needless to say, we’re bullish on the combination of blob storage and Bring Your Own Cloud software ❤️

Ready to get started?

Newsletter

Subscribe to our newsletter

Too much email? Subscribe via RSS feed