From pipeline to database: how tiered storage is reshaping the role of Kafka
For years, Apache Kafka has been the de facto standard for high-throughput data pipelines, but a recent evolution is fundamentally changing its role in the enterprise: tiered storage. This innovation, which decouples expensive, high-performance "hot" storage from cost-effective "cold" storage, is turning Kafka from a temporary message bus into a viable, long-term system of record.
To explore the implications of this shift, we sat down with Anatoly Zelenin, a seasoned Kafka trainer and author of the German Kafka book, on our Talking Event-Driven podcast. We discussed everything from the nuances of events in finance and IoT to the critical importance of a new approach to data retention and recovery.
The spectrum of events: from state to behavior
Before diving into storage, it's crucial to understand what we're storing. Anatoly highlights a spectrum of data types that often flow through Kafka, each with increasing semantic richness:
- State: The simplest form is a full snapshot of an entity (e.g., the complete customer profile). It's easy to consume, just take the latest message, but you lose all context of why the state changed.
- Deltas: A more efficient approach is to send only what has changed (e.g., customer address updated). This provides more context than a full state snapshot but still lacks the business intent.
- Business Events: The most powerful form is capturing the underlying business behavior (e.g., CustomerRelocated). This is an immutable fact, rich with semantic meaning, that describes what actually happened.
A mature event-driven architecture leans heavily towards business events, as they provide the unambiguous context needed for building intelligent, decoupled systems.
The game-changer: tiered storage and infinite retention
Historically, one of Kafka's biggest limitations as a long-term store has been cost. Because compute and storage were tightly coupled, retaining large volumes of data for long periods was prohibitively expensive. You paid a premium for high-performance disks and the compute power to run the brokers, even for data that was rarely accessed.
Tiered storage (introduced as a core feature in recent Kafka versions) shatters this limitation.
- Hot Data (recent events) remains on the fast, local disks of the brokers for low-latency access.
- Cold Data (older events) is automatically offloaded to cheaper object storage like AWS S3 or Google Cloud Storage.
"I believe this changes everything about the calculation of Kafka," Anatoly states. "Before tiered storage, Kafka is just a very expensive storage system. With tiered storage, you can... put the old data to AWS S3... and don't pay the compute part of it."
This makes the vision of "turning the database inside out," as Martin Kleppmann famously described, a practical reality. Kafka is no longer just a pipeline to a database, it can now serve as the central, durable log of business facts, with traditional databases acting as specialized projections of that log.
The unsolved problem: disaster recovery and the schema registry
While tiered storage simplifies long-term retention, it also changes the conversation around disaster recovery. Backing up the "cold" data in object storage is straightforward, but what about the "hot" data that still lives exclusively on the brokers? More critically, what about the metadata that gives your events meaning?
Anatoly points to a significant weak spot in many Kafka ecosystems: the schema registry.
"I'm not a big fan of the current approaches of schema management in Kafka," he admits. "And there's one big reason for that... the schema IDs. The schema IDs are just numbers that are incremented. There's no semantic meaning."
This creates a dangerous tight coupling. The binary payload of your Kafka messages contains a numeric ID that maps to a schema definition in a separate system (the registry). If you lose the schema topic or have to migrate your registry, those IDs can change, potentially rendering all your historical data unreadable. A simple "oops" moment, like deleting a topic in the wrong environment, can become a catastrophic, business-ending event.
Conclusion: a new era for Kafka requires a new mindset
Kafka is evolving beyond a simple messaging system. With the advent of tiered storage, it is becoming a foundational data platform capable of serving as the immutable log for the entire enterprise. This powerful shift requires a corresponding evolution in our thinking.
We must move from viewing Kafka as a transient pipeline to treating it as a critical, long-term data asset. This means embracing a more mature approach to event modeling, investing in robust disaster recovery strategies, and paying close attention to critical dependencies like the schema registry. The technology has arrived, now it's up to us to build our architectures and operational practices to match.
Ready to architect a modern, resilient data platform with Kafka at its core? We can help you navigate the complexities of tiered storage, disaster recovery, and schema management.
