Navigating the future of your data mesh with events
With the rise of machine learning and AI, the importance of data within an organisation is being emphasised yet again. More and more businesses are starting to use data as the main driver for their decision-making, even if they are not planning on going down the AI or ML route. In this ever-changing world where data is both plenty and essential, an overall data management strategy is indispensable.
One of those strategies is data mesh, a sociotechnical concept of managing data. In this blog, we’ll discuss the current state of data management and its problems. We’ll cover the solutions that data mesh brings to the table and why events are a natural fit in those solutions, rather than provide a complete overview of data mesh
Data as a product
Until recently, the dominant idea behind data management was to have a centralised data warehouse or lake. This was combined with a centralised data team responsible for extracting, transforming, and using the data for analytical purposes.
However, this centralised approach, where one team had to access a data source they did not own and piled the resulting unstructured data into lakes and warehouses, turned out to be detrimental to the overall data quality.
Because these data teams depend on the operational teams, the technical connections to the source systems turned out to be brittle and hard to manage. The issues are not only technical, though. Because of its operational nature, this data evolves independently of the analytical usages and without any kind of agreed-upon contract.
To tackle the pressing data needs of today and tomorrow, a new approach is needed. Microservices have decentralised applications by putting well-integrated domain boundaries around them, and using data mesh will let us do the same. The main driver behind this new sociotechnical view on data management is decentralisation by creating domain-specific data teams and shifting the view from data as a by-product to an actual product.
In a data mesh, data products are composed and managed by a team within a domain in such a way that teams from other domains can consume and enrich it. Just like operational APIs, data products should be easily accessible and well-documented. Within larger organisations, it might be necessary to set up a federated governance structure overseeing and aligning the different data products.
Data mesh defines three types of data products:
Source-aligned: reflects the original source without any transformation or aggregation applied.
Aggregation-aligned: aggregated based on source-aligned data sets or a combination of multiple data sources.
Consumer-aligned: tailored to the needs of your consumer.
Data products leverage operational data, transforming it into facts for purposes of analytics. Facts are things that happened in the past at a certain point in time and cannot be undone. There is a clear relationship between facts and events in an event-driven system since events are also immutable and timestamped.
Putting events at the origin of your data products
Operational data often has only one state or view of the world: the one currently needed to do its job. State changes are simply modeled as create, update, or delete operations on the data, typically erasing the previous state.
It is possible to derive facts from these state changes using tools like change data capture, but this is not without risk. Turning the data into facts can lead to misinterpretations, so excellent domain knowledge is crucial in these cases. Wouldn't it be easier if the world was fact-driven instead?
As it turns out, the concept of a fact in statistical data aligns excellently with events in an event-driven architecture. As we’ve said above, events are like facts: both are things that happened at a certain point in time and are thus immutable. So, wouldn't it be great if we had a continuous stream of events available that can be used to aggregate, join and enrich?
In contrast to more static data sources, such as CSV or Parquet files and batched approaches using ETL tools to access REST APIs or SQL databases, event streams offer a real-time view of your data.
Apache Kafka
Even though the concept of data mesh is technology-agnostic by itself, we'll present Apache Kafka as our technology of choice to implement event streams.
Event streams are a natural fit for building data pipelines. In Apache Kafka, event streams are abstracted by a topic. A topic is a timestamped and immutable change log of events. It can be consumed and reconsumed by multiple applications from different teams independently. To support the three different kinds of data products (source, aggregation, and consumer), Apache Kafka allows for aggregated event streams, creating enriched new streams along the way.
Kafka also offers a wide variety of integrations with existing systems. This concept is called a connector, and it works in both directions. There are source connectors to extract data (like ETL tools) and sink connectors to feed other systems.
There are a lot of connectors available for a wide variety of use cases. Got an ancient database that is still heavily used, and you don't want to get rid of it just yet? Chances are high that there's a connector available. Use Salesforce and want to extract data from it? There's a connector readily available for you.
To ensure data quality and set up documentation in the same go, you can introduce schemas to your event streams. A schema defines and safeguards the contents of your events, so your consumers know what to expect and can easily evolve along with the data product.
A lot of companies are adopting a more event-driven approach for their application landscape, and for good reason. This requires an event broker such as Apache Kafka to be available and managed, so reusing that broker for your data products makes sense. The technology will be shared and understood by different teams, and having common ground between operational and data teams is the key to success.
In short, Apache Kafka's topic abstraction, combined with its connectors and schemas, make for a perfect candidate for your data mesh.
Do you have any remaining questions about your data mesh or other aspects of working with events? At Cymo, we specialise in Event-Driven Architecture (EDA) implementations and support. Be sure to contact us, and we’ll gladly help you out!Written byJonas Geiregat
Read more
Mixing Streams and Batches: A Practical Approach for Legacy Upgrades
How to integrate SAP into an Event-Driven Architecture? (S1-E1)
Meet Perry Krol, Head of Solutions Engineering at Confluent. He shares his insights and experiences about integrating SAP into an Event-Driven Architecture and what benefits you can get with that.