Blog - Jan 16, 2023

Schema evolution in practice

Name: Cymo
Price range: $

There is a lot of discussion going on about schema evolution. But what if you are starting a new project, and you need an easy way to find out which compatibility type is the best choice for a specific topic?

In this blog post, we will give you some clear guidelines that all team members can easily understand. This will guarantee that schema evolution is not an afterthought in your team. All members need to know which choice to make after verifying the same criteria. You can already make a decision based on three different topic types, which you will (most likely) not regret later.

What kind of project are we talking about?

Firstly, we will use of the principles of Domain Driven Design. After defining the different bounded contexts in your organisation for all business domains you can map different microservices or existing products to these bounded contexts and make sure that there is a master system for each domain entity. These master systems will send out events for the data in the domains they are responsible for.

As a result, a master system needs to be able to introduce small schema updates to its events at its own pace. It will give consuming applications the necessary time to implement the new version. This is similar to introducing a new version of an API that is simultaneously available with its older version.

Why did we choose Kafka and Avro in this blog post? Well, Kafka is simply the most popular choice for an event hub at the moment, and it is often used together with Avro. The same principles would apply to other serialization/deserialization methods. The essence is to evolve the contract your data is exposing to potential consumers. Without it there is chaos.

Where do you define your compatibility types?

Compatibility types describe sets of guarantees between schemas, thereby declaring whether one schema can read a record written by another. They are not explicitly declared in the core Avro project, but in a Schema Registry such as Confluent Schema Registry, Apicurio, etc.

You can set the compatibility type on a global level. By default, Confluent Schema Registry sets the compatibility type to BACKWARD.

However, you can also set the compatibility type for each event type by linking it to a subject and using the correct naming strategy. To make things a little easier, we recommend using the same compatibility type for all event types on one topic.

You can link a subject to a topic by using TopicNameStrategy, but we only recommend this for topics that hold state, with one event for each key. Usually these are compacted and contain things like "Upsert" events.

Using this strategy means you have to use the same schema on one topic, which is less convenient for business events. It means you need to create 1 schema for all event types that can be produced on that topic. This is achievable by using “union” in avro.

It’s better to design different events for the same domain entity and group them in one topic to guarantee ordering.

That is why we suggest using TopicRecordNameStrategy for the subject naming. This will make sure that both the topic name and event type are used to define the schema, and the schema can be different for each event type. The downside here is that vendor tooling such as KSQLDB, Imply currently can’t handle more than 1 schema for a topic. It’s important to know the limitations of the tooling you want to use for streaming. They promise to be easy to implement, however they struggle with edge cases.

Which three rules can you use to define the compatibility type?

1. Use compatibility type FULL_TRANSITIVE for all public topics with infinite retention

Public topics with infinite retention contain business events that express the business behaviour of your organisation. They are useful for triggering other business processes, future migrations, testing new modules, digital twins, analysis, etc.

As a producer of events in our own bounded context, we want the freedom to introduce a new version of the schema first. However, we also want to be able to replay events from the start. That’s why you need the strictest compatibility type for these topics: FULL_TRANSITIVE.

If you need to add a lot of fields, they will have to be defined as optional, which is not ideal. After a while, you will need to introduce a new event structure on a new topic that everybody needs to migrate to. This is a downside of using schema evolution.

If you want to avoid this and still have high retention, consider using upcaster chains. We strongly recommend this talk by A. Evers if you are interested in doing so.

2. Use compatibility type FORWARD_TRANSITIVE for public topics without infinite retention

Introducing a retention period means these events will be deleted in the future. They may be deleted because they can either be reproduced from another datastore, or they are only relevant for a limited amount of time. Replaying the history of the events will also be limited for the retention period which means there is less risk of conflicting schemas in this period. The FORWARD_TRANSITIVE compatibility type can probably be used because it gives you more flexibility for updates to the schema, and you can still introduce a new version as a producer without having to impact consumers first.

3. Use compatibility type BACKWARD_TRANSITIVE for internal topics

For internal topics, we suggest using the BACKWARD_TRANSITIVE default type as compatibility mode.

This gives you more flexibility in your own bounded context. The events are internal, which means they are only read by yourself. It is not a problem to guarantee your own consumers will use the new schema version before or at the same time the producer will use it.

What is even better than schema evolution?

Of course, it’s always better not having to update a schema at all. So, what can you do to reduce the chance that you will have to change event schemas later?

Schemas are defined for each event type, which means a schema will only evolve if that specific event needs a change.
Multiple smaller event types reduce the impact on the consumer logic when an update occurs. Consumers are often not interested in all event types; they only process the event types they need data from or the ones that trigger new business processes.
Create separate state topics. These topics can be used by consumers to construct read models. The topics can be compacted and processing can be idempotent, which makes it easier to reproduce the events with a new schema.
Think about your own domain and add all the relevant data for a business event up front. This avoids having to add (optional) fields later.
Concider the granularity. This is hard to get right so spend some time thinking about it. More on that here https://www.andrewharcourt.com/articles/on-the-granularity-of-events.
Data structures don’t change frequently, so changing the structure of a field will also not occur regularly. EDA is not point to point: it obliges you to produce the data that might be useful in the future. Multiple consumers can be added, events can be replayed, … you might need more data for analysis.

An organisation can learn a lot from its data, especially from data in motion. If you need help with structuring your data, feel free to reach out!

Written byKris Van Vlaenderen

More news

Schema evolution in practice

What kind of project are we talking about?

Where do you define your compatibility types?

Which three rules can you use to define the compatibility type?

What is even better than schema evolution?

Read more

Integrating Salesforce in an Event-Driven Architecture

Enhanced Cybersecurity through Event-Driven Architectures

Apache Flink: Introduction to the DataStream API