A Strategy for dealing with Breaking Schema
Break Stuff
On Kafka, schemas define the contracts for business events, state events, and internal application events. Schemas define what structure the events must have; this is enforced by the code. Consumers also use the schemas to know the exact structure of events. A schema registry facilitates defining what kind of schema evolution is allowed between different schema versions.
These schema evolution rules are used to validate changes to a registered schema. However, introducing a breaking change cannot always be prevented. It’s just one of those days…
Intervention is needed to let producers and consumers evolve gracefully towards the new schema.
We at Cymo present a strategy on how such a breaking schema change can be introduced in a controlled manner.
Don’t Stop Me Now
The ultimate end goal is to have the new -breaking- schema being used by all parties involved (i.e., the producer as well as all consumers), within an acceptable time frame.
The producer and each consumer should be able to migrate to the new schema according to their own individual schedule, reducing impact and interdependencies. As such, we aim for minimal operational impact on all ends, if any. All parties are having a good time…
The producer owner is responsible for ensuring that messages keep being produced properly during migration, and (s)he should define a migration grace period, preferably agreed upon with all the consumers owners for a smooth transition. After this period, the migration solution gets either removed/decommissioned or the ownership (read “responsibility”) of that migration solution is shifted to the consumer who needs it the longest.
Show Me The Way
1. Consider a producer P1 producing events on a topic T1 following schema S1, and consumers C1 and C2 consuming those events from that topic.
Figure 1: A producer P1 produces messages on topic T1 following schema S1.Consumers C1 and C2 consume those messages.
2. A new topic T2 is introduced; this topic follows the breaking schema S2. Initially, that topic T2 is empty.
Figure 2: A new topic T2 is introduced.
3. In case the replayability of all events is desired - i.e., the history of existing records is maintained -, a dedicated migration component is introduced, reproducing all existing records from topic T1 on topic T2 by transforming them into messages following schema S2 and producing them on topic T2.
The same keys are to be used, ensuring that messages from one specific partition of topic T1 are put on the same partition of topic T2. This, in turn, ensures that the original message ordering is preserved. Moreover, the reproduced messages are to retain the original timestamp in order to preserve the semantics of the messages. Note that producer P1 can meanwhile still produce new messages that may get consumed by existing consumers, i.e., there is no downtime of any of the existing components. Any message produced by producer P1 will be picked up by the migration component and will be reproduced on topic T2.
Figure 3: A dedicated migration component is introduced.
Figure 4: The migration component reproduces the messages from topic T1 on topic T2.
4. An updated implementation of producer P1 is deployed. This updated implementation no longer uses schema S1 nor topic T1, but will produce messages on topic T2 using schema S2 instead.
Figure 5: Producer P1 gets updated and no longer produces on topic T1.
5. Producer P1 can only produce new messages on topic T2 as soon as the migration component has reproduced all messages from topic T1 on topic T2.
This sequential dependency is necessary in order to guarantee that the message ordering is maintained between the original messages and any new message that the producer might produce.
Figure 6: Producer P1 produces new messages on topic T2.
6. The dedicated migration component has reproduced all existing events in topic T1 on topic T2 (mapping from schema S1 schema S2 while doing so); its “mode of operation” is then switched: it no longer reproduces from topic T1 to topic T2. On the contrary, it will from now on consume messages from topic T2 and reproduce them on topic T1.
It is crucial that these two contra-directing processes never run simultaneously, which is why they are considered together, as one responsibility of a single component. As a single component, it can enforce and ensure it never runs in both directions at the same time. Moreover, once the reproduction from topic T1 to topic T2 is complete, the migration component may never reproduce from topic T1 to topic T2 ever again. Hence, this migration component should be set up in such a way to avoid this scenario.
Figure 7: The migration component reproduces new messages from topic T2 on topic T1.
7. As of now, all messages produced on topic T2 by producer P1 are reproduced with identical semantics on topic T1 by the migration component.
Consequently, consumers can start migrating to topic T2 as they see fit in light of their planning and workload. A migrated consumer inspects the timestamp of the message to determine where it stopped consuming messages from topic T1 and where to start consuming messages from topic T2.
Figure 8: Consumer C2 migrates and consumes messages from topic T2.
As shown above, consumer C2 has been migrated earlier than consumer C1. Both consumers will still be able to consume messages produced by producer P1.
8. All consumers can migrate at their own respective pace to stop consuming from topic T1, and consume from topic T2 instead.
Figure 9: Consumer C1 migrates and consumes messages from topic T2.
9. All consumers have migrated, so the migration component and topic T1 can be fully decommissioned.
Figure 10: Topic T1 and the migration component are decommissioned.
Note: In case one or more consumers would not be migrated within the agreed time frame, i.e. topic T1 is still being consumed, the producer - owner of topic T1 - may wish to hand over ownership of topic T1 and the integration component to that consumer (or one of those consumers). A formal process or agreement between different stakeholders may be required to set this up.
Another approach to shifting ownership of the migration logic could be to have consumer owners incorporate the mapping logic of the migration component into their respective consumer components. Clearly, this is a smoother approach if the mapping logic is easily integrated/embedded into the consumer, i.e. typically driven by the technologies used in the mapping logic and the consumer. In this approach, consumer owners can reduce the impact on their planning while their consumer is already consuming the new topic T2, using the new schema S2.
Parallel Universe
A similar strategy can be applied in other scenarios, e.g. when renaming a topic.
I swear it’s everywhere, it’s everything …
Conversations with the devil
You may have sensed that many parties can be involved in the migration of a system. Sometimes, parties do not know each other and only know about the API (i.e., topics and schemas) they are producing to or consuming from. Any migration should be done as carefully as possible, by having as little impact as possible on other parties involved. As the party that is introducing the breaking change, you want to avoid the surprise of coming out of the bushes with a red dot…
At Cymo, we believe that communication and alignment between all parties will help to understand all parties’ concerns, which in turn will smoothen the migration of any breaking change.
Written byKristof Van de Voorde