How to delete immutable data with crypto-shredding
We can use crypto-shredding to delete immutable data from a system. With this technique, cryptographic algorithms will scramble the data, which makes it difficult or even impossible to recover without the right key. This is useful to protect sensitive information from unauthorized access. In this tutorial, we will discuss how we can do this. Let’s get started!
Why deleting data is important...
We often encounter an issue when working with event-based systems and immutable data. It becomes difficult to delete, but we still have to do it to comply with laws like the GDPR in Europe. Most of the time, the issue is deleting the Personal Identifiable Information (PII) without compromising the rest of our systems.
Restreaming events with a filter
We can delete the PII by restreaming the entire topic or datastream with a filter. This works, but it’s a convoluted process and takes a lot of effort. This solution also introduces a major problem to event-based systems. Just listen to what Boromir has to say:
In event-based systems, the sequence of the events determines their meaning. In other words: if we remove an event from the chain, we risk changing the meaning of the events. Let’s look at an example using our cat Felix to illustrate this:
1. TrayPresented
```json
{
'trayName': 'yummy'
}
```
2. FoodPlacedOnTray
```json
{
'trayName':'yummy',
'foodItem':'chicken'
}
```
3. AnimalJumpedOnTray
```json
{
'trayName':'yummy',
'animalType':'Cat'
}
```
4. CatChasedOffTray
```json
{
'trayName':'yummy',
'cat': {
'name':'Felix'
}
}
```
5. TrayPlacedInOven
```json
{
'trayName':'yummy',
'ovenName':'HotHole'
}
```
If Felix exercised his rights to remove his PII and be forgotten (we’re aware that the GDPR does not cover animals, but just bear with us for this), we would have to delete the fourth event.
In this case, we'd get an entirely different story than the original one. In the original sequence, we'd get delicious (but potentially food-unsafe) roast chicken, along with a slightly grumpy cat. With this new sequence, we’d get a cooked chicken with a side of brutally murdered, crispy cat.
Thankfully, there are better ways to handle this. Enter crypto-shredding!
What is crypto-shredding?
So, what is crypto-shredding? Yet another one of those virtual currencies? Not at all!
The name derives from the way this technique works. We essentially use a symmetric key encryption algorithm to encrypt (part of) the PII. “Symmetric” means that we can use the same key to encrypt and decrypt the data, just like a normal lock.
If we use crypto-shredding, the encrypted data is rendered unusable without a matching key. We won’t have to change anything to the event, what it describes, or the sequence it’s a part of. Just make sure to store the key and the encrypted info in separate places!
To access the data, we can simply combine the key with the encrypted data to decrypt it. To forget or delete the data, for example following a GDPR request, we only need to throw away the key.
Crypto-shredding is also much more flexible. We can choose to encrypt specific fields, entire events, and everything in between. Or, we can encrypt different parts of the information in the same event using different keys. For example, we could assign each cat their own key.
Bonus: by controlling access to the keys, we can also control who can access the sensitive information while not compromising the event opacity. In other words, people can work on events without knowing the data in them. This is especially useful if we want to create an audit trail or establish a chain of custody.
How to implement crypto-shredding?
Now that we have demonstrated the usefulness of crypto shredding, let’s take a closer look at how to implement it using a five-step process
Establish which information is sensitive. This step is critical: if we don't know what needs to be protected, we can't properly protect it.
Determine the required access contexts and granularity. We can have various groups of sensitive data in an event with different contexts. For example, let’s say we have an employee data set that contains both their personal information (like their first and last names) and other information (like their wages). Some use cases may require one kind of information without needing to know the other, so we can consider them as different contexts and encrypt them using different keys.
Establish an event schema. We can do this by wrapping the sensitive information with a data structure. Try to make this as precise as possible, since we don’t want to lose the protection that schemas give us. At a minimum, we should include a reference to the key and the encrypted data. We recommend also specifying the field’s name, along with its data type.
- Implement the producer logic. Before publishing, we should encrypt the sensitive data with a randomly generated key for each context. Assign a reference ID to each key, and publish it in the data record. We can use various methods to store the keys:
An easy Kafka pattern is to store them in a separate compacted topic for each context. This makes them easy to join as part of streaming logic, and removing a key will be as simple as publishing a tombstone record.
If we need more control over the keys or have to deal with other compliance requirements, we can also keep the keys in a separate location that can be accessed through APIs or other means.
Finally, we can also consider using more complex patterns with primary and secondary keys or other setups. After all, we have the full range of cryptographic design patterns at our disposal.
Consume the data. We can use the encryption key to access the necessary sensitive information and decrypt it, and design patterns to make keys otherwise inaccessible. Considering standardising the chosen approach company-wide, as it prevents a lot of additional redundant effort.
Example
If we reuse the event chain from our previous example, the fourth step where Felix is identified would look like this with encryption:
```json
{
'trayName':'yummy',
'cat': {
'sensitive_data': {
'key_id':'uuid1',
'field':'name',
'type':'string',
'data':'vclV7MmS5TVz7ooumAr8KQ=='
}
}
}
```
As you can see, we now have a key with an ID (uuid1) and a value (0123456789123456). This assumes that we stored the used cipher together with the key.
We’ve also written a small demo on GitHub, if you would like to see a more elaborate example of crypto shredding. You can find it here.
For those wondering: thanks to crypto-shredding, Felix was saved from the oven and is in great condition. :)
Written byBryan De Smaele
Read more
Mixing Streams and Batches: A Practical Approach for Legacy Upgrades
How to integrate SAP into an Event-Driven Architecture? (S1-E1)
Meet Perry Krol, Head of Solutions Engineering at Confluent. He shares his insights and experiences about integrating SAP into an Event-Driven Architecture and what benefits you can get with that.