Migrating Kafka topics without downtime
Each kafka topic defines the number of partitions and replication factors when it’s created. However, once a topic is created, the partition count cannot be changed without affecting the ordering guarantees of the kafka partitions since kafka uses the following formula to calculate which partition a record should go to:
partition_id = partition_key % number of partitions
Kafka partitions are the gateway to concurrency and scalability. Having too many partitions cause a management overhead to manage the partitions, the sync between the replicas and choosing the leader, etc on the control plane and having too little partitions can bottleneck the concurrency metrics of a consumer group as there can be only one consumer per consumer group reading from a single partition.
I didn’t find any direct tool which can do this without downtime. If you find something from established sources, then this article is irrelevant.
Versioning topics#
We will version each topic by adding a version number suffixed to the original topic name. Every time we need to migrate the topic to another, we increase the version number suffixed to the topic name.
For example, if the original topic name is myTopic
and we want to migrate the topic, the versioning that will happen is myTopic.v0
-> myTopic.v1
. Since myTopic
didn’t have a suffix, by default, v0
is added to it. If we want to migrate again, then it will transition from myTopic.v1
-> myTopic.v2
.
Overall steps to migrate a kafka topic#
- Create a new topic suffixed with a increased version compared to the previous topic
- Inform your publisher that a newer version of the topic has been created
- Point your publisher to the new topic with the latest prefix
- Drain the old topic by the respective consumer groups listening on the same topic
- Once the older topic is drained, point the consumers to the new topic
How do we inform the publisher that a newer version of the topic has been created?#
We use Kafka itself to store all the versions of the topic. We can call this topic _meta_versions
. Whenever a newer version is required, push the version to the _meta_versions
topic.
In every publisher, run an internal consumer in a consumer group specific to the publisher only that reads from the _meta_versions
topic. This internal consumer shouldn’t be confused with the actual consumer which developers use to consume the actual messages. The internal consumer can read all the versions from the _meta_versions
topic till there are no more messages left in the topic for a certain duration of time. Set the topic version to that version in the publisher.
Even if there are multiple versions in the topic before the publisher starts, we need to ensure that at every instance, the publisher is writing only to the latest version in the topic.
Synchronization between publishers and its internal consumer, publishing to the _meta_versions
topic and configured retention period for the topic differs based on the use case.
How does the consumer group behave when a newer version of the topic has been created?#
The publisher shifts immediately to the newer version and starts publishing there. To maintain the ordering guarantees, the consumer has to drain the partitions in the older topic before moving on to the newer version.
How can the consumer detect that it has drained, or in other words, read all the messages in the kafka topic?
There are multiple options possible here.
First option is to run the consumer polling on a topic till no messages are received till the timeout is exceeded. However, this may also result in bugs where the timeout is breached because of an intermittent network issue.
The second option is to fetch the highest offset of the partition of the topic. There are 2 types of offsets, watermark offsets and end of log offsets. Use the watermark offset and ensure that the consumers consume up to the offset.
We can check the offset till which the consumer has consumed for a partition in a topic. Comparing the consumer’s offset and the topic partition’s offset can be a good indication that the consumer has drained the topic.
Once the draining is complete, the consumer can move on to the newer version of the topic.
I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!