There are a plethora of Message Brokers or Queuing systems, each with it's own set of nuances that come from how it is designed to work. For someone like me who dabbles on varied technologies it gets really hard to remember some key details (and you know that the devil is in the details). This is a cheat-sheet of sorts that I had prepared to quickly refer to, for reflection, introspection and analysis of current system design that employs Kafka in its Architecture.
Whether you are a Developer, or an Architect or an Engineering Leader, this should help you get your understanding right about how Kafka works and the components involved in its ecosystem.
Some Key Pointers
- Kafka provides mechanisms to achieve exactly-once message delivery semantics, which ensures that messages are processed only once.
- Apache ZooKeeper is used for cluster management, metadata storage, leader election etc. Starting Kafka v2.8.0, Kafka can be run without Apache Zookeeper, by using Kafka Controller for these tasks.
- Kafka brokers are individual Kafka server instances that store and serve data.
- Each Kafka Broker can host multiple Topics.
- Each Topic can have multiple Partitions.
- Data in Kafka is organized into Topics, which act as logical channels for data streams.
- Each Topic is divided into Partitions, which allow data to be distributed and parallelized across multiple Brokers. Partitions are the unit of parallelism in Kafka.
- Number of Partitions for a Topic is decided at the time of Topic creation.
- You can increase the number of partitions for an existing Kafka topic dynamically using command-line tools like `kafka-topics.sh`. Increasing partitions triggers a consumer group rebalance, where partitions are redistributed among the consumers in the group.
- You cannot decrease the number of partitions for a topic without deleting and recreating the topic (which results in data loss).
- It is a common best practice to over-partition initially based on future growth estimates or to plan for a data migration strategy to a new topic if the current partition count becomes insufficient.
- Message = Key + Value + Timestamp.
- A messages' key is hashed and based on its value, which partition the message is put into is decided.
- A message Key can be null. Null-key messages are distributed among partitions using a round-robin algorithm, which balances the load across all available partitions but does not guarantee message ordering.
- Typically, Consumers split the workload amongst themselves based on Partitions. For this reason it is important that the "Number of Consumers" is NO GREATER THAN the "Number of Partitions". You can witness idle Consumers if their number is greater than the count of Partitions.
- Kafka maintains an offset for each consumer to keep track of the last consumed message. Consumers can specify the offset from which they want to start consuming messages. Consumers can also specify if they want to get all the messages or receive only the new messages.
- Kafka supports message compression to reduce network and storage overhead.
- Common compression codecs include GZIP, Snappy, and LZ4.
- Message compression is not enabled by-default.
- Note that message compression implies reduced network and storage bandwidth but increased compute. Know what you trade in your decision making.
- Message compression can be done:
- at Producer-side configuration, or
- as Topic-level configuration or
- at Broker-level configuration.
- When a message is compressed at Producer-side, it remains so in Topic and automatically gets decompressed at Consumer-side by Kafka client library if it is aware of the compression codec that is used. This is a good option when you are thinking of message compression.
- Rebalancing is the process of redistributing partitions among consumers in a consumer group when:
- Consumers join or leave the group
- Topics/partitions are added or removed
- Any change in subscription pattern
