Kafka System Overview - A Cheatsheet


There are a plethora of Message Brokers or Queuing systems, each with it's own set of nuances that come from how it is designed to work. For someone like me who dabbles on varied technologies it gets really hard to remember some key details (and you know that the devil is in the details). This is a cheatsheet of sorts that I had prepared to quickly refer to, for reflection, introspection and analysis of current system design that employs Kafka in its Architecture. 

Whether you are a Developer, or an Architect or an Engineering Leader, this should help you get your understanding right about how Kafka works and the components involved in its ecosystem.

Some Key Pointers

  1. Kafka provides mechanisms to achieve exactly-once message delivery semantics, which ensures that messages are processed only once. 
  2. Apache ZooKeeper is used for cluster management, metadata storage, leader election etc. Starting Kafka v2.8.0, Kafka can be run without Apache Zookeeper, by using Kafka Controller for these tasks.
  3. Kafka brokers are individual Kafka server instances that store and serve data.
  4. Data in Kafka is organized into Topics, which act as logical channels for data streams. Each Topic is divided into Partitions, which allow data to be distributed and parallelized across multiple Brokers. Partitions are the unit of parallelism in Kafka.
  5. A Message is composed of a Key, a Value and a Timestamp.
  6. A messages' key is hashed and based on its value, which partition the message is put into is decided.
  7. It is important to remember that number of Partitions for a Topic cannot be changed, once a Topic is created.
  8. Typically, Consumers split the workload amongst themselves based on Partitions. For this reason it is important that the "Number of Consumers" is NO GREATER THAN the "Number of Partitions". You can witness idle Consumers if their number is greater than the count of Partitions.
  9. Kafka maintains an offset for each consumer to keep track of the last consumed message. Consumers can specify the offset from which they want to start consuming messages. Consumers can also specify if they want to get all the messages or receive only the new messages.
  10. Kafka supports message compression to reduce network and storage overhead. Common compression codecs include GZIP, Snappy, and LZ4. Message compression is not enabled by-default.
  11. Message compression can be done at Producer-side configuration, or as Topic-level configuration or at Broker-level configuration.
  12. Note that message compression implies reduced network and storage bandwidth but increased compute. Know what you trade in your decision making.
  13. When a message is compressed at Producer-side, it remains so in Topic and automatically gets decompressed at Consumer-side by Kafka client library if it is aware of the compression codec that is used. This is a good option when you are thinking of message compression (opinionated - please consult your Architect/Tech-head before you take this call!).