Getting Started With Apache Kafka

Haydar Can Kubilay Gümüş
Logiwa Tech
Published in
7 min readJan 28, 2022

--

Kafka has become one of the most popular messaging system tools. If we look at Google Trends to compare Apache Kafka with other similar technologies, we will see the public interest in Kafka. The popularity of Apache Kafka has increased immensely in the last 5 years. That is in contrast with RabbitMQ’s popularity which remained constant.

These metrics made me to interested in Kafka and going deep dive. I have been asking myself “Why is that technology is so popular?” and there are so many resources on this subject such as articles, courses etc. Why do we even need Apache Kafka? What is this? :)

So, the purpose of this article is to give you the fundamental information, some of its basic concepts and preparing a solid ground. I can also name this article as Kafka 101 :)

WHAT IS APACHE KAFKA ?

I can’t even think of any article without this question, right :)

So, in computing, a common term for transferring data is messaging. That’s how Apache Kafka would describe itself, as a messaging system.

It is also known as a distributed streaming platform that is fast, fault-tolerant, reliable, durable, and scalable. Yes, this explanation of Kafka is really charming but how can Kafka contain all these aforementioned properties? Which components are responsible to do that? Let us dive into Kafka Core Concepts.

CORE CONCEPTS

The Log:

Log is perhaps the simplest possible abstraction of a storage of any kind. It is an append-only, totally-ordered sequence of records by time.

Records are only appended to the end of the log and assigned a unique sequential number. Reading any record is going from left to right. So, on the left are considered as “older” records(events) than those on the right. By the way, as we are all familiar with logging, this log is a different form. The main difference is that “application logging” is meant to be for humans to read but that data logs, we are talking, are built for programmatic access. Logs are really important to understand for Kafka, because this is where all messages are stored.

This blog by Jay Kreps from Linkedin is explaining logs and their purposes comprehensively.

Partitions:

You can consider partitions as a log. Messages are appended to the end of the partitions and Kafka guarantees the order of records for a single partition. So what if the producer sends messages to multiple partitions of a topic ? In this scenario, messages can not be ordered sequentially. There are some strategies to produce messages to single or multiple partitions. I will explain it basically in the “Consumers and Producers” section.

Messages are written to a partition of a specific topic and copied to other replicas that are in other machines. So if a machine(broker) is down because of any reason, the messages will not be lost. That is why Kafka is known as fault-tolerant and reliable.

Each message in the partition has a unique key which is called “offset”. When consumers read the messages, it is really important to know where they left. So offsets are a sign for consumers to not miss the sequence. I will give more detail about offsets in the “Consumers and Producers” section.

Topics:

Topics are a collection of related messages. You can think of topics as a sequence of events. Producers send the messages to specific topics which are configured at the code level and consumers subscribe topics to get the messages.

Developers define topics and there can be unlimited number “theoretically” but there is always a restriction in practice :)

Consumers And Producers:

Consumers and producers are client users of Kafka. Producers publish new messages to topics. It can be also called publisher in other messaging systems.

Aforementioned producing strategies in the “Partition” section, there are strategies to produce messages to partitions. If you don’t send a key with your messages, Kafka uses round-robin algorithm by default and distributes messages evenly to all partitions. If you send a key with messages, the producer will always send messages with the same key to the same partition. It is also possible to send messages to a specific partition by configuring it at the code level. I am not going to dive deeper into strategies for now because it would be out of this article. I will consider producing concepts in the next articles. So stay tuned :)

Consumers read messages. It can be called subscribers in other similar systems. Consumers subscribe to one or more topics. Each consumer has a consumer group id. Consumers in the consumer group are just the same image that is deployed more than once.

As I told you before, offsets are the unique id of messages in the log. Consumers store offsets in their memory and periodically go to Kafka and ask for a new message with the offset, if there is a new message which has a greater offset number than the consumer’s offset, the consumer gets the new message and updates the offset. Everything is cool so far but there is a good question to ask just here. What if a consumer is unable to respond? Offsets are in the consumers memory so when they go down, the offsets will be lost. Just here, Kafka comes to our aid one more time. Kafka has a topic to store committed offsets for each topic/per group of consumers(groupId) which is named mysteriously “__consumer_offsets”. :)

A partition of a topic can be consumed by only one consumer in the consumer group. So according to this information, the total partition number of a topic equals the maximum number of consumers in the same consumer group. If you create more consumers than the total number of partitions of a topic, Kafka automatically disables one of the consumers by doing consumer rebalancing.

In the picture above, “consumer a” is getting data from partition 0 and partition 2, “consumer b” is getting data from partition 1 and partition 3. Then a new consumer is added, Kafka is rebalancing the consumers and “consumer a” is getting data from partition 0 and partition 3, “consumer b” is getting data only from partition 1 and new added “consumer c” is starting to get data from partition 2. When you remove a consumer from the consumer group, Kafka will handle rebalancing again.

Brokers and Cluster:

A broker is a single Kafka Server. Multiple of them make a Kafka cluster. Each broker has its own disk and Kafka process is running on each of these brokers.

Brokers are responsible for receiving messages from producers. It assigns offset to messages and stores them on a disk. “Durable” keyword comes to play here. Kafka stores the messages on the disk. That means if a broker is down, the data on it will not be lost.

Brokers are also serving the messages to the consumers. Depending on machine hardware such as CPU, memory, etc. A single broker can easily handle millions of messages per second and includes thousands of partitions. By the way, you can not see a single broker in real-world systems. There are always replicas of brokers. It is called followers. Messages are sent to the leader broker that actually runs data transferring. When the leader gets the data, it feeds the follower brokers. You can configure how many replicas you will have at the code level and these follower brokers are called ISR(In-Sync Replicas). That’s why Kafka is reliable. Because your messages are copied to other brokers too, when the leader broker downs, Zookeeper assigns a new leader from other brokers which have the same data so consumers can continue without data loss.

Zookeeper:

Zookeeper is a mandatory centralized service for Apache Kafka. It keeps all metadata information for Kafka. It basically keeps secrets; cluster membership, broker information, topic configuration.

The other important task is electing a leader in the cluster. For instance, if a leader broker is down, Zookeeper assigns a new leader from the existing ones.

After I have learned more about Zookeeper, I was asking myself why Kafka cluster doesn’t include this Zookeeper inside it and then I have read that they are working to put Zookeeper directly into Kafka and there will be no longer need a Zookeeper ensemble outside of the cluster. We may not see the Zookeeper ensemble in the future :)

HOW IS KAFKA SO FAST?

Before we are finished, I would like to sum up why is Kafka so fast?

  1. Optimizing throughput by batching reads and writes
  2. By partitioning of a topic makes help to writes scale horizontally
  3. You can read scale horizontally as well by consumer groups
  4. Data is only appended to the system by using log and reading is simple as well. All operations have the complexity of O(1).
  5. Zero Copy approach, basically, requests the kernel to move data directly to the response socket rather than moving it via the application.

CONCLUSION

In this article, I wanted to talk about what is Apache Kafka and explain all its core concepts. Before we go deeper, I wanted to make a good understanding of components.

Thank you for reading it and feel free to reach out.

--

--

i write code, love tools and techs, and some jokes. Software Engineer at Logiwa