Hello!

© 2024 Kishan Kumar. All rights reserved.

Apache Kafka: A Comprehensive Guide

At its core, Apache Kafka is an open-source distributed event streaming platform that allows you to publish, subscribe to, store, and process streams of records (messages) in real time.

Aug 27, 2023

Hero

Photo by Google DeepMind on Unsplash

In this article, we’ll be delving deep into the workings of Apache Kafka, a distributed event streaming platform that has become the backbone of many large-scale systems. It was originally developed by LinkedIn and later open-sourced as part of the Apache Software Foundation, Kafka is designed to handle real-time data feeds with high throughput, fault tolerance, and scalability.

We’d understand few key concepts/terminologies used and understand them by using few examples.

What is Apache Kafka?

If you are an experienced software engineer and have been dealing with projects that works at high scale, you must have heard of Kafka. It is used by over 80% of the Fortune 100 across virtually every industry.

Let’s understand what it is first, at its core, Apache Kafka is an open-source distributed event streaming platform that allows you to publish, subscribe to, store, and process streams of records (messages) in real time. It can be used as either a message queue or as a stream processing system. It is often described as a “distributed commit log” or a “publish-subscribe messaging system.”

It is designed to handle humongous volumes of data with minimal latency, making it ideal for building real-time data pipelines and streaming applications.

Real World Example

You might have heard of Cricbuzz, a platform that provides live cricket scores and real-time statistics on cricket matches. As passionate cricket fans in India, we often can't watch the game because of work. So, we rely on Cricbuzz to stay updated. Whenever a player is bowled out or hits a six or four, every detail is instantly updated on the platform. Here's what it looks like:

Crickbuzz Updates

Crickbuzz Updates

You have different matches on the very top, then you have a brief summary of what is happening at a high level. If you want detailed information you can click on it, and you will see the following page:

Crickbuzz Detailed Updates

Crickbuzz Detailed Updates

Let’s say we are asked to build something similar as Cricbuzz. How are we going to do it?

We can think of each ball being bowled as an event, right? When someone hits a six, it's an event in time. If someone gets injured, that's another event. Every point in time can be considered an event. But what should we do with these events? And how can we make use of them?

Let’s say whenever such event occurs, we first create a message or record out of it, here is how it looks:

1{
2    "eventId": "IndVsEng:123",
3    "bowler": "Ben",
4    "batsman": "Virat",
5    "outcome": "Hits Six",
6    "over": 5,
7    "ballsBowled": 3,
8    "timestamp": "26 Aug 2024, 17:00"
9}

And place them on a queue. We call the server or process responsible for putting these events on the queue the Producer. If we are producing these events, there must be some server that reads these events off the queue, performs computations, and updates the website. We call those servers the Consumers.

Kafka Basic HLD

Kafka Basic HLD

Everything works well if the total number of teams playing simultaneously is 2 or 4, which is usually the case. This means that at any given time, we'll only have events for those two or four matches. The volume of these events will be manageable, so our consumer won’t get overwhelmed.

However, let’s consider a hypothetical scenario where every country has regional cricket teams that also want to participate. Instead of just India competing, we could have teams like the Kolkata Knight Riders (from Kolkata) or Royal Challengers Bangalore (RCB), among others. And this is just from India; let's assume England has regional teams like the London Lions, etc. This would significantly increase the total number of matches running simultaneously, especially if we aim to conclude the World Cup in a week or a month.

Now, imagine there are 500 matches happening simultaneously. This would drastically increase the number of events being published to the queue, and our single server hosting the queue would struggle to keep up. Similarly, the consumer would be overwhelmed by the sheer volume of events, causing the website to feel sluggish. In reality, if RCB is all out, but our website still shows that they are playing, it would create a poor experience for cricket fans, especially RCB fans.

To address this issue, we’ll need to scale the system by adding more servers to distribute the queue. Why distribute the queue? The queue can only hold as much data as its disk can support. If we exceed that limit, we’ll need to distribute the queue across multiple servers.

Let’s say we use two queues. The next question is how to maintain the order of events. Let me illustrate the issue with an example. Suppose at 15:00, batsman Virat hits a six, and on the very next ball, around 15:02, he gets bowled. These are two events:

1Event 1: 15:00 → Virat hits a six
2Event 2: 15:02 → Virat is bowled

Now, since we’re using two different queues, there are two possibilities: either both events are ingested into the same queue, or one event goes to the first queue while the second event goes to the other queue.

Scaling Queues

Scaling Queues

If the events for the IND vs ENG match end up in different queues, it's possible that the consumer might process them out of order. For example, it could consume the event where Virat got bowled before the event where he hit a six. This could lead to confusion and frustration. To avoid this, we need a solution that ensures events are processed in the correct sequence. But how do we achieve this?

If both events end up in the same queue, the problem of processing events out of order won’t arise. Therefore, we need a logical way to ensure that events belonging to the same match go to the same queue. This is one of the key concepts behind Kafka: messages sent and received through Kafka can use a user-defined distribution strategy, rather than just going round-robin, which we want to avoid. In our case, the distribution strategy will be based on the game the events are associated with. For example, all events for the IND vs ENG match will end up in the same queue.

Doing that our queue will look like following:

Distribution Strategy

Distribution Strategy

Now that we’ve addressed the issue of the queue becoming overfilled and running out of disk space, we still need to solve the consumer problem. Since the total number of events being ingested has increased significantly, we still have only one consumer handling all the work. Should we simply scale the consumer?

Scaling Consumers

Scaling Consumers

If we simply scale the consumer, how can we ensure that each event is processed only once? For example, what if both consumers read that Virat hit a six? This could result in the score being incorrectly updated to 12. How can we solve this problem?

We can group consumers together into Consumer Groups in Kafka. With consumer groups, each event is guaranteed to be processed by only one consumer within the group.

Consumers Groups

Consumers Groups

More formally, a consumer group in Apache Kafka is a concept that allows multiple consumers to work together to read data from Kafka efficiently and in a coordinated manner. Within a consumer group, no two consumers will consume the same record (or message).

Let’s say we gain popularity and our method of showing events on the website becomes well-known. People around the world start asking us to show statistics for football (or soccer) as well. How would we incorporate those changes?

One solution will be to send the football event to our existing setup. If Ronaldo goals our website should show that. But don’t you think this would make thins a little messy and challenging to scale? Also if our something goes wrong both the football and cricket events will be affected.

We want some sort of segregation of events, and that is where the topic comes into picture. A topic in Kafka is a category or feed name to which records are sent by the producers. Thing of them as the fundamental unit of organization for events.

Each event is associated with a topic, and consumer can subscribe to specific topics. In this way we are not mingling the two events together. Following is the figure to help you visualize it better.

Topics

Topics

Terminologies

In this section, we’ll be understanding the terminologies used in Kafka and its paper.

Topic: In Kafka, a topic is a stream of records that are categorized under a specified name. For e.g. we have cricket topic that will contain all the events related to cricket, similarly we had football topic that accumulates all the events related to football.

Partition: Topics in Kafka are divided into partitions, which allows for parallelism. In the above example that we discussed you must have seen that we pushed all the events related to IND vs ENG to one queue, instead of sending them in a round-robin fashion. Those two queues within a single topic is termed as partitions.There are few caveats that we must know, first, we saw that ordering strictly matters in our use case, and Kafka provides strong ordering guarantees at the partition level not across partition. This we already realized when we send one event (IND vs ENG to one queue and the other event to other queue. So if we want the ordering we’ll have to send those events to one partition only. Refer the following pic for partitions:

Partitions

Partitions

Producer/Consumer: This would be easy to understand, a producer in Kafka is responsible for publishing data to topics; whereas, the consumer reads data from topics.

Consumer Groups: It is a collection of consumers that work together to read data from a topic, in a way that each partition is consumed by only one consumer within a group.

Offset: This is something which we didn’t discuss till now. Think of offset as the position of events in the big queue. The first event would have offset, say 0, the next event to it will have offset as 1, and so on. It is a unique identifier for each record (or message) within a partition, indicating its position. Each consumer maintains the offset it has consumed up to, also keeps on saving this state to the log and committing it. So that if it dies, and a new consumer takes its place, it doesn’t start from the beginning but from the point where the previous consumer left off.

Offsets

Offsets

Broker: A broker is a server that stores the data and serves clients (producers and consumers). The more brokers you have, the more data you can store and the more clients you can serve.

Cluster: A cluster is made up of a collection of brokers working together.

How Does Kafka Work?

Let’s walk through the entire process from data ingestion to data consumption:

Step 1: Data Ingestion by Producers

Producers (clients), such as a Java client, create records and send them to a Kafka topic. They can choose the partition within the topic to which they want to send the data, often based on a key (e.g., userId, matchId, etc.) to ensure that all related data ends up in the same partition, thus maintaining the ordering guarantee.

Kafka brokers receive this data and store it in the appropriate partition. Each record within a partition is assigned a unique offset, which is an incremental number used to keep track of the record’s position in the queue.

Step 2: Data Storage and Replication

Once the data is received, Kafka stores it in the topic partitions. Kafka ensures durability and fault tolerance through replication. Each partition has a designated leader broker, which handles all read and write operations. Other brokers in the cluster store replicas of the partition, acting as followers.

If a broker fails, one of the follower brokers is automatically promoted to leader, ensuring that the data remains available and that producers and consumers can continue working without interruption.

Step 3: Data Consumption by Consumers

Consumers subscribe to one or more topics to read data. Each consumer group coordinates so that each partition is consumed by only one consumer in the group at any given time; otherwise, we’ll have race conditions as explained in the real-world example. Ensuring this allows Kafka to process data in parallel.

Consumers read records from the partitions sequentially, based on the offset. One of the most important things to note is that Kafka doesn’t automatically delete records after they are consumed, which allows multiple consumer groups to read the same data independently.

Step 4: Data Retention and Cleanup

Kafka retains records (or messages) in a topic according to a configured retention policy, which can be based on time (e.g., for 7 days) or size (e.g., up to 100 GB of data per topic). After the retention period, Kafka automatically deletes old records.

It also offers log compaction, which retains only the most recent record for each key within a partition. This is useful in cases where only the most recent updates matter.

Kafka Use Cases

Real-Time Analytics: Kafka is used to stream data from sources like IoT devices, web logs, and financial transactions for real-time processing and analytics.

Log Aggregation: Kafka collects and stores logs from multiple services and systems in a centralized place, enabling easier monitoring and troubleshooting.

Event Sourcing: Kafka can be used to store all changes to the application state as a sequence of events, which can then be replayed to reconstruct the current state.

Conclusion

Kafka works as a distributed messaging system that allows for real-time data streaming and processing. We understood its importance using a real-world example and explained terminologies to further strengthen our understanding.

.   .   .

The 0xkishan Newsletter

Subscribe to the newsletter to learn more about the decentralized web, AI and technology.

Comments on this article

Please be respectful!

© 2024 Kishan Kumar. All rights reserved.