Basics of Apache Kafka
Git hub repo code and it’s explanation : This uses springboot, docker, docker-compose, kafka, and zookeeper to practically show producer producing messages and consumers consuming them from Kafka brokers. You can use the above links after you are clear on Kafka basics.
Let’s start the main article on basics of Kafka :
Apache Kafka is publisher subscriber based highly scalable, available, fault tolerant messaging system. Apache Kafka is playing a significant role in data streaming landscape wherever we need processing, reprocessing, analyzing and handling real time data.
Kafka Cluster
A Kafka cluster is a system that consists of several Brokers, Topics, Partitions. The primary goal of a cluster is to distribute workloads b/w systems to provide for scalability, availability and fault tolerance.
Brokers
A Kafka cluster comprises of one more more brokers. Brokers are servers/nodes in a cluster. There would always be multiple servers running for your business use case so that you have multiple topics, multiple partitions and replicas of those topics. This provides scalability, availability and fault tolerance. A single broker would not contain your complete data, however it would know about all other brokers, partitions and topics.
Topic
A publisher publishes messages/streams to a topic and consumer consumes from the topic. A topic is used to store and publish a specific type of data. There could be n number of topics in Kafka. Each topic has a name that is unique across the entire Kafka cluster. Topics are partitioned and replicated across multiple brokers.
Partition
Splitting a topic into several parts is called partition i.e a topic is divided into one or more partitions enabling producer and consumer load to be scaled. Hence the messages can live in multiple nodes. Having only one instance of a topic can always create issues in long run to handle the messages. If you have more data in a topic than can fit on a single node you must increase the number of partitions. The order of the offset value is guaranteed within the partition only and not across the partition.
Consumer Offset value
Offsets are assigned to each message in a partition to keep track of the messages in the partitions of a topic. Keeping track of the offset is critical, as a consumer would want to receive messages in the same order as was published by the producers. Every partition has it’s own offsets. The Kafka consumer offset allows processing to continue from where it last left off if the stream application stopped for any reason. This offset tracking is done by zookeeper. Consumers in a consumer group are assigned a particular partition. When you have two partitions for a single topic, a message is published on only one partition, not both the partitions. If you want it to be published across nodes for availability then use the concept of replicas. Only one consumer in a consumer group can be assigned to consume messages from a partition. Hence only one consumer in a consumer group can read a particular message.
Replicas
Kafka Replication means having multiple copies of data spread across multiple brokers/servers. This helps in high availability if there’s any broker which goes down in a Kafka cluster. Zookeeper selects your primary partition node where all the messages will be sent. This primary partition keeps track of it’s replicas across servers to keep a constant check on the messages if it’s getting replicated over replicas on time.
Check this for more of replicas : https://medium.com/@_amanarora/replication-in-kafka-58b39e91b64e
ZooKeeper
ZooKeeper acts like a master node which coordinates any stateful activity across your Kafka cluster. It is used to manage and coordinate Kafka brokers, topics, partitions, replicas . This service is used to notify on any event change of a broker like adding/removing of a broker. It also keeps track of the leaders for different topics.
Producer
A producer is responsible to publish or write messages to a topic. A producer sends every message to to a topic partition. This topic partition is responsible to publish messages to all it’s replicas for higher availability. Producers automatically know that, what data should be written to which partition and broker. The user does not require to specify the broker and the partition. Producer knows about this using the zookeeper service.
Consumer and Consumer group ?
A consumer is responsible to read data from a topic partition. In case, the number of consumers are more than the number of partitions, some of the consumers will be in an inactive state. Zookeeper will try to allocate every partition to a single consumer in a consumer group so that a message is read only once. So, there could be a single consumer reading messages from multiple partitions, however every partition will be assigned only one consumer in a consumer group i.e. one partition cannot be assigned to multiple consumers. Consumer groups have a group id. All consumers in a consumer group must specify this unique group id.
Kafka consumers pull data from kafka brokers. It’s a pull based model.
We are good. Hope this was helpful to get started with Kafka. Please feel free to drop a comment or text anywhere you’d like : LinkedIn or email me at vivek.sinless@gmail.com