FoxuTech

What is Kafka and How its work and its Use cases

What is Kafka

What is Kafka?

Apache Kafka is messaging system built to scale for big data. Similar to Apache ActiveMQ or RabbitMq, Kafka enables applications built on different platforms to communicate via asynchronous message passing. But Kafka differs from these more traditional messaging systems in key ways:

Kafka’s architecture

Before we explore Kafka’s architecture, you should know its basic terminology:

Architecture of a Kafka message system

Kafka’s architecture is very simple, which can result in better performance and throughput in some systems. Every topic in Kafka is like a simple log file. When a producer publishes a message, the Kafka server appends it to the end of the log file for its given topic. The server also assigns an offset, which is a number used to permanently identify each message. As the number of messages grows, the value of each offset increases; for example, if the producer publishes three messages the first one might get an offset of 1, the second an offset of 2, and the third an offset of 3.

When the Kafka consumer first starts, it will send a pull request to the server, asking to retrieve any messages for a particular topic with an offset value higher than 0. The server will check the log file for that topic and return the three new messages. The consumer will process the messages, then send a request for messages with an offset higher than 3, and so on.

In Kafka, the client is responsible for remembering the offset count and retrieving messages. The Kafka server doesn’t track or manage message consumption. By default, a Kafka server will keep a message for seven days. A background thread in the server checks and deletes messages that are seven days or older. A consumer can access messages as long as they are on the server. It can read a message multiple times, and even read messages in reverse order of receipt. But if the consumer fails to retrieve the message before the seven days are up, it will miss that message.

Kafka: Streaming Architecture

Kafka is used most often for streaming data in real-time into other systems. Kafka is a middle layer to decouple your real-time data pipelines. Kafka core is not good for direct computations such as data aggregations or CEP. Kafka streaming, which is part of the Kafka ecosystem, provides the ability to do real-time analytics. Kafka can be used to feed fast lane systems (real-time and operational data systems) like Storm, Flink, Spark streaming, and your services and CEP systems. Kafka is also used to stream data for batch data analysis. Kafka feeds Hadoop. It streams data into your big data platform or into RDBMS, Cassandra, Spark, or even S3 for some future data analysis. These data stores often support data analysis, reporting, data science crunching, compliance auditing, and backups.

            Kafka streaming architecture diagram referred from dzone.com

What Kafka doesn’t do

Kafka can be very fast because it presents the log data structure as a first-class citizen. It’s not a traditional message broker with lots of bells and whistles.

Because of those differences from traditional messaging brokers, Kafka can make optimizations.

How Kafka supports microservices

As powerful and popular as Kafka is for big data ingestion, the “log” data structure has interesting implications for applications built around the Internet of Things, microservices, and cloud-native architectures in general. Domain-driven design concepts like CQRS and event sourcing are powerful mechanisms for implementing scalable microservices, and Kafka can provide the backing store for these concepts. Event sourcing applications that generate a lot of events can be difficult to implement with traditional databases, and an additional feature in Kafka called “log compaction” can preserve events for the lifetime of the app. Basically, with log compaction, instead of discarding the log at preconfigured time intervals (7 days, 30 days, etc.), Kafka can keep the entire set of recent events around for all the keys in the set. This helps make the application very loosely coupled, because it can lose or discard logs and just restore the domain state from a log of preserved events.

How does Kafka compare to traditional messaging competitors?

Just as the evolution of the database from RDBMS to specialized stores has led to efficient technology for the problems that need it, messaging systems have evolved from the “one size fits all” message queues to more nuanced implementations (or assumptions) for certain classes of problems. Both Kafka and traditional messaging have their place.

Traditional message brokers allow you to keep consumers fairly simple in terms of reliable messaging guarantees. The broker (JMS, AMQP, or whatever) tracks what messages have been acknowledged by the consumer and can help a lot when order processing guarantees are required and messages must not be missed. Traditional brokers typically implement multiple protocols (e.g., Apache ActiveMQ implements AMQP, MQTT, STOMP, and others) to be used as a bridge for components that use different protocols. Additional functionalities such as message TTLs, non-persistent messaging, request-response messaging, correlation ID selectors, etc. are all perfectly valid messaging use cases where Kafka would not be a good fit.

More: Configure Elastic Load Balancing with SSL 

Should you use Kafka?

The answer will always depend on what your use case is. Kafka fits a class of problem that a lot of web-scale companies and enterprises have, but just as the traditional message broker is not a one size fits all, neither is Kafka. If you’re looking to build a set of resilient data services and applications, Kafka can serve as the source of truth by collecting and keeping all of the “facts” or “events” for a system.

Use Cases

Here is a description of a few of the popular use cases for Apache Kafka®. For an overview of a number of these areas in action, see this blog post.

Metrics

Every service at LinkedIn includes at least a Kafka producer, since this is how metrics are propagated. This includes technical metrics (query throughput, latency, etc.) as well as business metrics (ad impressions, ad clicks, etc.). These events are emitted at the rate of one per metric per minute per host, and thus represent aggregated information (count, average, max, percentiles, etc.) that get loaded into a time-series database. These metrics can thus be almost immediately visualized in dashboards, and their trend over time can be examined. Threshold alerts are also fired from this data.

Tracking

This is similar to business metrics, except that it is typically more complex than just a simple numerical value going into a time-series database. This includes, for example, page view events, which contain many dimensions (member ID, page key, etc.). These events are emitted at the rate of one per triggering event, and can thus be much higher volume for each given type of events than metrics. These end up being ingested into Hadoop (via Gobblin) for offline batch processing and into Samza for nearline stream processing.

Logs

Services can get their logs ingested into Kafka in order to load them into ELK, for monitoring purposes.

Meta-Infrastructure

Besides the above use cases, which are very common across the industry, Kafka is also used as a building block in almost every other piece of infrastructure at LinkedIn:

Stream processing

As mentioned above, Samza is the stream processing framework used at LinkedIn. It leverages Kafka in a variety of ways:

Use cases fulfilled by Samza are very diverse: pushing notifications and emails to members, site speed analysis, ad click-through rate computation, fraud detection, standardization machine learning pipelines, and many more.

Inter-process messaging

Some use cases require inter-process messaging, and Kafka can be used for this. These events are transient in nature and may not need to be persisted into a long-term storage system, contrarily to the previously mentioned use cases, which usually result in long-term storage.

This Question was raised about What are some of the use cases of Apache Kafka? on Quora, and answered by Felix GV, Engineer

Exit mobile version