Top Apache Kafka Interview Questions (2024) | CodeUsingJava
















Most frequently asked Apache Kafka Interview Questions


  1. What is Apache Kafka?
  2. What are capabilities of Apache Kafka?
  3. What are the usecase of Apache Kafka?
  4. What are the kafka core APIs?
  5. Explain the role of the ZooKeeper in Kafka?
  6. Explain the difference between Apache Kafka and apache storm?
  7. Explain the difference between apache Kafka and apache storm?
  8. What are Apache Kafka topic name limitations?
  9. How can you get exactly-once Kafka messaging during data production?
  10. What is the best way to make Znodes and how to remove them?
  11. In Kafka, what is the aim of ISR?


What is apache kafka?

It is a distributed streaming platform,that contains three capabilities
  • It is a publish-subscribe stream of records,that is scomparable to a message queue or an enterprise messaging system.
  • A falut torrent durable means of storing a stream of records.
  • Process a continuous stream of records as they come in.


What are capabilities of apache kafka?

  • High throughput Using a cluster of machines with latencies as low as 2ms, deliver messages at network-limited throughput.
  • Permanent storage Store data streams safely in a cluster that is distributed, durable, and fault-tolerant.
  • Scalable Thousands of brokers, trillions of messages per day, petabytes of data, and hundreds of thousands of partitions may all be handled by production clusters. Storage and processing can be expanded and contracted elastically.
  • High availability Clusters can be stretched efficiently between availability zones, or independent clusters can be connected across geographic regions.
  • Large ecosystem of open source tools Large ecosystem of open source tools: Take advantage of a diverse set of community-developed tools.
  • Client liberaries In a variety of programming languages, read, write, and process streams of events.


What are the usecase of apache kafka?

  • Kafka is frequently used for operational data monitoring. This includes integrating statistics from scattered apps into centralized operational data feeds.
  • Event sourcing is a type of application design in which state changes are recorded chronologically. Kafka is an appropriate backend for this type of application because it supports very large stored log data.
  • Creating a real-time streaming data pipeline that consistently gets data between systems or applications.
  • Creating a real-time streaming application that transforms to the stream of data
  • As an alternative for a more typical message broker, kafka works well. Kafka is a good choice for large-scale message processing applications because it has better throughput, built-in partitioning, replication, and fault-tolerance.


What are the kafka core APIs?

  • Producer API It enables an application to publish a stream of records to one or more topics.
    You can use the following maven dependency to use the producer:
    <dependency> 
        <groupId>org.apache.kafka </groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.8.0</version>
    </dependency> 
    
  • Consumer API Permits an application to subscribe to one or more topics and handle the records that are generated for them.
    You can use the following maven dependency to use the consumer:
    <dependency> 
        <groupId>org.apache.kafka </groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.8.0</version>
    </dependency> 
    
  • Streams API Permits the application to serve as a stream processor, accepting an input stream from one or more topics and producing an output stream to one or more topics,efficiently converting the input stream to output stream.
    You can use the following maven dependency to use the kafka stream:
    <dependency> 
        <groupId>org.apache.kafka </groupId>
        <artifactId>kafka-streams</artifactId>
        <version>2.8.0</version>
    </dependency>
    
    You can use the kafka-streams-scala package if you're using Scala..
    You can use the following maven dependency to use Kafka Streams DSL for Scala in Scala 2.13:
    <dependency> 
       <groupId>org.apache.kafka </groupId>
        <artifactId>kafka-streams-scala_2.13</artifactId>
        <version>2.8.0</version>
    </dependency> 
    
  • Connect API Developers can use the Connect API to build connectors that continually extract data from a source data system and put it into Kafka or push data from Kafka into a sink system. Many Connect customers won't need to utilize this API directly; instead, they can use pre-built connectors that don't require any coding.
  • Admin API Topics, brokers, acls, and other Kafka objects can be managed and inspected using the Admin API.
    You can use the following maven dependency to use the Admin API:
    <dependency> 
        <groupId>org.apache.kafka </groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.8.0</version>
    </dependency> 
    


Explain the role of the ZooKeeper in Kafka?

Zookeeper is used in Apache Kafka, which is a distributed system. However, Zookeeper's primary function is to establish coordination amongst nodes in a cluster. However, because Zookeeper functions as a periodically commit offset, we can recover from previously committed offsets if any node fails.

What's the difference between Apache Kafka and Apache Storm?

  • Apache Kafka. It is a distributed messaging system and robust queue that handle a large amount of data and Messages can be passed from one end-point to another. It is based on topics and partitions and uses the publish-subscribe paradigm. Zookeeper is a tool that Kafka utilizes to share and save states between brokers. Simply Kafka sends the messages from one node to another.
  • Apache Storm. It is Real-Time Message Processing. A storm is not a queue, it's a system with distributed real-time processing capabilities, which means it can perform all kinds of real-time data manipulations in parallel. Storm is a computation unit.
The common flow of these tools: real-time-system --> Kafka --> Storm --> NoSql --> BI(optional)

Explain the difference between apache Kafka and ActiveMQ?

  • Apache Kafka. It is a distributed messaging system that has good scaling capability. Applications can use it to process and reprocess streaming data on discs. Kafka's storage format is more efficient. In Kafka, each message had an average overhead of 9 bytes. In Kafka, a slow consumer does not affect other consumers. Kafka is a pull-based messaging system, Consumer will pull messages from the broker at its own pace in Kafka. The consumer has to consume the messages that are intended for them. With the addition of more customers, the performance of the Queue and Topics does not degrade.
  • ActiveMQ. ActiveMQ is a traditional messaging system.ActiveMQ is a general-purpose message broker that supports AMQP, STOMP, and MQTT, among other protocols. It also supports Enterprise Integration Patterns and more complex message routing patterns. In a Service Oriented Architecture, it is mostly used for integration between applications/services. Each message had an overhead of 144 bytes in ActiveMQ.ActiveMQ is a push-based messaging system. Producers send messages to Brokers, while Brokers push messages to all consumers in ActiveMQ. It is the producer's responsibility to ensure that the message is delivered. With the addition of more customers, the performance of the Queue and Topics diminishes.

What are Apache Kafka topic name limitations?

  • The maximum length of symbols is 255.
  • letters, . (dot), _ (underscore), - (minus) can use used
  • The maximum name length has been reduced from 255 to 249 in Kafka 0.10
  • Due to metric name limitations, topics containing a period ('.') or underscore (") may conflict. It is recommended to utilise either, but not both, to avoid problems."
  • '.' and " are permitted characters, but they should be treated as one and the same.

How can you get exactly-once Kafka messaging during data production?

To acquire exactly one messaging from Kafka during data production, you must eliminate duplicates during data consumption and avoid duplication during data generation. Here are two methods for obtaining a single semantics during data production:
Use a single writer per partition and verify the last message in that partition every time you get a network problem to see if your last write succeeded.
Include the main key (UUID or whatever) in the message and de-duplicate it to the consumer.

What is the best way to make Znodes and how to remove it?

Within the given path, Znodes are generated.
    Syntax:
  • create /path/data
  • The persistent, ephemeral, or sequential nature of the znode generated can be specified using flags
  • create -e /path/data
  • creates an ephemeral znode.
  • create -s /path/data
  • creates a sequential znode.
To remove znode specified and all its children.: rmr /path

In Kafka, what is the aim of ISR?

All replicated partitions that are totally synced up with the leader are referred to as ISR - in-synch replicas. Within a customizable amount of time, a replica must catch up to the leader. This time is set to 10 seconds by default. If a follower does not catch up with the leader after this time, the leader will drop the follower from its ISR, and writes will continue on the remaining replicas in the ISR. If the follower returns, it will first truncate its log to the last place it was checked, and then catch up on all messages received after the leader's last checkpoint. The leader will only add it back to the ISR once the follower has fully caught up.