Understanding Kafka Retention

Ganesh Ramasubramanian
4 min readSep 1, 2019

--

What is Kafka Retention

Kafka retention provides the ability to control the size of the Topic logs and avoid outgrowing the existing disk size. The retention can be configured or controlled based on the size of the logs or based on the configured duration.

Also, the same retention can be set across all the Kafka topics or it can be configured per topic, depending on the nature of the topic we can set the retention accordingly.

For example, the topics which are used for logging purpose can be set with different retention compared to the topics which are used for communicating across the services.

Apache Kafka provides two types of Retention Policies.

Time Based Retention:

Once the configured retention time has been reached for Segment, it is marked for deletion or compaction depending on configured cleanup policy. Default retention period for Segments is 7 days.

Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:

log.retention.ms, log.retention.minutes and log.retention.hours As described in Apache Kafka documentation the order of evaluation of these parameters are in the order that I have mentioned. If log.retention.ms is set the Kafka ignore the values set for log.retention.minutes and log.retention.hours and so on. The number of milliseconds/minutes/hours to keep a log file before deleting it. If set to -1, no time limit is applied.

Please note that from Kafka 2.0.0, KIP-186 increases the default offset retention time from 1 day to 7 days. This makes it less likely to “lose” offsets in an application that commits infrequently. It also increases the active set of offsets and therefore can increase memory usage on the broker. Note that the console consumer currently enables offset commit by default and can be the source of a large number of offsets which this change will now preserve for 7 days instead of 1. You can preserve the existing behavior by setting the broker config offsets.retention.minutes to 1440.

Size Based Retention:

In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.

  1. log.segment.bytes The maximum size of a single log file. Every log in Kafka is stored within a partition of a topic and each partition is further divided into a segment. Thus segment is an ordered collection of messages
  2. log.retention.check.interval.ms The frequency in milliseconds that the log cleaner checks whether any log is eligible for deletion.
  3. log.segment.delete.delay.ms The amount of time to wait before deleting a file from the file system.

Clean policies:

In Kafka, unlike other messaging systems, the messages on a topic are not immediately removed after they are consumed. Instead, the configuration of each topic determines how much space the topic is permitted and how it is managed.

Concept of making data expire is called as Cleanup. Its a Topic level configuration. It is important to restrict log segment to continue grow in size.

How the clean up policies work is quite well illustrated here.

Log Compaction:

This retention policy can be set per-topic, so a single cluster can have some topics where retention is enforced by size or time and other topics where retention is enforced by compaction.

This functionality is inspired by one of LinkedIn’s oldest and most successful pieces of infrastructure — a database change log caching service called Databus. Unlike most log-structured storage systems Kafka is built for subscription and organizes data for fast linear reads and writes.

However, retaining the complete log will use more and more space as time goes by, and the replay will take longer and longer. Hence, Kafka supports different type of retention. Instead of simply throwing away the old log, Kafka can remove obsolete records — i.e. records whose primary key has a more recent update. By doing this, we can still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones. Kafka call this feature log compaction.

References:

--

--

No responses yet