Handling Poison Pills — Dead Letter Queues with Spring Kafka

A poison pill is a message your consumer cannot process — bad schema, missing field, downstream service unavailable, or a bug in your handler. Without a dead letter strategy, the consumer stalls at the same offset indefinitely, blocking every message behind it. The partition is effectively dead.

Spring Kafka’s DefaultErrorHandler with DeadLetterPublishingRecoverer gives you configurable retry, backoff, and automatic routing of exhausted messages to a dead letter topic — without writing retry infrastructure yourself.

The default problem

By default, a Spring Kafka @KafkaListener that throws an exception will retry that message according to the container’s error handler. Without configuration, this means infinite retries at the same offset. A single bad message will halt a partition permanently.

The goal is to: retry transiently-failing messages with backoff, route permanently-failing messages to a DLT for inspection and recovery, and keep the main consumer moving.

Setting up DefaultErrorHandler with a DLT

The core configuration:

@Configuration
public class KafkaErrorHandlerConfig {

    @Bean
    public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> kafkaTemplate) {
        var recoverer = new DeadLetterPublishingRecoverer(kafkaTemplate);

        var backOff = new ExponentialBackOffWithMaxRetries(3);
        backOff.setInitialInterval(1_000);
        backOff.setMultiplier(2.0);
        backOff.setMaxInterval(10_000);

        return new DefaultErrorHandler(recoverer, backOff);
    }
}

And wire it into the listener container factory:

@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> kafkaListenerContainerFactory(
        ConsumerFactory<String, Object> consumerFactory,
        DefaultErrorHandler errorHandler) {

    var factory = new ConcurrentKafkaListenerContainerFactory<String, Object>();
    factory.setConsumerFactory(consumerFactory);
    factory.setCommonErrorHandler(errorHandler);
    return factory;
}

With this config: Spring retries the message 3 times with exponential backoff (1s, 2s, 4s). After the third failure the DeadLetterPublishingRecoverer publishes the message to the DLT and commits the offset. The main consumer moves on.

DLT naming convention

By default, DeadLetterPublishingRecoverer routes to {original-topic}.DLT. A message that fails on order-events goes to order-events.DLT. The DLT message includes the original headers plus additional headers appended by Spring Kafka:

Header	Value
`kafka_dlt-original-topic`	Source topic name
`kafka_dlt-original-partition`	Partition the message came from
`kafka_dlt-original-offset`	Offset of the poison message
`kafka_dlt-original-timestamp`	Original message timestamp
`kafka_dlt-exception-fqcn`	Fully qualified exception class
`kafka_dlt-exception-message`	Exception message

These headers make it straightforward to diagnose, replay, or discard DLT messages with full context.

Consuming from the DLT

Listen to the DLT with @DltHandler on a method in the same @KafkaListener class, or as a separate listener:

@Component
@KafkaListener(topics = "order-events", groupId = "order-processor")
public class OrderEventListener {

    private final OrderService orderService;
    private final AlertService alertService;

    @KafkaHandler
    public void handle(OrderPlaced event) {
        orderService.process(event);
    }

    @DltHandler
    public void handleDlt(OrderPlaced event, @Header KafkaHeaders.RECEIVED_TOPIC String topic) {
        log.error("Order event sent to DLT — topic={} orderId={}", topic, event.orderId());
        alertService.notifyDltMessage(event.orderId(), topic);
    }
}

@DltHandler receives the deserialized message and all Kafka headers. Use it to alert, persist the failed message to a table for manual review, or publish a compensation event.

Non-retryable exceptions

Some failures should go straight to the DLT — no retry. A JsonParseException means the payload is corrupt; retrying it 3 times wastes 17 seconds and the outcome is certain. Declare non-retryable exceptions on the error handler:

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> kafkaTemplate) {
    var recoverer = new DeadLetterPublishingRecoverer(kafkaTemplate);
    var backOff = new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1_000);
    backOff.setMultiplier(2.0);

    var handler = new DefaultErrorHandler(recoverer, backOff);

    handler.addNotRetryableExceptions(
        JsonParseException.class,
        MessageConversionException.class,
        MethodArgumentTypeMismatchException.class
    );

    return handler;
}

Non-retryable exceptions bypass backoff entirely — the message is published to the DLT immediately and the offset is committed.

Custom DLT routing by exception type

The default single-DLT approach works for most cases. For systems with multiple failure modes requiring different handling — some DLT messages auto-replayable, some requiring human review — route to different topics by exception type:

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> kafkaTemplate) {
    var recoverer = new DeadLetterPublishingRecoverer(kafkaTemplate,
        (record, ex) -> {
            if (ex.getCause() instanceof ValidationException) {
                return new TopicPartition(record.topic() + ".invalid", record.partition());
            }
            if (ex.getCause() instanceof DownstreamUnavailableException) {
                return new TopicPartition(record.topic() + ".retry-later", record.partition());
            }
            return new TopicPartition(record.topic() + ".DLT", record.partition());
        }
    );

    return new DefaultErrorHandler(recoverer, new ExponentialBackOffWithMaxRetries(3));
}

The destination function receives the original ConsumerRecord and the exception. Routing to the same partition as the source (using record.partition()) preserves ordering within the DLT.

Monitoring DLT depth

DLT lag is a production signal. A growing DLT means either a recurring bug in your consumer or a downstream service has been unavailable long enough for messages to exhaust retries. Expose it as a metric:

@Component
public class DltLagMonitor {

    private final KafkaAdmin kafkaAdmin;
    private final MeterRegistry meterRegistry;

    @Scheduled(fixedDelay = 30_000)
    public void checkDltLag() {
        // Consumer group lag on *.DLT topics → Gauge metric
        // Alert when lag exceeds threshold
    }
}

In practice, a Prometheus alert on kafka_consumer_fetch_manager_records_lag_max scoped to your DLT consumer group is sufficient — no custom code needed if you’re already scraping Micrometer’s Kafka binder metrics.

Replaying DLT messages

When the bug is fixed and you want to replay DLT messages, the cleanest approach is to reset the DLT consumer group offset to the earliest position and restart the consumer. Spring Boot’s Kafka admin tools or the kafka-consumer-groups.sh CLI both work.

For programmatic replay in situations where you need selective re-processing:

@Service
public class DltReplayService {

    private final KafkaTemplate<String, Object> kafkaTemplate;

    public void replayFromDlt(ConsumerRecord<String, Object> dltRecord) {
        var originalTopic = new String(dltRecord.headers()
            .lastHeader("kafka_dlt-original-topic").value());

        kafkaTemplate.send(originalTopic, dltRecord.key(), dltRecord.value());
    }
}

Strip the DLT headers before republishing if your consumer might otherwise treat them as signals.

ProTips

Set a finite retention on DLT topics: DLT topics can accumulate indefinitely. Set retention.ms to match your SLA for resolving production issues — 7 days is a reasonable default for most systems.

Distinguish retryable from non-retryable at the domain level: A ValidationException (bad data, never recoverable) and a ServiceUnavailableException (transient, retryable) should have different handling. Define a hierarchy at the domain level rather than catching checked exceptions from third-party libraries.

Don’t retry on deserialization errors: If the message can’t be deserialized, it will always fail. Add ErrorHandlingDeserializer to your consumer factory to route deserialization failures to the DLT without even reaching the listener method.

Test DLT routing with EmbeddedKafka: Spring’s @EmbeddedKafka supports the full error handler pipeline. Write tests that publish a bad message and assert it appears on the DLT — it’s the only way to catch routing configuration bugs before production.

If you’re building Kafka consumers that need to handle failures safely and want to discuss your error handling strategy, get in touch.