The Transactional Outbox Pattern with Kafka and MongoDB

The single most dangerous line of code in an event-driven system is this:

@Transactional
public void processClaimDecision(ClaimDecision decision) {
    claimRepository.save(buildClaimState(decision));   // writes to MongoDB
    kafkaTemplate.send("claim.decisions", decision);   // publishes to Kafka
}

It looks correct. It’s not. If the Kafka publish fails after the database write commits, your state is updated but no downstream service knows about it. If the Kafka publish succeeds but the transaction rolls back, you’ve published a fact that never happened. Either way, your system is inconsistent in a way that’s difficult to detect and painful to recover from.

This is the dual-write problem, and the Transactional Outbox Pattern is the standard solution. I’ve applied it on the DWP Digital benefit claims platform — a Spring Boot / Kafka / MongoDB system where losing a claim decision event could directly harm a claimant.

What the Outbox Pattern Is

The insight is simple: you can’t atomically commit to two separate systems (MongoDB and Kafka), but you can atomically commit to one. So instead of publishing to Kafka directly, you write your event into an outbox collection in the same database, inside the same transaction. A separate polling publisher then reads the outbox and publishes to Kafka independently.

┌─────────────────────────────────────────────────┐
│  @Transactional                                 │
│                                                 │
│  claimRepository.save(claim)           ─┐      │
│  outboxRepository.save(outboxEvent)    ─┘ atomic│
└─────────────────────────────────────────────────┘
                     │
          (same MongoDB transaction)
                     │
         ┌───────────▼────────────┐
         │  Outbox Collection     │
         │  { id, topic, payload, │
         │    status: PENDING }   │
         └───────────┬────────────┘
                     │
          (polling publisher, every N ms)
                     │
         ┌───────────▼────────────┐
         │  Kafka Topic           │
         └────────────────────────┘

The key property: if publishing to Kafka fails, the outbox record remains PENDING and will be retried. If publishing succeeds but the publisher crashes before marking the record PUBLISHED, it gets retried — which means consumers must handle at-least-once delivery. That’s a solvable problem; lost events are not.

The Outbox Document

@Document(collection = "outbox_events")
public class OutboxEvent {

    @Id
    private String id;
    private String aggregateId;
    private String aggregateType;
    private String topic;
    private String eventType;
    private String payload;
    private OutboxStatus status;
    private Instant createdAt;
    private Instant processedAt;
    private int retryCount;

    public enum OutboxStatus { PENDING, PUBLISHED, FAILED }

    public static OutboxEvent of(String aggregateId,
                                  String aggregateType,
                                  String topic,
                                  String eventType,
                                  String payload) {
        OutboxEvent e = new OutboxEvent();
        e.id = UUID.randomUUID().toString();
        e.aggregateId = aggregateId;
        e.aggregateType = aggregateType;
        e.topic = topic;
        e.eventType = eventType;
        e.payload = payload;
        e.status = OutboxStatus.PENDING;
        e.createdAt = Instant.now();
        e.retryCount = 0;
        return e;
    }
}

Writing to the Outbox Atomically

In MongoDB with Spring Data, multi-document transactions require a replica set. On Atlas or a properly configured self-hosted cluster, @Transactional works as expected:

@Service
@RequiredArgsConstructor
public class ClaimDecisionService {

    private final ClaimRepository claimRepository;
    private final OutboxRepository outboxRepository;
    private final ObjectMapper objectMapper;

    @Transactional
    public void processDecision(ClaimDecision decision) throws JsonProcessingException {
        BenefitClaim claim = claimRepository.findById(decision.claimId())
            .orElseThrow(() -> new ClaimNotFoundException(decision.claimId()));

        claim.applyDecision(decision);
        claimRepository.save(claim);

        OutboxEvent event = OutboxEvent.of(
            claim.getId(),
            "BenefitClaim",
            "claim.decisions",
            "ClaimDecisionApplied",
            objectMapper.writeValueAsString(decision)
        );
        outboxRepository.save(event);
    }
}

The claimRepository.save() and outboxRepository.save() are in the same transaction. Either both persist or neither does. Kafka is not involved at this point.

The Polling Publisher

A scheduled component polls the outbox and publishes pending events:

@Component
@RequiredArgsConstructor
@Slf4j
public class OutboxPublisher {

    private final OutboxRepository outboxRepository;
    private final KafkaTemplate<String, String> kafkaTemplate;

    @Scheduled(fixedDelay = 500) // poll every 500ms
    public void publishPending() {
        List<OutboxEvent> pending = outboxRepository
            .findTop50ByStatusOrderByCreatedAtAsc(OutboxStatus.PENDING);

        for (OutboxEvent event : pending) {
            try {
                kafkaTemplate.send(event.getTopic(), event.getAggregateId(), event.getPayload())
                    .whenComplete((result, ex) -> {
                        if (ex != null) {
                            handlePublishFailure(event, ex);
                        } else {
                            markPublished(event);
                        }
                    });
            } catch (Exception ex) {
                handlePublishFailure(event, ex);
            }
        }
    }

    private void markPublished(OutboxEvent event) {
        event.setStatus(OutboxStatus.PUBLISHED);
        event.setProcessedAt(Instant.now());
        outboxRepository.save(event);
    }

    private void handlePublishFailure(OutboxEvent event, Throwable ex) {
        log.error("Failed to publish outbox event {}: {}", event.getId(), ex.getMessage());
        event.setRetryCount(event.getRetryCount() + 1);
        if (event.getRetryCount() >= 5) {
            event.setStatus(OutboxStatus.FAILED);
            log.error("Outbox event {} permanently failed after {} retries", event.getId(), event.getRetryCount());
        }
        outboxRepository.save(event);
    }
}

A few design choices here worth noting. The findTop50 cap prevents the publisher from being overwhelmed if a backlog builds up. Using aggregateId as the Kafka message key ensures ordering: all events for a given claim land on the same partition in sequence. The FAILED status gates permanently broken events for manual investigation rather than retrying forever.

The Repository Queries

public interface OutboxRepository extends MongoRepository<OutboxEvent, String> {

    List<OutboxEvent> findTop50ByStatusOrderByCreatedAtAsc(OutboxStatus status);

    List<OutboxEvent> findByStatusAndCreatedAtBefore(OutboxStatus status, Instant cutoff);
}

Add an index on status and createdAt — the polling publisher will run this query every 500ms in production:

@Document(collection = "outbox_events")
@CompoundIndex(name = "status_created", def = "{'status': 1, 'createdAt': 1}")
public class OutboxEvent { ... }

Making Consumers Idempotent

Because the publisher can retry, consumers must handle receiving the same event more than once. The pattern is a processed-events store keyed on event ID:

@Component
@RequiredArgsConstructor
public class ClaimDecisionConsumer {

    private final ClaimProjectionRepository projectionRepository;
    private final ProcessedEventRepository processedEventRepository;

    @KafkaListener(topics = "claim.decisions", groupId = "projection-service")
    public void consume(
            @Payload String payload,
            @Header(KafkaHeaders.RECEIVED_KEY) String key,
            @Header("eventId") String eventId) throws JsonProcessingException {

        if (processedEventRepository.existsById(eventId)) {
            log.debug("Skipping already-processed event {}", eventId);
            return;
        }

        ClaimDecision decision = objectMapper.readValue(payload, ClaimDecision.class);
        projectionRepository.upsert(buildProjection(decision));
        processedEventRepository.save(new ProcessedEvent(eventId, Instant.now()));
    }
}

Storing the eventId as your deduplication key means retries are safe regardless of how many times the publisher delivers the same message.

Should You Use CDC Instead?

Change Data Capture (CDC) via Debezium is an alternative to the polling publisher. Rather than polling, Debezium tails the MongoDB oplog and emits events directly from database changes. This gives sub-second latency versus the polling interval, and removes the need for a publisher component entirely.

The trade-off: Debezium adds operational complexity. You need Kafka Connect running, connector configuration to manage, and a reliable oplog setup. For a single service in a DWP-scale deployment, the polling publisher is simpler to reason about, easier to test, and entirely sufficient. Where you have dozens of services all needing outbox behaviour, CDC starts to earn its keep.

Cleaning Up Processed Events

Outbox events and processed event records accumulate. A nightly cleanup task keeps collections manageable:

@Scheduled(cron = "0 2 * * * *") // 2am daily
public void cleanup() {
    Instant cutoff = Instant.now().minus(7, ChronoUnit.DAYS);
    outboxRepository.deleteByStatusAndCreatedAtBefore(OutboxStatus.PUBLISHED, cutoff);
    processedEventRepository.deleteByProcessedAtBefore(cutoff);
}

Keep FAILED records for manual review — don’t delete them automatically.

What This Buys You

The Outbox Pattern trades a small amount of code complexity for a fundamental reliability guarantee: every state change that should produce an event will produce that event, exactly once from the consumer’s perspective. In a claims processing system, a benefit payment platform, or any system where downstream services depend on event completeness, that’s not an optimisation — it’s a correctness requirement.

The polling publisher approach described here is production-ready for moderate throughput. At very high event rates (thousands per second), CDC becomes worth the operational cost. For most enterprise Spring Boot services, the pattern above is the right starting point.

If you’re building event-driven systems at scale and want an engineer who’s applied these patterns in production, get in touch.