Saga Pattern for Distributed Transactions in Spring Boot

Every microservices architecture reaches the same problem eventually. A business operation has to span multiple services — each with its own database, its own failure modes, its own retry budget — and you need the end state to be consistent even when one step fails halfway through. A two-phase commit would solve it, but two-phase commit across independent microservices doesn’t scale and introduces the kind of tight runtime coupling that defeats the entire purpose of having separate services.

The Saga pattern is the standard answer. I’ve built saga implementations at DWP Digital, where the claimant onboarding journey for the JSA platform spans identity verification, eligibility assessment, and payment setup — three bounded contexts, three separate databases, and a business requirement that partial failures leave no orphaned records. This is what a production Saga implementation looks like in Java with Spring Boot and Kafka.

Choreography vs Orchestration

There are two ways to implement a Saga:

Choreography — each service listens for domain events and reacts autonomously. A ClaimCreated event causes the eligibility service to run a check; a EligibilityConfirmed event causes the payment service to schedule a payment. No central coordinator. Services are loosely coupled, but the overall flow is implicit — it lives nowhere in the codebase and is only legible by reading every consumer.

Orchestration — a dedicated orchestrator sends commands to services and listens for their replies. The entire flow is visible in one component. More coupling to the orchestrator, but the state machine is explicit, debuggable, and restartable.

For anything beyond a simple two-step saga, orchestration is the better choice. When a saga fails at step four of six and enters compensation, you want that logic in one place. Choreography for the same scenario means distributing the compensation logic across every participating service and hoping they all handle every failure combination correctly.

Modelling Saga State

Every saga instance needs explicit state. Model it as an enum and persist it:

public enum ClaimOnboardingState {
    STARTED,
    IDENTITY_VERIFIED,
    ELIGIBILITY_CONFIRMED,
    PAYMENT_SCHEDULED,
    COMPLETED,

    // Compensation states — entered on failure, work backwards
    COMPENSATING_PAYMENT,
    COMPENSATING_ELIGIBILITY,
    COMPENSATING_IDENTITY,
    FAILED
}

@Document(collection = "saga_instances")
@Data
public class ClaimOnboardingSaga {

    @Id
    private String sagaId;
    private String claimantId;
    private ClaimOnboardingState state;
    private String failureReason;
    private Instant startedAt;
    private Instant updatedAt;

    public static ClaimOnboardingSaga create(String claimantId) {
        ClaimOnboardingSaga saga = new ClaimOnboardingSaga();
        saga.sagaId     = UUID.randomUUID().toString();
        saga.claimantId = claimantId;
        saga.state      = ClaimOnboardingState.STARTED;
        saga.startedAt  = Instant.now();
        saga.updatedAt  = Instant.now();
        return saga;
    }

    public void transitionTo(ClaimOnboardingState next) {
        this.state     = next;
        this.updatedAt = Instant.now();
    }
}

Persist saga state to MongoDB or PostgreSQL. This is not optional — it is your crash-recovery mechanism. If the orchestrator pod restarts mid-saga, it must be able to reload every in-flight saga and continue from its last persisted state.

The Orchestrator

@Service
@RequiredArgsConstructor
@Slf4j
public class ClaimOnboardingOrchestrator {

    private final SagaRepository sagaRepository;
    private final KafkaTemplate<String, SagaCommand> kafka;

    public void start(String claimantId, ClaimantDetails details) {
        ClaimOnboardingSaga saga = ClaimOnboardingSaga.create(claimantId);
        sagaRepository.save(saga);

        kafka.send("identity.commands", claimantId,
            new VerifyIdentityCommand(saga.getSagaId(), claimantId, details));

        log.info("Saga {} started for claimant {}", saga.getSagaId(), claimantId);
    }

    @KafkaListener(topics = "identity.events", groupId = "saga-orchestrator")
    public void onIdentityEvent(SagaEvent event) {
        ClaimOnboardingSaga saga = load(event.sagaId());
        switch (event.type()) {
            case "IdentityVerified"       -> handleIdentityVerified(saga);
            case "IdentityCheckFailed"    -> compensate(saga, "Identity check failed");
        }
    }

    @KafkaListener(topics = "eligibility.events", groupId = "saga-orchestrator")
    public void onEligibilityEvent(SagaEvent event) {
        ClaimOnboardingSaga saga = load(event.sagaId());
        switch (event.type()) {
            case "EligibilityConfirmed"   -> handleEligibilityConfirmed(saga);
            case "EligibilityRejected"    -> compensate(saga, "Eligibility rejected");
        }
    }

    @KafkaListener(topics = "payments.events", groupId = "saga-orchestrator")
    public void onPaymentEvent(SagaEvent event) {
        ClaimOnboardingSaga saga = load(event.sagaId());
        switch (event.type()) {
            case "PaymentScheduled"       -> completeSaga(saga);
            case "PaymentSetupFailed"     -> compensate(saga, "Payment setup failed");
        }
    }

    private void handleIdentityVerified(ClaimOnboardingSaga saga) {
        saga.transitionTo(ClaimOnboardingState.IDENTITY_VERIFIED);
        sagaRepository.save(saga);

        kafka.send("eligibility.commands", saga.getClaimantId(),
            new CheckEligibilityCommand(saga.getSagaId(), saga.getClaimantId()));
    }

    private void handleEligibilityConfirmed(ClaimOnboardingSaga saga) {
        saga.transitionTo(ClaimOnboardingState.ELIGIBILITY_CONFIRMED);
        sagaRepository.save(saga);

        kafka.send("payments.commands", saga.getClaimantId(),
            new SchedulePaymentCommand(saga.getSagaId(), saga.getClaimantId()));
    }

    private void completeSaga(ClaimOnboardingSaga saga) {
        saga.transitionTo(ClaimOnboardingState.COMPLETED);
        sagaRepository.save(saga);
        log.info("Saga {} completed for claimant {}", saga.getSagaId(), saga.getClaimantId());
    }

    private ClaimOnboardingSaga load(String sagaId) {
        return sagaRepository.findBySagaId(sagaId)
            .orElseThrow(() -> new IllegalStateException("Saga not found: " + sagaId));
    }
}

The orchestrator is stateless — all state lives in the repository. Multiple instances of the orchestrator can run concurrently without coordination, because each saga instance has a single sagaId and individual events always map to exactly one saga.

Compensating Transactions

Each forward step must have a compensating action that undoes it. Compensating actions must be idempotent — you may need to retry them if the downstream service times out during compensation:

// In the claims service — handles both forward and compensating commands
@KafkaListener(topics = "identity.commands", groupId = "identity-service")
public void onIdentityCommand(SagaCommand command) {
    switch (command.type()) {
        case "VerifyIdentity"   -> handleVerifyIdentity(command);
        case "RevokeIdentity"   -> handleRevokeIdentity(command);  // compensation
    }
}

private void handleRevokeIdentity(SagaCommand command) {
    // Idempotent: safe to call multiple times — finds and marks, or does nothing
    identityRepository.findByClaimantId(command.claimantId())
        .ifPresent(record -> {
            record.markRevoked("saga-compensation:" + command.sagaId());
            identityRepository.save(record);
        });

    // Always publish success — if it didn't exist, the goal is already achieved
    kafka.send("identity.events", command.claimantId(),
        new SagaEvent(command.sagaId(), "IdentityRevoked"));
}

The compensation handler publishes a success reply even if the record was never created. This is correct: the goal of compensation is to ensure the resource is absent. If it was never created, it is already absent.

Initiating Compensation

The orchestrator’s compensation logic works backwards through the state machine:

private void compensate(ClaimOnboardingSaga saga, String reason) {
    log.warn("Saga {} compensating from state {} — {}",
        saga.getSagaId(), saga.getState(), reason);
    saga.setFailureReason(reason);

    switch (saga.getState()) {
        case ELIGIBILITY_CONFIRMED, PAYMENT_SCHEDULED -> {
            saga.transitionTo(ClaimOnboardingState.COMPENSATING_ELIGIBILITY);
            kafka.send("eligibility.commands", saga.getClaimantId(),
                new RevokeEligibilityCommand(saga.getSagaId(), saga.getClaimantId()));
        }
        case IDENTITY_VERIFIED -> {
            saga.transitionTo(ClaimOnboardingState.COMPENSATING_IDENTITY);
            kafka.send("identity.commands", saga.getClaimantId(),
                new RevokeIdentityCommand(saga.getSagaId(), saga.getClaimantId()));
        }
        default -> {
            saga.transitionTo(ClaimOnboardingState.FAILED);
            log.error("Saga {} failed at state {} — nothing to compensate",
                saga.getSagaId(), saga.getState());
        }
    }
    sagaRepository.save(saga);
}

Timeout Handling

A saga step may never respond — the downstream service might be down, the consumer might have crashed before publishing a reply. Without explicit timeout handling, a saga sits in a non-terminal state indefinitely.

@Scheduled(fixedDelay = 60_000)
public void detectStalledSagas() {
    Instant cutoff = Instant.now().minus(Duration.ofMinutes(10));

    List<ClaimOnboardingSaga> stalled = sagaRepository
        .findByStateNotInAndUpdatedAtBefore(
            List.of(ClaimOnboardingState.COMPLETED, ClaimOnboardingState.FAILED),
            cutoff
        );

    for (ClaimOnboardingSaga saga : stalled) {
        log.warn("Saga {} stalled in state {} for >10 minutes — triggering compensation",
            saga.getSagaId(), saga.getState());
        compensate(saga, "Step timeout");
    }
}

This is where many first-pass saga implementations fall short. A stalled saga that never times out is a data consistency hole. Build the timeout scheduler from day one, not as a later enhancement.

Observability

Log every state transition with the saga ID and the business identifier — in the DWP context, that means the claimantId and the saga state on every event. Build a simple dashboard or periodic query over your saga collection:

Sagas in FAILED state — how many, how old, why?
Sagas in any COMPENSATING_* state — actively compensating or stuck?
Sagas older than 15 minutes in a non-terminal state — missed timeout?

A healthy onboarding flow should have no sagas older than a few minutes in non-terminal states. Any deviation is an operational signal worth investigating before it becomes a support ticket.

The Saga pattern adds real orchestration complexity. But in a multi-service system where business operations must leave consistent state, the alternative is either silent partial failures or distributed locks — both worse outcomes. Sagas make the complexity explicit and put it in one place.

If you’re designing distributed transaction flows across Spring Boot microservices and need the implementation to be correct under failure, get in touch.