Resilience Patterns — Circuit Breaker, Bulkhead, and Timeout in Distributed Systems

A distributed system will fail in ways that a monolith cannot. Network partitions, slow downstream services, and cascading failures are not edge cases — they are the normal operating environment. The resilience patterns are not about preventing failures; they are about containing them. Each pattern solves a specific failure mode, and understanding which failure you are guarding against tells you which pattern to apply.

Timeout: the baseline

The simplest and most important resilience primitive is the timeout. Any call to an external service must have a maximum duration after which it is abandoned. Without timeouts, a slow dependency causes your threads to accumulate waiting for responses that may never arrive — thread exhaustion follows, then complete service failure.

// Correct: bounded call
HttpRequest request = HttpRequest.newBuilder()
    .uri(uri)
    .timeout(Duration.ofSeconds(3))
    .build();

// Wrong: unbounded call
HttpRequest request = HttpRequest.newBuilder()
    .uri(uri)
    .build();   // blocks indefinitely on a dropped connection

Set timeouts based on your SLA, not on how fast the dependency usually responds. If your endpoint must respond in 5 seconds, the downstream call must complete in 3 — leaving margin for your own processing and network transit.

Circuit breaker: stopping the cascade

When a downstream service is degraded, calling it on every request wastes threads and adds latency to every user request. The circuit breaker tracks failure rates and opens the circuit when they exceed a threshold — subsequent calls fail immediately without hitting the dependency at all.

Closed (normal) → Failure rate exceeds threshold → Open (fail-fast)
Open → Wait half-open duration → Half-Open (probe one request)
Half-Open → Success → Closed
Half-Open → Failure → Open

The three-state machine is the essential structure. Implementation in Resilience4j:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)              // open after 50% failures
    .slowCallRateThreshold(80)             // also open on 80% slow calls
    .slowCallDurationThreshold(Duration.ofSeconds(2))
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)                 // measure over last 10 calls
    .minimumNumberOfCalls(5)               // need at least 5 calls to compute rate
    .build();

CircuitBreaker cb = CircuitBreaker.of("betfair-api", config);

Supplier<MarketBook> decoratedCall = CircuitBreaker.decorateSupplier(cb,
    () -> betfairClient.getMarketBook(marketId));

try {
    MarketBook book = decoratedCall.get();
} catch (CallNotPermittedException e) {
    // Circuit is open — use cached data or return degraded response
    return cachedBook.orElseThrow(() -> new ServiceUnavailableException("Market data unavailable"));
}

The slow call threshold is as important as the failure rate threshold. A service that returns slowly but never errors will still exhaust your thread pool — the circuit breaker should open on both.

Bulkhead: isolating failure domains

A bulkhead limits concurrent calls to a dependency. Without one, a slow dependency can consume all available threads — a thread-pool bulkhead reserves threads for each downstream service, preventing any single dependency from taking down the whole application.

BulkheadConfig config = BulkheadConfig.custom()
    .maxConcurrentCalls(10)         // max simultaneous calls
    .maxWaitDuration(Duration.ofMillis(100))   // queue depth effectively
    .build();

Bulkhead bulkhead = Bulkhead.of("betfair-orders", config);

Supplier<OrderResult> decorated = Bulkhead.decorateSupplier(bulkhead,
    () -> betfairClient.placeOrder(request));

try {
    OrderResult result = decorated.get();
} catch (BulkheadFullException e) {
    log.warn("Order placement bulkhead full — dropping request");
    throw new OrderCapacityExceededException();
}

For thread pool isolation (more resource-intensive but stronger isolation), use ThreadPoolBulkhead:

ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
    .maxThreadPoolSize(5)
    .coreThreadPoolSize(3)
    .queueCapacity(10)
    .build();

Combining patterns

The patterns compose — a real resilient call uses all three:

Retry retry = Retry.of("betfair-retry",
    RetryConfig.custom()
        .maxAttempts(3)
        .waitDuration(Duration.ofMillis(500))
        .retryOnException(e -> !(e instanceof BulkheadFullException))
        .build());

Supplier<MarketBook> resilientCall = Decorators
    .ofSupplier(() -> betfairClient.getMarketBook(marketId))
    .withCircuitBreaker(circuitBreaker)
    .withBulkhead(bulkhead)
    .withRetry(retry)
    .withFallback(List.of(Exception.class),
        e -> cachedMarketBook.orElseThrow())
    .decorate();

The order matters: bulkhead first (limit concurrency), then circuit breaker (fail-fast if open), then retry (attempt recovery), then fallback (handle non-recoverable failure). Do not retry when the circuit breaker is open — you will hammer the already-failing service.

When each pattern applies

Timeout: Always. No call to an external service should be unbounded.

Circuit breaker: When the dependency is expected to recover — a temporary overload, a network blip, a deployment. If the dependency is permanently down, the circuit stays open until it recovers. Appropriate for synchronous calls to external APIs and databases.

Bulkhead: When multiple dependencies share a thread pool and you need to prevent one slow dependency from starving the others. Critical in microservice architectures where a single service calls five downstream services.

Retry: For transient errors — network blips, 503 responses, lock contention. Do not retry non-idempotent operations (order placement) unless you have idempotency keys. Do not retry when the failure rate is high — you will amplify load on an already-struggling service.

Observability

Every resilience event must be observable. Resilience4j publishes events to Micrometer automatically:

resilience4j.circuitbreaker.state{name="betfair-api"} 0.0 (CLOSED=0, OPEN=1, HALF_OPEN=2)
resilience4j.bulkhead.available_concurrent_calls{name="betfair-orders"}
resilience4j.retry.calls_total{name="betfair-retry", kind="successful_with_retry"}

Alert on circuit breaker state transitions and on retry exhaustion. These events represent real failures in dependencies — they need human attention.

Resilience patterns are not an afterthought. Design them into service boundaries before writing business logic, and your system will behave predictably under the failures that will inevitably occur.

If you’re designing resilience for a distributed Java service and want a review, get in touch.