Circuit Breaker Pattern
A circuit breaker is the difference between a 5-minute blip and a 2-hour cascading outage. Without one, every thread blocks waiting for a slow downstream, exhausting the thread pool, taking down the entire service — and then the next service upstream does the same thing. Your entire platform collapses like dominoes because one database connection pool in a payment provider hit its limit.
💡 One-liner for interviews
"A circuit breaker monitors downstream failures and short-circuits requests when a service is unhealthy — failing fast with a fallback instead of waiting and exhausting resources."
What Is a Circuit Breaker?
Think about the electrical circuit breaker in your house. When a short circuit occurs, the breaker trips open to prevent the wires from catching fire. You don't keep pushing current through a broken circuit — that makes things worse. Once the problem is fixed, you flip the breaker back on.
The software circuit breaker does exactly the same thing. It wraps calls to an external service and monitors failures. When failures exceed a threshold, it "trips open" — immediately returning a fallback response without making the network call at all. After a cooldown period, it lets a few trial requests through to test if the service has recovered.
The core insight: Slow is worse than down. A service that returns errors in 5ms is easy to handle. A service that hangs for 30 seconds before timing out destroys your entire system because it holds threads hostage.
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
stateDiagram-v2
[*] --> CLOSED
CLOSED --> OPEN: Failure rate exceeds threshold
OPEN --> HALF_OPEN: Wait duration elapses
HALF_OPEN --> CLOSED: Trial calls succeed
HALF_OPEN --> OPEN: Trial calls fail
CLOSED: All requests pass through
CLOSED: Failures counted in sliding window
CLOSED: Normal operation
OPEN: All requests rejected immediately
OPEN: Fallback returned (no network call)
OPEN: Timer counting down
HALF_OPEN: Limited trial requests allowed
HALF_OPEN: Testing if downstream recovered
HALF_OPEN: Deciding next state The Three States — Deep Dive
CLOSED State (Normal Operation)
Every request passes through to the downstream service. The circuit breaker silently records outcomes in a sliding window. As long as the failure rate stays below the configured threshold, nothing happens — the breaker is invisible.
What counts as failure? Exceptions, timeouts, HTTP 5xx responses — whatever you configure. Business exceptions (like "item not found") typically should NOT count as circuit breaker failures.
OPEN State (Failing Fast)
The circuit has tripped. Every incoming request is immediately rejected without making a network call. The fallback response is returned in microseconds instead of waiting 30 seconds for a timeout. This is the protective state — it gives the downstream service breathing room to recover instead of hammering it with requests it can't handle.
Key detail: An automatic timer starts when the circuit opens. After waitDurationInOpenState elapses, the circuit transitions to HALF_OPEN.
HALF_OPEN State (Testing Recovery)
The circuit cautiously allows a configured number of trial requests through to test if the downstream service has recovered. If these succeed above the threshold, the circuit closes (back to normal). If they fail, it reopens and the wait timer resets.
This is the most critical state for tuning. Too few trial calls = slow recovery. Too many trial calls = you overwhelm a service that just came back up.
| State | Requests | Latency | Resource Usage | Purpose |
|---|---|---|---|---|
| CLOSED | Pass through | Normal (network call) | Normal | Monitoring |
| OPEN | Rejected immediately | Microseconds (no network) | Minimal | Protection |
| HALF_OPEN | Limited trial | Normal (network call) | Minimal | Recovery testing |
🎯 Interview Tip
When explaining the states, emphasize WHY each exists:
CLOSED = "trust but verify" (monitoring)
OPEN = "protect the system" (circuit is broken, stop trying)
HALF_OPEN = "cautious optimism" (is it safe to reconnect?)
Why Circuit Breakers — The Cascading Failure Problem
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
U["User"] --> A["Order Service<br/>200 threads"]
A -->|"Each waiting 30s"| B["Payment Service<br/>100 threads"]
B -->|"Each waiting 30s"| C["Bank API<br/>DOWN"]
C -.->|"Timeouts propagate UP"| B
B -.->|"Thread pool exhausted"| A
A -.->|"504 Gateway Timeout"| U
style C fill:#FFCDD2,stroke:#C62828,color:#000
style B fill:#FFF3E0,stroke:#E65100,color:#000
style A fill:#FFF3E0,stroke:#E65100,color:#000 Without a circuit breaker, here's what happens:
- Bank API goes down (or gets slow — same effect)
- Payment Service makes calls to Bank API. Each thread blocks for 30 seconds waiting for a timeout
- Payment Service has 100 threads. After 100 concurrent orders, all threads are blocked waiting
- Payment Service thread pool is exhausted. It can't handle any requests — even for non-bank operations
- Order Service calls Payment Service. Same thing — threads start blocking
- Order Service thread pool exhausts. Now the entire checkout flow is dead
- API Gateway starts queuing. Health checks fail. Load balancer removes nodes
- Total outage. One slow dependency killed everything.
⚠️ What breaks
The insidious part: the Bank API might not even be "down." It might just be slow — responding in 25 seconds instead of 200ms. Slow dependencies are MORE dangerous than fast-failing ones because they hold resources hostage.
With a circuit breaker:
- Bank API gets slow
- Payment Service circuit breaker detects rising failure/timeout rate
- After threshold exceeded → circuit OPENS
- All subsequent requests to Bank API immediately get a fallback response (0ms, no thread blocked)
- Payment Service stays healthy for all other operations
- Order Service stays healthy
- Users see "Payment processing delayed, we'll charge you shortly" instead of a 504 error
- Bank API recovers → HALF_OPEN → trial requests succeed → circuit CLOSES
- 5-minute blip instead of 2-hour outage
🔥 Production War Story
A major e-commerce platform had no circuit breakers on their recommendation service. During Black Friday, the recommendation engine's database hit connection limits. Without a breaker, every product page waited 30s for recommendations before rendering. Page load times went from 200ms to 30+ seconds. Cart abandonment spiked 90%. The fix took 4 hours to deploy because the deployment pipeline itself was affected. Revenue loss: $12M. The recommendation service wasn't even critical — they could have shown "Popular Items" as a fallback.
Resilience4j — The Modern Standard
Why Resilience4j, Not Hystrix?
Netflix Hystrix entered maintenance mode in November 2018. Netflix moved to adaptive concurrency limits internally and stopped investing in Hystrix. If you're starting a new project with Hystrix in 2024+, you're making a mistake.
| Feature | Hystrix | Resilience4j |
|---|---|---|
| Status | Deprecated (no updates since 2018) | Actively maintained |
| Design | Monolithic (all-or-nothing) | Modular (pick what you need) |
| Java Version | Java 6+ (old patterns) | Java 8+ (functional, CompletableFuture) |
| Dependencies | Archaius, RxJava, many transitive | Zero external dependencies (core) |
| Spring Boot | Spring Cloud Netflix (deprecated) | Spring Cloud Circuit Breaker (official) |
| Metrics | HystrixDashboard (custom) | Micrometer (Prometheus, Datadog, etc.) |
| Thread Model | Dedicated thread pools per command | Semaphore-based (lighter) or thread pool |
| Sliding Window | Fixed 10-second buckets | COUNT_BASED or TIME_BASED (configurable) |
| Configuration | Code-based mostly | YAML/properties + programmatic |
Resilience4j Module Architecture
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
subgraph Core["Resilience4j Core Modules"]
CB["CircuitBreaker<br/>Failure detection & fast-fail"]
RT["Retry<br/>Transient failure recovery"]
BH["Bulkhead<br/>Concurrency isolation"]
RL["RateLimiter<br/>Throughput control"]
TL["TimeLimiter<br/>Timeout management"]
CH["Cache<br/>Result caching"]
end
subgraph Integration["Spring Boot Integration"]
ANN["@CircuitBreaker<br/>@Retry<br/>@Bulkhead<br/>@RateLimiter"]
ACT["Actuator Health<br/>Prometheus Metrics"]
CFG["application.yml<br/>Configuration"]
end
CB --> ANN
RT --> ANN
BH --> ANN
RL --> ANN
ANN --> ACT
CFG --> ANN
style Core fill:#E3F2FD,stroke:#1565C0,color:#000
style Integration fill:#E8F5E9,stroke:#2E7D32,color:#000 Dependencies (Spring Boot)
<!-- Circuit Breaker + Spring Cloud integration -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-circuitbreaker-resilience4j</artifactId>
</dependency>
<!-- AOP support for annotations -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
<!-- Health endpoints and metrics -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Prometheus metrics export (production) -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
Configuration Deep Dive
Every configuration parameter exists for a reason. Understanding WHY each one matters is the difference between a circuit breaker that works and one that either never trips or flaps constantly.
Core Parameters Explained
resilience4j:
circuitbreaker:
instances:
paymentService:
# --- Failure Detection ---
failureRateThreshold: 50 # Open when 50% of calls fail
slowCallRateThreshold: 80 # Open when 80% of calls are slow
slowCallDurationThreshold: 3s # What counts as "slow"
# --- Sliding Window ---
slidingWindowType: COUNT_BASED # COUNT_BASED or TIME_BASED
slidingWindowSize: 20 # Last 20 calls (or 20 seconds)
minimumNumberOfCalls: 10 # Don't evaluate until 10 calls
# --- State Transitions ---
waitDurationInOpenState: 30s # How long to stay OPEN
permittedNumberOfCallsInHalfOpenState: 5 # Trial calls in HALF_OPEN
automaticTransitionFromOpenToHalfOpenEnabled: true
# --- What Counts as Failure ---
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.example.BusinessNotFoundException # 404 is not a failure
- com.example.ValidationException # Bad input is not downstream failure
Parameter-by-Parameter Breakdown
| Parameter | What It Does | Too Low | Too High | Recommended |
|---|---|---|---|---|
failureRateThreshold | % failures to trip circuit | Trips on minor blips (flapping) | Never trips, cascading failure | 50% for critical, 30% for non-critical |
slowCallRateThreshold | % slow calls to trip | Trips during normal variance | Misses gradual degradation | 80-100% |
slowCallDurationThreshold | What "slow" means | Normal calls trigger it | Misses actual slowness | 2-5x your p99 latency |
slidingWindowSize | How many calls to evaluate | Not enough data, noisy | Slow to react to failures | 10-20 (count) or 10-60s (time) |
minimumNumberOfCalls | Min calls before evaluating | 1 failure = 100% rate = trip | Slow initial protection | 5-10 |
waitDurationInOpenState | How long to stay OPEN | Premature recovery attempts | Slow recovery after fix | 10-60s depending on service |
permittedNumberOfCallsInHalfOpenState | Trial calls for recovery | Not enough signal | Overwhelm recovering service | 3-10 |
💡 One-liner for interviews
"minimumNumberOfCalls prevents the circuit from opening on the very first failure — you need a statistically meaningful sample before deciding a service is unhealthy."
Sliding Window Types
COUNT_BASED
Evaluates the last N calls regardless of when they happened. If slidingWindowSize: 10, it tracks the outcomes of the last 10 calls and calculates failure rate from those.
When to use: Consistent traffic. Services called at a predictable rate. Most common choice.
Gotcha: If traffic is bursty, a burst of 10 failures in 100ms opens the circuit even if the service was fine before and after.
TIME_BASED
Evaluates all calls within the last N seconds. If slidingWindowSize: 60, it looks at every call made in the last 60 seconds.
When to use: Variable traffic. Services where call volume fluctuates significantly (batch jobs, event-driven).
Gotcha: During low traffic, you might not hit minimumNumberOfCalls for a long time, delaying protection.
// Programmatic configuration showing both types
CircuitBreakerConfig countBased = CircuitBreakerConfig.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(20) // Last 20 calls
.minimumNumberOfCalls(10) // Need at least 10 before evaluating
.failureRateThreshold(50)
.build();
CircuitBreakerConfig timeBased = CircuitBreakerConfig.custom()
.slidingWindowType(SlidingWindowType.TIME_BASED)
.slidingWindowSize(60) // Last 60 seconds
.minimumNumberOfCalls(20) // Need at least 20 calls in that window
.failureRateThreshold(50)
.build();
❓ Counter-questions
Q: "When would you choose TIME_BASED over COUNT_BASED?"
A: When traffic is highly variable. Example: a batch processing service that gets 1000 calls/minute during batch runs but 2 calls/minute otherwise. COUNT_BASED with size 10 would evaluate 5 minutes of "normal" traffic as one window during quiet periods — stale data. TIME_BASED with 60s always looks at "the last minute" regardless of volume, giving fresher signal during quiet periods.
Fallback Strategies
A circuit breaker without a good fallback is only half the solution. The fallback determines what users actually experience during an outage.
Strategy 1: Default/Static Value
Return a safe default. Simplest approach.
private ProductResponse getProductFallback(String productId, Throwable t) {
return ProductResponse.builder()
.id(productId)
.name("Product")
.price(BigDecimal.ZERO) // Don't show price if pricing service is down
.available(false) // Safer to say unavailable than to sell what you can't fulfill
.message("Some details temporarily unavailable")
.build();
}
Strategy 2: Cached Data
Return the last known good response. Best for data that doesn't change frequently.
@Service
public class ProductService {
private final Cache<String, ProductResponse> productCache;
@CircuitBreaker(name = "catalogService", fallbackMethod = "catalogFallback")
public ProductResponse getProduct(String productId) {
ProductResponse response = catalogClient.getProduct(productId);
productCache.put(productId, response); // Cache on success
return response;
}
private ProductResponse catalogFallback(String productId, Throwable t) {
ProductResponse cached = productCache.getIfPresent(productId);
if (cached != null) {
cached.setStale(true); // Flag as potentially stale
return cached;
}
// No cache available — return minimal response
return ProductResponse.unavailable(productId);
}
}
Strategy 3: Degraded Service
Call a simpler/cheaper alternative service.
private List<Product> recommendationsFallback(String userId, Throwable t) {
// ML recommendation engine is down
// Fall back to simple "most popular" which is a cheap DB query
return popularProductsService.getTopProducts(10);
}
Strategy 4: Queue for Later
Accept the request and process it asynchronously when the service recovers.
private OrderResponse paymentFallback(OrderRequest request, Throwable t) {
// Payment service is down — don't reject the order!
// Queue it for processing when service recovers
paymentRetryQueue.send(PaymentMessage.from(request));
return OrderResponse.builder()
.status("ACCEPTED_PENDING_PAYMENT")
.message("Order accepted! Payment will be processed shortly.")
.orderId(UUID.randomUUID().toString())
.build();
}
Strategy 5: Feature Toggle
Disable the feature entirely and hide it from the UI.
private FeatureResponse featureFallback(String userId, Throwable t) {
// Tell the frontend to hide this feature section entirely
return FeatureResponse.disabled();
}
🔥 Production War Story
An airline booking system had a circuit breaker on their seat map service — but the fallback threw a NullPointerException because nobody tested it. When the seat map service went down during peak booking season, the circuit breaker opened correctly... and then the fallback crashed the entire booking page. Always test your fallbacks under realistic conditions.
Circuit Breaker + Retry — Order Matters!
This is one of the most misunderstood aspects of resilience patterns. The decoration order determines behavior.
Correct: Retry INSIDE Circuit Breaker
The circuit breaker wraps the retry. If all retries fail, that counts as ONE failure for the circuit breaker. Three retries failing = one circuit breaker failure count.
Wrong: Retry OUTSIDE Circuit Breaker
If the circuit is OPEN, the retry will retry the rejection itself — pointless retries against a circuit that won't let anything through.
Resilience4j Decoration Order
Resilience4j applies decorators in this fixed order (outermost to innermost):
This means:
- Retry wraps everything — if the circuit breaker rejects (CallNotPermittedException), retry can optionally retry after wait
- CircuitBreaker checks if the call is allowed before proceeding
- RateLimiter controls throughput
- TimeLimiter enforces timeout on the actual call
- Bulkhead limits concurrency for the actual execution
@Retry(name = "paymentService")
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
@RateLimiter(name = "paymentService")
@TimeLimiter(name = "paymentService")
@Bulkhead(name = "paymentService")
public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
return CompletableFuture.supplyAsync(() -> paymentClient.charge(request));
}
resilience4j:
retry:
instances:
paymentService:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- io.github.resilience4j.circuitbreaker.CallNotPermittedException # Don't retry when circuit is OPEN
circuitbreaker:
instances:
paymentService:
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s
⚠️ What breaks
If you configure retry to also retry CallNotPermittedException, you'll retry 3 times against an OPEN circuit — wasting time for the same rejection. Always add CallNotPermittedException to ignoreExceptions in your retry config unless you intentionally want to wait for the circuit to transition to HALF_OPEN.
Bulkhead Pattern — Resource Isolation
The bulkhead pattern comes from ship design. Ships have watertight compartments (bulkheads) so that if one compartment floods, the others stay dry and the ship stays afloat. In software, this means isolating resources so one slow service can't consume all available threads/connections.
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
subgraph Application["Application (200 total threads)"]
subgraph BH1["Bulkhead: Payment<br/>Max 20 threads"]
P1["Thread 1"]
P2["Thread 2"]
P3["...Thread 20"]
end
subgraph BH2["Bulkhead: Inventory<br/>Max 30 threads"]
I1["Thread 1"]
I2["Thread 2"]
I3["...Thread 30"]
end
subgraph BH3["Bulkhead: Recommendations<br/>Max 10 threads"]
R1["Thread 1"]
R2["Thread 2"]
R3["...Thread 10"]
end
subgraph Shared["Remaining: 140 threads<br/>Other operations"]
S1["..."]
end
end
style BH1 fill:#FFEBEE,stroke:#C62828,color:#000
style BH2 fill:#E3F2FD,stroke:#1565C0,color:#000
style BH3 fill:#FFF3E0,stroke:#E65100,color:#000
style Shared fill:#E8F5E9,stroke:#2E7D32,color:#000 Semaphore Bulkhead
Uses a semaphore to limit concurrent calls. Runs on the caller's thread. Lightweight, no thread pool overhead.
resilience4j:
bulkhead:
instances:
inventoryService:
maxConcurrentCalls: 20 # Max 20 parallel calls
maxWaitDuration: 200ms # Wait max 200ms for a permit, then reject
@Bulkhead(name = "inventoryService", fallbackMethod = "bulkheadFallback")
@CircuitBreaker(name = "inventoryService", fallbackMethod = "circuitBreakerFallback")
public InventoryResponse checkInventory(String sku) {
return inventoryClient.check(sku);
}
private InventoryResponse bulkheadFallback(String sku, BulkheadFullException e) {
log.warn("Bulkhead full for inventory check: {}", sku);
return InventoryResponse.unknown(sku); // Can't confirm stock right now
}
When to use: Most cases. Lower overhead, simpler mental model.
Thread Pool Bulkhead
Uses a separate thread pool. The caller's thread is not blocked — the call executes on the bulkhead's thread pool and returns a CompletableFuture.
resilience4j:
thread-pool-bulkhead:
instances:
paymentService:
maxThreadPoolSize: 10 # Max 10 threads
coreThreadPoolSize: 5 # Keep 5 alive
queueCapacity: 25 # Queue 25 more before rejecting
keepAliveDuration: 100ms
@Bulkhead(name = "paymentService", type = Bulkhead.Type.THREADPOOL)
public CompletableFuture<PaymentResponse> processPayment(PaymentRequest req) {
return CompletableFuture.supplyAsync(() -> paymentClient.charge(req));
}
When to use: When you want true thread isolation (caller thread never blocks). When the downstream call is very slow and you don't want to hold the request-handling thread.
| Aspect | Semaphore | Thread Pool |
|---|---|---|
| Thread usage | Caller's thread | Dedicated pool |
| Overhead | Low | Higher (thread context switch) |
| Return type | Synchronous | CompletableFuture |
| Timeout support | Need TimeLimiter | Built-in via Future.get() |
| Best for | Fast calls, reactive | Slow calls, blocking I/O |
💡 One-liner for interviews
"Semaphore bulkhead limits concurrency on the caller's thread. Thread pool bulkhead provides true isolation — a slow service can't block the caller's thread at all because execution happens on a separate pool."
Rate Limiter
Controls how many calls are made to a service in a given time period. Protects downstream services from being overwhelmed (different from circuit breaker which reacts to failures — rate limiter prevents them proactively).
How It Works Internally
Resilience4j's rate limiter uses a token-bucket algorithm under the hood:
- Tokens are added to a bucket at a fixed rate (
limitRefreshPeriod) - Each call consumes one token
- If no tokens available, the call waits up to
timeoutDurationfor a new token - If timeout expires before getting a token, the call is rejected
resilience4j:
ratelimiter:
instances:
externalApiService:
limitForPeriod: 50 # 50 calls allowed per period
limitRefreshPeriod: 1s # Period resets every 1 second
timeoutDuration: 500ms # Wait max 500ms for permission
registerHealthIndicator: true
eventConsumerBufferSize: 100
@RateLimiter(name = "externalApiService", fallbackMethod = "rateLimitFallback")
public ExchangeRate getExchangeRate(String currency) {
return externalRateApi.getRate(currency);
}
private ExchangeRate rateLimitFallback(String currency, RequestNotPermitted e) {
log.warn("Rate limit hit for exchange rate API");
return exchangeRateCache.getLastKnown(currency); // Use cached rate
}
🎯 Interview Tip
Rate limiter vs. circuit breaker:
- Rate limiter = proactive. "Don't send more than 50 requests/second regardless of success/failure"
- Circuit breaker = reactive. "The service is failing, stop sending requests until it recovers"
They complement each other. Use both.
TimeLimiter — Timeout Management
Cancels a call if it exceeds a deadline. Without this, a hanging service holds your thread forever (or until the HTTP client timeout, which is often too generous).
resilience4j:
timelimiter:
instances:
inventoryService:
timeoutDuration: 3s # Cancel after 3 seconds
cancelRunningFuture: true # Actually interrupt the thread
@TimeLimiter(name = "inventoryService")
@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallback")
public CompletableFuture<InventoryResponse> checkInventoryAsync(String sku) {
return CompletableFuture.supplyAsync(() -> {
// If this takes > 3s, TimeLimiter cancels it
return inventoryClient.check(sku);
});
}
public CompletableFuture<InventoryResponse> fallback(String sku, TimeoutException e) {
return CompletableFuture.completedFuture(
InventoryResponse.builder()
.sku(sku)
.status("TIMEOUT")
.message("Inventory check timed out, assuming available")
.build()
);
}
Important: TimeLimiter requires CompletableFuture return type. For synchronous calls, use your HTTP client's timeout instead (e.g., RestTemplate.setReadTimeout()).
Spring Boot Integration — Complete Implementation
Annotation-Based (Recommended for Most Cases)
@Service
@Slf4j
public class OrderService {
private final PaymentClient paymentClient;
private final InventoryClient inventoryClient;
private final RecommendationClient recommendationClient;
private final OrderRepository orderRepository;
private final PaymentRetryQueue retryQueue;
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
@Retry(name = "paymentService")
public PaymentResult processPayment(Order order) {
log.info("Processing payment for order: {}", order.getId());
PaymentResponse response = paymentClient.charge(
ChargeRequest.builder()
.orderId(order.getId())
.amount(order.getTotal())
.currency(order.getCurrency())
.idempotencyKey(order.getPaymentIdempotencyKey()) // Critical for retries!
.build()
);
order.setPaymentId(response.getTransactionId());
order.setStatus(OrderStatus.PAID);
orderRepository.save(order);
return PaymentResult.success(response.getTransactionId());
}
private PaymentResult paymentFallback(Order order, Throwable t) {
log.warn("Payment circuit breaker fallback for order {}: {}",
order.getId(), t.getMessage());
// Queue for async processing — don't lose the sale!
retryQueue.enqueue(order);
order.setStatus(OrderStatus.PAYMENT_PENDING);
orderRepository.save(order);
return PaymentResult.pending("Payment will be processed shortly");
}
@CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFallback")
@Bulkhead(name = "inventoryService")
public boolean checkInventory(String sku, int quantity) {
return inventoryClient.isAvailable(sku, quantity);
}
private boolean inventoryFallback(String sku, int quantity, Throwable t) {
log.warn("Inventory check fallback for SKU {}: {}", sku, t.getMessage());
// Optimistic: allow the order, handle stock issues later
// This is a business decision — discuss with product team
return true;
}
@CircuitBreaker(name = "recommendationService", fallbackMethod = "recommendationFallback")
@TimeLimiter(name = "recommendationService")
@Bulkhead(name = "recommendationService")
public CompletableFuture<List<Product>> getRecommendations(String userId) {
return CompletableFuture.supplyAsync(() ->
recommendationClient.getPersonalized(userId, 10)
);
}
private CompletableFuture<List<Product>> recommendationFallback(String userId, Throwable t) {
// Recommendations are non-critical — show popular items instead
return CompletableFuture.completedFuture(
recommendationClient.getPopularItems(10)
);
}
}
Full application.yml Configuration
spring:
application:
name: order-service
management:
health:
circuitbreakers:
enabled: true
endpoints:
web:
exposure:
include: health, circuitbreakers, circuitbreakerevents, metrics, prometheus
endpoint:
health:
show-details: always
metrics:
export:
prometheus:
enabled: true
resilience4j:
circuitbreaker:
configs:
# Shared defaults
default:
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
registerHealthIndicator: true
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.example.BusinessException
instances:
# Critical path — aggressive protection
paymentService:
baseConfig: default
slidingWindowSize: 10
failureRateThreshold: 30 # Trip at 30% for payment (critical)
waitDurationInOpenState: 60s # Give payment provider more time to recover
slowCallRateThreshold: 50
slowCallDurationThreshold: 5s
# Medium criticality
inventoryService:
baseConfig: default
failureRateThreshold: 50
waitDurationInOpenState: 20s
# Low criticality — aggressive fallback is fine
recommendationService:
baseConfig: default
failureRateThreshold: 70 # More tolerant — it's just recommendations
waitDurationInOpenState: 10s # Recover quickly
slidingWindowSize: 5 # Smaller window — trip faster
retry:
configs:
default:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- io.github.resilience4j.circuitbreaker.CallNotPermittedException
instances:
paymentService:
baseConfig: default
maxAttempts: 2 # Payment: fewer retries (idempotency concerns)
waitDuration: 1s
inventoryService:
baseConfig: default
maxAttempts: 3
bulkhead:
instances:
inventoryService:
maxConcurrentCalls: 30
maxWaitDuration: 200ms
recommendationService:
maxConcurrentCalls: 10
maxWaitDuration: 100ms
timelimiter:
instances:
recommendationService:
timeoutDuration: 2s
cancelRunningFuture: true
ratelimiter:
instances:
paymentService:
limitForPeriod: 100
limitRefreshPeriod: 1s
timeoutDuration: 0ms # Reject immediately if rate exceeded
Programmatic Configuration (When You Need Dynamic Control)
@Configuration
public class ResilienceConfig {
@Bean
public CircuitBreakerRegistry circuitBreakerRegistry() {
CircuitBreakerConfig paymentConfig = CircuitBreakerConfig.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.failureRateThreshold(30)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(5))
.waitDurationInOpenState(Duration.ofSeconds(60))
.permittedNumberOfCallsInHalfOpenState(3)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class)
.build();
CircuitBreakerConfig inventoryConfig = CircuitBreakerConfig.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(20)
.minimumNumberOfCalls(10)
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(20))
.permittedNumberOfCallsInHalfOpenState(5)
.build();
return CircuitBreakerRegistry.of(Map.of(
"paymentService", paymentConfig,
"inventoryService", inventoryConfig
));
}
@Bean
public Customizer<Resilience4JCircuitBreakerFactory> circuitBreakerCustomizer() {
return factory -> {
factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
.circuitBreakerConfig(CircuitBreakerConfig.ofDefaults())
.timeLimiterConfig(TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(3))
.build())
.build());
};
}
}
Production Configuration Tuning
Fast API (< 100ms p99) — Real-time user-facing
# Example: Product catalog, search, user profiles
fastApiService:
slidingWindowSize: 20
minimumNumberOfCalls: 10
failureRateThreshold: 40
slowCallRateThreshold: 60
slowCallDurationThreshold: 500ms # 5x normal latency = slow
waitDurationInOpenState: 15s # Recover quickly
permittedNumberOfCallsInHalfOpenState: 5
Slow API (1-5s p99) — Batch/processing
# Example: Report generation, ML inference, payment processing
slowApiService:
slidingWindowType: TIME_BASED
slidingWindowSize: 60 # 60-second window
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 10s # These are expected to be slow
waitDurationInOpenState: 60s # Give more time to recover
permittedNumberOfCallsInHalfOpenState: 3
Critical Service (Payment, Auth)
# Lower threshold — trip early, protect revenue
criticalService:
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 30 # Trip at 30% — can't risk payment failures
slowCallRateThreshold: 50
waitDurationInOpenState: 60s # Long cooldown — don't overwhelm recovering service
permittedNumberOfCallsInHalfOpenState: 2 # Very cautious recovery
Nice-to-Have Service (Recommendations, Analytics)
# Higher threshold — more tolerant, aggressive fallback is acceptable
niceToHaveService:
slidingWindowSize: 5
minimumNumberOfCalls: 3
failureRateThreshold: 80 # Very tolerant — partial degradation is fine
waitDurationInOpenState: 5s # Short cooldown — try again quickly
permittedNumberOfCallsInHalfOpenState: 5
🎯 Interview Tip
When asked "How would you configure a circuit breaker?" — always ask back: "What's the criticality of the downstream service and what's the acceptable degradation?" Payment failing = revenue loss (aggressive protection). Recommendations failing = slightly worse UX (tolerant config, good fallback).
Monitoring and Observability
Actuator Endpoints
# Check all circuit breaker states
curl localhost:8080/actuator/health | jq '.components.circuitBreakers'
# Response:
{
"status": "UP",
"details": {
"paymentService": {
"status": "UP",
"details": {
"state": "CLOSED",
"failureRate": "10.0%",
"slowCallRate": "5.0%",
"bufferedCalls": 20,
"failedCalls": 2,
"slowCalls": 1,
"notPermittedCalls": 0
}
},
"inventoryService": {
"status": "CIRCUIT_OPEN",
"details": {
"state": "OPEN",
"failureRate": "60.0%",
"bufferedCalls": 10,
"failedCalls": 6,
"notPermittedCalls": 147
}
}
}
}
Event Listeners (Custom Alerting)
@Component
@Slf4j
public class CircuitBreakerEventListener {
public CircuitBreakerEventListener(CircuitBreakerRegistry registry) {
registry.getAllCircuitBreakers().forEach(cb -> {
cb.getEventPublisher()
.onStateTransition(event -> {
log.warn("Circuit breaker '{}' transitioned: {} -> {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
alertingService.sendAlert(
AlertLevel.HIGH,
"Circuit breaker OPENED: " + event.getCircuitBreakerName()
);
}
})
.onError(event -> {
log.debug("Circuit breaker '{}' recorded error: {}",
event.getCircuitBreakerName(),
event.getThrowable().getMessage());
})
.onCallNotPermitted(event -> {
metricsService.incrementCounter(
"circuit_breaker_rejected_total",
"name", event.getCircuitBreakerName()
);
});
});
}
}
Prometheus Metrics
Resilience4j automatically exposes metrics via Micrometer:
# Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)
resilience4j_circuitbreaker_state{name="paymentService"}
# Failure rate
resilience4j_circuitbreaker_failure_rate{name="paymentService"}
# Call outcomes
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentService",kind="successful"}[5m])
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentService",kind="failed"}[5m])
# Rejected calls (circuit open)
rate(resilience4j_circuitbreaker_not_permitted_calls_total{name="paymentService"}[5m])
# Alert rule: Circuit breaker has been OPEN for > 5 minutes
ALERTS{alertname="CircuitBreakerOpen"} == 1
# rule: resilience4j_circuitbreaker_state == 1 for 5m
Grafana Dashboard Queries
# Panel 1: Circuit breaker state timeline
resilience4j_circuitbreaker_state{name=~"$service"}
# Panel 2: Request success/failure rate
sum(rate(resilience4j_circuitbreaker_calls_seconds_count{kind="successful"}[1m])) by (name)
sum(rate(resilience4j_circuitbreaker_calls_seconds_count{kind="failed"}[1m])) by (name)
# Panel 3: Rejected calls (indicates open circuit)
sum(rate(resilience4j_circuitbreaker_not_permitted_calls_total[1m])) by (name)
# Panel 4: Call duration (p99)
histogram_quantile(0.99, rate(resilience4j_circuitbreaker_calls_seconds_bucket[5m]))
Testing Circuit Breakers
Unit Testing State Transitions
@Test
void shouldOpenCircuitWhenFailureRateExceeded() {
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.slidingWindowSize(5)
.minimumNumberOfCalls(5)
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(10))
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("test", config);
// Record 3 successes and 3 failures (60% failure rate > 50% threshold)
circuitBreaker.onSuccess(0, TimeUnit.MILLISECONDS);
circuitBreaker.onSuccess(0, TimeUnit.MILLISECONDS);
circuitBreaker.onError(0, TimeUnit.MILLISECONDS, new IOException());
circuitBreaker.onError(0, TimeUnit.MILLISECONDS, new IOException());
circuitBreaker.onError(0, TimeUnit.MILLISECONDS, new IOException());
// Circuit should be OPEN
assertThat(circuitBreaker.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// Calls should be rejected
CheckedRunnable decorated = CircuitBreaker.decorateCheckedRunnable(
circuitBreaker, () -> { /* doesn't matter */ });
assertThatThrownBy(decorated::run)
.isInstanceOf(CallNotPermittedException.class);
}
@Test
void shouldTransitionToHalfOpenAfterWaitDuration() {
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.slidingWindowSize(5)
.minimumNumberOfCalls(5)
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(1))
.permittedNumberOfCallsInHalfOpenState(2)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("test", config);
// Force circuit OPEN
circuitBreaker.transitionToOpenState();
assertThat(circuitBreaker.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// Wait for transition
await().atMost(Duration.ofSeconds(2))
.until(() -> circuitBreaker.getState() == CircuitBreaker.State.HALF_OPEN);
// Record successful trial calls
circuitBreaker.onSuccess(0, TimeUnit.MILLISECONDS);
circuitBreaker.onSuccess(0, TimeUnit.MILLISECONDS);
// Should transition back to CLOSED
assertThat(circuitBreaker.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
}
Integration Testing with WireMock
@SpringBootTest
@AutoConfigureWireMock(port = 0)
class PaymentServiceCircuitBreakerIT {
@Autowired
private PaymentService paymentService;
@Autowired
private CircuitBreakerRegistry registry;
@BeforeEach
void resetCircuitBreaker() {
registry.circuitBreaker("paymentService").reset();
}
@Test
void shouldOpenCircuitAfterConsecutiveFailures() {
// Stub payment service to return 500
stubFor(post("/api/payments/charge")
.willReturn(serverError().withBody("Internal Server Error")));
// Make enough calls to trip the circuit
for (int i = 0; i < 10; i++) {
try {
paymentService.processPayment(testOrder());
} catch (Exception ignored) {}
}
// Verify circuit is OPEN
CircuitBreaker cb = registry.circuitBreaker("paymentService");
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// Verify fallback is returned (not an exception)
PaymentResult result = paymentService.processPayment(testOrder());
assertThat(result.getStatus()).isEqualTo("PENDING");
// Verify no more calls made to downstream
verify(exactly(10), postRequestedFor(urlEqualTo("/api/payments/charge")));
}
@Test
void shouldRecoverWhenServiceComesBack() {
// First: service is down
stubFor(post("/api/payments/charge")
.willReturn(serverError()));
// Trip the circuit
IntStream.range(0, 10).forEach(i -> {
try { paymentService.processPayment(testOrder()); } catch (Exception ignored) {}
});
// Now: service is back up
stubFor(post("/api/payments/charge")
.willReturn(okJson("{\"transactionId\": \"txn_123\", \"status\": \"SUCCESS\"}")));
// Wait for HALF_OPEN transition
CircuitBreaker cb = registry.circuitBreaker("paymentService");
await().atMost(Duration.ofSeconds(35))
.until(() -> cb.getState() == CircuitBreaker.State.HALF_OPEN);
// Trial calls should succeed, closing the circuit
PaymentResult result = paymentService.processPayment(testOrder());
assertThat(result.getStatus()).isEqualTo("SUCCESS");
await().atMost(Duration.ofSeconds(5))
.until(() -> cb.getState() == CircuitBreaker.State.CLOSED);
}
}
Chaos Engineering
// Chaos Monkey for Spring Boot — randomly inject failures
@Configuration
@Profile("chaos")
public class ChaosConfig {
@Bean
public ChaosMonkeySettings chaosMonkeySettings() {
return ChaosMonkeySettings.builder()
.assaultProperties(AssaultProperties.builder()
.latencyActive(true)
.latencyRangeStart(3000) // Add 3-10s latency
.latencyRangeEnd(10000)
.exceptionsActive(true)
.level(5) // Attack every 5th call
.build())
.build();
}
}
Common Mistakes and Anti-Patterns
Mistake 1: Sliding Window Too Small (Flapping)
# BAD: Window of 3 — one bad request = 33% failure rate
badConfig:
slidingWindowSize: 3
minimumNumberOfCalls: 1 # Even worse — evaluates after 1 call!
failureRateThreshold: 50
Problem: Circuit flaps open/closed constantly. One timeout trips it, one success resets it. Your monitoring is full of noise. Users experience inconsistent behavior.
Fix: Use a window of at least 10 and minimumNumberOfCalls of at least 5.
Mistake 2: No Fallback Defined
// BAD: No fallback — exception propagates to user as 500
@CircuitBreaker(name = "paymentService")
public PaymentResult charge(Order order) {
return paymentClient.charge(order);
}
// When circuit opens: CallNotPermittedException → 500 Internal Server Error → bad UX
Fix: Always define a fallback. Even if the fallback is "return a friendly error message."
Mistake 3: Retrying Non-Idempotent Operations
// DANGEROUS: Retry on payment without idempotency key
@Retry(name = "paymentService", maxAttempts = 3)
public void chargeCustomer(Order order) {
paymentClient.charge(order.getAmount()); // No idempotency key!
// If first call succeeds but response is lost (network issue),
// retry charges the customer AGAIN. Double-charge!
}
Fix: Always include an idempotency key for mutating operations:
@Retry(name = "paymentService", maxAttempts = 3)
public void chargeCustomer(Order order) {
paymentClient.charge(
order.getAmount(),
order.getIdempotencyKey() // Payment provider deduplicates by this key
);
}
Mistake 4: Circuit Breaker on Every Exception
# BAD: Records ALL exceptions as failures
badConfig:
recordExceptions:
- java.lang.Exception # This includes validation errors, 404s, etc.
Problem: A user passing invalid input (400 Bad Request) trips the circuit breaker. The service is healthy, but your circuit opened because of bad user input.
Fix: Only record infrastructure failures:
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException # 5xx only
ignoreExceptions:
- org.springframework.web.client.HttpClientErrorException # 4xx — client's fault
- com.example.ValidationException
Mistake 5: Same Config for Every Service
Every downstream service has different characteristics. Using identical configs means some services never trip (threshold too high) while others flap constantly (threshold too low).
⚠️ What breaks
Using waitDurationInOpenState: 5s for a database that takes 2 minutes to failover means you'll keep trying every 5 seconds, sending trial requests to a service that won't be ready for 2 minutes — potentially slowing its recovery.
Mistake 6: Circuit Breaker Without Timeout
// BAD: No timeout — thread hangs for 30s even with circuit breaker
@CircuitBreaker(name = "service")
public Response callService() {
return httpClient.get("/api/data"); // HttpClient timeout: 30s default
// Circuit breaker only helps AFTER this fails —
// but thread was blocked 30s waiting for this timeout!
}
Fix: Always pair circuit breaker with aggressive timeouts:
// Set HTTP client timeout much lower than default
@Bean
public WebClient webClient() {
return WebClient.builder()
.clientConnector(new ReactorClientHttpConnector(
HttpClient.create()
.responseTimeout(Duration.ofSeconds(3)) // 3s, not 30s!
))
.build();
}
Hystrix to Resilience4j Migration
Why Hystrix Is Dead
- Netflix stopped development in November 2018
- Last release: 1.5.18 (no security patches)
- Spring Cloud Netflix module deprecated in Spring Cloud 2020
- Internally, Netflix moved to adaptive concurrency limits (not Hystrix)
- Java 6-era design: heavy, many dependencies, thread pool per command
Migration Mapping
| Hystrix | Resilience4j | Notes |
|---|---|---|
@HystrixCommand | @CircuitBreaker | Annotation-based, similar |
HystrixCommand.Setter | CircuitBreakerConfig.custom() | Builder pattern |
commandProperties | application.yml config | Externalized in R4j |
| Thread pool per command | Semaphore bulkhead (default) | Lighter by default |
@HystrixProperty | YAML properties | Easier to change without redeploy |
| HystrixDashboard | Actuator + Prometheus + Grafana | Standard observability stack |
| Turbine aggregation | Prometheus federation | Industry standard |
getFallback() | fallbackMethod = "..." | Named method reference |
Migration Example
@HystrixCommand(
fallbackMethod = "getDefaultInventory",
commandProperties = {
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"),
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
@HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "30000"),
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000")
},
threadPoolProperties = {
@HystrixProperty(name = "coreSize", value = "10"),
@HystrixProperty(name = "maxQueueSize", value = "20")
}
)
public InventoryResponse getInventory(String sku) {
return inventoryClient.check(sku);
}
public InventoryResponse getDefaultInventory(String sku) {
return InventoryResponse.unknown(sku);
}
@CircuitBreaker(name = "inventoryService", fallbackMethod = "getDefaultInventory")
@Bulkhead(name = "inventoryService")
@TimeLimiter(name = "inventoryService")
public CompletableFuture<InventoryResponse> getInventory(String sku) {
return CompletableFuture.supplyAsync(() -> inventoryClient.check(sku));
}
public CompletableFuture<InventoryResponse> getDefaultInventory(String sku, Throwable t) {
return CompletableFuture.completedFuture(InventoryResponse.unknown(sku));
}
# application.yml — config externalized!
resilience4j:
circuitbreaker:
instances:
inventoryService:
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s
bulkhead:
instances:
inventoryService:
maxConcurrentCalls: 10
timelimiter:
instances:
inventoryService:
timeoutDuration: 3s
Key Migration Differences
- Fallback method signature changes: Resilience4j fallback must include
Throwableparameter - Thread pool isolation is not default: Use
@Bulkhead(type = THREADPOOL)explicitly - Timeout is separate: Use
@TimeLimiterinstead of built-in Hystrix timeout - Configuration is externalized: YAML instead of annotations — change without redeploy
- No Hystrix Dashboard: Use Actuator + Prometheus + Grafana (better, industry standard)
Real-World Architecture: E-Commerce Example
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
subgraph Client["Client Layer"]
WEB["Web App"]
MOB["Mobile App"]
end
subgraph Gateway["API Gateway"]
GW["Spring Cloud Gateway<br/>Rate Limiter: 1000 req/s"]
end
subgraph Services["Microservices"]
OS["Order Service"]
PS["Payment Service<br/>CB: 30% threshold<br/>Retry: 2 attempts"]
IS["Inventory Service<br/>CB: 50% threshold<br/>Bulkhead: 30 concurrent"]
RS["Recommendation Service<br/>CB: 70% threshold<br/>TimeLimiter: 2s"]
NS["Notification Service<br/>CB: 80% threshold<br/>Fire-and-forget"]
SS["Shipping Service<br/>CB: 50% threshold<br/>Retry: 3 attempts"]
end
subgraph External["External Services"]
BANK["Bank API"]
CARRIER["Shipping Carriers"]
ML["ML Engine"]
end
WEB --> GW
MOB --> GW
GW --> OS
OS --> PS
OS --> IS
OS --> RS
OS --> NS
OS --> SS
PS --> BANK
SS --> CARRIER
RS --> ML
style PS fill:#FFEBEE,stroke:#C62828,color:#000
style IS fill:#E3F2FD,stroke:#1565C0,color:#000
style RS fill:#FFF3E0,stroke:#E65100,color:#000
style NS fill:#E8F5E9,stroke:#2E7D32,color:#000 Resilience strategy per service criticality:
| Service | Criticality | CB Threshold | Fallback Strategy | Why |
|---|---|---|---|---|
| Payment | Critical | 30% | Queue for retry | Revenue — don't lose sales |
| Inventory | High | 50% | Optimistic (allow order) | Better to oversell than lose sale |
| Shipping | High | 50% | Show "calculating" + async | Can get rates later |
| Recommendations | Low | 70% | Popular items | Nice-to-have, not blocking |
| Notifications | Low | 80% | Silent fail + retry queue | User doesn't see this |
Interview Questions and Answers
Q1: What is the Circuit Breaker pattern and why is it needed in microservices?
A circuit breaker monitors calls to a downstream service and tracks failures in a sliding window. When the failure rate exceeds a configured threshold, it "trips open" — immediately rejecting all subsequent calls with a fallback response instead of making the network call. This prevents cascading failures where one slow/failed service exhausts thread pools upstream, bringing down the entire system.
Key insight to mention: "Slow is worse than down" — a service returning errors fast is easy to handle. A service hanging for 30 seconds holds threads hostage and kills the caller.
Q2: Explain the three states and their transitions.
CLOSED: Normal operation. All calls pass through. Outcomes tracked in a sliding window. If failure rate exceeds threshold → transitions to OPEN.
OPEN: Protective state. All calls immediately rejected (CallNotPermittedException). Fallback returned. Timer running. After waitDuration → transitions to HALF_OPEN.
HALF_OPEN: Recovery testing. Limited number of trial calls allowed through. If they succeed above threshold → CLOSED. If they fail → back to OPEN.
Bonus: Mention automaticTransitionFromOpenToHalfOpenEnabled — if false, the transition only happens on the next call attempt (lazy), not proactively.
Q3: COUNT_BASED vs TIME_BASED sliding window — when would you choose each?
COUNT_BASED (last N calls): Best for consistent traffic. Simple mental model — "if 5 of the last 10 calls failed, trip." Doesn't go stale during traffic pauses.
TIME_BASED (last N seconds): Best for variable traffic. During burst periods, it evaluates many calls. During quiet periods, old failures age out naturally. Better for services with highly variable call rates (batch processing, event-driven).
Gotcha with TIME_BASED: During very low traffic, you might not reach minimumNumberOfCalls in the window, delaying failure detection.
Q4: How do Retry and Circuit Breaker work together? What's the correct order?
The circuit breaker should wrap the retry (Retry is INNER, CircuitBreaker is OUTER in decoration order). Execution: CircuitBreaker → Retry → Actual Call.
If all retries fail, that counts as ONE failure for the circuit breaker. So 3 retries failing = 1 CB failure count.
Critical configuration: Add CallNotPermittedException to retry's ignoreExceptions — you don't want to retry when the circuit is OPEN (it will just reject again).
Resilience4j's annotation order: Retry → CircuitBreaker → RateLimiter → TimeLimiter → Bulkhead (outer to inner).
Q5: What is the Bulkhead pattern and how does it differ from Circuit Breaker?
Bulkhead isolates resources so one slow service can't consume all available threads. Like watertight compartments in a ship — if one floods, others stay dry.
Circuit Breaker reacts to failure rate — trips when things are broken. Bulkhead limits concurrency — prevents resource starvation regardless of success/failure.
Example: Payment service bulkhead allows max 20 concurrent calls. Even if payment is slow, only 20 threads are affected — the other 180 threads serve inventory checks, product pages, etc.
Two types: Semaphore (lightweight, caller's thread) and Thread Pool (true isolation, dedicated threads).
Q6: Why was Hystrix deprecated and how is Resilience4j different?
Hystrix entered maintenance mode in 2018. Netflix internally moved to adaptive concurrency limits. Key issues: monolithic design, Java 6 era patterns, heavy dependencies (Archaius, RxJava), required dedicated thread pool per command.
Resilience4j differences: modular (use only what you need), zero dependencies (core), Java 8+ functional design, semaphore-based by default (lighter), externalized config (YAML), integrates with standard observability (Micrometer/Prometheus instead of custom dashboard).
Q7: How would you configure a circuit breaker for a critical payment service vs. a nice-to-have recommendation service?
Payment (critical): Low failure threshold (30%), longer wait duration (60s — give payment provider time to recover), fewer trial calls (2-3 — cautious recovery), queue-based fallback (don't lose the sale), mandatory idempotency keys for retries.
Recommendations (nice-to-have): High failure threshold (70% — partial failures acceptable), short wait duration (5-10s — try again quickly), more trial calls (5-10 — recover aggressively), fallback to "popular items" (good enough UX), no retry needed (stale recommendations are fine).
Key insight: The circuit breaker config reflects a business decision about acceptable degradation, not just a technical one.
Q8: What are the most common circuit breaker anti-patterns?
- Sliding window too small — 3 calls means one failure = 33% rate. Circuit flaps open/closed constantly.
- No fallback — Circuit opens,
CallNotPermittedExceptionpropagates as 500 to user. Defeats the purpose. - Retrying non-idempotent ops — Payment retry without idempotency key = double charge.
- Recording business exceptions — 404 "user not found" trips circuit. Only record infrastructure failures (IOException, TimeoutException, 5xx).
- No timeout — Thread blocks 30s waiting for timeout, THEN circuit breaker counts it. By then, damage is done.
- Same config everywhere — Database failover takes 2 minutes, but
waitDurationis 5s. You'll send trial requests every 5s to a DB that won't be ready for 120s.
Q9: How do you test circuit breakers in production?
- Unit tests: Force state transitions using
circuitBreaker.transitionToOpenState(). Verify fallback is called, verify call counts. - Integration tests: Use WireMock to simulate downstream failures (500s, timeouts). Verify circuit opens after threshold, closes after recovery.
- Chaos Engineering: Chaos Monkey for Spring Boot — randomly inject latency/exceptions. Verify system stays healthy during partial failures.
- Monitoring: Actuator endpoints for state. Prometheus metrics for failure rates. Alerts when circuit opens (Grafana/PagerDuty).
- Manual testing:
circuitBreaker.transitionToOpenState()via admin endpoint (protected). Verify fallback works in staging.
Q10: How does a circuit breaker help with the thundering herd problem?
When a downstream service recovers after an outage, all upstream services simultaneously send requests — overwhelming it again (thundering herd). The circuit breaker's HALF_OPEN state prevents this:
- Circuit is OPEN — all traffic blocked
- Transitions to HALF_OPEN — only
permittedNumberOfCallsInHalfOpenState(e.g., 3) trial requests go through - If those succeed → circuit CLOSES and traffic gradually resumes
- The downstream only gets 3 requests initially, not thousands
Additional protection: Add jitter to waitDurationInOpenState across service instances so they don't all transition to HALF_OPEN simultaneously.
Q11: Explain the difference between client-side and server-side circuit breakers.
Client-side (what we've discussed): The calling service has the circuit breaker. It decides when to stop calling the downstream. Each caller makes independent decisions. Standard pattern with Resilience4j.
Server-side (API Gateway level): The gateway/load balancer has the circuit breaker. It removes unhealthy instances from the pool. Shared decision — all clients benefit. Example: Spring Cloud Gateway with circuit breaker filter, Envoy proxy circuit breaking.
Best practice: Use both. Client-side for service-specific fallbacks. Server-side (gateway) for global protection and centralized monitoring.
Q12: How would you handle circuit breaker state in a distributed environment with multiple instances?
Each service instance maintains its OWN circuit breaker state (not shared). This is by design:
- No shared state needed — each instance independently detects failures from its perspective
- Network partition resilience — if instance A can't reach downstream but instance B can, they should have different circuit states
- Simpler — no distributed consensus needed for circuit state
Consideration: If you have 10 instances with slidingWindowSize: 10 each, the downstream sees 100 calls collectively. Size your circuit breaker knowing your instance count.
Alternative: For cross-instance circuit breaking, use a service mesh (Istio/Envoy) which tracks aggregate health centrally.
Quick Reference: Complete Configuration Template
# Production-ready Resilience4j configuration template
resilience4j:
circuitbreaker:
configs:
default:
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 3s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
registerHealthIndicator: true
eventConsumerBufferSize: 100
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.example.BusinessException
critical:
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 30
waitDurationInOpenState: 60s
permittedNumberOfCallsInHalfOpenState: 2
non-critical:
slidingWindowSize: 5
minimumNumberOfCalls: 3
failureRateThreshold: 70
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 5
instances:
paymentService:
baseConfig: critical
inventoryService:
baseConfig: default
recommendationService:
baseConfig: non-critical
retry:
configs:
default:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- io.github.resilience4j.circuitbreaker.CallNotPermittedException
instances:
paymentService:
baseConfig: default
maxAttempts: 2
inventoryService:
baseConfig: default
bulkhead:
configs:
default:
maxConcurrentCalls: 25
maxWaitDuration: 200ms
instances:
paymentService:
maxConcurrentCalls: 15
maxWaitDuration: 500ms
inventoryService:
baseConfig: default
recommendationService:
maxConcurrentCalls: 10
maxWaitDuration: 100ms
timelimiter:
configs:
default:
timeoutDuration: 3s
cancelRunningFuture: true
instances:
paymentService:
timeoutDuration: 5s
recommendationService:
timeoutDuration: 2s
ratelimiter:
instances:
paymentService:
limitForPeriod: 100
limitRefreshPeriod: 1s
timeoutDuration: 0ms
externalApi:
limitForPeriod: 50
limitRefreshPeriod: 1s
timeoutDuration: 500ms
Decision Flowchart: When to Use What
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
START["Downstream call failing?"] -->|"Yes"| Q1{"Transient or persistent?"}
START -->|"Need to control load"| RL["Use Rate Limiter"]
START -->|"Need resource isolation"| BH["Use Bulkhead"]
Q1 -->|"Transient (network blip)"| RETRY["Use Retry<br/>+ exponential backoff"]
Q1 -->|"Persistent (service down)"| CB["Use Circuit Breaker<br/>+ fallback"]
Q1 -->|"Not sure"| BOTH["Use Retry INSIDE Circuit Breaker"]
RETRY --> Q2{"Call takes too long?"}
CB --> Q2
BOTH --> Q2
Q2 -->|"Yes"| TL["Add TimeLimiter"]
Q2 -->|"No"| DONE["Done"]
TL --> DONE
style CB fill:#FFEBEE,stroke:#C62828,color:#000
style RETRY fill:#E3F2FD,stroke:#1565C0,color:#000
style RL fill:#FFF3E0,stroke:#E65100,color:#000
style BH fill:#E8F5E9,stroke:#2E7D32,color:#000
style TL fill:#F3E5F5,stroke:#6A1B9A,color:#000 | Pattern | Protects Against | Mechanism | Use When |
|---|---|---|---|
| Circuit Breaker | Cascading failures | Fail fast after threshold | Service consistently failing |
| Retry | Transient failures | Try again with backoff | Network blips, brief timeouts |
| Bulkhead | Resource exhaustion | Limit concurrency | Prevent one service consuming all threads |
| Rate Limiter | Overloading downstream | Control request rate | External APIs with quotas |
| TimeLimiter | Thread blocking | Enforce deadline | Prevent indefinite waits |
Related Topics
- API Gateway — Circuit breakers at the gateway level
- Service Discovery — Detecting healthy instances
- Observability — Monitoring circuit breaker states
- Inter-Service Communication — Where failures occur