Blog #49: Surviving Black Box API – When you have to live with 'Instability'
Analyzing the decision to implement Resilience Patterns when integrating untrustworthy third-party APIs.
I once participated in a project integrating a mapping and positioning system of a local partner. Team size 4, user base around 50,000 accesses per day. The biggest problem wasn't in our code, but in the partner's API system: It was an extremely temperamental "Black Box." Occasionally it had 500 errors, occasionally it slowed down to 30 seconds, and occasionally it... disappeared without a trace.
The Problem: Surviving in Chaos
Our system relied 100% on this API to display the location of shippers. Every time the partner's API "sneezed," our application caught a "cold": Users saw error screens, shippers didn't receive orders, and our switchboard was on fire with complaints.
We couldn't fix the partner's code. The question was: How could our Frontend survive and still maintain a minimum user experience while the foundation (the API) was shaking?
Options Considered
We stood between 2 strategy choices:
Option 1: Retry Logic (Persistence Strategy)
- Solution: If an API call fails, automatically try again after 1s, 2s, and then 5s. Use an "Exponential Backoff" strategy.
- Pros: Solves "temporary" errors arising from network congestion. Easy to install with libraries like
axios-retry. - Cons: If the partner's API is truly down (long downtime), continuous retrying only wastes resources and adds a burden to both sides.
Option 2: Circuit Breaker & Fallback (Disconnection Strategy)
- Solution: If the API is detected to have failed more than 5 times consecutively in 1 minute, immediately "break the circuit" (no longer allow calling that API for the next 5 minutes). Instead, display old data (cache) or a friendly notification: "The positioning service is undergoing maintenance; we will use your last known location."
- Pros: Protects our system from freezing along with the partner. Provides a "Graceful degradation" experience for users.
- Cons: The logic for handling "closed/open circuit" states is quite complex to manage on the Frontend.
Final Decision and Analysis
I decided to combine both but prioritized Option 2.
// Pseudo-code of a simple circuit breaker logic
const fetchDataWithResilience = async () => {
if (circuitBreaker.isOpen()) {
return getCachedData(); // Return old data if API is down
}
try {
const data = await apiClient.get('/partner/location');
circuitBreaker.recordSuccess();
return data;
} catch (error) {
circuitBreaker.recordFailure();
throw error;
}
};
Impact on Performance: Markedly improved the application's responsiveness when the partner API crashed. Instead of waiting 30 seconds to report an error (timeout), our system reported an error or used the cache immediately within 100ms.
Impact on Maintainability: Code becomes more "defensive." We had to manage an additional Cache layer (LocalStorage or IndexedDB) to ensure there was always backup data.
Impact on Team: Juniors learned a hard lesson about "never fully trusting anything outside your own system."
Self-Reflection: Was it Over-engineering?
I asked myself: Why not let the Backend handle this for the Frontend? In reality, our Backend would also freeze without a similar mechanism. The Frontend proactively handling "Circuit Breaking" saved us from thousands of hopeless requests to the server, salvaged users' phone batteries, and kept the UI always in a controllable state.
If I were starting over, would I choose differently? No. In the modern web world, where we integrate dozens of Microservices and 3rd-party APIs, Resilience is not an option; it is survival.
Notes on building fortresses in the middle of a storm.
Series • Part 49 of 50