When the System Goes Silent - Lessons on Post-mortem and Blame-free Culture
A small bug in Service A crashes the entire payment flow. How to learn from failure without turning it into a blame game?
Failure is an inevitable part of distributed systems. What matters is not who did something wrong, but what the system lacked that allowed that error to occur.
The First Incident: A Dark Day
It was a Friday afternoon (of course!). A team member updated a library in the Shipping Service. It sounded harmless, but that library caused a memory leak, making the service respond very slowly.
Because we didn't have a Circuit Breaker yet (as discussed in post 2), the Order Service waited exhaustively for the Shipping Service. Then it was the Payment Service's turn to hang while waiting for the Order Service.
Within 15 minutes, the entire system went "silent"—not a single request succeeded. The dashboard was solid red.
1. Reaction: Don't Find a "Culprit"
In moments of panic, the first question usually asked is: "Who just deployed what?". This is a dangerous reflex. If you punish the person who made a mistake, next time they'll hide their errors, and that is where the real disaster lies.
We adopted a Blame-free Post-mortem culture:
- Focus on Process: Why didn't CI/CD detect this error? Why didn't monitoring alert us sooner?
- Focus on System: Where were we missing a Circuit Breaker mechanism?
2. Writing the Post-mortem: Looking in the Mirror
After the system was back online, we sat down to write a Post-mortem document consisting of:
- Incident Summary: What happened? (Users couldn't pay for 30 minutes).
- Timeline: 14:00 deploy, 14:10 errors began, 14:15 alert received, 14:30 successful rollback.
- Root Cause: Use the "5 Whys" technique to dig deep.
- Corrective Actions: Immediately install Circuit Breakers, add RAM alerts for each service.
3. The Price of Maturity
That post-mortem meeting was our most effective meeting to date. It completely changed how we wrote code thereafter:
- Always assume other services can die at any time (Design for failure).
- Always have a timeout for every internal API call.
- Consider writing logs and metrics as being just as important as writing features.
Lessons for Managers/Leaders
If you are leading a small team doing Microservices:
- Protect your team: When an incident occurs, be the one to take responsibility before the boss, and provide a safe space for the team to fix the error.
- Turn mistakes into assets: A good Post-mortem document is worth more than ten textbooks because it is a hard-earned lesson from the system you are actually building.
Conclusion: The Journey Never Ends
Microservice is not a destination; it's a discipline-building journey. It forces us to mature not just in coding skills, but also in system thinking and team culture.
Through these 13 articles, I hope you've gained a realistic (and "scarred") perspective on building microservices for a small team. It is full of trade-offs, but the results—a flexible, scalable system and a cohesive team—are well worth it.
I wish you strength and resilience on your path to conquering distributed code!
Series • Part 13 of 20