LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
When the System Goes Silent - Lessons on Post-mortem and Blame-free Culture
Microservice Journey: Lessons & Trade-offs

When the System Goes Silent - Lessons on Post-mortem and Blame-free Culture

A small bug in Service A crashes the entire payment flow. How to learn from failure without turning it into a blame game?

TP
Truong PhamSoftware Engineer
PublishedMarch 30, 2024
Stack
microservice ·culture ·management ·incident-response

Failure is an inevitable part of distributed systems. What matters is not who did something wrong, but what the system lacked that allowed that error to occur.


The First Incident: A Dark Day

It was a Friday afternoon (of course!). A team member updated a library in the Shipping Service. It sounded harmless, but that library caused a memory leak, making the service respond very slowly.

Because we didn't have a Circuit Breaker yet (as discussed in post 2), the Order Service waited exhaustively for the Shipping Service. Then it was the Payment Service's turn to hang while waiting for the Order Service. Within 15 minutes, the entire system went "silent"—not a single request succeeded. The dashboard was solid red.

1. Reaction: Don't Find a "Culprit"

In moments of panic, the first question usually asked is: "Who just deployed what?". This is a dangerous reflex. If you punish the person who made a mistake, next time they'll hide their errors, and that is where the real disaster lies.

We adopted a Blame-free Post-mortem culture:

  • Focus on Process: Why didn't CI/CD detect this error? Why didn't monitoring alert us sooner?
  • Focus on System: Where were we missing a Circuit Breaker mechanism?

2. Writing the Post-mortem: Looking in the Mirror

After the system was back online, we sat down to write a Post-mortem document consisting of:

  1. Incident Summary: What happened? (Users couldn't pay for 30 minutes).
  2. Timeline: 14:00 deploy, 14:10 errors began, 14:15 alert received, 14:30 successful rollback.
  3. Root Cause: Use the "5 Whys" technique to dig deep.
  4. Corrective Actions: Immediately install Circuit Breakers, add RAM alerts for each service.

3. The Price of Maturity

That post-mortem meeting was our most effective meeting to date. It completely changed how we wrote code thereafter:

  • Always assume other services can die at any time (Design for failure).
  • Always have a timeout for every internal API call.
  • Consider writing logs and metrics as being just as important as writing features.

Lessons for Managers/Leaders

If you are leading a small team doing Microservices:

  • Protect your team: When an incident occurs, be the one to take responsibility before the boss, and provide a safe space for the team to fix the error.
  • Turn mistakes into assets: A good Post-mortem document is worth more than ten textbooks because it is a hard-earned lesson from the system you are actually building.

Conclusion: The Journey Never Ends

Microservice is not a destination; it's a discipline-building journey. It forces us to mature not just in coding skills, but also in system thinking and team culture.

Through these 13 articles, I hope you've gained a realistic (and "scarred") perspective on building microservices for a small team. It is full of trade-offs, but the results—a flexible, scalable system and a cohesive team—are well worth it.

I wish you strength and resilience on your path to conquering distributed code!

Series • Part 13 of 20

Microservice Journey: Lessons & Trade-offs

NextContract Testing - When Integration Testing Becomes a Burden
The Pain Named Cloud Bill - Cost Optimization for 'Broke Teams'
08Database Separation - A Painful but Necessary 'Divorce'09BFF (Backend for Frontend) - The Savior of UX10Idempotency - Why Every API Should Be 'Stubborn'?11Infra as Code (IaC) - Automate or Die in a Mountain of Config12The Pain Named Cloud Bill - Cost Optimization for 'Broke Teams'13When the System Goes Silent - Lessons on Post-mortem and Blame-free CultureReading14Contract Testing - When Integration Testing Becomes a Burden15AI-Service Integration - Bringing LLM/RAG into Microservice Architecture16Feature Flags - Deploy and Release are No Longer One17Secret Management - Don't Let API Keys Wander Around18Macroservice - When We Decide to 'Merge' Services Back
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles