Skip to main content

Command Palette

Search for a command to run...

Reliability & Fault Tolerance in System Design: Keeping Your System Alive When Everything Goes Wrong

Published
4 min read
A

DevOps engineer & developer passionate about building scalable, reliable systems. I design and automate pipelines, manage cloud infrastructure, and ensure deployments run smoothly. Turning complex workflows into seamless operations is my craft.

Picture this:
You’re watching the final over of a World Cup cricket match 🏏, millions of fans glued to the app, and… suddenly the app crashes.
The entire country rages. Memes flood Twitter. Your company’s brand takes a beating.

What went wrong?
👉 The system wasn’t reliable enough.

In system design, reliability & fault tolerance are about building systems that keep running — even when things break (because things will break).


1. ⚡ What Is Reliability?

Reliability = the probability that your system works as expected over time.

  • A reliable system delivers correct responses, consistently.

  • Measured with uptime (e.g., 99.9% availability = “three nines”).

👉 Example: Banking apps must be ultra-reliable — you don’t want your balance disappearing due to a server crash.


2. 🛡️ What Is Fault Tolerance?

Fault Tolerance = the system’s ability to keep working even when parts fail.

  • Failures are inevitable: servers crash, networks partition, disks die.

  • A fault-tolerant system absorbs the failure gracefully and keeps serving users.

👉 Example: If one Netflix data center fails, users are redirected to another — and you never even notice.


3. 🧩 Techniques to Achieve Reliability & Fault Tolerance

🔹 Replication

  • Keep multiple copies of data/services.

  • Primary-Replica (Master-Slave) → Writes to master, reads from replicas.

  • Multi-Master → Multiple servers accept writes → conflict resolution needed.

  • Ensures durability if one server crashes.


🔹 Redundancy

  • N+1 servers → Always have extra servers waiting.

  • Active-Passive → One server active, another on standby.

  • Active-Active → Multiple servers active simultaneously.

👉 Think of airplanes → two engines instead of one.


🔹 Failover Mechanisms

  • Automatic switch from failed component to backup.

  • Example: DNS failover → if primary server goes down, traffic routed to backup.


🔹 Health Checks & Monitoring

  • Continuously check if servers/services are alive.

  • Tools: Prometheus, Grafana, CloudWatch.

  • Automatic removal of unhealthy servers from load balancers.


🔹 Circuit Breakers

  • Like fuses in your home.

  • If a service is failing repeatedly, stop calling it to avoid cascading failures.

  • Example: Netflix OSS Hystrix (famous circuit breaker library).


🔹 Idempotency

  • Retrying requests should not cause duplicate effects.

  • Example: Payment APIs → pressing “Pay” twice should not charge twice.


🔹 Quorum & Consensus Protocols

  • In distributed systems, use quorums (majority votes) to agree on data.

  • Algorithms: Paxos, Raft, ZAB.

  • Guarantees consistency during failures.


🔹 Graceful Degradation

  • When parts of the system fail, offer a “reduced” service instead of total failure.

  • Example: If recommendations fail, YouTube still plays the video.


🔹 Chaos Engineering

  • Netflix’s Chaos Monkey randomly kills servers in production to test resilience.

  • Goal: Proactively find weaknesses before real disasters.


4. 📊 Reliability Metrics (a.k.a. The “Nines”)

Reliability is often expressed as availability percentages:

  • 99% → “two nines” → ~3.65 days downtime/year.

  • 99.9% → “three nines” → ~8.7 hours downtime/year.

  • 99.99% → “four nines” → ~52 minutes downtime/year.

  • 99.999% → “five nines” → ~5 minutes downtime/year.

👉 The higher the nines, the more complex and expensive the system.


5. ⚖️ Trade-Offs

  • More redundancy = more cost.

  • More replicas = more consistency challenges.

  • More monitoring = more noise if misconfigured.

In interviews, mention:
👉 “There’s always a cost-performance trade-off in fault tolerance — you can’t design for infinite reliability.”


6. 📌 Real-World Examples

  • Netflix → Multi-region failover, Chaos Monkey testing.

  • Google Search → Distributed across thousands of servers, auto-failover.

  • Banking Systems → Active-active clusters, strong consistency, audit trails.


7. 🏁 Closing Thoughts

Reliability & fault tolerance are about planning for failure, not avoiding it.

Key takeaways:

  1. Failures are inevitable. Design like they’ll happen tomorrow.

  2. Use replication, redundancy, and failover as your base tools.

  3. Add monitoring, circuit breakers, and graceful degradation for resilience.

  4. Remember the availability vs. cost trade-off.

🔑 One-liner to drop in interviews:
“Reliable systems don’t avoid failure — they absorb it and keep going.”


✨ Fun closer for your blog:
“Think of fault tolerance like parachutes — you don’t notice them until you really, really need them.” 🪂😆