Part of series: System Design Roadmap
Week 4 Day 4: Availability & Reliability - The Nines
We want our system to be “Up” all the time. But what does that mean?
1. Reliability vs Availability
- Reliability: “Does the system do the right thing?” (Bug free, accurate data).
- Availability: “Is the system accessible?” (Not 502/503 errors).
2. Measuring Availability (The Nines)
- 99% (2 Nines): Down 3.65 days/year. (Okay for hobby).
- 99.9% (3 Nines): Down 8.7 hours/year. (Good for business).
- 99.99% (4 Nines): Down 52 mins/year. (Enterprise).
- 99.999% (5 Nines): Down 5 mins/year. (Critical Infrastructure like AWS S3).
3. Achieving HA (High Availability)
The only way to achieve HA is Redundancy.
- No SPOF: Eliminate Single Points of Failure.
- Active-Passive: Main server runs. Hot Standby waits. If Main dies, Standby takes over.
- Active-Active: Both servers run. If one dies, usage sends to the other.
4. Mean Time (Metrics)
- MTBF (Mean Time Between Failures): How long it runs before crashing. (Target: High).
- MTTR (Mean Time To Recovery): How fast you fix it. (Target: Low).
- Availability = MTBF / (MTBF + MTTR).
Lesson: You can improve availability by fixing things faster (Automated restarts), not just preventing crashes.
Tomorrow: Mini Project. We implement a Quorum-based store! 🗳️