Part of series: System Design Roadmap
Week 7 Day 3: Logging & Monitoring - Eyes on the System
When a user says “It’s not working”, how do you debug it? You can’t SSH into production servers. You need centrally collected data.
1. Logging (The “What”)
Records individual events.
- “User 123 logged in”.
- “Error: DB timeout on query X”. Tools:
- ELK Stack: ElasticSearch (Store), Logstash (Ingest), Kibana (Visualize).
- Structured Logging: Log JSON, not text.
{ "level": "error", "userId": 123, "msg": "DB fail" }. Easier to search.
2. Monitoring (The “How”)
Records aggregated metrics over time.
- “CPU usage is 80%“.
- “Requests per second is 500”.
- “P99 Latency is 200ms”. Tools:
- Prometheus: Scrapes metrics from your app.
- Grafana: Beautiful dashboards.
3. Tracing (The “Where”)
In Microservices, a request hits 10 services.
Distributed Tracing (Jaeger / OpenTelemetry) assigns a TraceID to the request. You can see the full waterfall:
- API Gateway (10ms) -> Auth Svc (50ms) -> DB (200ms).
Tomorrow: What happens when the CPU hits 99%? Alerts. 🚨