CodeHive
open main menu
System Design roadmap hero image
Part of series: System Design Roadmap

Week 7 Day 3: Logging & Monitoring - Eyes on the System

/ 1 min read

When a user says “It’s not working”, how do you debug it? You can’t SSH into production servers. You need centrally collected data.

1. Logging (The “What”)

Records individual events.

  • “User 123 logged in”.
  • “Error: DB timeout on query X”. Tools:
  • ELK Stack: ElasticSearch (Store), Logstash (Ingest), Kibana (Visualize).
  • Structured Logging: Log JSON, not text. { "level": "error", "userId": 123, "msg": "DB fail" }. Easier to search.

2. Monitoring (The “How”)

Records aggregated metrics over time.

  • “CPU usage is 80%“.
  • “Requests per second is 500”.
  • “P99 Latency is 200ms”. Tools:
  • Prometheus: Scrapes metrics from your app.
  • Grafana: Beautiful dashboards.

3. Tracing (The “Where”)

In Microservices, a request hits 10 services. Distributed Tracing (Jaeger / OpenTelemetry) assigns a TraceID to the request. You can see the full waterfall:

  • API Gateway (10ms) -> Auth Svc (50ms) -> DB (200ms).

Tomorrow: What happens when the CPU hits 99%? Alerts. 🚨


Next Step

Next: Alerts Health Checks →