Monitoring is the eyes and ears of infrastructure. Without proper monitoring, you're operating blind. Based on my experience setting up monitoring for production clusters at Badr Interactive, here's a comprehensive guide.
The Stack
- Prometheus — Metrics collection & alerting
- Grafana — Dashboard & visualization
- AlertManager — Alert routing & notification
- Node Exporter — Server metrics
- kube-state-metrics — Kubernetes metrics
Prometheus Setup
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
Essential Dashboards
1. RED Metrics (Rate, Errors, Duration)
The most important metrics for every service. I always have:
- Request rate per endpoint
- Error rate (5xx, 4xx)
- Response time distribution (p50, p95, p99)
2. Infrastructure Health
- CPU, Memory, Disk usage per node
- Network I/O
- Pod status & restarts
3. Business Metrics
Application-specific metrics: active users, orders, transactions — these are what stakeholders care about most.
Alerting Strategy
Alerts must be actionable and not noisy:
- Page urgency: PagerDuty for incidents (on-call)
- Warning: Slack/Email for issues to investigate
- Info: Dashboard annotations for trend analysis
Results
With proper monitoring, Mean Time To Detect (MTTD) dropped from 45 minutes to 3 minutes, and Mean Time To Resolve (MTTR) decreased by 60%.