Infrastructure Monitoring with Prometheus and Grafana

Monitoring is the eyes and ears of infrastructure. Without proper monitoring, you're operating blind. Based on my experience setting up monitoring for production clusters at Badr Interactive, here's a comprehensive guide.

The Stack

Prometheus — Metrics collection & alerting
Grafana — Dashboard & visualization
AlertManager — Alert routing & notification
Node Exporter — Server metrics
kube-state-metrics — Kubernetes metrics

Prometheus Setup

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

Essential Dashboards

1. RED Metrics (Rate, Errors, Duration)

The most important metrics for every service. I always have:

Request rate per endpoint
Error rate (5xx, 4xx)
Response time distribution (p50, p95, p99)

2. Infrastructure Health

CPU, Memory, Disk usage per node
Network I/O
Pod status & restarts

3. Business Metrics

Application-specific metrics: active users, orders, transactions — these are what stakeholders care about most.

Alerting Strategy

Alerts must be actionable and not noisy:

Page urgency: PagerDuty for incidents (on-call)
Warning: Slack/Email for issues to investigate
Info: Dashboard annotations for trend analysis

Results

With proper monitoring, Mean Time To Detect (MTTD) dropped from 45 minutes to 3 minutes, and Mean Time To Resolve (MTTR) decreased by 60%.