← Back to Blog

Infrastructure Monitoring with Prometheus and Grafana

Monitoring is the eyes and ears of infrastructure. Without proper monitoring, you're operating blind. Based on my experience setting up monitoring for production clusters at Badr Interactive, here's a comprehensive guide.

The Stack

  • Prometheus — Metrics collection & alerting
  • Grafana — Dashboard & visualization
  • AlertManager — Alert routing & notification
  • Node Exporter — Server metrics
  • kube-state-metrics — Kubernetes metrics

Prometheus Setup

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

Essential Dashboards

1. RED Metrics (Rate, Errors, Duration)

The most important metrics for every service. I always have:

  • Request rate per endpoint
  • Error rate (5xx, 4xx)
  • Response time distribution (p50, p95, p99)

2. Infrastructure Health

  • CPU, Memory, Disk usage per node
  • Network I/O
  • Pod status & restarts

3. Business Metrics

Application-specific metrics: active users, orders, transactions — these are what stakeholders care about most.

Alerting Strategy

Alerts must be actionable and not noisy:

  • Page urgency: PagerDuty for incidents (on-call)
  • Warning: Slack/Email for issues to investigate
  • Info: Dashboard annotations for trend analysis

Results

With proper monitoring, Mean Time To Detect (MTTD) dropped from 45 minutes to 3 minutes, and Mean Time To Resolve (MTTR) decreased by 60%.