Prometheus + Alertmanager: Build Production-Ready Monitoring with Custom Alert Rules

Prometheus + Alertmanager: Build Production-Ready Monitoring with Custom Alert Rules

Your servers are on fire and you find out from a customer tweet. Sound familiar? That's what happens when you rely on uptime checks instead of real monitoring.

Prometheus is the monitoring system that powers most of the cloud-native world. Paired with Alertmanager, it becomes a complete alerting pipeline that catches problems before your customers do. Here's how to set it up properly.

The Architecture

Prometheus scrapes metrics from your services at regular intervals and stores them as time series data. Alertmanager handles the alerts: routing them to the right channels, grouping related issues, and preventing alert storms.

The flow looks like this: Your apps expose metrics → Prometheus scrapes and evaluates rules → Alertmanager routes notifications → You get paged on Slack/PagerDuty.

Prometheus Setup

Here's a production-ready Docker Compose configuration:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "172.17.0.1:9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - ./prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: always

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "172.17.0.1:9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: always

volumes:
  prometheus_data:

Your prometheus.yml configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'your-app'
    static_configs:
      - targets: ['your-app:8080']

Writing Alert Rules

Alert rules are where Prometheus gets powerful. Create alert_rules.yml:

groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 10
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"

The for clause prevents flapping. An alert only fires after the condition has been true for that duration.

Alertmanager Configuration

Here's alertmanager.yml with Slack and PagerDuty routing:

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  receiver: 'slack-notifications'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

This routes critical alerts to PagerDuty (which pages your on-call engineer) while warnings go to Slack.

PromQL Queries That Matter

Some queries you'll use constantly:

Request rate per second:

rate(http_requests_total[5m])

95th percentile latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate percentage:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Memory usage percentage:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Recording Rules for Performance

If you're querying the same expensive expressions repeatedly (especially in dashboards), use recording rules. Add to your rules file:

groups:
  - name: recording_rules
    rules:
      - record: job:http_requests:rate5m
        expr: sum by(job) (rate(http_requests_total[5m]))

      - record: job:http_latency:p95
        expr: histogram_quantile(0.95, sum by(job, le) (rate(http_request_duration_seconds_bucket[5m])))

These pre-compute and store the results, making dashboards load instantly.

Troubleshooting

Alerts not firing: Check the Prometheus UI at /alerts. If your rule shows "pending" but never fires, the for duration might be too long or the condition is flapping.

No metrics from targets: Verify targets show "UP" in the Prometheus UI under /targets. Common issues: wrong port, firewall blocking scrapes, or metrics endpoint not exposed.

Alert storms: Tune group_wait and group_interval in Alertmanager. Grouping related alerts prevents notification spam.

High cardinality killing performance: Avoid labels with unbounded values (like user IDs or request IDs). Each unique label combination creates a new time series.

Deploy on Elestio

Setting up Prometheus and Alertmanager takes time. Elestio offers managed Prometheus with:

  • Pre-configured Alertmanager integration
  • Grafana dashboards included
  • Automated backups and updates
  • Starting at ~$16/month

You get production-ready monitoring without the ops overhead.

What's Next

Once Prometheus is running, connect Grafana for visualization and add exporters for your specific services. PostgreSQL, Redis, Nginx, and most databases have dedicated exporters that expose detailed metrics. The PromQL skills you build here transfer directly to Grafana queries.

The combination of Prometheus metrics, Alertmanager notifications, and Grafana dashboards gives you complete observability without paying Datadog prices. Most teams save thousands per year while getting better control over their monitoring stack.

Stop finding out about outages from Twitter. Set up real monitoring.

Thanks for reading!