Prometheus + Alertmanager: Build Production-Ready Monitoring with Custom Alert Rules
Your servers are on fire and you find out from a customer tweet. Sound familiar? That's what happens when you rely on uptime checks instead of real monitoring.
Prometheus is the monitoring system that powers most of the cloud-native world. Paired with Alertmanager, it becomes a complete alerting pipeline that catches problems before your customers do. Here's how to set it up properly.
The Architecture
Prometheus scrapes metrics from your services at regular intervals and stores them as time series data. Alertmanager handles the alerts: routing them to the right channels, grouping related issues, and preventing alert storms.
The flow looks like this: Your apps expose metrics → Prometheus scrapes and evaluates rules → Alertmanager routes notifications → You get paged on Slack/PagerDuty.
Prometheus Setup
Here's a production-ready Docker Compose configuration:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "172.17.0.1:9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert_rules.yml:/etc/prometheus/alert_rules.yml
- ./prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: always
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "172.17.0.1:9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: always
volumes:
prometheus_data:
Your prometheus.yml configuration:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alert_rules.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'your-app'
static_configs:
- targets: ['your-app:8080']
Writing Alert Rules
Alert rules are where Prometheus gets powerful. Create alert_rules.yml:
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 10
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
The for clause prevents flapping. An alert only fires after the condition has been true for that duration.
Alertmanager Configuration
Here's alertmanager.yml with Slack and PagerDuty routing:
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
severity: critical
This routes critical alerts to PagerDuty (which pages your on-call engineer) while warnings go to Slack.
PromQL Queries That Matter
Some queries you'll use constantly:
Request rate per second:
rate(http_requests_total[5m])
95th percentile latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Error rate percentage:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Memory usage percentage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Recording Rules for Performance
If you're querying the same expensive expressions repeatedly (especially in dashboards), use recording rules. Add to your rules file:
groups:
- name: recording_rules
rules:
- record: job:http_requests:rate5m
expr: sum by(job) (rate(http_requests_total[5m]))
- record: job:http_latency:p95
expr: histogram_quantile(0.95, sum by(job, le) (rate(http_request_duration_seconds_bucket[5m])))
These pre-compute and store the results, making dashboards load instantly.
Troubleshooting
Alerts not firing: Check the Prometheus UI at /alerts. If your rule shows "pending" but never fires, the for duration might be too long or the condition is flapping.
No metrics from targets: Verify targets show "UP" in the Prometheus UI under /targets. Common issues: wrong port, firewall blocking scrapes, or metrics endpoint not exposed.
Alert storms: Tune group_wait and group_interval in Alertmanager. Grouping related alerts prevents notification spam.
High cardinality killing performance: Avoid labels with unbounded values (like user IDs or request IDs). Each unique label combination creates a new time series.
Deploy on Elestio
Setting up Prometheus and Alertmanager takes time. Elestio offers managed Prometheus with:
- Pre-configured Alertmanager integration
- Grafana dashboards included
- Automated backups and updates
- Starting at ~$16/month
You get production-ready monitoring without the ops overhead.
What's Next
Once Prometheus is running, connect Grafana for visualization and add exporters for your specific services. PostgreSQL, Redis, Nginx, and most databases have dedicated exporters that expose detailed metrics. The PromQL skills you build here transfer directly to Grafana queries.
The combination of Prometheus metrics, Alertmanager notifications, and Grafana dashboards gives you complete observability without paying Datadog prices. Most teams save thousands per year while getting better control over their monitoring stack.
Stop finding out about outages from Twitter. Set up real monitoring.
Thanks for reading!