InfluxDB + Telegraf: Build a Complete Metrics Pipeline for Your Infrastructure

InfluxDB + Telegraf: Build a Complete Metrics Pipeline for Your Infrastructure

If you're still using Prometheus for everything, you're probably overcomplicating your life. Don't get me wrong—Prometheus is fantastic for certain workloads. But when you need to store millions of metrics per second with flexible retention policies and actually query them without pulling your hair out? That's where InfluxDB shines.

I've been running InfluxDB + Telegraf stacks for infrastructure monitoring for years now, and the setup is embarrassingly simple once you know the patterns. Let me walk you through building a complete metrics pipeline that'll have you wondering why you didn't do this sooner.

What We're Building

Here's the architecture:

  • Telegraf agents run on your servers, collecting CPU, memory, disk, network, and application metrics
  • InfluxDB stores everything with automatic compression and configurable retention
  • Grafana (optional) visualizes the data

The beauty? Telegraf has 300+ input plugins. You can monitor literally anything—Docker containers, PostgreSQL queries, Redis stats, Kubernetes pods, custom application metrics via StatsD. All feeding into one central time-series database.

Prerequisites

  • A running InfluxDB instance (we'll use InfluxDB on Elestio)
  • Target servers where you'll install Telegraf
  • Basic familiarity with YAML/TOML configuration

Step 1: Configure InfluxDB

Once your InfluxDB instance is running, you need to create a bucket and generate an API token. In InfluxDB 2.x, everything is organized around organizations and buckets.

# Access the InfluxDB CLI inside your container
docker exec -it influxdb influx

# Create a bucket for your metrics
influx bucket create \
  --name infrastructure \
  --retention 30d \
  --org your-org

# Create an API token for Telegraf
influx auth create \
  --org your-org \
  --write-bucket infrastructure \
  --read-bucket infrastructure \
  --description "Telegraf agent token"

Save that token somewhere safe—you'll need it for every Telegraf agent.

Step 2: Install and Configure Telegraf

On each server you want to monitor, install Telegraf:

# Ubuntu/Debian
curl -sL https://repos.influxdata.com/influxdata-archive_compat.key | sudo apt-key add -
echo "deb https://repos.influxdata.com/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt update && sudo apt install telegraf

# Or via Docker
docker run -d --name telegraf \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v $(pwd)/telegraf.conf:/etc/telegraf/telegraf.conf:ro \
  telegraf

Now the fun part—configuration. Here's a production-ready telegraf.conf:

[global_tags]
  environment = "production"
  datacenter = "us-east-1"

[agent]
  interval = "10s"
  round_interval = true
  flush_interval = "10s"
  hostname = ""  # Auto-detect

# OUTPUT: Send to InfluxDB
[[outputs.influxdb_v2]]
  urls = ["https://your-influxdb.elest.io:8086"]
  token = "$INFLUX_TOKEN"
  organization = "your-org"
  bucket = "infrastructure"

# INPUT: System metrics
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false

[[inputs.mem]]
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "overlay"]

[[inputs.diskio]]
[[inputs.net]]
[[inputs.system]]
[[inputs.processes]]

# INPUT: Docker containers (if applicable)
[[inputs.docker]]
  endpoint = "unix:///var/run/docker.sock"
  container_names = []
  perdevice = true

Start Telegraf:

sudo systemctl enable telegraf
sudo systemctl start telegraf

Step 3: Query Your Metrics with Flux

InfluxDB 2.x uses Flux—a functional query language that's actually pleasant to work with once you get the hang of it.

Here's how to query the last hour of CPU usage:

from(bucket: "infrastructure")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "cpu")
  |> filter(fn: (r) => r._field == "usage_idle")
  |> filter(fn: (r) => r.cpu == "cpu-total")
  |> aggregateWindow(every: 1m, fn: mean)

Want to find servers with high memory usage?

from(bucket: "infrastructure")
  |> range(start: -15m)
  |> filter(fn: (r) => r._measurement == "mem")
  |> filter(fn: (r) => r._field == "used_percent")
  |> last()
  |> filter(fn: (r) => r._value > 80)

Step 4: Set Up Retention and Downsampling

Here's the part everyone messes up. You don't need to keep raw 10-second metrics forever. Set up a task to downsample old data:

option task = {name: "downsample_cpu", every: 1h}

from(bucket: "infrastructure")
  |> range(start: -2h, stop: -1h)
  |> filter(fn: (r) => r._measurement == "cpu")
  |> aggregateWindow(every: 5m, fn: mean)
  |> to(bucket: "infrastructure_downsampled", org: "your-org")

Create multiple retention policies:

  • Raw data: 7 days
  • 5-minute aggregates: 30 days
  • 1-hour aggregates: 1 year

Troubleshooting

Telegraf not sending data?

telegraf --config /etc/telegraf/telegraf.conf --test

This runs a single collection cycle and shows you exactly what would be sent.

High cardinality warnings?
Check your tags. If you're tagging with user IDs or request IDs, stop. Tags should have low cardinality (server names, environments, regions—not UUIDs).

Query timeouts?
Add time bounds. Never query from(bucket: "x") |> filter(...) without a range(). InfluxDB will try to scan everything.

Why Self-Host?

Quick cost comparison: Datadog charges ~$15/host/month for infrastructure monitoring. Running InfluxDB on Elestio? About $20-30/month total, regardless of how many hosts you monitor. At 10 servers, you're saving $120/month. At 50 servers, that's $720/month.

Plus, your metrics never leave your infrastructure. For companies with compliance requirements, that's not optional—it's mandatory.

Wrapping Up

The InfluxDB + Telegraf combo is one of those "why isn't everyone doing this?" solutions. It's simple to set up, scales beautifully, and the Flux query language actually makes complex time-series analysis approachable.

Start with basic system metrics. Once you see how easy it is, you'll be monitoring everything—application latencies, database query times, business metrics. The 300+ Telegraf plugins are there when you need them.

Thanks for reading! See you in the next one.