Monitoring 📖 30 min read

Prometheus Grafana Monitoring

The problem: I had 6 services running and no idea if any of them were actually up. I found out my Nextcloud was down when my wife couldn't access her photos — 3 days after it crashed. Same week, a disk on my NAS hit 100% and corrupted a database. No alerts. No dashboards. Nothing. So I set up Prometheus and Grafana. Took an afternoon. Now I get a Telegram ping when disk usage crosses 80%, and I can see CPU/RAM/network for every host on one screen.

The Stack Explained

Key takeaways:
  • Import dashboard ID 1860 from grafana.com instead of building your own — you'll learn PromQL faster by reading working queries than by writing them from zero.
  • Set up Alertmanager on day one, not "later." If nobody gets notified, you're just making pretty graphs.
  • Prometheus - Pulls metrics from your services on a schedule and stores them as time-series data
  • Grafana - The dashboard layer. Connects to Prometheus and turns numbers into graphs.
  • Exporters - Small agents you run on each host. They translate system stats into a format Prometheus understands.
  • Alertmanager - Receives firing alerts from Prometheus and routes them to email, Slack, Telegram, whatever

Docker Compose Setup

version: '3'

services:
 prometheus:
 image: prom/prometheus:latest
 ports:
 - "9090:9090"
 volumes:
 - ./prometheus.yml:/etc/prometheus/prometheus.yml
 - prometheus_data:/prometheus
 command:
 - '--config.file=/etc/prometheus/prometheus.yml'
 - '--storage.tsdb.retention.time=30d'

 grafana:
 image: grafana/grafana:latest
 ports:
 - "3000:3000"
 volumes:
 - grafana_data:/var/lib/grafana
 environment:
 - GF_SECURITY_ADMIN_PASSWORD=changeme

 node_exporter:
 image: prom/node-exporter:latest
 ports:
 - "9100:9100"
 volumes:
 - /proc:/host/proc:ro
 - /sys:/host/sys:ro
 - /:/rootfs:ro
 command:
 - '--path.procfs=/host/proc'
 - '--path.sysfs=/host/sys'
 - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
 prometheus_data:
 grafana_data:

Prometheus Configuration

This goes in prometheus.yml in the same directory as your compose file:

global:
 scrape_interval: 15s
 evaluation_interval: 15s

scrape_configs:
 - job_name: 'prometheus'
 static_configs:
 - targets: ['localhost:9090']

 - job_name: 'node'
 static_configs:
 - targets: ['node_exporter:9100']

 - job_name: 'docker'
 static_configs:
 - targets: ['host.docker.internal:9323']

Start Everything

docker compose up -d

Give it 30 seconds to pull images if it's your first run. Then check:

Prometheus UI: http://localhost:9090
Grafana: http://localhost:3000

Add Prometheus to Grafana

  1. Log into Grafana with admin/changeme
  2. Go to Configuration → Data Sources → Add
  3. Pick Prometheus from the list
  4. For the URL, enter http://prometheus:9090 (not localhost — containers talk to each other by service name)
  5. Hit Save & Test. You should see a green checkmark.

Import Dashboards

Building dashboards from scratch is a waste of time when you're starting out. Hundreds of good ones already exist on grafana.com. Here's how to grab one:

  1. Dashboards → Import
  2. Type in a dashboard ID (1860 is the classic Node Exporter Full)
  3. Point it at your Prometheus data source
  4. Import. Done.

Dashboard IDs I actually use:

  • 1860 - Node Exporter Full
  • 893 - Docker and System Monitoring
  • 13946 - Container metrics

PromQL Basics

PromQL is Prometheus's query language. It's weird at first — the syntax doesn't look like SQL or anything else you've used. But these four queries cover 90% of what a homelab needs:

# Current CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk space used
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Request rate
rate(http_requests_total[5m])

Alerting

Alerts are the entire point. Without them you're just looking at graphs after something already broke. Create an alert_rules.yml file:

# alert_rules.yml
groups:
 - name: node_alerts
 rules:
 - alert: HighCPU
 expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: High CPU usage on {{ $labels.instance }}
 
 - alert: DiskSpaceLow
 expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
 for: 10m
 labels:
 severity: critical
 annotations:
 summary: Disk space low on {{ $labels.instance }}

Alertmanager Setup

Alertmanager is a separate container. Add this block to your docker-compose.yml:

 alertmanager:
 image: prom/alertmanager:latest
 ports:
 - "9093:9093"
 volumes:
 - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

Then tell it where to send notifications. This goes in alertmanager.yml:

route:
 receiver: 'email-notifications'

receivers:
 - name: 'email-notifications'
 email_configs:
 - to: '[email protected]'
 from: '[email protected]'
 smarthost: 'smtp.example.com:587'
 auth_username: 'user'
 auth_password: 'password'

Common Exporters

  • node_exporter - The big one. CPU, memory, disk, network. Install this first on every host.
  • cadvisor - Container-level stats. Pairs well with node_exporter if you run Docker.
  • blackbox_exporter - Pings URLs, checks DNS, tests TCP ports. Good for "is my site up?" checks.
  • postgres_exporter - Pulls query stats and connection counts from PostgreSQL
  • mysqld_exporter - Same idea but for MySQL/MariaDB

Retention and Storage

By default Prometheus only keeps 15 days of data. That's not enough for spotting long-term trends. Add these flags to your Prometheus command:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=10GB

If you need months or years of history, look into Thanos or Cortex. For a homelab, 30-90 days in plain Prometheus is usually fine.

Best Practices

  • node_exporter first, everything else second. It gives you CPU, RAM, disk, and network in one binary.
  • Dashboard 1860 on grafana.com. Import it. Tweak it later. Don't start from a blank canvas.
  • If you don't configure alerts, you're just collecting data nobody looks at. Set up Alertmanager the same day you set up Prometheus.
  • Bump retention to 30d minimum. The default 15 days disappears fast when you're trying to figure out why your server was slow last week.
My stack:
  • Prometheus — 8 hosts, 15s scrape interval
  • Grafana — dashboard 1860 + a custom one for Docker containers
  • Alertmanager — routes to Telegram bot
  • node_exporter on every host, cadvisor on the Docker boxes
  • Resource usage: ~800MB RAM, ~2GB disk per month of retention

What I'd change next time

  • Alertmanager from the start. I ran Prometheus for two weeks before adding alerts. During those two weeks I still missed a disk filling up. The dashboards existed. I just didn't look at them. Should have set up Telegram notifications on day one.
  • Blackbox exporter sooner. I was monitoring system metrics but not whether my actual services were responding. Adding HTTP endpoint checks with blackbox_exporter caught a hung Nextcloud process that looked healthy in every other metric.
  • Separate Grafana credentials. I left the default admin/changeme password for a month. Not great. Should have changed it immediately or set up OAuth through my reverse proxy.

💬 Comments