Prometheus Grafana Monitoring

The Stack Explained

Key takeaways:

Import dashboard ID 1860 from grafana.com instead of building your own — you'll learn PromQL faster by reading working queries than by writing them from zero.
Set up Alertmanager on day one, not "later." If nobody gets notified, you're just making pretty graphs.

Prometheus - Pulls metrics from your services on a schedule and stores them as time-series data
Grafana - The dashboard layer. Connects to Prometheus and turns numbers into graphs.
Exporters - Small agents you run on each host. They translate system stats into a format Prometheus understands.
Alertmanager - Receives firing alerts from Prometheus and routes them to email, Slack, Telegram, whatever

Docker Compose Setup

version: '3'

services:
 prometheus:
 image: prom/prometheus:latest
 ports:
 - "9090:9090"
 volumes:
 - ./prometheus.yml:/etc/prometheus/prometheus.yml
 - prometheus_data:/prometheus
 command:
 - '--config.file=/etc/prometheus/prometheus.yml'
 - '--storage.tsdb.retention.time=30d'

 grafana:
 image: grafana/grafana:latest
 ports:
 - "3000:3000"
 volumes:
 - grafana_data:/var/lib/grafana
 environment:
 - GF_SECURITY_ADMIN_PASSWORD=changeme

 node_exporter:
 image: prom/node-exporter:latest
 ports:
 - "9100:9100"
 volumes:
 - /proc:/host/proc:ro
 - /sys:/host/sys:ro
 - /:/rootfs:ro
 command:
 - '--path.procfs=/host/proc'
 - '--path.sysfs=/host/sys'
 - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
 prometheus_data:
 grafana_data:

Prometheus Configuration

This goes in prometheus.yml in the same directory as your compose file:

global:
 scrape_interval: 15s
 evaluation_interval: 15s

scrape_configs:
 - job_name: 'prometheus'
 static_configs:
 - targets: ['localhost:9090']

 - job_name: 'node'
 static_configs:
 - targets: ['node_exporter:9100']

 - job_name: 'docker'
 static_configs:
 - targets: ['host.docker.internal:9323']

Start Everything

docker compose up -d

Give it 30 seconds to pull images if it's your first run. Then check:

Prometheus UI: http://localhost:9090
Grafana: http://localhost:3000

Add Prometheus to Grafana

Log into Grafana with admin/changeme
Go to Configuration → Data Sources → Add
Pick Prometheus from the list
For the URL, enter http://prometheus:9090 (not localhost — containers talk to each other by service name)
Hit Save & Test. You should see a green checkmark.

Import Dashboards

Building dashboards from scratch is a waste of time when you're starting out. Hundreds of good ones already exist on grafana.com. Here's how to grab one:

Dashboards → Import
Type in a dashboard ID (1860 is the classic Node Exporter Full)
Point it at your Prometheus data source
Import. Done.

Dashboard IDs I actually use:

1860 - Node Exporter Full
893 - Docker and System Monitoring
13946 - Container metrics

PromQL Basics

PromQL is Prometheus's query language. It's weird at first — the syntax doesn't look like SQL or anything else you've used. But these four queries cover 90% of what a homelab needs:

# Current CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk space used
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Request rate
rate(http_requests_total[5m])

Alerting

Alerts are the entire point. Without them you're just looking at graphs after something already broke. Create an alert_rules.yml file:

# alert_rules.yml
groups:
 - name: node_alerts
 rules:
 - alert: HighCPU
 expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: High CPU usage on {{ $labels.instance }}
 
 - alert: DiskSpaceLow
 expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
 for: 10m
 labels:
 severity: critical
 annotations:
 summary: Disk space low on {{ $labels.instance }}

Alertmanager Setup

Alertmanager is a separate container. Add this block to your docker-compose.yml:

 alertmanager:
 image: prom/alertmanager:latest
 ports:
 - "9093:9093"
 volumes:
 - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

Then tell it where to send notifications. This goes in alertmanager.yml:

route:
 receiver: 'email-notifications'

receivers:
 - name: 'email-notifications'
 email_configs:
 - to: '[email protected]'
 from: '[email protected]'
 smarthost: 'smtp.example.com:587'
 auth_username: 'user'
 auth_password: 'password'

Common Exporters

node_exporter - The big one. CPU, memory, disk, network. Install this first on every host.
cadvisor - Container-level stats. Pairs well with node_exporter if you run Docker.
blackbox_exporter - Pings URLs, checks DNS, tests TCP ports. Good for "is my site up?" checks.
postgres_exporter - Pulls query stats and connection counts from PostgreSQL
mysqld_exporter - Same idea but for MySQL/MariaDB

Retention and Storage

By default Prometheus only keeps 15 days of data. That's not enough for spotting long-term trends. Add these flags to your Prometheus command:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=10GB

If you need months or years of history, look into Thanos or Cortex. For a homelab, 30-90 days in plain Prometheus is usually fine.

Best Practices

node_exporter first, everything else second. It gives you CPU, RAM, disk, and network in one binary.
Dashboard 1860 on grafana.com. Import it. Tweak it later. Don't start from a blank canvas.
If you don't configure alerts, you're just collecting data nobody looks at. Set up Alertmanager the same day you set up Prometheus.
Bump retention to 30d minimum. The default 15 days disappears fast when you're trying to figure out why your server was slow last week.

My stack:

Prometheus — 8 hosts, 15s scrape interval
Grafana — dashboard 1860 + a custom one for Docker containers
Alertmanager — routes to Telegram bot
node_exporter on every host, cadvisor on the Docker boxes
Resource usage: ~800MB RAM, ~2GB disk per month of retention

What I'd change next time

Alertmanager from the start. I ran Prometheus for two weeks before adding alerts. During those two weeks I still missed a disk filling up. The dashboards existed. I just didn't look at them. Should have set up Telegram notifications on day one.
Blackbox exporter sooner. I was monitoring system metrics but not whether my actual services were responding. Adding HTTP endpoint checks with blackbox_exporter caught a hung Nextcloud process that looked healthy in every other metric.
Separate Grafana credentials. I left the default admin/changeme password for a month. Not great. Should have changed it immediately or set up OAuth through my reverse proxy.