Skip to content

Monitoring (Prometheus + Loki + Grafana)

Stack: stack/itop/ · Host: phil-app · Updated: 2026-03-01

Full observability stack: metrics (Prometheus), logs (Loki + Alloy), dashboards (Grafana), alerting (Alertmanager → Matrix), uptime monitoring (Uptime Kuma), and container metrics (cAdvisor).

Overview

Alerts route to Matrix via matrix-alertmanager. Uptime Kuma provides push heartbeat monitors for backups. The Pushgateway collects borgmatic metrics. The blackbox exporter probes HTTPS, SMTP, and IMAP endpoints.

Architecture

Containers (stack/itop/)

  • Prometheus — scrapes all exporters; 90-day retention; --query.max-concurrency=10
  • Loki — log storage; 14-day retention; TSDB v13; 30 GiB XFS quota
  • Grafana — dashboards at grafana.philipp.info:8993
  • Alloy — log collector on both servers (replaces Promtail/Grafana Agent)
  • Alertmanager — routes alerts to Matrix (matrix-alertmanager sidecar)
  • Uptime Kuma — push heartbeat monitors; accessible at 10.42.10.4:3001 (WireGuard)
  • cAdvisor — container metrics (uses containerd socket under userns-remap)
  • Blackbox exporter — HTTPS/SMTP/IMAP probes
  • Pushgateway — receives borgmatic metric pushes; persists to volume
  • node-exporter — host metrics (both servers via Alloy)

Resource limits (tuned Feb 2026): - Loki: 6G / 2.0 CPU; querier.max_concurrent: 16, max_outstanding: 256 - Prometheus: 4G; --storage.tsdb.retention.time=90d - cAdvisor: 2G (was 1G — OOM-killed at ~680 MB)

Loki Label Schema

All log sources use a unified label set (target: ~80-100 active streams):

Label Meaning Example values
server Host (external label) phil-app, phil-db
app Application friendica, nextcloud, mariadb, host
component Subtype within app php-fpm, reverse-proxy, journald, slowlog
tier Category app, infra, db, obs, mail
job Source type docker-web, docker-app, systemd-journal
project Compose project friendica, matrix, itop
level Severity (pipeline-derived) error, warn, info, debug

Labels intentionally dropped: env, team, service, container, hostname, filename, config, systemd_unit, unit, slice, source, format.

service_name is auto-injected by Alloy's OTEL layer and kept (Grafana Logs default grouping).

Prometheus Alert Rules

Rule file Alerts
uptime.rules.yml UptimeFailed (status==0), BackupMissedHeartbeat (push status==2, for 30m), InstanceDown
cpu.rules.yml HostHighCpuLoad, HostCpuHighIowait
memory.rules.yml HostOutOfMemory, HostOomKillDetected, HostMemoryUnderPressure
filesystem.rules.yml DiskSpaceFree{20,10}Percent, XFSQuota{20,10}Percent, DeviceError, ReadOnly
io.rules.yml HostUnusualDiskIo, HighWriteLatency
traefik.rules.yml TraefikServiceDown
phpfpm.rules.yml Capacity, AllWorkersBusy, SlowRequests, ListenQueue, MaxChildren
mysql.rules.yml MysqlDown, Restarted, TooManyConnections, HighThreadsRunning
friendica.rules.yml FriendicaExporterUnavailable, WorkerStale, WorkerBacklogMonotonicIncrease
containers.rules.yml ContainerRestartLoop (>3 in 30min), ContainerOOMKilled
borgmatic.rules.yml BorgmaticMetricsMissing (absent metric for 30m)
certificates.rules.yml TLSCertExpiry{Warning,Critical} (<14d/<3d)
smartctl.rules.yml SmartCriticalWarning, MediaErrors (nvme0n1/nvme1n1 excluded)

Loki alert rules (config/loki/rules/): backupFailed (phil-db borg-backup systemd), borgmaticBackupFailed, SshLoginFailed.

Blackbox Exporter

Probes: - HTTPS: opensocial.at, friendica.me, cloud.philipp.info, git.opensocial.at, mail.philipp.info - SMTP: mailcow-postfix:587 (internal alias, smtp_banner module, preferred_ip_protocol: ip4) - IMAP: mailcow-dovecot:993 (internal alias, imap_tls module, tls_config.server_name: mail.philipp.info)

SMTP/IMAP probes use internal container aliases (not public IP) because Shorewall blocks Docker→Host traffic on non-443 ports. HTTP probes use CoreDNS (172.21.0.53) for resolution to route through Traefik correctly.

Loki Storage

  • Volume itop_loki with 30 GiB XFS quota (project ID 42)
  • Schema: TSDB v13 (migrated from boltdb-shipper v11, Feb 2026)
  • Retention: 14 days (limits_config.retention_period: 336h)
  • Compactor: runs every 5m

Pushgateway

  • Borgmatic pushes borgmatic_last_backup_timestamp_seconds + borgmatic_repo_size_bytes after each create run
  • Persistence: --persistence.file=/data/metrics.db (volume itop_pushgateway) — survives container restarts
  • BorgmaticMetricsMissing fires when all borgmatic metrics are absent for >30m (signals pushgateway restart, not backup failure)

Operations

Backup

The itop stack data (Prometheus TSDB, Grafana dashboards, Alertmanager state, Pushgateway persistence) is backed up by Borgmatic. The loki volume is intentionally NOT backed up (logs are ephemeral; 14-day retention is sufficient; backup of 30 GiB log data is not justified).

Prometheus TSDB can be rebuilt from scrape targets if lost. Grafana dashboard JSON is committed to git (stack/itop/config/grafana/provisioning/).

Reload Prometheus rules

cd /opt/docker/itop
sudo docker compose exec prometheus kill -HUP 1

Loki quota full — manual cleanup

sudo docker compose stop loki
sudo find /var/lib/docker/165536.165536/volumes/loki/_data/chunks/ -type f -mtime +14 -delete
sudo docker compose up -d loki

Alert Tuning for Backup Windows

Several alerts are tuned to avoid false positives during nightly backup operations (mariabackup + borg on phil-db, ~10 min duration):

Alert for duration Rate window Rationale
HostUnusualDiskIo 15m 5m NVMe I/O saturation during mariabackup is transient
HostCpuHighIowait 15m 5m iowait spikes from mariabackup + borg compression
DiskSpaceFree20Percent 30m /var on phil-db fills temporarily during mariabackup staging
SmartCriticalWarning 1m nvme0n1/nvme1n1 excluded (known EOL warning)

Pitfalls

phil-db Alloy cannot reach Loki via HTTPS (port 443)

Shorewall on phil-app only allows port 443 from the net zone, not the wg (WireGuard) zone. Alloy on phil-db connecting via WireGuard → port 443 blocked.

Fix: phil-db uses http://loki:3100 (direct Loki API, port 3100 allowed from wg zone). /etc/hosts maps loki to 10.42.10.4. WireGuard provides encryption, so HTTP is sufficient.

Loki label cardinality — auto-generated labels

Alloy and Docker auto-generate labels (hostname, filename, config) that create streams without being used in queries. Strip via stage.label_drop in all pipelines.

Alloy self-telemetry logs as service_name=unknown_service

Without logging { write_to = [...] }, Alloy's internal logs land in Loki with only server + service_name=unknown_service.

Fix: Route internal logs through loki.process "alloy_internal" (adds app=alloy, component=agent, tier=obs).

Alloy v1.13 duplicates all log entries as service_name=unknown_service

Alloy v1.13.x's loki.write OTEL compatibility layer creates a duplicate stream for every log entry (same timestamp and content, only server + service_name=unknown_service). Rate: ~143 entries/sec (~500 MB/day). Pre-existing Alloy bug. Functional dashboards not affected (queries always filter by app/component/job). No workaround — monitor for upstream fix.

cAdvisor container discovery broken with Docker userns-remap

With userns-remap, cAdvisor's Docker factory returns blank DockerVersion and discovers 0 containers.

Fix (2026-02-28): Mount both sockets: 1. /run/containerd/containerd.sock — for discovery 2. /var/run/docker.sock — for label enrichment 3. Pass --containerd=/run/containerd/containerd.sock --containerd-namespace=moby (dash, not underscore) 4. Pass --docker=unix:///var/run/docker.sock --docker_only=true --store_container_labels=true 5. Memory limit: 2G (was 1G — OOM at ~680 MB for 100+ containers)

Pitfalls: --containerd_namespace (underscore) = crash; --allowlisted_container_labels invalid in v0.55.1; without Docker socket, no container labels.

cgroup OOM kills invisible in dmesg and node_vmstat_oom_kill

Docker container OOM kills are at the cgroup level — not the global kernel OOM killer. They appear in journalctl -k | grep -iE 'oom|memory cgroup', not in dmesg. HostOomKillDetected (based on node_vmstat_oom_kill) will NOT fire for container OOM kills.

node-exporter: --collector.disable-defaults silently drops vmstat

When using --collector.disable-defaults, --collector.vmstat must be listed explicitly. Without it, node_vmstat_oom_kill is absent.

ContainerOOMKilled alert has no container labels

container_oom_events_total from cAdvisor only has instance and job labels under userns-remap — no container name. When the alert fires, check:

sudo journalctl -k --since '30 min ago' | grep -iE 'oom|out of memory|memory cgroup'

Uptime-Kuma push monitors use status=2 (PENDING) for missed heartbeats

Push monitors go UP (1) → PENDING (2) when the heartbeat window expires, and only hit DOWN (0) after the first check post-deadline. The UptimeFailed rule (status == 0) misses almost all missed-heartbeat situations.

Fix: Separate BackupMissedHeartbeat rule targeting monitor_status{monitor_type="push"} == 2 with for: 30m.

Grafana noDataState: Alerting fires falsely when pushgateway restarts

Setting noDataState: Alerting on Grafana-managed rules causes them to fire immediately when the underlying metric disappears (e.g., after pushgateway restart) — even if the last backup was only hours ago.

Fix: Use Prometheus rules instead of Grafana-managed rules for backup staleness. In Prometheus, absent series produce no alert by default; use absent() with for: 30m explicitly.

Pushgateway wipes all metrics on restart (without persistence)

Default Pushgateway stores metrics in memory only. Restart → metrics gone → gap until next backup run.

Fix: --persistence.file=/data/metrics.db + named volume itop_pushgateway.

Blackbox HTTP probes — Docker DNS vs CoreDNS

Docker's embedded DNS (127.0.0.11) intercepts queries for Docker-known hostnames before forwarding to configured DNS. Setting dns: 172.21.0.53 alone does NOT help — Docker DNS answers first.

Fix: Pin hostnames to Traefik ingress IP via extra_hosts in docker-compose: mail.philipp.info:172.21.0.100. Or use CoreDNS-resolved hostnames for probes.

Blackbox probe target hostnames must match Traefik router rules

Using git.philipp.info as a blackbox target fails if Traefik's router is configured for Host('git.opensocial.at') — Traefik returns TLS SNI error "unrecognized name". Always use the hostname from the Traefik router rule label.

MariaDB log_queries_not_using_indexes floods Loki

log_queries_not_using_indexes=1 logs every query without an index regardless of query time — including fast full scans of small tables. Generated ~21k log lines/sec. Set to 0 on phil-db.

Loki fills XFS quota

When Loki retention is misconfigured or the compactor is stopped, log storage grows past the 30 GiB XFS quota. XFSQuotaFree10Percent fires. Manual cleanup required (see Operations above). Key config requirements: compactor.retention_enabled: true, limits_config.retention_period set to 14d.

SmartCriticalWarning false positive (phil-db NVMe)

Both NVMe SSDs on phil-db report SMART Critical Warning 0x04 (volatile memory backup degraded) — expected at EOL per Hetzner. Alert rule excludes nvme0n1 and nvme1n1. Remove exclusion if disks are replaced.

Prometheus drops out-of-order samples from cAdvisor

~1500 "Error on ingesting out-of-order samples" per scrape — known cAdvisor issue after container restarts. No data loss for alerts. Not critical.