Monitoring (Prometheus + Loki + Grafana)¶

Stack: stack/itop/ · Host: phil-app · Updated: 2026-03-01

Full observability stack: metrics (Prometheus), logs (Loki + Alloy), dashboards (Grafana), alerting (Alertmanager → Matrix), uptime monitoring (Uptime Kuma), and container metrics (cAdvisor).

Overview¶

Alerts route to Matrix via matrix-alertmanager. Uptime Kuma provides push heartbeat monitors for backups. The Pushgateway collects borgmatic metrics. The blackbox exporter probes HTTPS, SMTP, and IMAP endpoints.

Architecture¶

Containers (stack/itop/)¶

Prometheus — scrapes all exporters; 90-day retention; --query.max-concurrency=10
Loki — log storage; 14-day retention; TSDB v13; 30 GiB XFS quota
Grafana — dashboards at grafana.philipp.info:8993
Alloy — log collector on both servers (replaces Promtail/Grafana Agent)
Alertmanager — routes alerts to Matrix (matrix-alertmanager sidecar)
Uptime Kuma — push heartbeat monitors; accessible at 10.42.10.4:3001 (WireGuard)
cAdvisor — container metrics (uses containerd socket under userns-remap)
Blackbox exporter — HTTPS/SMTP/IMAP probes
Pushgateway — receives borgmatic metric pushes; persists to volume
node-exporter — host metrics (both servers via Alloy)

Resource limits (tuned Feb 2026): - Loki: 6G / 2.0 CPU; querier.max_concurrent: 16, max_outstanding: 256 - Prometheus: 4G; --storage.tsdb.retention.time=90d - cAdvisor: 2G (was 1G — OOM-killed at ~680 MB)

Loki Label Schema¶

All log sources use a unified label set (target: ~80-100 active streams):

Label	Meaning	Example values
`server`	Host (external label)	`phil-app`, `phil-db`
`app`	Application	`friendica`, `nextcloud`, `mariadb`, `host`
`component`	Subtype within app	`php-fpm`, `reverse-proxy`, `journald`, `slowlog`
`tier`	Category	`app`, `infra`, `db`, `obs`, `mail`
`job`	Source type	`docker-web`, `docker-app`, `systemd-journal`
`project`	Compose project	`friendica`, `matrix`, `itop`
`level`	Severity (pipeline-derived)	`error`, `warn`, `info`, `debug`

Labels intentionally dropped: env, team, service, container, hostname, filename, config, systemd_unit, unit, slice, source, format.

service_name is auto-injected by Alloy's OTEL layer and kept (Grafana Logs default grouping).

Prometheus Alert Rules¶

Rule file	Alerts
`uptime.rules.yml`	UptimeFailed (status==0), BackupMissedHeartbeat (push status==2, for 30m), InstanceDown
`cpu.rules.yml`	HostHighCpuLoad, HostCpuHighIowait
`memory.rules.yml`	HostOutOfMemory, HostOomKillDetected, HostMemoryUnderPressure
`filesystem.rules.yml`	DiskSpaceFree{20,10}Percent, XFSQuota{20,10}Percent, DeviceError, ReadOnly
`io.rules.yml`	HostUnusualDiskIo, HighWriteLatency
`traefik.rules.yml`	TraefikServiceDown
`phpfpm.rules.yml`	Capacity, AllWorkersBusy, SlowRequests, ListenQueue, MaxChildren
`mysql.rules.yml`	MysqlDown, Restarted, TooManyConnections, HighThreadsRunning
`friendica.rules.yml`	FriendicaExporterUnavailable, WorkerStale, WorkerBacklogMonotonicIncrease
`containers.rules.yml`	ContainerRestartLoop (>3 in 30min), ContainerOOMKilled
`borgmatic.rules.yml`	BorgmaticMetricsMissing (absent metric for 30m)
`certificates.rules.yml`	TLSCertExpiry{Warning,Critical} (<14d/<3d)
`smartctl.rules.yml`	SmartCriticalWarning, MediaErrors (nvme0n1/nvme1n1 excluded)

Loki alert rules (config/loki/rules/): backupFailed (phil-db borg-backup systemd), borgmaticBackupFailed, SshLoginFailed.

Blackbox Exporter¶

Probes: - HTTPS: opensocial.at, friendica.me, cloud.philipp.info, git.opensocial.at, mail.philipp.info - SMTP: mailcow-postfix:587 (internal alias, smtp_banner module, preferred_ip_protocol: ip4) - IMAP: mailcow-dovecot:993 (internal alias, imap_tls module, tls_config.server_name: mail.philipp.info)

SMTP/IMAP probes use internal container aliases (not public IP) because Shorewall blocks Docker→Host traffic on non-443 ports. HTTP probes use CoreDNS (172.21.0.53) for resolution to route through Traefik correctly.

Loki Storage¶

Volume itop_loki with 30 GiB XFS quota (project ID 42)
Schema: TSDB v13 (migrated from boltdb-shipper v11, Feb 2026)
Retention: 14 days (limits_config.retention_period: 336h)
Compactor: runs every 5m

Pushgateway¶

Borgmatic pushes borgmatic_last_backup_timestamp_seconds + borgmatic_repo_size_bytes after each create run
Persistence: --persistence.file=/data/metrics.db (volume itop_pushgateway) — survives container restarts
BorgmaticMetricsMissing fires when all borgmatic metrics are absent for >30m (signals pushgateway restart, not backup failure)

Operations¶

Backup¶

The itop stack data (Prometheus TSDB, Grafana dashboards, Alertmanager state, Pushgateway persistence) is backed up by Borgmatic. The loki volume is intentionally NOT backed up (logs are ephemeral; 14-day retention is sufficient; backup of 30 GiB log data is not justified).

Prometheus TSDB can be rebuilt from scrape targets if lost. Grafana dashboard JSON is committed to git (stack/itop/config/grafana/provisioning/).

Reload Prometheus rules¶

cd /opt/docker/itop
sudo docker compose exec prometheus kill -HUP 1

Loki quota full — manual cleanup¶

sudo docker compose stop loki
sudo find /var/lib/docker/165536.165536/volumes/loki/_data/chunks/ -type f -mtime +14 -delete
sudo docker compose up -d loki

Alert Tuning for Backup Windows¶

Several alerts are tuned to avoid false positives during nightly backup operations (mariabackup + borg on phil-db, ~10 min duration):

Alert	`for` duration	Rate window	Rationale
`HostUnusualDiskIo`	15m	5m	NVMe I/O saturation during mariabackup is transient
`HostCpuHighIowait`	15m	5m	iowait spikes from mariabackup + borg compression
`DiskSpaceFree20Percent`	30m	—	`/var` on phil-db fills temporarily during mariabackup staging
`SmartCriticalWarning`	1m	—	nvme0n1/nvme1n1 excluded (known EOL warning)

Pitfalls¶

phil-db Alloy cannot reach Loki via HTTPS (port 443)¶

Shorewall on phil-app only allows port 443 from the net zone, not the wg (WireGuard) zone. Alloy on phil-db connecting via WireGuard → port 443 blocked.

Fix: phil-db uses http://loki:3100 (direct Loki API, port 3100 allowed from wg zone). /etc/hosts maps loki to 10.42.10.4. WireGuard provides encryption, so HTTP is sufficient.

Loki label cardinality — auto-generated labels¶

Alloy and Docker auto-generate labels (hostname, filename, config) that create streams without being used in queries. Strip via stage.label_drop in all pipelines.

Alloy self-telemetry logs as `service_name=unknown_service`¶

Without logging { write_to = [...] }, Alloy's internal logs land in Loki with only server + service_name=unknown_service.

Fix: Route internal logs through loki.process "alloy_internal" (adds app=alloy, component=agent, tier=obs).

Alloy v1.13 duplicates all log entries as `service_name=unknown_service`¶

Alloy v1.13.x's loki.write OTEL compatibility layer creates a duplicate stream for every log entry (same timestamp and content, only server + service_name=unknown_service). Rate: ~143 entries/sec (~500 MB/day). Pre-existing Alloy bug. Functional dashboards not affected (queries always filter by app/component/job). No workaround — monitor for upstream fix.

cAdvisor container discovery broken with Docker userns-remap¶

With userns-remap, cAdvisor's Docker factory returns blank DockerVersion and discovers 0 containers.

Fix (2026-02-28): Mount both sockets: 1. /run/containerd/containerd.sock — for discovery 2. /var/run/docker.sock — for label enrichment 3. Pass --containerd=/run/containerd/containerd.sock --containerd-namespace=moby (dash, not underscore) 4. Pass --docker=unix:///var/run/docker.sock --docker_only=true --store_container_labels=true 5. Memory limit: 2G (was 1G — OOM at ~680 MB for 100+ containers)

Pitfalls: --containerd_namespace (underscore) = crash; --allowlisted_container_labels invalid in v0.55.1; without Docker socket, no container labels.

cgroup OOM kills invisible in dmesg and node_vmstat_oom_kill¶

Docker container OOM kills are at the cgroup level — not the global kernel OOM killer. They appear in journalctl -k | grep -iE 'oom|memory cgroup', not in dmesg. HostOomKillDetected (based on node_vmstat_oom_kill) will NOT fire for container OOM kills.

node-exporter: `--collector.disable-defaults` silently drops vmstat¶

When using --collector.disable-defaults, --collector.vmstat must be listed explicitly. Without it, node_vmstat_oom_kill is absent.

ContainerOOMKilled alert has no container labels¶

container_oom_events_total from cAdvisor only has instance and job labels under userns-remap — no container name. When the alert fires, check:

sudo journalctl -k --since '30 min ago' | grep -iE 'oom|out of memory|memory cgroup'

Uptime-Kuma push monitors use `status=2 (PENDING)` for missed heartbeats¶

Push monitors go UP (1) → PENDING (2) when the heartbeat window expires, and only hit DOWN (0) after the first check post-deadline. The UptimeFailed rule (status == 0) misses almost all missed-heartbeat situations.

Fix: Separate BackupMissedHeartbeat rule targeting monitor_status{monitor_type="push"} == 2 with for: 30m.

Grafana `noDataState: Alerting` fires falsely when pushgateway restarts¶

Setting noDataState: Alerting on Grafana-managed rules causes them to fire immediately when the underlying metric disappears (e.g., after pushgateway restart) — even if the last backup was only hours ago.

Fix: Use Prometheus rules instead of Grafana-managed rules for backup staleness. In Prometheus, absent series produce no alert by default; use absent() with for: 30m explicitly.

Pushgateway wipes all metrics on restart (without persistence)¶

Default Pushgateway stores metrics in memory only. Restart → metrics gone → gap until next backup run.

Fix: --persistence.file=/data/metrics.db + named volume itop_pushgateway.

Blackbox HTTP probes — Docker DNS vs CoreDNS¶

Docker's embedded DNS (127.0.0.11) intercepts queries for Docker-known hostnames before forwarding to configured DNS. Setting dns: 172.21.0.53 alone does NOT help — Docker DNS answers first.

Fix: Pin hostnames to Traefik ingress IP via extra_hosts in docker-compose: mail.philipp.info:172.21.0.100. Or use CoreDNS-resolved hostnames for probes.

Blackbox probe target hostnames must match Traefik router rules¶

Using git.philipp.info as a blackbox target fails if Traefik's router is configured for Host('git.opensocial.at') — Traefik returns TLS SNI error "unrecognized name". Always use the hostname from the Traefik router rule label.

MariaDB `log_queries_not_using_indexes` floods Loki¶

log_queries_not_using_indexes=1 logs every query without an index regardless of query time — including fast full scans of small tables. Generated ~21k log lines/sec. Set to 0 on phil-db.

Loki fills XFS quota¶

When Loki retention is misconfigured or the compactor is stopped, log storage grows past the 30 GiB XFS quota. XFSQuotaFree10Percent fires. Manual cleanup required (see Operations above). Key config requirements: compactor.retention_enabled: true, limits_config.retention_period set to 14d.

SmartCriticalWarning false positive (phil-db NVMe)¶

Both NVMe SSDs on phil-db report SMART Critical Warning 0x04 (volatile memory backup degraded) — expected at EOL per Hetzner. Alert rule excludes nvme0n1 and nvme1n1. Remove exclusion if disks are replaced.

Prometheus drops out-of-order samples from cAdvisor¶

~1500 "Error on ingesting out-of-order samples" per scrape — known cAdvisor issue after container restarts. No data loss for alerts. Not critical.