Backups (Borgmatic + Borg)¶

Stack: stack/borgmatic/ (phil-app), systemd units (phil-db) · Updated: 2026-03-02

Borg-based backup system covering both servers. phil-app uses Borgmatic (Docker container running 11 jobs); phil-db uses native Borg with systemd units.

Overview¶

All backups go to Hetzner StorageBox via SSH. Retention: 2 hourly / 7 daily / 4 weekly / 6 monthly.

Architecture¶

phil-app — Borgmatic (`stack/borgmatic/`)¶

Borgmatic runs 11 backup jobs from separate config files in /etc/borgmatic.d/. Each job covers one service stack (friendica_philipp, friendicame, opensocial, mailcow, matrix, gitea, paperless, nextcloud, wordpress, itop, kopano-legacy).

Monitoring (dual channel):

Uptime-Kuma push heartbeat (states: start/finish/fail) — primary signal. When a backup misses its window, the monitor goes to PENDING (status=2). The BackupMissedHeartbeat Prometheus rule catches this (monitor_status{monitor_type="push"} == 2, for 30m) → Alertmanager. Uptime-Kuma's own notification channels are disabled — all alerting runs through Alertmanager.
Pushgateway metric push (curl after create success): borgmatic_last_backup_timestamp_seconds + borgmatic_repo_size_bytes. BorgmaticMetricsMissing fires when all borgmatic metrics are absent for >30m (signals pushgateway restart, not backup failure).

Memory requirements: The borgmatic container needs 4G for large repos (friendicame: 31M chunks/520 GB, nextcloud: 500+ GB). Smaller repos work within lower limits.

Config flag: borgmatic -v 1 --lock-wait 300 (verbose + 5 min lock wait). -v 0 --json was used before and silently swallowed all errors.

phil-db — Borg (systemd)¶

Units: borg-create.service, borg-prune.service, borg-check.service
Script: /opt/borg-backup/backup.sh, env: /etc/borg-backup.env
Method: mariabackup for consistent MySQL backup → borg create to Hetzner StorageBox
Borg version: 1.4.x (python venv at /usr/local/python-envs/borg-*/)
Monitoring: backupFailed Loki alert rule (systemd journal), Uptime Kuma push monitor

Operations¶

Health Check¶

# borgmatic — check last run
sudo docker logs borgmatic-borgmatic-1 --tail 100

# phil-db borg — check last run
sudo journalctl -u borg-create.service --since '25 hours ago'

# borgmatic — list repos and last archive
sudo docker exec borgmatic-borgmatic-1 borgmatic list --last 1

Repairing a Corrupted Borg Repo (phil-db)¶

If borg prune fails with FileNotFoundError for missing segment files:

# 1. Break stale locks
sudo bash -c 'source /etc/borg-backup.env && export BORG_REPO=$REPOSITORY \
  && export BORG_PASSPHRASE && export BORG_RSH && borg break-lock $BORG_REPO'

# 2. Repair (auto-confirm, takes 10-30 min on remote repo)
sudo bash -c 'source /etc/borg-backup.env && export BORG_REPO=$REPOSITORY \
  && export BORG_PASSPHRASE && export BORG_RSH \
  && export BORG_CHECK_I_KNOW_WHAT_I_AM_DOING=YES \
  && borg check --repair $BORG_REPO'

# 3. Verify prune works again
sudo systemctl start borg-prune.service && sudo journalctl -u borg-prune.service -f

borg check --repair removes references to missing segments. borg create continues working even with missing segments — only prune and check fail.

Breaking a Stale Borgmatic Lock (phil-app)¶

sudo docker exec borgmatic-borgmatic-1 borgmatic break-lock -c /etc/borgmatic.d/<repo>.yaml

Pitfalls¶

Borgmatic container OOM (SIGKILL)¶

Large Borg repos (friendicame: 31M chunks/520 GB, nextcloud: 500+ GB) need several GiB RAM for the chunk index. Container memory limit of 1G caused SIGKILL on 4 of 10 backup jobs. Increased to 4G (Feb 2026).

Successful jobs with smaller repos (wordpress, gitea, itop, matrix, paperless, friendica_philipp) work within lower limits.

Borgmatic `-v 0` silently swallows errors¶

-v 0 --json suppresses all error output. When a repo fails (e.g., stale lock), it is invisible in container logs. Changed to -v 1 --lock-wait 300 (Feb 2026).

Stale borg lock blocks all subsequent backup jobs¶

Borgmatic processes config files sequentially (alphabetical). If one repo has a stale lock, it blocks indefinitely and all repos sorted after it are skipped.

Observed: kopano lock blocked matrix, paperless, and nextcloud for 5 days.

Prevention: --lock-wait 300 in crontab gives 5 min grace period before failing. Fix: borgmatic break-lock -c /etc/borgmatic.d/{repo}.yaml

Borg exit code 127¶

Exit code 127 = command not found. Usually means the script was broken by manual editing. Check for .swp swap files in /opt/borg-backup/ (sign of interrupted vim session). Redeploy via Ansible.

Borg passphrase mismatch (phil-db)¶

BORG_PASSPHRASE in /etc/borg-backup.env must match the passphrase used during borg init. Test manually: sudo BORG_PASSPHRASE='...' borg list ssh://u153193-sub3@.... Update the Ansible vault variable borg_backup_passphrase if the passphrase was changed manually.

Borg `--verify-data --repository-only` incompatible in Borg 1.4¶

These flags are mutually exclusive in Borg 1.4+. Fixed: removed --verify-data, keeping --repository-only with --max-duration for resumable checks.

Borg prune fails with missing segments¶

borg prune crashes with FileNotFoundError: No such file or directory: '/home/backup/data/XX/XXXXX' — remote repo has missing segment files (typically from interrupted SSH writes).

Fix: borg check --repair (see Operations above).

Pushgateway wipes metrics on restart¶

Default Pushgateway stores metrics in memory. Container restart → borgmatic metrics gone → gap until next nightly run.

Fix: --persistence.file=/data/metrics.db + named volume itop_pushgateway. See services/monitoring.md.

Uptime-Kuma push monitors use `status=2 (PENDING)` for missed heartbeats¶

Push monitors go UP (1) → PENDING (2) when the heartbeat window expires, and only hit DOWN (0) after the first check post-deadline. UptimeFailed rule (status == 0) misses this.

Fix: Separate BackupMissedHeartbeat rule: monitor_status{monitor_type="push"} == 2 with for: 30m. See services/monitoring.md.

Backup I/O triggers false-positive monitoring alerts¶

During borg-create on phil-db (nightly mariabackup + borg), several alerts fire due to transient I/O load. Alert thresholds tuned to exceed the ~10 min backup window:

Alert	`for`	Rate window	Rationale
`HostUnusualDiskIo`	15m	5m	NVMe I/O saturation during mariabackup
`HostCpuHighIowait`	15m	5m	iowait spikes from mariabackup + borg compression
`DiskSpaceFree20Percent`	30m	—	`/var` on phil-db fills temporarily during mariabackup staging