Backups (Borgmatic + Borg)¶
Stack:
stack/borgmatic/(phil-app), systemd units (phil-db) · Updated: 2026-03-02
Borg-based backup system covering both servers. phil-app uses Borgmatic (Docker container running 11 jobs); phil-db uses native Borg with systemd units.
Overview¶
All backups go to Hetzner StorageBox via SSH. Retention: 2 hourly / 7 daily / 4 weekly / 6 monthly.
Architecture¶
phil-app — Borgmatic (stack/borgmatic/)¶
Borgmatic runs 11 backup jobs from separate config files in /etc/borgmatic.d/. Each job covers one service stack (friendica_philipp, friendicame, opensocial, mailcow, matrix, gitea, paperless, nextcloud, wordpress, itop, kopano-legacy).
Monitoring (dual channel):
-
Uptime-Kuma push heartbeat (
states: start/finish/fail) — primary signal. When a backup misses its window, the monitor goes to PENDING (status=2). TheBackupMissedHeartbeatPrometheus rule catches this (monitor_status{monitor_type="push"} == 2, for 30m) → Alertmanager. Uptime-Kuma's own notification channels are disabled — all alerting runs through Alertmanager. -
Pushgateway metric push (curl after
createsuccess):borgmatic_last_backup_timestamp_seconds+borgmatic_repo_size_bytes.BorgmaticMetricsMissingfires when all borgmatic metrics are absent for >30m (signals pushgateway restart, not backup failure).
Memory requirements: The borgmatic container needs 4G for large repos (friendicame: 31M chunks/520 GB, nextcloud: 500+ GB). Smaller repos work within lower limits.
Config flag: borgmatic -v 1 --lock-wait 300 (verbose + 5 min lock wait). -v 0 --json was used before and silently swallowed all errors.
phil-db — Borg (systemd)¶
- Units:
borg-create.service,borg-prune.service,borg-check.service - Script:
/opt/borg-backup/backup.sh, env:/etc/borg-backup.env - Method:
mariabackupfor consistent MySQL backup →borg createto Hetzner StorageBox - Borg version: 1.4.x (python venv at
/usr/local/python-envs/borg-*/) - Monitoring:
backupFailedLoki alert rule (systemd journal), Uptime Kuma push monitor
Operations¶
Health Check¶
# borgmatic — check last run
sudo docker logs borgmatic-borgmatic-1 --tail 100
# phil-db borg — check last run
sudo journalctl -u borg-create.service --since '25 hours ago'
# borgmatic — list repos and last archive
sudo docker exec borgmatic-borgmatic-1 borgmatic list --last 1
Repairing a Corrupted Borg Repo (phil-db)¶
If borg prune fails with FileNotFoundError for missing segment files:
# 1. Break stale locks
sudo bash -c 'source /etc/borg-backup.env && export BORG_REPO=$REPOSITORY \
&& export BORG_PASSPHRASE && export BORG_RSH && borg break-lock $BORG_REPO'
# 2. Repair (auto-confirm, takes 10-30 min on remote repo)
sudo bash -c 'source /etc/borg-backup.env && export BORG_REPO=$REPOSITORY \
&& export BORG_PASSPHRASE && export BORG_RSH \
&& export BORG_CHECK_I_KNOW_WHAT_I_AM_DOING=YES \
&& borg check --repair $BORG_REPO'
# 3. Verify prune works again
sudo systemctl start borg-prune.service && sudo journalctl -u borg-prune.service -f
borg check --repair removes references to missing segments. borg create continues working even with missing segments — only prune and check fail.
Breaking a Stale Borgmatic Lock (phil-app)¶
sudo docker exec borgmatic-borgmatic-1 borgmatic break-lock -c /etc/borgmatic.d/<repo>.yaml
Pitfalls¶
Borgmatic container OOM (SIGKILL)¶
Large Borg repos (friendicame: 31M chunks/520 GB, nextcloud: 500+ GB) need several GiB RAM for the chunk index. Container memory limit of 1G caused SIGKILL on 4 of 10 backup jobs. Increased to 4G (Feb 2026).
Successful jobs with smaller repos (wordpress, gitea, itop, matrix, paperless, friendica_philipp) work within lower limits.
Borgmatic -v 0 silently swallows errors¶
-v 0 --json suppresses all error output. When a repo fails (e.g., stale lock), it is invisible in container logs. Changed to -v 1 --lock-wait 300 (Feb 2026).
Stale borg lock blocks all subsequent backup jobs¶
Borgmatic processes config files sequentially (alphabetical). If one repo has a stale lock, it blocks indefinitely and all repos sorted after it are skipped.
Observed: kopano lock blocked matrix, paperless, and nextcloud for 5 days.
Prevention: --lock-wait 300 in crontab gives 5 min grace period before failing.
Fix: borgmatic break-lock -c /etc/borgmatic.d/{repo}.yaml
Borg exit code 127¶
Exit code 127 = command not found. Usually means the script was broken by manual editing. Check for .swp swap files in /opt/borg-backup/ (sign of interrupted vim session). Redeploy via Ansible.
Borg passphrase mismatch (phil-db)¶
BORG_PASSPHRASE in /etc/borg-backup.env must match the passphrase used during borg init. Test manually: sudo BORG_PASSPHRASE='...' borg list ssh://u153193-sub3@.... Update the Ansible vault variable borg_backup_passphrase if the passphrase was changed manually.
Borg --verify-data --repository-only incompatible in Borg 1.4¶
These flags are mutually exclusive in Borg 1.4+. Fixed: removed --verify-data, keeping --repository-only with --max-duration for resumable checks.
Borg prune fails with missing segments¶
borg prune crashes with FileNotFoundError: No such file or directory: '/home/backup/data/XX/XXXXX' — remote repo has missing segment files (typically from interrupted SSH writes).
Fix: borg check --repair (see Operations above).
Pushgateway wipes metrics on restart¶
Default Pushgateway stores metrics in memory. Container restart → borgmatic metrics gone → gap until next nightly run.
Fix: --persistence.file=/data/metrics.db + named volume itop_pushgateway. See services/monitoring.md.
Uptime-Kuma push monitors use status=2 (PENDING) for missed heartbeats¶
Push monitors go UP (1) → PENDING (2) when the heartbeat window expires, and only hit DOWN (0) after the first check post-deadline. UptimeFailed rule (status == 0) misses this.
Fix: Separate BackupMissedHeartbeat rule: monitor_status{monitor_type="push"} == 2 with for: 30m. See services/monitoring.md.
Backup I/O triggers false-positive monitoring alerts¶
During borg-create on phil-db (nightly mariabackup + borg), several alerts fire due to transient I/O load. Alert thresholds tuned to exceed the ~10 min backup window:
| Alert | for |
Rate window | Rationale |
|---|---|---|---|
HostUnusualDiskIo |
15m | 5m | NVMe I/O saturation during mariabackup |
HostCpuHighIowait |
15m | 5m | iowait spikes from mariabackup + borg compression |
DiskSpaceFree20Percent |
30m | — | /var on phil-db fills temporarily during mariabackup staging |