Skip to content

Backups (Borgmatic + Borg)

Stack: stack/borgmatic/ (phil-app), systemd units (phil-db) · Updated: 2026-03-02

Borg-based backup system covering both servers. phil-app uses Borgmatic (Docker container running 11 jobs); phil-db uses native Borg with systemd units.

Overview

All backups go to Hetzner StorageBox via SSH. Retention: 2 hourly / 7 daily / 4 weekly / 6 monthly.

Architecture

phil-app — Borgmatic (stack/borgmatic/)

Borgmatic runs 11 backup jobs from separate config files in /etc/borgmatic.d/. Each job covers one service stack (friendica_philipp, friendicame, opensocial, mailcow, matrix, gitea, paperless, nextcloud, wordpress, itop, kopano-legacy).

Monitoring (dual channel):

  1. Uptime-Kuma push heartbeat (states: start/finish/fail) — primary signal. When a backup misses its window, the monitor goes to PENDING (status=2). The BackupMissedHeartbeat Prometheus rule catches this (monitor_status{monitor_type="push"} == 2, for 30m) → Alertmanager. Uptime-Kuma's own notification channels are disabled — all alerting runs through Alertmanager.

  2. Pushgateway metric push (curl after create success): borgmatic_last_backup_timestamp_seconds + borgmatic_repo_size_bytes. BorgmaticMetricsMissing fires when all borgmatic metrics are absent for >30m (signals pushgateway restart, not backup failure).

Memory requirements: The borgmatic container needs 4G for large repos (friendicame: 31M chunks/520 GB, nextcloud: 500+ GB). Smaller repos work within lower limits.

Config flag: borgmatic -v 1 --lock-wait 300 (verbose + 5 min lock wait). -v 0 --json was used before and silently swallowed all errors.

phil-db — Borg (systemd)

  • Units: borg-create.service, borg-prune.service, borg-check.service
  • Script: /opt/borg-backup/backup.sh, env: /etc/borg-backup.env
  • Method: mariabackup for consistent MySQL backup → borg create to Hetzner StorageBox
  • Borg version: 1.4.x (python venv at /usr/local/python-envs/borg-*/)
  • Monitoring: backupFailed Loki alert rule (systemd journal), Uptime Kuma push monitor

Operations

Health Check

# borgmatic — check last run
sudo docker logs borgmatic-borgmatic-1 --tail 100

# phil-db borg — check last run
sudo journalctl -u borg-create.service --since '25 hours ago'

# borgmatic — list repos and last archive
sudo docker exec borgmatic-borgmatic-1 borgmatic list --last 1

Repairing a Corrupted Borg Repo (phil-db)

If borg prune fails with FileNotFoundError for missing segment files:

# 1. Break stale locks
sudo bash -c 'source /etc/borg-backup.env && export BORG_REPO=$REPOSITORY \
  && export BORG_PASSPHRASE && export BORG_RSH && borg break-lock $BORG_REPO'

# 2. Repair (auto-confirm, takes 10-30 min on remote repo)
sudo bash -c 'source /etc/borg-backup.env && export BORG_REPO=$REPOSITORY \
  && export BORG_PASSPHRASE && export BORG_RSH \
  && export BORG_CHECK_I_KNOW_WHAT_I_AM_DOING=YES \
  && borg check --repair $BORG_REPO'

# 3. Verify prune works again
sudo systemctl start borg-prune.service && sudo journalctl -u borg-prune.service -f

borg check --repair removes references to missing segments. borg create continues working even with missing segments — only prune and check fail.

Breaking a Stale Borgmatic Lock (phil-app)

sudo docker exec borgmatic-borgmatic-1 borgmatic break-lock -c /etc/borgmatic.d/<repo>.yaml

Pitfalls

Borgmatic container OOM (SIGKILL)

Large Borg repos (friendicame: 31M chunks/520 GB, nextcloud: 500+ GB) need several GiB RAM for the chunk index. Container memory limit of 1G caused SIGKILL on 4 of 10 backup jobs. Increased to 4G (Feb 2026).

Successful jobs with smaller repos (wordpress, gitea, itop, matrix, paperless, friendica_philipp) work within lower limits.

Borgmatic -v 0 silently swallows errors

-v 0 --json suppresses all error output. When a repo fails (e.g., stale lock), it is invisible in container logs. Changed to -v 1 --lock-wait 300 (Feb 2026).

Stale borg lock blocks all subsequent backup jobs

Borgmatic processes config files sequentially (alphabetical). If one repo has a stale lock, it blocks indefinitely and all repos sorted after it are skipped.

Observed: kopano lock blocked matrix, paperless, and nextcloud for 5 days.

Prevention: --lock-wait 300 in crontab gives 5 min grace period before failing. Fix: borgmatic break-lock -c /etc/borgmatic.d/{repo}.yaml

Borg exit code 127

Exit code 127 = command not found. Usually means the script was broken by manual editing. Check for .swp swap files in /opt/borg-backup/ (sign of interrupted vim session). Redeploy via Ansible.

Borg passphrase mismatch (phil-db)

BORG_PASSPHRASE in /etc/borg-backup.env must match the passphrase used during borg init. Test manually: sudo BORG_PASSPHRASE='...' borg list ssh://u153193-sub3@.... Update the Ansible vault variable borg_backup_passphrase if the passphrase was changed manually.

Borg --verify-data --repository-only incompatible in Borg 1.4

These flags are mutually exclusive in Borg 1.4+. Fixed: removed --verify-data, keeping --repository-only with --max-duration for resumable checks.

Borg prune fails with missing segments

borg prune crashes with FileNotFoundError: No such file or directory: '/home/backup/data/XX/XXXXX' — remote repo has missing segment files (typically from interrupted SSH writes).

Fix: borg check --repair (see Operations above).

Pushgateway wipes metrics on restart

Default Pushgateway stores metrics in memory. Container restart → borgmatic metrics gone → gap until next nightly run.

Fix: --persistence.file=/data/metrics.db + named volume itop_pushgateway. See services/monitoring.md.

Uptime-Kuma push monitors use status=2 (PENDING) for missed heartbeats

Push monitors go UP (1) → PENDING (2) when the heartbeat window expires, and only hit DOWN (0) after the first check post-deadline. UptimeFailed rule (status == 0) misses this.

Fix: Separate BackupMissedHeartbeat rule: monitor_status{monitor_type="push"} == 2 with for: 30m. See services/monitoring.md.

Backup I/O triggers false-positive monitoring alerts

During borg-create on phil-db (nightly mariabackup + borg), several alerts fire due to transient I/O load. Alert thresholds tuned to exceed the ~10 min backup window:

Alert for Rate window Rationale
HostUnusualDiskIo 15m 5m NVMe I/O saturation during mariabackup
HostCpuHighIowait 15m 5m iowait spikes from mariabackup + borg compression
DiskSpaceFree20Percent 30m /var on phil-db fills temporarily during mariabackup staging