Skip to content

Operational Runbooks

Global procedures for the philipp.info infrastructure. Service-specific runbooks are in the service docs.

Health Check Procedure

When diagnosing issues, run these checks in parallel:

# On both servers (via SSH, not Ansible — avoids vault issues):
sudo systemctl --failed                    # Failed systemd units
sudo docker ps --filter health=unhealthy   # Unhealthy containers (phil-app)
sudo docker ps --filter status=restarting  # Restart loops (phil-app)

Failure Cascades

These services have dependency chains. Fix upstream first:

Step-CA cert expired
  → Keycloak has no valid TLS cert
    → Keycloak OIDC discovery returns 404
      → Synapse (Matrix) fails to load OIDC provider → restart loop
      → Paperless OIDC login broken

Rule: When Keycloak or LDAP are down, check Step-CA certs first.

Internal PKI — Step-CA (stack/secure/)

Step-CA issues short-lived certificates (24h) renewed automatically by cert-renewer@.timer systemd units.

  • Step-CA runs as Docker container secure-ca-1 (image: smallstep/step-ca), listening on 127.0.0.1:9000
  • Root CA: /etc/step-ca/certs/root_ca.crt (valid until 2033)
  • CA data volume: ca_files

For issuing a new certificate when expired, see services/identity.md.

Stale systemd Units

  • dirsrv@dieholzers.service on phil-db: 389-ds LDAP relic (LDAP now runs in Docker on phil-app). Safe to reset-failed.
  • dkms.service on phil-app: package not installed, unit reference is a relic. reset-failed clears it.
  • user@1000.service on phil-db: transient user session failure. reset-failed clears it.

Docker on phil-app

  • Compose stacks live at /opt/docker/{stack-name}/ on the server
  • Local repo equivalent: stack/{stack-name}/
  • Docker uses userns-remap (UID namespace offset 165536) — file ownership inside volumes uses remapped UIDs
  • Port 9000 on localhost is Step-CA (via Docker port mapping), NOT php-fpm

For deploy workflow, see operations/deploy.md.

Restart Policies

All long-running services should have restart: unless-stopped. This ensures: - Auto-restart after crash or host reboot - Stays stopped after manual docker compose stop (for maintenance)

Exceptions (intentionally no restart): - Init containers (e.g., kopano_ssl, fix-permissions) — run once then exit - kopano_schedulerrestart: "no" by design - Woodpecker wp_* containers — ephemeral CI workflow runners

Container Resource Limits

Docker daemon config sets global log rotation (25MB × 5 files, compressed, non-blocking). Per-container limits via deploy.resources.limits in compose files.

Policy: All long-running containers should have memory limits at 3-5× current usage. See pitfalls.md for guidance.

Docker Housekeeping

Run periodically to reclaim disk space:

# Check reclaimable space first:
sudo docker system df
# Prune unused images (not just dangling):
sudo docker image prune -a
# Prune dangling volumes (anonymous only):
sudo docker volume prune -f
# Named dangling volumes must be removed individually:
sudo docker volume ls --filter dangling=true
sudo docker volume rm <name>

Automated: docker-prune systemd timer runs weekly (Sunday 03:00) via Ansible docker_host role.

Port Security — Docker Bypasses Shorewall

Docker inserts its own iptables DOCKER chain before Shorewall rules. Any ports: mapping with 0.0.0.0 (the default) is publicly accessible regardless of firewall config.

Rules for port mappings: - Services only accessed Docker-internally: no port mapping at all - Services accessed from phil-db via WireGuard: bind to WireGuard IP 10.42.10.4 (e.g., Loki 10.42.10.4:3100, Uptime Kuma 10.42.10.4:3001) - Services only accessed from the host: bind to 127.0.0.1 (e.g., Prometheus 127.0.0.1:9090, Step-CA 127.0.0.1:9000) - Never use 0.0.0.0 or bare port numbers

NVMe Health (phil-db)

Both NVMe SSDs are Samsung MZVL21T0HCLR (1TB) in RAID1 (md0 + md1): - nvme0n1: SMART Critical Warning 0x04 (volatile memory backup degraded) - nvme1n1: Same warning

Hetzner confirmed this is expected at EOL and not actionable. RAID1 provides redundancy. The SmartCriticalWarning Prometheus alert excludes both devices. Remove the exclusion if disks are replaced.

Sysctl (both servers)

Managed by Ansible role sysctl_custom. Two files per host: - /etc/sysctl.d/50-common.conf — shared baseline (security, IPv6, TCP) - /etc/sysctl.d/99-ansible.conf — host-specific overrides

Unmanaged sysctl files can override Ansible settings — see pitfalls.md.

Logrotate (phil-db)

Debian can leave /etc/logrotate.d/inetutils-syslogd behind, duplicating entries from /etc/logrotate.d/rsyslog. This causes logrotate to fail silently.

# Diagnose:
logrotate -d /etc/logrotate.conf 2>&1 | grep -i 'error\|duplicate'
# Fix:
rm /etc/logrotate.d/inetutils-syslogd

DNS Resolution in Docker — CoreDNS

  • CoreDNS container: internal-dns on traefik-resolver network at 172.21.0.53
  • Services using dns: [172.21.0.53] in compose rely on CoreDNS for hostname resolution (e.g., phil-db10.42.10.3)
  • Testing DNS from the host fails — test from inside a container: bash sudo docker run --rm --network traefik-resolver alpine sh -c 'nslookup phil-db 172.21.0.53'