Operational Runbooks¶
Global procedures for the philipp.info infrastructure. Service-specific runbooks are in the service docs.
Health Check Procedure¶
When diagnosing issues, run these checks in parallel:
# On both servers (via SSH, not Ansible — avoids vault issues):
sudo systemctl --failed # Failed systemd units
sudo docker ps --filter health=unhealthy # Unhealthy containers (phil-app)
sudo docker ps --filter status=restarting # Restart loops (phil-app)
Failure Cascades¶
These services have dependency chains. Fix upstream first:
Step-CA cert expired
→ Keycloak has no valid TLS cert
→ Keycloak OIDC discovery returns 404
→ Synapse (Matrix) fails to load OIDC provider → restart loop
→ Paperless OIDC login broken
Rule: When Keycloak or LDAP are down, check Step-CA certs first.
Internal PKI — Step-CA (stack/secure/)¶
Step-CA issues short-lived certificates (24h) renewed automatically by cert-renewer@.timer systemd units.
- Step-CA runs as Docker container
secure-ca-1(image:smallstep/step-ca), listening on127.0.0.1:9000 - Root CA:
/etc/step-ca/certs/root_ca.crt(valid until 2033) - CA data volume:
ca_files
For issuing a new certificate when expired, see services/identity.md.
Stale systemd Units¶
dirsrv@dieholzers.serviceon phil-db: 389-ds LDAP relic (LDAP now runs in Docker on phil-app). Safe toreset-failed.dkms.serviceon phil-app: package not installed, unit reference is a relic.reset-failedclears it.user@1000.serviceon phil-db: transient user session failure.reset-failedclears it.
Docker on phil-app¶
- Compose stacks live at
/opt/docker/{stack-name}/on the server - Local repo equivalent:
stack/{stack-name}/ - Docker uses userns-remap (UID namespace offset 165536) — file ownership inside volumes uses remapped UIDs
- Port 9000 on localhost is Step-CA (via Docker port mapping), NOT php-fpm
For deploy workflow, see operations/deploy.md.
Restart Policies¶
All long-running services should have restart: unless-stopped. This ensures:
- Auto-restart after crash or host reboot
- Stays stopped after manual docker compose stop (for maintenance)
Exceptions (intentionally no restart):
- Init containers (e.g., kopano_ssl, fix-permissions) — run once then exit
- kopano_scheduler — restart: "no" by design
- Woodpecker wp_* containers — ephemeral CI workflow runners
Container Resource Limits¶
Docker daemon config sets global log rotation (25MB × 5 files, compressed, non-blocking). Per-container limits via deploy.resources.limits in compose files.
Policy: All long-running containers should have memory limits at 3-5× current usage. See pitfalls.md for guidance.
Docker Housekeeping¶
Run periodically to reclaim disk space:
# Check reclaimable space first:
sudo docker system df
# Prune unused images (not just dangling):
sudo docker image prune -a
# Prune dangling volumes (anonymous only):
sudo docker volume prune -f
# Named dangling volumes must be removed individually:
sudo docker volume ls --filter dangling=true
sudo docker volume rm <name>
Automated: docker-prune systemd timer runs weekly (Sunday 03:00) via Ansible docker_host role.
Port Security — Docker Bypasses Shorewall¶
Docker inserts its own iptables DOCKER chain before Shorewall rules. Any ports: mapping with 0.0.0.0 (the default) is publicly accessible regardless of firewall config.
Rules for port mappings:
- Services only accessed Docker-internally: no port mapping at all
- Services accessed from phil-db via WireGuard: bind to WireGuard IP 10.42.10.4 (e.g., Loki 10.42.10.4:3100, Uptime Kuma 10.42.10.4:3001)
- Services only accessed from the host: bind to 127.0.0.1 (e.g., Prometheus 127.0.0.1:9090, Step-CA 127.0.0.1:9000)
- Never use 0.0.0.0 or bare port numbers
NVMe Health (phil-db)¶
Both NVMe SSDs are Samsung MZVL21T0HCLR (1TB) in RAID1 (md0 + md1): - nvme0n1: SMART Critical Warning 0x04 (volatile memory backup degraded) - nvme1n1: Same warning
Hetzner confirmed this is expected at EOL and not actionable. RAID1 provides redundancy. The SmartCriticalWarning Prometheus alert excludes both devices. Remove the exclusion if disks are replaced.
Sysctl (both servers)¶
Managed by Ansible role sysctl_custom. Two files per host:
- /etc/sysctl.d/50-common.conf — shared baseline (security, IPv6, TCP)
- /etc/sysctl.d/99-ansible.conf — host-specific overrides
Unmanaged sysctl files can override Ansible settings — see pitfalls.md.
Logrotate (phil-db)¶
Debian can leave /etc/logrotate.d/inetutils-syslogd behind, duplicating entries from /etc/logrotate.d/rsyslog. This causes logrotate to fail silently.
# Diagnose:
logrotate -d /etc/logrotate.conf 2>&1 | grep -i 'error\|duplicate'
# Fix:
rm /etc/logrotate.d/inetutils-syslogd
DNS Resolution in Docker — CoreDNS¶
- CoreDNS container:
internal-dnsontraefik-resolvernetwork at172.21.0.53 - Services using
dns: [172.21.0.53]in compose rely on CoreDNS for hostname resolution (e.g.,phil-db→10.42.10.3) - Testing DNS from the host fails — test from inside a container:
bash sudo docker run --rm --network traefik-resolver alpine sh -c 'nslookup phil-db 172.21.0.53'