Skip to content

Known Pitfalls

Cross-cutting gotchas that affect multiple services or the infrastructure as a whole.

Service-specific pitfalls have moved to service docs: - mailcow → services/mail.md - PHP/Friendica → services/friendica-opensocial.md - Identity/Certificates → services/identity.md - Matrix/MAS → services/matrix.md - Monitoring/Alerts → services/monitoring.md - Backups/Borgmatic → services/backup.md - Traefik/DNS/Networking → services/network.md


Networking

host-gateway / host.docker.internal broken with Shorewall

Docker's extra_hosts: host.docker.internal:host-gateway resolves to the docker0 bridge IP (172.255.255.1). Containers on custom bridge networks (192.168.x.x) cannot reach this IP because Shorewall's fw→dock OUTPUT chain only allows RELATED,ESTABLISHED — the SYN-ACK for new connections gets dropped.

Fix: Use the gateway IP of a shared network the container is already on (e.g., host.docker.internal:172.24.0.1 for containers on prometheus-network). The gateway IPs are stable because all networks are statically defined in stack/.config/create-docker-networks.sh. See ADR-003.

Forgejo runner step containers need a non-internal network for internet access

traefik-resolver is created with --internal and has no internet egress. Step containers on this network only cannot reach Docker Hub, ghcr.io, etc. (ENETUNREACH).

Fix: Set container.network in runner.yaml to compose-forgejo-runner-network. Add the resolver network to the runner container itself for DNS, but step containers use the compose-*-network which has normal internet routing.

Also add --add-host=git.opensocial.at:192.168.208.1 via container.options so step containers can resolve the Forgejo host. Do not use host-gateway.

Testing DNS from the host fails

CoreDNS listens only on the Docker bridge network. dig @172.21.0.53 phil-db from the host gets "connection refused". Test from inside a Docker container on the same network:

sudo docker run --rm --network traefik-resolver alpine sh -c 'nslookup phil-db 172.21.0.53'

Port security — Docker bypasses Shorewall

Docker inserts its own iptables DOCKER chain before Shorewall rules. Any ports: mapping with 0.0.0.0 (the default) is publicly accessible regardless of firewall config. Never use 0.0.0.0 or bare port numbers (e.g., 9501:9501).

Docker

Forgejo runner v9: container.extra_hosts field removed — use container.options

Runner config schema changed between v3 and v9. The container.extra_hosts field was removed. Using it causes a Go nil-pointer panic at startup. Move host entries to container.options as --add-host flags.

Auto-deploy via SSH from Actions runner: avoid for infrastructure repos

Wiring a Forgejo Actions workflow to SSH into the host and run sudo git pull && docker compose up -d is tempting, but the attack surface is significant: a compromised repo push triggers unrestricted root-equivalent actions on the host. For infrastructure repos, manual SSH deploy after merge review is the right posture.

docker volume prune only removes anonymous volumes

Named dangling volumes (e.g., mail_data, opensocial_storage) are not pruned automatically. List them with docker volume ls --filter dangling=true and remove individually after verifying they're no longer needed.

Server has uncommitted local changes

The server /opt/docker/ often has local changes (config tuning, cert renewals, manual edits). A git pull will fail if there are conflicts. Use git stash && git pull && git stash pop — but watch for merge conflicts in the stash pop.

Always commit server changes first

Before pushing compose changes from local, check if /opt/docker/ on the server has uncommitted work. If so, commit + push from the server first, then pull locally, then make your changes on top.

Port 9000 on localhost

This is Step-CA (via Docker port mapping), NOT php-fpm or any other service.

kopano-web-1 (thttpd) memory leak — OOM loop (RESOLVED)

The Kopano web service had a memory leak causing OOM-kill loops (~584 events in 6 hours, triggered ContainerOOMKilled continuously.

Status: Resolved — Kopano stack fully stopped after mailcow migration (Feb 2026).

System

Unmanaged sysctl files override Ansible settings

Watch for: - 99-hetzner.conf (Hetzner installimage default — sets rp_filter=1, breaks Docker) - 70-dirsrv.conf in /usr/lib/sysctl.d/ (389-ds/LDAP relic — overrides vm.swappiness) - 90-local.conf, 99-docker-host.conf (manual leftovers)

These are cleaned up by the hardening_common role but can reappear after OS reinstall or package upgrades. Verify with sudo sysctl --system 2>&1 | grep -E 'Applying|swappiness|rp_filter'.

Logrotate duplicate entries

Debian can leave /etc/logrotate.d/inetutils-syslogd behind, duplicating entries from /etc/logrotate.d/rsyslog. This causes logrotate to fail silently. Fix: rm /etc/logrotate.d/inetutils-syslogd.

on-failure:3 restart policy is often too few

on-failure:3 gives up after 3 failed starts. If MariaDB is briefly unavailable during a backup, 3 retries may not be enough. Consider on-failure:10 or unless-stopped.

I/O

PHP OOM crash-loop I/O spikes

When a PHP worker's memory_limit is too high (4-8G), Guzzle slowly allocates memory for large HTTP responses. Each allocation generates page faults and disk I/O. When the limit is finally hit, the process crashes and restarts — but the problematic job stays in the queue, causing an infinite crash-loop. Keep memory_limit low (relative to actual task needs) so crashes are fast with minimal I/O impact. See ADR-002.

Container Resource Limits

Docker daemon config (/etc/docker/daemon.json) sets global log rotation (25MB × 5 files, compressed, non-blocking). Per-container resource limits are set via deploy.resources.limits in compose files.

Memory limit policy: All long-running containers should have deploy.resources.limits.memory set. Limits are dimensioned at 3-5× current usage to prevent OOM during load spikes while providing a safety net.

Monitoring containers to watch: - Alloy (itop-alloy-1): limit 2 GiB - Loki (itop-loki-1): limit 6 GiB - cAdvisor (itop-cadvisor-1): limit 2 GiB (OOM at ~680 MB with 100+ containers) - Prometheus (itop-prometheus-1): limit 4 GiB - Keycloak (secure-keycloak-1): limit 2 GiB (Java, runs ~770 MiB)