Two ASUS NUC 15 Pro+ units. 96GB RAM each, 4TB NVMe each. Proxmox cluster. Everything provisioned by Terraform, configured by Ansible, deployed on push to main. No web UI clicks.
The goal was not "another homelab." It was applying production IaC practices to hardware I own, and seeing where the abstraction breaks down.
LXC Over Kubernetes
I ran k8s professionally for years. For a homelab running 15 services -- Jellyfin, Nextcloud, Sonarr/Radarr, Paperless-ngx, Mealie, plus a full Prometheus/Grafana/Loki monitoring stack -- Kubernetes is the wrong tool. The operational overhead of etcd, control planes, and YAML sprawl buys you nothing at this scale.
LXC containers give you the isolation that matters: sub-5-second boot times, per-container snapshots, dedicated resource limits. Without the orchestration tax.
The Pipeline
Terraform provisions containers declaratively. Ansible configures them. A GitHub Actions runner -- itself in a dedicated LXC container on the NUC -- applies changes on merge to main. Clean on paper.
In practice, every layer fought back.
Where It Broke
Tailscale DNS in LXC is fundamentally broken. The Tailscale daemon rewrites /etc/resolv.conf. The container's network config overwrites it again on restart. DNS dies. The fix is chattr +i /etc/resolv.conf -- making it immutable. It is ugly. It has not failed in six months.
192GB RAM is not generous. Fifteen services at 65% utilization (125GB). Jellyfin transcoding spikes to 4GB alone. Without hard memory limits on every container, one spike cascades into OOM kills across the node. I now set explicit ceilings on everything and disable swap entirely.
Backup storms. Running all backups at 2 AM created IO contention that spiked Nextcloud response times past timeout thresholds. Staggering across a 3-hour window fixed it. I only caught it because Prometheus was alerting on p99 latency.
Proxmox Backup Server does not belong in LXC. PBS needs direct block device access for efficient incremental backups. In an LXC container, it silently degrades to slow full-image reads. I burned a weekend before moving it into a lightweight VM.
What I Got Wrong
Monitoring should be the first service deployed, not the last. Half the problems above would have been caught on day one with Prometheus in place. I deployed it after the fact and immediately found three issues I had been ignoring.
Five VLANs on day one was over-engineering. Management, DMZ, internal, IoT, lab -- the firewall rules between them created debugging complexity that was not justified at homelab scale. Two VLANs (trusted, untrusted) would have been sufficient to start. Add segmentation when you have a real threat model, not a theoretical one.
Backup restoration needs a schedule. I can rebuild the entire infrastructure from scratch in under 2 hours. But only because I tested it. The first restoration test revealed my Restic backups to Backblaze B2 were silently skipping a critical Nextcloud database dump.
The Numbers
Six months in: 99.94% uptime (one planned maintenance window). 65% RAM utilization, 15% CPU average. 70W idle, 170W under load. Monthly cost: $19 for electricity and Backblaze B2.
Equivalent cloud infrastructure runs $200+/month. The hardware paid for itself in four months.
The Actual Lesson
The single biggest improvement was migrating from a monolithic Docker host to dedicated LXC containers. Jellyfin transcoding no longer tanks Nextcloud. Snapshots are per-service. Logs and metrics are scoped. Failures are contained.
The best infrastructure disappears. This setup runs without me thinking about it, which was the entire point of building it.