Jonathan Haaswritingnowusesabout
emailgithubx
Jonathan Haaswritingnowusesabout

From Consumer NUC to Production-Grade Homelab: My Journey with Proxmox and Infrastructure as Code

July 18, 2025·3 min read

How I transformed two ASUS NUC 15 Pro+ machines into an enterprise-grade homelab using Proxmox, Terraform, Ansible, and 100% Infrastructure as Code

#homelab#proxmox#infrastructure-as-code#devops#automation

Two ASUS NUC 15 Pro+ units. 96GB RAM each, 4TB NVMe each. Proxmox cluster. Everything provisioned by Terraform, configured by Ansible, deployed on push to main. No web UI clicks.

The goal was not "another homelab." It was applying production IaC practices to hardware I own, and seeing where the abstraction breaks down.

LXC Over Kubernetes

I ran k8s professionally for years. For a homelab running 15 services -- Jellyfin, Nextcloud, Sonarr/Radarr, Paperless-ngx, Mealie, plus a full Prometheus/Grafana/Loki monitoring stack -- Kubernetes is the wrong tool. The operational overhead of etcd, control planes, and YAML sprawl buys you nothing at this scale.

LXC containers give you the isolation that matters: sub-5-second boot times, per-container snapshots, dedicated resource limits. Without the orchestration tax.

The Pipeline

Terraform provisions containers declaratively. Ansible configures them. A GitHub Actions runner -- itself in a dedicated LXC container on the NUC -- applies changes on merge to main. Clean on paper.

In practice, every layer fought back.

Where It Broke

Tailscale DNS in LXC is fundamentally broken. The Tailscale daemon rewrites /etc/resolv.conf. The container's network config overwrites it again on restart. DNS dies. The fix is chattr +i /etc/resolv.conf -- making it immutable. It is ugly. It has not failed in six months.

192GB RAM is not generous. Fifteen services at 65% utilization (125GB). Jellyfin transcoding spikes to 4GB alone. Without hard memory limits on every container, one spike cascades into OOM kills across the node. I now set explicit ceilings on everything and disable swap entirely.

Backup storms. Running all backups at 2 AM created IO contention that spiked Nextcloud response times past timeout thresholds. Staggering across a 3-hour window fixed it. I only caught it because Prometheus was alerting on p99 latency.

Proxmox Backup Server does not belong in LXC. PBS needs direct block device access for efficient incremental backups. In an LXC container, it silently degrades to slow full-image reads. I burned a weekend before moving it into a lightweight VM.

What I Got Wrong

Monitoring should be the first service deployed, not the last. Half the problems above would have been caught on day one with Prometheus in place. I deployed it after the fact and immediately found three issues I had been ignoring.

Five VLANs on day one was over-engineering. Management, DMZ, internal, IoT, lab -- the firewall rules between them created debugging complexity that was not justified at homelab scale. Two VLANs (trusted, untrusted) would have been sufficient to start. Add segmentation when you have a real threat model, not a theoretical one.

Backup restoration needs a schedule. I can rebuild the entire infrastructure from scratch in under 2 hours. But only because I tested it. The first restoration test revealed my Restic backups to Backblaze B2 were silently skipping a critical Nextcloud database dump.

The Numbers

Six months in: 99.94% uptime (one planned maintenance window). 65% RAM utilization, 15% CPU average. 70W idle, 170W under load. Monthly cost: $19 for electricity and Backblaze B2.

Equivalent cloud infrastructure runs $200+/month. The hardware paid for itself in four months.

The Actual Lesson

The single biggest improvement was migrating from a monolithic Docker host to dedicated LXC containers. Jellyfin transcoding no longer tanks Nextcloud. Snapshots are per-service. Logs and metrics are scoped. Failures are contained.

The best infrastructure disappears. This setup runs without me thinking about it, which was the entire point of building it.

share

Continue reading

Building a Developer Environment That Actually Works: My Dotfiles Journey

Most developer environments are optimized for keystrokes. The actual bottleneck is context transfer between you and your AI tools.

AI Code Review Is Reasoning, Not Pattern Matching

AI code reviewers moved from rules-based checking to reasoning-based analysis. The gap between what they catch and what humans catch is closing fast.

The Shift to Async Code Gen: What It Means for Developers

Async code generation turns development into specification and review. The coding happens in the background. This changes what it means to be a senior...

emailgithubx