Two-Node Multi-GPU Inference Cluster

Context

Serving capable local models reliably means living inside hard VRAM and thermal limits — and recovering gracefully when a card wedges. I built the cluster and the tooling to do exactly that.

What I built

A two-node cluster — a Windows workstation (RTX 5090) paired with a Linux agent machine (RTX 3090) — serving local models over llama.cpp / Ollama.
VRAM-bound tuning of KV-cache and context windows to fit capable context lengths inside a hard memory ceiling without wedging the fragile card.
Thermal-aware autonomy that backs off before the abort line, plus gpu_doctor.py — structured GPU diagnostics (nvidia-smi analysis, VRAM mapping, machine-consumable JSON output) for fast fault diagnosis.
A privacy-first network — SSH tunnels and scoped port routing, with the agent machine holding only read-only access to shared services.

Why it matters

It is the hardware substrate the rest of the stack runs on — and a hands-on lesson in reliability under real physical constraints.