← Back to work

Two-Node Multi-GPU Inference Cluster

Personal — Local AI Infrastructure Lab · 2024 — Present

A self-built two-node multi-GPU cluster serving local models with a privacy-first architecture, thermal-aware autonomy, and GPU-wedge diagnosis and recovery.

2 nodes
multi-GPU cluster measured
VRAM-bound
KV-cache / context tuning measured
auto-recover
GPU-wedge diagnosis measured
  • llama.cpp
  • Ollama
  • CUDA 12.x
  • Multi-GPU
  • Linux / SSH
  • KV-cache tuning

Context

Serving capable local models reliably means living inside hard VRAM and thermal limits — and recovering gracefully when a card wedges. I built the cluster and the tooling to do exactly that.

What I built

  • A two-node cluster — a Windows workstation (RTX 5090) paired with a Linux agent machine (RTX 3090) — serving local models over llama.cpp / Ollama.
  • VRAM-bound tuning of KV-cache and context windows to fit capable context lengths inside a hard memory ceiling without wedging the fragile card.
  • Thermal-aware autonomy that backs off before the abort line, plus gpu_doctor.py — structured GPU diagnostics (nvidia-smi analysis, VRAM mapping, machine-consumable JSON output) for fast fault diagnosis.
  • A privacy-first network — SSH tunnels and scoped port routing, with the agent machine holding only read-only access to shared services.

Why it matters

It is the hardware substrate the rest of the stack runs on — and a hands-on lesson in reliability under real physical constraints.