Local LLM Infrastructure
for Agentic AI Coding

Personal project
Ongoing · Mar 2025 – Present
Local GPU-based LLM serving

Run coding agents fast, private, and always available — entirely on local hardware.

Overview

Agentic coding tools are most useful when they are fast, private, and always available. This project builds a self-hosted, GPU-based LLM serving environment so that coding agents can run entirely on local hardware instead of depending on external APIs.

The constraint that makes it interesting is doing this on a single workstation-class machine, where GPU memory is the bottleneck. Most of the work is in serving configuration: fitting larger contexts and keeping latency low without a datacenter behind you.

Approach

  • Deploy and serve open LLMs locally with vLLM on an NVIDIA DGX Spark.
  • Tune serving configurations — KV-cache quantization and prefix caching — to fit larger contexts and improve throughput under constrained GPU memory.
  • Wire the local endpoint into agentic coding workflows as a drop-in replacement for hosted APIs.

Status

Ongoing. I am iterating on serving configuration for stability and latency, and folding the local endpoint into my day-to-day agentic coding setup.