Overview
Agentic coding tools are most useful when they are fast, private, and always available. This project builds a self-hosted, GPU-based LLM serving environment so that coding agents can run entirely on local hardware instead of depending on external APIs.
The constraint that makes it interesting is doing this on a single workstation-class machine, where GPU memory is the bottleneck. Most of the work is in serving configuration: fitting larger contexts and keeping latency low without a datacenter behind you.
Approach
- Deploy and serve open LLMs locally with vLLM on an NVIDIA DGX Spark.
- Tune serving configurations — KV-cache quantization and prefix caching — to fit larger contexts and improve throughput under constrained GPU memory.
- Wire the local endpoint into agentic coding workflows as a drop-in replacement for hosted APIs.
Status
Ongoing. I am iterating on serving configuration for stability and latency, and folding the local endpoint into my day-to-day agentic coding setup.