Local LLM Infrastructure
for Agentic AI Coding

Dong-Won Lee

Personal project

Ongoing · Mar 2025 – Present

Run coding agents fast, private, and always available — entirely on local hardware.

Overview

Agentic coding tools are most useful when they are fast, private, and always available. This project builds a self-hosted, GPU-based LLM serving environment so that coding agents can run entirely on local hardware instead of depending on external APIs.

The constraint that makes it interesting is doing this on a single workstation-class machine, where GPU memory is the bottleneck. Most of the work is in serving configuration: fitting larger contexts and keeping latency low without a datacenter behind you.

Approach

Deploy and serve open LLMs locally with vLLM on an NVIDIA DGX Spark.
Tune serving configurations — KV-cache quantization and prefix caching — to fit larger contexts and improve throughput under constrained GPU memory.
Wire the local endpoint into agentic coding workflows as a drop-in replacement for hosted APIs.

Status

Ongoing. I am iterating on serving configuration for stability and latency, and folding the local endpoint into my day-to-day agentic coding setup.

Local LLM Infrastructurefor Agentic AI Coding

Run coding agents fast, private, and always available — entirely on local hardware.

Overview

Approach

Status

Local LLM Infrastructure
for Agentic AI Coding