Self-Host LLMs with vLLM: A Production-Grade OpenAI-Compatible Server

You do not need OpenAI to run a serious LLM. You do not even need to pay per token. With a single GPU box and vLLM, you can host Llama, Qwen, Mistral, or any of 200+ supported open models behind an API that speaks the OpenAI protocol. Drop it into any tool that already calls /v1/chat/completions and it just works.

The numbers from the vLLM team's original benchmark are not subtle. On LLaMA-7B on an A10G and LLaMA-13B on an A100 40GB, sampling from the ShareGPT distribution, vLLM hit 14x to 24x higher throughput than HuggingFace Transformers and 2.2x to 3.5x over HuggingFace Text Generation Inference (TGI). At LMSYS, switching Chatbot Arena to vLLM cut GPU usage in half while handling 30K requests a day at peak.

The trick is PagedAttention. KV cache for a single LLaMA-13B sequence runs up to 1.7GB. Existing systems waste 60 to 80% of that memory to fragmentation and over-reservation because they treat KV cache like one big contiguous tensor. vLLM partitions it into fixed-size blocks the way an OS partitions memory into pages, and waste drops to under 4%. More memory means more concurrent requests in a batch, which means higher throughput per dollar.

This guide gets you from zero to a production server you can call with the openai Python SDK or curl.

Prerequisites

Linux (Ubuntu 22.04+ or similar)
Python 3.10 to 3.13
NVIDIA GPU with at least 24GB VRAM for a 7B model in FP16 (or 16GB for INT4/INT8)
CUDA 12.x driver
At least 50GB free disk for model weights
uv installed (pip install uv) — recommended but optional

The vllm PyPI page lists the supported CUDA versions per release. Match your driver to what the wheel was built against.

Step 1: Install vLLM

The fastest path uses uv, which picks the right PyTorch index for your CUDA version automatically:

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

If you prefer plain pip:

python -m venv .venv
source .venv/bin/activate
pip install vllm

Smoke test:

python -c "import vllm; print(vllm.__version__)"

The first import triggers a few small kernel compilations. Subsequent imports are fast.

Step 2: Start the OpenAI-Compatible Server

One command starts a server on port 8000 with an OpenAI-compatible /v1/chat/completions endpoint:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

On first run, vLLM downloads the model from HuggingFace (default ~16GB for 8B in FP16) and compiles a few CUDA graphs. That takes 1 to 5 minutes. After that, startup is under 30 seconds.

You can also use the older entrypoint if you have a script pinned to it:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

Both forms are documented in the vllm serve reference.

Step 3: Call It Like OpenAI

The whole point is that your existing code does not change. The openai SDK works against vLLM with one base URL swap:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM does not require auth by default
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention in 3 sentences."}],
    temperature=0.7,
    max_tokens=200,
)
print(resp.choices[0].message.content)

Or with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

Streaming works the same way. Set stream=True in the SDK or pass "stream": true in the curl body. The server sends Server-Sent Events on the same /v1/chat/completions path OpenAI uses.

Step 4: Tune for Your Workload

The defaults are sensible. A few flags cover 90% of production tuning:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --enable-chunked-prefill

What each one does:

--max-model-len 8192 — cap the context window. Smaller values leave more memory for KV cache. Pick the smallest window your app actually needs.
--gpu-memory-utilization 0.92 — fraction of VRAM vLLM is allowed to use. Default is 0.9. Push to 0.95 if you are running vLLM alone on the box.
--max-num-seqs 256 — max concurrent requests. Lower if you have long contexts or a small GPU; higher if requests are short.
--max-num-batched-tokens 16384 — total tokens per batched step. The optimization docs recommend 8192+ for smaller models on big GPUs. Going higher improves throughput at the cost of TTFT.
--enable-prefix-caching — reuse KV cache across requests that share a common prefix. Huge win for chat apps where every message starts with the system prompt. On by default in V1.
--enable-chunked-prefill — split long prefill operations into smaller chunks and interleave them with decode steps. Reduces TTFT variance. On by default in V1.

Optimization Levels

vLLM has 4 compile levels: -O0 through -O3. -O2 is the default and what you want for production. -O0 skips graph capture for the fastest startup (good for CI). -O3 enables every optimization including experimental ones and is currently equal to -O2. Set with VLLM_COMPILE_CONFIG=... or --compilation-config.

Watch for Preemption

If you see warnings like Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode, the scheduler ran out of KV cache mid-batch and recomputed some sequences. This is correct behavior, not a bug, but it hurts latency. The fix is one of: raise --gpu-memory-utilization, drop --max-num-seqs, drop --max-num-batched-tokens, or shard the model with --tensor-parallel-size 2 across two GPUs.

The optimization page walks through every flag in detail.

Step 5: Add an API Key (Production)

vLLM does not enforce auth by default. For anything reachable from a network, set a key and validate it at the edge:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --api-key "$MY_SECRET_KEY" \
  --allow-credentials

Pass the same key as the OpenAI SDK's api_key parameter. For real production, terminate TLS at a reverse proxy (Caddy, nginx, Traefik) and put the key in an Authorization: Bearer ... check there. vLLM stays inside the private network.

Step 6: Run Behind a Reverse Proxy

Caddyfile that terminates TLS and forwards to vLLM:

llm.example.com {
  reverse_proxy localhost:8000 {
    header_up X-Real-IP {remote_host}
  }
}

Restart Caddy. Now https://llm.example.com/v1/chat/completions works and your SDK just needs a different base_url.

Run Multiple Models

vllm serve loads one model per process. If you want to serve several models from the same GPU box, run one vLLM process per model on different ports, or use a router in front. llm-router and LiteLLM both work. For high-traffic setups, run each model on its own GPU with one vLLM instance per GPU.

If the model does not fit on one GPU, vLLM supports tensor parallelism:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

This shards the model across 4 GPUs. vLLM also supports pipeline parallelism (--pipeline-parallel-size) and a mix of both. The parallelism strategies section of the optimization docs has the full matrix.

When to Use vLLM vs Alternatives

Pick vLLM when:

You want maximum throughput per GPU on NVIDIA hardware
You need an OpenAI-compatible drop-in for existing client code
You are serving 7B-70B class dense or MoE models

Pick something else when:

Ollama or LM Studio if you just want to chat with a local model on a laptop. They wrap llama.cpp with a friendlier UX but cap out well below vLLM throughput.
llama.cpp if you need CPU-only or Apple Silicon inference and maximum quantization flexibility. vLLM has Apple Silicon support through vllm-metal but it is still maturing.
TensorRT-LLM if you are all-in on NVIDIA and need the absolute lowest latency. vLLM is close behind and much easier to deploy.
TGI if you prefer HuggingFace's official server. vLLM is the faster, more popular option today.
Hosted APIs (OpenAI, Anthropic) if you do not have GPU capacity or are not sure the workload is real. They bill per token and remove ops overhead.

Common Pitfalls

Out of memory on startup. Drop --max-model-len first. A 4096 context uses far less KV cache than 32768. If that is not enough, lower --gpu-memory-utilization to 0.85 and check no other process is using the GPU (nvidia-smi).

Slow first request. The first call after startup hits the JIT-compiled CUDA graphs. Every request after is fast. Warm up with one dummy call before counting latency in benchmarks.

Model download blocks startup. vLLM pulls from HuggingFace by default. In air-gapped environments, pre-download with huggingface-cli download and point --model at the local path.

Long-tail latency from preemption. See the "Watch for Preemption" section above. The default gpu_memory_utilization=0.9 is conservative. Tune up only after you measure.

Confused about OpenAI client vs raw API. vLLM's /v1/chat/completions matches OpenAI's schema, but /v1/responses (the new Assistants-style endpoint) is not yet supported. Use /v1/chat/completions or /v1/completions.

Where to Go Next

vllm serve CLI reference — every flag explained
Optimization and tuning — parallelism, batching, compilation levels
vLLM blog: PagedAttention — the original SOSP 2023 paper results and the memory-sharing tricks that came from it
Speculative decoding in vLLM — EAGLE and other draft-model setups that 2-3x tokens/sec at no quality cost