10.1 Self-Hosting LLMs
AI-generated content may contain errors. Always verify against official sources.
10.1 Self-Hosting LLMs
Key Concepts: Ollama · vLLM · TGI · Hardware sizing · GPU vs CPU inference
Official Docs: Ollama · vLLM · Text Generation Inference
When to Self-Host
✅ Data privacy requirements — data cannot leave your servers
✅ High volume — self-hosting becomes cheaper above ~10M tokens/day
✅ Custom models — fine-tuned models not available via APIs
✅ Air-gapped environments — no internet access allowed
❌ Low volume — API costs are cheaper when starting out
❌ No ML infrastructure expertise — self-hosting adds operational burden
Hardware Sizing Guide
| Model Size | Min VRAM | Recommended GPU | Use case |
|---|---|---|---|
| 1B–3B | 3 GB | RTX 3060 | Lightweight, fast |
| 7B–8B | 8 GB | RTX 3080 | Best quality/cost |
| 13B | 14 GB | RTX 3090 / A10 | Higher quality |
| 70B (4-bit) | 40 GB | 2× A100 40GB | Production |
| CPU (no GPU) | 16 GB RAM | Any modern CPU | Very slow |
Option 1 — Ollama (Easiest)
# Install (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run a model
ollama pull llama3.2
ollama run llama3.2
# OpenAI-compatible API (runs on port 11434)
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Use Ollama with the OpenAI Python client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required but not validated
)
resp = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain what RAG is in 3 sentences."}],
)
print(resp.choices[0].message.content)
Option 2 — vLLM (Production, High Throughput)
vLLM uses PagedAttention for efficient KV cache management, achieving 10–24× higher throughput than naive HuggingFace inference.
pip install vllm
# Start server (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--host 0.0.0.0 \
--port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.2-8B-Instruct",
messages=[{"role": "user", "content": "What is vLLM?"}],
max_tokens=200,
)
print(resp.choices[0].message.content)
Common Mistakes
- Underestimating VRAM — always add 20% buffer. A 7B model in float16 needs ~14GB VRAM, not 7GB. 4-bit quantisation halves this.
- CPU inference in production — CPU inference is 10–50× slower than GPU. Use GPU for anything latency-sensitive.
- No load balancing — a single vLLM instance has limited concurrency. For production traffic, run multiple instances behind a load balancer.
- Not pinning model versions — if you pull
ollama pull llama3.2it fetches the latest. Pin exact model tags for reproducibility.
Quick Quiz
Q1. What is the main advantage of vLLM over basic HuggingFace inference?
A1. PagedAttention — vLLM manages the KV cache memory like a virtual memory system, achieving much higher throughput with many concurrent requests.
Q2. Which self-hosting tool is easiest for development and local testing?
A2. Ollama — one install command, one model pull, and it serves an OpenAI-compatible API locally.
Q3. At approximately what scale does self-hosting become cheaper than cloud APIs?
A3. Roughly 10M+ tokens per day, depending on the model and hardware costs.
Student Exercise
Exercise 11.1 — Local model server
Install Ollama. Pull llama3.2 and qwen2.5:0.5b. Write a Python script using the OpenAI client that sends the same prompt to both models and compares output quality and response time. Which is faster? Which is more accurate?