Skip to main content

10.1 Self-Hosting LLMs

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.1 Self-Hosting LLMs

Key Concepts: Ollama · vLLM · TGI · Hardware sizing · GPU vs CPU inference

Official Docs: Ollama · vLLM · Text Generation Inference


When to Self-Host

✅ Data privacy requirements — data cannot leave your servers
✅ High volume — self-hosting becomes cheaper above ~10M tokens/day
✅ Custom models — fine-tuned models not available via APIs
✅ Air-gapped environments — no internet access allowed

❌ Low volume — API costs are cheaper when starting out
❌ No ML infrastructure expertise — self-hosting adds operational burden


Hardware Sizing Guide

Model SizeMin VRAMRecommended GPUUse case
1B–3B3 GBRTX 3060Lightweight, fast
7B–8B8 GBRTX 3080Best quality/cost
13B14 GBRTX 3090 / A10Higher quality
70B (4-bit)40 GB2× A100 40GBProduction
CPU (no GPU)16 GB RAMAny modern CPUVery slow

Option 1 — Ollama (Easiest)

# Install (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull llama3.2
ollama run llama3.2

# OpenAI-compatible API (runs on port 11434)
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Use Ollama with the OpenAI Python client
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required but not validated
)

resp = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain what RAG is in 3 sentences."}],
)
print(resp.choices[0].message.content)

Option 2 — vLLM (Production, High Throughput)

vLLM uses PagedAttention for efficient KV cache management, achieving 10–24× higher throughput than naive HuggingFace inference.

pip install vllm

# Start server (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--host 0.0.0.0 \
--port 8000
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

resp = client.chat.completions.create(
model="meta-llama/Llama-3.2-8B-Instruct",
messages=[{"role": "user", "content": "What is vLLM?"}],
max_tokens=200,
)
print(resp.choices[0].message.content)

Common Mistakes

Common Mistakes
  1. Underestimating VRAM — always add 20% buffer. A 7B model in float16 needs ~14GB VRAM, not 7GB. 4-bit quantisation halves this.
  2. CPU inference in production — CPU inference is 10–50× slower than GPU. Use GPU for anything latency-sensitive.
  3. No load balancing — a single vLLM instance has limited concurrency. For production traffic, run multiple instances behind a load balancer.
  4. Not pinning model versions — if you pull ollama pull llama3.2 it fetches the latest. Pin exact model tags for reproducibility.

Quick Quiz

Test Your Understanding

Q1. What is the main advantage of vLLM over basic HuggingFace inference?
A1. PagedAttention — vLLM manages the KV cache memory like a virtual memory system, achieving much higher throughput with many concurrent requests.

Q2. Which self-hosting tool is easiest for development and local testing?
A2. Ollama — one install command, one model pull, and it serves an OpenAI-compatible API locally.

Q3. At approximately what scale does self-hosting become cheaper than cloud APIs?
A3. Roughly 10M+ tokens per day, depending on the model and hardware costs.


Student Exercise

Exercise 11.1 — Local model server
Install Ollama. Pull llama3.2 and qwen2.5:0.5b. Write a Python script using the OpenAI client that sends the same prompt to both models and compares output quality and response time. Which is faster? Which is more accurate?


Further Reading

Next → 11.2 API Gateway & Rate Limiting