10.1 Self-Hosting LLMs

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.1 Self-Hosting LLMs

Key Concepts: Ollama · vLLM · TGI · Hardware sizing · GPU vs CPU inference

Official Docs: Ollama · vLLM · Text Generation Inference

When to Self-Host

✅ Data privacy requirements — data cannot leave your servers
✅ High volume — self-hosting becomes cheaper above ~10M tokens/day
✅ Custom models — fine-tuned models not available via APIs
✅ Air-gapped environments — no internet access allowed

❌ Low volume — API costs are cheaper when starting out
❌ No ML infrastructure expertise — self-hosting adds operational burden

Hardware Sizing Guide

Model Size	Min VRAM	Recommended GPU	Use case
1B–3B	3 GB	RTX 3060	Lightweight, fast
7B–8B	8 GB	RTX 3080	Best quality/cost
13B	14 GB	RTX 3090 / A10	Higher quality
70B (4-bit)	40 GB	2× A100 40GB	Production
CPU (no GPU)	16 GB RAM	Any modern CPU	Very slow

Option 1 — Ollama (Easiest)

# Install (Linux/macOS)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull llama3.2
ollama run llama3.2

# OpenAI-compatible API (runs on port 11434)
curl http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Use Ollama with the OpenAI Python client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but not validated
)

resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain what RAG is in 3 sentences."}],
)
print(resp.choices[0].message.content)

Option 2 — vLLM (Production, High Throughput)

vLLM uses PagedAttention for efficient KV cache management, achieving 10–24× higher throughput than naive HuggingFace inference.

pip install vllm

# Start server (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "What is vLLM?"}],
    max_tokens=200,
)
print(resp.choices[0].message.content)

Common Mistakes

Underestimating VRAM — always add 20% buffer. A 7B model in float16 needs ~14GB VRAM, not 7GB. 4-bit quantisation halves this.
CPU inference in production — CPU inference is 10–50× slower than GPU. Use GPU for anything latency-sensitive.
No load balancing — a single vLLM instance has limited concurrency. For production traffic, run multiple instances behind a load balancer.
Not pinning model versions — if you pull ollama pull llama3.2 it fetches the latest. Pin exact model tags for reproducibility.

Quick Quiz

Test Your Understanding

Q1. What is the main advantage of vLLM over basic HuggingFace inference?
A1. PagedAttention — vLLM manages the KV cache memory like a virtual memory system, achieving much higher throughput with many concurrent requests.

Q2. Which self-hosting tool is easiest for development and local testing?
A2. Ollama — one install command, one model pull, and it serves an OpenAI-compatible API locally.

Q3. At approximately what scale does self-hosting become cheaper than cloud APIs?
A3. Roughly 10M+ tokens per day, depending on the model and hardware costs.

Student Exercise

Exercise 11.1 — Local model server
Install Ollama. Pull llama3.2 and qwen2.5:0.5b. Write a Python script using the OpenAI client that sends the same prompt to both models and compares output quality and response time. Which is faster? Which is more accurate?

10.1 Self-Hosting LLMs

When to Self-Host​

Hardware Sizing Guide​

Option 1 — Ollama (Easiest)​

Option 2 — vLLM (Production, High Throughput)​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

When to Self-Host

Hardware Sizing Guide

Option 1 — Ollama (Easiest)

Option 2 — vLLM (Production, High Throughput)

Common Mistakes

Quick Quiz

Student Exercise

Further Reading