3.3 Running DeepSeek R1 Locally

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

3.3 Running DeepSeek R1 Locally

Key Concepts: Ollama · vLLM · Hardware requirements · Quantization (Q4/Q8)

Official Docs: Ollama DeepSeek Models · DeepSeek GitHub · vLLM Docs

Why Run Models Locally?

Privacy — sensitive data never leaves your machine
Cost — no per-token charges after hardware investment
Offline — works without internet connectivity
Experimentation — explore open-weight models freely

DeepSeek R1 Model Sizes

Model	Parameters	VRAM (Q4)	VRAM (fp16)	Use Case
deepseek-r1:1.5b	1.5B	~1 GB	~3 GB	Laptop / CPU
deepseek-r1:7b	7B	~4.7 GB	~14 GB	Consumer GPU
deepseek-r1:14b	14B	~9 GB	~28 GB	Prosumer GPU
deepseek-r1:32b	32B	~20 GB	~64 GB	High-end GPU
deepseek-r1:70b	70B	~43 GB	~140 GB	Multi-GPU

Always check Ollama model page for the latest quantized variants.

Option A — Ollama (Recommended for Students)

Ollama is the simplest way to run DeepSeek R1 locally. It handles downloading, quantization, and serving automatically.

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull DeepSeek R1 (7B quantized ~4.7 GB)
ollama pull deepseek-r1:7b

# 3. Interactive chat in terminal
ollama run deepseek-r1:7b

Call via Python (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",   # required but not validated
)

response = client.chat.completions.create(
    model="deepseek-r1:7b",
    messages=[
        {"role": "user", "content": "What is the difference between RAG and fine-tuning?"}
    ],
    temperature=0.6,
)
print(response.choices[0].message.content)

Option B — vLLM (Production / High Throughput)

vLLM is a high-performance inference server designed for production. It supports continuous batching and PagedAttention for high throughput.

pip install vllm

# Start vLLM server (requires CUDA GPU)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --port 8000 \
    --dtype auto

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    messages=[{"role": "user", "content": "Solve: 3x + 5 = 20"}],
)
print(response.choices[0].message.content)

Understanding Quantization

Quantization reduces the bit-width of model weights to save memory at the cost of a small quality decrease.

Format	Bits	VRAM savings vs fp16	Quality loss
fp16	16	baseline	none
Q8	8	~2×	minimal
Q4	4	~4×	small
Q2	2	~8×	noticeable

Student Recommendation

For learning and experimentation, Q4 quantization via Ollama is the best starting point. It runs a capable model on a laptop GPU or even CPU.

Hardware Checklist

What you have	Recommended model
MacBook M-series (16 GB)	deepseek-r1:7b (Q4)
Windows/Linux + RTX 3080 (10 GB)	deepseek-r1:7b (Q4)
Windows/Linux + RTX 4090 (24 GB)	deepseek-r1:14b (Q4)
CPU only	deepseek-r1:1.5b

Common Mistakes

Not checking VRAM before pulling — pulling a 14B fp16 model on a 10 GB GPU will fail at runtime.
Forgetting to start the Ollama server — run ollama serve if the API isn't responding.
Using the wrong model name in the API call — the model field must match the exact tag returned by ollama list.
Expecting API parity with OpenAI — local models via Ollama/vLLM don’t support all OpenAI parameters (e.g., response_format may be limited).

Quick Quiz

Test Your Understanding

Q1. What does Q4 quantization mean, and what is its main trade-off?
A1. Weights are stored at 4-bit precision (~4× VRAM reduction vs fp16) at the cost of a small quality decrease.

Q2. What port does Ollama expose its OpenAI-compatible API on?
A2. localhost:11434.

Q3. What are two advantages of vLLM over Ollama for production?
A3. Continuous batching and PagedAttention — enabling significantly higher throughput for concurrent requests.

Student Exercise

Exercise 3.3 — Local vs cloud comparison
Run deepseek-r1:7b via Ollama and gpt-4o-mini via the OpenAI API on the same 5 prompts. Compare: output quality, latency (seconds to first token), and cost (free vs API cost).

3.3 Running DeepSeek R1 Locally

Why Run Models Locally?​

DeepSeek R1 Model Sizes​

Option A — Ollama (Recommended for Students)​

Call via Python (OpenAI-compatible API)​

Option B — vLLM (Production / High Throughput)​

Understanding Quantization​

Hardware Checklist​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Why Run Models Locally?

DeepSeek R1 Model Sizes

Option A — Ollama (Recommended for Students)

Call via Python (OpenAI-compatible API)

Option B — vLLM (Production / High Throughput)

Understanding Quantization

Hardware Checklist

Common Mistakes

Quick Quiz

Student Exercise

Further Reading