3.3 Running DeepSeek R1 Locally
AI-generated content may contain errors. Always verify against official sources.
3.3 Running DeepSeek R1 Locally
Key Concepts: Ollama · vLLM · Hardware requirements · Quantization (Q4/Q8)
Official Docs: Ollama DeepSeek Models · DeepSeek GitHub · vLLM Docs
Why Run Models Locally?
- Privacy — sensitive data never leaves your machine
- Cost — no per-token charges after hardware investment
- Offline — works without internet connectivity
- Experimentation — explore open-weight models freely
DeepSeek R1 Model Sizes
| Model | Parameters | VRAM (Q4) | VRAM (fp16) | Use Case |
|---|---|---|---|---|
| deepseek-r1:1.5b | 1.5B | ~1 GB | ~3 GB | Laptop / CPU |
| deepseek-r1:7b | 7B | ~4.7 GB | ~14 GB | Consumer GPU |
| deepseek-r1:14b | 14B | ~9 GB | ~28 GB | Prosumer GPU |
| deepseek-r1:32b | 32B | ~20 GB | ~64 GB | High-end GPU |
| deepseek-r1:70b | 70B | ~43 GB | ~140 GB | Multi-GPU |
Always check Ollama model page for the latest quantized variants.
Option A — Ollama (Recommended for Students)
Ollama is the simplest way to run DeepSeek R1 locally. It handles downloading, quantization, and serving automatically.
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull DeepSeek R1 (7B quantized ~4.7 GB)
ollama pull deepseek-r1:7b
# 3. Interactive chat in terminal
ollama run deepseek-r1:7b
Call via Python (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but not validated
)
response = client.chat.completions.create(
model="deepseek-r1:7b",
messages=[
{"role": "user", "content": "What is the difference between RAG and fine-tuning?"}
],
temperature=0.6,
)
print(response.choices[0].message.content)
Option B — vLLM (Production / High Throughput)
vLLM is a high-performance inference server designed for production. It supports continuous batching and PagedAttention for high throughput.
pip install vllm
# Start vLLM server (requires CUDA GPU)
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--port 8000 \
--dtype auto
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
messages=[{"role": "user", "content": "Solve: 3x + 5 = 20"}],
)
print(response.choices[0].message.content)
Understanding Quantization
Quantization reduces the bit-width of model weights to save memory at the cost of a small quality decrease.
| Format | Bits | VRAM savings vs fp16 | Quality loss |
|---|---|---|---|
| fp16 | 16 | baseline | none |
| Q8 | 8 | ~2× | minimal |
| Q4 | 4 | ~4× | small |
| Q2 | 2 | ~8× | noticeable |
For learning and experimentation, Q4 quantization via Ollama is the best starting point. It runs a capable model on a laptop GPU or even CPU.
Hardware Checklist
| What you have | Recommended model |
|---|---|
| MacBook M-series (16 GB) | deepseek-r1:7b (Q4) |
| Windows/Linux + RTX 3080 (10 GB) | deepseek-r1:7b (Q4) |
| Windows/Linux + RTX 4090 (24 GB) | deepseek-r1:14b (Q4) |
| CPU only | deepseek-r1:1.5b |
Common Mistakes
- Not checking VRAM before pulling — pulling a 14B fp16 model on a 10 GB GPU will fail at runtime.
- Forgetting to start the Ollama server — run
ollama serveif the API isn't responding. - Using the wrong model name in the API call — the
modelfield must match the exact tag returned byollama list. - Expecting API parity with OpenAI — local models via Ollama/vLLM don’t support all OpenAI parameters (e.g.,
response_formatmay be limited).
Quick Quiz
Q1. What does Q4 quantization mean, and what is its main trade-off?
A1. Weights are stored at 4-bit precision (~4× VRAM reduction vs fp16) at the cost of a small quality decrease.
Q2. What port does Ollama expose its OpenAI-compatible API on?
A2. localhost:11434.
Q3. What are two advantages of vLLM over Ollama for production?
A3. Continuous batching and PagedAttention — enabling significantly higher throughput for concurrent requests.
Student Exercise
Exercise 3.3 — Local vs cloud comparison
Run deepseek-r1:7b via Ollama and gpt-4o-mini via the OpenAI API on the same 5 prompts. Compare: output quality, latency (seconds to first token), and cost (free vs API cost).
Further Reading
- 📘 Ollama Model Library
- 📘 vLLM Documentation
- 📄 DeepSeek-R1 Technical Report
- 📘 GGUF format — llama.cpp wiki
Next → 3.4 Cost & Token Management