Skip to main content

3.3 Running DeepSeek R1 Locally

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

3.3 Running DeepSeek R1 Locally

Key Concepts: Ollama · vLLM · Hardware requirements · Quantization (Q4/Q8)

Official Docs: Ollama DeepSeek Models · DeepSeek GitHub · vLLM Docs


Why Run Models Locally?

  • Privacy — sensitive data never leaves your machine
  • Cost — no per-token charges after hardware investment
  • Offline — works without internet connectivity
  • Experimentation — explore open-weight models freely

DeepSeek R1 Model Sizes

ModelParametersVRAM (Q4)VRAM (fp16)Use Case
deepseek-r1:1.5b1.5B~1 GB~3 GBLaptop / CPU
deepseek-r1:7b7B~4.7 GB~14 GBConsumer GPU
deepseek-r1:14b14B~9 GB~28 GBProsumer GPU
deepseek-r1:32b32B~20 GB~64 GBHigh-end GPU
deepseek-r1:70b70B~43 GB~140 GBMulti-GPU

Always check Ollama model page for the latest quantized variants.


Ollama is the simplest way to run DeepSeek R1 locally. It handles downloading, quantization, and serving automatically.

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull DeepSeek R1 (7B quantized ~4.7 GB)
ollama pull deepseek-r1:7b

# 3. Interactive chat in terminal
ollama run deepseek-r1:7b

Call via Python (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but not validated
)

response = client.chat.completions.create(
model="deepseek-r1:7b",
messages=[
{"role": "user", "content": "What is the difference between RAG and fine-tuning?"}
],
temperature=0.6,
)
print(response.choices[0].message.content)

Option B — vLLM (Production / High Throughput)

vLLM is a high-performance inference server designed for production. It supports continuous batching and PagedAttention for high throughput.

pip install vllm
# Start vLLM server (requires CUDA GPU)
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--port 8000 \
--dtype auto
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
messages=[{"role": "user", "content": "Solve: 3x + 5 = 20"}],
)
print(response.choices[0].message.content)

Understanding Quantization

Quantization reduces the bit-width of model weights to save memory at the cost of a small quality decrease.

FormatBitsVRAM savings vs fp16Quality loss
fp1616baselinenone
Q88~2×minimal
Q44~4×small
Q22~8×noticeable
Student Recommendation

For learning and experimentation, Q4 quantization via Ollama is the best starting point. It runs a capable model on a laptop GPU or even CPU.


Hardware Checklist

What you haveRecommended model
MacBook M-series (16 GB)deepseek-r1:7b (Q4)
Windows/Linux + RTX 3080 (10 GB)deepseek-r1:7b (Q4)
Windows/Linux + RTX 4090 (24 GB)deepseek-r1:14b (Q4)
CPU onlydeepseek-r1:1.5b

Common Mistakes

Common Mistakes
  1. Not checking VRAM before pulling — pulling a 14B fp16 model on a 10 GB GPU will fail at runtime.
  2. Forgetting to start the Ollama server — run ollama serve if the API isn't responding.
  3. Using the wrong model name in the API call — the model field must match the exact tag returned by ollama list.
  4. Expecting API parity with OpenAI — local models via Ollama/vLLM don’t support all OpenAI parameters (e.g., response_format may be limited).

Quick Quiz

Test Your Understanding

Q1. What does Q4 quantization mean, and what is its main trade-off?
A1. Weights are stored at 4-bit precision (~4× VRAM reduction vs fp16) at the cost of a small quality decrease.

Q2. What port does Ollama expose its OpenAI-compatible API on?
A2. localhost:11434.

Q3. What are two advantages of vLLM over Ollama for production?
A3. Continuous batching and PagedAttention — enabling significantly higher throughput for concurrent requests.


Student Exercise

Exercise 3.3 — Local vs cloud comparison
Run deepseek-r1:7b via Ollama and gpt-4o-mini via the OpenAI API on the same 5 prompts. Compare: output quality, latency (seconds to first token), and cost (free vs API cost).


Further Reading

Next → 3.4 Cost & Token Management