Skip to main content

9.1 When to Fine-Tune vs Prompt

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.1 When to Fine-Tune vs Prompt

Key Concepts: Decision framework · Cost/benefit analysis

Official Docs: OpenAI Fine-Tuning Guide · HuggingFace PEFT


The Decision Framework

Fine-tuning is expensive, time-consuming, and not always necessary. Always start with prompting.

Can prompting solve it?

Yes │ No
══════════╫══════════
↓ ↓
Use prompting Is it a style/format/persona problem?

Yes │ No
══════════╫══════════
↓ ↓
Fine-tune Need domain knowledge?

Yes │ No
══════════╫══════════
↓ ↓
RAG Prompt engineer harder

Prompting vs RAG vs Fine-Tuning

ApproachBest forCostSpeed to deploy
PromptingFormat, tone, simple tasks≈ freeMinutes
RAGDomain knowledge, recent dataLowHours
Fine-tuningStyle, persona, consistent formatHighDays-Weeks

When Fine-Tuning Makes Sense

Consistent output format — you always need JSON/XML in a specific schema
Specialised style or persona — e.g., always respond as a friendly nurse
Reduce prompt length — replace long system prompts with baked-in behaviour
Latency-sensitive — shorter prompts mean faster, cheaper inference
High volume — you will make millions of calls, and a fine-tuned small model beats a large model with a long prompt on cost

When Fine-Tuning Does NOT Help

❌ Teaching new knowledge (use RAG instead)
❌ One-off tasks
❌ When prompting already works


Cost-Benefit Analysis

# Rough cost comparison
def compare_approaches(daily_requests: int, tokens_per_request: int):
# Approach 1: GPT-4o with a 500-token system prompt
long_prompt_tokens = tokens_per_request + 500
gpt4o_cost_per_1k = 0.005 # $5/1M input tokens
daily_cost_long = (daily_requests * long_prompt_tokens / 1000) * gpt4o_cost_per_1k

# Approach 2: Fine-tuned GPT-4o-mini (50-token system prompt)
short_prompt_tokens = tokens_per_request + 50
ft_cost_per_1k = 0.0003 # $0.30/1M for fine-tuned mini input
daily_cost_ft = (daily_requests * short_prompt_tokens / 1000) * ft_cost_per_1k

print(f"Daily requests: {daily_requests:,}")
print(f"GPT-4o with long prompt: ${daily_cost_long:.2f}/day")
print(f"Fine-tuned GPT-4o-mini: ${daily_cost_ft:.2f}/day")
print(f"Monthly savings: ${(daily_cost_long - daily_cost_ft) * 30:.2f}")

compare_approaches(daily_requests=10_000, tokens_per_request=200)

Common Mistakes

Common Mistakes
  1. Fine-tuning to add knowledge — fine-tuning does not reliably add facts. The model may hallucinate differently. Use RAG for knowledge.
  2. Skipping prompting experiments — many teams fine-tune prematurely. Try few-shot prompting first; it often achieves 90% of the quality at 1% of the effort.
  3. Too little training data — fewer than 50 high-quality examples rarely produces meaningful improvement. Aim for 100–500+ examples.
  4. Forgetting catastrophic forgetting — fine-tuning can cause the model to lose general capabilities. Always evaluate on a holdout set of general tasks.

Quick Quiz

Test Your Understanding

Q1. A startup wants their chatbot to always respond with structured JSON. Should they use RAG or fine-tuning?
A1. Fine-tuning — this is a format/style problem, not a knowledge problem.

Q2. A medical company wants their chatbot to know the latest drug interactions. Should they fine-tune?
A2. No — RAG is better here. Drug interactions are factual knowledge that changes frequently. Fine-tuning can’t reliably teach new facts.

Q3. What is catastrophic forgetting?
A3. When fine-tuning on a specific task degrades the model’s performance on other general tasks.


Student Exercise

Exercise 9.1 — Justify the approach
For each scenario, decide: Prompting, RAG, or Fine-Tuning? Justify your answer.

  1. A legal firm wants an assistant that always writes in formal legal English
  2. A news site wants answers about articles published today
  3. A customer support bot that should always classify issues as one of 5 fixed categories

Further Reading

Next → 9.2 Dataset Preparation