9.1 When to Fine-Tune vs Prompt
AI-generated content may contain errors. Always verify against official sources.
9.1 When to Fine-Tune vs Prompt
Key Concepts: Decision framework · Cost/benefit analysis
Official Docs: OpenAI Fine-Tuning Guide · HuggingFace PEFT
The Decision Framework
Fine-tuning is expensive, time-consuming, and not always necessary. Always start with prompting.
Can prompting solve it?
│
Yes │ No
══════════╫══════════
↓ ↓
Use prompting Is it a style/format/persona problem?
│
Yes │ No
══════════╫══════════
↓ ↓
Fine-tune Need domain knowledge?
│
Yes │ No
══════════╫══════════
↓ ↓
RAG Prompt engineer harder
Prompting vs RAG vs Fine-Tuning
| Approach | Best for | Cost | Speed to deploy |
|---|---|---|---|
| Prompting | Format, tone, simple tasks | ≈ free | Minutes |
| RAG | Domain knowledge, recent data | Low | Hours |
| Fine-tuning | Style, persona, consistent format | High | Days-Weeks |
When Fine-Tuning Makes Sense
✅ Consistent output format — you always need JSON/XML in a specific schema
✅ Specialised style or persona — e.g., always respond as a friendly nurse
✅ Reduce prompt length — replace long system prompts with baked-in behaviour
✅ Latency-sensitive — shorter prompts mean faster, cheaper inference
✅ High volume — you will make millions of calls, and a fine-tuned small model beats a large model with a long prompt on cost
When Fine-Tuning Does NOT Help
❌ Teaching new knowledge (use RAG instead)
❌ One-off tasks
❌ When prompting already works
Cost-Benefit Analysis
# Rough cost comparison
def compare_approaches(daily_requests: int, tokens_per_request: int):
# Approach 1: GPT-4o with a 500-token system prompt
long_prompt_tokens = tokens_per_request + 500
gpt4o_cost_per_1k = 0.005 # $5/1M input tokens
daily_cost_long = (daily_requests * long_prompt_tokens / 1000) * gpt4o_cost_per_1k
# Approach 2: Fine-tuned GPT-4o-mini (50-token system prompt)
short_prompt_tokens = tokens_per_request + 50
ft_cost_per_1k = 0.0003 # $0.30/1M for fine-tuned mini input
daily_cost_ft = (daily_requests * short_prompt_tokens / 1000) * ft_cost_per_1k
print(f"Daily requests: {daily_requests:,}")
print(f"GPT-4o with long prompt: ${daily_cost_long:.2f}/day")
print(f"Fine-tuned GPT-4o-mini: ${daily_cost_ft:.2f}/day")
print(f"Monthly savings: ${(daily_cost_long - daily_cost_ft) * 30:.2f}")
compare_approaches(daily_requests=10_000, tokens_per_request=200)
Common Mistakes
- Fine-tuning to add knowledge — fine-tuning does not reliably add facts. The model may hallucinate differently. Use RAG for knowledge.
- Skipping prompting experiments — many teams fine-tune prematurely. Try few-shot prompting first; it often achieves 90% of the quality at 1% of the effort.
- Too little training data — fewer than 50 high-quality examples rarely produces meaningful improvement. Aim for 100–500+ examples.
- Forgetting catastrophic forgetting — fine-tuning can cause the model to lose general capabilities. Always evaluate on a holdout set of general tasks.
Quick Quiz
Q1. A startup wants their chatbot to always respond with structured JSON. Should they use RAG or fine-tuning?
A1. Fine-tuning — this is a format/style problem, not a knowledge problem.
Q2. A medical company wants their chatbot to know the latest drug interactions. Should they fine-tune?
A2. No — RAG is better here. Drug interactions are factual knowledge that changes frequently. Fine-tuning can’t reliably teach new facts.
Q3. What is catastrophic forgetting?
A3. When fine-tuning on a specific task degrades the model’s performance on other general tasks.
Student Exercise
Exercise 9.1 — Justify the approach
For each scenario, decide: Prompting, RAG, or Fine-Tuning? Justify your answer.
- A legal firm wants an assistant that always writes in formal legal English
- A news site wants answers about articles published today
- A customer support bot that should always classify issues as one of 5 fixed categories
Further Reading
Next → 9.2 Dataset Preparation