9.1 When to Fine-Tune vs Prompt

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.1 When to Fine-Tune vs Prompt

Key Concepts: Decision framework · Cost/benefit analysis

Official Docs: OpenAI Fine-Tuning Guide · HuggingFace PEFT

The Decision Framework

Fine-tuning is expensive, time-consuming, and not always necessary. Always start with prompting.

Can prompting solve it?
         │
    Yes  │  No
══════════╫══════════
↓            ↓
Use prompting  Is it a style/format/persona problem?
                     │
                Yes  │  No
          ══════════╫══════════
          ↓            ↓
     Fine-tune    Need domain knowledge?
                       │
                  Yes  │  No
            ══════════╫══════════
            ↓            ↓
           RAG        Prompt engineer harder

Prompting vs RAG vs Fine-Tuning

Approach	Best for	Cost	Speed to deploy
Prompting	Format, tone, simple tasks	≈ free	Minutes
RAG	Domain knowledge, recent data	Low	Hours
Fine-tuning	Style, persona, consistent format	High	Days-Weeks

When Fine-Tuning Makes Sense

✅ Consistent output format — you always need JSON/XML in a specific schema
✅ Specialised style or persona — e.g., always respond as a friendly nurse
✅ Reduce prompt length — replace long system prompts with baked-in behaviour
✅ Latency-sensitive — shorter prompts mean faster, cheaper inference
✅ High volume — you will make millions of calls, and a fine-tuned small model beats a large model with a long prompt on cost

When Fine-Tuning Does NOT Help

❌ Teaching new knowledge (use RAG instead)
❌ One-off tasks
❌ When prompting already works

Cost-Benefit Analysis

# Rough cost comparison
def compare_approaches(daily_requests: int, tokens_per_request: int):
    # Approach 1: GPT-4o with a 500-token system prompt
    long_prompt_tokens = tokens_per_request + 500
    gpt4o_cost_per_1k = 0.005  # $5/1M input tokens
    daily_cost_long = (daily_requests * long_prompt_tokens / 1000) * gpt4o_cost_per_1k
    
    # Approach 2: Fine-tuned GPT-4o-mini (50-token system prompt)
    short_prompt_tokens = tokens_per_request + 50
    ft_cost_per_1k = 0.0003  # $0.30/1M for fine-tuned mini input
    daily_cost_ft = (daily_requests * short_prompt_tokens / 1000) * ft_cost_per_1k
    
    print(f"Daily requests: {daily_requests:,}")
    print(f"GPT-4o with long prompt: ${daily_cost_long:.2f}/day")
    print(f"Fine-tuned GPT-4o-mini:  ${daily_cost_ft:.2f}/day")
    print(f"Monthly savings: ${(daily_cost_long - daily_cost_ft) * 30:.2f}")

compare_approaches(daily_requests=10_000, tokens_per_request=200)

Common Mistakes

Fine-tuning to add knowledge — fine-tuning does not reliably add facts. The model may hallucinate differently. Use RAG for knowledge.
Skipping prompting experiments — many teams fine-tune prematurely. Try few-shot prompting first; it often achieves 90% of the quality at 1% of the effort.
Too little training data — fewer than 50 high-quality examples rarely produces meaningful improvement. Aim for 100–500+ examples.
Forgetting catastrophic forgetting — fine-tuning can cause the model to lose general capabilities. Always evaluate on a holdout set of general tasks.

Quick Quiz

Test Your Understanding

Q1. A startup wants their chatbot to always respond with structured JSON. Should they use RAG or fine-tuning?
A1. Fine-tuning — this is a format/style problem, not a knowledge problem.

Q2. A medical company wants their chatbot to know the latest drug interactions. Should they fine-tune?
A2. No — RAG is better here. Drug interactions are factual knowledge that changes frequently. Fine-tuning can’t reliably teach new facts.

Q3. What is catastrophic forgetting?
A3. When fine-tuning on a specific task degrades the model’s performance on other general tasks.

Student Exercise

Exercise 9.1 — Justify the approach
For each scenario, decide: Prompting, RAG, or Fine-Tuning? Justify your answer.

A legal firm wants an assistant that always writes in formal legal English
A news site wants answers about articles published today
A customer support bot that should always classify issues as one of 5 fixed categories

9.1 When to Fine-Tune vs Prompt

The Decision Framework​

Prompting vs RAG vs Fine-Tuning​

When Fine-Tuning Makes Sense​

When Fine-Tuning Does NOT Help​

Cost-Benefit Analysis​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

The Decision Framework

Prompting vs RAG vs Fine-Tuning

When Fine-Tuning Makes Sense

When Fine-Tuning Does NOT Help

Cost-Benefit Analysis

Common Mistakes

Quick Quiz

Student Exercise

Further Reading