9.2 Dataset Preparation

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.2 Dataset Preparation

Key Concepts: Instruction-response pairs · Data quality · Alpaca/ShareGPT formatting

Official Docs: HuggingFace Datasets · OpenAI Fine-Tuning Data Format

The Two Standard Formats

Alpaca Format (Instruction Tuning)

Used to teach a model to follow instructions:

{
    "instruction": "Classify the following text as positive, neutral, or negative.",
    "input": "The product arrived on time and worked perfectly.",
    "output": "Positive"
}

ShareGPT / Chat Format (Conversation Tuning)

Used to fine-tune on multi-turn conversations (used by OpenAI fine-tuning API):

{
    "messages": [
        {"role": "system", "content": "You are a helpful medical assistant. Always recommend consulting a doctor."},
        {"role": "user", "content": "What is metformin used for?"},
        {"role": "assistant", "content": "Metformin is a first-line medication for type 2 diabetes. It helps lower blood sugar by reducing glucose production in the liver. Always consult your doctor before starting or changing any medication."}
    ]
}

Creating a Dataset from Scratch

import json
from pathlib import Path

# Your raw examples
examples = [
    {
        "user": "What are the symptoms of dehydration?",
        "assistant": "Common symptoms include thirst, dark urine, dry mouth, fatigue, and dizziness. Severe dehydration requires immediate medical attention."
    },
    {
        "user": "How much water should I drink daily?",
        "assistant": "A general guideline is 8 cups (2 litres) per day, but needs vary by weight, activity level, and climate. Consult your doctor for personalised advice."
    },
]

system_prompt = "You are a helpful medical information assistant. Always advise consulting a healthcare professional."

# Convert to ShareGPT format
chat_dataset = [
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": ex["user"]},
            {"role": "assistant", "content": ex["assistant"]},
        ]
    }
    for ex in examples
]

# Save as JSONL (one JSON object per line)
Path("train.jsonl").write_text("\n".join(json.dumps(ex) for ex in chat_dataset))
print(f"Saved {len(chat_dataset)} examples")

Data Quality Checklist

def validate_example(example: dict) -> list[str]:
    """Return a list of issues with the example."""
    issues = []
    messages = example.get("messages", [])
    
    # Check structure
    if not messages:
        issues.append("No messages")
        return issues
    
    roles = [m["role"] for m in messages]
    if roles[0] != "system":
        issues.append("No system message")
    
    # Check content quality
    for msg in messages:
        if msg["role"] == "assistant":
            if len(msg["content"]) < 20:
                issues.append(f"Assistant response too short: '{msg['content']}'")
            if msg["content"].strip().startswith("I"):
                # Weak: responses starting with "I" tend to be less helpful
                pass
    
    # Check for data leakage patterns
    for msg in messages:
        if "[PLACEHOLDER]" in msg["content"] or "TODO" in msg["content"]:
            issues.append("Contains placeholder text")
    
    return issues

# Validate entire dataset
for i, example in enumerate(chat_dataset):
    issues = validate_example(example)
    if issues:
        print(f"Example {i}: {issues}")

Train/Validation Split

import random

def split_dataset(examples: list, val_ratio: float = 0.1) -> tuple:
    """Split dataset into training and validation sets."""
    random.shuffle(examples)
    split_idx = int(len(examples) * (1 - val_ratio))
    train = examples[:split_idx]
    val = examples[split_idx:]
    print(f"Train: {len(train)} examples | Validation: {len(val)} examples")
    return train, val

train_data, val_data = split_dataset(chat_dataset)

# Save both splits
Path("train.jsonl").write_text("\n".join(json.dumps(ex) for ex in train_data))
Path("val.jsonl").write_text("\n".join(json.dumps(ex) for ex in val_data))

Common Mistakes

Low-quality "gold" outputs — if your assistant responses are mediocre, the fine-tuned model will produce mediocre responses consistently. Quality > Quantity.
Inconsistent style — if some examples are formal and others casual, the model will be confused. Pick one style and apply it everywhere.
No validation set — always hold out 10% of data to evaluate overfitting. If train loss drops but val loss rises, the model is overfitting.
Too-similar examples — if all examples are very similar, the model overfits to that narrow distribution. Include diverse inputs.

Quick Quiz

Test Your Understanding

Q1. What is JSONL format?
A1. JSON Lines — one JSON object per line in a text file. Used for fine-tuning datasets because files can be streamed line-by-line.

Q2. What is the difference between Alpaca and ShareGPT formats?
A2. Alpaca uses instruction/input/output fields for single-turn instruction following. ShareGPT uses messages with role/content for multi-turn conversations.

Q3. How many training examples is typically the minimum for meaningful fine-tuning results?
A3. At least 50–100 high-quality examples, though 300–500+ are recommended for reliable improvement.

Student Exercise

Exercise 9.2 — Build a dataset
Create a 20-example dataset for a customer support bot for a fictional software company. Use the ShareGPT format. Validate each example programmatically. Save as train.jsonl and val.jsonl with a 90/10 split.

9.2 Dataset Preparation

The Two Standard Formats​

Alpaca Format (Instruction Tuning)​

ShareGPT / Chat Format (Conversation Tuning)​

Creating a Dataset from Scratch​

Data Quality Checklist​

Train/Validation Split​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

The Two Standard Formats

Alpaca Format (Instruction Tuning)

ShareGPT / Chat Format (Conversation Tuning)

Creating a Dataset from Scratch

Data Quality Checklist

Train/Validation Split

Common Mistakes

Quick Quiz

Student Exercise

Further Reading