Skip to main content

9.2 Dataset Preparation

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.2 Dataset Preparation

Key Concepts: Instruction-response pairs · Data quality · Alpaca/ShareGPT formatting

Official Docs: HuggingFace Datasets · OpenAI Fine-Tuning Data Format


The Two Standard Formats

Alpaca Format (Instruction Tuning)

Used to teach a model to follow instructions:

{
"instruction": "Classify the following text as positive, neutral, or negative.",
"input": "The product arrived on time and worked perfectly.",
"output": "Positive"
}

ShareGPT / Chat Format (Conversation Tuning)

Used to fine-tune on multi-turn conversations (used by OpenAI fine-tuning API):

{
"messages": [
{"role": "system", "content": "You are a helpful medical assistant. Always recommend consulting a doctor."},
{"role": "user", "content": "What is metformin used for?"},
{"role": "assistant", "content": "Metformin is a first-line medication for type 2 diabetes. It helps lower blood sugar by reducing glucose production in the liver. Always consult your doctor before starting or changing any medication."}
]
}

Creating a Dataset from Scratch

import json
from pathlib import Path

# Your raw examples
examples = [
{
"user": "What are the symptoms of dehydration?",
"assistant": "Common symptoms include thirst, dark urine, dry mouth, fatigue, and dizziness. Severe dehydration requires immediate medical attention."
},
{
"user": "How much water should I drink daily?",
"assistant": "A general guideline is 8 cups (2 litres) per day, but needs vary by weight, activity level, and climate. Consult your doctor for personalised advice."
},
]

system_prompt = "You are a helpful medical information assistant. Always advise consulting a healthcare professional."

# Convert to ShareGPT format
chat_dataset = [
{
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": ex["user"]},
{"role": "assistant", "content": ex["assistant"]},
]
}
for ex in examples
]

# Save as JSONL (one JSON object per line)
Path("train.jsonl").write_text("\n".join(json.dumps(ex) for ex in chat_dataset))
print(f"Saved {len(chat_dataset)} examples")

Data Quality Checklist

def validate_example(example: dict) -> list[str]:
"""Return a list of issues with the example."""
issues = []
messages = example.get("messages", [])

# Check structure
if not messages:
issues.append("No messages")
return issues

roles = [m["role"] for m in messages]
if roles[0] != "system":
issues.append("No system message")

# Check content quality
for msg in messages:
if msg["role"] == "assistant":
if len(msg["content"]) < 20:
issues.append(f"Assistant response too short: '{msg['content']}'")
if msg["content"].strip().startswith("I"):
# Weak: responses starting with "I" tend to be less helpful
pass

# Check for data leakage patterns
for msg in messages:
if "[PLACEHOLDER]" in msg["content"] or "TODO" in msg["content"]:
issues.append("Contains placeholder text")

return issues

# Validate entire dataset
for i, example in enumerate(chat_dataset):
issues = validate_example(example)
if issues:
print(f"Example {i}: {issues}")

Train/Validation Split

import random

def split_dataset(examples: list, val_ratio: float = 0.1) -> tuple:
"""Split dataset into training and validation sets."""
random.shuffle(examples)
split_idx = int(len(examples) * (1 - val_ratio))
train = examples[:split_idx]
val = examples[split_idx:]
print(f"Train: {len(train)} examples | Validation: {len(val)} examples")
return train, val

train_data, val_data = split_dataset(chat_dataset)

# Save both splits
Path("train.jsonl").write_text("\n".join(json.dumps(ex) for ex in train_data))
Path("val.jsonl").write_text("\n".join(json.dumps(ex) for ex in val_data))

Common Mistakes

Common Mistakes
  1. Low-quality "gold" outputs — if your assistant responses are mediocre, the fine-tuned model will produce mediocre responses consistently. Quality > Quantity.
  2. Inconsistent style — if some examples are formal and others casual, the model will be confused. Pick one style and apply it everywhere.
  3. No validation set — always hold out 10% of data to evaluate overfitting. If train loss drops but val loss rises, the model is overfitting.
  4. Too-similar examples — if all examples are very similar, the model overfits to that narrow distribution. Include diverse inputs.

Quick Quiz

Test Your Understanding

Q1. What is JSONL format?
A1. JSON Lines — one JSON object per line in a text file. Used for fine-tuning datasets because files can be streamed line-by-line.

Q2. What is the difference between Alpaca and ShareGPT formats?
A2. Alpaca uses instruction/input/output fields for single-turn instruction following. ShareGPT uses messages with role/content for multi-turn conversations.

Q3. How many training examples is typically the minimum for meaningful fine-tuning results?
A3. At least 50–100 high-quality examples, though 300–500+ are recommended for reliable improvement.


Student Exercise

Exercise 9.2 — Build a dataset
Create a 20-example dataset for a customer support bot for a fictional software company. Use the ShareGPT format. Validate each example programmatically. Save as train.jsonl and val.jsonl with a 90/10 split.


Further Reading

Next → 9.3 LoRA & QLoRA