9.4 Fine-Tuning with Unsloth
AI-generated content may contain errors. Always verify against official sources.
9.4 Fine-Tuning with Unsloth
Key Concepts: Unsloth framework · Fine-tune 7B on Google Colab · Hands-on walkthrough
Official Docs: Unsloth · Unsloth GitHub
What is Unsloth?
Unsloth is a fine-tuning framework that makes QLoRA 2× faster and uses ~60% less VRAM than the standard HuggingFace PEFT approach. It achieves this by writing custom CUDA kernels for the most expensive operations.
| HuggingFace + PEFT | Unsloth | |
|---|---|---|
| Training speed | 1× | ~2× faster |
| VRAM usage | baseline | ~40% less |
| Llama 3.1 8B on Colab T4 | OOM | Works |
Installation
# For CUDA 12.1
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
Tip: Unsloth provides a ready-to-use Google Colab notebook for Llama 3.1 8B fine-tuning.
Complete Fine-Tuning Example
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch
# 1. Load model and tokeniser
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True, # QLoRA
)
# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0, # Unsloth-optimised: use 0
bias="none",
use_gradient_checkpointing="unsloth", # Unsloth memory optimisation
)
# 3. Load your JSONL dataset
dataset = load_dataset("json", data_files={"train": "train.jsonl"})["train"]
# 4. Format as chat (Llama 3 format)
def format_chat(example):
messages = example["messages"]
text = tokenizer.apply_chat_template(messages, tokenize=False)
return {"text": text}
dataset = dataset.map(format_chat)
# 5. Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
output_dir="./fine-tuned",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
),
)
trainer.train()
# 6. Save model
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")
Export for Inference
Option A — Merge and save as float16 (for HuggingFace inference)
model.save_pretrained_merged("./merged-model", tokenizer, save_method="merged_16bit")
Option B — Export to GGUF (for Ollama/llama.cpp)
# Q4_K_M is the most popular GGUF quantisation
model.save_pretrained_gguf("./gguf-model", tokenizer, quantization_method="q4_k_m")
# Then create an Ollama Modelfile:
# FROM ./gguf-model/model-Q4_K_M.gguf
# PARAMETER temperature 0.7
Common Mistakes
- Learning rate too high —
2e-4is standard for LoRA. Higher rates cause loss spikes and unstable training. - Too many epochs — 3 epochs is usually sufficient for 100–500 examples. More epochs causes overfitting.
- Not using a validation set — always monitor validation loss alongside training loss.
- Forgetting
use_gradient_checkpointing="unsloth"— this is Unsloth’s key memory-saving flag. Without it you lose the VRAM benefits.
Quick Quiz
Q1. What makes Unsloth faster than standard HuggingFace PEFT?
A1. Custom CUDA kernels that rewrite the most computationally expensive operations (attention, cross-entropy) from scratch.
Q2. What does GGUF format enable?
A2. Running fine-tuned models locally with llama.cpp or Ollama, without needing a GPU or the Python ML stack.
Q3. What is the recommended number of training epochs when fine-tuning with 100–500 examples?
A3. 3 epochs. More risks overfitting; fewer risks underfitting.
Student Exercise
Exercise 9.4 — Fine-tune on Google Colab
Using the free Unsloth Colab notebook for Llama 3.1 8B, fine-tune on the Alpaca dataset. Monitor training loss. Export to GGUF and run inference with Ollama.