9.4 Fine-Tuning with Unsloth

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.4 Fine-Tuning with Unsloth

Key Concepts: Unsloth framework · Fine-tune 7B on Google Colab · Hands-on walkthrough

Official Docs: Unsloth · Unsloth GitHub

What is Unsloth?

Unsloth is a fine-tuning framework that makes QLoRA 2× faster and uses ~60% less VRAM than the standard HuggingFace PEFT approach. It achieves this by writing custom CUDA kernels for the most expensive operations.

	HuggingFace + PEFT	Unsloth
Training speed	1×	~2× faster
VRAM usage	baseline	~40% less
Llama 3.1 8B on Colab T4	OOM	Works

Installation

# For CUDA 12.1
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

Tip: Unsloth provides a ready-to-use Google Colab notebook for Llama 3.1 8B fine-tuning.

Complete Fine-Tuning Example

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# 1. Load model and tokeniser
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,       # QLoRA
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,          # Unsloth-optimised: use 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth memory optimisation
)

# 3. Load your JSONL dataset
dataset = load_dataset("json", data_files={"train": "train.jsonl"})["train"]

# 4. Format as chat (Llama 3 format)
def format_chat(example):
    messages = example["messages"]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

dataset = dataset.map(format_chat)

# 5. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./fine-tuned",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()

# 6. Save model
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")

Export for Inference

Option A — Merge and save as float16 (for HuggingFace inference)

model.save_pretrained_merged("./merged-model", tokenizer, save_method="merged_16bit")

Option B — Export to GGUF (for Ollama/llama.cpp)

# Q4_K_M is the most popular GGUF quantisation
model.save_pretrained_gguf("./gguf-model", tokenizer, quantization_method="q4_k_m")

# Then create an Ollama Modelfile:
# FROM ./gguf-model/model-Q4_K_M.gguf
# PARAMETER temperature 0.7

Common Mistakes

Learning rate too high — 2e-4 is standard for LoRA. Higher rates cause loss spikes and unstable training.
Too many epochs — 3 epochs is usually sufficient for 100–500 examples. More epochs causes overfitting.
Not using a validation set — always monitor validation loss alongside training loss.
Forgetting use_gradient_checkpointing="unsloth" — this is Unsloth’s key memory-saving flag. Without it you lose the VRAM benefits.

Quick Quiz

Test Your Understanding

Q1. What makes Unsloth faster than standard HuggingFace PEFT?
A1. Custom CUDA kernels that rewrite the most computationally expensive operations (attention, cross-entropy) from scratch.

Q2. What does GGUF format enable?
A2. Running fine-tuned models locally with llama.cpp or Ollama, without needing a GPU or the Python ML stack.

Q3. What is the recommended number of training epochs when fine-tuning with 100–500 examples?
A3. 3 epochs. More risks overfitting; fewer risks underfitting.

Student Exercise

Exercise 9.4 — Fine-tune on Google Colab
Using the free Unsloth Colab notebook for Llama 3.1 8B, fine-tune on the Alpaca dataset. Monitor training loss. Export to GGUF and run inference with Ollama.

9.4 Fine-Tuning with Unsloth

What is Unsloth?​

Installation​

Complete Fine-Tuning Example​

Export for Inference​

Option A — Merge and save as float16 (for HuggingFace inference)​

Option B — Export to GGUF (for Ollama/llama.cpp)​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

What is Unsloth?

Installation

Complete Fine-Tuning Example

Export for Inference

Option A — Merge and save as float16 (for HuggingFace inference)

Option B — Export to GGUF (for Ollama/llama.cpp)

Common Mistakes

Quick Quiz

Student Exercise

Further Reading