9.3 LoRA & QLoRA
AI-generated content may contain errors. Always verify against official sources.
9.3 LoRA & QLoRA
Key Concepts: Parameter-efficient fine-tuning · Adapter layers · Memory savings
Official Docs: HuggingFace PEFT · QLoRA Paper
Why Not Fine-Tune All Parameters?
A 7B parameter model has ~28GB of weights in float32. Fine-tuning all of them requires:
- Storing the full model
- Storing gradients (same size as model)
- Storing an optimiser state (2× model size for Adam)
➞ ~3× model size in GPU VRAM just to start training
LoRA solves this by training only a tiny number of new parameters.
How LoRA Works
LoRA (Low-Rank Adaptation) freezes the original model weights and adds small adapter matrices to attention layers:
Original weight matrix W (d × d) — FROZEN
LoRA adds: W + ΔW = W + A × B
where A is (d × r) and B is (r × d)
and r << d (e.g., r=8 vs d=4096)
- Instead of updating 4096×4096 = 16M parameters, LoRA updates 4096×8 + 8×4096 = 65K parameters
- Reduction: ~250× fewer trainable parameters
LoRA with HuggingFace PEFT
pip install peft transformers datasets trl accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
# Load base model
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank — lower = fewer params (try 4, 8, 16)
lora_alpha=32, # Scaling factor (typically 2r)
lora_dropout=0.1, # Regularisation
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,239,863,296 || trainable%: 0.34
QLoRA — LoRA + Quantisation
QLoRA (Quantised LoRA) adds 4-bit quantisation of the base model, reducing VRAM usage by ~4×:
| Method | VRAM for 7B model | Quality |
|---|---|---|
| Full fine-tune | ~56 GB | Best |
| LoRA (fp16) | ~16 GB | Very good |
| QLoRA (4-bit + LoRA) | ~6 GB | Good |
from transformers import BitsAndBytesConfig
import torch
# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
Key Hyperparameters
| Parameter | Typical range | Effect |
|---|---|---|
r (rank) | 4–32 | Higher r = more capacity, more VRAM |
lora_alpha | 2r or 4r | Scales the LoRA update |
lora_dropout | 0.05–0.1 | Regularisation |
target_modules | q_proj, v_proj | Which attention matrices to adapt |
Common Mistakes
- Rank too high — setting
r=64often doesn’t improve quality overr=8but uses much more VRAM. Start withr=8. - Not targeting the right modules — different model architectures use different names. Use
model.named_modules()to find the correct names. - Forgetting to merge adapters before inference — at inference time, merge LoRA weights into the base model for maximum speed:
model.merge_and_unload(). - Using QLoRA on a consumer CPU — QLoRA requires a CUDA GPU.
bitsandbytesdoes not work on CPU.
Quick Quiz
Q1. What does LoRA’s rank parameter r control?
A1. The size of the adapter matrices. Lower rank = fewer trainable parameters = less VRAM; higher rank = more expressive adapters.
Q2. What is the key difference between LoRA and QLoRA?
A2. QLoRA additionally quantises the frozen base model to 4-bit, dramatically reducing the VRAM required to load the model.
Q3. After fine-tuning, what should you do before deploying to production for maximum inference speed?
A3. Call model.merge_and_unload() to merge the LoRA adapters into the base model weights. This eliminates the adapter overhead at inference time.
Student Exercise
Exercise 9.3 — Count trainable parameters
Load any HuggingFace model (e.g., gpt2). Apply LoRA with r=4, r=8, r=16, and r=32. For each, print the number of trainable parameters. Plot rank vs. trainable parameter count.
Further Reading
- 📄 LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- 📄 QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
- 📘 HuggingFace PEFT Docs
Next → 9.4 Fine-Tuning with Unsloth