Skip to main content

9.3 LoRA & QLoRA

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.3 LoRA & QLoRA

Key Concepts: Parameter-efficient fine-tuning · Adapter layers · Memory savings

Official Docs: HuggingFace PEFT · QLoRA Paper


Why Not Fine-Tune All Parameters?

A 7B parameter model has ~28GB of weights in float32. Fine-tuning all of them requires:

  • Storing the full model
  • Storing gradients (same size as model)
  • Storing an optimiser state (2× model size for Adam)

~3× model size in GPU VRAM just to start training

LoRA solves this by training only a tiny number of new parameters.


How LoRA Works

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small adapter matrices to attention layers:

Original weight matrix W (d × d) — FROZEN

LoRA adds: W + ΔW = W + A × B
where A is (d × r) and B is (r × d)
and r << d (e.g., r=8 vs d=4096)
  • Instead of updating 4096×4096 = 16M parameters, LoRA updates 4096×8 + 8×4096 = 65K parameters
  • Reduction: ~250× fewer trainable parameters

LoRA with HuggingFace PEFT

pip install peft transformers datasets trl accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)

# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank — lower = fewer params (try 4, 8, 16)
lora_alpha=32, # Scaling factor (typically 2r)
lora_dropout=0.1, # Regularisation
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,239,863,296 || trainable%: 0.34

QLoRA — LoRA + Quantisation

QLoRA (Quantised LoRA) adds 4-bit quantisation of the base model, reducing VRAM usage by ~4×:

MethodVRAM for 7B modelQuality
Full fine-tune~56 GBBest
LoRA (fp16)~16 GBVery good
QLoRA (4-bit + LoRA)~6 GBGood
from transformers import BitsAndBytesConfig
import torch

# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)

Key Hyperparameters

ParameterTypical rangeEffect
r (rank)4–32Higher r = more capacity, more VRAM
lora_alpha2r or 4rScales the LoRA update
lora_dropout0.05–0.1Regularisation
target_modulesq_proj, v_projWhich attention matrices to adapt

Common Mistakes

Common Mistakes
  1. Rank too high — setting r=64 often doesn’t improve quality over r=8 but uses much more VRAM. Start with r=8.
  2. Not targeting the right modules — different model architectures use different names. Use model.named_modules() to find the correct names.
  3. Forgetting to merge adapters before inference — at inference time, merge LoRA weights into the base model for maximum speed: model.merge_and_unload().
  4. Using QLoRA on a consumer CPU — QLoRA requires a CUDA GPU. bitsandbytes does not work on CPU.

Quick Quiz

Test Your Understanding

Q1. What does LoRA’s rank parameter r control?
A1. The size of the adapter matrices. Lower rank = fewer trainable parameters = less VRAM; higher rank = more expressive adapters.

Q2. What is the key difference between LoRA and QLoRA?
A2. QLoRA additionally quantises the frozen base model to 4-bit, dramatically reducing the VRAM required to load the model.

Q3. After fine-tuning, what should you do before deploying to production for maximum inference speed?
A3. Call model.merge_and_unload() to merge the LoRA adapters into the base model weights. This eliminates the adapter overhead at inference time.


Student Exercise

Exercise 9.3 — Count trainable parameters
Load any HuggingFace model (e.g., gpt2). Apply LoRA with r=4, r=8, r=16, and r=32. For each, print the number of trainable parameters. Plot rank vs. trainable parameter count.


Further Reading

Next → 9.4 Fine-Tuning with Unsloth