9.3 LoRA & QLoRA

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.3 LoRA & QLoRA

Key Concepts: Parameter-efficient fine-tuning · Adapter layers · Memory savings

Official Docs: HuggingFace PEFT · QLoRA Paper

Why Not Fine-Tune All Parameters?

A 7B parameter model has ~28GB of weights in float32. Fine-tuning all of them requires:

Storing the full model
Storing gradients (same size as model)
Storing an optimiser state (2× model size for Adam)

➞ ~3× model size in GPU VRAM just to start training

LoRA solves this by training only a tiny number of new parameters.

How LoRA Works

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small adapter matrices to attention layers:

Original weight matrix W (d × d) — FROZEN

LoRA adds: W + ΔW = W + A × B
  where A is (d × r) and B is (r × d)
  and r << d  (e.g., r=8 vs d=4096)

Instead of updating 4096×4096 = 16M parameters, LoRA updates 4096×8 + 8×4096 = 65K parameters
Reduction: ~250× fewer trainable parameters

LoRA with HuggingFace PEFT

pip install peft transformers datasets trl accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                          # Rank — lower = fewer params (try 4, 8, 16)
    lora_alpha=32,                # Scaling factor (typically 2r)
    lora_dropout=0.1,             # Regularisation
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,239,863,296 || trainable%: 0.34

QLoRA — LoRA + Quantisation

QLoRA (Quantised LoRA) adds 4-bit quantisation of the base model, reducing VRAM usage by ~4×:

Method	VRAM for 7B model	Quality
Full fine-tune	~56 GB	Best
LoRA (fp16)	~16 GB	Very good
QLoRA (4-bit + LoRA)	~6 GB	Good

from transformers import BitsAndBytesConfig
import torch

# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

Key Hyperparameters

Parameter	Typical range	Effect
`r` (rank)	4–32	Higher r = more capacity, more VRAM
`lora_alpha`	2r or 4r	Scales the LoRA update
`lora_dropout`	0.05–0.1	Regularisation
`target_modules`	q_proj, v_proj	Which attention matrices to adapt

Common Mistakes

Rank too high — setting r=64 often doesn’t improve quality over r=8 but uses much more VRAM. Start with r=8.
Not targeting the right modules — different model architectures use different names. Use model.named_modules() to find the correct names.
Forgetting to merge adapters before inference — at inference time, merge LoRA weights into the base model for maximum speed: model.merge_and_unload().
Using QLoRA on a consumer CPU — QLoRA requires a CUDA GPU. bitsandbytes does not work on CPU.

Quick Quiz

Test Your Understanding

Q1. What does LoRA’s rank parameter r control?
A1. The size of the adapter matrices. Lower rank = fewer trainable parameters = less VRAM; higher rank = more expressive adapters.

Q2. What is the key difference between LoRA and QLoRA?
A2. QLoRA additionally quantises the frozen base model to 4-bit, dramatically reducing the VRAM required to load the model.

Q3. After fine-tuning, what should you do before deploying to production for maximum inference speed?
A3. Call model.merge_and_unload() to merge the LoRA adapters into the base model weights. This eliminates the adapter overhead at inference time.

Student Exercise

Exercise 9.3 — Count trainable parameters
Load any HuggingFace model (e.g., gpt2). Apply LoRA with r=4, r=8, r=16, and r=32. For each, print the number of trainable parameters. Plot rank vs. trainable parameter count.

9.3 LoRA & QLoRA

Why Not Fine-Tune All Parameters?​

How LoRA Works​

LoRA with HuggingFace PEFT​

QLoRA — LoRA + Quantisation​

Key Hyperparameters​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Why Not Fine-Tune All Parameters?

How LoRA Works

LoRA with HuggingFace PEFT

QLoRA — LoRA + Quantisation

Key Hyperparameters

Common Mistakes

Quick Quiz

Student Exercise

Further Reading