4.2 Document Loading & Chunking
AI-generated content may contain errors. Always verify against official sources.
4.2 Document Loading & Chunking
Key Concepts: Chunk size · Overlap · Splitting strategies for structured data
Primary Sources: LangChain — Text Splitters · LlamaIndex — Node Parsers
Why Chunking Matters
Vector search finds the most relevant chunk, not the most relevant document. If chunks are:
- Too large → irrelevant text dilutes the signal
- Too small → context is lost; answers are incomplete
LangChain Document Loaders
pip install langchain langchain-community pypdf
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
WebBaseLoader,
CSVLoader,
)
# Load a PDF
loader = PyPDFLoader("report.pdf")
docs = loader.load() # List[Document]
print(docs[0].page_content[:200])
print(docs[0].metadata) # {"source": "report.pdf", "page": 0}
# Load a web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()
Text Splitting Strategies
1. RecursiveCharacterTextSplitter (Default Choice)
Splits on \n\n, \n, , "" in order — tries to keep paragraphs intact.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # max chars per chunk
chunk_overlap=64, # overlap to preserve context across boundaries
length_function=len,
)
chunks = splitter.split_documents(docs)
print(f"{len(chunks)} chunks created")
print(chunks[0].page_content)
2. Semantic Chunking
Splits at semantic boundaries using embedding similarity — better quality, slower.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(docs)
3. Markdown / Code Splitters
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###","h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_chunks = splitter.split_text(markdown_string)
# Each chunk carries header metadata → great for citation
Chunk Size Guide
| Use Case | chunk_size | chunk_overlap |
|---|---|---|
| FAQ / short answers | 256 | 32 |
| General QA | 512 | 64 |
| Long-form reports | 1024 | 128 |
| Code files | 400 | 50 |
Always add source metadata to every chunk. You’ll need it to cite references in the answer.
Key Takeaways
- Start with
RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64) - Preserve metadata (
source,page,section) for citation and filtering - For Markdown/code, use structure-aware splitters
- Tune chunk size by measuring retrieval quality (covered in Chapter 8)
Common Mistakes
- Chunk size too large — chunks > 1000 tokens often include irrelevant sentences that confuse the retrieval model. Start with 512 tokens.
- No chunk overlap — without overlap, a sentence that spans a chunk boundary is split across two chunks, losing context. Use at least 10% overlap.
- Stripping metadata — always preserve
source,page_number, andsectionin chunk metadata. Without it, you cannot cite sources. - Re-using old chunks after document updates — if source documents change, delete and re-embed. Stale chunks cause incorrect answers.
Quick Quiz
Q1. Why is RecursiveCharacterTextSplitter preferred over simple character splitting?
A1. It tries to split at natural boundaries (paragraphs, sentences, words) in order of preference, preserving semantic coherence within each chunk.
Q2. What does chunk_overlap=64 mean?
A2. The last 64 tokens of each chunk are repeated at the start of the next chunk, ensuring context is preserved across chunk boundaries.
Q3. What metadata should always be preserved in chunks?
A3. At minimum: source file/URL, page number (for PDFs), and section title. This enables source citation and filtering during retrieval.
Student Exercise
Exercise 4.2 — Chunk size analysis
Load a 10-page PDF. Create three chunk sets: chunk_size=256, 512, 1024 (all with overlap=10%). For each, count the number of chunks and measure the average chunk character length. Discuss the trade-offs.