4.2 Document Loading & Chunking
AI-Generated Content
AI-generated content may contain errors. Always verify against official sources.
4.2 Document Loading & Chunking
Key Concepts: Chunk size · Overlap · Splitting strategies for structured data
Primary Sources: LangChain — Text Splitters · LlamaIndex — Node Parsers
Why Chunking Matters
Vector search finds the most relevant chunk, not the most relevant document. If chunks are:
- Too large → irrelevant text dilutes the signal
- Too small → context is lost; answers are incomplete
LangChain Document Loaders
pip install langchain langchain-community pypdf
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
WebBaseLoader,
CSVLoader,
)
# Load a PDF
loader = PyPDFLoader("report.pdf")
docs = loader.load() # List[Document]
print(docs[0].page_content[:200])
print(docs[0].metadata) # {"source": "report.pdf", "page": 0}
# Load a web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()
Text Splitting Strategies
1. RecursiveCharacterTextSplitter (Default Choice)
Splits on \n\n, \n, , "" in order — tries to keep paragraphs intact.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # max chars per chunk
chunk_overlap=64, # overlap to preserve context across boundaries
length_function=len,
)
chunks = splitter.split_documents(docs)
print(f"{len(chunks)} chunks created")
print(chunks[0].page_content)
2. Semantic Chunking
Splits at semantic boundaries using embedding similarity — better quality, slower.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(docs)
3. Markdown / Code Splitters
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###","h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_chunks = splitter.split_text(markdown_string)
# Each chunk carries header metadata → great for citation
Chunk Size Guide
| Use Case | chunk_size | chunk_overlap |
|---|---|---|
| FAQ / short answers | 256 | 32 |
| General QA | 512 | 64 |
| Long-form reports | 1024 | 128 |
| Code files | 400 | 50 |
tip
Always add source metadata to every chunk. You’ll need it to cite references in the answer.
Key Takeaways
- Start with
RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64) - Preserve metadata (
source,page,section) for citation and filtering - For Markdown/code, use structure-aware splitters
- Tune chunk size by measuring retrieval quality (covered in Chapter 8)