Skip to main content

4.2 Document Loading & Chunking

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

4.2 Document Loading & Chunking

Key Concepts: Chunk size · Overlap · Splitting strategies for structured data

Primary Sources: LangChain — Text Splitters · LlamaIndex — Node Parsers


Why Chunking Matters

Vector search finds the most relevant chunk, not the most relevant document. If chunks are:

  • Too large → irrelevant text dilutes the signal
  • Too small → context is lost; answers are incomplete

LangChain Document Loaders

pip install langchain langchain-community pypdf
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
WebBaseLoader,
CSVLoader,
)

# Load a PDF
loader = PyPDFLoader("report.pdf")
docs = loader.load() # List[Document]
print(docs[0].page_content[:200])
print(docs[0].metadata) # {"source": "report.pdf", "page": 0}

# Load a web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()

Text Splitting Strategies

1. RecursiveCharacterTextSplitter (Default Choice)

Splits on \n\n, \n, , "" in order — tries to keep paragraphs intact.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # max chars per chunk
chunk_overlap=64, # overlap to preserve context across boundaries
length_function=len,
)

chunks = splitter.split_documents(docs)
print(f"{len(chunks)} chunks created")
print(chunks[0].page_content)

2. Semantic Chunking

Splits at semantic boundaries using embedding similarity — better quality, slower.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(docs)

3. Markdown / Code Splitters

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###","h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_chunks = splitter.split_text(markdown_string)
# Each chunk carries header metadata → great for citation

Chunk Size Guide

Use Casechunk_sizechunk_overlap
FAQ / short answers25632
General QA51264
Long-form reports1024128
Code files40050
tip

Always add source metadata to every chunk. You’ll need it to cite references in the answer.


Key Takeaways

  • Start with RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
  • Preserve metadata (source, page, section) for citation and filtering
  • For Markdown/code, use structure-aware splitters
  • Tune chunk size by measuring retrieval quality (covered in Chapter 8)

Common Mistakes

Common Mistakes
  1. Chunk size too large — chunks > 1000 tokens often include irrelevant sentences that confuse the retrieval model. Start with 512 tokens.
  2. No chunk overlap — without overlap, a sentence that spans a chunk boundary is split across two chunks, losing context. Use at least 10% overlap.
  3. Stripping metadata — always preserve source, page_number, and section in chunk metadata. Without it, you cannot cite sources.
  4. Re-using old chunks after document updates — if source documents change, delete and re-embed. Stale chunks cause incorrect answers.

Quick Quiz

Test Your Understanding

Q1. Why is RecursiveCharacterTextSplitter preferred over simple character splitting?
A1. It tries to split at natural boundaries (paragraphs, sentences, words) in order of preference, preserving semantic coherence within each chunk.

Q2. What does chunk_overlap=64 mean?
A2. The last 64 tokens of each chunk are repeated at the start of the next chunk, ensuring context is preserved across chunk boundaries.

Q3. What metadata should always be preserved in chunks?
A3. At minimum: source file/URL, page number (for PDFs), and section title. This enables source citation and filtering during retrieval.


Student Exercise

Exercise 4.2 — Chunk size analysis
Load a 10-page PDF. Create three chunk sets: chunk_size=256, 512, 1024 (all with overlap=10%). For each, count the number of chunks and measure the average chunk character length. Discuss the trade-offs.


Further Reading

Next → 4.3 Embeddings & Vector Stores