Skip to main content

4.2 Document Loading & Chunking

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

4.2 Document Loading & Chunking

Key Concepts: Chunk size · Overlap · Splitting strategies for structured data

Primary Sources: LangChain — Text Splitters · LlamaIndex — Node Parsers


Why Chunking Matters

Vector search finds the most relevant chunk, not the most relevant document. If chunks are:

  • Too large → irrelevant text dilutes the signal
  • Too small → context is lost; answers are incomplete

LangChain Document Loaders

pip install langchain langchain-community pypdf
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
WebBaseLoader,
CSVLoader,
)

# Load a PDF
loader = PyPDFLoader("report.pdf")
docs = loader.load() # List[Document]
print(docs[0].page_content[:200])
print(docs[0].metadata) # {"source": "report.pdf", "page": 0}

# Load a web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()

Text Splitting Strategies

1. RecursiveCharacterTextSplitter (Default Choice)

Splits on \n\n, \n, , "" in order — tries to keep paragraphs intact.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # max chars per chunk
chunk_overlap=64, # overlap to preserve context across boundaries
length_function=len,
)

chunks = splitter.split_documents(docs)
print(f"{len(chunks)} chunks created")
print(chunks[0].page_content)

2. Semantic Chunking

Splits at semantic boundaries using embedding similarity — better quality, slower.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(docs)

3. Markdown / Code Splitters

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###","h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_chunks = splitter.split_text(markdown_string)
# Each chunk carries header metadata → great for citation

Chunk Size Guide

Use Casechunk_sizechunk_overlap
FAQ / short answers25632
General QA51264
Long-form reports1024128
Code files40050
tip

Always add source metadata to every chunk. You’ll need it to cite references in the answer.


Key Takeaways

  • Start with RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
  • Preserve metadata (source, page, section) for citation and filtering
  • For Markdown/code, use structure-aware splitters
  • Tune chunk size by measuring retrieval quality (covered in Chapter 8)

Further Reading

Next → 4.3 Embeddings & Vector Stores