4.5 Advanced RAG
AI-generated content may contain errors. Always verify against official sources.
4.5 Advanced RAG
Key Concepts: Hybrid search · Re-ranking · Multi-query retrieval · Metadata filtering
Official Docs: LangChain Advanced RAG · Cohere Rerank
Why Basic RAG Falls Short
Basic vector similarity search has two main weaknesses:
- Semantic search misses exact keyword matches — e.g., "what is the GPU part number?" — exact product codes need keyword matching.
- Top-k retrieval doesn’t rank by relevance quality — the 4th-closest vector may be more relevant than the 2nd-closest.
Advanced RAG adds layers to fix these.
Technique 1 — Hybrid Search (Semantic + BM25)
Combine dense vector search with sparse keyword search:
pip install langchain-community rank_bm25
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# Your documents (already split into chunks)
docs = [...] # list of Document objects
# Dense retriever (semantic)
vector_store = Chroma.from_documents(docs, OpenAIEmbeddings())
dense_retriever = vector_store.as_retriever(search_kwargs={"k": 4})
# Sparse retriever (keyword / BM25)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 4
# Ensemble: 60% semantic + 40% keyword
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4],
)
# Retrieve
results = hybrid_retriever.invoke("GPU part number RTX 4090")
for doc in results:
print(doc.page_content[:100])
Technique 2 — Re-Ranking
After retrieval, use a cross-encoder model to re-rank chunks by true relevance:
pip install cohere
import cohere
from langchain_core.documents import Document
co = cohere.Client() # COHERE_API_KEY env var
def rerank_documents(query: str, docs: list[Document], top_n: int = 3) -> list[Document]:
"""Re-rank retrieved documents using Cohere Rerank."""
passages = [doc.page_content for doc in docs]
results = co.rerank(
query=query,
documents=passages,
top_n=top_n,
model="rerank-english-v3.0",
)
reranked = [docs[r.index] for r in results.results]
scores = [r.relevance_score for r in results.results]
print(f"Re-ranked scores: {[f'{s:.3f}' for s in scores]}")
return reranked
# Usage
raw_docs = hybrid_retriever.invoke("What are the system requirements?")
final_docs = rerank_documents("What are the system requirements?", raw_docs, top_n=3)
Technique 3 — Multi-Query Retrieval
Generate multiple query variations to improve recall:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
multi_retriever = MultiQueryRetriever.from_llm(
retriever=dense_retriever,
llm=llm,
)
# Automatically generates 3 query variations and merges results
docs = multi_retriever.invoke("How do transformers handle long sequences?")
print(f"Retrieved {len(docs)} unique chunks")
Technique 4 — Metadata Filtering
# Filter by document source or date
filtered_retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={
"k": 4,
"filter": {"source": "policy_2024.pdf"},
},
)
# Or filter by date range
recent_retriever = vector_store.as_retriever(
search_kwargs={
"k": 4,
"filter": {"year": {"$gte": 2023}}, # Chroma metadata filter syntax
},
)
Common Mistakes
- Skipping re-ranking for cost reasons — re-ranking is cheap (~$0.001 per call) and significantly improves answer quality. Always include it in production RAG.
- Too many BM25 vs dense weights — start with 60% dense / 40% BM25. Tune based on whether your queries are more semantic or keyword-based.
- Multi-query without deduplication — multi-query retrieval can return duplicate chunks from different queries. Always deduplicate retrieved documents.
- No metadata at indexing time — metadata filtering requires that metadata was stored when documents were indexed. Add metadata at load time, not retrieval time.
Quick Quiz
Q1. What problem does hybrid search solve that pure semantic search cannot?
A1. Exact keyword matching — semantic search misses precise terms like product codes, names, and identifiers that BM25 can find exactly.
Q2. What is re-ranking and why is it more accurate than top-k retrieval alone?
A2. Re-ranking uses a cross-encoder to compare the query and each document together (not separately), giving a more accurate relevance score than the bi-encoder used for initial retrieval.
Q3. How does multi-query retrieval improve recall?
A3. By generating multiple phrasings of the same question, it retrieves chunks that might be missed by a single query framing, improving the chance of finding relevant content.
Student Exercise
Exercise 4.5 — Compare RAG strategies
Build three RAG pipelines on the same document set: (1) Basic semantic search, (2) Hybrid search, (3) Hybrid + re-ranking. Ask 10 test questions. Evaluate answer quality with an LLM-as-judge. Which strategy performs best?
Further Reading
- 📘 LangChain Advanced RAG How-To
- 📘 Cohere Rerank API
- 📄 Lost in the Middle: How LLMs Use Long Contexts (Liu et al., 2023)
Next → 4.6 RAG over Structured Data