Skip to main content

4.5 Advanced RAG

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

4.5 Advanced RAG

Key Concepts: Hybrid search · Re-ranking · Multi-query retrieval · Metadata filtering

Official Docs: LangChain Advanced RAG · Cohere Rerank


Why Basic RAG Falls Short

Basic vector similarity search has two main weaknesses:

  1. Semantic search misses exact keyword matches — e.g., "what is the GPU part number?" — exact product codes need keyword matching.
  2. Top-k retrieval doesn’t rank by relevance quality — the 4th-closest vector may be more relevant than the 2nd-closest.

Advanced RAG adds layers to fix these.


Technique 1 — Hybrid Search (Semantic + BM25)

Combine dense vector search with sparse keyword search:

pip install langchain-community rank_bm25
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Your documents (already split into chunks)
docs = [...] # list of Document objects

# Dense retriever (semantic)
vector_store = Chroma.from_documents(docs, OpenAIEmbeddings())
dense_retriever = vector_store.as_retriever(search_kwargs={"k": 4})

# Sparse retriever (keyword / BM25)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 4

# Ensemble: 60% semantic + 40% keyword
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4],
)

# Retrieve
results = hybrid_retriever.invoke("GPU part number RTX 4090")
for doc in results:
print(doc.page_content[:100])

Technique 2 — Re-Ranking

After retrieval, use a cross-encoder model to re-rank chunks by true relevance:

pip install cohere
import cohere
from langchain_core.documents import Document

co = cohere.Client() # COHERE_API_KEY env var

def rerank_documents(query: str, docs: list[Document], top_n: int = 3) -> list[Document]:
"""Re-rank retrieved documents using Cohere Rerank."""
passages = [doc.page_content for doc in docs]

results = co.rerank(
query=query,
documents=passages,
top_n=top_n,
model="rerank-english-v3.0",
)

reranked = [docs[r.index] for r in results.results]
scores = [r.relevance_score for r in results.results]
print(f"Re-ranked scores: {[f'{s:.3f}' for s in scores]}")
return reranked

# Usage
raw_docs = hybrid_retriever.invoke("What are the system requirements?")
final_docs = rerank_documents("What are the system requirements?", raw_docs, top_n=3)

Technique 3 — Multi-Query Retrieval

Generate multiple query variations to improve recall:

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

multi_retriever = MultiQueryRetriever.from_llm(
retriever=dense_retriever,
llm=llm,
)

# Automatically generates 3 query variations and merges results
docs = multi_retriever.invoke("How do transformers handle long sequences?")
print(f"Retrieved {len(docs)} unique chunks")

Technique 4 — Metadata Filtering

# Filter by document source or date
filtered_retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={
"k": 4,
"filter": {"source": "policy_2024.pdf"},
},
)

# Or filter by date range
recent_retriever = vector_store.as_retriever(
search_kwargs={
"k": 4,
"filter": {"year": {"$gte": 2023}}, # Chroma metadata filter syntax
},
)

Common Mistakes

Common Mistakes
  1. Skipping re-ranking for cost reasons — re-ranking is cheap (~$0.001 per call) and significantly improves answer quality. Always include it in production RAG.
  2. Too many BM25 vs dense weights — start with 60% dense / 40% BM25. Tune based on whether your queries are more semantic or keyword-based.
  3. Multi-query without deduplication — multi-query retrieval can return duplicate chunks from different queries. Always deduplicate retrieved documents.
  4. No metadata at indexing time — metadata filtering requires that metadata was stored when documents were indexed. Add metadata at load time, not retrieval time.

Quick Quiz

Test Your Understanding

Q1. What problem does hybrid search solve that pure semantic search cannot?
A1. Exact keyword matching — semantic search misses precise terms like product codes, names, and identifiers that BM25 can find exactly.

Q2. What is re-ranking and why is it more accurate than top-k retrieval alone?
A2. Re-ranking uses a cross-encoder to compare the query and each document together (not separately), giving a more accurate relevance score than the bi-encoder used for initial retrieval.

Q3. How does multi-query retrieval improve recall?
A3. By generating multiple phrasings of the same question, it retrieves chunks that might be missed by a single query framing, improving the chance of finding relevant content.


Student Exercise

Exercise 4.5 — Compare RAG strategies
Build three RAG pipelines on the same document set: (1) Basic semantic search, (2) Hybrid search, (3) Hybrid + re-ranking. Ask 10 test questions. Evaluate answer quality with an LLM-as-judge. Which strategy performs best?


Further Reading

Next → 4.6 RAG over Structured Data