4.6 RAG for Structured & Tabular Data
AI-generated content may contain errors. Always verify against official sources.
4.6 RAG for Structured & Tabular Data
Key Concepts: SQL generation · Pandas integration · Cricket scorecards as structured input
Official Docs: LangChain SQL Agent · Pandas AI
What is RAG over Structured Data?
Traditional RAG retrieves from unstructured text (PDFs, web pages). But many real-world use cases involve structured data — databases, spreadsheets, CSV files. RAG over structured data generates queries (SQL, pandas) to retrieve answers.
Pattern 1 — Text-to-SQL
Convert natural language questions to SQL, execute, and return results:
pip install langchain langchain-openai sqlalchemy
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
from langchain_core.prompts import PromptTemplate
# Connect to a SQLite database (create a sample one first)
import sqlite3, pandas as pd
conn = sqlite3.connect("cricket.db")
pd.DataFrame({
"player": ["Rohit Sharma", "Virat Kohli", "Steve Smith", "Kane Williamson"],
"matches": [243, 291, 144, 156],
"runs": [9837, 12344, 7540, 7172],
"average": [49.2, 58.7, 57.2, 54.3],
}).to_sql("odi_stats", conn, if_exists="replace", index=False)
conn.close()
# Set up LangChain SQL chain
db = SQLDatabase.from_uri("sqlite:///cricket.db")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = create_sql_query_chain(llm, db)
# Ask a natural language question
sql_query = chain.invoke({"question": "Who has the highest batting average?"})
print(f"Generated SQL: {sql_query}")
# Execute the query
result = db.run(sql_query)
print(f"Result: {result}")
Pattern 2 — Pandas Integration
For CSV/Excel data, generate and execute pandas code:
from openai import OpenAI
import pandas as pd
client = OpenAI()
# Sample scorecard data
scorecard = pd.DataFrame({
"player": ["Rohit", "Kohli", "Rahul", "Pant", "Hardik"],
"runs": [87, 45, 23, 67, 12],
"balls": [65, 52, 30, 45, 15],
"fours": [8, 4, 2, 6, 1],
"sixes": [3, 1, 0, 2, 0],
})
scorecard["strike_rate"] = (scorecard["runs"] / scorecard["balls"] * 100).round(1)
def query_dataframe(df: pd.DataFrame, question: str) -> str:
# Describe the dataframe structure
df_info = f"DataFrame columns: {list(df.columns)}\nFirst 3 rows:\n{df.head(3).to_string()}"
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Given this pandas DataFrame:
{df_info}
Write Python pandas code to answer: {question}
Return ONLY the pandas expression (no df = ..., no import).
Example: df[df['runs'] > 50]['player'].tolist()"""
}],
temperature=0,
)
pandas_code = resp.choices[0].message.content.strip().strip('`').strip()
# Execute safely
try:
result = eval(pandas_code, {"df": df, "pd": pd})
return str(result)
except Exception as e:
return f"Error executing query: {e}"
print(query_dataframe(scorecard, "Who scored the most runs?"))
print(query_dataframe(scorecard, "What is the team total?"))
print(query_dataframe(scorecard, "Who has the highest strike rate?"))
Pattern 3 — Full SQL Agent
For complex multi-step SQL queries, use an agent:
from langchain_community.agent_toolkits import create_sql_agent
from langchain_openai import ChatOpenAI
from langchain_community.utilities import SQLDatabase
db = SQLDatabase.from_uri("sqlite:///cricket.db")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = create_sql_agent(llm, db=db, verbose=True)
# Agent can handle multi-step queries
result = agent.invoke({
"input": "Compare the batting averages of players with more than 200 matches vs fewer than 200 matches."
})
print(result["output"])
Common Mistakes
- Executing generated SQL without a read-only connection — a text-to-SQL model can generate
DROP TABLEorDELETE. Always use a read-only database connection for SQL agents. - Executing generated pandas code with
evalwithout sandboxing — never allow arbitrary code execution in production. Use a restricted namespace and validate the code. - No schema description — provide the LLM with table schemas, column descriptions, and sample values. Without context, generated SQL will have wrong column names.
- Ignoring query results in the final response — after executing the SQL/pandas query, pass the result back to the LLM to generate a natural language answer. Don’t just return raw query output.
Quick Quiz
Q1. Why should SQL agents use a read-only database connection?
A1. An LLM can generate destructive SQL (DROP, DELETE, UPDATE). A read-only connection prevents accidental or malicious data modification.
Q2. What information should you provide to the LLM for reliable text-to-SQL generation?
A2. Table names, column names with data types, sample values, and business context (e.g., what each table represents).
Q3. What is the advantage of using a SQL agent vs a simple chain for complex queries?
A3. An agent can inspect the database schema, run exploratory queries, correct SQL errors, and iterate — handling multi-step problems that a single-shot chain cannot.
Student Exercise
Exercise 4.6 — Cricket stats chatbot
Create an SQLite database with IPL player statistics (player, team, matches, runs, wickets). Build a text-to-SQL chatbot that answers: "Who is the top scorer from Mumbai Indians?", "List all-rounders with 500+ runs and 20+ wickets". Add a read-only guardrail.
Further Reading
End of Chapter 4: RAG → Chapter 5: Chains