4.6 RAG for Structured & Tabular Data

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

4.6 RAG for Structured & Tabular Data

Key Concepts: SQL generation · Pandas integration · Cricket scorecards as structured input

Official Docs: LangChain SQL Agent · Pandas AI

What is RAG over Structured Data?

Traditional RAG retrieves from unstructured text (PDFs, web pages). But many real-world use cases involve structured data — databases, spreadsheets, CSV files. RAG over structured data generates queries (SQL, pandas) to retrieve answers.

Pattern 1 — Text-to-SQL

Convert natural language questions to SQL, execute, and return results:

pip install langchain langchain-openai sqlalchemy

from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
from langchain_core.prompts import PromptTemplate

# Connect to a SQLite database (create a sample one first)
import sqlite3, pandas as pd
conn = sqlite3.connect("cricket.db")
pd.DataFrame({
    "player": ["Rohit Sharma", "Virat Kohli", "Steve Smith", "Kane Williamson"],
    "matches": [243, 291, 144, 156],
    "runs": [9837, 12344, 7540, 7172],
    "average": [49.2, 58.7, 57.2, 54.3],
}).to_sql("odi_stats", conn, if_exists="replace", index=False)
conn.close()

# Set up LangChain SQL chain
db = SQLDatabase.from_uri("sqlite:///cricket.db")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = create_sql_query_chain(llm, db)

# Ask a natural language question
sql_query = chain.invoke({"question": "Who has the highest batting average?"})
print(f"Generated SQL: {sql_query}")

# Execute the query
result = db.run(sql_query)
print(f"Result: {result}")

Pattern 2 — Pandas Integration

For CSV/Excel data, generate and execute pandas code:

from openai import OpenAI
import pandas as pd

client = OpenAI()

# Sample scorecard data
scorecard = pd.DataFrame({
    "player": ["Rohit", "Kohli", "Rahul", "Pant", "Hardik"],
    "runs": [87, 45, 23, 67, 12],
    "balls": [65, 52, 30, 45, 15],
    "fours": [8, 4, 2, 6, 1],
    "sixes": [3, 1, 0, 2, 0],
})
scorecard["strike_rate"] = (scorecard["runs"] / scorecard["balls"] * 100).round(1)

def query_dataframe(df: pd.DataFrame, question: str) -> str:
    # Describe the dataframe structure
    df_info = f"DataFrame columns: {list(df.columns)}\nFirst 3 rows:\n{df.head(3).to_string()}"
    
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Given this pandas DataFrame:
{df_info}

Write Python pandas code to answer: {question}
Return ONLY the pandas expression (no df = ..., no import). 
Example: df[df['runs'] > 50]['player'].tolist()"""
        }],
        temperature=0,
    )
    pandas_code = resp.choices[0].message.content.strip().strip('`').strip()
    
    # Execute safely
    try:
        result = eval(pandas_code, {"df": df, "pd": pd})
        return str(result)
    except Exception as e:
        return f"Error executing query: {e}"

print(query_dataframe(scorecard, "Who scored the most runs?"))
print(query_dataframe(scorecard, "What is the team total?"))
print(query_dataframe(scorecard, "Who has the highest strike rate?"))

Pattern 3 — Full SQL Agent

For complex multi-step SQL queries, use an agent:

from langchain_community.agent_toolkits import create_sql_agent
from langchain_openai import ChatOpenAI
from langchain_community.utilities import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///cricket.db")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agent = create_sql_agent(llm, db=db, verbose=True)

# Agent can handle multi-step queries
result = agent.invoke({
    "input": "Compare the batting averages of players with more than 200 matches vs fewer than 200 matches."
})
print(result["output"])

Common Mistakes

Executing generated SQL without a read-only connection — a text-to-SQL model can generate DROP TABLE or DELETE. Always use a read-only database connection for SQL agents.
Executing generated pandas code with eval without sandboxing — never allow arbitrary code execution in production. Use a restricted namespace and validate the code.
No schema description — provide the LLM with table schemas, column descriptions, and sample values. Without context, generated SQL will have wrong column names.
Ignoring query results in the final response — after executing the SQL/pandas query, pass the result back to the LLM to generate a natural language answer. Don’t just return raw query output.

Quick Quiz

Test Your Understanding

Q1. Why should SQL agents use a read-only database connection?
A1. An LLM can generate destructive SQL (DROP, DELETE, UPDATE). A read-only connection prevents accidental or malicious data modification.

Q2. What information should you provide to the LLM for reliable text-to-SQL generation?
A2. Table names, column names with data types, sample values, and business context (e.g., what each table represents).

Q3. What is the advantage of using a SQL agent vs a simple chain for complex queries?
A3. An agent can inspect the database schema, run exploratory queries, correct SQL errors, and iterate — handling multi-step problems that a single-shot chain cannot.

Student Exercise

Exercise 4.6 — Cricket stats chatbot
Create an SQLite database with IPL player statistics (player, team, matches, runs, wickets). Build a text-to-SQL chatbot that answers: "Who is the top scorer from Mumbai Indians?", "List all-rounders with 500+ runs and 20+ wickets". Add a read-only guardrail.

4.6 RAG for Structured & Tabular Data

What is RAG over Structured Data?​

Pattern 1 — Text-to-SQL​

Pattern 2 — Pandas Integration​

Pattern 3 — Full SQL Agent​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

What is RAG over Structured Data?

Pattern 1 — Text-to-SQL

Pattern 2 — Pandas Integration

Pattern 3 — Full SQL Agent

Common Mistakes

Quick Quiz

Student Exercise

Further Reading