7.5 Guardrails & Safety
AI-generated content may contain errors. Always verify against official sources.
7.5 Guardrails & Safety
Key Concepts: Input/output filtering · Scope limiting · Human-in-the-loop
Official Docs: OpenAI Agents SDK — Guardrails · OpenAI Moderation API
Why Agents Need Guardrails
Agents have broad capabilities — they can execute code, call APIs, send emails, and modify databases. Without guardrails:
- An agent may take irreversible harmful actions (delete data, send incorrect messages)
- Users may prompt-inject to make the agent do things outside its intended scope
- The agent may loop indefinitely or take thousands of API calls before failing
Layer 1 — Input Filtering
Filter user inputs before they reach the agent:
from openai import OpenAI
client = OpenAI()
def is_safe_input(user_message: str) -> bool:
"""Use OpenAI moderation API to check for harmful content."""
resp = client.moderations.create(input=user_message)
result = resp.results[0]
# Flag if any category is True
return not result.flagged
def run_agent_safely(user_message: str) -> str:
if not is_safe_input(user_message):
return "I’m sorry, I can’t help with that request."
# ... run agent
return agent_response
print(is_safe_input("Tell me how to cook pasta.")) # True
The OpenAI Moderation API is free and checks for: hate, harassment, self-harm, sexual content, and violence.
Layer 2 — Scope Limiting (System Prompt Rules)
Explicitly restrict the agent’s domain in the system prompt:
from agents import Agent
cs_agent = Agent(
name="Customer Support Agent",
model="gpt-4o-mini",
instructions="""
You are a customer support agent for ShopCo.
You MAY:
- Answer questions about orders, delivery, and returns
- Look up order status using the check_order tool
- Escalate complex issues to a human agent
You MUST NOT:
- Give legal or financial advice
- Discuss competitor products
- Execute any action that modifies an order without human confirmation
- Discuss topics unrelated to ShopCo customer support
If asked anything outside your scope, respond:
'I can only help with ShopCo order enquiries. Shall I connect you to a human agent?'
""",
)
Layer 3 — OpenAI Agents SDK Guardrails
The OpenAI Agents SDK has a built-in guardrail system:
from agents import Agent, Runner, GuardrailFunctionOutput, input_guardrail
from pydantic import BaseModel
class TopicCheck(BaseModel):
is_on_topic: bool
reason: str
@input_guardrail
async def topic_guardrail(ctx, agent, input_data) -> GuardrailFunctionOutput:
"""Only allow queries related to software development."""
check_agent = Agent(
name="Topic Checker",
instructions="Determine if the message is about software development.",
output_type=TopicCheck,
)
result = await Runner.run(check_agent, input_data, context=ctx.context)
check: TopicCheck = result.final_output
return GuardrailFunctionOutput(
output_info=check,
tripwire_triggered=not check.is_on_topic,
)
dev_agent = Agent(
name="Dev Helper",
instructions="You are a software development assistant.",
input_guardrails=[topic_guardrail],
)
Layer 4 — Human-in-the-Loop
For high-stakes or irreversible actions, require human confirmation before execution:
def confirm_action(action: str, details: str) -> bool:
"""Ask the human to confirm before executing a dangerous action."""
print(f"\n⚠️ The agent wants to: {action}")
print(f"Details: {details}")
response = input("Approve? (yes/no): ").strip().lower()
return response == "yes"
# In your tool function:
def delete_record(record_id: str) -> str:
"""Delete a database record."""
if not confirm_action("delete_record", f"Record ID: {record_id}"):
return "Action cancelled by user."
# ... execute deletion
return f"Record {record_id} deleted."
Layer 5 — Output Validation
Validate agent outputs before returning them to users:
from pydantic import BaseModel, ValidationError
class AgentOutput(BaseModel):
answer: str
sources: list[str]
confidence: float # 0.0 – 1.0
def validate_output(raw_output: str) -> AgentOutput | None:
try:
import json
data = json.loads(raw_output)
return AgentOutput(**data)
except (ValidationError, json.JSONDecodeError) as e:
print(f"Output validation failed: {e}")
return None
Common Mistakes
- No max_turns limit — always set a maximum iteration count. Runaway agents are both expensive and can cause unintended side effects.
- Irreversible tools without confirmation — any tool that modifies data (delete, send, update) should require human confirmation in production.
- Trusting the model’s own safety — do not rely solely on the model’s built-in refusals. Always add external input filtering.
- Prompt injection from external sources — if the agent reads web pages or documents, an attacker can inject instructions into those documents. Sanitise tool results before injecting into context.
Quick Quiz
Q1. What is prompt injection in the context of agents?
A1. An attack where malicious instructions are hidden in documents or web pages that the agent reads, tricking it into executing unintended actions.
Q2. What does the OpenAI Moderation API check for?
A2. Hate speech, harassment, self-harm, sexual content, and violence in text inputs.
Q3. What is human-in-the-loop, and when is it required?
A3. A design pattern where a human must approve agent actions before they are executed. Required for irreversible or high-stakes actions (deletion, external communications, financial transactions).
Student Exercise
Exercise 7.5 — Safe customer support agent
Build a customer support agent with: (1) input filtering using the OpenAI Moderation API, (2) a scope-limiting system prompt, (3) a cancel_order tool that requires human confirmation before executing. Test with both on-topic and off-topic inputs.
Further Reading
- 📘 OpenAI Agents SDK — Guardrails
- 📘 OpenAI Moderation API
- 📄 Prompt Injection Attacks Against LLM-Integrated Applications (Greshake et al., 2023)
Next Chapter → Chapter 8: Evaluation