Skip to main content

7.5 Guardrails & Safety

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

7.5 Guardrails & Safety

Key Concepts: Input/output filtering · Scope limiting · Human-in-the-loop

Official Docs: OpenAI Agents SDK — Guardrails · OpenAI Moderation API


Why Agents Need Guardrails

Agents have broad capabilities — they can execute code, call APIs, send emails, and modify databases. Without guardrails:

  • An agent may take irreversible harmful actions (delete data, send incorrect messages)
  • Users may prompt-inject to make the agent do things outside its intended scope
  • The agent may loop indefinitely or take thousands of API calls before failing

Layer 1 — Input Filtering

Filter user inputs before they reach the agent:

from openai import OpenAI

client = OpenAI()

def is_safe_input(user_message: str) -> bool:
"""Use OpenAI moderation API to check for harmful content."""
resp = client.moderations.create(input=user_message)
result = resp.results[0]
# Flag if any category is True
return not result.flagged

def run_agent_safely(user_message: str) -> str:
if not is_safe_input(user_message):
return "I’m sorry, I can’t help with that request."
# ... run agent
return agent_response

print(is_safe_input("Tell me how to cook pasta.")) # True

The OpenAI Moderation API is free and checks for: hate, harassment, self-harm, sexual content, and violence.


Layer 2 — Scope Limiting (System Prompt Rules)

Explicitly restrict the agent’s domain in the system prompt:

from agents import Agent

cs_agent = Agent(
name="Customer Support Agent",
model="gpt-4o-mini",
instructions="""
You are a customer support agent for ShopCo.

You MAY:
- Answer questions about orders, delivery, and returns
- Look up order status using the check_order tool
- Escalate complex issues to a human agent

You MUST NOT:
- Give legal or financial advice
- Discuss competitor products
- Execute any action that modifies an order without human confirmation
- Discuss topics unrelated to ShopCo customer support

If asked anything outside your scope, respond:
'I can only help with ShopCo order enquiries. Shall I connect you to a human agent?'
""",
)

Layer 3 — OpenAI Agents SDK Guardrails

The OpenAI Agents SDK has a built-in guardrail system:

from agents import Agent, Runner, GuardrailFunctionOutput, input_guardrail
from pydantic import BaseModel

class TopicCheck(BaseModel):
is_on_topic: bool
reason: str

@input_guardrail
async def topic_guardrail(ctx, agent, input_data) -> GuardrailFunctionOutput:
"""Only allow queries related to software development."""
check_agent = Agent(
name="Topic Checker",
instructions="Determine if the message is about software development.",
output_type=TopicCheck,
)
result = await Runner.run(check_agent, input_data, context=ctx.context)
check: TopicCheck = result.final_output
return GuardrailFunctionOutput(
output_info=check,
tripwire_triggered=not check.is_on_topic,
)

dev_agent = Agent(
name="Dev Helper",
instructions="You are a software development assistant.",
input_guardrails=[topic_guardrail],
)

Layer 4 — Human-in-the-Loop

For high-stakes or irreversible actions, require human confirmation before execution:

def confirm_action(action: str, details: str) -> bool:
"""Ask the human to confirm before executing a dangerous action."""
print(f"\n⚠️ The agent wants to: {action}")
print(f"Details: {details}")
response = input("Approve? (yes/no): ").strip().lower()
return response == "yes"

# In your tool function:
def delete_record(record_id: str) -> str:
"""Delete a database record."""
if not confirm_action("delete_record", f"Record ID: {record_id}"):
return "Action cancelled by user."
# ... execute deletion
return f"Record {record_id} deleted."

Layer 5 — Output Validation

Validate agent outputs before returning them to users:

from pydantic import BaseModel, ValidationError

class AgentOutput(BaseModel):
answer: str
sources: list[str]
confidence: float # 0.0 – 1.0

def validate_output(raw_output: str) -> AgentOutput | None:
try:
import json
data = json.loads(raw_output)
return AgentOutput(**data)
except (ValidationError, json.JSONDecodeError) as e:
print(f"Output validation failed: {e}")
return None

Common Mistakes

Common Mistakes
  1. No max_turns limit — always set a maximum iteration count. Runaway agents are both expensive and can cause unintended side effects.
  2. Irreversible tools without confirmation — any tool that modifies data (delete, send, update) should require human confirmation in production.
  3. Trusting the model’s own safety — do not rely solely on the model’s built-in refusals. Always add external input filtering.
  4. Prompt injection from external sources — if the agent reads web pages or documents, an attacker can inject instructions into those documents. Sanitise tool results before injecting into context.

Quick Quiz

Test Your Understanding

Q1. What is prompt injection in the context of agents?
A1. An attack where malicious instructions are hidden in documents or web pages that the agent reads, tricking it into executing unintended actions.

Q2. What does the OpenAI Moderation API check for?
A2. Hate speech, harassment, self-harm, sexual content, and violence in text inputs.

Q3. What is human-in-the-loop, and when is it required?
A3. A design pattern where a human must approve agent actions before they are executed. Required for irreversible or high-stakes actions (deletion, external communications, financial transactions).


Student Exercise

Exercise 7.5 — Safe customer support agent
Build a customer support agent with: (1) input filtering using the OpenAI Moderation API, (2) a scope-limiting system prompt, (3) a cancel_order tool that requires human confirmation before executing. Test with both on-topic and off-topic inputs.


Further Reading

Next Chapter → Chapter 8: Evaluation