2.5 Prompt Versioning & Management
AI-generated content may contain errors. Always verify against official sources.
2.5 Prompt Versioning & Management
Key Concepts: Storing prompts as templates · A/B testing · Iteration workflow
Official Docs: LangSmith Prompt Hub · OpenAI Playground
Why Version Your Prompts?
Prompts are code. Changing a single word in a system prompt can shift output quality significantly. Without versioning you cannot:
- Reproduce a specific output
- Roll back a regression
- A/B test two prompt variants
- Audit what went to production
Pattern 1 — Prompts as Python Constants
The simplest approach: store prompts in a dedicated module with version comments.
# prompts.py
# v1.0 — initial release
SUMMARY_V1 = """
Summarise the following article in 3 bullet points.
Each bullet must be under 20 words.
ARTICLE:
{article}
"""
# v1.1 — added tone instruction after QA feedback
SUMMARY_V1_1 = """
Summarise the following article in 3 bullet points.
Each bullet must be under 20 words.
Use a neutral, journalistic tone.
ARTICLE:
{article}
"""
ACTIVE_SUMMARY_PROMPT = SUMMARY_V1_1 # single point of change
# usage.py
from prompts import ACTIVE_SUMMARY_PROMPT
prompt = ACTIVE_SUMMARY_PROMPT.format(article=article_text)
Pattern 2 — YAML / JSON Prompt Files
For larger teams, store prompts in version-controlled YAML files:
# prompts/summary.yaml
name: article-summary
version: "1.2"
created: "2025-09-01"
changelog: "Added word limit, tone instruction, escape hatch"
template: |
Summarise the following article in 3 bullet points.
Each bullet must be under 20 words.
Use a neutral, journalistic tone.
If the article is not in English, translate first then summarise.
If the article is too short to summarise, respond with: {"error": "too short"}
ARTICLE:
{article}
import yaml
with open("prompts/summary.yaml") as f:
config = yaml.safe_load(f)
prompt = config["template"].format(article=article_text)
print(f"Using prompt v{config['version']}")
Pattern 3 — LangSmith Prompt Hub
For teams with LangChain, LangSmith Prompt Hub allows collaborative prompt management with version history.
pip install langchain langsmith
from langchain import hub
# Pull a specific version of a prompt from the hub
prompt = hub.pull("hwchase17/react") # community prompt
# Push your own prompt
from langchain_core.prompts import ChatPromptTemplate
my_prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Answer concisely."),
("human", "{question}"),
])
hub.push("your-username/my-assistant", my_prompt)
A/B Testing Prompts
A/B testing compares two prompt variants against a quality metric:
import random
from openai import OpenAI
client = OpenAI()
PROMPT_A = "Summarise this review in one sentence: {review}"
PROMPT_B = "In one concise sentence, state the main sentiment and reason from this review: {review}"
def ab_test(review: str, n: int = 20) -> dict:
scores = {"A": [], "B": []}
for _ in range(n):
variant = random.choice(["A", "B"])
template = PROMPT_A if variant == "A" else PROMPT_B
output = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": template.format(review=review)}],
temperature=0.3,
).choices[0].message.content
# Score with your own rubric (e.g., word count, keyword presence)
scores[variant].append(len(output.split()))
return {
"A_avg_words": sum(scores["A"]) / len(scores["A"]),
"B_avg_words": sum(scores["B"]) / len(scores["B"]),
}
print(ab_test("Absolutely loved the product! Fast delivery and great quality."))
Prompt Iteration Workflow
1. Write prompt v1
2. Test on 10 diverse examples — record outputs
3. Identify failure patterns (too verbose, wrong format, hallucination)
4. Hypothesise fix — change ONE variable at a time
5. Re-test on the SAME 10 examples + new edge cases
6. Commit to version control with changelog note
7. Deploy behind a feature flag — monitor in production
Change one thing at a time when iterating. If you change the persona, tone, and format simultaneously, you cannot tell which change caused the improvement.
Common Mistakes
- Hardcoding prompts inline — embedding a 500-token prompt as a string literal in your main application code makes it impossible to iterate without code changes.
- No changelog — without notes, you can’t remember why you changed from v1.0 to v1.1 six months later.
- Testing on only one example — a prompt that works on one input often fails on edge cases. Always test on a diverse set.
- Changing multiple variables at once — you lose the ability to attribute improvements or regressions to a specific change.
Quick Quiz
Q1. Why should prompts be treated like code?
A1. Prompts directly determine output quality. Like code, they must be versioned, reviewed, and tested to ensure reproducibility and enable rollback.
Q2. What does A/B testing a prompt mean?
A2. Running two prompt variants on the same inputs, measuring a quality metric for each, and choosing the better-performing variant.
Q3. Name one tool designed for collaborative LLM prompt management.
A3. LangSmith Prompt Hub (by LangChain) — see docs.smith.langchain.com.
Student Exercise
Exercise 2.5 — Version your first prompt
Take the summarisation prompt from Exercise 2.1. Create prompts.yaml with version 1.0. Run it on 5 test articles. Identify one failure. Fix it in version 1.1. Document the change in a changelog field. Re-test and compare.
Further Reading
- 📘 LangSmith Prompt Hub docs
- 📘 OpenAI Playground — save and share prompts
- 📄 Prompt Engineering Guide — promptingguide.ai
Next Chapter → Chapter 3: LLM APIs