2.5 Prompt Versioning & Management

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

2.5 Prompt Versioning & Management

Key Concepts: Storing prompts as templates · A/B testing · Iteration workflow

Official Docs: LangSmith Prompt Hub · OpenAI Playground

Why Version Your Prompts?

Prompts are code. Changing a single word in a system prompt can shift output quality significantly. Without versioning you cannot:

Reproduce a specific output
Roll back a regression
A/B test two prompt variants
Audit what went to production

Pattern 1 — Prompts as Python Constants

The simplest approach: store prompts in a dedicated module with version comments.

# prompts.py

# v1.0 — initial release
SUMMARY_V1 = """
Summarise the following article in 3 bullet points.
Each bullet must be under 20 words.

ARTICLE:
{article}
"""

# v1.1 — added tone instruction after QA feedback
SUMMARY_V1_1 = """
Summarise the following article in 3 bullet points.
Each bullet must be under 20 words.
Use a neutral, journalistic tone.

ARTICLE:
{article}
"""

ACTIVE_SUMMARY_PROMPT = SUMMARY_V1_1   # single point of change

# usage.py
from prompts import ACTIVE_SUMMARY_PROMPT

prompt = ACTIVE_SUMMARY_PROMPT.format(article=article_text)

Pattern 2 — YAML / JSON Prompt Files

For larger teams, store prompts in version-controlled YAML files:

# prompts/summary.yaml
name: article-summary
version: "1.2"
created: "2025-09-01"
changelog: "Added word limit, tone instruction, escape hatch"
template: |
  Summarise the following article in 3 bullet points.
  Each bullet must be under 20 words.
  Use a neutral, journalistic tone.
  If the article is not in English, translate first then summarise.
  If the article is too short to summarise, respond with: {"error": "too short"}

  ARTICLE:
  {article}

import yaml

with open("prompts/summary.yaml") as f:
    config = yaml.safe_load(f)

prompt = config["template"].format(article=article_text)
print(f"Using prompt v{config['version']}")

Pattern 3 — LangSmith Prompt Hub

For teams with LangChain, LangSmith Prompt Hub allows collaborative prompt management with version history.

pip install langchain langsmith

from langchain import hub

# Pull a specific version of a prompt from the hub
prompt = hub.pull("hwchase17/react")  # community prompt

# Push your own prompt
from langchain_core.prompts import ChatPromptTemplate

my_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer concisely."),
    ("human", "{question}"),
])
hub.push("your-username/my-assistant", my_prompt)

A/B Testing Prompts

A/B testing compares two prompt variants against a quality metric:

import random
from openai import OpenAI

client = OpenAI()

PROMPT_A = "Summarise this review in one sentence: {review}"
PROMPT_B = "In one concise sentence, state the main sentiment and reason from this review: {review}"

def ab_test(review: str, n: int = 20) -> dict:
    scores = {"A": [], "B": []}
    for _ in range(n):
        variant = random.choice(["A", "B"])
        template = PROMPT_A if variant == "A" else PROMPT_B
        output = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": template.format(review=review)}],
            temperature=0.3,
        ).choices[0].message.content
        # Score with your own rubric (e.g., word count, keyword presence)
        scores[variant].append(len(output.split()))
    return {
        "A_avg_words": sum(scores["A"]) / len(scores["A"]),
        "B_avg_words": sum(scores["B"]) / len(scores["B"]),
    }

print(ab_test("Absolutely loved the product! Fast delivery and great quality."))

Prompt Iteration Workflow

Write prompt v1
Test on 10 diverse examples — record outputs
Identify failure patterns (too verbose, wrong format, hallucination)
Hypothesise fix — change ONE variable at a time
Re-test on the SAME 10 examples + new edge cases
Commit to version control with changelog note
Deploy behind a feature flag — monitor in production

Golden Rule

Change one thing at a time when iterating. If you change the persona, tone, and format simultaneously, you cannot tell which change caused the improvement.

Common Mistakes

Hardcoding prompts inline — embedding a 500-token prompt as a string literal in your main application code makes it impossible to iterate without code changes.
No changelog — without notes, you can’t remember why you changed from v1.0 to v1.1 six months later.
Testing on only one example — a prompt that works on one input often fails on edge cases. Always test on a diverse set.
Changing multiple variables at once — you lose the ability to attribute improvements or regressions to a specific change.

Quick Quiz

Test Your Understanding

Q1. Why should prompts be treated like code?
A1. Prompts directly determine output quality. Like code, they must be versioned, reviewed, and tested to ensure reproducibility and enable rollback.

Q2. What does A/B testing a prompt mean?
A2. Running two prompt variants on the same inputs, measuring a quality metric for each, and choosing the better-performing variant.

Q3. Name one tool designed for collaborative LLM prompt management.
A3. LangSmith Prompt Hub (by LangChain) — see docs.smith.langchain.com.

Student Exercise

Exercise 2.5 — Version your first prompt
Take the summarisation prompt from Exercise 2.1. Create prompts.yaml with version 1.0. Run it on 5 test articles. Identify one failure. Fix it in version 1.1. Document the change in a changelog field. Re-test and compare.

2.5 Prompt Versioning & Management

Why Version Your Prompts?​

Pattern 1 — Prompts as Python Constants​

Pattern 2 — YAML / JSON Prompt Files​

Pattern 3 — LangSmith Prompt Hub​

A/B Testing Prompts​

Prompt Iteration Workflow​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Why Version Your Prompts?

Pattern 1 — Prompts as Python Constants

Pattern 2 — YAML / JSON Prompt Files

Pattern 3 — LangSmith Prompt Hub

A/B Testing Prompts

Prompt Iteration Workflow

Common Mistakes

Quick Quiz

Student Exercise

Further Reading