1.3 How LLMs Generate Text

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.3 How LLMs Generate Text

Key Concepts: Autoregressive generation · Temperature · Top-p · Sampling strategies

Official Docs: OpenAI — Text Generation · Hugging Face — Generation Strategies

Autoregressive Generation

LLMs generate one token at a time, appending each new token to the context before predicting the next one.

Prompt: "The capital of France is"

Step 1: model sees prompt               → predicts " Paris"
Step 2: appends " Paris", sees updated  → predicts "."
Step 3: "." triggers stop → generation ends

At each step the model outputs a probability distribution over the entire vocabulary. A sampling strategy selects the next token from that distribution.

Temperature

Temperature reshapes the probability distribution before sampling:

$$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

Temperature	Effect	Use case
`0.0`	Deterministic (always picks highest probability)	Code, JSON, facts
`0.3–0.5`	Focused, low variety	QA, summarisation
`0.7–1.0`	Balanced	General chat
`> 1.0`	High creativity, less coherent	Brainstorming

Top-p (Nucleus Sampling)

Top-p keeps only the smallest set of tokens whose cumulative probability ≥ p, then re-normalises.

# top_p = 0.9 example
# Probs: {" Paris": 0.72, " Lyon": 0.13, " Rome": 0.08, ...}
# Cumulative:   0.72           0.85          0.93  ← cut here
# Only sample from: [" Paris", " Lyon", " Rome"]

top_p = 1.0 — full vocabulary (default)
top_p = 0.9 — trim low-probability tail

OpenAI API Example

from openai import OpenAI

client = OpenAI()   # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Name three European capitals."}
    ],
    temperature=0.3,
    top_p=1.0,
    max_tokens=128,
)

print(response.choices[0].message.content)

Key Takeaways

Generation is one token at a time — no look-ahead
temperature=0 → deterministic; higher → more varied
Adjust either temperature or top-p, not both at once
Use temperature=0 for structured/factual tasks

1.3 How LLMs Generate Text

Autoregressive Generation​

Temperature​

Top-p (Nucleus Sampling)​

OpenAI API Example​

Key Takeaways​

Further Reading​

Autoregressive Generation

Temperature

Top-p (Nucleus Sampling)

OpenAI API Example

Key Takeaways

Further Reading