Skip to main content

1.3 How LLMs Generate Text

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.3 How LLMs Generate Text

Key Concepts: Autoregressive generation · Temperature · Top-p · Sampling strategies

Official Docs: OpenAI — Text Generation · Hugging Face — Generation Strategies


Autoregressive Generation

LLMs generate one token at a time, appending each new token to the context before predicting the next one.

Prompt: "The capital of France is"

Step 1: model sees prompt → predicts " Paris"
Step 2: appends " Paris", sees updated → predicts "."
Step 3: "." triggers stop → generation ends

At each step the model outputs a probability distribution over the entire vocabulary. A sampling strategy selects the next token from that distribution.


Temperature

Temperature reshapes the probability distribution before sampling:

$$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

TemperatureEffectUse case
0.0Deterministic (always picks highest probability)Code, JSON, facts
0.3–0.5Focused, low varietyQA, summarisation
0.7–1.0BalancedGeneral chat
> 1.0High creativity, less coherentBrainstorming

Top-p (Nucleus Sampling)

Top-p keeps only the smallest set of tokens whose cumulative probability ≥ p, then re-normalises.

# top_p = 0.9 example
# Probs: {" Paris": 0.72, " Lyon": 0.13, " Rome": 0.08, ...}
# Cumulative: 0.72 0.85 0.93 ← cut here
# Only sample from: [" Paris", " Lyon", " Rome"]
  • top_p = 1.0 — full vocabulary (default)
  • top_p = 0.9 — trim low-probability tail

OpenAI API Example

from openai import OpenAI

client = OpenAI() # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Name three European capitals."}
],
temperature=0.3,
top_p=1.0,
max_tokens=128,
)

print(response.choices[0].message.content)

Key Takeaways

  • Generation is one token at a time — no look-ahead
  • temperature=0 → deterministic; higher → more varied
  • Adjust either temperature or top-p, not both at once
  • Use temperature=0 for structured/factual tasks

Further Reading

Next → 1.4 Model Landscape