Design evaluation prompts

The quality of your evaluation depends on the quality of your prompts. Well-designed prompts test exactly what you intend—no more and no less.

This article explains how to design evaluation prompts that produce clear, actionable results.

Anatomy of an effective evaluation prompt

Effective evaluation prompts share four characteristics:

Single intent
Realistic phrasing
Grounded in data
Self-contained

Single intent

Each prompt should test one user goal or question. Multi-intent prompts make it difficult to identify the cause of a failure.

Multi-intent (avoid)	Single intent (preferred)
What's my PTO balance and can you order me a laptop?	What's my PTO balance?
Tell me about health benefits and also the 401k match.	What health insurance plans are available?

To test multiple capabilities together, use multi-turn conversation evaluations instead of combining intents into a single prompt.

Realistic phrasing

Prompts should reflect how users actually communicate, including informal language, incomplete sentences, and varying levels of detail.

Overly formal	Realistic
Please provide information regarding the annual paid time off allocation for employees in their first year of employment.	How many vacation days do new hires get?
I would like to initiate a request for procurement of computing equipment.	I need to order a laptop.

You can derive realistic prompts from:

Production query logs
User research sessions
Support tickets
Everyday workplace conversations

Grounded in data

Prompts should be grounded in data. When grounding data is available, use specific entities, values, and identifiers. This approach makes evaluations measurable and verifiable.

Grounded prompts allow precise assertions such as "The response contains 15 days," instead of vague checks like "The response contains the correct number."

Without grounding data

Prompt: "What's the PTO policy for engineers?"

With grounding data

Grounding data:

Employee: Marcus Johnson
Department: Engineering
Tenure: 8 months
Manager: Lisa Park
Location: Austin office

Prompt: "I'm in the engineering team — how many vacation days do I get?"

Self-contained (single-turn)

For single-turn evaluations, each prompt must include all required context. The agent can't rely on prior conversation turns. The following table shows examples of self-contained prompts.

Self-contained	Depends on context (avoid)
What does the PPO health plan cover?	What about the other health plan?
What's the employee cost for the PPO health plan?	And how much does that cost?
Can you order a 16-inch MacBook Pro?	Can you order that instead?

For scenarios that span multiple turns, use multi-turn conversations.

Prompt variations

Users don't all ask the same question in the same way. To test generalization, create three variations of each prompt.

Canonical prompts

Canonical prompts are explicit, complete, and unambiguous. They serve as the baseline.

Include all required parameters.
Use precise terminology.
Avoid ambiguity.
Represent an ideal query.

Example

"How many paid time off days do employees with less than two years of tenure receive annually according to the current PTO policy?"

Natural language variant

The natural language variant reflects everyday conversational phrasing. Natural language variants:

Use casual, conversational language.
Might include synonyms or informal terms.
Avoid technical identifiers.
Remain complete enough to answer.

Example

"Hey, how much vacation do I get as a new hire?"

The following table compares canonical prompts and natural language variants.

Technique	Canonical	Natural variant
Synonyms	"paid time off"	"vacation days," "time off," "PTO"
Informal phrasing	"How many days do I receive"	"how much do I get"
Implicit context	"employees with <2 years tenure"	"as a new hire"
Casual casing	Proper capitalization	lowercase, minimal punctuation

Robustness probe

The robustness probe evaluates how well the agent handles imperfect input. Robustness probes:

Include realistic typos.
Contain grammatical errors.
Use shorthand or abbreviations.
Test intent recognition under noise.

Example: "whats my vacaton days entitlement"

The following table shows examples of patterns to test.

Pattern	Example
Typos	"vacaton" instead of "vacation"
Missing punctuation	"whats" instead of "what's"
Missing words	"how many days get"
Abbreviations	"PTO bal?"
Run-on queries	"need laptop macbook pro 16 inch"

Complete prompt variation examples

The following examples demonstrate all three prompt types for a single scenario.

Scenario: Equipment ordering

This scenario includes the following grounding data:

Employee: Katrin Pold
Department: Product Design
Start date: 2024-01-15
Equipment budget: $3,500
Approved items: MacBook Pro (14-inch or 16-inch), External monitor, Keyboard, Mouse

Prompt variations

Canonical

"I'm a new employee in the Product Design department starting on January 15, 2024. I need to order a 16-inch MacBook Pro laptop. Please submit this equipment request through the IT ordering system."

Natural language

"Hi, I just joined the product design team and need to get my laptop set up. Can I get a MacBook Pro? The 16 inch one preferably."

Robustness probe

"need to order macbook pro 16in for new job in product design"

Assertions (apply to all variations):

The response confirms the equipment order was initiated.
The agent invoked the OrderEquipment tool.
The tool call includes "MacBook Pro 16-inch" (or equivalent).
The response includes an order confirmation or reference number.

Scenario: Policy question with personalization

This scenario includes the following grounding data:

Employee: James Wright
Location: London, UK office
Tenure: 6 months
Employment type: Full-time

Prompt variations

Canonical

"As a full-time employee based in the London, UK office with 6 months of tenure, what public holidays am I entitled to this year?"

Natural language

"I work in the London office — what bank holidays do I get off?"

Robustness probe

"UK office holidays off this yr?"

Assertions (apply to all variations):

The response lists UK bank holidays (not US holidays).
The response includes at least: New Year's Day, Easter, Christmas.
The response references UK policy or schedule.
The response doesn't mention US holidays such as July 4 or Thanksgiving.

Patterns to avoid

Avoid the following prompt patterns.

Multi-intent prompts

Avoid multi-intent prompts. When your prompt covers multiple intents, you can't determine which intent caused a failure.

Avoid: What's my PTO balance, and can you tell me about health insurance options, and I might need a laptop too?
Use instead: Split into separate prompts or use multithreaded evaluation.

Schema-aware prompts

Avoid schema-aware prompts. Schema-aware prompts don't work well because users don't know internal APIs or tool names.

Avoid: "Call the GetPTOBalance API for employee ID 12345"
Use instead: "What's my current vacation balance?"

Vague prompts

Avoid vague prompts. If your prompt is vague, you can't define measurable assertions.

Avoid: "Help me with HR stuff"
Use instead: "How do I enroll in the dental insurance plan?"

Leading prompts

Prompts that hint at the expected answer don't test the agent's real reasoning effectiveness.

Avoid: "The PTO policy says 15 days, right?"
Use instead: "How many PTO days do new employees receive?"

Not self-contained (single-turn)

Avoid prompts that depend on prior context.

Avoid: "What about the other option?"
Use instead: "What's the difference between the HMO and PPO health plans?"

Generate prompts from user scenarios

Start with real user intent instead of feature lists.

Collect representative user questions.
Group by scenario (for example, policy lookup, actions, escalation).
Write a canonical prompt for each scenario.
Add natural language and robustness variants.
Ground prompts with concrete data.

This approach ensures evaluations reflect real-world usage.

AI-assisted prompt expansion (optional)

After you establish a strong baseline, use AI to expand coverage. Ask AI to suggest more variations. Review each suggestion for realism and relevance. Reject prompts that are unnatural, schema-aware, or out of scope. Add prompts only where they improve coverage.

Prompt coverage checklist

Use this checklist to ensure that your prompt coverage is complete.

Capability coverage

Every tool or action has at least one test case
Every knowledge domain is represented
Escalation behavior is tested
Out-of-scope scenarios are tested

Variation coverage

Canonical prompt
Natural language variant
Robustness probe

Edge cases

Very short prompts
Very long prompts
Ambiguous requests
Missing information
Invalid or unsupported requests

Personalization (if applicable)

Different user locations
Different tenure levels
Different roles or departments

Next step

Write assertions

Commenti e suggerimenti

Questa pagina è stata utile?

Last updated on 2026-04-28

Design evaluation prompts

Anatomy of an effective evaluation prompt

Single intent

Realistic phrasing

Grounded in data

Self-contained (single-turn)

Prompt variations

Canonical prompts

Natural language variant

Robustness probe

Complete prompt variation examples

Scenario: Equipment ordering

Prompt variations

Scenario: Policy question with personalization

Prompt variations

Patterns to avoid

Multi-intent prompts

Schema-aware prompts

Vague prompts

Leading prompts

Not self-contained (single-turn)

Generate prompts from user scenarios

AI-assisted prompt expansion (optional)

Prompt coverage checklist

Capability coverage

Variation coverage

Edge cases

Personalization (if applicable)

Next step

Commenti e suggerimenti

Risorse aggiuntive