Nota
L'accesso a questa pagina richiede l'autorizzazione. È possibile provare ad accedere o modificare le directory.
L'accesso a questa pagina richiede l'autorizzazione. È possibile provare a modificare le directory.
Real users don't interact with agents through isolated, single-question exchanges. Instead, they engage in conversations that include follow-up questions, incremental information sharing, and multistep task completion.
Multi-turn evaluation helps ensure that your agent can maintain context, complete tasks across multiple steps, and respond appropriately throughout a realistic conversation flow.
When to use multi-turn evaluation
Use multi-turn evaluation when testing scenarios that require context retention or multiple steps to complete a task.
| Scenario | Why multi-turn is needed |
|---|---|
| Slot filling | The agent must collect multiple pieces of information. |
| Clarification flows | The agent must resolve ambiguous user input. |
| Multi-step tasks | The task requires multiple exchanges to complete. |
| Context-dependent follow-ups | Later questions depend on earlier responses. |
| Progressive disclosure | Information is provided incrementally. |
Single-turn vs. multi-turn evaluation
Start with single-turn evaluations for coverage, then add multi-turn evaluations to test realistic interactions.
| Aspect | Single-turn | Multi-turn |
|---|---|---|
| Tests | Individual question-answer pairs | Complete conversations |
| Context | Each prompt is independent | Turns build on each other |
| Complexity | Easier to design and evaluate | More realistic but complex |
| Failure diagnosis | Easier to isolate issues | Requires turn-by-turn analysis |
| Use case | Testing specific responses | Testing conversation flow |
Key metrics for multi-turn evaluation
Multi-turn evaluation requires extra metrics beyond single-turn accuracy. These metrics assess how well the agent performs across an entire conversation.
Conversation completeness
Conversation completeness measures whether the agent achieves the user’s goal by the end of the interaction.
Success example
Turn 1
- User: "I need to order a laptop"
- Agent: "I can help with that. What type of laptop do you need?"
Turn 2
- User: "MacBook Pro"
- Agent: "What screen size—14 inch or 16 inch?"
Turn 3
- User: "16 inch"
- Agent: "I've submitted your order for a 16-inch MacBook Pro. Order ID: ORD-456. Expected delivery: 5–7 business days."
✅ The user goal (ordering a laptop) is achieved.
Failure example
Turn 1
- User: "I need to order a laptop"
- Agent: "I can help with that. What type of laptop do you need?"
Turn 2
- User: "MacBook Pro"
- Agent: "MacBook Pro is a great choice! Is there anything else?"
❌ The order wasn't submitted and the user goal isn't achieved.
Knowledge retention
Knowledge retention measures whether the agent remembers and uses information provided in earlier turns.
Success example
Turn 1
- User: "My employee ID is EMP-12345"
- Agent: "Thanks! How can I help you today?"
Turn 2
- User: "I want to check my PTO balance"
- Agent: "Your current PTO balance is 12 days remaining."
✅ The agent used the previously provided employee ID without asking again.
Failure example
Turn 1
- User: "My employee ID is EMP-12345"
- Agent: "Thanks! How can I help you today?"
Turn 2
- User: "I want to check my PTO balance"
- Agent: "I'd be happy to help. What's your employee ID?"
❌ The agent forgot the employee ID from the previous turn.
Role adherence
Role adherence measures whether the agent maintains its defined scope and responsibilities throughout the conversation.
Success example
Turn 1
- User: "What's the PTO policy?"
- Agent: Provides correct PTO policy information
Turn 2
- User: "Can you write a poem about vacation?"
- Agent: "I'm here to help with HR and onboarding questions. Is there anything else about PTO I can help with?"
✅ The agent stayed within its role and redirected appropriately.
Failure example
Turn 1
- User: "What's the PTO policy?"
- Agent: Provides correct PTO policy information
Turn 2
- User: "Can you write a poem about vacation?"
- Agent: "Sure! Here's a poem about sunny beaches..."
❌ The agent drifted from its role as an HR assistant.
Conversation relevancy
Conversation relevancy measures whether each response is directly relevant to the user’s request within the current conversational context.
Failure example
Turn 1
- User: "What health plans are available?"
- Agent: Lists HMO and PPO options
Turn 2
- User: "What's the difference in cost?"
- Agent: "The PPO offers more flexibility in choosing doctors."
❌ The response doesn't address the user’s question about cost.
Context consistency
Context consistency measures whether the agent maintains consistent facts and information across all turns in the conversation.
Failure example
Turn 1
- User: "How many vacation days do I get?"
- Agent: "As a new employee, you receive 15 days of PTO annually."
Turn 2
- User: "And how many sick days?"
- Agent: "You have 10 sick days, plus your 10 vacation days."
❌ The agent contradicts earlier information by changing the number of vacation days.
Design multi-turn test cases
Designing effective multi-turn evaluations requires defining both the conversation flow and expected behaviors. A multi-turn test case typically includes:
- Test case ID and description
- Grounding data
- Conversation sequence
- Expected behaviors
- Assertions
Example: Equipment order
- Test case ID: CONV-001
- Description: Equipment ordering flow
Grounding data
- Employee: Katrin Pold
- Employee ID: EMP-54321
- Department: Product Design
- Budget: $3,500
Conversation
Turn 1
- User: "I need to get my laptop set up."
- Expected behavior: Ask about preferences.
Turn 2
- User: "MacBook Pro"
- Expected behavior: Ask for specifications.
Turn 3
- User: "16 inch with 32GB RAM"
- Expected behavior: Confirm and submit order.
Turn 4
- User: "Actually, change it to 64GB."
- Expected behavior: Update the order.
Conversation-level assertions
- The conversation completes with a confirmed equipment order.
- The final order reflects the corrected specification (64GB).
- The agent never asks for employee ID (should infer from context).
- Total turns to completion is ≤6.
Turn-level assertions
- Turn 3:
- Agent invokes OrderEquipment tool.
- Tool call includes model: "MacBook Pro 16-inch".
- Tool call includes specs that contain "32GB".
- Turn 4:
- Agent invokes UpdateOrder or modifies previous order.
- Final confirmation shows "64GB".
Example: Slot-filling conversation
Scenario: Benefits enrollment (slot filling)
Goal: Collect all required information and complete enrollment
Required inputs (slots):
- Plan selection (HMO or PPO)
- Coverage level (individual or family)
- Dependent information (if applicable)
- Effective date
Conversation
Turn 1
User: "I want to sign up for health insurance."
Agent:
- Acknowledges request
- Asks for plan preference (HMO vs. PPO)
Turn 2
User: "What's the difference?"
Agent:
- Explains key differences between HMO and PPO
- Re-asks for plan selection
Turn 3
User: "I'll go with the PPO."
Agent:
- Confirms selection
- Asks for coverage level (individual or family)
Turn 4
User: "Family coverage. I have a spouse and two kids."
Agent:
- Confirms dependent scenario
- Requests dependent details (names, ages, etc.)
Turn 5
User: [Provides dependent information]
Agent:
- Confirms all collected information
- Submits enrollment
- Provides confirmation summary
Assertions:
- All required slots are collected before submission.
- The agent doesn't skip required inputs.
- The agent explains options when prompted.
- The final confirmation includes all captured details.
- The task completes successfully without unnecessary turns.
Example: Clarification conversation
Scenario: Ambiguous request (clarification)
Goal: Resolve ambiguity by gathering sufficient context before responding
Conversation
Turn 1
User: "I need to take some time off."
Agent:
- Acknowledges request
- Asks what type of time off (PTO, sick leave, personal leave)
Turn 2
User: "It's for a medical procedure."
Agent:
- Identifies likely category
- Asks whether the request is for the user or a family member
Turn 3
User: "It's for me. It's a minor surgery."
Agent:
- Determines this is likely sick leave
- Provides relevant policy information
- Asks about expected duration
Turn 4
User: "About a week."
Agent:
- Confirms eligibility and process
- Provides next steps (for example, how to request leave)
- Offers to initiate the request if applicable
Assertions:
- The agent doesn't assume intent from the initial ambiguous request.
- Clarifying questions are relevant and progressively narrow the scope.
- The final response reflects the user's actual situation.
- The agent transitions from clarification to resolution effectively.
Example: Error recovery conversation
Scenario: Order failure and recovery
Goal: Handle failures transparently and guide the user to a successful outcome
Conversation:
Turn 1
User: "Order me a MacBook Pro 16-inch."
Agent:
- Attempts to submit the order
System/tool response:
- Error: Item out of stock
Agent:
- Communicates the issue clearly
- Does not obscure or ignore the failure
- Suggests alternatives (for example, different model or size)
Turn 2
User: "When will it be back in stock?"
Agent:
- Checks availability or provides restock estimate if available
- If unavailable, communicates that clearly
- Offers to notify the user or suggest alternatives
Turn 3
User: "Can I get the 14-inch instead?"
Agent:
- Confirms availability of the alternative
- Submits the new order
- Provides confirmation (order details, expected delivery)
Assertions:
- The agent clearly communicates failures and their causes.
- The agent offers actionable alternatives or next steps.
- The conversation progresses toward recovery, not dead ends.
- The user successfully completes the task after the failure.
- The agent maintains context across the failure and recovery steps.
Multi-turn assertions
Multi-turn evaluation requires both conversation-level and turn-level validation.
Conversation-level assertions
- The conversation achieves the user's goal.
- The number of turns is within an expected range.
- The agent maintains consistent information.
- The agent maintains its role.
- The agent doesn't request duplicate information.
Turn-level assertions
- The response acknowledges prior context.
- The response asks appropriate follow-up questions.
- The agent performs expected actions.
- The response confirms actions taken.
Conditional assertions
- When the user corrects information, the agent updates subsequent responses.
- When an action fails, the agent communicates the issue and provides alternatives.
- When a user asks an out-of-scope question, the agent redirects appropriately.
Example: End-to-end multi-turn evaluation
Test case: New hire setup
Description
A new employee orders equipment and asks a benefits question during the same conversation.
Grounding data
- Employee: Marcus Johnson
- Employee ID: EMP-99887
- Department: Engineering
- Location: Seattle
- Equipment budget: $4,000
Conversation
Turn 1
- User: "I'm starting next week and need to set up my workstation."
- Expected behavior: Welcome and ask about equipment needs.
Turn 2
- User: "I need a MacBook Pro 16 inch with 64GB RAM and a 27-inch monitor."
- Expected behavior: Submit order and confirm.
Turn 3
- User: "When is the deadline to sign up for health insurance?"
- Expected behavior: Provide benefits information without losing context.
Turn 4
- User: "Can you confirm what I ordered?"
- Expected behavior: Recall and summarize the order.
Expected outcomes
- Equipment order is completed.
- Benefits question is answered accurately.
- Context is retained across turns.
- Information isn't re-requested.
Common pitfalls
Be aware of these common issues when designing and evaluating multi-turn conversations:
- Evaluating turns in isolation instead of within full conversation context, which can hide failures related to memory, continuity, and task completion.
- Testing only ideal (happy path) scenarios, which fails to reflect real user behavior such as ambiguity, corrections, and interruptions.
- Ignoring differences between short and long conversations, where short flows often miss completion and long flows introduce context drift.
- Not including adversarial or unexpected user behavior, such as conflicting inputs, topic switching, or vague requests.
- Defining unclear or implicit user goals, making it difficult to determine whether the conversation succeeded.
- Over-specifying expected responses, which reduces flexibility and incorrectly penalizes valid variations.
- Failing to test error handling and recovery paths, leaving gaps in how the agent responds to failures or missing information.
Best practices
Use these practices to design effective and scalable multi-turn evaluations:
- Start with a small set of high-value scenarios that represent the most common or critical conversation flows.
- Define a clear user goal for each test case and use goal completion as the primary success metric.
- Test both direct (user provides all inputs upfront) and guided (agent collects inputs over multiple turns) paths to completion.
- Use conversation-level assertions first, then add turn-level assertions for deeper validation.
- Design scenarios that reflect realistic user behavior, including clarifications, corrections, and topic switching.
- Validate context retention explicitly by ensuring the agent uses previously provided information without re-asking.
- Include error handling and recovery scenarios, ensuring the agent communicates failures and provides next steps.
- Use flexible, behavior-based assertions instead of exact response matching.
- Ensure tests are reproducible while preserving realism by defining expected behaviors rather than exact phrasing.
- Continuously iterate and refine test cases based on observed failures and evolving usage patterns.