Modifica

Derive quality signals for Copilot agent evaluation

Quality signals provide a structured way to understand why agent responses succeed or fail. They help teams group evaluation outcomes into meaningful categories, prioritize improvements, and track progress over time.

This article explains how to derive quality signals from evaluation results to diagnose issues, identify patterns, and improve Copilot agent performance.

By using quality signals, teams can:

  • Identify recurring failure patterns
  • Prioritize improvements based on impact
  • Track performance across iterations
  • Communicate results clearly to stakeholders

What are quality signals?

Quality signals are categories that represent patterns in evaluation results. They come from observed behavior rather than predefined rules.

Assertions and quality signals work together in an evaluation workflow:

  • Assertions determine whether a response passes or fails.
  • Quality signals group assertion outcomes into higher-level patterns.
Aspect Assertions Quality signals
Level Specific and concrete Abstract and categorical
Purpose Determine pass or fail Diagnose patterns
Quantity Many per test case Few per agent
Origin Defined before testing Derived from results
Example Contains "15 days" Policy accuracy

After you define assertions, derive quality signals from the assertion outcomes and use those signals to track performance across scenarios.

Common quality signals

Use the following common quality signals when you evaluate Copilot agents:

  • Policy accuracy – Measures whether responses align with authoritative knowledge sources
  • Source attribution – Measures whether responses clearly identify information sources
  • Personalization – Measures whether responses use relevant user context
  • Tool accuracy – Measures whether tool calls are executed correctly
  • Tool response handling – Measures whether the agent correctly interprets tool output
  • Escalation appropriateness – Measures whether requests are routed to human support when needed
  • Privacy protection – Measures whether sensitive information is safeguarded
  • Action enablement – Measures whether responses provide clear next steps

Signal evaluation and common causes

The following table lists indicators for each quality signal.

Quality signal Pass indicators Fail indicators Common causes
Policy accuracy Correct values and dates

Accurate policy details

Consistent with current documentation
Outdated or incorrect values

Conflicting or fabricated details
Outdated or duplicate documents

Incorrect retrieval results

Model hallucinations
Source attribution References to specific documents or sections

Clear attribution statements
No source provided

Vague or generic references
Missing source metadata

Instructions don't emphasize attribution
Personalization Region-specific or role-specific responses

Context-aware recommendations
Generic responses that ignore user context

Incorrect regional or role-based information
User context unavailable for agent

Knowledge sources not segmented by audience
Tool accuracy Correct tool selection

Valid parameters and identifiers

All required fields populated
Missing or incorrect parameters

Invalid tool inputs
Ambiguous API specifications

Incorrect parameter mapping
Tool response handling Accurate communication of tool results

Correct handling of success and error states
Incorrect success claims

Ignored or misinterpreted tool errors
Missing error-handling guidance

Misinterpretation of tool responses
Escalation appropriateness Sensitive or complex issues are routed correctly

Compliance with escalation rules
Agent attempts to handle unsupported scenarios

Failure to escalate high-risk requests
Undefined escalation criteria

Overly permissive instructions
Privacy protection Refusal to disclose restricted data

Responses limited to authorized information
Disclosure or inference of sensitive data

Responses that expose protected information
Weak access controls

Insufficient privacy guidance
Action enablement Specific instructions

Links, identifiers, or contact details
Vague or incomplete guidance

Missing actionable steps
Missing procedural information in knowledge sources

Over-summarized responses

How to derive quality signals

Quality signals are derived from patterns in evaluation results rather than predefined checklists. To derive quality signals:

  • Run an initial set of evaluation test cases.
  • Review failed responses across test cases.
  • Identify recurring patterns in failures.
  • Define each pattern as a quality signal.
  • Tag related assertions with the corresponding signal.
  • Track pass rates by signal.

Quality signals in practice

The following example shows quality signals defined for an employee onboarding agent.

Observation Pattern identified Quality signal
Correct PTO values returned Accurate knowledge retrieval Policy accuracy
Source cited in response Attribution included Source attribution
Incorrect regional information returned Context not used Personalization
Tool invoked with incorrect parameters Execution error Tool accuracy
Request routed to HR appropriately Correct escalation Escalation appropriateness
Sensitive data nearly exposed Privacy boundary risk Privacy protection
Response included next steps Actionable response Action enablement

The following are specific measures for quality signals.

Policy accuracy Source attribution Tool accuracy
Contains correct PTO duration Cites authoritative documents Invokes correct tool
Includes correct enrollment deadline References specific sections Uses valid parameters
Does not reference outdated policy Returns correct outcome

Apply and communicate quality signals

Use quality signals to drive evaluation workflows and communicate insights. To apply quality signals:

  • Tag assertions – Add signal tags to each assertion in your test cases.

    Test Case: PTO-001
    Prompt: "How many vacation days do new employees get?"

    Assertions:

    • The response contains "15 days".
      Signal: Policy Accuracy

    • The response cites the Employee Handbook.
      Signal: Source Attribution

    • The response mentions the <2 year tenure bracket.
      Signal: Personalization

  • Calculate metrics – Aggregate pass and fail results by signal.

    Quality signal Test cases Pass Fail Pass rate
    Policy Accuracy 25 23 2 92%
    Source Attribution 25 20 5 80%
    Personalization 15 11 4 73%
    Tool Accuracy 12 10 2 83%
    Escalation Appropriateness 8 8 0 100%
    Privacy Protection 10 10 0 100%
  • Prioritize issues – Focus on signals with low pass rates or high impact.

    1. Personalization (73%) - Biggest gap, investigate first.
    2. Source attribution (80%) - Second priority.
    3. Tool accuracy (83%) - Third priority.
    4. Policy accuracy (92%) - Minor issues, monitor.
  • Track progress – Monitor signal pass rates across agent versions.

    • Version 1.0 > 1.1 > 1.2 > 1.3
    • Personalization: 73% > 78% > 85% > 91% (improving)
    • Source attribution: 80% > 82% > 88% > 90% (improving)
    • Tool accuracy: 83% > 85% > 84% > 92% (improved after v1.2 regression)

Quality signals transform stakeholder conversations. This specificity enables targeted fixes, quantitative progress tracking, and clearer stakeholder communication.

Without signals: The agent isn't performing well. Users are complaining.

With signals: Policy Accuracy is at 92% — we're hitting our target. But Personalization dropped to 73% after the last update. Specifically, UK employees are getting US holiday information. We identified the root cause: the context retrieval isn't passing location data. Fix is in progress for next release.

Quality signals by agent type

Quality signals and priorities vary based on the type of agent you're evaluating.

Agent type Signal Priority
Knowledge-grounded Policy accuracy High
Source attribution High
Completeness Medium
Personalization Medium
Tool-calling Tool accuracy High
Tool response handling High
Action enablement High
Error recovery Medium
Hybrid Routing accuracy High
Knowledge signals Medium
Tool signals Medium
Escalation appropriateness Medium
Customer-facing Privacy protection High
Tone and professionalism High
Escalation appropriateness High
Resolution completeness Medium

Avoid common pitfalls

Avoid the following issues to ensure that your quality signals remain useful, consistent, and actionable.

Use specific signals instead of generic categories

Signals that are too broad, such as "Accuracy," "Helpfulness," or "Relevance," don't provide actionable insight. Generic signals make it difficult to identify root causes or prioritize improvements.

Instead, define signals based on specific, observable patterns in evaluation results.

  • Avoid: Accuracy
  • Prefer: Policy accuracy, Source attribution

Avoid overly granular signals

Creating too many narrowly scoped signals increases complexity without improving insight. Excessive granularity fragments analysis and makes it harder to track meaningful trends.

Instead, group related behaviors into broader, reusable signal categories.

  • Avoid: PTO accuracy, Benefits accuracy, Holiday accuracy
  • Prefer: Policy accuracy

Avoid vague pass and fail criteria

Vague signal definitions, such as "Correctness," lack measurable standards. Without clear criteria, results are inconsistent and difficult to interpret.

Instead, define signals using explicit, observable behaviors tied to evaluation outcomes.

  • Avoid: "Response is correct"
  • Prefer: "Response includes correct value and cites authoritative source"

Next step