Nota
L'accesso a questa pagina richiede l'autorizzazione. È possibile provare ad accedere o modificare le directory.
L'accesso a questa pagina richiede l'autorizzazione. È possibile provare a modificare le directory.
Quality signals provide a structured way to understand why agent responses succeed or fail. They help teams group evaluation outcomes into meaningful categories, prioritize improvements, and track progress over time.
This article explains how to derive quality signals from evaluation results to diagnose issues, identify patterns, and improve Copilot agent performance.
By using quality signals, teams can:
- Identify recurring failure patterns
- Prioritize improvements based on impact
- Track performance across iterations
- Communicate results clearly to stakeholders
What are quality signals?
Quality signals are categories that represent patterns in evaluation results. They come from observed behavior rather than predefined rules.
Assertions and quality signals work together in an evaluation workflow:
- Assertions determine whether a response passes or fails.
- Quality signals group assertion outcomes into higher-level patterns.
| Aspect | Assertions | Quality signals |
|---|---|---|
| Level | Specific and concrete | Abstract and categorical |
| Purpose | Determine pass or fail | Diagnose patterns |
| Quantity | Many per test case | Few per agent |
| Origin | Defined before testing | Derived from results |
| Example | Contains "15 days" | Policy accuracy |
After you define assertions, derive quality signals from the assertion outcomes and use those signals to track performance across scenarios.
Common quality signals
Use the following common quality signals when you evaluate Copilot agents:
- Policy accuracy – Measures whether responses align with authoritative knowledge sources
- Source attribution – Measures whether responses clearly identify information sources
- Personalization – Measures whether responses use relevant user context
- Tool accuracy – Measures whether tool calls are executed correctly
- Tool response handling – Measures whether the agent correctly interprets tool output
- Escalation appropriateness – Measures whether requests are routed to human support when needed
- Privacy protection – Measures whether sensitive information is safeguarded
- Action enablement – Measures whether responses provide clear next steps
Signal evaluation and common causes
The following table lists indicators for each quality signal.
| Quality signal | Pass indicators | Fail indicators | Common causes |
|---|---|---|---|
| Policy accuracy | Correct values and dates Accurate policy details Consistent with current documentation |
Outdated or incorrect values Conflicting or fabricated details |
Outdated or duplicate documents Incorrect retrieval results Model hallucinations |
| Source attribution | References to specific documents or sections Clear attribution statements |
No source provided Vague or generic references |
Missing source metadata Instructions don't emphasize attribution |
| Personalization | Region-specific or role-specific responses Context-aware recommendations |
Generic responses that ignore user context Incorrect regional or role-based information |
User context unavailable for agent Knowledge sources not segmented by audience |
| Tool accuracy | Correct tool selection Valid parameters and identifiers All required fields populated |
Missing or incorrect parameters Invalid tool inputs |
Ambiguous API specifications Incorrect parameter mapping |
| Tool response handling | Accurate communication of tool results Correct handling of success and error states |
Incorrect success claims Ignored or misinterpreted tool errors |
Missing error-handling guidance Misinterpretation of tool responses |
| Escalation appropriateness | Sensitive or complex issues are routed correctly Compliance with escalation rules |
Agent attempts to handle unsupported scenarios Failure to escalate high-risk requests |
Undefined escalation criteria Overly permissive instructions |
| Privacy protection | Refusal to disclose restricted data Responses limited to authorized information |
Disclosure or inference of sensitive data Responses that expose protected information |
Weak access controls Insufficient privacy guidance |
| Action enablement | Specific instructions Links, identifiers, or contact details |
Vague or incomplete guidance Missing actionable steps |
Missing procedural information in knowledge sources Over-summarized responses |
How to derive quality signals
Quality signals are derived from patterns in evaluation results rather than predefined checklists. To derive quality signals:
- Run an initial set of evaluation test cases.
- Review failed responses across test cases.
- Identify recurring patterns in failures.
- Define each pattern as a quality signal.
- Tag related assertions with the corresponding signal.
- Track pass rates by signal.
Quality signals in practice
The following example shows quality signals defined for an employee onboarding agent.
| Observation | Pattern identified | Quality signal |
|---|---|---|
| Correct PTO values returned | Accurate knowledge retrieval | Policy accuracy |
| Source cited in response | Attribution included | Source attribution |
| Incorrect regional information returned | Context not used | Personalization |
| Tool invoked with incorrect parameters | Execution error | Tool accuracy |
| Request routed to HR appropriately | Correct escalation | Escalation appropriateness |
| Sensitive data nearly exposed | Privacy boundary risk | Privacy protection |
| Response included next steps | Actionable response | Action enablement |
The following are specific measures for quality signals.
| Policy accuracy | Source attribution | Tool accuracy |
|---|---|---|
| Contains correct PTO duration | Cites authoritative documents | Invokes correct tool |
| Includes correct enrollment deadline | References specific sections | Uses valid parameters |
| Does not reference outdated policy | Returns correct outcome |
Apply and communicate quality signals
Use quality signals to drive evaluation workflows and communicate insights. To apply quality signals:
Tag assertions – Add signal tags to each assertion in your test cases.
Test Case: PTO-001
Prompt: "How many vacation days do new employees get?"Assertions:
The response contains "15 days".
Signal: Policy AccuracyThe response cites the Employee Handbook.
Signal: Source AttributionThe response mentions the <2 year tenure bracket.
Signal: Personalization
Calculate metrics – Aggregate pass and fail results by signal.
Quality signal Test cases Pass Fail Pass rate Policy Accuracy 25 23 2 92% Source Attribution 25 20 5 80% Personalization 15 11 4 73% Tool Accuracy 12 10 2 83% Escalation Appropriateness 8 8 0 100% Privacy Protection 10 10 0 100% Prioritize issues – Focus on signals with low pass rates or high impact.
- Personalization (73%) - Biggest gap, investigate first.
- Source attribution (80%) - Second priority.
- Tool accuracy (83%) - Third priority.
- Policy accuracy (92%) - Minor issues, monitor.
Track progress – Monitor signal pass rates across agent versions.
- Version 1.0 > 1.1 > 1.2 > 1.3
- Personalization: 73% > 78% > 85% > 91% (improving)
- Source attribution: 80% > 82% > 88% > 90% (improving)
- Tool accuracy: 83% > 85% > 84% > 92% (improved after v1.2 regression)
Quality signals transform stakeholder conversations. This specificity enables targeted fixes, quantitative progress tracking, and clearer stakeholder communication.
Without signals: The agent isn't performing well. Users are complaining.
With signals: Policy Accuracy is at 92% — we're hitting our target. But Personalization dropped to 73% after the last update. Specifically, UK employees are getting US holiday information. We identified the root cause: the context retrieval isn't passing location data. Fix is in progress for next release.
Quality signals by agent type
Quality signals and priorities vary based on the type of agent you're evaluating.
| Agent type | Signal | Priority |
|---|---|---|
| Knowledge-grounded | Policy accuracy | High |
| Source attribution | High | |
| Completeness | Medium | |
| Personalization | Medium | |
| Tool-calling | Tool accuracy | High |
| Tool response handling | High | |
| Action enablement | High | |
| Error recovery | Medium | |
| Hybrid | Routing accuracy | High |
| Knowledge signals | Medium | |
| Tool signals | Medium | |
| Escalation appropriateness | Medium | |
| Customer-facing | Privacy protection | High |
| Tone and professionalism | High | |
| Escalation appropriateness | High | |
| Resolution completeness | Medium |
Avoid common pitfalls
Avoid the following issues to ensure that your quality signals remain useful, consistent, and actionable.
Use specific signals instead of generic categories
Signals that are too broad, such as "Accuracy," "Helpfulness," or "Relevance," don't provide actionable insight. Generic signals make it difficult to identify root causes or prioritize improvements.
Instead, define signals based on specific, observable patterns in evaluation results.
- Avoid: Accuracy
- Prefer: Policy accuracy, Source attribution
Avoid overly granular signals
Creating too many narrowly scoped signals increases complexity without improving insight. Excessive granularity fragments analysis and makes it harder to track meaningful trends.
Instead, group related behaviors into broader, reusable signal categories.
- Avoid: PTO accuracy, Benefits accuracy, Holiday accuracy
- Prefer: Policy accuracy
Avoid vague pass and fail criteria
Vague signal definitions, such as "Correctness," lack measurable standards. Without clear criteria, results are inconsistent and difficult to interpret.
Instead, define signals using explicit, observable behaviors tied to evaluation outcomes.
- Avoid: "Response is correct"
- Prefer: "Response includes correct value and cites authoritative source"