Modifica

Agent Evaluations CLI overview (preview)

The Microsoft 365 Copilot Agent Evaluations CLI (@microsoft/m365-copilot-eval) helps you test, measure, and improve the quality of your agents with structured evaluations and rich result reports with AI-based scoring.

Note

The Agent Evaluations CLI is currently in preview. Features and functionality are subject to change.

What you can do

The evaluation tool provides the following capabilities:

  • Run batch and interactive evaluations.
  • Automatically score responses using Azure AI + machine learning evaluation metrics.
  • Test using JSON datasets, inline prompts, or interactive input.
  • Generate reports in HTML, JSON, or CSV formats.

Evaluation metrics

Each response is scored using standard evaluation metrics.

Evaluator Type Scale Default Threshold Default
Relevance LLM-based 1-5 3 Yes
Coherence LLM-based 1-5 3 Yes
Groundedness LLM-based 1-5 3 No
Similarity LLM-based 1-5 3 No
Citations Count-based >= 0 1 No
ExactMatch String match boolean N/A No
PartialMatch String match 0.0-1.0 0.5 No

How the evaluation workflow works

Evaluations follow a consistent workflow:

  1. Install and configure the CLI.
  2. Provide environment configuration and credentials.
  3. Create a dataset of test prompts.
  4. Run evaluations against your agent.
  5. Review results and iterate.

Required environment variables

The evaluation tool uses environment variables to authenticate and connect to your tenant and Azure OpenAI in Foundry Models resource.

Variable Description Default
TENANT_ID Microsoft Entra tenant ID where your agent is deployed. None
AZURE_AI_OPENAI_ENDPOINT Azure OpenAI endpoint URL. None
AZURE_AI_API_KEY Azure OpenAI API key. None
M365_TITLE_ID (optional) Title ID used to auto-detect the Microsoft 365 agent ID for evaluation. None
M365_AGENT_ID (optional) Explicit agent ID for evaluation. Auto-detected from M365_TITLE_ID
AZURE_AI_API_VERSION Azure OpenAI REST API version. 2024-12-01-preview
AZURE_AI_MODEL_NAME Model deployment name in your Azure OpenAI in Foundry Models resource. gpt-4o-mini

These values enable authentication and allow the tool to run LLM-based evaluation scoring. For details about how to get these values, see Get values for environment variables.