Edit

Share via


Run evaluations from the Microsoft Foundry portal

Evaluate the performance and safety of your generative AI models and agents by running them against a test dataset. During an evaluation, the model or agent is tested with the dataset and its performance is measured using built-in and custom evaluators.

Use the Foundry portal to run evaluations, view results, and analyze metrics.

Prerequisites

  • A test dataset in CSV or JSON Lines (JSONL) format, or a model or agent to evaluate.
  • An Azure OpenAI connection with a deployed GPT model that supports chat completion (for example, gpt-4o-mini). Required only for AI-assisted quality evaluations.

Create an evaluation

You can start an evaluation from several places in the Foundry portal:

  • Evaluation page: From the left pane, select Evaluation > Create.
  • Models page: Go to your model, select the Evaluation tab, then select Create.
  • Agents page: Go to your agent, select the Evaluation tab, then select Create.
  • Agent playground: Go to your agent, select the Playground tab, then select Metrics > Run full evaluation.

Evaluation target

When you create an evaluation, you first choose the evaluation target. The target determines what the evaluation runs against:

  • Agent: Evaluates the output generated by your selected agent and user-defined prompt.
  • Model: Evaluates the output generated by your selected model and user-defined prompt.
  • Dataset: Evaluates preexisting model or agent outputs from a test dataset.

Select or create a dataset

Provide a dataset for the evaluation. You can upload your own dataset or synthetically generate one.

  • Add new dataset: Upload files from your local storage. Only CSV and JSONL file formats are supported. A preview of your test data displays on the right pane.
  • Synthetic dataset generation: Generate a synthetic dataset when you don't have test data. Specify the resource, the number of rows, and a prompt that describes the data to generate. You can also upload files to improve relevance.

Note

Synthetic data generation requires a model with Responses API capability. For availability, see Responses API region availability.

Configure testing criteria

Select the evaluators to use for your evaluation. Microsoft Foundry provides three categories of built-in evaluators:

  • Agent evaluators — Evaluate how effectively agents handle tasks, tools, and user intent.
  • Quality evaluators — Measure the overall quality of generated responses. Includes both AI-assisted metrics (require a model deployment as judge) and NLP metrics (mathematical, often require ground truth data).
  • Safety evaluators — Identify potential content and security risks in generated output. Safety evaluators don't require a model deployment.

You can also create your own custom evaluators and select them when configuring testing criteria.

For the complete list of available evaluators, see Built-in evaluators.

Data mapping

Different evaluators require different data inputs. The portal automatically maps your dataset fields to the fields each evaluator expects. Check the mapping and reassign fields if needed. For field requirements, see the respective evaluator pages under Built-in evaluators.

Review and submit

After you finish configuring, provide a name for your evaluation, review your settings, and select Submit.

Learn more about evaluating your generative AI models and agents: