Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Evaluate the performance and safety of your generative AI models and agents by running them against a test dataset. During an evaluation, the model or agent is tested with the dataset and its performance is measured using built-in and custom evaluators.
Use the Foundry portal to run evaluations, view results, and analyze metrics.
Prerequisites
- A test dataset in CSV or JSON Lines (JSONL) format, or a model or agent to evaluate.
- An Azure OpenAI connection with a deployed GPT model that supports chat completion (for example,
gpt-4o-mini). Required only for AI-assisted quality evaluations.
Create an evaluation
You can start an evaluation from several places in the Foundry portal:
- Evaluation page: From the left pane, select Evaluation > Create.
- Models page: Go to your model, select the Evaluation tab, then select Create.
- Agents page: Go to your agent, select the Evaluation tab, then select Create.
- Agent playground: Go to your agent, select the Playground tab, then select Metrics > Run full evaluation.
Evaluation target
When you create an evaluation, you first choose the evaluation target. The target determines what the evaluation runs against:
- Agent: Evaluates the output generated by your selected agent and user-defined prompt.
- Model: Evaluates the output generated by your selected model and user-defined prompt.
- Dataset: Evaluates preexisting model or agent outputs from a test dataset.
Select or create a dataset
Provide a dataset for the evaluation. You can upload your own dataset or synthetically generate one.
- Add new dataset: Upload files from your local storage. Only CSV and JSONL file formats are supported. A preview of your test data displays on the right pane.
- Synthetic dataset generation: Generate a synthetic dataset when you don't have test data. Specify the resource, the number of rows, and a prompt that describes the data to generate. You can also upload files to improve relevance.
Note
Synthetic data generation requires a model with Responses API capability. For availability, see Responses API region availability.
Configure testing criteria
Select the evaluators to use for your evaluation. Microsoft Foundry provides three categories of built-in evaluators:
- Agent evaluators — Evaluate how effectively agents handle tasks, tools, and user intent.
- Quality evaluators — Measure the overall quality of generated responses. Includes both AI-assisted metrics (require a model deployment as judge) and NLP metrics (mathematical, often require ground truth data).
- Safety evaluators — Identify potential content and security risks in generated output. Safety evaluators don't require a model deployment.
You can also create your own custom evaluators and select them when configuring testing criteria.
For the complete list of available evaluators, see Built-in evaluators.
Data mapping
Different evaluators require different data inputs. The portal automatically maps your dataset fields to the fields each evaluator expects. Check the mapping and reassign fields if needed. For field requirements, see the respective evaluator pages under Built-in evaluators.
Review and submit
After you finish configuring, provide a name for your evaluation, review your settings, and select Submit.
Related content
Learn more about evaluating your generative AI models and agents: