Azure OpenAI graders (classico)

Visualizzazione attualmente:Versione del portale - Foundry (versione classica)Passare alla versione per il nuovo portale foundry

Note

Il Microsoft Foundry SDK per la valutazione e il portale foundry sono disponibili in anteprima pubblica, ma le API sono disponibili a livello generale per la valutazione del modello e del set di dati (la valutazione dell'agente rimane in anteprima pubblica). I Azure AI Evaluation SDK e gli analizzatori contrassegnati (anteprima) in questo articolo sono attualmente disponibili in anteprima pubblica ovunque.

I grader Azure OpenAI sono un nuovo insieme di strumenti di valutazione nel Microsoft Foundry SDK che valutano le prestazioni dei modelli di IA e i loro output. Questi classificatori includono:

Label grader
String checker
Text similarity
Python grader

È possibile eseguire i classificatori in locale o in remoto. Ogni classificatore valuta aspetti specifici dei modelli di intelligenza artificiale e dei relativi output.

Configurazione del modello per grader assistito dall'IA

Il seguente estratto di codice mostra la configurazione del modello utilizzata dal grader assistito dall'IA:

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION")
)

Label grader

AzureOpenAILabelGrader Usa il tuo prompt personalizzato per istruire un modello a classificare gli output in base alle etichette che definisci. Restituisce risultati strutturati con spiegazioni sul perché ogni etichetta è stata scelta.

Note

Consigliamo di utilizzare Azure OpenAI o3-mini per i migliori risultati.

Ecco un esempio di data.jsonl utilizzato nei seguenti snippet di codice:

[
    {
        "query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
        "ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What should you do when choosing an in-network provider for your health care needs?",
        "ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "To choose an in-network provider, confirm they are part of your plan using the Northwind website or by calling their office. Also consider location, availability, and office hours. [Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What range of in-network providers does Northwind Health Plus offer?",
        "ground_truth": "Northwind Health Plus offers a wide range of in-network providers, including doctors, specialists, hospitals, and pharmacies.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus offers a wide range of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]. The exact range of in-network providers may vary depending on the area, but the Northwind Health Plus website provides a searchable directory of all in-network providers in your area [Northwind_Health_Plus_Benefits_Details.pdf]. It is important to check with Northwind Health Plus to ensure that the provider you are considering is in-network [Northwind_Health_Plus_Benefits_Details.pdf]."
    },
    {
        "query": "What is the limitation of in-network providers?",
        "ground_truth": "The limitation of in-network providers is that they may not always be available in every area.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "The limitation of in-network providers is that they may not accept the amount of payment offered by Northwind Health, which means you may be responsible for a greater portion of the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network providers may not offer additional services or discounts that are available to in-network providers [Northwind_Standard_Benefits_Details.pdf]. It is recommended to choose an in-network provider whenever possible to ensure the best value for your health care expenses [Northwind_Standard_Benefits_Details.pdf].\n\nSources:\n- Northwind_Standard_Benefits_Details.pdf"
    },
    {
        "query": "What resource does Northwind Health Plus provide to find in-network providers in your area?",
        "ground_truth": "The Northwind Health Plus website offers a searchable directory of all in-network providers in your area. This directory is regularly updated, so you can be sure that you are choosing from in-network providers that are available.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus provides a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]."
    }
]

Esempio di valutatore di etichette

from azure.ai.evaluation import AzureOpenAILabelGrader, evaluate

data_file_name="data.jsonl"

#  Evaluation criteria: Determine if the response column contains text that is "too short," "just right," or "too long," and pass if it is "just right."
label_grader = AzureOpenAILabelGrader(
    model_config=model_config,
    input=[{"content": "{{item.response}}", "role": "user"},
           {"content": "Any text including space that's more than 600 characters is too long, less than 500 characters is too short; 500 to 600 characters is just right.", "role": "user", "type": "message"}],
    labels=["too short", "just right", "too long"],
    passing_labels=["just right"],
    model="gpt-4o",
    name="label",
)

label_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "label": label_grader
    },
)

Output del grader delle etichette

Per ogni insieme di dati campione nel file dati, viene restituito un risultato di valutazione di True o False che indica se l'output corrisponde all'etichetta di passaggio definita. Il score caso è 1.0 per True i casi e 0.0 per False i casi. Il motivo per cui il modello ha fornito l'etichetta per i dati è in content under outputs.label.sample.

'outputs.label.sample':
...
...
    'output': [{'role': 'assistant',
      'content': '{"steps":[{"description":"Calculate the number of characters in the user\'s input including spaces.","conclusion":"The provided text contains 575 characters."},{"description":"Evaluate if the character count falls within the given ranges (greater than 600 too long, less than 500 too short, 500 to 600 just right).","conclusion":"The character count falls between 500 and 600, categorized as \'just right.\'"}],"result":"just right"}'}],
...
...
'outputs.label.passed': True,
'outputs.label.score': 1.0

Oltre ai risultati della valutazione individuale dei dati, il valutatore restituisce una metrica che indica il tasso complessivo di superamento del dataset.

'metrics': {'label.pass_rate': 0.2}, #1/5 in this case

String checker

Confronta il testo di input con un valore di riferimento, verificando corrispondenze esatte o parziali con insensibilità opzionale per la maiuscolo. Utile per validazioni di testo flessibili e abbinamento di pattern.

Esempio di checker di stringhe

from azure.ai.evaluation import AzureOpenAIStringCheckGrader

# Evaluation criteria: Pass if the query column contains "What is"
string_grader = AzureOpenAIStringCheckGrader(
    model_config=model_config,
    input="{{item.query}}",
    name="starts with what is",
    operation="like", # "eq" for equal, "ne" for not equal, "like" for contains, "ilike" for case-insensitive contains
    reference="What is",
)

string_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "string": string_grader
    },
)

Output del checker stringa

Per ogni insieme di dati campione nel file dati, viene restituito un risultato di valutazione di True o False che indica se il testo di input corrisponde alle regole definite di pattern matching. Il score è 1.0 per True i casi mentre score è 0.0 per False i casi.

'outputs.string.passed': True,
'outputs.string.score': 1.0

Il correttore restituisce anche una metrica che indica il tasso complessivo di superamento del dataset.

'metrics': {'string.pass_rate': 0.4}, # 2/5 in this case

Text similarity

Valuta quanto il testo in input corrisponde a un valore di riferimento utilizzando metriche di similarità come fuzzy_match, BLEU, ROUGE, o METEOR. Questo è utile per valutare la qualità del testo o la vicinanza semantica.

Esempio di somiglianza testuale

from azure.ai.evaluation import AzureOpenAITextSimilarityGrader

# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    evaluation_metric="fuzzy_match", # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
    input="{{item.response}}",
    name="similarity",
    pass_threshold=0.5,
    reference="{{item.ground_truth}}",
)

sim_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "similarity": sim_grader
    },
)
sim_grader_evaluation

Output di similarità testuale

Per ogni insieme di dati campione nel file di dati, viene generato un punteggio di somiglianza numerica. Questo punteggio varia da 0 a 1 e indica il grado di somiglianza, con punteggi più alti che rappresentano una maggiore somiglianza. Viene restituito anche un risultato di valutazione di True o False che indica se il punteggio di somiglianza raggiunge o supera la soglia specificata in base alla metrica di valutazione definita nel valutatore.

'outputs.similarity.passed': True,
'outputs.similarity.score': 0.6117136659436009

Il correttore restituisce anche una metrica che indica il tasso complessivo di superamento del dataset.

'metrics': {'similarity.pass_rate': 0.4}, # 2 out of 5 in this case

Python grader

Gli utenti avanzati possono creare o importare funzioni personalizzate di Python grader e integrarle nel Azure OpenAI Python grader. Questo consente valutazioni personalizzate per aree specifiche di interesse oltre le capacità dei valutatori Azure OpenAI esistenti. Il seguente esempio dimostra come importare una funzione personalizzata di grader di similarità e configurarla per eseguire come un grader Azure OpenAI Python utilizzando l'SDK di Microsoft Foundry.

Esempio di Python grader

from azure.ai.evaluation import AzureOpenAIPythonGrader
 
python_similarity_grader = AzureOpenAIPythonGrader(
    model_config=model_config_aoai,
    name="custom_similarity",
    image_tag="2025-05-08",
    pass_threshold=0.3,
    source="""
    def grade(sample, item) -> float:
     \"\"\"
     Custom similarity grader using word overlap.
     Note: All data is in the 'item' parameter.
     \"\"\"
     # Extract from item, not sample!
     response = item.get("response", "") if isinstance(item, dict) else ""
     ground_truth = item.get("ground_truth", "") if isinstance(item, dict) else ""
    
     # Simple word overlap similarity
     response_words = set(response.lower().split())
     truth_words = set(ground_truth.lower().split())
    
     if not truth_words:
     return 0.0
    
     overlap = response_words.intersection(truth_words)
     similarity = len(overlap) / len(truth_words)
    
     return min(1.0, similarity)
""",
)

file_name = "eval_this.jsonl"
evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "custom_similarity": python_similarity_grader,
    },
    #azure_ai_project=azure_ai_project,
)
evaluation

Example output

Per ogni insieme di dati campione nel file dati, il grader Python restituisce un punteggio numerico basato sulla funzione definita. Data una soglia numerica definita come parte del valutatore personalizzato, risultiamo anche True se il punteggio >= soglia, oppure False altrimenti.

For example:

"outputs.custom_similarity.passed": false,
"outputs.custom_similarity.score": 0.0

Oltre ai risultati della valutazione individuale dei dati, il valutatore restituisce anche una metrica che indica il tasso complessivo di superamento del dataset.

'metrics': {'custom_similarity.pass_rate': 0.0}, #0/5 in this case

Commenti e suggerimenti

Questa pagina è stata utile?

Last updated on 2026-05-01