Evaluation Server

Function

The Evaluation Server provides a complete set of text evaluation tools, supporting various common evaluation metrics for automated assessment of model outputs within a pipeline.

Parameter Description

servers/evaluation/parameter.yaml

metrics: [ 'acc', 'f1', 'em', 'coverem', 'rouge-l' ]
save_path: output/asqa.json

metrics: Specifies the evaluation metrics to calculate, multiple can be computed simultaneously.
save_path: The storage location for the result logs.

Tool Description

evaluate: Evaluates a set of model-generated answers and saves the evaluation results.

Evaluation Metrics

Metric Name	Type	Description
`EM`	float	Exact Match, prediction exactly matches any reference.
`Acc`	float	Answer contains any form of the reference answer (loose match).
`StringEM`	float	Soft matching ratio for multiple answers (commonly used in multi-choice/nested QA).
`CoverEM`	float	Whether the reference answer is fully covered by the predicted text.
`F1`	float	Token-level F1 score.
`Rouge_1`	float	1-gram ROUGE-F1.
`Rouge_2`	float	2-gram ROUGE-F1.
`Rouge_L`	float	Longest Common Subsequence (LCS) based ROUGE.

Getting Started

Developer Guide

Function

Parameter Description

Tool Description

Evaluation Metrics

Getting Started

Developer Guide

​Function

​Parameter Description

​Tool Description

​Evaluation Metrics

Function

Parameter Description

Tool Description

Evaluation Metrics