Function

The Evaluation Server provides a complete set of text evaluation tools, supporting various common evaluation metrics for automated assessment of model outputs within a pipeline.

Parameter Description

/images/yaml.svgservers/evaluation/parameter.yaml
metrics: [ 'acc', 'f1', 'em', 'coverem', 'rouge-l' ]
save_path: output/asqa.json
  • metrics: Specifies the evaluation metrics to calculate, multiple can be computed simultaneously.
  • save_path: The storage location for the result logs.

Tool Description

  • evaluate: Evaluates a set of model-generated answers and saves the evaluation results.

Evaluation Metrics

Metric NameTypeDescription
EMfloatExact Match, prediction exactly matches any reference.
AccfloatAnswer contains any form of the reference answer (loose match).
StringEMfloatSoft matching ratio for multiple answers (commonly used in multi-choice/nested QA).
CoverEMfloatWhether the reference answer is fully covered by the predicted text.
F1floatToken-level F1 score.
Rouge_1float1-gram ROUGE-F1.
Rouge_2float2-gram ROUGE-F1.
Rouge_LfloatLongest Common Subsequence (LCS) based ROUGE.