metrics
: Specifies the evaluation metrics to calculate, multiple can be computed simultaneously.save_path
: The storage location for the result logs.evaluate
: Evaluates a set of model-generated answers and saves the evaluation results.Metric Name | Type | Description |
---|---|---|
EM | float | Exact Match, prediction exactly matches any reference. |
Acc | float | Answer contains any form of the reference answer (loose match). |
StringEM | float | Soft matching ratio for multiple answers (commonly used in multi-choice/nested QA). |
CoverEM | float | Whether the reference answer is fully covered by the predicted text. |
F1 | float | Token-level F1 score. |
Rouge_1 | float | 1-gram ROUGE-F1. |
Rouge_2 | float | 2-gram ROUGE-F1. |
Rouge_L | float | Longest Common Subsequence (LCS) based ROUGE. |