evaluate
Signature
- Performs automatic metric evaluation for QA or generation tasks.
- Supported metrics:
acc,em,coverem,stringem,f1,rouge-1,rouge-2,rouge-l. - Automatically saves results as a
.jsonfile and prints them in Markdown table format.
evaluate_trec
Signature
- Evaluates IR retrieval metrics based on
pytrec_eval. - Reads standard TREC-formatted files:
- qrels:
<qid> <iter> <docid> <rel> - run:
<qid> Q0 <docid> <rank> <score> <tag>
- qrels:
- Supported metrics:
mrr,map,recall@k,precision@k,ndcg@k. - Automatically computes and outputs aggregated results in a table.
evaluate_trec_pvalue
Signature
- Performs a significance comparison between two TREC result files using a two-tailed permutation test to compute p-values.
- Default resampling count:
n_resamples=10000. - Outputs mean, difference, p-value, and significance markers.
Parameter Configuration
| Parameter | Type | Description |
|---|---|---|
save_path | str | Path to save evaluation results (automatically timestamped) |
metrics | list[str] | Metric set used for QA / generation tasks |
qrels_path | str | Path to TREC-format ground truth file |
run_path | str | Path to the retrieval task result file |
ks | list[int] | Cutoff levels for computing NDCG@K, P@K, Recall@K, etc. |
ir_metrics | list[str] | IR metric names, supporting mrr, map, recall, ndcg, precision |
run_new_path | str | Path to the new model’s run file (for significance testing) |
run_old_path | str | Path to the old model’s run file (for significance testing) |
n_resamples | int | Number of resampling iterations for the permutation test |