evaluate
Signature
- Executes automatic metric evaluation for QA/Generation tasks.
- Supported Metrics:
acc,em,coverem,stringem,f1,rouge-1,rouge-2,rouge-l. - Results are automatically saved as
.jsonfile and printed as Markdown table.
evaluate_trec
Signature
- Performs IR retrieval metric evaluation based on
pytrec_eval. - Reads standard TREC format:
- qrels:
<qid> <iter> <docid> <rel> - run:
<qid> Q0 <docid> <rank> <score> <tag>
- qrels:
- Supported Metrics:
mrr,map,recall@k,precision@k,ndcg@k. - Automatically aggregates statistics and outputs as table.
evaluate_trec_pvalue
Signature
- Compares significance of two TREC result files using Two-sided Permutation Test to calculate p-value.
- Default resampling count
n_resamples=10000. - Outputs mean, difference, p-value, and significance flag.
Configuration
| Parameter | Type | Description |
|---|---|---|
save_path | str | Evaluation result save path (automatically appends timestamp) |
metrics | list[str] | Metric set used for QA / Generation tasks |
qrels_path | str | TREC format ground truth file path |
run_path | str | Result file for retrieval task |
ks | list[int] | Truncation levels for calculating NDCG@K, P@K, Recall@K, etc. |
ir_metrics | list[str] | Retrieval task metric names, supports mrr, map, recall, ndcg, precision |
run_new_path | str | Run file path generated by new model (significance analysis) |
run_old_path | str | Run file path of old model (significance analysis) |
n_resamples | int | Resampling count for Permutation Test |