Function
The Evaluation Server provides a set of comprehensive automated evaluation tools for systematic and reproducible performance evaluation of model outputs in retrieval and generation tasks. It supports various mainstream metrics, including ranking-based, matching-based, and summarization-based evaluations, and can be directly embedded at the end of the Pipeline to achieve automatic calculation and saving of evaluation results.Retrieval
| Metric Name | Type | Description |
|---|---|---|
MRR | float | Mean Reciprocal Rank, measuring the average rank position of the first relevant document. |
MAP | float | Mean Average Precision, comprehensively considering retrieval precision and recall. |
Recall | float | Recall rate, measuring how many relevant documents the retrieval system can find. |
Precision | float | Precision rate, measuring how many of the retrieval results are relevant documents. |
NDCG | float | Normalized Discounted Cumulative Gain, evaluating the consistency between retrieval results and ideal ranking. |
Generation
| Metric Name | Type | Description |
|---|---|---|
EM | float | Exact Match, prediction is exactly the same as any reference. |
Acc | float | Answer contains any form of the reference answer (loose matching). |
StringEM | float | Soft match ratio for multiple sets of answers (commonly used for multiple choice/nested QA). |
CoverEM | float | Whether the reference answer is completely covered by the predicted text. |
F1 | float | Token-level F1 score. |
Rouge_1 | float | 1-gram ROUGE-F1. |
Rouge_2 | float | 2-gram ROUGE-F1. |
Rouge_L | float | Longest Common Subsequence (LCS) based ROUGE. |
Usage Examples
Retrieval
TREC File Evaluation
In information retrieval, TREC format files are standardized evaluation interfaces used to measure model performance in ranking, recall, etc. TREC evaluation usually consists of two types of files: qrel (human-annotated true relevance) and run (system retrieval output results). I. qrel file (“ground truth”, human-annotated relevance) The qrel file is used to store human-annotated true relevance judgments of “which documents are relevant to which query”. During evaluation, the system output retrieval results will be compared with the qrel file to calculate metrics (such as MAP, NDCG, Recall, Precision, etc.). Format (4 columns, space-separated):query_id: Query IDiter: Usually write0(legacy field, can be ignored)doc_id: Document IDrelevance: Relevance annotation (usually 0 means irrelevant, 1 or higher means relevant)
query_id: Query IDQ0: Fixed writeQ0(TREC standard requirement)doc_id: Document IDrank: Ranking position (1 means most relevant)score: System scorerun_name: System name (e.g., bm25, dense_retriever)
You can click the following links to download example files: qrels.test and results.test
Significance Analysis
Significance Testing is used to judge whether the performance difference between two retrieval systems is “real” rather than caused by random fluctuations. The core question it answers is: Is the improvement of system A statistically significant? In retrieval tasks, system performance is usually measured by average metrics of multiple queries (such as MAP, NDCG, Recall, etc.). However, the improvement of the average value is not necessarily reliable because there is randomness between different queries. Significance analysis evaluates whether system improvement is stable and reproducible through statistical test methods. Common significance analysis methods include:- Permutation Test: By randomly exchanging the query results of system A and system B multiple times (e.g., 10000 times), construct a random distribution of differences. If the actual difference exceeds 95% of random cases (p < 0.05), the improvement is considered significant.
- Paired t-test: Assuming that the query scores of the two systems follow a normal distribution, calculate the significance of the difference between their means.
- A_mean / B_mean: Average metrics of the new and old systems;
- Diff(A-B): Improvement magnitude;
- p_value: Probability of significance test;
- significant: Significance judgment (True when p < 0.05).
Generation
Basic Usage
evaluation.evaluate tool at the end of the Pipeline to automatically calculate all specified evaluation metrics after the task execution is completed, and output the results to the path set in the configuration file.
Evaluate Existing Results
If you already have the result file generated by the model and wish to evaluate it directly, you can organize the results into a standardized JSONL format. The file should at least contain fields representing answer labels and generation results, for example:pred_ls field in the get_data function:
servers/prompt/src/benchmark.py
pred_ls and specify its corresponding key name in the original data, and modify the data path and name to point to the new evaluation file: