Overview
The Evaluation Server provides a comprehensive automated evaluation toolkit for systematically and reproducibly assessing model performance in both retrieval and generation tasks.It supports multiple mainstream metrics, including ranking-based, matching-based, and summarization-based evaluations.
This module can be directly embedded at the end of a Pipeline to automatically calculate and save evaluation results.
Retrieval
| Metric Name | Type | Description |
|---|---|---|
MRR | float | Mean Reciprocal Rank — measures the average position of the first relevant document. |
MAP | float | Mean Average Precision — considers both precision and recall across ranked results. |
Recall | float | Recall — measures how many relevant documents were retrieved. |
Precision | float | Precision — measures how many retrieved documents are actually relevant. |
NDCG | float | Normalized Discounted Cumulative Gain — evaluates the similarity between the ranked list and the ideal ranking. |
Generation
| Metric Name | Type | Description |
|---|---|---|
EM | float | Exact Match — prediction exactly matches any of the references. |
Acc | float | Accuracy — prediction contains any form of the reference answer (loose matching). |
StringEM | float | Soft match ratio for multiple answers (commonly used in multi-choice or nested QA). |
CoverEM | float | Whether the reference answer is fully covered by the predicted text. |
F1 | float | Token-level F1 score. |
Rouge_1 | float | 1-gram ROUGE-F1 score. |
Rouge_2 | float | 2-gram ROUGE-F1 score. |
Rouge_L | float | ROUGE based on the Longest Common Subsequence (LCS). |
Usage Example
Retrieval
TREC File Evaluation
In information retrieval, the TREC format is the standard evaluation interface used to assess system performance in ranking and recall.TREC evaluation typically involves two files: qrel (manually annotated ground truth) and run (system output results). 1. qrel file (ground truth, human relevance judgments) The qrel file stores human-labeled relevance information — which documents are relevant to which queries.
During evaluation, system outputs are compared with qrel data to compute metrics such as MAP, NDCG, Recall, and Precision. Format (4 columns, space-separated):
query_id: query identifieriter: typically0(legacy field, can be ignored)doc_id: document identifierrelevance: relevance label (0 = irrelevant, 1 or higher = relevant)
Each line represents one document retrieved for a query along with its ranking information. Format (6 columns, space-separated):
query_id: query identifierQ0: constant string “Q0” (required by TREC)doc_id: document identifierrank: rank position (1 = most relevant)score: system-assigned scorerun_name: system name (e.g., bm25, dense_retriever)
You can download example files here: qrels.test and results.test
Significance Testing
Significance testing determines whether performance differences between two retrieval systems are statistically meaningful or due to random variation.It answers the question: Is the improvement of system A over system B statistically significant? Retrieval performance is usually measured as an average over multiple queries (e.g., MAP, NDCG, Recall).
However, such averages may not always be reliable due to query variability.
Significance testing uses statistical tests to verify whether improvements are consistent and reproducible. Common approaches include:
- Permutation Test — randomly swaps results between systems A and B many times (e.g., 10,000 iterations) to build a random difference distribution. If the observed improvement exceeds 95% of random outcomes (p < 0.05), it is considered significant.
- Paired t-test — assumes per-query scores follow a normal distribution and evaluates whether the difference in means is significant.
- A_mean / B_mean — average metrics of the new and old systems
- Diff(A-B) — performance difference
- p_value — probability from the significance test
- significant — True if p < 0.05
Generation
Basic Usage
evaluation.evaluate tool at the end of the Pipeline.It will automatically compute all specified evaluation metrics after the task completes and save the results to the configured output path.
Evaluate Existing Results
If you already have generated model results and want to evaluate them directly, organize them in standard JSONL format.The file should include at least fields representing reference answers and model predictions, for example:
pred_ls field in the get_data function:
servers/prompt/src/benchmark.py
pred_ls and specify its corresponding key in the original data, while updating dataset name and path to point to the evaluation file: