Skip to main content

Function

The Evaluation Server provides a set of comprehensive automated evaluation tools for systematic and reproducible performance evaluation of model outputs in retrieval and generation tasks. It supports various mainstream metrics, including ranking-based, matching-based, and summarization-based evaluations, and can be directly embedded at the end of the Pipeline to achieve automatic calculation and saving of evaluation results.

Retrieval

Metric NameTypeDescription
MRRfloatMean Reciprocal Rank, measuring the average rank position of the first relevant document.
MAPfloatMean Average Precision, comprehensively considering retrieval precision and recall.
RecallfloatRecall rate, measuring how many relevant documents the retrieval system can find.
PrecisionfloatPrecision rate, measuring how many of the retrieval results are relevant documents.
NDCGfloatNormalized Discounted Cumulative Gain, evaluating the consistency between retrieval results and ideal ranking.

Generation

Metric NameTypeDescription
EMfloatExact Match, prediction is exactly the same as any reference.
AccfloatAnswer contains any form of the reference answer (loose matching).
StringEMfloatSoft match ratio for multiple sets of answers (commonly used for multiple choice/nested QA).
CoverEMfloatWhether the reference answer is completely covered by the predicted text.
F1floatToken-level F1 score.
Rouge_1float1-gram ROUGE-F1.
Rouge_2float2-gram ROUGE-F1.
Rouge_LfloatLongest Common Subsequence (LCS) based ROUGE.

Usage Examples

Retrieval

TREC File Evaluation

In information retrieval, TREC format files are standardized evaluation interfaces used to measure model performance in ranking, recall, etc. TREC evaluation usually consists of two types of files: qrel (human-annotated true relevance) and run (system retrieval output results). I. qrel file (“ground truth”, human-annotated relevance) The qrel file is used to store human-annotated true relevance judgments of “which documents are relevant to which query”. During evaluation, the system output retrieval results will be compared with the qrel file to calculate metrics (such as MAP, NDCG, Recall, Precision, etc.). Format (4 columns, space-separated):
<query_id>  <iter>  <doc_id>  <relevance>
  • query_id: Query ID
  • iter: Usually write 0 (legacy field, can be ignored)
  • doc_id: Document ID
  • relevance: Relevance annotation (usually 0 means irrelevant, 1 or higher means relevant)
Example:
1 0 DOC123 1
1 0 DOC456 0
2 0 DOC321 1
2 0 DOC654 1
II. run file (system output retrieval results) The run file saves the output results of the retrieval system and is used to compare with the qrel file to evaluate performance. Each line represents a document returned by a query and its score information. Format (6 columns, space-separated):
<query_id>  Q0  <doc_id>  <rank>  <score>  <run_name>
  • query_id: Query ID
  • Q0: Fixed write Q0 (TREC standard requirement)
  • doc_id: Document ID
  • rank: Ranking position (1 means most relevant)
  • score: System score
  • run_name: System name (e.g., bm25, dense_retriever)
Example:
1 Q0 DOC123 1 12.34 bm25
1 Q0 DOC456 2 11.21 bm25
2 Q0 DOC654 1 13.89 bm25
2 Q0 DOC321 2 12.01 bm25
You can click the following links to download example files: qrels.test and results.test
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/eval_trec.yaml
# MCP Server
servers:
  evaluation: servers/evaluation

# MCP Client Pipeline
pipeline:
- evaluation.evaluate_trec
Run the following command to compile the Pipeline:
ultrarag build examples/eval_trec.yaml
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/eval_trec_parameter.yaml
evaluation:
  ir_metrics:
  - mrr
  - map
  - recall
  - ndcg
  - precision
  ks:
  - 1
  - 5
  - 10
  - 20
  - 50
  - 100
  qrels_path: data/qrels.txt
  run_path: data/run_a.txt
  qrels_path: data/qrels.test
  run_path: data/results.test
  save_path: output/evaluate_results.json

Run the following command to execute this Pipeline:
ultrarag run examples/eval_trec.yaml

Significance Analysis

Significance Testing is used to judge whether the performance difference between two retrieval systems is “real” rather than caused by random fluctuations. The core question it answers is: Is the improvement of system A statistically significant? In retrieval tasks, system performance is usually measured by average metrics of multiple queries (such as MAP, NDCG, Recall, etc.). However, the improvement of the average value is not necessarily reliable because there is randomness between different queries. Significance analysis evaluates whether system improvement is stable and reproducible through statistical test methods. Common significance analysis methods include:
  • Permutation Test: By randomly exchanging the query results of system A and system B multiple times (e.g., 10000 times), construct a random distribution of differences. If the actual difference exceeds 95% of random cases (p < 0.05), the improvement is considered significant.
  • Paired t-test: Assuming that the query scores of the two systems follow a normal distribution, calculate the significance of the difference between their means.
UltraRAG has a built-in Two-sided Permutation Test, outputting the following key statistical information during automatic evaluation:
  • A_mean / B_mean: Average metrics of the new and old systems;
  • Diff(A-B): Improvement magnitude;
  • p_value: Probability of significance test;
  • significant: Significance judgment (True when p < 0.05).
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/eval_trec_pvalue.yaml
# MCP Server
servers:
  evaluation: servers/evaluation

# MCP Client Pipeline
pipeline:
- evaluation.evaluate_trec_pvalue
Run the following command to compile the Pipeline:
ultrarag build examples/eval_trec_pvalue.yaml
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/eval_trec_pvalue_parameter.yaml
evaluation:
  ir_metrics:
  - mrr
  - map
  - recall
  - ndcg
  - precision
  ks:
  - 1
  - 5
  - 10
  - 20
  - 50
  - 100
  n_resamples: 10000
  qrels_path: data/qrels.txt
  run_new_path: data/run_a.txt
  run_old_path: data/run_b.txt
  save_path: output/evaluate_results.json
Run the following command to execute this Pipeline:
ultrarag run examples/eval_trec_pvalue.yaml

Generation

Basic Usage

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/rag_full.yaml
servers:
  benchmark: servers/benchmark
  retriever: servers/retriever
  prompt: servers/prompt
  generation: servers/generation
  evaluation: servers/evaluation
  custom: servers/custom

pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index
- retriever.retriever_search
- generation.generation_init
- prompt.qa_rag_boxed
- generation.generate
- custom.output_extract_from_boxed
- evaluation.evaluate
Simply add the evaluation.evaluate tool at the end of the Pipeline to automatically calculate all specified evaluation metrics after the task execution is completed, and output the results to the path set in the configuration file.

Evaluate Existing Results

If you already have the result file generated by the model and wish to evaluate it directly, you can organize the results into a standardized JSONL format. The file should at least contain fields representing answer labels and generation results, for example:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "pred_answer": "December 14, 1973"}
{"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "pred_answer": "The documents do not provide information about the author of the lyrics to \"He Ain't Heavy, He's My Brother.\""}
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/evaluate_results.yaml
# MCP Server
servers:
  benchmark: servers/benchmark
  evaluation: servers/evaluation

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- evaluation.evaluate
To allow the Benchmark Server to read the generation results, you need to add the pred_ls field in the get_data function:
servers/prompt/src/benchmark.py
@app.tool(output="benchmark->q_ls,gt_ls") 
@app.tool(output="benchmark->q_ls,gt_ls,pred_ls") 
def get_data(
    benchmark: Dict[str, Any],
) -> Dict[str, List[Any]]:
Then, run the following command to compile the Pipeline:
ultrarag build examples/evaluate_results.yaml
In the generated parameter file, add the field pred_ls and specify its corresponding key name in the original data, and modify the data path and name to point to the new evaluation file:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/evaluate_results_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
      pred_ls: pred_answer
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    name: evaluate
    path: data/test_evaluate.jsonl
    seed: 42
    shuffle: false
evaluation:
  metrics:
  - acc
  - f1
  - em
  - coverem
  - stringem
  - rouge-1
  - rouge-2
  - rouge-l
  save_path: output/evaluate_results.json
Run the following command to execute this Pipeline:
ultrarag run examples/evaluate_results.yaml