Evaluation

Function

The Evaluation Server provides a set of comprehensive automated evaluation tools for systematic and reproducible performance evaluation of model outputs in retrieval and generation tasks. It supports various mainstream metrics, including ranking-based, matching-based, and summarization-based evaluations, and can be directly embedded at the end of the Pipeline to achieve automatic calculation and saving of evaluation results.

Retrieval

Metric Name	Type	Description
`MRR`	float	Mean Reciprocal Rank, measuring the average rank position of the first relevant document.
`MAP`	float	Mean Average Precision, comprehensively considering retrieval precision and recall.
`Recall`	float	Recall rate, measuring how many relevant documents the retrieval system can find.
`Precision`	float	Precision rate, measuring how many of the retrieval results are relevant documents.
`NDCG`	float	Normalized Discounted Cumulative Gain, evaluating the consistency between retrieval results and ideal ranking.

Generation

Metric Name	Type	Description
`EM`	float	Exact Match, prediction is exactly the same as any reference.
`Acc`	float	Answer contains any form of the reference answer (loose matching).
`StringEM`	float	Soft match ratio for multiple sets of answers (commonly used for multiple choice/nested QA).
`CoverEM`	float	Whether the reference answer is completely covered by the predicted text.
`F1`	float	Token-level F1 score.
`Rouge_1`	float	1-gram ROUGE-F1.
`Rouge_2`	float	2-gram ROUGE-F1.
`Rouge_L`	float	Longest Common Subsequence (LCS) based ROUGE.

Usage Examples

Retrieval

TREC File Evaluation

In information retrieval, TREC format files are standardized evaluation interfaces used to measure model performance in ranking, recall, etc. TREC evaluation usually consists of two types of files: qrel (human-annotated true relevance) and run (system retrieval output results). I. qrel file (“ground truth”, human-annotated relevance) The qrel file is used to store human-annotated true relevance judgments of “which documents are relevant to which query”. During evaluation, the system output retrieval results will be compared with the qrel file to calculate metrics (such as MAP, NDCG, Recall, Precision, etc.). Format (4 columns, space-separated):

<query_id>  <iter>  <doc_id>  <relevance>

query_id: Query ID
iter: Usually write 0 (legacy field, can be ignored)
doc_id: Document ID
relevance: Relevance annotation (usually 0 means irrelevant, 1 or higher means relevant)

Example:

0 DOC123 1
0 DOC456 0
0 DOC321 1
0 DOC654 1

II. run file (system output retrieval results) The run file saves the output results of the retrieval system and is used to compare with the qrel file to evaluate performance. Each line represents a document returned by a query and its score information. Format (6 columns, space-separated):

<query_id>  Q0  <doc_id>  <rank>  <score>  <run_name>

query_id: Query ID
Q0: Fixed write Q0 (TREC standard requirement)
doc_id: Document ID
rank: Ranking position (1 means most relevant)
score: System score
run_name: System name (e.g., bm25, dense_retriever)

Example:

Q0 DOC123 1 12.34 bm25
Q0 DOC456 2 11.21 bm25
Q0 DOC654 1 13.89 bm25
Q0 DOC321 2 12.01 bm25

You can click the following links to download example files: qrels.test and results.test

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b

examples/eval_trec.yaml

# MCP Server
servers:
  evaluation: servers/evaluation

# MCP Client Pipeline
pipeline:
- evaluation.evaluate_trec

Run the following command to compile the Pipeline:

ultrarag build examples/eval_trec.yaml

examples/parameters/eval_trec_parameter.yaml

evaluation:
  ir_metrics:
  - mrr
  - map
  - recall
  - ndcg
  - precision
  ks:
  - 1
  - 5
  - 10
  - 20
  - 50
  - 100
  qrels_path: data/qrels.txt
  run_path: data/run_a.txt
  qrels_path: data/qrels.test
  run_path: data/results.test
  save_path: output/evaluate_results.json

Run the following command to execute this Pipeline:

ultrarag run examples/eval_trec.yaml

Significance Analysis

Significance Testing is used to judge whether the performance difference between two retrieval systems is “real” rather than caused by random fluctuations. The core question it answers is: Is the improvement of system A statistically significant? In retrieval tasks, system performance is usually measured by average metrics of multiple queries (such as MAP, NDCG, Recall, etc.). However, the improvement of the average value is not necessarily reliable because there is randomness between different queries. Significance analysis evaluates whether system improvement is stable and reproducible through statistical test methods. Common significance analysis methods include:

Permutation Test: By randomly exchanging the query results of system A and system B multiple times (e.g., 10000 times), construct a random distribution of differences. If the actual difference exceeds 95% of random cases (p < 0.05), the improvement is considered significant.
Paired t-test: Assuming that the query scores of the two systems follow a normal distribution, calculate the significance of the difference between their means.

UltraRAG has a built-in Two-sided Permutation Test, outputting the following key statistical information during automatic evaluation:

A_mean / B_mean: Average metrics of the new and old systems;
Diff(A-B): Improvement magnitude;
p_value: Probability of significance test;
significant: Significance judgment (True when p < 0.05).

examples/eval_trec_pvalue.yaml

# MCP Server
servers:
  evaluation: servers/evaluation

# MCP Client Pipeline
pipeline:
- evaluation.evaluate_trec_pvalue

Run the following command to compile the Pipeline:

ultrarag build examples/eval_trec_pvalue.yaml

examples/parameters/eval_trec_pvalue_parameter.yaml

evaluation:
  ir_metrics:
  - mrr
  - map
  - recall
  - ndcg
  - precision
  ks:
  - 1
  - 5
  - 10
  - 20
  - 50
  - 100
  n_resamples: 10000
  qrels_path: data/qrels.txt
  run_new_path: data/run_a.txt
  run_old_path: data/run_b.txt
  save_path: output/evaluate_results.json

Run the following command to execute this Pipeline:

ultrarag run examples/eval_trec_pvalue.yaml

Generation

Basic Usage

examples/rag_full.yaml

servers:
  benchmark: servers/benchmark
  retriever: servers/retriever
  prompt: servers/prompt
  generation: servers/generation
  evaluation: servers/evaluation
  custom: servers/custom

pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index
- retriever.retriever_search
- generation.generation_init
- prompt.qa_rag_boxed
- generation.generate
- custom.output_extract_from_boxed
- evaluation.evaluate

Simply add the evaluation.evaluate tool at the end of the Pipeline to automatically calculate all specified evaluation metrics after the task execution is completed, and output the results to the path set in the configuration file.

Evaluate Existing Results

If you already have the result file generated by the model and wish to evaluate it directly, you can organize the results into a standardized JSONL format. The file should at least contain fields representing answer labels and generation results, for example:

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a

{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "pred_answer": "December 14, 1973"}
{"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "pred_answer": "The documents do not provide information about the author of the lyrics to \"He Ain't Heavy, He's My Brother.\""}

examples/evaluate_results.yaml

# MCP Server
servers:
  benchmark: servers/benchmark
  evaluation: servers/evaluation

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- evaluation.evaluate

To allow the Benchmark Server to read the generation results, you need to add the pred_ls field in the get_data function:

servers/prompt/src/benchmark.py

@app.tool(output="benchmark->q_ls,gt_ls") 
@app.tool(output="benchmark->q_ls,gt_ls,pred_ls") 
def get_data(
    benchmark: Dict[str, Any],
) -> Dict[str, List[Any]]:

Then, run the following command to compile the Pipeline:

ultrarag build examples/evaluate_results.yaml

In the generated parameter file, add the field pred_ls and specify its corresponding key name in the original data, and modify the data path and name to point to the new evaluation file:

examples/parameters/evaluate_results_parameter.yaml

benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
      pred_ls: pred_answer
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    name: evaluate
    path: data/test_evaluate.jsonl
    seed: 42
    shuffle: false
evaluation:
  metrics:
  - acc
  - f1
  - em
  - coverem
  - stringem
  - rouge-1
  - rouge-2
  - rouge-l
  save_path: output/evaluate_results.json

Run the following command to execute this Pipeline:

ultrarag run examples/evaluate_results.yaml

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

Function

Retrieval

Generation

Usage Examples

Retrieval

TREC File Evaluation

Significance Analysis

Generation

Basic Usage

Evaluate Existing Results

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

​Function

​Retrieval

​Generation

​Usage Examples

​Retrieval

​TREC File Evaluation

​Significance Analysis

​Generation

​Basic Usage

​Evaluate Existing Results

Function

Retrieval

Generation

Usage Examples

Retrieval

TREC File Evaluation

Significance Analysis

Generation

Basic Usage

Evaluate Existing Results