> ## Documentation Index > Fetch the complete documentation index at: https://ultrarag.openbmb.cn/llms.txt > Use this file to discover all available pages before exploring further. # Evaluation ## 作用 Evaluation Server 提供了一套完善的自动化评估工具，用于对检索与生成任务的模型输出进行系统化、可复现的性能评测。它支持多种主流指标，包括排序类、匹配类与摘要类评估，可直接嵌入 Pipeline 末尾，实现评估结果的自动计算与保存。 ### 检索 | 指标名 | 类型 | 说明 | | :---------- | :---- | :---------------------------------------------------------------- | | `MRR` | float | Mean Reciprocal Rank（平均倒数排名），衡量首个相关文档的平均排名位置。 | | `MAP` | float | Mean Average Precision（平均精确率），综合考虑检索的精确性与召回率。 | | `Recall` | float | 召回率，衡量检索系统能找回多少相关文档。 | | `Precision` | float | 精确率，衡量检索结果中有多少是相关文档。 | | `NDCG` | float | Normalized Discounted Cumulative Gain（标准化折损累计增益），评估检索结果与理想排序的一致性。 | ### 生成 | 指标名 | 类型 | 说明 | | :--------- | ----- | :-------------------------------------------- | | `EM` | float | Exact Match，预测与任一参考完全相同。 | | `Acc` | float | Answer 包含参考答案中的任一形式（宽松匹配）。 | | `StringEM` | float | 针对多组答案的软匹配比例（常用于多选/嵌套 QA）。 | | `CoverEM` | float | 参考答案是否完全被预测文本覆盖。 | | `F1` | float | Token 级别 F1 得分。 | | `Rouge_1` | float | 1-gram ROUGE-F1。 | | `Rouge_2` | float | 2-gram ROUGE-F1。 | | `Rouge_L` | float | Longest Common Subsequence (LCS) based ROUGE。 | ## 使用示例 ### 检索 #### Trec文件评估在信息检索中，TREC 格式文件是标准化的评测接口，用于衡量模型在排序、召回等方面的性能。 TREC 评估通常由两类文件组成：qrel（人工标注的真实相关性）与 run（系统检索输出结果）。 **一、qrel 文件（“ground truth”，人工标注的相关性）** qrel 文件用于存储“哪些文档与哪个查询是相关的”这类人工标注的真实相关性判断。\ 在评测时，系统输出的检索结果会与 qrel 文件进行对比，用来计算指标（如 MAP、NDCG、Recall、Precision 等）。格式（4列，空格分隔）： ``` ``` * `query_id`：查询编号 * `iter`：通常写 `0`（历史遗留字段，可忽略） * `doc_id`：文档编号 * `relevance`：相关性标注（通常 0 表示不相关，1 或更高表示相关）示例： ``` 1 0 DOC123 1 1 0 DOC456 0 2 0 DOC321 1 2 0 DOC654 1 ``` **二、run 文件（系统输出的检索结果）** run 文件保存检索系统的输出结果，用于与 qrel 文件对比评估性能。\ 每行表示一个查询返回的文档及其得分信息。格式（6列，空格分隔）： ``` Q0 ``` * `query_id`：查询编号 * `Q0`：固定写 `Q0`（TREC 标准要求） * `doc_id`：文档编号 * `rank`：排序名次（1 表示最相关） * `score`：系统打分 * `run_name`：系统名称（例如 bm25、dense\_retriever）示例： ``` 1 Q0 DOC123 1 12.34 bm25 1 Q0 DOC456 2 11.21 bm25 2 Q0 DOC654 1 13.89 bm25 2 Q0 DOC321 2 12.01 bm25 ``` 你可以点击以下链接下载示例文件：[qrels.test](https://github.com/usnistgov/trec_eval/blob/main/test/qrels.test) 和 [results.test](https://github.com/usnistgov/trec_eval/blob/main/test/results.test) ```yaml examples/eval_trec.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" theme={null} # MCP Server servers: evaluation: servers/evaluation # MCP Client Pipeline pipeline: - evaluation.evaluate_trec ``` 运行以下命令编译 Pipeline： ```shell theme={null} ultrarag build examples/eval_trec.yaml ``` ```yaml examples/parameters/eval_trec_parameter.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" theme={null} evaluation: ir_metrics: - mrr - map - recall - ndcg - precision ks: - 1 - 5 - 10 - 20 - 50 - 100 qrels_path: data/qrels.txt # [!code --] run_path: data/run_a.txt # [!code --] qrels_path: data/qrels.test # [!code ++] run_path: data/results.test # [!code ++] save_path: output/evaluate_results.json ``` 运行以下命令执行该 Pipeline： ```shell theme={null} ultrarag run examples/eval_trec.yaml ``` #### 显著性分析显著性分析（Significance Testing）用于判断两个检索系统之间的性能差异是否“真实存在”，而不是由随机波动造成。\ 它回答的核心问题是：系统 A 的提升是否具有统计学意义？在检索任务中，系统的性能通常通过多个查询的平均指标（如 MAP、NDCG、Recall 等）衡量。\ 然而，平均值的提升并不一定可靠，因为不同查询间存在随机性。\ 显著性分析通过统计检验方法，评估系统改进是否稳定且可复现。常见的显著性分析方法包括： * **置换检验（Permutation Test）**：通过随机交换系统 A 和系统 B 的查询结果多次（如 10000 次），构建差异的随机分布。若实际差异超过 95% 的随机情况（p \< 0.05），则认为提升显著。 * **t 检验（Paired t-test）**：假设两个系统的查询得分服从正态分布，计算两者均值差异的显著性。 UltraRAG 内置双侧置换检验（Two-sided Permutation Test），在自动评估过程中输出以下关键统计信息： * **A\_mean / B\_mean** 表示新旧系统的平均指标； * **Diff(A-B)** 表示改进幅度； * **p\_value** 为显著性检验的概率； * **significant** 为显著性判断（p \< 0.05 时为 True）。 ```yaml examples/eval_trec_pvalue.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" theme={null} # MCP Server servers: evaluation: servers/evaluation # MCP Client Pipeline pipeline: - evaluation.evaluate_trec_pvalue ``` 运行以下命令编译 Pipeline： ```shell theme={null} ultrarag build examples/eval_trec_pvalue.yaml ``` ```yaml examples/parameters/eval_trec_pvalue_parameter.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" theme={null} evaluation: ir_metrics: - mrr - map - recall - ndcg - precision ks: - 1 - 5 - 10 - 20 - 50 - 100 n_resamples: 10000 qrels_path: data/qrels.txt run_new_path: data/run_a.txt run_old_path: data/run_b.txt save_path: output/evaluate_results.json ``` 运行以下命令执行该 Pipeline： ```shell theme={null} ultrarag run examples/eval_trec_pvalue.yaml ``` ### 生成 #### 基本用法 ```yaml examples/rag_full.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" highlight="5,19" theme={null} servers: benchmark: servers/benchmark retriever: servers/retriever prompt: servers/prompt generation: servers/generation evaluation: servers/evaluation custom: servers/custom pipeline: - benchmark.get_data - retriever.retriever_init - retriever.retriever_embed - retriever.retriever_index - retriever.retriever_search - generation.generation_init - prompt.qa_rag_boxed - generation.generate - custom.output_extract_from_boxed - evaluation.evaluate ``` 只需在 Pipeline 的末尾添加 evaluation.evaluate 工具，即可在任务执行完成后自动计算所有指定评测指标，并输出结果到配置文件中设定的路径。 #### 评估已有结果如果你已经拥有模型生成的结果文件，并希望直接对其进行评估，可以将结果整理为标准化的 JSONL 格式。文件中应至少包含代表答案标签与生成结果的字段，例如： ```json icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a" theme={null} {"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "pred_answer": "December 14, 1973"} {"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "pred_answer": "The documents do not provide information about the author of the lyrics to \"He Ain't Heavy, He's My Brother.\""} ``` ```yaml examples/evaluate_results.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" theme={null} # MCP Server servers: benchmark: servers/benchmark evaluation: servers/evaluation # MCP Client Pipeline pipeline: - benchmark.get_data - evaluation.evaluate ``` 为了让 Benchmark Server 读取生成结果，需要在 get\_data 函数中增加 `pred_ls` 字段： ```python servers/prompt/src/benchmark.py icon="python" theme={null} @app.tool(output="benchmark->q_ls,gt_ls") # [!code --] @app.tool(output="benchmark->q_ls,gt_ls,pred_ls") # [!code ++] def get_data( benchmark: Dict[str, Any], ) -> Dict[str, List[Any]]: ``` 然后，运行以下命令编译 Pipeline： ```shell theme={null} ultrarag build examples/evaluate_results.yaml ``` 在生成的参数文件中，新增字段 pred\_ls 并指定其在原始数据中的对应键名，同时修改数据路径和名称以指向新的评估文件： ```yaml examples/parameters/evaluate_results_parameter.yaml icon="https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b" theme={null} benchmark: benchmark: key_map: gt_ls: golden_answers q_ls: question pred_ls: pred_answer # [!code ++] limit: -1 name: nq # [!code --] path: data/sample_nq_10.jsonl # [!code --] name: evaluate # [!code ++] path: data/test_evaluate.jsonl # [!code ++] seed: 42 shuffle: false evaluation: metrics: - acc - f1 - em - coverem - stringem - rouge-1 - rouge-2 - rouge-l save_path: output/evaluate_results.json ``` 运行以下命令执行该 Pipeline： ```shell theme={null} ultrarag run examples/evaluate_results.yaml ```