Benchmark

Function

The Benchmark Server is used to load evaluation datasets, commonly used in the data configuration phase of benchmark testing, Q&A tasks, or generation tasks.

We strongly recommend preprocessing data into .jsonl format.

Example data:

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a

data/sample_nq_10.jsonl

{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "meta_data": {}}
{"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "meta_data": {}}
{"id": 2, "question": "how many seasons of the bastard executioner are there", "golden_answers": ["one", "one season"], "meta_data": {}}
{"id": 3, "question": "when did the eagles win last super bowl", "golden_answers": ["2017"], "meta_data": {}}
{"id": 4, "question": "who won last year's ncaa women's basketball", "golden_answers": ["South Carolina"], "meta_data": {}}

Usage Examples

Basic Usage

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b

examples/load_data.yaml

# MCP Server
servers:
  benchmark: servers/benchmark

# MCP Client Pipeline
pipeline:
- benchmark.get_data

Run the following command to compile the Pipeline:

ultrarag build examples/load_data.yaml

Modify corresponding fields according to the actual situation:

examples/parameters/load_data_parameter.yaml

benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false

Run the following command to execute the Pipeline:

ultrarag run examples/load_data.yaml

After completion, the system will automatically load and output data samples, providing input support for subsequent retrieval and generation tasks.

Add Dataset Loading Fields

In some cases, we may not only need to load query and ground_truth fields, but also wish to use other information in the dataset, such as retrieved passage. In this case, you can modify the code of the Benchmark Server to add fields that need to be returned.

You can extend other fields (such as cot, retrieved_passages, etc.) in the same way, just add the corresponding key names synchronously in the decorator output and key_map.

If you have generated results (such as the pred field), you can use it together with Evaluation Server to achieve rapid evaluation.

The following example demonstrates how to add the id_ls field in the get_data function:

servers/prompt/src/benchmark.py

@app.tool(output="benchmark->q_ls,gt_ls") 
@app.tool(output="benchmark->q_ls,gt_ls,id_ls") 
def get_data(
    benchmark: Dict[str, Any],
) -> Dict[str, List[Any]]:

Then, run the following command to recompile the Pipeline:

ultrarag build examples/load_data.yaml

In the generated parameter file, add the field id_ls and specify its corresponding key name in the original data:

examples/parameters/load_data_parameter.yaml

benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
      id_ls: id
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false

After completing the modification, rerun the Pipeline to load data samples containing id.

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

Function

Usage Examples

Basic Usage

Add Dataset Loading Fields

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

​Function

​Usage Examples

​Basic Usage

​Add Dataset Loading Fields

Function

Usage Examples

Basic Usage

Add Dataset Loading Fields