Skip to main content

Function

The Benchmark Server is used to load evaluation datasets, commonly used in the data configuration phase of benchmark testing, Q&A tasks, or generation tasks.
We strongly recommend preprocessing data into .jsonl format.
Example data:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5adata/sample_nq_10.jsonl
{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "meta_data": {}}
{"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "meta_data": {}}
{"id": 2, "question": "how many seasons of the bastard executioner are there", "golden_answers": ["one", "one season"], "meta_data": {}}
{"id": 3, "question": "when did the eagles win last super bowl", "golden_answers": ["2017"], "meta_data": {}}
{"id": 4, "question": "who won last year's ncaa women's basketball", "golden_answers": ["South Carolina"], "meta_data": {}}

Usage Examples

Basic Usage

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/load_data.yaml
# MCP Server
servers:
  benchmark: servers/benchmark

# MCP Client Pipeline
pipeline:
- benchmark.get_data
Run the following command to compile the Pipeline:
ultrarag build examples/load_data.yaml
Modify corresponding fields according to the actual situation:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/load_data_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
Run the following command to execute the Pipeline:
ultrarag run examples/load_data.yaml
After completion, the system will automatically load and output data samples, providing input support for subsequent retrieval and generation tasks.

Add Dataset Loading Fields

In some cases, we may not only need to load query and ground_truth fields, but also wish to use other information in the dataset, such as retrieved passage. In this case, you can modify the code of the Benchmark Server to add fields that need to be returned.
You can extend other fields (such as cot, retrieved_passages, etc.) in the same way, just add the corresponding key names synchronously in the decorator output and key_map.
If you have generated results (such as the pred field), you can use it together with Evaluation Server to achieve rapid evaluation.
The following example demonstrates how to add the id_ls field in the get_data function:
servers/prompt/src/benchmark.py
@app.tool(output="benchmark->q_ls,gt_ls") 
@app.tool(output="benchmark->q_ls,gt_ls,id_ls") 
def get_data(
    benchmark: Dict[str, Any],
) -> Dict[str, List[Any]]:
Then, run the following command to recompile the Pipeline:
ultrarag build examples/load_data.yaml
In the generated parameter file, add the field id_ls and specify its corresponding key name in the original data:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/load_data_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
      id_ls: id
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
After completing the modification, rerun the Pipeline to load data samples containing id.