The Benchmark Server is used to load evaluation datasets. It is commonly employed during the data configuration stage of benchmark testing, question answering (QA), or generation tasks.
We strongly recommend preprocessing your dataset into the .jsonl format.
Example data:
data/sample_nq_10.jsonl
Copy
{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "meta_data": {}}{"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "meta_data": {}}{"id": 2, "question": "how many seasons of the bastard executioner are there", "golden_answers": ["one", "one season"], "meta_data": {}}{"id": 3, "question": "when did the eagles win last super bowl", "golden_answers": ["2017"], "meta_data": {}}{"id": 4, "question": "who won last year's ncaa women's basketball", "golden_answers": ["South Carolina"], "meta_data": {}}
Run the following command to execute the Pipeline:
Copy
ultrarag run examples/load_data.yaml
After execution, the system automatically loads and outputs dataset sample information, providing input support for subsequent retrieval and generation tasks.
In some cases, you may want to load not only the query and ground_truth fields, but also other information from the dataset — for example, pre-retrieved passage data.
In such scenarios, you can modify the Benchmark Server code to include additional fields in the output.
You can extend other fields (such as cot, retrieved_passages, etc.) in the same way — simply add the corresponding key names both in the decorator’s output and in key_map.
If you already have generated results (such as a pred field), you can use them with the Evaluation Server to perform quick evaluation.
The following example demonstrates how to add an id_ls field in the get_data function: