The Benchmark Server is used to load evaluation datasets, commonly used in the data configuration phase of benchmark testing, Q&A tasks, or generation tasks.
We strongly recommend preprocessing data into .jsonl format.
Example data:
data/sample_nq_10.jsonl
Copy
{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "meta_data": {}}{"id": 1, "question": "who wrote he ain't heavy he's my brother lyrics", "golden_answers": ["Bobby Scott", "Bob Russell"], "meta_data": {}}{"id": 2, "question": "how many seasons of the bastard executioner are there", "golden_answers": ["one", "one season"], "meta_data": {}}{"id": 3, "question": "when did the eagles win last super bowl", "golden_answers": ["2017"], "meta_data": {}}{"id": 4, "question": "who won last year's ncaa women's basketball", "golden_answers": ["South Carolina"], "meta_data": {}}
In some cases, we may not only need to load query and ground_truth fields, but also wish to use other information in the dataset, such as retrieved passage.
In this case, you can modify the code of the Benchmark Server to add fields that need to be returned.
You can extend other fields (such as cot, retrieved_passages, etc.) in the same way, just add the corresponding key names synchronously in the decorator output and key_map.
If you have generated results (such as the pred field), you can use it together with Evaluation Server to achieve rapid evaluation.
The following example demonstrates how to add the id_ls field in the get_data function: