Available Data

We have organized and preprocessed the most commonly used public evaluation datasets in current RAG research and released them on Huggingface datasets. Users can directly download and use them without any further processing. The table below lists the supported task types and dataset statistics:
Task TypeDataset NameOriginal Data QuantityLeaderboard Sample Quantity
qanq3,6101,000
qaTriviaQA11,3131,000
qapopqa14,2671,000
qaAmbigQA2,0021,000
qaMarcoQA101,093 ; 55,636 (filtered no-answer version)1,000 (based on filtered)
qaWebQuestions2,0321,000
Multi-hop qahotpotqa7,4051,000
Multi-hop qa2WikiMultiHopQA12,5761,000
Multi-hop qaMusique2,4171,000
Multi-hop qabamboogle125125 (unprocessed)
Multi-hop qastrategy-qa2,2901,000
Multiple-choiceARC3,548 ; (options are uppercase letters A-E, with option E having 1 item)1,000
Multiple-choicemmlu14,042 ; (options are uppercase letters A-D)1,000
Long-form QAASQA948948 (unprocessed)
fact-verificationFEVER13,332 ; (only support and refuse labels retained)1,000
dialogueWoW3,0541,000
slot-fillingT-REx5,0001,000
Corpus Statistics:
Corpus NameNumber of Documents
wiki201821,015,324
wiki2024Coming soon

Data Format Description

We recommend users process all test data into .jsonl format, following the structure specifications below to ensure compatibility with UltraRAG modules: Non-multiple-choice data format:
{
  "id": 0,  // integer identifier
  "question": "xxxx",  // question text
  "golden_answers": ["xxx", "xxx"],  // list of standard answers, can contain multiple
  "metadata": { ... }  // other information fields, optional
}
Multiple-choice data format:
{
  "id": 0,
  "question": "xxxx",
  "golden_answers": ["A"],  // standard answer as option letter (e.g., A–D)
  "choices": ["xxx", "xxx", "xxx", "xxx"],  // list of option texts
  "metadata": { ... }
}
Corpus data format:
{
  "id": "0",
  "contents": "xxxxx"  // text segment after corpus chunking
}