All datasets have been released on ModelScope and Huggingface.
Users can directly download and use them without additional cleaning or conversion — they are fully compatible with UltraRAG’s evaluation pipelines.
Benchmark
The following table summarizes the supported task types and their corresponding dataset statistics:| Task Type | Dataset Name | Original Data Volume | Evaluation Samples |
|---|---|---|---|
| QA | NQ | 3,610 | 1,000 |
| QA | TriviaQA | 11,313 | 1,000 |
| QA | PopQA | 14,267 | 1,000 |
| QA | AmbigQA | 2,002 | 1,000 |
| QA | MarcoQA | 55,636 | 1,000 |
| QA | WebQuestions | 2,032 | 1,000 |
| VQA | MP-DocVQA | 591 | 591 |
| VQA | ChartQA | 63 | 63 |
| VQA | InfoVQA | 718 | 718 |
| VQA | PlotQA | 863 | 863 |
| Multi-hop QA | HotpotQA | 7,405 | 1,000 |
| Multi-hop QA | 2WikiMultiHopQA | 12,576 | 1,000 |
| Multi-hop QA | Musique | 2,417 | 1,000 |
| Multi-hop QA | Bamboogle | 125 | 125 |
| Multi-hop QA | StrategyQA | 2,290 | 1,000 |
| Multi-hop VQA | SlideVQA | 556 | 556 |
| Multiple-choice | ARC | 3,548 | 1,000 |
| Multiple-choice | MMLU | 14,042 | 1,000 |
| Multiple-choice VQA | ArXivQA | 816 | 816 |
| Long-form QA | ASQA | 948 | 948 |
| Fact Verification | FEVER | 13,332 | 1,000 |
| Dialogue | WoW | 3,054 | 1,000 |
| Slot-filling | T-REx | 5,000 | 1,000 |
.jsonl format following the specifications below.
Non-multiple-choice data format:
Corpus
UltraRAG provides multi-source, high-quality standardized corpora covering both text and image modalities, facilitating the construction of diverse RAG systems.The following table summarizes the current corpus statistics:
| Corpus Name | Number of Documents |
|---|---|
| Wiki-2018 | 21,015,324 |
| Wiki-2024 | 30,463,973 |
| MP-DocVQA | 741 |
| ChartQA | 500 |
| InfoVQA | 459 |
| PlotQA | 9,593 |
| SlideVQA | 1,284 |
| ArXivQA | 8,066 |