Evaluation Data

We have organized and preprocessed the most commonly used public evaluation datasets and corpora in current RAG research, and have released them synchronously on ModelScope and Huggingface. Users can download and use them directly without additional cleaning or conversion, seamlessly interfacing with UltraRAG’s evaluation pipeline.

Benchmark

The following table summarizes the currently supported task types and statistical information of corresponding datasets:

Task Type	Dataset Name	Raw Data Count	Eval Sample Count
QA	NQ	3,610	1,000
QA	TriviaQA	11,313	1,000
QA	PopQA	14,267	1,000
QA	AmbigQA	2,002	1,000
QA	MarcoQA	55,636	1,000
QA	WebQuestions	2,032	1,000
VQA	MP-DocVQA	591	591
VQA	ChartQA	63	63
VQA	InfoVQA	718	718
VQA	PlotQA	863	863
Multi-hop QA	HotpotQA	7,405	1,000
Multi-hop QA	2WikiMultiHopQA	12,576	1,000
Multi-hop QA	Musique	2,417	1,000
Multi-hop QA	Bamboogle	125	125
Multi-hop QA	StrategyQA	2,290	1,000
Multi-hop VQA	SlideVQA	556	556
Multiple-choice	ARC	3,548	1,000
Multiple-choice	MMLU	14,042	1,000
Multiple-choice VQA	ArXivQA	816	816
Long-form QA	ASQA	948	948
Fact-verification	FEVER	13,332	1,000
Dialogue	WoW	3,054	1,000
Slot-filling	T-REx	5,000	1,000

Data Format Specification To ensure full compatibility with UltraRAG modules, it is recommended that users store test data uniformly as .jsonl files and follow the format specifications below. Non-multiple-choice data format:

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a

{
  "id": 0, 
  "question": "where does the karate kid 2010 take place", 
  "golden_answers": ["China", "Beijing", "Beijing, China"], 
  "meta_data": {} 
}

Multiple-choice data format:

{
  "id": 0, 
  "question": "Mast Co. converted from the FIFO method for inventory valuation to the LIFO method for financial statement and tax purposes. During a period of inflation would Mast's ending inventory and income tax payable using LIFO be higher or lower than FIFO? Ending inventory Income tax payable", 
  "golden_answers": ["A"], 
  "choices": ["Lower Lower", "Higher Higher", "Lower Higher", "Higher Lower"], 
  "meta_data": {"subject": "professional_accounting"}
}

Corpus

UltraRAG provides multi-source, high-quality standardized corpora, covering both text and image modalities, facilitating the construction of multi-scenario RAG systems. The following is statistical information on currently collected corpora:

Corpus Name	Document Count
Wiki-2018	21,015,324
Wiki-2024	30,463,973
MP-DocVQA	741
ChartQA	500
InfoVQA	459
PlotQA	9,593
SlideVQA	1,284
ArXivQA	8,066

Data Format Specification Text corpus format:

{
  "id": "15106858", 
  "contents": "Arrowhead Stadium 1970s practice would eventually spread to the other NFL stadiums as the 1970s progressed, finally becoming mandatory league-wide in the 1978 season (after being used in Super Bowl XII), and become almost near-universal at the lower levels of football. On January 20, 1974, Arrowhead Stadium hosted the Pro Bowl. Due to an ice storm and brutally cold temperatures the week leading up to the game, the game's participants worked out at the facilities of the San Diego Chargers. On game day, the temperature soared to 41 F, melting most of the ice and snow that accumulated during the week. The AFC defeated the NFC, 15–13."
}

Image corpus format:

{
  "id": 0, 
  "image_id": "37313.jpeg", 
  "image_path": "image/37313.jpg"
}

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

Benchmark

Corpus

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

​Benchmark

​Corpus

Benchmark

Corpus