Skip to main content
We have organized and preprocessed the most commonly used public evaluation datasets and corpora in current RAG research, and have released them synchronously on ModelScope and Huggingface. Users can download and use them directly without additional cleaning or conversion, seamlessly interfacing with UltraRAG’s evaluation pipeline.

Benchmark

The following table summarizes the currently supported task types and statistical information of corresponding datasets:
Task TypeDataset NameRaw Data CountEval Sample Count
QANQ3,6101,000
QATriviaQA11,3131,000
QAPopQA14,2671,000
QAAmbigQA2,0021,000
QAMarcoQA55,6361,000
QAWebQuestions2,0321,000
VQAMP-DocVQA591591
VQAChartQA6363
VQAInfoVQA718718
VQAPlotQA863863
Multi-hop QAHotpotQA7,4051,000
Multi-hop QA2WikiMultiHopQA12,5761,000
Multi-hop QAMusique2,4171,000
Multi-hop QABamboogle125125
Multi-hop QAStrategyQA2,2901,000
Multi-hop VQASlideVQA556556
Multiple-choiceARC3,5481,000
Multiple-choiceMMLU14,0421,000
Multiple-choice VQAArXivQA816816
Long-form QAASQA948948
Fact-verificationFEVER13,3321,000
DialogueWoW3,0541,000
Slot-fillingT-REx5,0001,000
Data Format Specification To ensure full compatibility with UltraRAG modules, it is recommended that users store test data uniformly as .jsonl files and follow the format specifications below. Non-multiple-choice data format:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{
  "id": 0, 
  "question": "where does the karate kid 2010 take place", 
  "golden_answers": ["China", "Beijing", "Beijing, China"], 
  "meta_data": {} 
}
Multiple-choice data format:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{
  "id": 0, 
  "question": "Mast Co. converted from the FIFO method for inventory valuation to the LIFO method for financial statement and tax purposes. During a period of inflation would Mast's ending inventory and income tax payable using LIFO be higher or lower than FIFO? Ending inventory Income tax payable", 
  "golden_answers": ["A"], 
  "choices": ["Lower Lower", "Higher Higher", "Lower Higher", "Higher Lower"], 
  "meta_data": {"subject": "professional_accounting"}
}

Corpus

UltraRAG provides multi-source, high-quality standardized corpora, covering both text and image modalities, facilitating the construction of multi-scenario RAG systems. The following is statistical information on currently collected corpora:
Corpus NameDocument Count
Wiki-201821,015,324
Wiki-202430,463,973
MP-DocVQA741
ChartQA500
InfoVQA459
PlotQA9,593
SlideVQA1,284
ArXivQA8,066
Data Format Specification Text corpus format:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{
  "id": "15106858", 
  "contents": "Arrowhead Stadium 1970s practice would eventually spread to the other NFL stadiums as the 1970s progressed, finally becoming mandatory league-wide in the 1978 season (after being used in Super Bowl XII), and become almost near-universal at the lower levels of football. On January 20, 1974, Arrowhead Stadium hosted the Pro Bowl. Due to an ice storm and brutally cold temperatures the week leading up to the game, the game's participants worked out at the facilities of the San Diego Chargers. On game day, the temperature soared to 41 F, melting most of the ice and snow that accumulated during the week. The AFC defeated the NFC, 15–13."
}
Image corpus format:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{
  "id": 0, 
  "image_id": "37313.jpeg", 
  "image_path": "image/37313.jpg"
}