Skip to main content

Function

The Corpus Server is the core component in UltraRAG for processing raw corpus documents. It supports parsing, extracting, and standardizing text or image content from various data sources, and provides multiple chunking strategies to convert raw documents into formats that can be directly used for subsequent retrieval and generation. The main functions of the Corpus Server include:
  • Document Parsing: Supports content extraction from multiple file types (such as .pdf, .txt, .md, .docx, etc.).
  • Corpus Construction: Saves parsed content as a standardized .jsonl structure, where each line corresponds to an independent document.
  • Image Conversion: Supports converting PDF pages into image corpora, preserving layout and visual structure information.
  • Text Chunking: Provides multiple splitting strategies such as Token, Sentence, Recursive, etc.
Example data: Text Modality:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5adata/corpus_example.jsonl
{"id": "2066692", "contents": "Truman Sports Complex The Harry S. Truman Sports...."}
{"id": "15106858", "contents": "Arrowhead Stadium 1970s...."}
Image Modality:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}

Document Parsing Examples

Text Parsing

The Corpus Server supports multiple text parsing formats, including .pdf, .txt, .md, .docx, .xps, .oxps, .epub, .mobi, .fb2, etc.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_text_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.build_text_corpus
Compile Pipeline:
ultrarag build examples/build_text_corpus.yaml
Modify corresponding fields according to the actual situation:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_text_corpus_parameter.yaml
corpus:
  parse_file_path: data/UltraRAG.pdf
  text_corpus_save_path: corpora/text.jsonl
Where parse_file_path can be a single file or a folder path — when specified as a folder, the system will automatically traverse and batch read all parsable files within it. Run Pipeline:
ultrarag run examples/build_text_corpus.yaml
After successful execution, the system will automatically parse the text and output a standardized corpus file, for example:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}

PDF to Image

In multi-modal RAG scenarios, one approach is to directly convert document pages into images and perform retrieval and generation in the form of complete images. The advantage of this method is that it can preserve the document’s layout, format, and visual structure, making retrieval and understanding closer to real reading scenarios.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_image_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.build_image_corpus
Compile Pipeline:
ultrarag build examples/build_image_corpus.yaml
Modify corresponding fields according to the actual situation:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_image_corpus_parameter.yaml
corpus:
  image_corpus_save_path: corpora/image.jsonl
  parse_file_path: data/UltraRAG.pdf
Similarly, the parse_file_path parameter can be specified as either a single file or a folder path. When set to a folder, the system will automatically traverse and process all files within it. Run Pipeline:
ultrarag run examples/build_image_corpus.yaml
After successful execution, the system will save the generated image corpus file. Each record contains the image identifier and relative path. The generated .jsonl file can be directly used as input for multi-modal retrieval or generation tasks. Output example:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}

MinerU Parsing

MinerU is an industry-acclaimed PDF parsing framework that supports high-precision text and layout structure extraction. UltraRAG seamlessly integrates MinerU as a built-in tool, which can be called directly in the Pipeline to achieve one-stop PDF → Text + Image corpus construction.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_mineru_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.mineru_parse
- corpus.build_mineru_corpus
Compile Pipeline:
ultrarag build examples/build_mineru_corpus.yaml
Modify corresponding fields according to the actual situation:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_mineru_corpus_parameter.yaml
corpus:
  image_corpus_save_path: corpora/image.jsonl    # Image corpus save path
  mineru_dir: corpora/                           # MinerU parsing result save directory
  mineru_extra_params:
    source: modelscope                           # Model download source (default is Hugging Face, optional modelscope)
  parse_file_path: data/UltraRAG.pdf             # File or folder path to parse
  text_corpus_save_path: corpora/text.jsonl      # Text corpus save path
Similarly, the parse_file_path parameter can be either a single file or a folder path. Run Pipeline (downloading MinerU model is required for the first execution, which may be slow):
ultrarag run examples/build_mineru_corpus.yaml
After successful execution, the system will automatically output the corresponding Text Corpus and Image Corpus files, the formats of which are consistent with build_text_corpus and build_image_corpus, and can be directly used for multi-modal retrieval and generation tasks.

Document Chunking Examples

UltraRAG integrates the chonkie document chunking library and has built-in three mainstream chunking strategies: Token Chunker, Sentence Chunker, and Recursive Chunker, flexibly coping with different types of text structures.
  • Token Chunker: Chunks by tokenizer, word, or character, suitable for general text.
  • Sentence Chunker: Splits by sentence boundaries, ensuring semantic integrity.
  • Recursive Chunker: Suitable for well-structured long documents (such as books, papers), capable of automatically dividing content by hierarchy.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/corpus_chunk.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.chunk_documents
Compile Pipeline:
ultrarag build examples/corpus_chunk.yaml
Modify corresponding fields according to the actual situation:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_chunk_parameter.yaml
corpus:
  chunk_backend: token    # Chunking strategy, optional token / sentence / recursive
  chunk_backend_configs:
    recursive:
      min_characters_per_chunk: 12  # Minimum length per chunk to prevent being too short
    sentence:
      chunk_overlap: 50              # Overlapping characters of adjacent chunks
      delim: '[''.'', ''!'', ''?'', ''\n'']'  # Sentence delimiter
      min_sentences_per_chunk: 1  # Minimum sentences per chunk
    token:
      chunk_overlap: 50             # Overlapping tokens of adjacent chunks
  chunk_path: corpora/chunks.jsonl      # Output path for chunked corpus
  chunk_size: 256                      # Maximum tokens per chunk
  raw_chunk_path: corpora/text.jsonl    # Raw text corpus path
  tokenizer_or_token_counter: character # Tokenizer used
  use_title: false                     # Whether to append title to the beginning of each chunk
Run Pipeline:
ultrarag run examples/corpus_chunk.yaml
After execution, the system will output standardized chunked corpus files, which can be directly used for subsequent retrieval and generation modules. Output example:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 1, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 2, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
You can call parsing tools and chunking tools in the same Pipeline to build your own personalized knowledge base.