Skip to main content

Overview

The Corpus Server is a core component in UR-2.0 designed for processing raw document corpora. It supports parsing, extracting, and standardizing textual or visual content from multiple data sources, and provides various chunking strategies to convert raw documents into formats ready for retrieval and generation. Main features of the Corpus Server include:
  • Document Parsing: Supports multiple file types (.pdf, .txt, .md, .json, .jsonl) for content extraction.
  • Corpus Construction: Saves parsed content into a standardized .jsonl structure, with each line representing an independent document.
  • Image Conversion: Converts PDF pages into image-based corpora, preserving layout and visual structure information.
  • Text Chunking: Provides multiple splitting strategies, including Token-based, Sentence-based, and Recursive chunking.
Example data: Text modality:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5adata/corpus_example.jsonl
{"id": "2066692", "contents": "Truman Sports Complex The Harry S. Truman Sports...."}
{"id": "15106858", "contents": "Arrowhead Stadium 1970s...."}
Image modality:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}

Document Parsing Examples

Text Parsing

The Corpus Server supports multiple text formats, including .pdf, .txt, .md, .json, and .jsonl.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_text_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.build_text_corpus
Compile the Pipeline:
ultrarag build examples/build_text_corpus.yaml
Modify fields as needed:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_text_corpus_parameter.yaml
corpus:
  parse_file_path: data/UltraRAG.pdf
  text_corpus_save_path: corpora/text.jsonl
The parse_file_path can be either a single file or a directory. When a folder is specified, the system will automatically traverse and process all supported files inside. Run the Pipeline:
ultrarag run examples/build_text_corpus.yaml
Upon successful execution, the system automatically parses the text and outputs a standardized corpus file, such as:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}

PDF to Image Conversion

In multimodal RAG scenarios, one common approach is to convert document pages directly into images and perform retrieval and generation on the full visual input.
This approach preserves document layout, formatting, and visual structure, making retrieval and understanding closer to real-world reading.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_image_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.build_image_corpus
Compile the Pipeline:
ultrarag build examples/build_image_corpus.yaml
Adjust parameters as necessary:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_image_corpus_parameter.yaml
corpus:
  image_corpus_save_path: corpora/image.jsonl
  parse_file_path: data/UltraRAG.pdf
Similarly, the parse_file_path parameter can be a single file or a directory.
When set to a directory, the system automatically processes all files within it.
Run the Pipeline:
ultrarag run examples/build_image_corpus.yaml
After successful execution, the system saves the generated image corpus file, where each record includes an image identifier and relative path.
The resulting .jsonl file can be directly used for multimodal retrieval or generation tasks. Example output:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}

MinerU Parsing

MinerU is a widely adopted PDF parsing framework known for high-precision text and layout extraction.
UR-2.0 seamlessly integrates MinerU as a built-in utility, enabling end-to-end corpus construction (PDF → Text + Image) directly within the Pipeline.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_mineru_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.mineru_parse
- corpus.build_mineru_corpus
Compile the Pipeline:
ultrarag build examples/build_mineru_corpus.yaml
Modify parameters as appropriate:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_mineru_corpus_parameter.yaml
corpus:
  image_corpus_save_path: corpora/image.jsonl        # Image corpus output path
  mineru_dir: corpora/                               # MinerU parsing output directory
  mineru_extra_params:
    source: modelscope                               # Model download source (default: Hugging Face, optional: modelscope)
  parse_file_path: data/UltraRAG.pdf                 # File or directory to parse
  text_corpus_save_path: corpora/text.jsonl          # Text corpus output path
The parse_file_path parameter can also point to either a file or a directory. Run the Pipeline (the first run will download MinerU models, which may take time):
ultrarag run examples/build_mineru_corpus.yaml
After execution, the system automatically outputs both text and image corpora in standardized format, consistent with the build_text_corpus and build_image_corpus methods, and ready for multimodal retrieval and generation.

Document Chunking Example

UR-2.0 integrates the Chonkie document chunking library and provides three main chunking strategies:
Token Chunker, Sentence Chunker, and Recursive Chunker, offering flexibility for various text structures.
  • Token Chunker: Splits text by tokenizer, word, or character — suitable for general text.
  • Sentence Chunker: Splits by sentence boundaries to preserve semantic integrity.
  • Recursive Chunker: Designed for well-structured long documents (e.g., books, papers) and supports hierarchical segmentation.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/corpus_chunk.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.chunk_documents
Compile the Pipeline:
ultrarag build examples/corpus_chunk.yaml
Modify parameters as needed:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_chunk_parameter.yaml
corpus:
  chunk_backend: token                     # Chunking strategy: token / sentence / recursive
  chunk_backend_configs:
    recursive:
      chunk_size: 256                      # Max characters/tokens per chunk
      min_characters_per_chunk: 12         # Minimum characters per chunk
      tokenizer_or_token_counter: character
    sentence:
      chunk_overlap: 50                    # Overlap between chunks (characters)
      chunk_size: 256                      # Max length per chunk
      delim: '[''.'', ''!'', ''?'', ''\n'']'  # Sentence delimiters
      min_sentences_per_chunk: 1           # Minimum sentences per chunk
      tokenizer_or_token_counter: character
    token:
      chunk_overlap: 50                    # Overlap between chunks (tokens)
      chunk_size: 256                      # Max tokens per chunk
      tokenizer_or_token_counter: gpt2     # Tokenizer used
  chunk_path: corpora/chunks.jsonl         # Output path for chunked corpus
  raw_chunk_path: corpora/text.jsonl       # Input text corpus path
  use_title: false                         # Whether to prepend title to each chunk
Run the Pipeline:
ultrarag run examples/corpus_chunk.yaml
Once executed, the system outputs a standardized chunked corpus file, ready for downstream retrieval and generation modules.
Example output:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a
{"id": 0, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 1, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 2, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
You can invoke both parsing and chunking tools within the same Pipeline to build your own customized knowledge base.