Overview
The Corpus Server is a core component in UR-2.0 designed for processing raw document corpora. It supports parsing, extracting, and standardizing textual or visual content from multiple data sources, and provides various chunking strategies to convert raw documents into formats ready for retrieval and generation.
Main features of the Corpus Server include:
- Document Parsing: Supports multiple file types (.pdf, .txt, .md, .json, .jsonl) for content extraction.
- Corpus Construction: Saves parsed content into a standardized
.jsonl structure, with each line representing an independent document.
- Image Conversion: Converts PDF pages into image-based corpora, preserving layout and visual structure information.
- Text Chunking: Provides multiple splitting strategies, including Token-based, Sentence-based, and Recursive chunking.
Example data:
Text modality:

data/corpus_example.jsonl
{"id": "2066692", "contents": "Truman Sports Complex The Harry S. Truman Sports...."}
{"id": "15106858", "contents": "Arrowhead Stadium 1970s...."}
Image modality:
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}
Document Parsing Examples
Text Parsing
The Corpus Server supports multiple text formats, including .pdf, .txt, .md, .json, and .jsonl.

examples/build_text_corpus.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.build_text_corpus
Compile the Pipeline:
ultrarag build examples/build_text_corpus.yaml
Modify fields as needed:

examples/parameters/build_text_corpus_parameter.yaml
corpus:
parse_file_path: data/UltraRAG.pdf
text_corpus_save_path: corpora/text.jsonl
The parse_file_path can be either a single file or a directory. When a folder is specified, the system will automatically traverse and process all supported files inside.
Run the Pipeline:
ultrarag run examples/build_text_corpus.yaml
Upon successful execution, the system automatically parses the text and outputs a standardized corpus file, such as:
{"id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
PDF to Image Conversion
In multimodal RAG scenarios, one common approach is to convert document pages directly into images and perform retrieval and generation on the full visual input.
This approach preserves document layout, formatting, and visual structure, making retrieval and understanding closer to real-world reading.

examples/build_image_corpus.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.build_image_corpus
Compile the Pipeline:
ultrarag build examples/build_image_corpus.yaml
Adjust parameters as necessary:

examples/parameters/build_image_corpus_parameter.yaml
corpus:
image_corpus_save_path: corpora/image.jsonl
parse_file_path: data/UltraRAG.pdf
Similarly, the parse_file_path parameter can be a single file or a directory.
When set to a directory, the system automatically processes all files within it.
Run the Pipeline:
ultrarag run examples/build_image_corpus.yaml
After successful execution, the system saves the generated image corpus file, where each record includes an image identifier and relative path.
The resulting .jsonl file can be directly used for multimodal retrieval or generation tasks. Example output:
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}
MinerU Parsing
MinerU is a widely adopted PDF parsing framework known for high-precision text and layout extraction.
UR-2.0 seamlessly integrates MinerU as a built-in utility, enabling end-to-end corpus construction (PDF → Text + Image) directly within the Pipeline.

examples/build_mineru_corpus.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.mineru_parse
- corpus.build_mineru_corpus
Compile the Pipeline:
ultrarag build examples/build_mineru_corpus.yaml
Modify parameters as appropriate:

examples/parameters/build_mineru_corpus_parameter.yaml
corpus:
image_corpus_save_path: corpora/image.jsonl # Image corpus output path
mineru_dir: corpora/ # MinerU parsing output directory
mineru_extra_params:
source: modelscope # Model download source (default: Hugging Face, optional: modelscope)
parse_file_path: data/UltraRAG.pdf # File or directory to parse
text_corpus_save_path: corpora/text.jsonl # Text corpus output path
The parse_file_path parameter can also point to either a file or a directory.
Run the Pipeline (the first run will download MinerU models, which may take time):
ultrarag run examples/build_mineru_corpus.yaml
After execution, the system automatically outputs both text and image corpora in standardized format, consistent with the build_text_corpus and build_image_corpus methods, and ready for multimodal retrieval and generation.
Document Chunking Example
UR-2.0 integrates the Chonkie document chunking library and provides three main chunking strategies:
Token Chunker, Sentence Chunker, and Recursive Chunker, offering flexibility for various text structures.
- Token Chunker: Splits text by tokenizer, word, or character — suitable for general text.
- Sentence Chunker: Splits by sentence boundaries to preserve semantic integrity.
- Recursive Chunker: Designed for well-structured long documents (e.g., books, papers) and supports hierarchical segmentation.

examples/corpus_chunk.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.chunk_documents
Compile the Pipeline:
ultrarag build examples/corpus_chunk.yaml
Modify parameters as needed:

examples/parameters/corpus_chunk_parameter.yaml
corpus:
chunk_backend: token # Chunking strategy: token / sentence / recursive
chunk_backend_configs:
recursive:
chunk_size: 256 # Max characters/tokens per chunk
min_characters_per_chunk: 12 # Minimum characters per chunk
tokenizer_or_token_counter: character
sentence:
chunk_overlap: 50 # Overlap between chunks (characters)
chunk_size: 256 # Max length per chunk
delim: '[''.'', ''!'', ''?'', ''\n'']' # Sentence delimiters
min_sentences_per_chunk: 1 # Minimum sentences per chunk
tokenizer_or_token_counter: character
token:
chunk_overlap: 50 # Overlap between chunks (tokens)
chunk_size: 256 # Max tokens per chunk
tokenizer_or_token_counter: gpt2 # Tokenizer used
chunk_path: corpora/chunks.jsonl # Output path for chunked corpus
raw_chunk_path: corpora/text.jsonl # Input text corpus path
use_title: false # Whether to prepend title to each chunk
Run the Pipeline:
ultrarag run examples/corpus_chunk.yaml
Once executed, the system outputs a standardized chunked corpus file, ready for downstream retrieval and generation modules.
Example output:
{"id": 0, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 1, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 2, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
You can invoke both parsing and chunking tools within the same Pipeline to build your own customized knowledge base.