Function
The Corpus Server is the core component in UltraRAG for processing raw corpus documents. It supports parsing, extracting, and standardizing text or image content from various data sources, and provides multiple chunking strategies to convert raw documents into formats that can be directly used for subsequent retrieval and generation.
The main functions of the Corpus Server include:
- Document Parsing: Supports content extraction from multiple file types (such as .pdf, .txt, .md, .docx, etc.).
- Corpus Construction: Saves parsed content as a standardized .jsonl structure, where each line corresponds to an independent document.
- Image Conversion: Supports converting PDF pages into image corpora, preserving layout and visual structure information.
- Text Chunking: Provides multiple splitting strategies such as Token, Sentence, Recursive, etc.
Example data:
Text Modality:

data/corpus_example.jsonl
{"id": "2066692", "contents": "Truman Sports Complex The Harry S. Truman Sports...."}
{"id": "15106858", "contents": "Arrowhead Stadium 1970s...."}
Image Modality:
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}
Document Parsing Examples
Text Parsing
The Corpus Server supports multiple text parsing formats, including .pdf, .txt, .md, .docx, .xps, .oxps, .epub, .mobi, .fb2, etc.

examples/build_text_corpus.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.build_text_corpus
Compile Pipeline:
ultrarag build examples/build_text_corpus.yaml
Modify corresponding fields according to the actual situation:

examples/parameters/build_text_corpus_parameter.yaml
corpus:
parse_file_path: data/UltraRAG.pdf
text_corpus_save_path: corpora/text.jsonl
Where parse_file_path can be a single file or a folder path — when specified as a folder, the system will automatically traverse and batch read all parsable files within it.
Run Pipeline:
ultrarag run examples/build_text_corpus.yaml
After successful execution, the system will automatically parse the text and output a standardized corpus file, for example:
{"id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
PDF to Image
In multi-modal RAG scenarios, one approach is to directly convert document pages into images and perform retrieval and generation in the form of complete images.
The advantage of this method is that it can preserve the document’s layout, format, and visual structure, making retrieval and understanding closer to real reading scenarios.

examples/build_image_corpus.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.build_image_corpus
Compile Pipeline:
ultrarag build examples/build_image_corpus.yaml
Modify corresponding fields according to the actual situation:

examples/parameters/build_image_corpus_parameter.yaml
corpus:
image_corpus_save_path: corpora/image.jsonl
parse_file_path: data/UltraRAG.pdf
Similarly, the parse_file_path parameter can be specified as either a single file or a folder path. When set to a folder, the system will automatically traverse and process all files within it.
Run Pipeline:
ultrarag run examples/build_image_corpus.yaml
After successful execution, the system will save the generated image corpus file. Each record contains the image identifier and relative path. The generated .jsonl file can be directly used as input for multi-modal retrieval or generation tasks. Output example:
{"id": 0, "image_id": "UltraRAG/page_0.jpg", "image_path": "image/UltraRAG/page_0.jpg"}
{"id": 1, "image_id": "UltraRAG/page_1.jpg", "image_path": "image/UltraRAG/page_1.jpg"}
{"id": 2, "image_id": "UltraRAG/page_2.jpg", "image_path": "image/UltraRAG/page_2.jpg"}
MinerU Parsing
MinerU is an industry-acclaimed PDF parsing framework that supports high-precision text and layout structure extraction.
UltraRAG seamlessly integrates MinerU as a built-in tool, which can be called directly in the Pipeline to achieve one-stop PDF → Text + Image corpus construction.

examples/build_mineru_corpus.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.mineru_parse
- corpus.build_mineru_corpus
Compile Pipeline:
ultrarag build examples/build_mineru_corpus.yaml
Modify corresponding fields according to the actual situation:

examples/parameters/build_mineru_corpus_parameter.yaml
corpus:
image_corpus_save_path: corpora/image.jsonl # Image corpus save path
mineru_dir: corpora/ # MinerU parsing result save directory
mineru_extra_params:
source: modelscope # Model download source (default is Hugging Face, optional modelscope)
parse_file_path: data/UltraRAG.pdf # File or folder path to parse
text_corpus_save_path: corpora/text.jsonl # Text corpus save path
Similarly, the parse_file_path parameter can be either a single file or a folder path.
Run Pipeline (downloading MinerU model is required for the first execution, which may be slow):
ultrarag run examples/build_mineru_corpus.yaml
After successful execution, the system will automatically output the corresponding Text Corpus and Image Corpus files, the formats of which are consistent with build_text_corpus and build_image_corpus, and can be directly used for multi-modal retrieval and generation tasks.
Document Chunking Examples
UltraRAG integrates the chonkie document chunking library and has built-in three mainstream chunking strategies: Token Chunker, Sentence Chunker, and Recursive Chunker, flexibly coping with different types of text structures.
Token Chunker: Chunks by tokenizer, word, or character, suitable for general text.
Sentence Chunker: Splits by sentence boundaries, ensuring semantic integrity.
Recursive Chunker: Suitable for well-structured long documents (such as books, papers), capable of automatically dividing content by hierarchy.

examples/corpus_chunk.yaml
# MCP Server
servers:
corpus: servers/corpus
# MCP Client Pipeline
pipeline:
- corpus.chunk_documents
Compile Pipeline:
ultrarag build examples/corpus_chunk.yaml
Modify corresponding fields according to the actual situation:

examples/parameters/corpus_chunk_parameter.yaml
corpus:
chunk_backend: token # Chunking strategy, optional token / sentence / recursive
chunk_backend_configs:
recursive:
min_characters_per_chunk: 12 # Minimum length per chunk to prevent being too short
sentence:
chunk_overlap: 50 # Overlapping characters of adjacent chunks
delim: '[''.'', ''!'', ''?'', ''\n'']' # Sentence delimiter
min_sentences_per_chunk: 1 # Minimum sentences per chunk
token:
chunk_overlap: 50 # Overlapping tokens of adjacent chunks
chunk_path: corpora/chunks.jsonl # Output path for chunked corpus
chunk_size: 256 # Maximum tokens per chunk
raw_chunk_path: corpora/text.jsonl # Raw text corpus path
tokenizer_or_token_counter: character # Tokenizer used
use_title: false # Whether to append title to the beginning of each chunk
Run Pipeline:
ultrarag run examples/corpus_chunk.yaml
After execution, the system will output standardized chunked corpus files, which can be directly used for subsequent retrieval and generation modules.
Output example:
{"id": 0, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 1, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
{"id": 2, "doc_id": "UltraRAG", "title": "UltraRAG", "contents": "xxxxx"}
You can call parsing tools and chunking tools in the same Pipeline to build your own personalized knowledge base.