Skip to main content

build_text_corpus

Signature
@app.tool(output="parse_file_path,text_corpus_save_path->None")
async def build_text_corpus(parse_file_path: str, text_corpus_save_path: str) -> None
Function
  • Supports .txt / .md;
  • Supports .docx (reads paragraphs and tables);
  • Supports .pdf / .xps / .oxps / .epub / .mobi / .fb2 (pure text extraction via pymupdf).
  • Recursively processes in directory mode.
Output Format (JSONL)
{"id": "<stem>", "title": "<stem>", "contents": "<full text>"}

build_image_corpus

Signature
@app.tool(output="parse_file_path,image_corpus_save_path->None")
async def build_image_corpus(parse_file_path: str, image_corpus_save_path: str) -> None
Function
  • Only supports PDF: Renders each page as JPG (RGB) at 144DPI and validates file validity.
  • Recursively processes in directory mode.
Output Index (JSONL)
{"id": 0, "image_id": "paper/page_0.jpg", "image_path": "image/paper/page_0.jpg"}

mineru_parse

Signature
@app.tool(output="parse_file_path,mineru_dir,mineru_extra_params->None")
async def mineru_parse(
    parse_file_path: str, 
    mineru_dir: str, 
    mineru_extra_params: Optional[Dict[str, Any]] = None
) -> None
Function
  • Calls CLI mineru to structurally parse PDF/directory and outputs to mineru_dir.

build_mineru_corpus

Signature
@app.tool(output="mineru_dir,parse_file_path,text_corpus_save_path,image_corpus_save_path->None")
async def build_mineru_corpus(
    mineru_dir: str, 
    parse_file_path: str, 
    text_corpus_save_path: str, 
    image_corpus_save_path: str
) -> None
Function
  • Aggregates MinerU parsing artifacts into Text Corpus JSONL and Image Index JSONL.
Output Format (JSONL)
  • Text:
{"id": "<stem>", "title": "<stem>", "contents": "<markdown full text>"}
  • Image:
{"id": 0, "image_id": "paper/page_0.jpg", "image_path": "images/paper/page_0.jpg"}

chunk_documents

Signature
@app.tool(output="raw_chunk_path,chunk_backend_configs,chunk_backend,tokenizer_or_token_counter,chunk_size,chunk_path,use_title->None")
async def chunk_documents(
    raw_chunk_path: str,
    chunk_backend_configs: Dict[str, Any],
    chunk_backend: str = "token",
    tokenizer_or_token_counter: str = "character",
    chunk_size: int = 256,
    chunk_path: Optional[str] = None,
    use_title: bool = True,
) -> None
Function
  • Chunks input text corpus (JSONL, containing id/title/contents) into paragraphs using selected backend:
  • Chunk Backend: Supports token / sentence / recursive.
  • Tokenizer: tokenizer_or_token_counter can be word, character, or tiktoken encoding name (e.g., gpt2).
  • Chunk Size: Controls block size via chunk_size (overlap defaults to size/4).
  • Optionally appends document title to the beginning of each block (use_title).
Output Format (JSONL)
{"id": 0, "doc_id": "paper", "title": "paper", "contents": "Chunked text"}

Configuration

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bservers/corpus/parameter.yaml
# servers/corpus/parameter.yaml
parse_file_path: data/UltraRAG.pdf
text_corpus_save_path: corpora/text.jsonl
image_corpus_save_path: corpora/image.jsonl

# mineru
mineru_dir: corpora/
mineru_extra_params:
  source: modelscope

# chunking parameters
raw_chunk_path: corpora/text.jsonl
chunk_path: corpora/chunks.jsonl
use_title: false
chunk_backend: sentence # choices=["token", "sentence", "recursive"]
tokenizer_or_token_counter: character
chunk_size: 512
chunk_backend_configs:
  token:
    chunk_overlap: 50
  sentence:
    chunk_overlap: 50
    min_sentences_per_chunk: 1
    delim: "['.', '!', '?', ';', '。', '!', '?', '\\n']"
  recursive:
    min_characters_per_chunk: 12
Parameter Description:
ParameterTypeDescription
parse_file_pathstrInput file or directory path
text_corpus_save_pathstrText corpus output path (JSONL)
image_corpus_save_pathstrImage corpus index output path (JSONL)
mineru_dirstrMinerU output root directory
mineru_extra_paramsdictMinerU extra parameters, such as source, layout, etc.
raw_chunk_pathstrChunking input file path (JSONL format)
chunk_pathstrChunking output path
use_titleboolWhether to append document title to the beginning of each chunk
chunk_backendstrSelect chunking method: token, sentence, recursive
tokenizer_or_token_counterstrTokenizer or counting method. Options: word, character or tiktoken model name (e.g., gpt2)
chunk_sizeintTarget size for each chunk
chunk_backend_configsdictConfiguration items for each chunking method (see below)
chunk_backend_configs Detailed Parameters:
Backend TypeParameterDescription
tokenchunk_overlapOverlapping tokens between chunks
sentencechunk_overlapOverlapping count between chunks
min_sentences_per_chunkMinimum number of sentences per chunk
delimSentence delimiter list (Python list in string format)
recursivemin_characters_per_chunkMinimum character unit for recursive splitting