build_text_corpus
Signature
- Supports .txt / .md;
- Supports .docx (reads paragraphs and tables);
- Supports .pdf / .xps / .oxps / .epub / .mobi / .fb2 (pure text extraction via pymupdf).
- Recursively processes in directory mode.
build_image_corpus
Signature
- Only supports PDF: Renders each page as JPG (RGB) at 144DPI and validates file validity.
- Recursively processes in directory mode.
mineru_parse
Signature
- Calls CLI
mineruto structurally parse PDF/directory and outputs tomineru_dir.
build_mineru_corpus
Signature
- Aggregates MinerU parsing artifacts into Text Corpus JSONL and Image Index JSONL.
- Text:
- Image:
chunk_documents
Signature
- Chunks input text corpus (JSONL, containing
id/title/contents) into paragraphs using selected backend: - Chunk Backend: Supports
token/sentence/recursive. - Tokenizer:
tokenizer_or_token_countercan beword,character, ortiktokenencoding name (e.g.,gpt2). - Chunk Size: Controls block size via
chunk_size(overlap defaults to size/4). - Optionally appends document title to the beginning of each block (
use_title).
Configuration
| Parameter | Type | Description |
|---|---|---|
parse_file_path | str | Input file or directory path |
text_corpus_save_path | str | Text corpus output path (JSONL) |
image_corpus_save_path | str | Image corpus index output path (JSONL) |
mineru_dir | str | MinerU output root directory |
mineru_extra_params | dict | MinerU extra parameters, such as source, layout, etc. |
raw_chunk_path | str | Chunking input file path (JSONL format) |
chunk_path | str | Chunking output path |
use_title | bool | Whether to append document title to the beginning of each chunk |
chunk_backend | str | Select chunking method: token, sentence, recursive |
tokenizer_or_token_counter | str | Tokenizer or counting method. Options: word, character or tiktoken model name (e.g., gpt2) |
chunk_size | int | Target size for each chunk |
chunk_backend_configs | dict | Configuration items for each chunking method (see below) |
chunk_backend_configs Detailed Parameters:
| Backend Type | Parameter | Description |
|---|---|---|
| token | chunk_overlap | Overlapping tokens between chunks |
| sentence | chunk_overlap | Overlapping count between chunks |
min_sentences_per_chunk | Minimum number of sentences per chunk | |
delim | Sentence delimiter list (Python list in string format) | |
| recursive | min_characters_per_chunk | Minimum character unit for recursive splitting |