Skip to main content

Introduction

In everyday life, we often encounter situations such as purchasing a new device but not knowing how to configure certain features.
Reading through the entire manual can be time-consuming and inefficient.
If an intelligent assistant could directly answer such questions, it would greatly enhance the user experience.
For example, a user who purchased a Nikon Z7 camera wants to know “in which scenarios the electronic vibration reduction function is unavailable.”
The response generated by a standard LLM might look like this:
The electronic vibration reduction function of the Nikon Z7 is mainly used to help stabilize the image when shooting still objects or portraits. However, in some specific scenarios, this function may be unavailable.

According to Nikon’s official documentation and common usage scenarios, the electronic vibration reduction function is unavailable in the following cases:

1. When shooting moving objects: Because the electronic VR adjusts shutter speed based on camera movement, it becomes unstable and fails to work properly for moving subjects.

2. When using long exposure: During long exposure shots, camera shake becomes more noticeable, and electronic VR cannot effectively compensate for it.

3. When using long exposure in low-light environments: In low light, long exposure exaggerates camera shake, making the electronic VR ineffective.

Therefore, the electronic vibration reduction function of the Nikon Z7 is unavailable when shooting moving subjects, using long exposures, or long exposures under low-light conditions.
Although the model produces a fluent and well-structured answer, it lacks factual grounding since it does not rely on real documentation, resulting in seemingly plausible yet incorrect responses. To address this issue, UR-2.0 implements a multimodal retrieval-augmented pipeline based on VisRAG.
Instead of relying on text-only parsing, VisRAG directly feeds screenshots of relevant document pages into a vision-language model, enabling document-grounded question answering based on visual semantics.

Building a Personal Knowledge Base

Using the “Nikon User Manual” as an example, you can click here to download the PDF file. We use the Corpus Server in UR-2.0 to directly convert this PDF into an image corpus:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/build_image_corpus.yaml
# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.build_image_corpus
Run the following command:
ultrarag build examples/build_image_corpus.yaml
Modify the parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/build_image_corpus_parameter.yaml
corpus:
  image_corpus_save_path: corpora/image.jsonl
  parse_file_path: data/UltraRAG.pdf
  parse_file_path: data/nikon.pdf
Run the Pipeline:
ultrarag run examples/build_image_corpus.yaml
After completion, an image corpus file will be automatically generated:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5acorpora/image.jsonl
{"id": 0, "image_id": "nikon/page_0.jpg", "image_path": "image/nikon/page_0.jpg"}
{"id": 1, "image_id": "nikon/page_1.jpg", "image_path": "image/nikon/page_1.jpg"}
{"id": 2, "image_id": "nikon/page_2.jpg", "image_path": "image/nikon/page_2.jpg"}
{"id": 3, "image_id": "nikon/page_3.jpg", "image_path": "image/nikon/page_3.jpg"}
...
Next, use the Retriever Server to perform embedding and indexing on the image corpus:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/corpus_index.yaml
# MCP Server
servers:
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index
Execute the command:
ultrarag build examples/corpus_index.yaml
Modify the parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_index_parameter.yaml
retriever:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
        psg_prompt_name: null
        psg_task: retrieval
        q_prompt_name: query
        q_task: retrieval
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  corpus_path: corpora/image.jsonl
  embedding_path: embedding/embedding.npy
  faiss_use_gpu: true
  gpu_ids: 0,1
  gpu_ids: 1
  index_chunk_size: 50000
  index_path: index/index.index
  is_multimodal: false
  is_multimodal: true
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  model_name_or_path: jinaai/jina-embeddings-v4
  overwrite: false
Run the indexing process:
ultrarag run examples/corpus_index.yaml

VisRAG

Prepare a user query file:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5adata/test.jsonl
{"id": 0, "question": "尼康Z7的电子减震功能在哪些场景不可用?", "golden_answers": [], "meta_data": {}}
Define the VisRAG Pipeline:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/visrag.yaml
# MCP Server
servers:
  benchmark: servers/benchmark
  retriever: servers/retriever
  prompt: servers/prompt
  generation: servers/generation
  evaluation: servers/evaluation
  custom: servers/custom

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.retriever_search
- generation.generation_init
- prompt.qa_boxed
- generation.multimodal_generate:
    input:
      multimodal_path: ret_psg
Execute the following commands:
ultrarag build examples/visrag.yaml
Modify parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/visrag_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    name: test
    path: data/test.jsonl
    seed: 42
    shuffle: false
generation:
  backend: vllm
  backend_configs:
    hf:
      batch_size: 8
      gpu_ids: 2,3
      model_name_or_path: openbmb/MiniCPM4-8B
      trust_remote_code: true
    openai:
      api_key: ''
      base_delay: 1.0
      base_url: http://localhost:8000/v1
      concurrency: 8
      model_name: MiniCPM4-8B
      retries: 3
    vllm:
      dtype: auto
      gpu_ids: 2,3
      gpu_memory_utilization: 0.9
      model_name_or_path: openbmb/MiniCPM4-8B
      model_name_or_path: openbmb/MiniCPM-V-4
      trust_remote_code: true
  sampling_params:
    chat_template_kwargs:
      enable_thinking: false
    max_tokens: 2048
    temperature: 0.7
    top_p: 0.8
  system_prompt: ''
prompt:
  template: prompt/qa_boxed.jinja
  template: prompt/visrag.jinja
retriever:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
        psg_prompt_name: null
        psg_task: retrieval
        q_prompt_name: query
        q_task: retrieval
      trust_remote_code: true
  batch_size: 16 
  corpus_path: data/corpus_example.jsonl
  corpus_path: corpora/image.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  gpu_ids: 1
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  is_multimodal: true
  model_name_or_path: jinaai/jina-embeddings-v4
  query_instruction: ''
  top_k: 5

Run the Pipeline:
ultrarag run examples/visrag.yaml
Launch the Case Study Viewer:
python ./script/case_study.py \
  --data output/memory_test_visrag_20251015_163425.json \
  --host 127.0.0.1 \
  --port 8070 \
  --title "Case Study Viewer"
The system will automatically display the retrieved manual page screenshots: The model’s generated answer will now be grounded in actual visual document content, for example:
The electronic vibration reduction function of the Nikon Z7 is unavailable in the following scenarios:

1. When the frame size is 1920×1080.

2. When using 120p, 1920×1080, 100p, or 1920×1080 (slow motion) modes.

This information can be found directly in the text shown in the retrieved image, specifically in the section describing the Nikon Z7’s electronic vibration reduction function.
Through visual-semantic augmentation, the system can now provide accurate, document-grounded answers — especially useful for multimodal scenarios such as manuals, textbooks, and reports.