VisRAG

Introduction

In daily use, we often encounter scenarios like this: buying a new device but not knowing how to set up certain functions, and flipping through the manual is both time-consuming and inefficient. At this time, if there is an intelligent assistant that can directly answer questions, it will greatly improve the experience. For example, a user purchased a Nikon Z7 camera and wants to know “in which scenarios the electronic vibration reduction function is unavailable”. Asking LLM directly yields the following answer:

The electronic vibration reduction function of Nikon Z7 is mainly used to provide help in scenarios such as still life and portrait shooting that require stable images. However, in certain specific scenarios, the electronic vibration reduction function is unavailable. According to

According to Nikon official documents and common usage scenarios, the electronic vibration reduction function is unavailable in the following cases:

1. Shooting moving objects: Since the electronic vibration reduction function adjusts the shutter speed according to the camera's movement, when shooting moving objects, the electronic vibration reduction function will become unstable and cannot work properly.

2. Using long exposure: When shooting with long exposure, camera shake becomes obvious, and the electronic vibration reduction function cannot effectively compensate for this shake.

3. Using long exposure in low light environments: When shooting long exposure in low light environments, camera shake becomes more obvious, and the electronic vibration reduction function cannot effectively compensate for this shake.

Therefore, the electronic vibration reduction function of Nikon Z7 is unavailable in scenarios such as shooting moving objects, using long exposure, and using long exposure in low light environments.

It can be seen that although the model can generate semantically smooth answers, due to the lack of real document support, it often produces “plausible but incorrect” results. To solve this problem, UltraRAG implemented a multi-modal retrieval-augmented Pipeline based on VisRAG. It no longer relies on cumbersome text parsing, but directly inputs “screenshots of relevant documents” into the vision language model, thereby achieving real document Q&A based on visual semantics.

Build Personal Knowledge Base

Take “Nikon User Manual” as an example. You can click here to download the PDF file. We use UltraRAG’s Corpus Server to convert this PDF directly into an image corpus:

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b

examples/build_image_corpus.yaml

# MCP Server
servers:
  corpus: servers/corpus

# MCP Client Pipeline
pipeline:
- corpus.build_image_corpus

Execute the following command:

ultrarag build examples/build_image_corpus.yaml

Modify parameters as follows:

examples/parameters/build_image_corpus_parameter.yaml

corpus:
  image_corpus_save_path: corpora/image.jsonl
  parse_file_path: data/UltraRAG.pdf
  parse_file_path: data/nikon.pdf

Run Pipeline:

ultrarag run examples/build_image_corpus.yaml

After execution, the image corpus file will be automatically generated:

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/json.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=81a8c440100333f3454ca984a5b0fe5a

corpora/image.jsonl

{"id": 0, "image_id": "nikon/page_0.jpg", "image_path": "image/nikon/page_0.jpg"}
{"id": 1, "image_id": "nikon/page_1.jpg", "image_path": "image/nikon/page_1.jpg"}
{"id": 2, "image_id": "nikon/page_2.jpg", "image_path": "image/nikon/page_2.jpg"}
{"id": 3, "image_id": "nikon/page_3.jpg", "image_path": "image/nikon/page_3.jpg"}
...

Next, use the Retriever Server to perform vector encoding and indexing on the image corpus:

examples/corpus_index.yaml

# MCP Server
servers:
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index

Execute the following command:

ultrarag build examples/corpus_index.yaml

Modify parameters:

examples/parameters/corpus_index_parameter.yaml

retriever:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
      save_path: index/bm25
    infinity:
      bettertransformer: false
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: abc
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      sentence_transformers_encode:
        encode_chunk_size: 256
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
        psg_prompt_name: null
        psg_task: retrieval
        q_prompt_name: query
        q_task: retrieval
      trust_remote_code: true
  batch_size: 16
  collection_name: wiki
  corpus_path: data/corpus_example.jsonl
  corpus_path: corpora/image.jsonl
  embedding_path: embedding/embedding.npy
  gpu_ids: '1'
  index_backend: faiss
  index_backend_configs:
    faiss:
      index_chunk_size: 10000
      index_path: index/index.index
      index_use_gpu: true
    milvus:
      id_field_name: id
      id_max_length: 64
      index_chunk_size: 1000
      index_params:
        index_type: AUTOINDEX
        metric_type: IP
      metric_type: IP
      search_params:
        metric_type: IP
        params: {}
      text_field_name: contents
      text_max_length: 60000
      token: null
      uri: index/milvus_demo.db
      vector_field_name: vector
  is_demo: false
  is_multimodal: false
  is_multimodal: true
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  model_name_or_path: jinaai/jina-embeddings-v4
  overwrite: false

Run index construction:

ultrarag run examples/corpus_index.yaml

Prepare user query file:

data/test.jsonl

{"id": 0, "question": "In which scenarios is the electronic vibration reduction function of Nikon Z7 unavailable?", "golden_answers": [], "meta_data": {}}

Define VisRAG Pipeline:

examples/visrag.yaml

# MCP Server
servers:
  benchmark: servers/benchmark
  retriever: servers/retriever
  prompt: servers/prompt
  generation: servers/generation
  evaluation: servers/evaluation
  custom: servers/custom

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.retriever_search
- generation.generation_init
- prompt.qa_boxed
- generation.multimodal_generate:
    input:
      multimodal_path: ret_psg

Execute the following command:

ultrarag build examples/visrag.yaml

Modify parameters:

examples/parameters/visrag_parameter.yaml

benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    name: test
    path: data/test.jsonl
    seed: 42
    shuffle: false
generation:
  backend: vllm
  backend_configs:
    hf:
      batch_size: 8
      gpu_ids: 2,3
      model_name_or_path: openbmb/MiniCPM4-8B
      trust_remote_code: true
    openai:
      api_key: abc
      base_delay: 1.0
      base_url: http://localhost:8000/v1
      concurrency: 8
      model_name: MiniCPM4-8B
      retries: 3
    vllm:
      dtype: auto
      gpu_ids: 2,3
      gpu_memory_utilization: 0.9
      model_name_or_path: openbmb/MiniCPM4-8B
      model_name_or_path: openbmb/MiniCPM-V-4
      trust_remote_code: true
  extra_params:
    chat_template_kwargs:
      enable_thinking: false
  image_tag: null
  sampling_params:
    max_tokens: 2048
    temperature: 0.7
    top_p: 0.8
  system_prompt: ''
prompt:
  template: prompt/qa_boxed.jinja
  template: prompt/visrag.jinja
retriever:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
      save_path: index/bm25
    infinity:
      bettertransformer: false
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: abc
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      sentence_transformers_encode:
        encode_chunk_size: 256
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
        psg_prompt_name: null
        psg_task: retrieval
        q_prompt_name: query
        q_task: retrieval
      trust_remote_code: true
  batch_size: 16
  collection_name: wiki
  corpus_path: data/corpus_example.jsonl
  corpus_path: corpora/image.jsonl
  gpu_ids: '1'
  index_backend: faiss
  index_backend_configs:
    faiss:
      index_chunk_size: 10000
      index_path: index/index.index
      index_use_gpu: true
    milvus:
      id_field_name: id
      id_max_length: 64
      index_chunk_size: 1000
      index_params:
        index_type: AUTOINDEX
        metric_type: IP
      metric_type: IP
      search_params:
        metric_type: IP
        params: {}
      text_field_name: contents
      text_max_length: 60000
      token: null
      uri: index/milvus_demo.db
      vector_field_name: vector
  is_demo: false
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  is_multimodal: true
  model_name_or_path: jinaai/jina-embeddings-v4
  query_instruction: ''
  top_k: 5

Run this Pipeline:

ultrarag run examples/visrag.yaml

Execute the following command to start Case Study Viewer:

python ./script/case_study.py \
  --data output/memory_test_visrag_20251015_163425.json \
  --host 127.0.0.1 \
  --port 8070 \
  --title "Case Study Viewer"

The system will automatically display the screenshots of the retrieved manual pages:

The answer generated by the model will be based on the real image content. Example follows:

The electronic vibration reduction function of Nikon Z7 is unavailable in the following scenarios:

1. When the frame size is 1920×1080.

2. At 120p, 1920×1080, 100p, or 1920×1080 (slow motion).

These information can be found from the text part in the image, specifically in the paragraph describing the electronic vibration reduction function of Nikon Z7.

Through visual semantic enhancement, the system can answer user questions more accurately, especially suitable for multi-modal scenarios such as manuals, textbooks, reports, etc.

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

VisRAG

Introduction

Build Personal Knowledge Base

VisRAG

Get Started

RAG Servers

RAG Client

Develop Guide

Typical Implementation

​Introduction

​Build Personal Knowledge Base

​VisRAG

Introduction

Build Personal Knowledge Base

VisRAG