Introduction
In everyday life, we often encounter situations such as purchasing a new device but not knowing how to configure certain features.Reading through the entire manual can be time-consuming and inefficient.
If an intelligent assistant could directly answer such questions, it would greatly enhance the user experience. For example, a user who purchased a Nikon Z7 camera wants to know “in which scenarios the electronic vibration reduction function is unavailable.”
The response generated by a standard LLM might look like this:
Instead of relying on text-only parsing, VisRAG directly feeds screenshots of relevant document pages into a vision-language model, enabling document-grounded question answering based on visual semantics.
Building a Personal Knowledge Base
Using the “Nikon User Manual” as an example, you can click here to download the PDF file. We use the Corpus Server in UR-2.0 to directly convert this PDF into an image corpus:VisRAG
Prepare a user query file: