Skip to main content

Overview

The Retriever Server is the core retrieval module in UR-2.0. It integrates model loading, text/image encoding, index construction, and search into a single component. It natively supports multiple backends—Sentence-Transformers, Infinity, and OpenAI—so it can flexibly adapt to corpora of various sizes and types, meeting the needs of large-scale embedding and efficient document recall.

Usage Example

Corpus Encoding & Indexing

The following example shows how to use the Retriever Server to encode a corpus and build an index.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/corpus_index.yaml
# MCP Server
servers:
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index
Build the Pipeline:
ultrarag build examples/corpus_index.yaml
Modify the parameter file as needed. Below are two common scenarios: text corpus encoding and image corpus encoding.
  1. Text corpus encoding
Example: use Qwen3-Embedding-0.6B to vectorize a text corpus.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_index_parameter.yaml
retriever:
  backend: sentence_transformers # Using ST as the example backend
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  embedding_path: embedding/embedding.npy
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_chunk_size: 50000
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  model_name_or_path: Qwen/Qwen3-Embedding-0.6B
  overwrite: false
  1. Image corpus encoding
Example: use jinaai/jina-embeddings-v4 to vectorize an image corpus.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_index_parameter.yaml
retriever:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
        psg_prompt_name: null
        psg_task: retrieval
        q_prompt_name: query
        q_task: retrieval
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  corpus_path: corpora/image.jsonl
  embedding_path: embedding/embedding.npy
  faiss_use_gpu: true
  gpu_ids: 0,1
  gpu_ids: 1
  index_chunk_size: 50000
  index_path: index/index.index
  is_multimodal: false
  is_multimodal: true
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  model_name_or_path: jinaai/jina-embeddings-v4
  overwrite: false
Run the Pipeline:
ultrarag run examples/corpus_index.yaml
Encoding and indexing often involve large corpora and can be time-consuming. It’s recommended to run the task in the background using screen or nohup, for example:
nohup ultrarag run examples/corpus_index.yaml > log.txt 2>&1 &
The following example shows how to use the Retriever Server to perform vector search on an existing index.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/corpus_search.yaml
# MCP Server
servers:
  benchmark: servers/benchmark
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.retriever_search

Build the Pipeline:
ultrarag build examples/corpus_search.yaml
Modify parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_search_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
retriever:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  model_name_or_path: Qwen/Qwen3-Embedding-0.6B
  query_instruction: ''
  top_k: 5
Run the Pipeline:
ultrarag run examples/corpus_search.yaml

BM25 Retrieval

In addition to dense retrieval, UR-2.0 includes the classic BM25 sparse text retrieval algorithm. BM25 is an improvement over TF–IDF and is commonly used for fast, lightweight keyword-based matching. In practice, BM25 and dense retrieval are complementary: combining them often improves both coverage and recall diversity. Step 1: Build the BM25 index Before searching with BM25, you need to tokenize documents and build a sparse index.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/bm25_index.yaml
# MCP Server
servers:
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.bm25_index
Build the Pipeline:
ultrarag build examples/bm25_index.yaml
Modify parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/bm25_index_parameter.yaml
retriever:
  backend: sentence_transformers
  backend: bm25
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  overwrite: false
Run:
ultrarag run examples/bm25_index.yaml
Step 2: Execute BM25 search Once the index is built, you can perform BM25-based document retrieval.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/bm25_search.yaml
# MCP Server
servers:
  benchmark: servers/benchmark
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.bm25_search
Build the Pipeline:
ultrarag build examples/bm25_search.yaml
Modify parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/bm25_search_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
retriever:
  backend: sentence_transformers
  backend: bm25
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  top_k: 5
Run the search flow:
ultrarag run examples/bm25_search.yaml

Hybrid Retrieval

In real applications, a single retrieval method often struggles to balance recall and precision. For instance, BM25 excels at keyword matching, while dense retrieval is stronger in semantic understanding. UR-2.0 supports Hybrid Retrieval that combines sparse (BM25) and dense methods to leverage both, improving diversity and robustness. The example below shows how to run BM25 and dense retrieval in the same Pipeline and merge results via a custom module.
You can extend this example to arbitrarily combine retrieval methods—for instance, mixing local knowledge bases with online web search, or fusing text and image multimodal retrieval—to build a more powerful hybrid retrieval Pipeline.
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/hybrid_search.yaml
# MCP Server
servers:
  benchmark: servers/benchmark
  dense: servers/retriever
  bm25: servers/retriever
  custom: servers/custom

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- dense.retriever_init
- bm25.retriever_init
- dense.retriever_search:
    output:
      ret_psg: dense_psg
- bm25.bm25_search:
    output:
      ret_psg: sparse_psg
- custom.merge_passages:
    input:
      ret_psg: dense_psg
      temp_psg: sparse_psg
This Pipeline involves parameter renaming and module reuse. See the linked docs for details.
Build the Pipeline:
ultrarag build examples/hybrid_search.yaml
Modify parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/hybrid_search_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
bm25:
  backend: sentence_transformers
  backend: bm25
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  top_k: 5
custom: {}
dense:
  backend: sentence_transformers
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
      base_url: https://api.openai.com/v1
      model_name: text-embedding-3-small
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_path: index/index.index
  index_path: index1/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  model_name_or_path: Qwen/Qwen3-Embedding-0.6B
  query_instruction: ''
  top_k: 5
Run the hybrid retrieval Pipeline:
ultrarag run examples/hybrid_search.yaml

Deploy Retrieval Models

UR-2.0 is fully compatible with the OpenAI API interface specification, so any embedding model that conforms to this API can be integrated without extra adapters or code changes. The example below shows how to deploy a local embedding model with vLLM. Step 1: Serve the model in the background We recommend using screen to run in the background, so you can easily view logs and status. Start a new screen session:
screen -S retriever
Launch the model (using Qwen3-Embedding-0.6B as an example):
script/vllm_serve_emb.sh
CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \
    --served-model-name qwen-embedding \
    --model Qwen/Qwen3-Embedding-0.6B \
    --trust-remote-code \
    --host 127.0.0.1 \
    --port 65502 \
    --task embed \
    --gpu-memory-utilization 0.9
If you see output similar to the following, the service started successfully:
(APIServer pid=2270761) INFO:     Started server process [2270761]
(APIServer pid=2270761) INFO:     Waiting for application startup.
(APIServer pid=2270761) INFO:     Application startup complete.
Press Ctrl + A + D to exit and keep the service running in the background. To re-enter the session, run:
screen -r retriever
Step 2: Modify Pipeline Parameters Taking the corpus_search Pipeline as an example, simply switch the retriever backend to openai and set base_url to your local vLLM service:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/corpus_search_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
retriever:
  backend: sentence_transformers
  backend: openai
  backend_configs:
    bm25:
      lang: en
    infinity:
      bettertransformer: false
      device: cuda
      model_warmup: false
      pooling_method: auto
      trust_remote_code: true
    openai:
      api_key: ''
  base_url: https://api.openai.com/v1
  base_url: http://127.0.0.1:65502/v1
  model_name: text-embedding-3-small
  model_name: qwen-embedding
    sentence_transformers:
      device: cuda
      sentence_transformers_encode:
        encode_chunk_size: 10000
        normalize_embeddings: false
        psg_prompt_name: document
        psg_task: null
        q_prompt_name: query
        q_task: null
      trust_remote_code: true
  batch_size: 16
  corpus_path: data/corpus_example.jsonl
  faiss_use_gpu: true
  gpu_ids: 0,1
  index_path: index/index.index
  is_multimodal: false
  model_name_or_path: openbmb/MiniCPM-Embedding-Light
  query_instruction: ''
  top_k: 5
Once configured, you can run it just like regular vector retrieval.

Web Search API

UR-2.0 natively integrates three mainstream Web search APIs: Tavily, Exa, and GLM. These APIs can be directly used as Retriever Server backends to enable online information retrieval and real-time knowledge augmentation. Step 1: Configure API Keys Before use, set the corresponding service’s API key. You can manually export environment variables before running the Pipeline:
export TAVILY_API_KEY="your retriever key"
It is recommended to manage keys centrally using a .env configuration file. In the UR-2.0 root directory, rename the template file .env.dev to .env and fill in your key information, for example:
LLM_API_KEY=
RETRIEVER_API_KEY=
TAVILY_API_KEY=tvly-dev-yourapikeyhere
EXA_API_KEY=
ZHIPUAI_API_KEY=
UR-2.0 will automatically read this file and load the configuration at startup. Step 2: Web Search The following example demonstrates how to use the Tavily API for Web search:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/web_search.yaml
# MCP Server
servers:
  benchmark: servers/benchmark
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_tavily_search
Build the Pipeline:
ultrarag build examples/web_search.yaml
In the generated parameter file, fill in the data path and retrieval parameters:
https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bexamples/parameters/web_search_parameter.yaml
benchmark:
  benchmark:
    key_map:
      gt_ls: golden_answers
      q_ls: question
    limit: -1
    name: nq
    path: data/sample_nq_10.jsonl
    seed: 42
    shuffle: false
retriever:
  retrieve_thread_num: 1
  top_k: 5
Run the following command to start the Web search flow:
ultrarag run examples/web_search.yaml
You can replace retriever_tavily_search with retriever_exa_search or retriever_zhipuai_search as the Web search source.