Function
The Retriever Server is the core retrieval module in UltraRAG, integrating model loading, text encoding, index construction, and retrieval query functions.
It natively supports multiple backend interfaces such as Sentence-Transformers, Infinity, and OpenAI, enabling flexible adaptation to corpora of different scales and types to meet the needs of large-scale vectorization and efficient document recall.
Usage Examples
Corpus Encoding and Indexing
The following example shows how to use the Retriever Server to perform encoding and index construction on a corpus.

examples/corpus_index.yaml
# MCP Server
servers:
retriever: servers/retriever
# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index
Run the following command to compile the Pipeline:
ultrarag build examples/corpus_index.yaml
Modify the parameter file according to the actual situation. Two typical scenarios are shown below: Text Corpus Encoding and Image Corpus Encoding.
- Text Corpus Encoding
Example: Using Qwen3-Embedding-0.6B to vectorize text corpus.

examples/parameters/corpus_index_parameter.yaml
retriever:
backend: sentence_transformers # We take st as an example here
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
embedding_path: embedding/embedding.npy
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
model_name_or_path: Qwen/Qwen3-Embedding-0.6B
overwrite: false
- Image Corpus Encoding
Example: Using jinaai/jina-embeddings-v4 to vectorize image corpus.

examples/parameters/corpus_index_parameter.yaml
retriever:
backend: sentence_transformers
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
psg_prompt_name: null
psg_task: retrieval
q_prompt_name: query
q_task: retrieval
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
corpus_path: corpora/image.jsonl
embedding_path: embedding/embedding.npy
gpu_ids: 0,1
gpu_ids: 1
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
is_multimodal: true
model_name_or_path: openbmb/MiniCPM-Embedding-Light
model_name_or_path: jinaai/jina-embeddings-v4
overwrite: false
Run the following command to execute this Pipeline:
ultrarag run examples/corpus_index.yaml
The encoding and indexing phase usually involves large-scale corpus processing and takes a long time. It is recommended to use screen or nohup to mount the task to run in the background, for example:
nohup ultrarag run examples/corpus_index.yaml > log.txt 2>&1 &
Vector Retrieval
The following example shows how to use the Retriever Server to perform vector retrieval tasks on the constructed index.

examples/corpus_search.yaml
# MCP Server
servers:
benchmark: servers/benchmark
retriever: servers/retriever
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.retriever_search
Run the following command to compile the Pipeline:
ultrarag build examples/corpus_search.yaml
Modify parameters:

examples/parameters/corpus_search_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
retriever:
backend: sentence_transformers
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
model_name_or_path: Qwen/Qwen3-Embedding-0.6B
query_instruction: ''
top_k: 5
Run Pipeline:
ultrarag run examples/corpus_search.yaml
BM25 Retrieval
In addition to vector retrieval, UltraRAG also has a built-in classic BM25 text retrieval algorithm. BM25 is a sparse retrieval method improved based on Term Frequency-Inverse Document Frequency (TF-IDF), often used for fast, lightweight text semantic matching tasks. In practical applications, BM25 can complement dense retrieval to improve retrieval coverage and recall diversity.
Step 1: Build BM25 Index
Before using BM25 for retrieval, you need to tokenize the document and build a sparse index.

examples/bm25_index.yaml
# MCP Server
servers:
retriever: servers/retriever
# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.bm25_index
Run the following command to compile the Pipeline:
ultrarag build examples/bm25_index.yaml
Modify parameters:

examples/parameters/bm25_index_parameter.yaml
retriever:
backend: sentence_transformers
backend: bm25
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
overwrite: false
Run:
ultrarag run examples/bm25_index.yaml
Step 2: Execute BM25 Retrieval
After the index construction is completed, document retrieval based on BM25 can be performed.

examples/bm25_search.yaml
# MCP Server
servers:
benchmark: servers/benchmark
retriever: servers/retriever
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_init
- retriever.bm25_search
Compile Pipeline:
ultrarag build examples/bm25_search.yaml
Modify parameters:

examples/parameters/bm25_search_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
retriever:
backend: sentence_transformers
backend: bm25
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
top_k: 5
Run retrieval process:
ultrarag run examples/bm25_search.yaml
Hybrid Retrieval
In practical applications, a single retrieval method is often difficult to balance recall and precision.
For example, BM25 excels at keyword matching, while vector retrieval has advantages in semantic understanding.
Therefore, UltraRAG supports fusing sparse retrieval (BM25) with dense retrieval (Dense Retrieval), comprehensively utilizing the advantages of both through hybrid strategies (Hybrid Retrieval) to further improve retrieval diversity and robustness.
The following example demonstrates how to run BM25 and vector retrieval simultaneously in the same Pipeline, and merge results through a custom module.
You can refer to this example to flexibly extend retrieval methods into any combination, such as combining local knowledge bases with online Web retrieval, or fusing multi-modal retrieval results such as text and images, to build a more powerful hybrid retrieval Pipeline.

examples/hybrid_search.yaml
# MCP Server
servers:
benchmark: servers/benchmark
dense: servers/retriever
bm25: servers/retriever
custom: servers/custom
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- dense.retriever_init
- bm25.retriever_init
- dense.retriever_search:
output:
ret_psg: dense_psg
- bm25.bm25_search:
output:
ret_psg: sparse_psg
- custom.merge_passages:
input:
ret_psg: dense_psg
temp_psg: sparse_psg
Run the following command to compile the Pipeline:
ultrarag build examples/hybrid_search.yaml
Modify parameters:

examples/parameters/hybrid_search_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
bm25:
backend: sentence_transformers
backend: bm25
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
top_k: 5
custom: {}
dense:
backend: sentence_transformers
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
base_url: https://api.openai.com/v1
model_name: text-embedding-3-small
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
model_name_or_path: Qwen/Qwen3-Embedding-0.6B
query_instruction: ''
top_k: 5
Run Hybrid Search Pipeline:
ultrarag run examples/hybrid_search.yaml
Deploy Retrieval Model
UltraRAG is fully compatible with the OpenAI API interface specification, so any Embedding model that conforms to this interface standard can be directly accessed without additional adaptation or code modification.
The following example shows how to deploy a local retrieval model using vLLM.
Step 1: Background Model Deployment
It is recommended to use the Screen method to run in the background to view logs and status in real time.
Enter a new Screen session:
Execute the following command to deploy the model (taking Qwen3-Embedding-0.6B as an example):
CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \
--served-model-name qwen-embedding \
--model Qwen/Qwen3-Embedding-0.6B \
--trust-remote-code \
--host 0.0.0.0 \
--port 65504 \
--task embed \
--gpu-memory-utilization 0.2
Seeing output similar to the following indicates that the model service has started successfully:
(APIServer pid=2270761) INFO: Started server process [2270761]
(APIServer pid=2270761) INFO: Waiting for application startup.
(APIServer pid=2270761) INFO: Application startup complete.
Press Ctrl + A + D to exit and keep the service running in the background.
If you need to re-enter the session, execute:
Step 2: Modify Pipeline Parameters
Taking corpus_search Pipeline as an example, just switch the retrieval backend to openai and point base_url to the local vLLM service:

examples/parameters/corpus_search_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
retriever:
backend: sentence_transformers
backend: openai
backend_configs:
bm25:
lang: en
save_path: index/bm25
infinity:
bettertransformer: false
model_warmup: false
pooling_method: auto
trust_remote_code: true
openai:
api_key: ''
api_key: 'abc'
base_url: https://api.openai.com/v1
base_url: http://127.0.0.1:65504/v1
model_name: text-embedding-3-small
model_name: qwen-embedding
sentence_transformers:
sentence_transformers_encode:
encode_chunk_size: 256
normalize_embeddings: false
psg_prompt_name: document
psg_task: null
q_prompt_name: query
q_task: null
trust_remote_code: true
batch_size: 16
collection_name: wiki
corpus_path: data/corpus_example.jsonl
gpu_ids: '1'
index_backend: faiss
index_backend_configs:
faiss:
index_chunk_size: 10000
index_path: index/index.index
index_use_gpu: true
milvus:
id_field_name: id
id_max_length: 64
index_chunk_size: 1000
index_params:
index_type: AUTOINDEX
metric_type: IP
metric_type: IP
search_params:
metric_type: IP
params: {}
text_field_name: contents
text_max_length: 60000
token: null
uri: index/milvus_demo.db
vector_field_name: vector
is_demo: false
is_multimodal: false
model_name_or_path: openbmb/MiniCPM-Embedding-Light
query_instruction: ''
top_k: 5
After completing the configuration, you can run it just like using ordinary vector retrieval.
Web Search API
UltraRAG natively integrates three mainstream Web retrieval APIs: Tavily, Exa, and GLM.
These APIs can be directly used as the retrieval backend of the Retriever Server to achieve online information retrieval and real-time knowledge enhancement.
Step 1: Configure API Key
You need to set the API Key of the corresponding service before use. You can manually export environment variables before running the Pipeline:
export TAVILY_API_KEY="your retriever key"
It is recommended to use the .env configuration file for unified management:
In the UltraRAG root directory, rename the template file .env.dev to .env, and fill in your key information, for example:
LLM_API_KEY=
RETRIEVER_API_KEY=
TAVILY_API_KEY=tvly-dev-yourapikeyhere
EXA_API_KEY=
ZHIPUAI_API_KEY=
UltraRAG will automatically read this file and load relevant configurations at startup.
Step 2: Web Search
The following example demonstrates how to use Tavily API for Web retrieval:

examples/web_search.yaml
# MCP Server
servers:
benchmark: servers/benchmark
retriever: servers/retriever
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_tavily_search
Compile Pipeline:
ultrarag build examples/web_search.yaml
Fill in the data path and retrieval parameters in the automatically generated parameter file:

examples/parameters/web_search_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
retriever:
retrieve_thread_num: 1
top_k: 5
Execute the following command to start the Web retrieval process:
ultrarag run examples/web_search.yaml
You can replace retriever_tavily_search with retriever_exa_search or retriever_zhipuai_search as the Web retrieval source.
Deploy Retriever Server
When testing multiple benchmarks or model performances under the same corpus, if the retriever server is re-initialized every time, the large corpus and index will be repeatedly loaded, which is time-consuming and inefficient.
Therefore, UltraRAG provides a Resident Retriever Server Deployment Script, which allows the retriever to run on the CPU or GPU for a long time, avoiding repeated loading and accelerating the experimental process.
Step 1: Parameter Settings
Similar to ordinary retriever server, you need to prepare the configuration file first:

script/deploy_retriever_config.json
{
"model_name_or_path": "openbmb/MiniCPM-Embedding-Light",
"corpus_path": "data/corpus_example.jsonl",
"collection_name": "ultrarag_embeddings",
"backend": "sentence_transformers",
"backend_configs": {
"infinity": {
"bettertransformer": false,
"pooling_method": "auto",
"model_warmup": false,
"trust_remote_code": true
},
"sentence_transformers": {
"trust_remote_code": true,
"sentence_transformers_encode": {
"normalize_embeddings": false,
"encode_chunk_size": 10000,
"q_prompt_name": "query",
"psg_prompt_name": "document",
"psg_task": null,
"q_task": null
}
},
"openai": {
"model_name": "text-embedding-3-small",
"base_url": "https://api.openai.com/v1",
"api_key": ""
},
"bm25": {
"lang": "en",
"save_path": "index/bm25"
}
},
"index_backend": "faiss",
"index_backend_configs": {
"faiss": {
"index_use_gpu": true,
"index_chunk_size": 50000,
"index_path": "index/index.index"
},
"milvus": {
"uri": "index/milvus_demo.db",
"token": null,
"id_field_name": "id",
"vector_field_name": "vector",
"text_field_name": "contents",
"id_max_length": 64,
"text_max_length": 60000,
"metric_type": "IP",
"index_params": {
"index_type": "AUTOINDEX",
"metric_type": "IP"
},
"search_params": {
"metric_type": "IP",
"params": {}
},
"index_chunk_size": 50000
}
},
"batch_size": 16,
"gpu_ids": "0,1",
"is_multimodal": false,
"is_demo": false
}
Step 2: Background Deployment
It is recommended to use Screen so that the retriever can run in the background for a long time and logs can be viewed at any time.
Create Screen session:
Start retriever server:
script/deploy_retriever_server.py
python ./script/deploy_retriever_server.py \
--config_path script/deploy_retriever_config.json \
--host 0.0.0.0 \
--port 64501
After the Server starts, it will reside in memory without repeated loading of corpus and index.
Step 3: Online Retrieval
During online retrieval, there is no need to re-initialize the retriever, just specify the deployed address in the pipeline:

examples/deploy_corpus_search.yaml
# Deploy Corpus Search Demo
# MCP Server
servers:
benchmark: servers/benchmark
retriever: servers/retriever
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- retriever.retriever_deploy_search
Run the following command to compile the Pipeline:
ultrarag build examples/deploy_corpus_search.yaml
Modify parameters:

examples/parameters/deploy_corpus_search_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
retriever:
query_instruction: ''
retriever_url: http://127.0.0.1:64501
top_k: 5
Run Pipeline:
ultrarag run examples/deploy_corpus_search.yaml