Function
The Generation Server is the core module in UltraRAG responsible for calling and deploying Large Language Models (LLMs).
It receives input prompts (Prompts) constructed by the Prompt Server and generates corresponding output results.
This module supports two modes: Text Generation and Image-Text Multi-modal Generation, flexibly adapting to different task scenarios (such as Q&A, reasoning, summarization, visual Q&A, etc.).
The Generation Server is natively compatible with the following mainstream backends: vLLM, HuggingFace,
and OpenAI.
Usage Examples
Text Generation
The following example shows how to use the Generation Server to execute a basic text generation task. The process calls the LLM to generate an answer after constructing the input prompt through the Prompt Server, and finally completes result extraction and evaluation.

examples/vanilla_llm.yaml
# MCP Server
servers:
benchmark: servers/benchmark
prompt: servers/prompt
generation: servers/generation
evaluation: servers/evaluation
custom: servers/custom
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- generation.generation_init
- prompt.qa_boxed
- generation.generate
- custom.output_extract_from_boxed
- evaluation.evaluate
Run the following command to compile the Pipeline:
ultrarag build examples/vanilla_llm.yaml
Modify parameters:

examples/parameters/vanilla_llm_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
custom: {}
evaluation:
metrics:
- acc
- f1
- em
- coverem
- stringem
- rouge-1
- rouge-2
- rouge-l
save_path: output/evaluate_results.json
generation:
backend: vllm
backend_configs:
hf:
batch_size: 8
gpu_ids: 2,3
model_name_or_path: openbmb/MiniCPM4-8B
trust_remote_code: true
openai:
api_key: abc
base_delay: 1.0
base_url: http://localhost:8000/v1
concurrency: 8
model_name: MiniCPM4-8B
retries: 3
vllm:
dtype: auto
gpu_ids: 2,3
gpu_memory_utilization: 0.9
model_name_or_path: openbmb/MiniCPM4-8B
trust_remote_code: true
extra_params:
chat_template_kwargs:
enable_thinking: false
sampling_params:
max_tokens: 2048
temperature: 0.7
top_p: 0.8
system_prompt: ''
prompt:
template: prompt/qa_boxed.jinja
Run Pipeline:
ultrarag run examples/vanilla_llm.yaml
Multi-modal Inference
In multi-modal scenarios, the Generation Server can not only process text inputs but also combine visual information such as images to complete more complex reasoning tasks. The following example shows how to implement this.
First, prepare an example dataset (including image paths):

data/test.jsonl
{"id": 0, "question": "when was the last time anyone was on the moon", "golden_answers": ["14 December 1972 UTC", "December 1972"], "image":["image/page_0.jpg"],"meta_data": {}}
Before performing multi-modal generation, you need to add a new field multimodal_path in the get_data function of the Benchmark Server to specify the image input path.

examples/vanilla_vlm.yaml
# MCP Server
servers:
benchmark: servers/benchmark
prompt: servers/prompt
generation: servers/generation
evaluation: servers/evaluation
custom: servers/custom
# MCP Client Pipeline
pipeline:
- benchmark.get_data
- generation.generation_init
- prompt.qa_boxed
- generation.multimodal_generate
- custom.output_extract_from_boxed
- evaluation.evaluate
Run the following command to compile the Pipeline:
ultrarag build examples/vanilla_vlm.yaml
Modify parameters:

examples/parameters/vanilla_vlm_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
multimodal_path: image
limit: -1
name: nq
path: data/sample_nq_10.jsonl
name: test
path: data/test.jsonl
seed: 42
shuffle: false
custom: {}
evaluation:
metrics:
- acc
- f1
- em
- coverem
- stringem
- rouge-1
- rouge-2
- rouge-l
save_path: output/evaluate_results.json
generation:
backend: vllm
backend_configs:
hf:
batch_size: 8
gpu_ids: 2,3
model_name_or_path: openbmb/MiniCPM4-8B
trust_remote_code: true
openai:
api_key: abc
base_delay: 1.0
base_url: http://localhost:8000/v1
concurrency: 8
model_name: MiniCPM4-8B
retries: 3
vllm:
dtype: auto
gpu_ids: 2,3
gpu_memory_utilization: 0.9
model_name_or_path: openbmb/MiniCPM4-8B
model_name_or_path: openbmb/MiniCPM-V-4
trust_remote_code: true
extra_params:
chat_template_kwargs:
enable_thinking: false
image_tag: null
sampling_params:
max_tokens: 2048
temperature: 0.7
top_p: 0.8
system_prompt: ''
prompt:
template: prompt/qa_boxed.jinja
Run:
ultrarag run examples/vanilla_vlm.yaml
Note: You can set image_tag such as <IMG> to specify the position where you wish the image input to be. If empty, it defaults to the leftmost input.
Deploy Model
UltraRAG is fully compatible with the OpenAI API interface specification, so any model that conforms to this interface standard can be directly accessed without additional adaptation or code modification.
The following example shows how to use vLLM to deploy a local model.
Step 1: Background Model Deployment
Taking Qwen3-32B as an example, it is recommended to use multi-card parallelism to ensure inference speed.
Screen (Run directly on host)
- Create session:
- Start command:
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
--served-model-name qwen3-32b \
--model Qwen/Qwen3-32B \
--trust-remote-code \
--host 0.0.0.0 \
--port 65503 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 \
--enforce-eager
Seeing output similar to the following indicates that the model service has started successfully:
(APIServer pid=2811812) INFO: Started server process [2811812]
(APIServer pid=2811812) INFO: Waiting for application startup.
(APIServer pid=2811812) INFO: Application startup complete.
- Exit session: Press
Ctrl + A + D to exit and keep the service running in the background.
If you need to re-enter the session, execute:
Step 2: Modify Pipeline Parameters
Modify parameters:

examples/parameters/vanilla_llm_parameter.yaml
benchmark:
benchmark:
key_map:
gt_ls: golden_answers
q_ls: question
limit: -1
name: nq
path: data/sample_nq_10.jsonl
seed: 42
shuffle: false
custom: {}
evaluation:
metrics:
- acc
- f1
- em
- coverem
- stringem
- rouge-1
- rouge-2
- rouge-l
save_path: output/evaluate_results.json
generation:
backend: vllm
backend: openai
backend_configs:
hf:
batch_size: 8
gpu_ids: 2,3
model_name_or_path: openbmb/MiniCPM4-8B
trust_remote_code: true
openai:
api_key: abc
base_delay: 1.0
base_url: http://localhost:8000/v1
base_url: http://127.0.0.1:65501/v1
concurrency: 8
model_name: MiniCPM4-8B
model_name: qwen3-8b
retries: 3
vllm:
dtype: auto
gpu_ids: 2,3
gpu_memory_utilization: 0.9
model_name_or_path: openbmb/MiniCPM4-8B
trust_remote_code: true
extra_params:
chat_template_kwargs:
enable_thinking: false
sampling_params:
max_tokens: 2048
temperature: 0.7
top_p: 0.8
system_prompt: ''
prompt:
template: prompt/qa_boxed.jinja
After completing the configuration, you can run normally.