Generation

`generation_init`

Signature

def generation_init(
    backend_configs: Dict[str, Any],
    sampling_params: Dict[str, Any],
    backend: str = "vllm",
) -> None

Function

Initializes the inference backend and sampling parameters.

`generate`

Signature

async def generate(
    prompt_ls: List[Union[str, Dict[str, Any]]],
    system_prompt: str = "",
) -> Dict[str, List[str]]

Pure text-based dialogue generation.

Output Format (JSON)

{"ans_ls": ["answer for prompt_0", "answer for prompt_1", "..."]}

`multimodal_generate`

Signature

async def multimodal_generate(
    multimodal_path: List[List[str]],
    prompt_ls: List[Union[str, Dict[str, Any]]],
    system_prompt: str = "",
) -> Dict[str, List[str]]

Function

Performs multimodal (text-image) dialogue generation.

Output Format (JSON)

{"ans_ls": ["answer with images for prompt_0", "..."]}

Parameter Configuration

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3b

servers/generation/parameter.yaml

backend: vllm # options: vllm, openai
backend_configs:
  vllm:
    model_name_or_path: openbmb/MiniCPM4-8B
    gpu_ids: "2,3"
    gpu_memory_utilization: 0.9
    dtype: auto
    trust_remote_code: true
  openai:
    model_name: MiniCPM4-8B
    base_url: http://localhost:8000/v1
    api_key: ""
    concurrency: 8
    retries: 3
    base_delay: 1.0
  hf:
    model_name_or_path: openbmb/MiniCPM4-8B
    gpu_ids: '2,3'
    trust_remote_code: true
    batch_size: 8

sampling_params:
  temperature: 0.7
  top_p: 0.8
  max_tokens: 2048
  chat_template_kwargs:
    enable_thinking: false

system_prompt: ""

Parameter Description:

Parameter	Type	Description
`backend`	str	Specifies the generation backend; options: `vllm`, `openai`, or `hf` (Transformers).
`backend_configs`	dict	Configuration for model and runtime environments of each backend.
`sampling_params`	dict	Sampling parameters controlling generation diversity and length.
`system_prompt`	str	Global system prompt added as a `system` message in context.

Detailed Description of backend_configs:

Backend	Parameter	Description
vllm	`model_name_or_path`	Model name or local path.
	`gpu_ids`	GPU IDs in use (e.g., `"0,1"`).
	`gpu_memory_utilization`	GPU memory usage ratio (0–1).
	`dtype`	Data type (e.g., `auto`, `bfloat16`).
	`trust_remote_code`	Whether to trust remote custom code.
openai	`model_name`	Model name for OpenAI or self-hosted API-compatible model.
	`base_url`	API endpoint URL.
	`api_key`	API key for authentication.
	`concurrency`	Maximum number of concurrent requests.
	`retries`	Maximum retry count for API requests.
	`base_delay`	Base delay time (in seconds) between retries.
hf	`model_name_or_path`	Model path for Transformers backend.
	`gpu_ids`	GPU IDs (same as above).
	`trust_remote_code`	Whether to trust remote custom code.
	`batch_size`	Batch size per inference.

Detailed Description of sampling_params:

Parameter	Type	Description
`temperature`	float	Controls randomness — higher values increase diversity.
`top_p`	float	Nucleus sampling threshold.
`max_tokens`	int	Maximum number of generated tokens.
`chat_template_kwargs`	dict	Additional arguments for chat templates.
`enable_thinking`	bool	Enables chain-of-thought style reasoning output (if supported by model).

RAG Servers

CLI

`generation_init`

`generate`

`multimodal_generate`

Parameter Configuration

RAG Servers

CLI

​generation_init

​generate

​multimodal_generate

​Parameter Configuration

`generation_init`

`generate`

`multimodal_generate`

Parameter Configuration