Skip to main content

generation_init

Signature
def generation_init(
    backend_configs: Dict[str, Any],
    sampling_params: Dict[str, Any],
    extra_params: Optional[Dict[str, Any]] = None,
    backend: str = "vllm",
) -> None
Function
  • Initializes inference backend and sampling parameters.
  • Supports vllm, openai, hf backends.
  • extra_params can be used to pass chat_template_kwargs or other backend-specific parameters.

generate

Signature
async def generate(
    prompt_ls: List[Union[str, Dict[str, Any]]],
    system_prompt: str = "",
) -> Dict[str, List[str]]
Function
  • Plain text conversation generation.
  • Automatically handles Prompt in list, supports string or OpenAI format dictionary.
Output Format (JSON)
{"ans_ls": ["answer for prompt_0", "answer for prompt_1", "..."]}

multimodal_generate

Signature
async def multimodal_generate(
    multimodal_path: List[List[str]],
    prompt_ls: List[Union[str, Dict[str, Any]]],
    system_prompt: str = "",
    image_tag: Optional[str] = None,
) -> Dict[str, List[str]]
Function
  • Text-image multimodal conversation generation.
  • multimodal_path: List of image paths corresponding to each Prompt (supports local path or URL).
  • image_tag: If specified (e.g., <img>), inserts image at that tag’s position in Prompt; otherwise defaults to appending to end of Prompt.
Output Format (JSON)
{"ans_ls": ["answer with images for prompt_0", "..."]}

multiturn_generate

Signature
async def multiturn_generate(
    messages: List[Dict[str, str]],
    system_prompt: str = "",
) -> Dict[str, List[str]]
Function
  • Multi-turn conversation generation.
  • Supports only single-call generation, does not handle batch Prompts.
Output Format (JSON)
{"ans_ls": ["assistant response"]}

vllm_shutdown

Signature
def vllm_shutdown() -> None
Function
  • Explicitly shuts down vLLM engine and releases VRAM resources.
  • Valid only when using vllm backend.

Configuration

https://mintcdn.com/ultrarag/T7GffHzZitf6TThi/images/yaml.svg?fit=max&auto=format&n=T7GffHzZitf6TThi&q=85&s=69b41e79144bc908039c2ee3abbb1c3bservers/generation/parameter.yaml
# servers/generation/parameter.yaml
backend: vllm # options: vllm, openai
backend_configs:
  vllm:
    model_name_or_path: openbmb/MiniCPM4-8B
    gpu_ids: "2,3"
    gpu_memory_utilization: 0.9
    dtype: auto
    trust_remote_code: true
  openai:
    model_name: MiniCPM4-8B
    base_url: http://localhost:8000/v1
    api_key: "abc"
    concurrency: 8
    retries: 3
    base_delay: 1.0
  hf:
    model_name_or_path: openbmb/MiniCPM4-8B
    gpu_ids: '2,3'
    trust_remote_code: true
    batch_size: 8
sampling_params:
  temperature: 0.7
  top_p: 0.8
  max_tokens: 2048
extra_params:
  chat_template_kwargs:
    enable_thinking: false
system_prompt: ""
image_tag: null
Parameter Description:
ParameterTypeDescription
backendstrSpecify generation backend, options vllm, openai, or hf (Transformers)
backend_configsdictModel and runtime environment configuration for each backend
sampling_paramsdictSampling parameters to control generation diversity and length
extra_paramsdictExtra parameters, e.g., chat_template_kwargs
system_promptstrGlobal system prompt, added to context as system message
image_tagstrImage placeholder tag (if needed)
backend_configs Detailed Description:
BackendParameterDescription
vllmmodel_name_or_pathModel name or path
gpu_idsGPU IDs used (e.g., "0,1")
gpu_memory_utilizationGPU memory utilization ratio (0–1)
dtypeData type (e.g., auto, bfloat16)
trust_remote_codeWhether to trust remote code
openaimodel_nameOpenAI model name or self-hosted compatible model
base_urlAPI base URL
api_keyAPI Key
concurrencyMax concurrent requests
retriesAPI retry count
base_delayBase wait time for each retry (seconds)
hfmodel_name_or_pathTransformers model path
gpu_idsGPU IDs (same as above)
trust_remote_codeWhether to trust remote code
batch_sizeBatch size per inference
sampling_params Detailed Description:
ParameterTypeDescription
temperaturefloatControls randomness, higher means more diverse generation
top_pfloatNucleus sampling threshold
max_tokensintMax generated tokens