generation_init
Signature
- Initializes the inference backend and sampling parameters.
generate
Signature
- Pure text-based dialogue generation.
multimodal_generate
Signature
- Performs multimodal (text-image) dialogue generation.
Parameter Configuration
| Parameter | Type | Description |
|---|---|---|
backend | str | Specifies the generation backend; options: vllm, openai, or hf (Transformers). |
backend_configs | dict | Configuration for model and runtime environments of each backend. |
sampling_params | dict | Sampling parameters controlling generation diversity and length. |
system_prompt | str | Global system prompt added as a system message in context. |
backend_configs:
| Backend | Parameter | Description |
|---|---|---|
| vllm | model_name_or_path | Model name or local path. |
gpu_ids | GPU IDs in use (e.g., "0,1"). | |
gpu_memory_utilization | GPU memory usage ratio (0–1). | |
dtype | Data type (e.g., auto, bfloat16). | |
trust_remote_code | Whether to trust remote custom code. | |
| openai | model_name | Model name for OpenAI or self-hosted API-compatible model. |
base_url | API endpoint URL. | |
api_key | API key for authentication. | |
concurrency | Maximum number of concurrent requests. | |
retries | Maximum retry count for API requests. | |
base_delay | Base delay time (in seconds) between retries. | |
| hf | model_name_or_path | Model path for Transformers backend. |
gpu_ids | GPU IDs (same as above). | |
trust_remote_code | Whether to trust remote custom code. | |
batch_size | Batch size per inference. |
sampling_params:
| Parameter | Type | Description |
|---|---|---|
temperature | float | Controls randomness — higher values increase diversity. |
top_p | float | Nucleus sampling threshold. |
max_tokens | int | Maximum number of generated tokens. |
chat_template_kwargs | dict | Additional arguments for chat templates. |
enable_thinking | bool | Enables chain-of-thought style reasoning output (if supported by model). |