Generation Server

Function

Generation Server is used to deploy large language models (LLM) and generate responses by receiving inputs from the Prompt Server. Currently, vLLM is used as the model deployment backend.

Parameter Description

servers/generation/parameter.yaml

model_name: openbmb/MiniCPM4-8B # model name or path
base_url: http://localhost:8000/v1 # vllm server url

# init vllm server configs
port: 8000
gpu_ids: "0,1"
api_key: ""

# generation parameters
sampling_params:
  temperature: 0.7
  top_p: 0.8
  max_tokens: 2048
  extra_body:
    top_k: 20
    chat_template_kwargs:
      enable_thinking: false # as qwen3, switch to true if you want to enable thinking
    include_stop_str_in_output: true # use in search-o1 pipeline
  # stop: [ "<|im_end|>", "<|end_search_query|>" ] # use in search-o1 pipeline

model_name: The name or path of the generation model used
base_url: The HTTP interface address of the vLLM model service
port: The port that the local vLLM service listens on
gpu_ids: Specifies GPU devices
api_key: The API Key required to call the model service
sampling_params: Generation parameters supported by vLLM, such as temperature, top-p, max_length, etc.

Tool Description

initialize_local_vllm: Starts a vLLM model service locally and waits for it to be ready, ultimately returning the base_url of the service.
generate: Receives prompt input provided by the Prompt Server, calls the LLM interface supporting the OpenAI API protocol for generation, and finally returns a list of response strings.

Getting Started

Developer Guide

Function

Parameter Description

Tool Description

Getting Started

Developer Guide

​Function

​Parameter Description

​Tool Description

Function

Parameter Description

Tool Description