Function

Generation Server is used to deploy large language models (LLM) and generate responses by receiving inputs from the Prompt Server. Currently, vLLM is used as the model deployment backend.

Parameter Description

/images/yaml.svgservers/generation/parameter.yaml
model_name: openbmb/MiniCPM4-8B # model name or path
base_url: http://localhost:8000/v1 # vllm server url

# init vllm server configs
port: 8000
gpu_ids: "0,1"
api_key: ""

# generation parameters
sampling_params:
  temperature: 0.7
  top_p: 0.8
  max_tokens: 2048
  extra_body:
    top_k: 20
    chat_template_kwargs:
      enable_thinking: false # as qwen3, switch to true if you want to enable thinking
    include_stop_str_in_output: true # use in search-o1 pipeline
  # stop: [ "<|im_end|>", "<|end_search_query|>" ] # use in search-o1 pipeline
  • model_name: The name or path of the generation model used
  • base_url: The HTTP interface address of the vLLM model service
  • port: The port that the local vLLM service listens on
  • gpu_ids: Specifies GPU devices
  • api_key: The API Key required to call the model service
  • sampling_params: Generation parameters supported by vLLM, such as temperature, top-p, max_length, etc.

Tool Description

  • initialize_local_vllm: Starts a vLLM model service locally and waits for it to be ready, ultimately returning the base_url of the service.
  • generate: Receives prompt input provided by the Prompt Server, calls the LLM interface supporting the OpenAI API protocol for generation, and finally returns a list of response strings.