CtrlK

LLM 모델 서빙 시 권장 옵션

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

권장 VRAM: 640 GB (양자화로 인해 Hopper 이상 GPU 필요)
권장 args: 하단 참조

{
  "--gpu-memory-utilization": 0.9,
  "--enable-expert-parallel": "X-BOOLEAN-TRUE",
  "--max-model-len": 430000,
  "--quantization": "compressed-tensors",
  "--enable-auto-tool-choice": "X-BOOLEAN-TRUE",
  "--tool-call-parser": "pythonic",
  "--chat-template": "examples/tool_chat_template_llama4_pythonic.jinja"
}

v0.9.0 이하에서 tool 옵션 사용 시 unicode 이슈 발생 가능
Chat template 다운로드 링크

Qwen/Qwen3-235B-A22B

권장 VRAM: 640 GB
권장 args: 하단 참조

{
  "--gpu-memory-utilization": 0.95,
  "--enable-expert-parallel": "X-BOOLEAN-TRUE",
  "--max-model-len": 32768,
  "--enable-auto-tool-choice": "X-BOOLEAN-TRUE",
  "--tool-call-parser": "hermes",
  "--enable-reasoning": "X-BOOLEAN-TRUE",
  "--reasoning-parser": "qwen3"
}

Qwen3 reasoning parser는 v0.9.0 이상 지원
Reasoning을 원하지 않을 경우 "--no-enable-reasoning": "X-BOOLEAN-TRUE" 사용 (v0.10.0에서 deprecated 예정)

google/gemma-3-27b-it

권장 VRAM: 640 GB
권장 args: 하단 참조

{
  "--gpu-memory-utilization": 0.9,
  "--max-model-len": 100000,
  "--enable-auto-tool-choice": "X-BOOLEAN-TRUE",
  "--tool-call-parser": "pythonic",
  "--chat-template": "/mnt/model/tool_chat_template_gemma3_pythonic.jinja"
}

Function calling template을 별도로 추가해야 tool-call-parser 사용 가능
Gemma3는 tool calling을 위한 special token이 존재하지 않아, 성능이 저조할 수 있음
Chat template 다운로드 링크

RoPE

RoPE scaling은 아래와 같이 설정 가능

{
    "--rope-scaling": "{\"rope_type\": \"yarn\", \"factor\": 4.0, \"original_max_position_embeddings\": 32768}"
}

Previous모델 서빙 스케일 아웃 Next프롬프트 엔지니어링

Last updated 1 month ago

Was this helpful?