LLM 모델 서빙 시 권장 옵션
gpu-memory-utilization: 0.9
enable-expert-parallel: true
quantization: "compressed-tensors"
enable-auto-tool-choice: true
tool-call-parser: "pythonic"
chat-template: "examples/tool_chat_template_llama4_pythonic.jinja"
gpu-memory-utilization: 0.95
max-model-len: 32768
tool-call-parser: "hermes"
reasoning-parser: "qwen3"
enable-expert-parallel: true
enable-auto-tool-choice: true
enable-reasoning: trueRoPE
Last updated
Was this helpful?