microchipLLM 모델 서빙 시 권장 옵션

  • 권장 VRAM: 640 GB (양자화로 인해 Hopper 이상 GPU 필요)

  • 권장 args: 하단 참조

gpu-memory-utilization: 0.9
enable-expert-parallel: true
quantization: "compressed-tensors"
enable-auto-tool-choice: true
tool-call-parser: "pythonic"
chat-template: "examples/tool_chat_template_llama4_pythonic.jinja"

  • 권장 VRAM: 640 GB

  • 권장 args: 하단 참조

gpu-memory-utilization: 0.95
max-model-len: 32768
tool-call-parser: "hermes"
reasoning-parser: "qwen3"
enable-expert-parallel: true
enable-auto-tool-choice: true
enable-reasoning: true

  • 권장 VRAM: 640 GB

  • 권장 args: 하단 참조


RoPE

Last updated

Was this helpful?