# LLM 모델 서빙 시 권장 옵션

## [meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8)

* 권장 VRAM: 640 GB (양자화로 인해 Hopper 이상 GPU 필요)
* 권장 `args`: 하단 참조

```yaml
gpu-memory-utilization: 0.9
enable-expert-parallel: true
quantization: "compressed-tensors"
enable-auto-tool-choice: true
tool-call-parser: "pythonic"
chat-template: "examples/tool_chat_template_llama4_pythonic.jinja"

```

* `v0.9.0` 이하에서 tool 옵션 사용 시 [unicode 이슈](https://github.com/vllm-project/vllm/pull/17704) 발생 가능
* [Chat template 다운로드 링크](https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama4_pythonic.jinja)

## [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)

* 권장 VRAM: 640 GB
* 권장 `args`: 하단 참조

```yaml
gpu-memory-utilization: 0.95
max-model-len: 32768
tool-call-parser: "hermes"
reasoning-parser: "qwen3"
enable-expert-parallel: true
enable-auto-tool-choice: true
enable-reasoning: true
```

* [Qwen3 reasoning parser](https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/qwen3_reasoning_parser.py)는 `v0.9.0` 이상 지원
* Reasoning을 원하지 않을 경우 `"--no-enable-reasoning": "X-BOOLEAN-TRUE"` 사용 (v0.10.0에서 deprecated 예정)

## [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it)

* 권장 VRAM: 640 GB
* 권장 `args`: 하단 참조

```yaml
gpu-memory-utilization: 0.9
max-model-len: 100000
tool-call-parser: "pythonic"
chat-template: "/mnt/model/tool_chat_template_gemma3_pythonic.jinja"
enable-auto-tool-choice: true
```

* [Function calling template](https://github.com/vllm-project/vllm/pull/17149)을 별도로 추가해야 `tool-call-parser` 사용 가능
* Gemma3는 tool calling을 위한 special token이 존재하지 않아, 성능이 저조할 수 있음
* [Chat template 다운로드 링크](https://raw.githubusercontent.com/philipchung/vllm/main/examples/tool_chat_template_gemma3_pythonic.jinja)

***

## RoPE

* [RoPE scaling](https://docs.vllm.ai/en/stable/api/vllm/config.html?h=rope#vllm.config.ModelConfig.rope_scaling)은 아래와 같이 설정 가능

```yaml
rope-scaling:
  rope_type: "yarn"
  factor: 4.0
  original_max_position_embeddings: 32768
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://genos-docs.gitbook.io/default/v1.7.5.1/basic-tutorials/guides/llm/serving-info.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
