How to Enable Multi-Token Prediction in llama.cpp: Complete Tutorial
By Braincuber Team
Published on May 18, 2026
What if you could make large language models run faster without upgrading your GPU, changing your machine, or switching to a smaller model? That is exactly what we tested in this complete tutorial using Multi-Token Prediction, or MTP. In our benchmark, the same Qwen3.6 27B model on the same RunPod RTX 3090 setup improved from 38 tokens/sec to 65 tokens/sec after enabling MTP. That is a 1.71x speedup, or around 71% higher throughput, with no visible loss in output quality. This step by step guide shows you how to replicate the same results.
What You'll Learn:
- What Multi-Token Prediction is and how it differs from standard token-by-token generation
- How to set up a RunPod RTX 3090 machine with proper configuration
- How to clone and switch to the MTP branch of llama.cpp
- How to build llama.cpp with full CUDA GPU support
- How to download the Qwen3.6 27B MTP GGUF model
- How to run baseline inference without MTP to measure starting speed
- How to enable MTP with two command-line flags and retest
- Real token/sec comparisons: 38 t/s vs 65 t/s on the same hardware
- How to push speeds even further with TurboQuant optimization
What Is Multi-Token Prediction?
Most LLMs generate text one token at a time. The model predicts the next token, adds it to the context, and repeats the same process again. This is reliable, but it can be slow because each new token usually requires another decoding step. Multi-Token Prediction changes this by allowing the model to look ahead and propose multiple future tokens instead of only one. These proposed tokens are then checked by the main decoding process. If the predictions are correct, the model can accept several tokens in one go. If one token is wrong, the model falls back to the normal path from that point.
In practice, MTP works like a built-in drafting mechanism. The model drafts a few likely next tokens, verifies them, and keeps the valid ones. The more draft tokens that get accepted, the fewer full decoding steps are needed, which can increase tokens per second without changing the final output quality.
Without MTP
Generate token 1 → generate token 2 → generate token 3. Each token requires a full decoding step. The model moves forward one tiny step at a time, which is reliable but slow for large models on limited hardware.
With MTP
Draft multiple tokens → verify them → accept valid tokens together. The model safely jumps ahead whenever its draft predictions are correct, reducing the total number of decoding steps needed and increasing throughput.
In tools like llama.cpp and vLLM-style implementations, this is closely related to speculative decoding, where draft tokens are accepted only when they match the verifier's output. The key advantage of MTP in Qwen3.6 is that no separate draft model is needed — the MTP heads are built into the model itself.
Step by Step Guide: Setting Up MTP with llama.cpp
Set Up a RunPod RTX 3090 Machine
Create a new RunPod pod and select an RTX 3090 GPU. Before deploying, edit the template settings: increase the volume disk size to 100 GB, add an additional HTTP port 8910, and add an environment variable called HF_TOKEN set to your Hugging Face access token. The extra HTTP port lets you access the llama.cpp server and web UI from your browser. The Hugging Face token authenticates the download request and improves model download speed for large GGUF files. After updating the template, deploy the pod. Once running, wait for RunPod to give you access to the JupyterLab instance. Open JupyterLab, then launch a new terminal and install required system packages: apt update && apt install -y git cmake build-essential curl wget python3-pip
Clone and Switch to the MTP Branch
Move into the workspace directory with cd /workspace. Clone the llama.cpp repository: git clone --depth 1 https://github.com/ggml-org/llama.cpp.git then cd llama.cpp. The MTP changes are being tested through a dedicated llama.cpp pull request, so fetch and switch to that branch: git fetch origin pull/22673/head:mtp-pr then git checkout mtp-pr. This switches your local llama.cpp build to the MTP-enabled version for the rest of this guide.
Build llama.cpp with CUDA Support
Now that you are on the MTP-enabled branch, build llama.cpp with CUDA support so the model uses the RTX 3090 GPU instead of running inference on the CPU. Run the CMake build configuration: cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release. Then compile the two targets we need: cmake --build build --target llama-cli llama-server -j. This builds llama-cli for command-line tests and llama-server for launching an OpenAI-compatible server with browser access. Once the build is complete, copy the binary: cp ./build/bin/llama-server ./llama-server
Download the Qwen3.6-27B-MTP Model
Install the Hugging Face download tools: pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer. Enable faster downloads: export HF_HUB_ENABLE_HF_TRANSFER=1. Create a dedicated directory: mkdir -p /workspace/models/qwen3.6-mtp. Download the model: hf download froggeric/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-Q4_K_M-mtp.gguf --local-dir /workspace/models/qwen3.6-mtp. This is the model we will run first without MTP, then again with MTP enabled to compare the speed difference.
Run Qwen3.6-27B Without MTP (Baseline)
Move into the llama.cpp directory: cd /workspace/llama.cpp. Start the server without MTP to get a clean baseline. Use the same model, same GPU, same context size, and same server settings. The only change in the next step will be enabling MTP. Start the server with: ./llama-server -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" --alias qwen3.6-27b-no-mtp --host 0.0.0.0 --port 8910 -ngl 99 -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -np 1 -b 2048 -ub 512 -t 8 -fa on --temp 0.7 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 --metrics. Once the server is ready, go to your RunPod dashboard and click the link for port 8910 to open the llama.cpp web UI. In our baseline test, the model generated responses at around 38.86 tokens/sec without MTP.
Run Qwen3.6-27B With MTP Enabled
Stop the server with CTRL + C. We are not changing the model, GPU, quantization, or most runtime settings. We are only adding two MTP-related flags: --spec-type mtp and --spec-draft-n-max 3. The first flag tells llama.cpp to use MTP-style speculative decoding. The second flag sets the maximum number of draft tokens to 3. Start the server again with MTP: ./llama-server -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" --alias qwen3.6-27b-mtp --host 0.0.0.0 --port 8910 -ngl 99 -c 100000 --spec-type mtp --spec-draft-n-max 3 --cache-type-k q8_0 --cache-type-v q8_0 -np 1 -b 2048 -ub 512 -t 8 -fa on --temp 0.7 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 --metrics. Refresh the browser page and test with the same prompts. With MTP enabled, speed increased to 65-67 tokens/sec for simple prompts and 56-61 tokens/sec for complex prompts.
Compare Results and Consider TurboQuant
Enabling MTP improved Qwen3.6 27B from around 38 tokens/sec to 65 tokens/sec on the RunPod RTX 3090 setup. That gives a 1.71x speedup, or around 71% higher throughput, without changing the hardware or switching to a smaller model. For further optimization, the next step is to explore MTP combined with TurboQuant, which reduces KV-cache memory pressure during inference. This is especially useful for larger models, long-context prompts, and GPUs like the RTX 3090 where memory bandwidth and VRAM can become limiting factors.
Complete Server Command Reference
Here are the exact server commands for both baseline and MTP-enabled runs. The only difference between the two is the addition of --spec-type mtp and --spec-draft-n-max 3 in the MTP version.
./llama-server -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" --alias qwen3.6-27b-no-mtp --host 0.0.0.0 --port 8910 -ngl 99 -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -np 1 -b 2048 -ub 512 -t 8 -fa on --temp 0.7 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 --metrics
./llama-server -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" --alias qwen3.6-27b-mtp --host 0.0.0.0 --port 8910 -ngl 99 -c 100000 --spec-type mtp --spec-draft-n-max 3 --cache-type-k q8_0 --cache-type-v q8_0 -np 1 -b 2048 -ub 512 -t 8 -fa on --temp 0.7 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 --metrics
Benchmark Results: MTP vs Non-MTP
Here is the complete comparison of token generation speeds across different prompt types. All tests used the same Qwen3.6 27B Q4_K_M model, same RTX 3090 GPU, same context size (100K), and same server configuration.
| Test Type | Without MTP | With MTP | Improvement |
|---|---|---|---|
| Simple prompt (greeting) | 38.86 t/s | 65-67 t/s | ~71% faster |
| Complex prompt (Python game) | ~38 t/s | 56-61 t/s | ~50% faster |
| Overall average | ~38 t/s | ~65 t/s | 1.71x speedup |
Why Complex Prompts Show Lower MTP Gains
For more complex prompts like code generation, the model's draft predictions are less likely to be correct because the output is more varied and creative. MTP works best when the model can confidently predict the next tokens — which happens more often with straightforward responses. Even so, 56-61 tokens/sec is still a strong result for a 27B model on an RTX 3090, and significantly faster than the non-MTP baseline.
Key Command-Line Flags Explained
Understanding what each flag does helps you tune the server for your specific hardware and use case. Here is a breakdown of the most important parameters used in this tutorial.
| Flag | Value | Purpose |
|---|---|---|
-ngl | 99 | Number of layers to offload to GPU. 99 offloads all layers. |
-c | 100000 | Context size in tokens. Supports long documents and conversations. |
--cache-type-k | q8_0 | KV cache quantization for K values. q8_0 balances quality and VRAM. |
--cache-type-v | q8_0 | KV cache quantization for V values. Matches K cache for consistency. |
--spec-type | mtp | Enables Multi-Token Prediction speculative decoding mode. |
--spec-draft-n-max | 3 | Maximum draft tokens to predict per step. Higher = more aggressive. |
-fa | on | Enables Flash Attention for faster attention computation on GPU. |
--temp | 0.7 | Sampling temperature. Lower = more deterministic, higher = more creative. |
Next Steps: TurboQuant for Further Optimization
The benchmark in this guide uses the original llama.cpp MTP setup, without adding TurboQuant, custom patches, or other runtime-level optimizations. This keeps the test simple, reproducible, and focused on the speed gain from enabling MTP alone. To push performance further, the next optimization to explore is MTP and TurboQuant together. MTP improves throughput by allowing the model to accept multiple predicted tokens, while TurboQuant helps reduce KV-cache memory pressure during inference. This can be especially useful for larger models, long-context prompts, and GPUs like the RTX 3090, where memory bandwidth and VRAM can become limiting factors.
Why Some Community Results Show Higher Speeds
Some r/LocalLLaMA community results report higher tokens/sec than this guide. Those setups often combine MTP with TurboQuant, patched builds, different KV-cache settings, or faster GPUs. Since this tutorial focuses on a clean MTP-only benchmark, TurboQuant should be treated as the recommended next experiment rather than part of the current setup.
What MTP Does
Improves throughput by allowing the model to draft and accept multiple tokens per decoding step. Reduces the total number of forward passes needed, directly increasing tokens/sec on the same hardware.
What TurboQuant Does
Reduces KV-cache memory pressure during inference through advanced quantization techniques. Enables larger context windows and batch sizes on memory-constrained GPUs like the RTX 3090 with 24GB VRAM.
Final Thoughts: The Future of Local LLM Inference
The LocalLLaMA community has shown how far local LLM inference has come. People are now running models like Qwen3.6 27B as local coding agents, even on older GPUs with limited VRAM. Some are also running similar setups on Mac systems, and the results are genuinely impressive. After testing MTP, it is clear why there is so much excitement. With the same model and the same RTX 3090 setup, enabling Multi-Token Prediction improved generation speed from around 38 tokens/sec to 65 tokens/sec. That is almost a 2x speedup without upgrading the GPU or switching to a smaller model.
This guide focused on a simple and reproducible MTP setup using llama.cpp, but this feels like only the beginning. The next step is to experiment with better GGUF quantization, MTP, TurboQuant, and more tuned runtime settings to see how much further local inference speed can be pushed. The most exciting part is what this means for local coding agents. You can run powerful models on your own hardware, reduce cost per query, keep your code private, and use an AI coding assistant without depending entirely on internet-based APIs. Local LLMs are becoming faster, more practical, and much more useful than they were even a short time ago.
Frequently Asked Questions
Do I need a separate draft model for MTP with Qwen3.6?
No. With Qwen3.6-27B, MTP is built into the model itself through dedicated MTP heads, so no second draft model is required. This is the key advantage over traditional speculative decoding.
How much faster does MTP make the model on RTX 3090?
In our RunPod RTX 3090 setup, enabling MTP improved generation speed from ~38 tokens/sec to ~65 tokens/sec, a 1.71x speedup or ~71% higher throughput with no loss in output quality.
What is the difference between MTP and speculative decoding?
MTP in llama.cpp is a form of speculative decoding. Draft tokens from the model's own MTP heads are accepted only if they pass verification. The key difference is that no external draft model is needed.
Can I get even faster speeds beyond what MTP offers?
Yes. Combining MTP with TurboQuant, which reduces KV-cache memory pressure during inference, is the recommended next step for further speed gains, especially on memory-constrained GPUs like the RTX 3090.
What does the --spec-draft-n-max 3 flag do?
It sets the maximum number of draft tokens the model can attempt to predict per decoding step to 3. Higher values are more aggressive but may reduce the acceptance rate if predictions are less accurate.
Need Help Running Local LLMs?
Our AI engineers can help you set up optimized local inference pipelines, deploy models on your hardware, and integrate them into your applications. From MTP configuration to TurboQuant tuning, we have the expertise to maximize your LLM performance.
