Quick answer

Multi-token prediction (speculative decoding) in llama.cpp drafts several tokens per step with a small draft model, then verifies them with the main model — cutting latency roughly 2x with no quality loss. Enable it with the --model-draft flag plus a compatible small model. The exact commands and tuning are below.

Related guides

What if you could make large language models run faster without upgrading your GPU, changing your machine, or switching to a smaller model? That is exactly what we tested in this complete tutorial using Multi-Token Prediction, or MTP. In our benchmark, the same Qwen3.6 27B model on the same RunPod RTX 3090 setup improved from 38 tokens/sec to 65 tokens/sec after enabling MTP. That is a 1.71x speedup, or around 71% higher throughput, with no visible loss in output quality. This step by step guide shows you how to replicate the same results.

What You'll Learn:

What Multi-Token Prediction is and how it differs from standard token-by-token generation
How to set up a RunPod RTX 3090 machine with proper configuration
How to clone and switch to the MTP branch of llama.cpp
How to build llama.cpp with full CUDA GPU support
How to download the Qwen3.6 27B MTP GGUF model
How to run baseline inference without MTP to measure starting speed
How to enable MTP with two command-line flags and retest
Real token/sec comparisons: 38 t/s vs 65 t/s on the same hardware
How to push speeds even further with TurboQuant optimization

What Is Multi-Token Prediction?

Most LLMs generate text one token at a time. The model predicts the next token, adds it to the context, and repeats the same process again. This is reliable, but it can be slow because each new token usually requires another decoding step. Multi-Token Prediction changes this by allowing the model to look ahead and propose multiple future tokens instead of only one. These proposed tokens are then checked by the main decoding process. If the predictions are correct, the model can accept several tokens in one go. If one token is wrong, the model falls back to the normal path from that point.

In practice, MTP works like a built-in drafting mechanism. The model drafts a few likely next tokens, verifies them, and keeps the valid ones. The more draft tokens that get accepted, the fewer full decoding steps are needed, which can increase tokens per second without changing the final output quality.

Without MTP

Generate token 1 → generate token 2 → generate token 3. Each token requires a full decoding step. The model moves forward one tiny step at a time, which is reliable but slow for large models on limited hardware.

With MTP

Draft multiple tokens → verify them → accept valid tokens together. The model safely jumps ahead whenever its draft predictions are correct, reducing the total number of decoding steps needed and increasing throughput.

In tools like llama.cpp and vLLM-style implementations, this is closely related to speculative decoding, where draft tokens are accepted only when they match the verifier's output. The key advantage of MTP in Qwen3.6 is that no separate draft model is needed — the MTP heads are built into the model itself.

Step by Step Guide: Setting Up MTP with llama.cpp

Set Up a RunPod RTX 3090 Machine

Create a new RunPod pod and select an RTX 3090 GPU. Before deploying, edit the template settings: increase the volume disk size to 100 GB, add an additional HTTP port 8910, and add an environment variable called HF_TOKEN set to your Hugging Face access token. The extra HTTP port lets you access the llama.cpp server and web UI from your browser. The Hugging Face token authenticates the download request and improves model download speed for large GGUF files. After updating the template, deploy the pod. Once running, wait for RunPod to give you access to the JupyterLab instance. Open JupyterLab, then launch a new terminal and install required system packages: apt update && apt install -y git cmake build-essential curl wget python3-pip

Clone and Switch to the MTP Branch

Move into the workspace directory with cd /workspace. Clone the llama.cpp repository: git clone --depth 1 https://github.com/ggml-org/llama.cpp.git then cd llama.cpp. The MTP changes are being tested through a dedicated llama.cpp pull request, so fetch and switch to that branch: git fetch origin pull/22673/head:mtp-pr then git checkout mtp-pr. This switches your local llama.cpp build to the MTP-enabled version for the rest of this guide.

Build llama.cpp with CUDA Support

Now that you are on the MTP-enabled branch, build llama.cpp with CUDA support so the model uses the RTX 3090 GPU instead of running inference on the CPU. Run the CMake build configuration: cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release. Then compile the two targets we need: cmake --build build --target llama-cli llama-server -j. This builds llama-cli for command-line tests and llama-server for launching an OpenAI-compatible server with browser access. Once the build is complete, copy the binary: cp ./build/bin/llama-server ./llama-server

Download the Qwen3.6-27B-MTP Model

Install the Hugging Face download tools: pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer. Enable faster downloads: export HF_HUB_ENABLE_HF_TRANSFER=1. Create a dedicated directory: mkdir -p /workspace/models/qwen3.6-mtp. Download the model: hf download froggeric/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-Q4_K_M-mtp.gguf --local-dir /workspace/models/qwen3.6-mtp. This is the model we will run first without MTP, then again with MTP enabled to compare the speed difference.

Run Qwen3.6-27B Without MTP (Baseline)

Move into the llama.cpp directory: cd /workspace/llama.cpp. Start the server without MTP to get a clean baseline. Use the same model, same GPU, same context size, and same server settings. The only change in the next step will be enabling MTP. Start the server with: ./llama-server -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" --alias qwen3.6-27b-no-mtp --host 0.0.0.0 --port 8910 -ngl 99 -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -np 1 -b 2048 -ub 512 -t 8 -fa on --temp 0.7 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 --metrics. Once the server is ready, go to your RunPod dashboard and click the link for port 8910 to open the llama.cpp web UI. In our baseline test, the model generated responses at around 38.86 tokens/sec without MTP.

Run Qwen3.6-27B With MTP Enabled

Stop the server with CTRL + C. We are not changing the model, GPU, quantization, or most runtime settings. We are only adding two MTP-related flags: --spec-type mtp and --spec-draft-n-max 3. The first flag tells llama.cpp to use MTP-style speculative decoding. The second flag sets the maximum number of draft tokens to 3. Start the server again with MTP: ./llama-server -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" --alias qwen3.6-27b-mtp --host 0.0.0.0 --port 8910 -ngl 99 -c 100000 --spec-type mtp --spec-draft-n-max 3 --cache-type-k q8_0 --cache-type-v q8_0 -np 1 -b 2048 -ub 512 -t 8 -fa on --temp 0.7 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 --metrics. Refresh the browser page and test with the same prompts. With MTP enabled, speed increased to 65-67 tokens/sec for simple prompts and 56-61 tokens/sec for complex prompts.

Compare Results and Consider TurboQuant

Enabling MTP improved Qwen3.6 27B from around 38 tokens/sec to 65 tokens/sec on the RunPod RTX 3090 setup. That gives a 1.71x speedup, or around 71% higher throughput, without changing the hardware or switching to a smaller model. For further optimization, the next step is to explore MTP combined with TurboQuant, which reduces KV-cache memory pressure during inference. This is especially useful for larger models, long-context prompts, and GPUs like the RTX 3090 where memory bandwidth and VRAM can become limiting factors.

Complete Server Command Reference

Here are the exact server commands for both baseline and MTP-enabled runs. The only difference between the two is the addition of --spec-type mtp and --spec-draft-n-max 3 in the MTP version.

Baseline Server Command (No MTP)

./llama-server   -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf"   --alias qwen3.6-27b-no-mtp   --host 0.0.0.0   --port 8910   -ngl 99   -c 100000   --cache-type-k q8_0   --cache-type-v q8_0   -np 1   -b 2048   -ub 512   -t 8   -fa on   --temp 0.7   --top-k 20   --top-p 0.95   --repeat-penalty 1.1   --metrics

MTP-Enabled Server Command

./llama-server   -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf"   --alias qwen3.6-27b-mtp   --host 0.0.0.0   --port 8910   -ngl 99   -c 100000   --spec-type mtp   --spec-draft-n-max 3   --cache-type-k q8_0   --cache-type-v q8_0   -np 1   -b 2048   -ub 512   -t 8   -fa on   --temp 0.7   --top-k 20   --top-p 0.95   --repeat-penalty 1.1   --metrics

Benchmark Results: MTP vs Non-MTP

Here is the complete comparison of token generation speeds across different prompt types. All tests used the same Qwen3.6 27B Q4_K_M model, same RTX 3090 GPU, same context size (100K), and same server configuration.

Test Type	Without MTP	With MTP	Improvement
Simple prompt (greeting)	38.86 t/s	65-67 t/s	~71% faster
Complex prompt (Python game)	~38 t/s	56-61 t/s	~50% faster
Overall average	~38 t/s	~65 t/s	1.71x speedup

Why Complex Prompts Show Lower MTP Gains

For more complex prompts like code generation, the model's draft predictions are less likely to be correct because the output is more varied and creative. MTP works best when the model can confidently predict the next tokens — which happens more often with straightforward responses. Even so, 56-61 tokens/sec is still a strong result for a 27B model on an RTX 3090, and significantly faster than the non-MTP baseline.

Key Command-Line Flags Explained

Understanding what each flag does helps you tune the server for your specific hardware and use case. Here is a breakdown of the most important parameters used in this tutorial.

Flag	Value	Purpose
`-ngl`	99	Number of layers to offload to GPU. 99 offloads all layers.
`-c`	100000	Context size in tokens. Supports long documents and conversations.
`--cache-type-k`	q8_0	KV cache quantization for K values. q8_0 balances quality and VRAM.
`--cache-type-v`	q8_0	KV cache quantization for V values. Matches K cache for consistency.
`--spec-type`	mtp	Enables Multi-Token Prediction speculative decoding mode.
`--spec-draft-n-max`	3	Maximum draft tokens to predict per step. Higher = more aggressive.
`-fa`	on	Enables Flash Attention for faster attention computation on GPU.
`--temp`	0.7	Sampling temperature. Lower = more deterministic, higher = more creative.

Next Steps: TurboQuant for Further Optimization

The benchmark in this guide uses the original llama.cpp MTP setup, without adding TurboQuant, custom patches, or other runtime-level optimizations. This keeps the test simple, reproducible, and focused on the speed gain from enabling MTP alone. To push performance further, the next optimization to explore is MTP and TurboQuant together. MTP improves throughput by allowing the model to accept multiple predicted tokens, while TurboQuant helps reduce KV-cache memory pressure during inference. This can be especially useful for larger models, long-context prompts, and GPUs like the RTX 3090, where memory bandwidth and VRAM can become limiting factors.

Why Some Community Results Show Higher Speeds

Some r/LocalLLaMA community results report higher tokens/sec than this guide. Those setups often combine MTP with TurboQuant, patched builds, different KV-cache settings, or faster GPUs. Since this tutorial focuses on a clean MTP-only benchmark, TurboQuant should be treated as the recommended next experiment rather than part of the current setup.

What MTP Does

Improves throughput by allowing the model to draft and accept multiple tokens per decoding step. Reduces the total number of forward passes needed, directly increasing tokens/sec on the same hardware.

What TurboQuant Does

Reduces KV-cache memory pressure during inference through advanced quantization techniques. Enables larger context windows and batch sizes on memory-constrained GPUs like the RTX 3090 with 24GB VRAM.

Final Thoughts: The Future of Local LLM Inference

The LocalLLaMA community has shown how far local LLM inference has come. People are now running models like Qwen3.6 27B as local coding agents, even on older GPUs with limited VRAM. Some are also running similar setups on Mac systems, and the results are genuinely impressive. After testing MTP, it is clear why there is so much excitement. With the same model and the same RTX 3090 setup, enabling Multi-Token Prediction improved generation speed from around 38 tokens/sec to 65 tokens/sec. That is almost a 2x speedup without upgrading the GPU or switching to a smaller model.

This guide focused on a simple and reproducible MTP setup using llama.cpp, but this feels like only the beginning. The next step is to experiment with better GGUF quantization, MTP, TurboQuant, and more tuned runtime settings to see how much further local inference speed can be pushed. The most exciting part is what this means for local coding agents. You can run powerful models on your own hardware, reduce cost per query, keep your code private, and use an AI coding assistant without depending entirely on internet-based APIs. Local LLMs are becoming faster, more practical, and much more useful than they were even a short time ago.

Frequently Asked Questions

Do I need a separate draft model for MTP with Qwen3.6?

No. With Qwen3.6-27B, MTP is built into the model itself through dedicated MTP heads, so no second draft model is required. This is the key advantage over traditional speculative decoding.

How much faster does MTP make the model on RTX 3090?

In our RunPod RTX 3090 setup, enabling MTP improved generation speed from ~38 tokens/sec to ~65 tokens/sec, a 1.71x speedup or ~71% higher throughput with no loss in output quality.

What is the difference between MTP and speculative decoding?

MTP in llama.cpp is a form of speculative decoding. Draft tokens from the model's own MTP heads are accepted only if they pass verification. The key difference is that no external draft model is needed.

Can I get even faster speeds beyond what MTP offers?

Yes. Combining MTP with TurboQuant, which reduces KV-cache memory pressure during inference, is the recommended next step for further speed gains, especially on memory-constrained GPUs like the RTX 3090.

What does the --spec-draft-n-max 3 flag do?

It sets the maximum number of draft tokens the model can attempt to predict per decoding step to 3. Higher values are more aggressive but may reduce the acceptance rate if predictions are less accurate.

Need Help Running Local LLMs?

Our AI engineers can help you set up optimized local inference pipelines, deploy models on your hardware, and integrate them into your applications. From MTP configuration to TurboQuant tuning, we have the expertise to maximize your LLM performance.

Quick answer

Related guides

What You'll Learn:

What Multi-Token Prediction is and how it differs from standard token-by-token generation
How to set up a RunPod RTX 3090 machine with proper configuration
How to clone and switch to the MTP branch of llama.cpp
How to build llama.cpp with full CUDA GPU support
How to download the Qwen3.6 27B MTP GGUF model
How to run baseline inference without MTP to measure starting speed
How to enable MTP with two command-line flags and retest
Real token/sec comparisons: 38 t/s vs 65 t/s on the same hardware
How to push speeds even further with TurboQuant optimization

What Is Multi-Token Prediction?

Without MTP

With MTP

Step by Step Guide: Setting Up MTP with llama.cpp

Set Up a RunPod RTX 3090 Machine

Clone and Switch to the MTP Branch

Build llama.cpp with CUDA Support

Download the Qwen3.6-27B-MTP Model

Run Qwen3.6-27B Without MTP (Baseline)

Run Qwen3.6-27B With MTP Enabled

Compare Results and Consider TurboQuant

Complete Server Command Reference

Here are the exact server commands for both baseline and MTP-enabled runs. The only difference between the two is the addition of --spec-type mtp and --spec-draft-n-max 3 in the MTP version.

Baseline Server Command (No MTP)

./llama-server   -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf"   --alias qwen3.6-27b-no-mtp   --host 0.0.0.0   --port 8910   -ngl 99   -c 100000   --cache-type-k q8_0   --cache-type-v q8_0   -np 1   -b 2048   -ub 512   -t 8   -fa on   --temp 0.7   --top-k 20   --top-p 0.95   --repeat-penalty 1.1   --metrics

MTP-Enabled Server Command

./llama-server   -m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf"   --alias qwen3.6-27b-mtp   --host 0.0.0.0   --port 8910   -ngl 99   -c 100000   --spec-type mtp   --spec-draft-n-max 3   --cache-type-k q8_0   --cache-type-v q8_0   -np 1   -b 2048   -ub 512   -t 8   -fa on   --temp 0.7   --top-k 20   --top-p 0.95   --repeat-penalty 1.1   --metrics

Benchmark Results: MTP vs Non-MTP

Test Type	Without MTP	With MTP	Improvement
Simple prompt (greeting)	38.86 t/s	65-67 t/s	~71% faster
Complex prompt (Python game)	~38 t/s	56-61 t/s	~50% faster
Overall average	~38 t/s	~65 t/s	1.71x speedup

Why Complex Prompts Show Lower MTP Gains

Key Command-Line Flags Explained

Understanding what each flag does helps you tune the server for your specific hardware and use case. Here is a breakdown of the most important parameters used in this tutorial.

Flag	Value	Purpose
`-ngl`	99	Number of layers to offload to GPU. 99 offloads all layers.
`-c`	100000	Context size in tokens. Supports long documents and conversations.
`--cache-type-k`	q8_0	KV cache quantization for K values. q8_0 balances quality and VRAM.
`--cache-type-v`	q8_0	KV cache quantization for V values. Matches K cache for consistency.
`--spec-type`	mtp	Enables Multi-Token Prediction speculative decoding mode.
`--spec-draft-n-max`	3	Maximum draft tokens to predict per step. Higher = more aggressive.
`-fa`	on	Enables Flash Attention for faster attention computation on GPU.
`--temp`	0.7	Sampling temperature. Lower = more deterministic, higher = more creative.

Next Steps: TurboQuant for Further Optimization

Why Some Community Results Show Higher Speeds

What MTP Does

What TurboQuant Does

Reduces KV-cache memory pressure during inference through advanced quantization techniques. Enables larger context windows and batch sizes on memory-constrained GPUs like the RTX 3090 with 24GB VRAM.

Final Thoughts: The Future of Local LLM Inference

Frequently Asked Questions

Do I need a separate draft model for MTP with Qwen3.6?

No. With Qwen3.6-27B, MTP is built into the model itself through dedicated MTP heads, so no second draft model is required. This is the key advantage over traditional speculative decoding.

How much faster does MTP make the model on RTX 3090?

In our RunPod RTX 3090 setup, enabling MTP improved generation speed from ~38 tokens/sec to ~65 tokens/sec, a 1.71x speedup or ~71% higher throughput with no loss in output quality.

How to Enable Multi-Token Prediction in llama.cpp: Complete Tutorial

What Is Multi-Token Prediction?

Without MTP

With MTP

Step by Step Guide: Setting Up MTP with llama.cpp

Set Up a RunPod RTX 3090 Machine

Clone and Switch to the MTP Branch

Build llama.cpp with CUDA Support

Download the Qwen3.6-27B-MTP Model

Run Qwen3.6-27B Without MTP (Baseline)

Run Qwen3.6-27B With MTP Enabled

Compare Results and Consider TurboQuant

Complete Server Command Reference

Benchmark Results: MTP vs Non-MTP

Key Command-Line Flags Explained

Next Steps: TurboQuant for Further Optimization

What MTP Does

What TurboQuant Does

Final Thoughts: The Future of Local LLM Inference

Frequently Asked Questions

Do I need a separate draft model for MTP with Qwen3.6?

How much faster does MTP make the model on RTX 3090?

What is the difference between MTP and speculative decoding?

Can I get even faster speeds beyond what MTP offers?

What does the --spec-draft-n-max 3 flag do?

Need Help Running Local LLMs?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Enable Multi-Token Prediction in llama.cpp: Complete Tutorial

What Is Multi-Token Prediction?

Without MTP

With MTP

Step by Step Guide: Setting Up MTP with llama.cpp

Set Up a RunPod RTX 3090 Machine

Clone and Switch to the MTP Branch

Build llama.cpp with CUDA Support

Download the Qwen3.6-27B-MTP Model

Run Qwen3.6-27B Without MTP (Baseline)

Run Qwen3.6-27B With MTP Enabled

Compare Results and Consider TurboQuant

Complete Server Command Reference

Benchmark Results: MTP vs Non-MTP

Key Command-Line Flags Explained

Next Steps: TurboQuant for Further Optimization

What MTP Does

What TurboQuant Does

Final Thoughts: The Future of Local LLM Inference

Frequently Asked Questions

Do I need a separate draft model for MTP with Qwen3.6?

How much faster does MTP make the model on RTX 3090?

What is the difference between MTP and speculative decoding?

Can I get even faster speeds beyond what MTP offers?

What does the --spec-draft-n-max 3 flag do?

Need Help Running Local LLMs?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief