MTP for Qwen3.6-35B-A3B: Multi-Token Prediction and Faster Local Inference

Author Info

James Hayes

Cloud & MLOps Staff Writer

AWS Solutions Architect Professional; ex-platform engineer at a Series C AI startup

James documents how teams ship models to production: inference stacks, observability, cost controls, and incident response. He reproduces deployment patterns in sandbox environments when feasible and labels what was not independently verified. Readers rely on his work for practical checklists and version-specific caveats.

#MLOps #Inference Infrastructure #Cost Optimization #Reliability Engineering

Full author profile →

If you are running Qwen3.6-35B-A3B locally, the easiest free speed boost is often MTP (Multi-Token Prediction)—speculative decoding built into the model weights, not a separate draft model.

This guide explains what MTP does, when it helps, and the exact SGLang settings that pair with the 35B-A3B MoE checkpoint.

What Is MTP?

Multi-Token Prediction trains the model to guess several future tokens at once. At inference time, a lightweight draft head proposes multiple candidate tokens; the main model verifies them in parallel. Accepted drafts become output without running a full decode step for every single token.

That is different from classic autoregressive decoding, which generates one token per forward pass.

MTP became widely known through DeepSeek-V3 and similar architectures. Qwen3.6 ships dedicated MTP weights (mtp.safetensors) for both the 35B-A3B MoE and 27B dense variants, so you can turn on speculative decoding without downloading a second model.

Why It Matters for Qwen3.6-35B-A3B

Qwen3.6-35B-A3B already activates only ~3B parameters per step (MoE). MTP adds another layer of efficiency on top:

Without MTPWith MTP (typical)
1 token verified per decode stepSeveral tokens accepted per step when drafts match
Lower VRAM overheadSlightly higher VRAM for draft buffers
Predictable latencyBest for interactive chat, agents, and coding loops
Simplest configRequires correct SGLang flags + env var

For agentic coding or chat, MTP often cuts time-to-first-token clusters and improves tokens per second at batch size 1—the case most home labs care about.

Companion guides: Full 35B-A3B deployment · 27B vs 35B-A3B choice

How Speculative Decoding Works (30-Second Version)

  1. Draft: MTP head proposes up to N tokens ahead.
  2. Verify: The main model checks those proposals in one batched pass.
  3. Accept: Matching prefixes are emitted; mismatches are discarded and decoding continues from the first error.

SGLang exposes this through --speculative-algorithm NEXTN for Qwen3.6 (NEXTN is the integrated MTP path—do not point at a separate draft checkpoint).

MTP for Qwen3.6-35B-A3B — SGLang production path

1. Set the environment variable (required)

Without this, SGLang may treat MTP incorrectly and attempt to load a second full model as a draft—often causing OOM:

export SGLANG_ENABLE_SPEC_V2=1

2. Launch Qwen3.6-35B-A3B with NEXTN

uv pip install "sglang[all]"

SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tp-size 8 \
  --reasoning-parser qwen3 \
  --speculative-algorithm NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

For FP8 weights on a single large GPU:

SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B-FP8 \
  --port 8000 \
  --tp-size 1 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --speculative-algorithm NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.8

SGLang install step for Qwen3.6 local serving

Flag cheat sheet

FlagTypical valueRole
--speculative-algorithmNEXTNUse built-in Qwen3.6 MTP head
--speculative-num-steps3Draft depth; higher = more aggressive
--speculative-eagle-topk1Branching factor for draft candidates
--speculative-num-draft-tokens4Max draft tokens per verification round
--reasoning-parserqwen3Split thinking vs answer for hybrid reasoning

Tuning Recipes (Latency vs Throughput)

Think in terms of batch size and acceptance rate:

ProfileStepsDraft tokensBest for
Low latency34Single-user chat, coding agents (bs=1)
Balanced12Small multi-user or moderate batch
Max throughputoffSaturated server where verify cost dominates

If acceptance is low (random or highly creative text), MTP adds verify overhead without much gain—turn it off for that workload.

When Not to Use MTP

  • VRAM is already tight: Draft buffers need headroom; prefer FP8 or lower --mem-fraction-static first.
  • Very long context at high batch: Verify passes scale with running requests; profile before enabling aggressive steps.
  • llama.cpp / GGUF local runs: MTP/NEXTN is a server-side SGLang (or vLLM) feature today; Unsloth GGUF paths in our deployment guide do not expose the same MTP switch.
  • You forgot SGLANG_ENABLE_SPEC_V2: Fix the env var before chasing OOM errors.

vLLM and Other Engines

vLLM has been adding MTP-style speculative decoding for Qwen-family models via JSON config (e.g. method: mtp, num_speculative_tokens: 2). Ecosystem support moves quickly—check the vLLM release notes for your exact Qwen3.6 checkpoint before production.

For day-zero Qwen3.6 features (thinking parser, tool calls, NEXTN), SGLang remains the reference path in May 2026.

Quick Checklist

  1. Deploy Qwen3.6-35B-A3B with enough TP / FP8 headroom.
  2. Export SGLANG_ENABLE_SPEC_V2=1.
  3. Add --speculative-algorithm NEXTN and start with steps=3, draft-tokens=4.
  4. Benchmark tokens/s at batch size 1 with thinking on/off.
  5. If VRAM spikes, reduce draft tokens or disable MTP before reducing context.

References

  1. Qwen3.6 on SGLang Cookbook
  2. DeepSeek-V3 — Multi-Token Prediction (arXiv:2412.19437)
  3. Qwen3.6-35B-A3B on Hugging Face

Comments