Decode throughput — Qwen3.6 on llama.cpp, MTP vs base
tokens/second per stream, measured on HF Inference Endpoints (single replica, --spec-draft-n-max 2)
base (no MTP)MTP n=2
Same prompt, same hardware, same llama.cpp build per pair. Decode tok/s measured server-side from the streamed completion (no client network overhead in the figure). Output was identical in distribution between variants — speculative decoding is mathematically lossless when the verifier uses proper rejection sampling. Numbers will vary with batch size, context length, and quantization.