MTP speedup — llama.cpp benchmark

tokens/second per stream, measured on HF Inference Endpoints (single replica, --spec-draft-n-max 2)

base (no MTP) MTP n=2

Same prompt, same hardware, same llama.cpp build per pair. Decode tok/s measured server-side from the streamed completion (no client network overhead in the figure). Output was identical in distribution between variants — speculative decoding is mathematically lossless when the verifier uses proper rejection sampling. Numbers will vary with batch size, context length, and quantization.

Decode throughput — Qwen3.6 on llama.cpp, MTP vs base