While running an AI model on a personal computer offers privacy and eliminates subscription fees, it often results in sluggish performance due to inference speed—a hardware limitation. This delay occurs because standard models generate text one token at a time, requiring frequent shuttling of parameters between memory and compute units, which is inherently slow. To address this, many users resort to smaller or compressed models, though these solutions compromise on quality.
Google has now introduced an innovative solution called Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models. This speculative decoding method promises a speedup of up to three times without affecting the model’s quality or reasoning capabilities. The technique involves pairing the main model with a fast, inexpensive ‘drafter’ model that predicts multiple tokens simultaneously. If these predictions are accurate, the main model verifies them in one go, enhancing efficiency.
The concept, first introduced by Google researchers in 2022, didn’t gain traction until now due to architectural requirements for scalability. The drafter models utilize the same KV cache as the target model, avoiding redundant calculations and boosting performance on consumer devices like phones and Raspberry Pi units through efficient clustering techniques.
While other methods such as diffusion-based language models have attempted parallelizing text generation, they’ve struggled with quality compared to traditional transformer models. Speculative decoding stands out because it optimizes serving rather than altering the model architecture itself, allowing existing Gemma 4 models to run faster.
Google’s benchmarks reveal significant performance improvements: a Gemma 4 26B model on an Nvidia RTX Pro 6000 GPU achieves nearly double the tokens per second with MTP drafters enabled. On Apple Silicon, speedups of around 2.2x are observed with batch sizes of 4 to 8 requests.
Efficiency advancements like these have profound impacts, as seen when DeepSeek introduced a more efficient model training approach in January 2025, significantly affecting Nvidia’s market value. Google’s MTP drafter emphasizes smarter solutions over merely increasing hardware power, focusing on consumer applications.
This breakthrough affects the AI industry’s dynamic between inference, training, and memory, promising enhanced responsiveness for real-time chat, voice applications, and agentic workflows without added latency. Potential applications include local coding assistants, prompt voice interfaces, and responsive agentic workflows—all achievable with existing hardware.
The MTP drafters are now accessible via Hugging Face, Kaggle, and Ollama under the Apache 2.0 license, compatible with vLLM, MLX, SGLang, and Hugging Face Transformers.