Tech2 hrs ago

Google Relicenses Gemma 4 and Boosts Speed Threefold with Speculative Decoding

Google's Gemma 4 models now run three times faster using speculative decoding and are released under the Apache 2.0 license, enabling broader edge AI use.

Alex Mercer/3 min/US

Senior Tech Correspondent

TweetLinkedIn
From a low-angle perspective, a person in a blue jacket holds a grey Pixel phone. A bright blue sky and white architectural beams fill the background.

From a low-angle perspective, a person in a blue jacket holds a grey Pixel phone. A bright blue sky and white architectural beams fill the background.

Source: AboutOriginal source

*TL;DR: Google’s Gemma 4 models run three times faster with speculative decoding and are now offered under the permissive Apache 2.0 license.

Google unveiled Gemma 4 earlier this year as a locally runnable alternative to its cloud‑based Gemini AI. The new release adds Multi‑Token Prediction (MTP) drafters that guess upcoming tokens, cutting generation time dramatically.

Key facts - Speculative decoding, the technique behind MTP, predicts future tokens and lets the main model skip redundant work, delivering a threefold speed increase. - The largest Gemma 4 model runs at full precision on a single high‑power AI accelerator; after quantization—reducing numeric precision—it can operate on a typical consumer GPU. - Google switched Gemma 4’s license from a custom agreement to Apache 2.0, a widely used open‑source license that permits commercial use and modification without restrictive clauses.

Gemma 4 builds on the same architecture that powers Gemini, but it is tuned for edge devices. Traditional language models generate text token by token, each step requiring the same amount of computation regardless of the token’s importance. This creates a bottleneck on consumer hardware, where memory bandwidth is far lower than the high‑bandwidth memory in data‑center TPUs.

MTP addresses the bottleneck by running a lightweight drafter model—only 74 million parameters for the E2B variant—during idle cycles. The drafter shares the key‑value cache, the active memory that stores context, so it does not recompute information already processed by the main model. Sparse decoding further narrows token candidates, allowing the drafter to produce speculative tokens that the main model can verify or discard.

The licensing change to Apache 2.0 removes barriers for developers who want to integrate Gemma 4 into products, customize it, or redistribute it. Combined with the speed boost, the models become more viable for on‑device AI tasks such as real‑time translation, code assistance, or personalized assistants that keep data local.

What it means The threefold acceleration narrows the performance gap between cloud AI services and on‑device inference, making privacy‑preserving AI more practical. The permissive license encourages broader adoption and community contributions, potentially accelerating innovation around edge AI.

Watch for benchmark releases that compare Gemma 4’s performance on consumer GPUs versus cloud‑based alternatives, and for third‑party tools that leverage the new Apache 2.0 license to build commercial products.

TweetLinkedIn

More in this thread

Reader notes

Loading comments...