Tech1 hr ago

Google Relicenses Gemma 4 Under Apache 2.0 and Boosts Speed Threefold

Google releases Gemma 4 under Apache 2.0 and introduces Multi‑Token Prediction, delivering up to three times faster token generation for local AI.

Alex Mercer/3 min/US

Senior Tech Correspondent

TweetLinkedIn
From a low-angle perspective, a person in a blue jacket holds a grey Pixel phone. A bright blue sky and white architectural beams fill the background.

From a low-angle perspective, a person in a blue jacket holds a grey Pixel phone. A bright blue sky and white architectural beams fill the background.

Source: AboutOriginal source

TL;DR: Google has re‑released its Gemma 4 open models under the Apache 2.0 license and introduced Multi‑Token Prediction drafters that make token generation up to three times faster.

Google unveiled the Gemma 4 family this spring, positioning the models as high‑performance options for running AI locally. The release promised more power than earlier open models, allowing developers to keep data on‑device rather than in the cloud.

In a separate move, Google swapped the original custom license for the permissive Apache 2.0 license. Apache 2.0 lets anyone use, modify, and redistribute the code with minimal restrictions, a stark contrast to the earlier, more limiting terms.

The performance upgrade comes from Multi‑Token Prediction (MTP) drafters. Traditional large language models generate text one token at a time, a process that repeats the same amount of computation for each token. MTP inserts a lightweight “draft” model that speculatively predicts several future tokens while the main model finishes its current computation. The draft model, only 74 million parameters in the Gemma 4 E2B version, shares the active memory cache with the main model and uses sparse decoding to narrow token choices. This parallel work cuts idle cycles and speeds up output.

Google reports that the new drafters deliver up to a three‑fold increase in token generation speed. The boost is most noticeable on consumer‑grade GPUs, where memory bandwidth limits often slow down full‑scale models. Quantizing the model—reducing numeric precision—further expands hardware compatibility.

The license change lowers barriers for researchers and startups to adopt Gemma 4 in products, open‑source tools, or academic work. Combined with the speed gains, the model becomes a more viable alternative to cloud‑only AI services, especially for edge devices that need low latency and data privacy.

What to watch next: how quickly the community adopts the Apache‑licensed Gemma 4 and whether other AI firms follow suit with similar permissive licensing and speculative decoding techniques.

TweetLinkedIn

More in this thread

Reader notes

Loading comments...