Google DeepMind RecurrentGemma outperforms Transformer models

RecurrentGemma

Google DeepMind published a research paper proposing a language model called RecurrentGemma that can match or exceed the performance of transformer-based models while being more memory efficient, offering the promise of high language model performance in environments with limited resources.

The research paper provides a brief overview:

“Introducing RecurrentGemma, an open language model that uses Google’s new Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent language performance. It has a fixed-size state, which reduces memory usage and allows for efficient inference on long sequences.We provide a pre-trained model with no 2B parameters and an instruction-fit variant.Both models achieve comparable performance to Gemma-2B despite being trained with fewer tiles “.

Connection with Gemma

Gemma is an open model that uses Google’s top-notch Gemini technology, but is lightweight and can run on laptops and mobile devices. Similar to Gemma, RecurrentGemma can also work in resource-constrained environments. Other similarities between Gemma and RecurrentGemma are in pre-training data, instruction tuning, and RLHF (Reinforcement Learning From Human Feedback). RLHF is a way of using human feedback to train a model to learn by itself, for generative AI.

Griffin Architecture

The new model is based on a hybrid model called the Griffin that was announced a few months ago. Griffin is called a “hybrid” model because it uses two types of technologies, one that allows it to efficiently handle long sequences of information while the other allows it to focus on the most recent parts of the input, which which gives it the ability to process. “significantly” more data (increased throughput) in the same amount of time than transformer-based models and also decreases latency (latency).

Griffin’s research paper proposed two models, one called the Hawk and the other called the Griffin. Griffin’s research paper explains why this is a breakthrough:

“…we empirically validate the inference time advantages of Hawk and Griffin and observe reduced latency and a significant increase in performance compared to our Transformer baselines. Finally, Hawk and Griffin show the ability to extrapolate in longer sequences than they have been trained and are able to efficiently learn to copy and retrieve data over long time horizons. These findings strongly suggest that our proposed models offer a powerful and efficient alternative to transformers with global attention.”

The difference between Griffin and RecurrentGemma is in a modification related to how the model processes the input data (input embeddings).

advances

The research paper claims that RecurrentGemma provides similar or better performance than the more conventional Gemma-2b transformer model (which was trained with 3 trillion tokens versus 2 trillion for RecurrentGemma). This is part of the reason the research paper is titled “Moving Past Transformer Models” because it shows a way to achieve higher performance without the high resource overhead of the transformer architecture.

Another win over transformer models is reduced memory usage and faster processing times. The research paper explains:

“A key advantage of RecurrentGemma is that it has a significantly smaller state size than transformers in long sequences. While Gemma’s KV cache grows proportionally to the length of the sequence, RecurrentGemma’s state is bounded and not increases in sequences longer than the local attention window size of 2k tokens Consequently, while the longest sample that can be autoregressively generated by Gemma is limited by the memory available on the host, RecurrentGemma can generate sequences of arbitrary length.

RecurrentGemma also outperforms the Gemma transformer model in performance (amount of data that can be processed, higher is better). The performance of the transformer model suffers with higher sequence lengths (increased number of tokens or words), but this is not the case for RecurrentGemma, which is able to maintain high performance.

The research paper shows:

“In Figure 1a, we plot the throughput achieved when sampling from a request of 2k tokens for a range of generation durations. The throughput calculates the maximum number of tokens we can sample per second on a single device TPUv5e.

…RecurrentGemma achieves higher performance at all sequence lengths considered. The performance achieved by RecurrentGemma does not decrease as the sequence length increases, while the performance achieved by Gemma decreases as the cache grows.

RecurrentGemma Limitations

The research paper shows that this approach has its own limitation where performance lags compared to traditional transformer models.

The researchers highlight a limitation in handling very long sequences, which transformer models are capable of handling.

According to the document:

“Although RecurrentGemma models are very efficient for shorter sequences, their performance can lag traditional transformer models like Gemma-2B when handling extremely long sequences that exceed the local attention window.”

What does this mean for the real world?

The importance of this approach to language models is that it suggests that there are other ways to improve the performance of language models while using fewer computational resources in a non-transformative model architecture. This also shows that a transformerless model can overcome one of the limitations of transformer model cache sizes that tend to increase memory usage.

This could lead to language model applications in the near future that can work in resource-constrained environments.

Read the Google DeepMind research paper:

RecurrentGemma: Passing transformers for efficient open language models (PDF)

Featured image by Shutterstock/Photo For Everything

[ad_2]

Source link

You May Also Like

About the Author: Ted Simmons

I follow and report the current news trends on Google news.

Leave a Reply

Your email address will not be published. Required fields are marked *