Well known AI researchers have dug deep into Google s Gemma parameters with more than 7 billion para

Mondo Education Updated on 2024-02-28

Reported by the Heart of the Machine.

Editors: Chen Ping, Du Wei

Want to know the similarities and differences between Google GEMMA and Llama 2 and Mistral? The article is worth reading.
Just a few days ago, a big new player in the field of open source large models was ushered in: Google launched a new family of open source models, GEMMA. Compared to Gemini, GEMMA is lighter, while remaining free to use, and model weights are open-sourced and commercially available.

Google has released a model with two weight scales: GEMMA 2B and GEMMA 7B. Despite its smaller size, GEMMA has clearly outperformed larger models in key benchmarks, including the LLAMA-2 7B and 13B, as well as the up-and-coming Mistral 7B. At the same time, a technical report on GEMMA was released.

In this article, Sebastian Raschka, a well-known machine learning and AI researcher, introduces us to some unique design principles for GEMMA compared to other LLMs.

Raschka started with the performance of the model, and he said that anyone who has read the technical report may have a question, what makes Gemma perform so well? The reasons are not explicitly stated, but Sebastian Raschka believes that conclusions can be drawn from the following two points:

The first is the large vocabulary, which reaches 256000 words in GEMMA compared to 32000 words in Llama;

This is followed by a training dataset of 6 trillion tokens, of which Llama was trained on only one-third.

In terms of architecture, Raschka gave an overview of the architecture of GEMMA with LLAMA 2 7B and OLMO 7B.

In terms of model size, Raschka indicates that Gemma 2b has multiple query attention, while Gemma 7b does not. In addition, Gemma 7B has a relatively large feedforward layer compared to Llama 2, and although it has a smaller number of layers (28 vs 32), the number of parameters in Gemma is quite large.

Raschka guesses that Gemma 7b actually has a total of 9.3 billion parameters, or 8.5 billion parameters if weight tying is taken into account. Weight sharing means that the model shares the same weights in the input embedding and output projection layers, similar to GPT-2 and OLMO 1B (OLMO 7B is trained without weight sharing).

Normalization layer

Another striking detail is the following passage from gemma.

Normalized location. Google normalizes the inputs and outputs of each transformer sublayer, which is different from the standard practice of normalizing inputs or outputs individually. Google uses rmsnorm as the normalization layer.
At first glance, it looks like gemma has an extra rmsnorm layer after each transformer block. However, looking at the official ** implementation of the Keras-NLP project, it turns out that GEMMA only uses the regular pre-normalization scheme used by GPT-2, LLAMA 2, and other LLMs, as shown in the figure below.

Typical layer normalization locations in GPT, LLAMA 2, and other LLMs, there's nothing new in GEMMA. **

Geglu activation

One of the big differences between GEMMA and other architectures is that it uses Geglu activation, which was proposed in Google's Glu Variants Improve Transformer in 2020.

*Address: Gelu stands for Gaussian Error Linear Unit, which is an activation function that is increasingly being used as an alternative to traditional Relu. Gelu's popularity is due to its ability to introduce nonlinear features and allow gradient propagation to be performed for negative input values, which solves one of the major limitations of RELU, blocking negative values entirely.

Now, as a gate linear element variant of Gelu, the activation of Geglu is split into two parts, the sigmoid unit and the linear mapping unit (which is multiplied by element by element with the output of the sigmoid unit), as shown in the figure below.

Graphical comparison of gelu and relu activation functions, **

At the same time, GEGLU is similar to the Swiglu activation used by other LLMs such as Llama 2, Mistral, etc. The only difference is that the base activation used by Geglu is Gelu instead of Swish.

The diagram below shows the pseudo-** of Gelu (GPT-2), Swiglu (Llama 2), and Geglu (Gemma).

It is important to note that the feedforward modules using Swiglu and Geglu each have one more linear layer (Linear 1 and Linear 2, respectively) than the regular feedforward module using Gelu (linear only). However, in the Swiglu and Geglu feedforward modules, Linear 1 and Linear 2 are typically obtained by splitting a single linear layer into two parts, so the parameter size is not increased.

Is Geglu better than Swiglu? There are no ablation experiments to confirm this. Raschka guessed that Google chose to use Geglu just to make Gemma slightly different from Llama 2.

For example, GEMMA adds an offset of + 1 to the RMSNORM layer and normalizes the embedding by hiding the open square root of the dimension of the layer. These details are not mentioned or discussed in the gemma, so their importance is not clear.

Conclusion

GEMMA has made a great contribution to open source LLMs, demonstrating that 7B parameter scale can also lead to powerful models and has the potential to replace Llama 2 and MISTRAL in real-world use cases.

In addition, there are already a lot of open source models in the 7B size out there, so Gemma 2B is even more interesting because it can easily run on a single GPU. Of course, gemma 2b is the same as 2The contrast between the 7b size phi-2 will also be interesting.

Reference Links:

Related Pages