Today, we'll be diving into Google DeepMind's recently announced compact generative AI model, "Gemma2-2B" (1), and running a simple demo. Gemma is an open-source library. While medium-sized models with 70B and 9B parameters are already available, this latest release boasts a significantly smaller 2B parameter model. It promises remarkable performance despite its size, generating considerable excitement. Let's take a closer look.

1. Remarkable Fundamental Performance

Despite its compact size, the Gemma model exhibits impressive performance, as detailed below. Surpassing GPT3.5 is a feat unimaginable just a year ago. The rapid advancements in open-source models continue to amaze.

Google's website describes it as follows (1):

""This lightweight model produces outsized results by learning from larger models through distillation. In fact, Gemma 2 2B surpasses all GPT-3.5 models on the Chatbot Arena, demonstrating its exceptional conversational AI abilities.

The "distillation" technique mentioned here is key to enhancing the performance of smaller models. It's employed not only in Gemma but also in Llama3 and various other small models, making it a concept worth remembering. With the performance of a 2B parameter model reaching such heights, it's tempting to explore its capabilities. Let's move on to the demo.

2. Performance Check with a News Article Classification Task

For this demo, we'll tackle the task of classifying Japanese articles from the publicly available Livedoor-news dataset (2) into five genres. We'll fine-tune the Gemma2-2B model and evaluate its classification accuracy. Since we're using Japanese articles, this will also assess its multilingual capabilities. Let's get started!

The following article is an example from the validation data. The model's task is to identify this article as belonging to the sports category.

Specifically, each article is categorized into one of the following categories. The goal of today's demo is to improve the accuracy of this classification.

'kaden-channel' (Electronics)
'topic-news' (General News)
'sports-watch' (Sports)
'it-life-hack' (IT/Life Hacks)
'movie-enter' (Movies/Entertainment)

We prepared 100 samples for training data and 1000 samples for validation data. We'll apply fine-tuning using the impressive quantization tool Unsloth, and the data will be in the Alpaca format. For details, please refer to this link (3).

Without extensive tuning, we achieved an accuracy of 81.5%, as shown below. Considering the small training dataset of only 100 samples, this is an impressive result. With further optimization, the accuracy could likely be improved. It's hard to believe this performance comes from a model with only 2B parameters. Its ability to handle Japanese text is also commendable. The notebook used for the demo can be found here.

3. Limitless Potential Applications

With such high performance in a small model, the possibility of implementation on devices like smartphones, previously deemed impractical, becomes a reality. It also opens doors for applications where cost and computational speed were prohibitive. It seems particularly well-suited for customer service applications requiring real-time responses. Additionally, it could be deployed in developing countries where the cost of using frontier models like GPT4 has been a barrier. The future possibilities are truly exciting.

So, what did you think? The Gemma2-2B model can run on Google Colab's free T4 GPU, making it a valuable asset for startups like ours. It's truly remarkable. The small yet powerful Gemma2-2B model is poised for widespread adoption. At ToshiStats, we're committed to developing tuning techniques to maximize the benefits of open-source libraries. We'll be sharing more on this blog in the future. That's all for today. Stay tuned!

(1) Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma
(2) Livedoor-news Japanese articles
(3) Alpaca + Gemma2-2B full example.ipynb

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Google has finally introduced a new type of open-weight generative AI, "Gemma2" (1). Although it had been previously announced, it came out sooner than expected. As shown below, the 27B model boasts an impressive 12th place on the leaderboard, closely rivaling larger models. A technical report (2) is also available, so let's take a look at what kind of evolution has occurred.

1. Model Architecture

Gemma2 adopts the familiar decoder-only transformer architecture. It's the same as GPT4. The context window, which indicates the amount of information that can be input and output at once, is 8192 tokens. The model structure is largely the same as Gemma1, but according to the technical report, the following points have been updated:

“We alternate between a local sliding window attention (Beltagy et al., 2020) and global attention (Luong et al., 2015) in every other layer. The sliding window size of local attention layers is set to 4096 tokens, while the span of the global attention layers is set to 8192 tokens.”

Comparison of full self-attention pattern and other attention patterns (4)

2. Pre-training

Gemma2's training data is as follows:

27B model: 13 trillion tokens, primarily English data
9B model: 8 trillion tokens
2.6B model: 2 trillion tokens

"These tokens come from a variety of data sources, including web documents, code, and science articles. Our models are not multimodal and are not trained for state-of-the-art multilingual capabilitiesthe.”

“same tokenizer as Gemma 1 and Gemini: a SentencePiece tokenizer with split digits, preserved whitespace, and byte-level encodings. The resulting vocabulary has 256k entries."

Knowledge distillation was also adopted for the 9B and 2.6B models. In my opinion, this might be the most evolved point of Gemma2. It's a Google-specific strategy to leverage the advantages of their existing large-scale generative AI to improve the performance of smaller models. The technical report explains in detail: "Given a large model used as a teacher, we learn smaller 9B and 2.6B models by distilling from the probability given by the teacher of each token 𝑥 given its context 𝑥𝑐, i.e., 𝑃𝑇(𝑥 | 𝑥𝑐). More precisely, we minimize the negative log-likelihood between the probabilities from the teacher and the student.

where 𝑃𝑆 is the parameterized probability of the student. In practice, we run inference on the teacher once and store the probabilities. Since the vocabulary has 256k entries, we only store a sampled subset of the teacher probabilities."

3. Post-training

This part uses techniques commonly seen in other generative AIs. According to the technical report, it is implemented in the following process:

“For post-training, we fine-tune our pre-trained models into instruction-tuned models. First, we apply supervised fine-tuning (SFT) on a mix of text-only, English-only synthetic and humangenerated prompt-response pairs. We then apply RLHF on top of these models with the reward model trained on labelled English-only preference data and the policy based on the same prompts as the SFT phase. Finally, we average the models obtained after each phase to improve their overall performance.“

It's noteworthy that knowledge distillation is adopted again. "We run behavioral cloning on synthetic and real prompts, and responses predominantly synthetically generated by the teacher, that is a larger model. We also run distillation from the teacher on the student’s distribution." In the future, knowledge distillation from large models to small models may become common practice. It's exciting to see.

What do you think? Gemma2 seems to be a model with high potential even in small sizes, and it's promising. The 2.6B model is also expected to be released soon. By the way, Google, which created Gemma2, and Meta, which created Llama3 that we covered last time, have been rivals in the open-source world for more than 8 years with "Tensorflow vs PyTorch". It seems that a similar battle has begun in generative AI as well. Next time, I'd like to try various things with the Gemma2 model. Stay tuned!

1) Gemma 2 is now available to researchers and developers, Google, 27 June 2024
2) Gemma 2 technical paper, Google DeepMind, 27 June 2024
3) Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning Computer Science Department, Stanford University, 20 Sep 2015
4) Longformer: The Long-Document Transformer, Iz Beltagy, Matthew E. Peters, Arman Cohan, Allen Institute for Artificial Intelligence, 2 Dec 2020
5) On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes, Rishabh Agarwal12, Nino Vieillard1, Yongchao Zhou13, Piotr Stanczyk1, Sabela Ramos1, Matthieu Geist1, Olivier Bachem1, 1Google DeepMind, 2Mila, 3University of Toronto, 17 Jan 2024

Gemma2-2B: A Small Yet Powerful Generative AI - A Hands-On Review

Google introduces new open-weight generative AI "Gemma2". The competition with Llama3 has finally begun!