artificial intelligence

Gemma2-2B: A Small Yet Powerful Generative AI - A Hands-On Review

Today, we'll be diving into Google DeepMind's recently announced compact generative AI model, "Gemma2-2B" (1), and running a simple demo. Gemma is an open-source library. While medium-sized models with 70B and 9B parameters are already available, this latest release boasts a significantly smaller 2B parameter model. It promises remarkable performance despite its size, generating considerable excitement. Let's take a closer look.

 

1. Remarkable Fundamental Performance

Despite its compact size, the Gemma model exhibits impressive performance, as detailed below. Surpassing GPT3.5 is a feat unimaginable just a year ago. The rapid advancements in open-source models continue to amaze.

Google's website describes it as follows (1):

""This lightweight model produces outsized results by learning from larger models through distillation. In fact, Gemma 2 2B surpasses all GPT-3.5 models on the Chatbot Arena, demonstrating its exceptional conversational AI abilities.

The "distillation" technique mentioned here is key to enhancing the performance of smaller models. It's employed not only in Gemma but also in Llama3 and various other small models, making it a concept worth remembering. With the performance of a 2B parameter model reaching such heights, it's tempting to explore its capabilities. Let's move on to the demo.

 

2. Performance Check with a News Article Classification Task

For this demo, we'll tackle the task of classifying Japanese articles from the publicly available Livedoor-news dataset (2) into five genres. We'll fine-tune the Gemma2-2B model and evaluate its classification accuracy. Since we're using Japanese articles, this will also assess its multilingual capabilities. Let's get started!

The following article is an example from the validation data. The model's task is to identify this article as belonging to the sports category.

                Example of validation data

Specifically, each article is categorized into one of the following categories. The goal of today's demo is to improve the accuracy of this classification.

  • 'kaden-channel' (Electronics)

  • 'topic-news' (General News)

  • 'sports-watch' (Sports)

  • 'it-life-hack' (IT/Life Hacks)

  • 'movie-enter' (Movies/Entertainment)

We prepared 100 samples for training data and 1000 samples for validation data. We'll apply fine-tuning using the impressive quantization tool Unsloth, and the data will be in the Alpaca format. For details, please refer to this link (3).

Without extensive tuning, we achieved an accuracy of 81.5%, as shown below. Considering the small training dataset of only 100 samples, this is an impressive result. With further optimization, the accuracy could likely be improved. It's hard to believe this performance comes from a model with only 2B parameters. Its ability to handle Japanese text is also commendable. The notebook used for the demo can be found here.

 

3. Limitless Potential Applications

With such high performance in a small model, the possibility of implementation on devices like smartphones, previously deemed impractical, becomes a reality. It also opens doors for applications where cost and computational speed were prohibitive. It seems particularly well-suited for customer service applications requiring real-time responses. Additionally, it could be deployed in developing countries where the cost of using frontier models like GPT4 has been a barrier. The future possibilities are truly exciting.

 



So, what did you think? The Gemma2-2B model can run on Google Colab's free T4 GPU, making it a valuable asset for startups like ours. It's truly remarkable. The small yet powerful Gemma2-2B model is poised for widespread adoption. At ToshiStats, we're committed to developing tuning techniques to maximize the benefits of open-source libraries. We'll be sharing more on this blog in the future. That's all for today. Stay tuned!

 
 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Google DeepMind's new prompt engineering technique, "Many-Shot In-Context Learning," is amazing!

I recently came across an interesting research paper, "Many-Shot In-Context Learning" (1), by Google DeepMind, and I'd like to share a brief overview. Although it's a highly technical paper, it offers valuable insights that we can apply to our own prompt writing. Let's dive in.

 

1. Utilizing Context Effectively

When you write prompts for language models or generative AI like ChatGPT, you probably input the information you want, like a search engine, such as "What is the capital of Japan?" However, generative AI can handle much larger amounts of information. For example, as shown in the chart below, you can load a PDF document and then write a prompt like "Summarize this," and the AI will output a summary of the PDF's content. Think of a prompt as an "instruction to the generative AI." The additional information you provide is called the context.

 


2. What's Needed to Use Generative AI in a Business Setting

Now that we have a basic understanding of how to use generative AI, let's consider what's needed to use it in a company or business setting. Obviously, when you represent your company and interact with customers, you wouldn't express "personal opinions or feelings." You wouldn't say, "I personally don't think this new product will sell." Specifically, companies have established rules and manuals that employees must follow. Normally, employees cannot violate these rules. Therefore, to use generative AI in a company, it must output answers that comply with each company's "rules and manuals," not just general answers. So, how do you convey these rules to the generative AI? One way is to input the "rules and manuals" directly into the generative AI along with the prompt, as shown in the chart above. Many recent generative AIs have "context windows" of 100,000 tokens or more. This represents the amount of information that can be input and output at once, and 100,000 tokens is about 70,000 words in English. You can input a considerable amount of "rules and manuals." Some models, like Google's Gemini 1.5 Pro, can input up to 2 million tokens, which is enough for about 3,000 pages of English manuals. That's amazing. These context windows are sometimes called "long context windows."

 


3. Many-Shot In-Context Learning

"Many-Shot In-Context Learning" is a technique that utilizes these "long context windows" even more effectively. You may have heard of a similar term, "Few-Shot Learning." "Few-Shot Learning" is a method where you first provide the generative AI with a few "question and answer pairs" as examples and then ask the question you want to know. For instance, you might give examples like "The capital of the United States is Washington, D.C." and "The capital of China is Beijing," and then ask the AI, "What is the capital of Japan?" "Many-Shot In-Context Learning" increases the number of these "question and answer pairs" to 10-10,000. This is said to improve accuracy. The graph below shows that in machine translation and summarization tasks, increasing the number of examples to 500-1,000 improves accuracy. 2 to the power of 10 is 1024. The idea is to put as many examples as possible into the "long context window" since it can easily handle them.

The relationship between accuracy and the number of examples in machine translation and summarization.

 


What do you think? If simply increasing the number of examples improves accuracy, it might be worth trying. For those who say, "I can't create so many examples myself," "Many-Shot In-Context Learning" also suggests a method to create synthetic data using an LLM (language model). If you're interested, please check out the paper. But if it's just about 10 examples, you could probably create them yourself. I'll give it a try and update here if I get good results. That's all for today. Stay tuned!

 






1) "Many-Shot In-Context Learning", Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle, Google DeepMind, 22 May 2024,  https://arxiv.org/abs/2404.11018



Copyright © 2024 Toshifumi Kuga. All right reserved




Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Google introduces new open-weight generative AI "Gemma2". The competition with Llama3 has finally begun!

Google has finally introduced a new type of open-weight generative AI, "Gemma2" (1). Although it had been previously announced, it came out sooner than expected. As shown below, the 27B model boasts an impressive 12th place on the leaderboard, closely rivaling larger models. A technical report (2) is also available, so let's take a look at what kind of evolution has occurred.

LMSYS Chatbot Arena Leaderboard

 

1. Model Architecture

Gemma2 adopts the familiar decoder-only transformer architecture. It's the same as GPT4. The context window, which indicates the amount of information that can be input and output at once, is 8192 tokens. The model structure is largely the same as Gemma1, but according to the technical report, the following points have been updated:

“We alternate between a local sliding window attention (Beltagy et al., 2020) and global attention (Luong et al., 2015) in every other layer. The sliding window size of local attention layers is set to 4096 tokens, while the span of the global attention layers is set to 8192 tokens.”

Global attentional model (3)

Comparison of full self-attention pattern and other attention patterns (4)

 

2. Pre-training

Gemma2's training data is as follows:

  • 27B model: 13 trillion tokens, primarily English data

  • 9B model: 8 trillion tokens

  • 2.6B model: 2 trillion tokens

"These tokens come from a variety of data sources, including web documents, code, and science articles.  Our models are not multimodal and are not trained for state-of-the-art multilingual capabilitiesthe.”

“same tokenizer as Gemma 1 and Gemini: a SentencePiece tokenizer with split digits, preserved whitespace, and byte-level encodings. The resulting vocabulary has 256k entries."

Knowledge distillation was also adopted for the 9B and 2.6B models. In my opinion, this might be the most evolved point of Gemma2. It's a Google-specific strategy to leverage the advantages of their existing large-scale generative AI to improve the performance of smaller models. The technical report explains in detail: "Given a large model used as a teacher, we learn smaller 9B and 2.6B models by distilling from the probability given by the teacher of each token 𝑥 given its context 𝑥𝑐, i.e., 𝑃𝑇(𝑥 | 𝑥𝑐). More precisely, we minimize the negative log-likelihood between the probabilities from the teacher and the student.

where 𝑃𝑆 is the parameterized probability of the student. In practice, we run inference on the teacher once and store the probabilities. Since the vocabulary has 256k entries, we only store a sampled subset of the teacher probabilities."

 

3. Post-training

This part uses techniques commonly seen in other generative AIs. According to the technical report, it is implemented in the following process:

“For post-training, we fine-tune our pre-trained models into instruction-tuned models. First, we apply supervised fine-tuning (SFT) on a mix of text-only, English-only synthetic and humangenerated prompt-response pairs. We then apply RLHF on top of these models with the reward model trained on labelled English-only preference data and the policy based on the same prompts as the SFT phase. Finally, we average the models obtained after each phase to improve their overall performance.“

It's noteworthy that knowledge distillation is adopted again. "We run behavioral cloning on synthetic and real prompts, and responses predominantly synthetically generated by the teacher, that is a larger model. We also run distillation from the teacher on the student’s distribution." In the future, knowledge distillation from large models to small models may become common practice. It's exciting to see.

 

What do you think? Gemma2 seems to be a model with high potential even in small sizes, and it's promising. The 2.6B model is also expected to be released soon. By the way, Google, which created Gemma2, and Meta, which created Llama3 that we covered last time, have been rivals in the open-source world for more than 8 years with "Tensorflow vs PyTorch". It seems that a similar battle has begun in generative AI as well. Next time, I'd like to try various things with the Gemma2 model. Stay tuned!

 
 

1) Gemma 2 is now available to researchers and developers, Google, 27 June 2024
2) Gemma 2 technical paper,  Google DeepMind, 27 June 2024
3) Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning Computer Science Department, Stanford University, 20 Sep 2015
4) Longformer: The Long-Document Transformer, Iz Beltagy,  Matthew E. Peters,  Arman Cohan, Allen Institute for Artificial Intelligence, 2 Dec 2020
5) On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes, Rishabh Agarwal12, Nino Vieillard1, Yongchao Zhou13, Piotr Stanczyk1, Sabela Ramos1, Matthieu Geist1, Olivier Bachem1, 1Google DeepMind, 2Mila, 3University of Toronto, 17 Jan 2024

 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Llama3-8B has shown impressive performance even when fine-tuned on Japanese data. Its high base performance likely plays a significant role in this.

In the previous post, we introduced the high performance of Llama3-70B. However, Llama3 also has a smaller 8B model, and I've been wanting to fine-tune it to fit my own tasks. Since it's small, it's cost-effective and fast, so if you have a clear task in mind, this 8B model will surely be an option. Therefore, this time, we will fine-tune the Llama3-8B model for the task of classifying the published Livedoor-news Japanese articles (3) into several genres, and check its accuracy. Let's get started!

 
  1. Creating an Alpaca-style dataset

Livedoor-news Japanese articles are divided into the following 9 genres. The distribution of each genre is shown in the following chart.

  • 'kaden-channel',

  • 'livedoor-homme',

  • 'topic-news',

  • 'sports-watch',

  • 'peachy',

  • 'dokujo-tsushin',

  • 'it-life-hack',

  • 'movie-enter',

  • 'smax'

Distribution and sample size of each genre

This time, we will randomly extract 1000 samples for both training and validation data, and actually classify each article into the above 9 genres to verify whether high accuracy can be achieved. We have adopted Alpaca as the data format. As shown below, it consists of instruction, input, and output. Here, the instruction is common to all samples.

Example of Livedoor news

 

2. Fine-tuning using Hugging face TRL + "unsloth"

This time, we used Hugging face's TRL (1), a library for fine-tuning LLMs, along with "unsloth", a library for accelerating training, to efficiently perform fine-tuning. The development environment was Google Colab, and we prepared a paid L4 (GPU) instance. The training time was about 100 minutes for 4 epochs. L4 has 22.5GB of GPU-RAM, which is large enough for this training. Also, "unsloth" prepares a 4-bit quantized model for fine-tuning, so you can download and use it directly from Hugging Face Hub, which is convenient. This training process was based on the "unsloth" notebook (2). If you are interested in speeding up training, please check it out.

"Unsloth" model

 

3. Verify model accuracy

At first, I simply asked, "The skill to score a penalty kick from this impossible angle is amazing." The answer was "sports-watch". It's a soccer/football story, so I think it's a reasonable answer.

Next, I asked, "Which is better, iPhone or Android?" The answer was "it-life-hack". This is also a good answer.

It's hard to type in one by one, and the actual articles are longer and more complex. This time, I prepared 1000 validation data samples and tried it. The result was a very good accuracy of 94.5%. Since the input is Japanese, I thought Llama3 would struggle, but I was surprised that it easily exceeded 90%. It must be the effect of pre-training with a huge corpus of 15 trillion tokens. Even the 8B model seems to be practical in Japanese if fine-tuned.

 

How was it? Even though Llama3-8B is small, it has high potential and seems to be active in various places. Fine-tuning is required for each task, but "unsloth" can help speed it up. If you want to shorten the training time, please try it. This time, we were able to obtain sufficient accuracy in about 2 hours even with a general-purpose single GPU. It's a reliable ally for small startups like us! If you want to try it by yourself, you can use my notebook here.

We will update you as we gain new insights. Stay tuned!

 

(1) TRL - Transformer Reinforcement Learning https://huggingface.co/docs/trl/en/index

(2) Alpaca + Llama-3 8b full example.ipynb https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=iHjt_SMYsd3P

(3) Livedoor-news Japanese articles https://www.rondhuit.com/download.html

 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

I tried the new generative AI model "Claude3 Haiku". Fast, smart, and low-priced. I want to use it as an AI agent!

On March 14th, "Claude3 Haiku" (1), the lightest model among the Claude3 generative AIs, was released and became available for use in web applications and APIs. I'm usually drawn to the highest-performing models, but this time I'd like to focus on the lightest one. Recently, algorithms that execute repetitive calculations like AI Agents have become more common. I want to use high-end models like GPT4, but they are very costly to run. So I was looking for a low-cost, high-performance model, and "Claude3 Haiku" is perfect as it costs 1/60th of the high-end model "Claude3 Opus" while still delivering excellent performance. I'd like to try it out here right away. The details of each model are as follows.




1. First, let's test the text

I checked if "Claude3 Haiku" knows about Hiroshima-style okonomiyaki, a hyper-local Japanese food. I used to live in Hiroshima, so I know it well, and I think this answer is generally good. The Japanese is clean, so it passes for now.




Next, I asked about transportation from Tokyo to Osaka. Unfortunately, there was one clear mistake. The travel time by bus is stated as "about 4 hours and 30 minutes," but in reality, it takes around 8 hours. This is a hallucination.



Then I asked about the "Five Forces," a framework for analyzing market competitiveness. It analyzed the automotive industry, and the analysis incorporates the latest examples, such as the threat of electric vehicles as substitutes, making it a sufficient quality starting point for discussion. However, the fact that it's not in a table format is a drawback.





2. Next, let's analyze images.

First, I asked about the number of smartphones, but unfortunately, it got it wrong. It may not be good at counting.




This is a photo of the Atomic Bomb Dome in Hiroshima. It answered this perfectly. It seems to understand famous Japanese buildings.





This is a photo of a streetcar running in Hiroshima City. I think it captures it pretty well overall. However, the streetcars don't run solely for tourists, so the explanation may be somewhat incomplete.




This is a flight information board at Haneda Airport. It perfectly understands the detailed information. Excellent.





Counting the number of cars in a parking lot is a difficult task for generative AI. This time it answered 60 cars, but there are actually 48. If the accuracy improves a bit more, it will reach a practical level, which is a bit disappointing.






3. Impressions of using "Claude3 Haiku".

Honestly, the performance was unbelievable for a general-use AI. The Japanese is natural and clean. The fact that it can incorporate and analyze images in the first place is groundbreaking. Multimodality has arrived in general-use AI. The calculation speed is also fast, and I think it will be applied to applications that require real-time responses. And the cost is low. This allows for plenty of interesting experiments. It's a savior for startups with tight cost constraints! I want to continue doing interesting experiments using "Claude3 Haiku". Stay tuned!

(1) Claude 3 Haiku: our fastest model yet   2024.3.14  Anthropic

Copyright © 2024 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The Evolution of AI Accelerates: A Deep Dive into Google's "Gemini 1.5 Pro"

The pace of AI advancement is truly remarkable, and this year is no exception. Google has unveiled a new generative AI called "Gemini 1.5 Pro," which boasts a groundbreaking Mixture-of-Experts (MoE) architecture. Currently only available to a limited number of users, with broader testing to come, this technology presents intriguing breakthroughs that warrant a closer look.

 
 

1. Unprecedented Context Window of 1 Million Tokens

Gemini 1.5 Pro boasts a context window that is unfathomable by existing LLMs, capable of processing up to 1 million tokens. Research has even demonstrated data ingestion of up to 10 million tokens. This represents a revolutionary breakthrough, considering that GPT-4's context window is limited to 128,000 tokens (1).

Comparison of Context Windows for Different LLMs

With such an extensive context window, Gemini 1.5 Pro can ingest an entire book at once. Currently, when creating RAG systems and referencing internal documents, chunking is necessary to accommodate the LLM's context window. However, with Gemini 1.5 Pro, this requirement is minimized, simplifying RAG development and operation. Furthermore, the model maintains high accuracy, even with such a large context window, achieving over 99% accuracy in information retrieval tests (see chart below).

 
 

2. Remarkable In-Context Learning Capabilities

The ability to process vast amounts of data is not the only noteworthy aspect of Gemini 1.5 Pro. It also excels at understanding and applying this information to various tasks. This is evident in its in-context learning capabilities, showcased in a Kalamang language translation task. The model was trained using a Kalamang grammar book and dictionary, enabling it to translate between English and Kalamang.

English to Kalamang Translation Test

Gemini 1.5 Pro outperformed other models, achieving scores that rival those of human learners. This is an astonishing feat.

 
 

3. Towards Individualized Agents with Gemini 1.5 Pro

If a model can acquire translation capabilities simply by reading a grammar book, it stands to reason that it can also learn from knowledge systems in other domains and apply that knowledge to various tasks. In other words, Gemini 1.5 Pro has the potential to develop its own "frame of reference" that influences its understanding and values. The ability to incorporate a vast amount of data into its context through its extensive context window has significant implications in this regard. This is because it allows Gemini 1.5 Pro to potentially become an individualized agent with diverse perspectives in the future. The Kalamang translation experiment provides promising evidence of this potential.

Gemini 1.5 Pro is a remarkable advancement in AI technology, offering unprecedented capabilities in terms of context window size and in-context learning. "A host of improvements made across nearly the entire model stack (architecture, data, optimization and systems) allows Gemini 1.5 Pro to achieve comparable quality to Gemini 1.0 Ultra , while using significantly less training compute and being significantly more efficient to serve" according to the report(1). This is truly a testament to the rapid progress being made in the field of AI.

I am eager to experiment with Gemini 1.5 Pro once it becomes publicly available. Stay tuned for future updates!

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Gemini Team, Google

 

Copyright © 2024 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.