generative ai

DeepSeek-R1's Impact and the Future of Generative AI

Hello, DeepSeek-R1, released on January 20th (1), has sparked excitement among AI professionals and investors worldwide. I believe it's had an impact comparable to that of ChatGPT's emergence. Here, I'd like to consider why it has garnered so much global attention.

 

1. What Was New?

DeepSeek-R1's performance is remarkable. It stands shoulder-to-shoulder with OpenAI's o1 model, a veteran in inference models. Below is a comparison of performance across various benchmarks, where DeepSeek-R1 rivals the o1 model. The fact that a newcomer model suddenly matched OpenAI, the frontrunner in generative AI, is undoubtedly why the world is so astonished.

Performance comparison across various benchmarks


While DeepSeek-R1 appeared suddenly like a comet, there were several technical breakthroughs. Among the most significant is a training method called "GRPO." DeepSeek-R1 uses reinforcement learning to acquire advanced reasoning abilities in mathematics and coding. This is similar to existing generative AI. Reinforcement learning is a powerful training technique that doesn't require so-called "correct answer data," but it's a complex and resource-intensive approach. DeepSeek adopted a method that requires only one model instead of the usual two. This is "GRPO." Here is an overview, with PPO in the upper section representing a common technique used in existing models, and GRPO in the lower section being a new method.

PPO vs GRPO

In comparison, GRPO lacks the Value model present in PPO, and has only a Policy model. This means that only one model is needed instead of two. Since the model here refers to a massive generative AI, being able to complete training with only one model has a massive impact on resource saving. The fact that DeepSeek-R1, developed by a Chinese company unable to use the latest GPUs due to US semiconductor export restrictions, achieved such remarkable results might be related to this. For more details on the technical aspects, please refer to the research paper (2). The research paper (3) first introduced GRPO.

 

2. Why Did It Attract Global Attention?

DeepSeek-R1 was released as an open-weight model, available for anyone to download and use. Additionally, the entire training method, including GRPO, was published in detail in research papers. Until now, most generative AI models, with a few exceptions, could only be accessed via APIs and not downloaded. Furthermore, how they were trained was rarely disclosed, making them black boxes. In this context, the release of DeepSeek-R1, a cutting-edge model, in a usable form for AI researchers worldwide had a profound impact. Even if a model is called amazing, if the inner workings are unknown, neither criticism nor improvement suggestions can be made. With DeepSeek-R1, I feel that the open-source community can, for the first time, participate in the development of the most advanced generative AI models.

 

3. What Will Become of Generative AI in the Future?

AI developers around the world are already starting to adopt methods like GRPO in the development of state-of-the-art models. DeepSeek-R1 has proven that it's possible without incurring enormous costs. I'm currently focusing on a public project called "Open-R1" (4), which plans to disclose not only the training data but also the code, which was not revealed with DeepSeek-R1, and I believe this is revolutionary.

Open-R1

Of course, it is expected that such projects will start worldwide, and I am looking forward to that. It's exciting!

 

How was it? The landscape surrounding generative AI has changed in an instant. New generative AI models will continue to be created. It's really hard to take your eyes off of it. I will continue to deliver further news. Stay tuned!

 

























 
 
 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Marketing AI agents for customer targeting in telemarketing can also be easily implemented using the new library "smolagents." This looks promising!

1. Marketing AI Agent

To efficiently reach potential customers, it's necessary to target customers who are likely to purchase your products or services. Marketing activities directed at customers without needs are often wasteful and unsuccessful. However, identifying which customers to focus on from a large customer list beforehand is a challenging task. To meet the expectation of easily targeting customers without complex analysis, provided you have customer-related data at hand, we have implemented a marketing AI agent this time. Anyone with basic Python knowledge should be able to implement it without much difficulty. The secret to this lies in the latest framework "smolagents" (1), which we introduced previously. Please refer to the official documentation for details.

 

2. Agent Predicting Potential Customers for Deposit-Taking Telemarketing

Let's actually build an AI agent. The theme is "Predicting potential customers for deposit-taking telemarketing with an AI agent using smolagents." As before, by providing data, we want the AI agent itself to internally code using Python and automatically display "the top 10 customers most likely to be successfully reached by telemarketing."

While the coding method should be referenced from the official documentation, here we will present what kind of prompt to write to make the AI agent predict potential customers for deposit-taking telemarketing. The key point, as before, is to instruct it to "use sklearn's HistGradientBoostingClassifier for data analysis." This is a gradient boosting library, highly regarded for its accuracy and ease of use.

Furthermore, as a question (instruction), we specifically add the instruction to calculate "the purchase probability of the 10 customers most likely to be successful." The input to the AI agent is in the form of "prompt + question."

Then, the AI agent automatically generates Python code like the following. The AI agent does this work instead of a human. And as a result, "the top 10 customers most likely to be successfully marketed to" are presented. Customers with a purchase probability close to 100%! Amazing!

         "Top 10 customers most likely to be successfully marketed to"

In this way, the user only needs to instruct "tell me the top 10 customers most likely to be successful," and the AI agent writes the code to calculate the purchase probability for each customer. This method can also be applied to various other things. I'm looking forward to future developments.

 

3. Future Expectations for Marketing AI Agents

As before, we implemented it with "smolagents" this time as well. It's easy to implement, and although the behavior isn't perfect, it's reasonably stable, so we plan to actively use it in 2025 to develop various AI agents. The code from this time has been published as a notebook (2). Also, the data used this time is relatively simple demo data with over 40,000 samples, but given the opportunity, I would like to try how the AI agent behaves with larger and more complex data. With more data, the possibilities will increase accordingly, so we can expect even more. Please look forward to the next AI agent article. Stay tuned!

 
 
 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.








I tried using the new AI agent framework "smolagents". The code is simple and easy to use, and I recommend it for AI agent beginners!

At the end of last year, a new AI agent framework called "smolagents" was released from Hugging Face (1). The code is simple and easy to use, and it even supports multi-agents. This time, I actually created a data analysis AI agent and tried various things. I hope it will be helpful.

 

1. Features of "smolagents"
The newly released "smolagents" has features that existing frameworks do not have. 1) First, it has a simple structure. You can execute an AI agent by writing 3 to 5 lines of code. It's perfect for those who want to start with AI agents. 2) Also, since it was released by Hugging Face, there are already a huge number of open-source models on the Hub. You can easily call and use them. Of course, it also supports proprietary models such as GPT4o, so you can use it for both open and closed models. 3) Finally, when you execute an agent, python code is generated and acted upon. Therefore, you can use the assets of the vast Python ecosystem, which is very convenient. Especially for those who specialize in data analysis like me, it is a perfect framework because you can use Python libraries such as sklearn.

 

2. An Agent for Predicting Credit Card Defaults

Now, let's actually build an AI agent. The theme is "AI agent by smolagent predicts credit card defaults". Normally, when building a default prediction model, you would code using machine learning libraries such as sklearn, but this time, I want to give it data and have the AI agent itself code internally using Python and automatically display the default probabilities of the first 10 customers.

For how to write the code, please refer to the official documentation , but here I would like to present what kind of prompts I actually wrote to make the AI agent predict defaults. The point is to specifically instruct it to "use sklearn's HistGradientBoostingClassifier for data analysis". This library is highly evaluated for creating machine learning models with high accuracy and ease of use. This is domain knowledge of data analysis, but by including that knowledge in the prompt, we expect to obtain higher accuracy.

Furthermore, as a question, I will add an instruction to specifically calculate "the default probability of 10 customers". The AI agent is input in the form of "prompt + question".

Then, the AI agent automatically generated the following Python code. Normally, this is what I would write myself, but the AI agent does it for me. And as a result, the default probabilities for 10 people are also shown. Amazing!

In this way, the user only needs to instruct "use sklearn to calculate the default probability", and the AI agent writes the code to calculate the default probability for each customer. And you will be able to make default predictions for each customer. I tried it with default prediction this time, but I think it can be covered to the probability in any business, such as marketing, customer churn and human resources. I'm looking forward to future developments.

 

3. Impressions after using "smolagents" for the first time

Until now, I used LangGraph to implement AI agents. I liked it because I could make various detailed settings, but it was necessary to code each of state, tool, node, edge, etc., and I felt that the hurdle was high for beginners to start with. After implementing it with "smolagents" this time, I found that if I coded according to the template, it would run by writing a few lines, so anyone could start. Of course, it fully meets the needs of AI developers, so I plan to actively use it in 2025 to develop various AI agents. I have published the code this time in a notebook (2). Please look forward to the next AI agent article. Stay tuned!

 

(1) Introducing smolagents, a simple library to build agents,  Aymeric Roucher, Merve Noyan, Thomas Wolf, Hugging Face, Dec 31,2024   
(2) AI-agent-to-predict-default-of-credit-card-with-smolagent_20250121

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.






OpenAI o1-preview: A Breakthrough in Generative AI, Introducing a Novel Paradigm

Last week, we introduced OpenAI o1(1). Despite still being in preview, it boasts high performance, and as evidenced by the leaderboard below(2), it seems to be widely regarded as having overwhelming performance, especially in mathematics and coding. In this article, we'd like to explore why OpenAI o1 demonstrates higher accuracy compared to existing generative AI models like GPT4.

Scores of various generative AIs in the mathematics field

 

1. Chain of Thought is Key

The star of the OpenAI o1 model is "Chain of Thought." This refers to "a series of intermediate reasoning steps," which was previously considered an important element of prompts created by users for existing generative AI models. By incorporating "Chain of Thought" into prompts, users enabled generative AI to engage in deeper and broader thinking before providing an answer, thereby improving accuracy. "Chain of Thought" became known to the public through a 2022 research paper(3). Please refer to it for details.

 

2. OpenAI o1 Can Generate Chain of Thought Independently

OpenAI o1 can generate Chain of Thought internally on its own. Users don't need to devise Chain of Thought themselves; it's generated automatically. This is why it achieved high accuracy in mathematics and coding. Unfortunately, OpenAI seems to have decided not to disclose the Chain of Thought itself. Users can only see a summary of it. If you're like most users, you're probably thinking, "I'd really like to see that!" We hope that OpenAI will change its policy and release it in the future.

 

3. Creating a Reward Model

OpenAI has released very little information about what we'll discuss from here on. Please note that the following is based on speculation drawn from previously published research papers and information shared by OpenAI researchers. To enable generative AI to automatically generate task-specific Chain of Thought for practical use, we must evaluate whether the generated Chain of Thought is actually correct. This is where the Reward model comes into play. A 2023 research paper(4) from OpenAI provides a detailed explanation of how to train a Reward model, so let's look to it for clues.

The data for training the Reward model takes the form of Chain of Thought, as shown below. This research paper limits the tasks to mathematics. Since it's challenging for humans to manually create Chain of Thought for each task, they are automatically generated using GPT4. This is called the Generator. Humans then label each step of the Chain of Thought generated by the Generator on a three-point scale (correct, incorrect, or neither). This completes the training data. In the example below, you can see that each step is assigned a three-point label. It must have been quite a task for humans to label a large amount of such data.

Training data for the Reward model

 

4. Training Generative AI Through Reinforcement Learning

Once the Reward model is complete, we can train the generative AI using reinforcement learning. As a result, the generative AI can generate the correct Chain of Thought for the task. We, the users, actually run OpenAI o1 and benefit from the generated Chain of Thought. Unfortunately, OpenAI has not disclosed the specific method for training OpenAI o1 using reinforcement learning. Since this directly affects accuracy, it's unlikely to be released in the future. However, researchers worldwide are working on this issue and have published several promising results. As this is a technology that will support the future development of generative AI, we would like to revisit it in a future article.

 

5. A New Paradigm for Generative AI

OpenAI's website includes the following statement and chart:

“Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.”

Model computational cost and accuracy

 

Until now, there has been much discussion about how increasing the computational cost (time and number of parameters) in pre-training improves the accuracy of generative AI, but there hasn't been much in-depth discussion about the relationship between inference computational cost and accuracy. However, it has now become clear that by generating Chain of Thought itself and then providing an answer, generative AI can answer tasks requiring complex logical reasoning with high accuracy, albeit with significantly increased inference computational cost. The chart on the right above shows that accuracy improves as the computational cost at inference time increases. We believe this is a groundbreaking development. Therefore, it will be important to consider both training and inference computational costs for generative AI in the future. This marks the dawn of a new paradigm for generative AI.

 

OpenAI o1 has not only improved accuracy but has also brought about a new paradigm for the development of generative AI. We look forward to seeing how future generative AIs will follow in its footsteps. That's all for today. Stay tuned!

 



1) Introducing OpenAI o1, OpenAI, Sep 12, 2024
2) LMSYS Chatbot Arena Leader board
3) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Google Research, Brain Team, Jan 2023
4) Let’s Verify Step by Step, Hunter Lightman , Vineet Kosaraju, Yura Burda, Harri Edwards,  Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe,  OpenAI, May 2023

 




Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.





OpenAI's "o1-preview" Arrives: Is This the Next Leap Towards Artificial General Intelligence?!

On September 12, 2024, OpenAI released its new generative AI model "o1" (pronounced "oh-one"), which had been the subject of much speculation. I had the opportunity to try it out, and here are my initial impressions.

 

1. Model Overview

As a new generative AI model, o1 has various features, but the key points are as follows:

  • Specialized for scientific, coding, and mathematical reasoning.

  • Available in two versions: OpenAI o1 and OpenAI o1-mini.

  • Currently in preview with limited functionality and performance.

  • Not a successor to GPT-4.

  • OpenAI o1 has a limited usage of 30 requests per week.

  • Price: OpenAI o1 is about six times more expensive than GPT-4o.

For more details, please refer to the official website (1).

Compared to GPT-4o, o1-preview demonstrates superior performance in coding, data analysis, and mathematics, as shown below. It seems likely that o1 will excel in fields where existing generative AI has struggled to achieve satisfactory accuracy. However, because it utilizes Chain of Thought reasoning to arrive at answers, it can take a considerable amount of time to respond, making it unsuitable for tasks requiring real-time answers.

GPT-4o vs. o1-preview: Task Performance Comparison

 

2. Challenging o1 with Game24

Let's test the capabilities of o1-preview. A common example of a task that generative AI struggles with is Game24.

This is a simple mathematical puzzle with the following rules:

  • Use the four given numbers and basic arithmetic operations (addition, subtraction, multiplication, division).

  • Create a mathematical expression that results in 24.

  • Each of the four given numbers can be used only once.

Example: 13, 10, 9, 4 → (10 - 4) × (13 - 9)

When attempting this with o1-preview, it produced the following result. It successfully solved the puzzle! The response took about 15 seconds, likely due to internal trial-and-error processes.

Game24 instruction

o1-preview Game24 Trial Result

When trying the same with GPT-4o:

GPT4o Game24 Trial Result

GPT-4o fails to provide a correct answer. This highlights o1's superiority in tasks that require strong logical reasoning.

 

3. The Impact on the Future of Generative AI

o1's newfound capabilities are attributed to its incorporation of Chain of Thought reasoning, enabling it to generate task-specific chains of thought and produce more reliable correct answers. However, the Chain of Thought process, which demonstrates how the correct answer is derived, is not revealed to the user. This is somewhat disappointing, as users typically want to understand not only the correct answer but also "why" that answer was reached. Therefore, it's understandable that some may perceive it as a black box. We hope that the open-source development community will further research this aspect and share their findings with the world. With excellent open-source generative AI models like Llama and Gemma currently available, we believe that user verification of Chain of Thought will become possible in the near future.

 

Conclusion

o1-preview seems to have been received with a level of excitement not seen since the release of GPT-4 in March 2023. In the next installment, I plan to explore the technology behind this impressive generative AI, based on external speculation. That's all for today. Stay tuned!

 

1) Introducing OpenAI o1, OpenAI, Sep 12, 2024 

 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The Future of Generative AI: Predicting the Next Generation Based on Google DeepMind's Math Olympiad Breakthrough

Generative AI has a reputation for struggling with math, often making mistakes even with simple elementary-level arithmetic. However, Google DeepMind recently announced that their AI achieved a score equivalent to a silver medal in the International Mathematical Olympiad (IMO)(1). Based on this article, let's delve into predicting the future of next-generation generative AI.

 

1. How Did AI Solve Complex Math Problems?

The achievement is impressive:

“Today, we present AlphaProof, a new reinforcement-learning based system for formal math reasoning, and AlphaGeometry 2, an improved version of our geometry-solving system. Together, these systems solved four out of six problems from this year’s International Mathematical Olympiad (IMO), achieving the same level as a silver medalist in the competition for the first time.”

                                                                          

This is an amazing score, just shy of a gold medal. We'll focus on AlphaProof, the reasoning system, out of the two models.

AlphaProof is explained as follows:

“AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go.”

In simple terms, while there is abundant data available for math problems written in natural language, generative AI tends to make plausible yet incorrect statements (hallucinations), making it difficult to utilize effectively. Therefore, Google utilized its generative AI, Gemini, to translate math problems into the formal language Lean. This formal representation was then fed into AlphaZero, known for its long-term planning and reasoning capabilities, for computation. The chart below provides a clear illustration.

                                                                          AlphaProof's Structure

AlphaZero has already proven its reasoning prowess in board games like Go. This achievement demonstrates the successful application of its capabilities to the realm of mathematics. Remarkable!

 

2. Implications from AlphaZero

Let's briefly revisit AlphaZero, which made a reappearance. It is a groundbreaking AI that combines RL (Reinforcement Learning) and MCTS (Monte Carlo Tree Search). The initial model gained fame in March 2016 as the first AI to defeat a top professional Go player. It's important to emphasize that AlphaZero achieved superhuman ability without relying on human-created data; it trained itself using self-generated data. Upon hearing this for the first time, many might wonder, "How is that even possible?" AlphaZero accomplishes this through self-play, generating massive amounts of training data by playing against itself. Refer to the research paper(2) for more details. For context, consider AlphaGo as the initial version of AlphaZero.

 

3. The Fusion of Current Generative AI and AlphaGo

Interestingly, Demis Hassabis, CEO of Google DeepMind, recently hinted at the future of their generative AI(3). The key takeaways are:

  • “Gemini” is a natively multimodal model.

  • It can understand various aspects of the world, including language, images, videos, and audio.

  • Current models are incapable of long-term planning and problem-solving.

  • DeepMind possesses expertise in this field through AlphaGo.

  • The next-generation model will be an agent that fuses Gemini and AlphaGo.

 

It's plausible to view the project that secured a silver medal in the Math Olympiad as a step towards overcoming the limitations of generative AI in "long-term planning." However, one might question, "How exactly will this fusion work?" A prominent long-form paper (4) in June of this year provides clues.

A look back at AlphaGo—the first AI system that beat the world champions at Go, decades before it was thought possible—is useful here

• In step 1, AlphaGo was trained by imitation learning on expert human Go games. This gave it a foundation.

• In step 2, AlphaGo played millions of games against itself. This let it become superhuman at Go:

remember the famous move 37 in the game against Lee Sedol, an extremely unusual but brilliant move a human would never have played. Developing the equivalent of step 2 for LLMs is a key research problem for overcoming the data wall (and, moreover, will ultimately be the key to surpassing human-level intelligence).

AlphaGo eventually transitioned to self-play, generating its own training data and eliminating the need for human input. This is a remarkable achievement achieved through the combination of "Reinforcement Learning and MCTS." The future of next-generation AI hinges on how generative AI can be trained using this mechanism.

 

Conclusion:

The ability to execute long-term plans opens up a plethora of possibilities. Imagine AI formulating long-term investment strategies or serving as legal advisors in court, excelling in tasks that demand prolonged reasoning and debate. The world is undoubtedly on the verge of transformation, and the future is incredibly exciting.

That's all for today. Stay tuned!

 





1) AI achieves silver-medal standard solving International Mathematical Olympiad problems, Google DeepMind, 25 JULY 2024
2)Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Google DeepMind, 5 DEC 2017
3)Unreasonably Effective AI with Demis Hassabis, Google DeepMind, 14 AUG 2024  (around 18:00)
4) SITUATIONAL AWARENESS p28,  The Decade Ahead, Leopold Aschenbrenner, June 2024 













Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Gemma2-2B: A Small Yet Powerful Generative AI - A Hands-On Review

Today, we'll be diving into Google DeepMind's recently announced compact generative AI model, "Gemma2-2B" (1), and running a simple demo. Gemma is an open-source library. While medium-sized models with 70B and 9B parameters are already available, this latest release boasts a significantly smaller 2B parameter model. It promises remarkable performance despite its size, generating considerable excitement. Let's take a closer look.

 

1. Remarkable Fundamental Performance

Despite its compact size, the Gemma model exhibits impressive performance, as detailed below. Surpassing GPT3.5 is a feat unimaginable just a year ago. The rapid advancements in open-source models continue to amaze.

Google's website describes it as follows (1):

""This lightweight model produces outsized results by learning from larger models through distillation. In fact, Gemma 2 2B surpasses all GPT-3.5 models on the Chatbot Arena, demonstrating its exceptional conversational AI abilities.

The "distillation" technique mentioned here is key to enhancing the performance of smaller models. It's employed not only in Gemma but also in Llama3 and various other small models, making it a concept worth remembering. With the performance of a 2B parameter model reaching such heights, it's tempting to explore its capabilities. Let's move on to the demo.

 

2. Performance Check with a News Article Classification Task

For this demo, we'll tackle the task of classifying Japanese articles from the publicly available Livedoor-news dataset (2) into five genres. We'll fine-tune the Gemma2-2B model and evaluate its classification accuracy. Since we're using Japanese articles, this will also assess its multilingual capabilities. Let's get started!

The following article is an example from the validation data. The model's task is to identify this article as belonging to the sports category.

                Example of validation data

Specifically, each article is categorized into one of the following categories. The goal of today's demo is to improve the accuracy of this classification.

  • 'kaden-channel' (Electronics)

  • 'topic-news' (General News)

  • 'sports-watch' (Sports)

  • 'it-life-hack' (IT/Life Hacks)

  • 'movie-enter' (Movies/Entertainment)

We prepared 100 samples for training data and 1000 samples for validation data. We'll apply fine-tuning using the impressive quantization tool Unsloth, and the data will be in the Alpaca format. For details, please refer to this link (3).

Without extensive tuning, we achieved an accuracy of 81.5%, as shown below. Considering the small training dataset of only 100 samples, this is an impressive result. With further optimization, the accuracy could likely be improved. It's hard to believe this performance comes from a model with only 2B parameters. Its ability to handle Japanese text is also commendable. The notebook used for the demo can be found here.

 

3. Limitless Potential Applications

With such high performance in a small model, the possibility of implementation on devices like smartphones, previously deemed impractical, becomes a reality. It also opens doors for applications where cost and computational speed were prohibitive. It seems particularly well-suited for customer service applications requiring real-time responses. Additionally, it could be deployed in developing countries where the cost of using frontier models like GPT4 has been a barrier. The future possibilities are truly exciting.

 



So, what did you think? The Gemma2-2B model can run on Google Colab's free T4 GPU, making it a valuable asset for startups like ours. It's truly remarkable. The small yet powerful Gemma2-2B model is poised for widespread adoption. At ToshiStats, we're committed to developing tuning techniques to maximize the benefits of open-source libraries. We'll be sharing more on this blog in the future. That's all for today. Stay tuned!

 
 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Google DeepMind's new prompt engineering technique, "Many-Shot In-Context Learning," is amazing!

I recently came across an interesting research paper, "Many-Shot In-Context Learning" (1), by Google DeepMind, and I'd like to share a brief overview. Although it's a highly technical paper, it offers valuable insights that we can apply to our own prompt writing. Let's dive in.

 

1. Utilizing Context Effectively

When you write prompts for language models or generative AI like ChatGPT, you probably input the information you want, like a search engine, such as "What is the capital of Japan?" However, generative AI can handle much larger amounts of information. For example, as shown in the chart below, you can load a PDF document and then write a prompt like "Summarize this," and the AI will output a summary of the PDF's content. Think of a prompt as an "instruction to the generative AI." The additional information you provide is called the context.

 


2. What's Needed to Use Generative AI in a Business Setting

Now that we have a basic understanding of how to use generative AI, let's consider what's needed to use it in a company or business setting. Obviously, when you represent your company and interact with customers, you wouldn't express "personal opinions or feelings." You wouldn't say, "I personally don't think this new product will sell." Specifically, companies have established rules and manuals that employees must follow. Normally, employees cannot violate these rules. Therefore, to use generative AI in a company, it must output answers that comply with each company's "rules and manuals," not just general answers. So, how do you convey these rules to the generative AI? One way is to input the "rules and manuals" directly into the generative AI along with the prompt, as shown in the chart above. Many recent generative AIs have "context windows" of 100,000 tokens or more. This represents the amount of information that can be input and output at once, and 100,000 tokens is about 70,000 words in English. You can input a considerable amount of "rules and manuals." Some models, like Google's Gemini 1.5 Pro, can input up to 2 million tokens, which is enough for about 3,000 pages of English manuals. That's amazing. These context windows are sometimes called "long context windows."

 


3. Many-Shot In-Context Learning

"Many-Shot In-Context Learning" is a technique that utilizes these "long context windows" even more effectively. You may have heard of a similar term, "Few-Shot Learning." "Few-Shot Learning" is a method where you first provide the generative AI with a few "question and answer pairs" as examples and then ask the question you want to know. For instance, you might give examples like "The capital of the United States is Washington, D.C." and "The capital of China is Beijing," and then ask the AI, "What is the capital of Japan?" "Many-Shot In-Context Learning" increases the number of these "question and answer pairs" to 10-10,000. This is said to improve accuracy. The graph below shows that in machine translation and summarization tasks, increasing the number of examples to 500-1,000 improves accuracy. 2 to the power of 10 is 1024. The idea is to put as many examples as possible into the "long context window" since it can easily handle them.

The relationship between accuracy and the number of examples in machine translation and summarization.

 


What do you think? If simply increasing the number of examples improves accuracy, it might be worth trying. For those who say, "I can't create so many examples myself," "Many-Shot In-Context Learning" also suggests a method to create synthetic data using an LLM (language model). If you're interested, please check out the paper. But if it's just about 10 examples, you could probably create them yourself. I'll give it a try and update here if I get good results. That's all for today. Stay tuned!

 






1) "Many-Shot In-Context Learning", Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle, Google DeepMind, 22 May 2024,  https://arxiv.org/abs/2404.11018



Copyright © 2024 Toshifumi Kuga. All right reserved




Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Llama3-8B has shown impressive performance even when fine-tuned on Japanese data. Its high base performance likely plays a significant role in this.

In the previous post, we introduced the high performance of Llama3-70B. However, Llama3 also has a smaller 8B model, and I've been wanting to fine-tune it to fit my own tasks. Since it's small, it's cost-effective and fast, so if you have a clear task in mind, this 8B model will surely be an option. Therefore, this time, we will fine-tune the Llama3-8B model for the task of classifying the published Livedoor-news Japanese articles (3) into several genres, and check its accuracy. Let's get started!

 
  1. Creating an Alpaca-style dataset

Livedoor-news Japanese articles are divided into the following 9 genres. The distribution of each genre is shown in the following chart.

  • 'kaden-channel',

  • 'livedoor-homme',

  • 'topic-news',

  • 'sports-watch',

  • 'peachy',

  • 'dokujo-tsushin',

  • 'it-life-hack',

  • 'movie-enter',

  • 'smax'

Distribution and sample size of each genre

This time, we will randomly extract 1000 samples for both training and validation data, and actually classify each article into the above 9 genres to verify whether high accuracy can be achieved. We have adopted Alpaca as the data format. As shown below, it consists of instruction, input, and output. Here, the instruction is common to all samples.

Example of Livedoor news

 

2. Fine-tuning using Hugging face TRL + "unsloth"

This time, we used Hugging face's TRL (1), a library for fine-tuning LLMs, along with "unsloth", a library for accelerating training, to efficiently perform fine-tuning. The development environment was Google Colab, and we prepared a paid L4 (GPU) instance. The training time was about 100 minutes for 4 epochs. L4 has 22.5GB of GPU-RAM, which is large enough for this training. Also, "unsloth" prepares a 4-bit quantized model for fine-tuning, so you can download and use it directly from Hugging Face Hub, which is convenient. This training process was based on the "unsloth" notebook (2). If you are interested in speeding up training, please check it out.

"Unsloth" model

 

3. Verify model accuracy

At first, I simply asked, "The skill to score a penalty kick from this impossible angle is amazing." The answer was "sports-watch". It's a soccer/football story, so I think it's a reasonable answer.

Next, I asked, "Which is better, iPhone or Android?" The answer was "it-life-hack". This is also a good answer.

It's hard to type in one by one, and the actual articles are longer and more complex. This time, I prepared 1000 validation data samples and tried it. The result was a very good accuracy of 94.5%. Since the input is Japanese, I thought Llama3 would struggle, but I was surprised that it easily exceeded 90%. It must be the effect of pre-training with a huge corpus of 15 trillion tokens. Even the 8B model seems to be practical in Japanese if fine-tuned.

 

How was it? Even though Llama3-8B is small, it has high potential and seems to be active in various places. Fine-tuning is required for each task, but "unsloth" can help speed it up. If you want to shorten the training time, please try it. This time, we were able to obtain sufficient accuracy in about 2 hours even with a general-purpose single GPU. It's a reliable ally for small startups like us! If you want to try it by yourself, you can use my notebook here.

We will update you as we gain new insights. Stay tuned!

 

(1) TRL - Transformer Reinforcement Learning https://huggingface.co/docs/trl/en/index

(2) Alpaca + Llama-3 8b full example.ipynb https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=iHjt_SMYsd3P

(3) Livedoor-news Japanese articles https://www.rondhuit.com/download.html

 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Llama3: Exceeding Expectations and Expanding Horizons!

The release of the new LLM "Llama3" by Meta on April 18th has created quite a stir in the AI community (1). As a highly anticipated open-source model with performance expectations on par with GPT-4, its potential applications seem limitless.

Examining the performance on leaderboards (2), Llama3 is available in two sizes: 70B and 8B parameters. The larger 70B model, in particular, demonstrates capabilities that rival proprietary models such as GPT-4, Claude3-Opus, and Gemini 1.5 Pro.

To assess Llama3's performance, a test was conducted using a bank customer complaint classification task. The objective was to evaluate the model's accuracy in categorizing complaints without any fine-tuning.





1.To what extent can we discriminate between six categories of customer complaints without training?

The dataset consisted of customer complaints from a US bank, categorized into six product areas:

  • Mortgage

  • Checking or savings account

  • Student loan

  • Money transfer, virtual currency, or money service

  • Bank account or service

  • Consumer loan

Examples of these complaints, all in English, were provided.

A random sample of 500 complaints was used with a prompt instructing Llama3-70B to assign a product category to each complaint. The results were astounding, achieving an accuracy rate of 88.6%. This near 90% accuracy was unprecedented and speaks volumes about Llama3's potential.







2. Maintaining Accuracy with Japanese Data?

Considering the potential use of Llama3 in Japan, the English dataset was translated into Japanese using Google Translate. The classification task was then repeated with the translated data.

Despite Llama3's training data being predominantly English (around 95%), the model maintained an impressive accuracy rate of 82.8% with the Japanese data. This suggests that Llama3's capabilities extend beyond English and hold promise for multilingual applications.









3. Conclusion and Future Prospects

Llama3 has proven to be a top-tier performer, despite being open-source. This achievement deserves appreciation for Meta's contribution to the AI community. Hopefully, other companies like Google will follow suit and release their own open-source models more.

Further experiments are planned to evaluate the accuracy and computational speed of the smaller 8B model. Stay tuned for the results!

 



1) meta website https://llama.meta.com/llama3/
2) LMSYS Chatbot Arena Leaderboard https://chat.lmsys.org/?leaderboard
3) https://github.com/TOSHISTATS/Classification-of-Consumer-Complaints-by-Llama3/tree/main







Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

I tried the new generative AI model "Claude3 Haiku". Fast, smart, and low-priced. I want to use it as an AI agent!

On March 14th, "Claude3 Haiku" (1), the lightest model among the Claude3 generative AIs, was released and became available for use in web applications and APIs. I'm usually drawn to the highest-performing models, but this time I'd like to focus on the lightest one. Recently, algorithms that execute repetitive calculations like AI Agents have become more common. I want to use high-end models like GPT4, but they are very costly to run. So I was looking for a low-cost, high-performance model, and "Claude3 Haiku" is perfect as it costs 1/60th of the high-end model "Claude3 Opus" while still delivering excellent performance. I'd like to try it out here right away. The details of each model are as follows.




1. First, let's test the text

I checked if "Claude3 Haiku" knows about Hiroshima-style okonomiyaki, a hyper-local Japanese food. I used to live in Hiroshima, so I know it well, and I think this answer is generally good. The Japanese is clean, so it passes for now.




Next, I asked about transportation from Tokyo to Osaka. Unfortunately, there was one clear mistake. The travel time by bus is stated as "about 4 hours and 30 minutes," but in reality, it takes around 8 hours. This is a hallucination.



Then I asked about the "Five Forces," a framework for analyzing market competitiveness. It analyzed the automotive industry, and the analysis incorporates the latest examples, such as the threat of electric vehicles as substitutes, making it a sufficient quality starting point for discussion. However, the fact that it's not in a table format is a drawback.





2. Next, let's analyze images.

First, I asked about the number of smartphones, but unfortunately, it got it wrong. It may not be good at counting.




This is a photo of the Atomic Bomb Dome in Hiroshima. It answered this perfectly. It seems to understand famous Japanese buildings.





This is a photo of a streetcar running in Hiroshima City. I think it captures it pretty well overall. However, the streetcars don't run solely for tourists, so the explanation may be somewhat incomplete.




This is a flight information board at Haneda Airport. It perfectly understands the detailed information. Excellent.





Counting the number of cars in a parking lot is a difficult task for generative AI. This time it answered 60 cars, but there are actually 48. If the accuracy improves a bit more, it will reach a practical level, which is a bit disappointing.






3. Impressions of using "Claude3 Haiku".

Honestly, the performance was unbelievable for a general-use AI. The Japanese is natural and clean. The fact that it can incorporate and analyze images in the first place is groundbreaking. Multimodality has arrived in general-use AI. The calculation speed is also fast, and I think it will be applied to applications that require real-time responses. And the cost is low. This allows for plenty of interesting experiments. It's a savior for startups with tight cost constraints! I want to continue doing interesting experiments using "Claude3 Haiku". Stay tuned!

(1) Claude 3 Haiku: our fastest model yet   2024.3.14  Anthropic

Copyright © 2024 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The new generative AI "Google Gemini 1.5 Pro" is as amazing as expected!

Last month, I informed you that Google released a new generative AI called "Gemini 1.5 Pro" (1). And today, the "Gemini 1.5 Pro" finally arrived at Toshi Stats. I would like to experiment with it right away.



1. Can the 1 million token long context window really work?

Gemini 1.5 Pro boasts an incredibly long context window of 1 million tokens, which is unthinkable for previous LLMs. Because it is so amazing, anyone would wonder, "Can this really work?" Today, I would like to explore its capabilities here. I have prepared two experiments. The first one is to extract detailed information including numbers from relatively short materials, and the second one is to see if it can answer comprehensive questions well from materials over 200,000 tokens long. Let's begin.



2. Information extraction from Toyota Motor Corporation's financial results  

First, I will check if it can accurately extract numerical information from Toyota Motor Corporation's financial results for the fiscal year ended March 2023. The number of pages is 28, and the number of tokens is about 27,000, which is not a long material, but it is a task often seen in practice. This time, I have prepared 13 questions. Let's upload the material to Google AI Studio and ask the questions one by one.

Google AI Studio 




Here are the results. The correct answer rate is about 88%.

Questions and Results & Computation Time

For question 8, the financial results refer to ROE as "Return on equity attributable to owners of the parent company". This long phrase in Japanese (Kanji) may have been difficult for the American-born Gemini 1.5 Pro to understand. However, it showed the value by calculating from other related numbers by itself, so I gave it a △ evaluation with 0.5 points added. It seems to be able to handle Japanese quite well. By the way, the average computation time was around 30 seconds to output the results, so it seems that the computation resources have also been optimized. Just to note, this material was published on May 10, 2023, so it's possible that it has been incorporated into the training data.





3. Information extraction from the Financial Inspection Manual 

Next, I would like to analyze the lengthy financial administrative document "Financial Inspection Manual" with Gemini 1.5 Pro. This is an unforgettable document for those who have been in the financial industry since 2000 in Japan, as it sparked discussions on establishing risk management and governance frameworks. The number of pages is 366, and the number of tokens is 220,526, exceeding 200,000 tokens. Gemini 1.5 Pro is the world's first generative AI that can take in documents of this volume all at once. I was able to upload it smoothly as shown below. It answers properly when asked questions.

Uploading the Financial Inspection Manual

I actually asked, "What should be done to verify the appropriateness of the credit rating assignment process?" and received the following response (showing only 1 and 2 here as it is long).  To verify the appropriateness of the credit rating assignment process, the following items need to be examined:

* 1. Validity of rating criteria

    * Do the rating criteria appropriately reflect the credit risk of the debtors?

    * Are the rating criteria defined in an objective and verifiable manner?  

    * Are the rating criteria consistent with the debtor classifications?

    * Are the rating criteria periodically reviewed?

* 2. Appropriateness of rating assignment procedures      

    * Are the rating assignment procedures carried out appropriately in accordance with the rating criteria?

    * Are the rating assignment procedures implemented in an accurate and verifiable objective manner?

    * Can the rating assignment procedures timely reflect changes in the debtor's situation? 

    * Are the rating assignment procedures performed by an independent department not influenced by sales departments, etc.?

It continues with 3: Model Utilization, 4: Data Maintenance, 5: Internal Audit, and so on. It is a comprehensive and holistic answer that follows the principles of risk management and is also consistent with the descriptions around page 142 of the Financial Inspection Manual. Furthermore, the descriptions related to credit risk management are in the middle of this manual, and there were past comments that generative AIs tend to have lower accuracy in the middle parts of long data. However, Gemini 1.5 Pro does not seem to have any issues. Despite the specialized content, it provided a very good answer. The computation time was also around 90 seconds, which is sufficiently practical. It will surely make a good risk management assistant.  





How was that? It seems that it can analyze materials over 200,000 tokens quite accurately even in Japanese. It might also be useful for internal document search tasks at work. Next time, I would like to challenge even more difficult tasks in English. Stay tuned!"

 

1) Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Gemini Team, Google

Copyright © 2024 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The Evolution of AI Accelerates: A Deep Dive into Google's "Gemini 1.5 Pro"

The pace of AI advancement is truly remarkable, and this year is no exception. Google has unveiled a new generative AI called "Gemini 1.5 Pro," which boasts a groundbreaking Mixture-of-Experts (MoE) architecture. Currently only available to a limited number of users, with broader testing to come, this technology presents intriguing breakthroughs that warrant a closer look.

 
 

1. Unprecedented Context Window of 1 Million Tokens

Gemini 1.5 Pro boasts a context window that is unfathomable by existing LLMs, capable of processing up to 1 million tokens. Research has even demonstrated data ingestion of up to 10 million tokens. This represents a revolutionary breakthrough, considering that GPT-4's context window is limited to 128,000 tokens (1).

Comparison of Context Windows for Different LLMs

With such an extensive context window, Gemini 1.5 Pro can ingest an entire book at once. Currently, when creating RAG systems and referencing internal documents, chunking is necessary to accommodate the LLM's context window. However, with Gemini 1.5 Pro, this requirement is minimized, simplifying RAG development and operation. Furthermore, the model maintains high accuracy, even with such a large context window, achieving over 99% accuracy in information retrieval tests (see chart below).

 
 

2. Remarkable In-Context Learning Capabilities

The ability to process vast amounts of data is not the only noteworthy aspect of Gemini 1.5 Pro. It also excels at understanding and applying this information to various tasks. This is evident in its in-context learning capabilities, showcased in a Kalamang language translation task. The model was trained using a Kalamang grammar book and dictionary, enabling it to translate between English and Kalamang.

English to Kalamang Translation Test

Gemini 1.5 Pro outperformed other models, achieving scores that rival those of human learners. This is an astonishing feat.

 
 

3. Towards Individualized Agents with Gemini 1.5 Pro

If a model can acquire translation capabilities simply by reading a grammar book, it stands to reason that it can also learn from knowledge systems in other domains and apply that knowledge to various tasks. In other words, Gemini 1.5 Pro has the potential to develop its own "frame of reference" that influences its understanding and values. The ability to incorporate a vast amount of data into its context through its extensive context window has significant implications in this regard. This is because it allows Gemini 1.5 Pro to potentially become an individualized agent with diverse perspectives in the future. The Kalamang translation experiment provides promising evidence of this potential.

Gemini 1.5 Pro is a remarkable advancement in AI technology, offering unprecedented capabilities in terms of context window size and in-context learning. "A host of improvements made across nearly the entire model stack (architecture, data, optimization and systems) allows Gemini 1.5 Pro to achieve comparable quality to Gemini 1.0 Ultra , while using significantly less training compute and being significantly more efficient to serve" according to the report(1). This is truly a testament to the rapid progress being made in the field of AI.

I am eager to experiment with Gemini 1.5 Pro once it becomes publicly available. Stay tuned for future updates!

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Gemini Team, Google

 

Copyright © 2024 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.