AGI

GPT-4V is here. I tried it immediately and was amazed. It can do this too!

Sorry to keep you waiting. OpenAI's GPT-4 now comes with image recognition capabilities. To be precise, it was demonstrated when it debuted in March of this year, but it has only now been made available to users after half a year. I recently tried the new feature in ChatGPT+ and, in a word, it's incredible!

By the way, the image mentioned above was also created with a combination of GPT-4 and DALL-E3.

Now, let's start the experiment!


First, we'll start with recognizing mobile-phones. It can accurately count the number of mobile-phones. This is a piece of cake.

 

I thought flight information would be challenging, but it identified the destination impeccably. Since it's originally an excellent language model, it seems proficient in deriving meaning from images.

 

It can even read Osaka's Tsutenkaku tower. Local information is no problem.

 

For a change, I inserted an image of analysis results. It can read graphs effortlessly. This is impressive!

 

What shocked me was that it could easily count cars. Of course, it's not a specialized object detection model, so errors will always occur. I believe there were about 48 cars in this photo, but for general use, this margin of error seems acceptable. It's astonishing what it can do by just being given an image.

 

It can count cans, but the error is relatively significant. It might struggle with cluttered items.

 

It works well to read English text in an OCR-like manner.

 

It can also easily read the time displayed on electronic signboards.

How did you find it? Without any fine-tuning, it achieved this much. GPT-4V has just been launched, and various use cases are likely to emerge in the future. I look forward to introducing interesting examples here as they arise. Stay tuned!

 

Copyright © 2023 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

"Large Language Models as Tool Makers" is a good idea to enhance the accuracy of the model while keeping the cost of computation low!

Since GPT4 , one of the most intelligent Large Language Model (LLM) in the world, was released on 14 March 2023, many people are surprised because it is very intelligent. This is great. But there is one problem for users. It is not free service. Users should pay the fee of GPT4, based on how much tokens they use on GPT4. Therefore when we continue to use GPT4 all day long, it must be very expensive. Of course we prefer more intelligence. But we should consider the cost of it. There is a trade off between them. What should we do to solve it? Last week, I found a good research paper called “Large Language Models as Tool Makers“(1). All charts below come from this awesome paper. The idea is simple and looks promising to tackle these problems. So let me explain more details.

 

1. Tool user and Tool maker

The basic idea is as follows. Let us have two LLMs, one is called “Tool maker” and another is called “Tool user”. When a new task comes to us, Tool maker create “tools” for this task. Once “tools” are ready, they are passed to Tool user for inference. These tools are reusable to solve similar tasks in the future. So GPT4 can be used only as Tool maker as it should be more intelligent. Light weights models, such as GPT3.5, can be used as a Tool user. Then we can reduce the cost of computation for inference. It sounds great! The chart below explains how it works.

 

2. How can we create tools for our task?

As we want to keep the accuracy of the results, Tool maker should create better tools. There are three steps to do that.

• Tool Proposing: The tool maker generates a Python function to solve the given task. If the proposed tool makes errors, the tool maker makes another tool.

• Tool Verification: The tool maker generates unit tests using validation samples and subsequently executes these tests. 3 validation samples are prepared here. If the tool fails any of these tests, the tool makes an attempt to fix the issues. The paper explains it as follows “This stage fulfills two key roles: 1) it provides examples that demonstrate how to convert natural language questions into function calls, and 2) it verifies the tool’s reliability, enabling the entire process to be fully automated.”

• Tool Wrapping: If the execution or verification passes the preset threshold, tool maker prepares the wrapped tool for tool user. This step involves wrapping up the function code and providing demonstrations of how to convert a task into a function call. This final product is then ready for use by the tool user.

The chart below shows us how it works.

Once the tool is ready, the tool is passed to the tool-user. The tool user should solve various instances of the task by using tools made by the Tool maker. The prompt for this stage is the wrapped tool which contains the function for solving the task and demonstrations of how to convert a task query into a function call. With the demonstrations, tool user can then generate the required function call in an in-context learning fashion. The function calls are then executed to solve the task. The chart below shows us how the processes from Tool maker to Tool user are going.

 

3. Can we confirm if tools that fit our tasks are available or not?

Here, we use a third LLM called “the dispatcher”. Because the dispatcher maintains a record of existing tools produced by the tool maker, it can confirm if tools that fit our task are available at the time when a task is received, If no appropriate tool is found, the dispatcher identifies the instance as a new task and solves the instance with a powerful model, such as GPT4. The dispatcher’s workflow is shown here.

 

That is it! This is a major part of “Large Language Models as Tool Makers” or “LATM” in short. By LATM, we might reduce computational cost for heavy models, such as GPT4. It is amazing! Hope you enjoy the article today. I will update new technologies around LLM in the near future. Stay tuned!

 

1) “Large Language Models as Tool Makers” Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou, 26 May 2023, https://arxiv.org/abs/2305.17126



Copyright  ©  2023  Toshifumi Kuga  All right reserved



Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.