GPT4

New Prompt Engineering Method from Google DeepMind Surpassing CoT in Accuracy !

Hello everyone, how have you been? There are only two months left in this year. It has truly been a year of incredible AI advancements, and it doesn't seem to be slowing down. Recently, Google DeepMind announced a new prompt-engineering method called "Step-Back Prompting (1)." Let's dive into the details right away.


  1. Step-Back Prompting:

Coming from DeepMind, one might initially think it's a complicated method, but the concept turned out to be quite simple. Instead of directly answering the question input by the user, the process involves:

  • Creating a more generalized and essential question (Stepback Question)

  • Answering the generated question (Stepback Answer)

  • Producing the final answer to the user based on the original question and the generated response (Final Answer)

The paper abstract has the following note which could give insights on the Stepback Answer:

"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise. — Edsger W. Dijkstra"



2. Automatic Generation of "Stepback Question":

The key to this method seems to be the effective creation of the Stepback Question. However, constantly coming up with the Stepback Question could be challenging. While searching for an easier way, an excellent automatic generation method was introduced in LangChain's cookbook (2), which seems to apply Few shot learning.

By presenting these two examples to the model first, when a new user question like "Was ChatGPT around when Trump was president?" is posed,

As shown, a more general question, "When was ChatGPT developed?" is generated. Using this to guide the final answer results in higher accuracy. Although not always 100% correct based on my own trials, the accuracy does seem notably higher. According to the paper, it even achieves accuracy surpassing GPT-4 in some instances.



3. Anticipation for Future Developments:

Since "Step-Back Prompting" has a simple structure, it seems versatile for various applications. It can also be combined with existing techniques like CoT. Looking forward to its future growth, it seems highly compatible with LangChain and easy to implement, which will likely lead to an increase in use cases.

So, what do you think? I will continue to experiment and if there are any significant findings, I'll share them here. Stay tuned!

1) “TAKE A STEP BACK: EVOKING REASONING VIA ABSTRACTION IN LARGE LANGUAGE MODELS" Huaixiu Steven Zheng∗ Swaroop Mishra∗ Xinyun Chen Heng-Tze Cheng Ed H. Chi Quoc V Le Denny Zhou Google DeepMind” Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, Google DeepMind, 9 Oct 2023, https://arxiv.org/abs/2310.06117

2) langchain/cookbook/stepback-qa.ipynb https://github.com/langchain-ai/langchain/blob/master/cookbook/stepback-qa.ipynb

Copyright © 2023 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.



GPT-4V is here. I tried it immediately and was amazed. It can do this too!

Sorry to keep you waiting. OpenAI's GPT-4 now comes with image recognition capabilities. To be precise, it was demonstrated when it debuted in March of this year, but it has only now been made available to users after half a year. I recently tried the new feature in ChatGPT+ and, in a word, it's incredible!

By the way, the image mentioned above was also created with a combination of GPT-4 and DALL-E3.

Now, let's start the experiment!


First, we'll start with recognizing mobile-phones. It can accurately count the number of mobile-phones. This is a piece of cake.

 

I thought flight information would be challenging, but it identified the destination impeccably. Since it's originally an excellent language model, it seems proficient in deriving meaning from images.

 

It can even read Osaka's Tsutenkaku tower. Local information is no problem.

 

For a change, I inserted an image of analysis results. It can read graphs effortlessly. This is impressive!

 

What shocked me was that it could easily count cars. Of course, it's not a specialized object detection model, so errors will always occur. I believe there were about 48 cars in this photo, but for general use, this margin of error seems acceptable. It's astonishing what it can do by just being given an image.

 

It can count cans, but the error is relatively significant. It might struggle with cluttered items.

 

It works well to read English text in an OCR-like manner.

 

It can also easily read the time displayed on electronic signboards.

How did you find it? Without any fine-tuning, it achieved this much. GPT-4V has just been launched, and various use cases are likely to emerge in the future. I look forward to introducing interesting examples here as they arise. Stay tuned!

 

Copyright © 2023 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Fine-tuning GPT-3.5 with synthetic text generated by GPT-4. The accuracy has improved! In the future, we might not even need training text???

Hello, despite being in the latter half of September, it is still quite hot in Japan. The photos feel mismatched, but I'm deliberately sticking to the autumn theme, hoping it will get cooler soon. However, it might stay hot for the rest of the month.

Now, about the fine-tuning of ChatGPT-3.5 that I introduced the other day, it's certainly a hot topic. I think there is a strong demand in companies to specialize its performance for specific tasks. For this reason, we conducted an experiment assuming cases where you would want to proceed even without data at hand by generating synthetic text and then fine-tuning it.

 
  1. Experiment Details

Just like the previous experiment, we set a task to determine which financial product a given English-language complaint is about. They are complaints for the banking industry, so the task involves differentiating between six types of financial products such as mortgages and bank accounts. The data used for fine-tuning was minimal, with 100 samples for validation, just like last time. However, the training data is different this time. We generated customer complaint emails using GPT-4, and they are indistinguishable from real ones at a glance. GPT-4's performance is indeed impressive. We generated 15 similar customer complaints for training and then proceeded with fine-tuning.

synthetic text generated by GPT-4


2. Experiment Results

Since this was our first time using synthetic text, we were worried about the outcome, but we were able to confirm the effectiveness of fine-tuning as follows. Though the improvement isn't dramatic with just 15 samples, the accuracy for this task has improved compared to the base GPT-3.5, which had an accuracy of 0.5 to 0.55.

For more details on the experiment, please refer to this notebook.

 

3. Discussion

Fine-tuning with synthetic text was a method not even considered before, but with the arrival of GPT-4, it's becoming more realistic. There are several points to consider, such as the number of samples and how to write prompts, but the advantage of being able to start even without data is significant. Currently, GPT-4 is the only option for generation models, but it seems like new models like Gemini from Google will also be available next year. Technology is advancing rapidly, so we can expect a lot more in the future.

So, what did you think? We will continue to conduct various experiments and share our findings here. See you again soon!




Copyright © 2023 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The "Graph of Thoughts" might pave the way for new avenues in "human and LLM (Large Language Model) collaboration"!

Last week, I came across an intriguing paper on Large Language Models (LLMs). It appears to further develop the "Tree of Thoughts" (ToT) reasoning method I mentioned before, introducing a new technique called the "Graph of Thoughts" (GoT). Let's take a closer look.

 
  1. Features of GoT

First, let's compare various methods using the chart provided in the paper.

The far right shows the newly introduced GoT. Key differences from ToT may be that GoT allows for the merging of thoughts, and that users can define the shape of the GoT themselves. Incidentally, this merging is referred to as "aggregation" within the paper. While it may seem similar to ToT, the differences might be significant. Let's explore this in more detail.

 

2. Four Key Modules

GoT (Graph of Thoughts) has the following four crucial modules. Understanding these will clarify the differences between it and ToT (Tree of Thoughts).

  • Prompter

  • Parser

  • Scoring & Validation

  • Controller

Let's look at each one in detail. The Prompter, as the name suggests, is the module responsible for creating prompts. The Parser extracts the required information, or "thoughts," from the LLM (Large Language Model). You might think of the Prompter as handling input and the Parser as managing output. Scoring & Validation is the module that evaluates the gathered thoughts. This evaluation allows us to select the thoughts worth keeping. Finally, let's elaborate on the Controller. It is responsible for adding new thoughts or merging multiple thoughts, a process referred to as "transform." The Controller decides which transformations should be applied to which thoughts and passes this information to the Prompter. It is a critical module for executing problem-solving strategies. It has two functions: Graph of Operations (GoO), which is an execution plan for operations defined by the user, and Graph Reasoning State (GRS), which maintains the ongoing LLM reasoning process based on the thought state.


3. Considering the Number Sorting Problem

Since merely talking abstractly may not advance understanding, let's consider an actual task. This time we will consider sorting a list of 64 numbers in ascending order. Here, we'll see how the Graph of Operations (GoO) comes into play. In the chart below, each thought is tagged with operations like G (Generate), S (Sort), K (Keep the best), and A (Merge). Initially, we take a list of 64 numbers and divide it into four lists, each containing 16 numbers. Each of these lists is then sorted and evaluated. Only the list with the highest accuracy is kept. These are then further merged to form a new list containing 32 numbers. You'll see various operations functioning as the process progresses.

For those who want to delve deeper, detailed explanations are available here, particularly in the green part of the chart above.

It might feel complex at a glance, but it's user-controllable, allowing you to incorporate your domain knowledge. I am excited to conduct various experiments in the future.

Thank you for your attention! I will keep you updated on the progress of GoT and hope to share more with you soon. Stay tuned!









1) "Graph of Thoughts: Solving Elaborate Problems with Large Language Models",  Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler, 21 Aug 2023, https://arxiv.org/abs/2308.09687v2







Copyright © 2023 Toshifumi Kuga. All right reserved




Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

“Function calling” is a game changer as GPT can access outside and be converted to our agents easily!

Today, I want to create web-site with a description of the Japanese sweets collection, just like “Dorayaki“ in the picture above. So I ordered my AI agent to create an awesome web-site. But is it really possible? I am sure yes, it is!. As you know, OpenAI created GPT, which is very intelligent large language model (LLM). On 13 June 2023, “Function calling” was introduced by OpenAI. It can bridge GPT to other systems, APIs and functions outside. Let me explain step by step!

 

1.What is the advantage of “Function calling”?

Function calling makes it easy for GPT to access functions outside. For example, when you want to create a web-site where Japanese sweets are explained to customers, you need to connect GPT to the function that can write code of web-site with HTML/CSS. With “Function calling”, GPT can call this function and pass the parameters, such as “explanations of Japanese sweets” to this function. Official documents says “The latest models (gpt-3.5-turbo-0613 and gpt-4-0613) have been fine-tuned to both detect when a function should to be called (depending on the input) and to respond with JSON that adheres to the function signature.”

 

2. The list of “functions” is key to set “function calling” up

“Function calling”looks great! But how can we implement in our code. I think it is so simple. Just prepare the list of functions. This should have

  • "name"

  • "description"

  • "parameters" : "type" , "properties", "required"

In ChatCompletion.create, we should add “functions=functions” because we want to call the function. The other part of the code has not changed so much. The code below shows us an example of functions, which comes from Official documents. Please look at these docs for the details if needed.

 

3. Let us see how the generated web looks like

OK, it is the time that we see the result from our agent. I instruct "Create a web-site for a pretty Japanese sweets collection" to our agent. Text of “title” and “explanation” are generated by GPT3.5-turbo and are sent to the function that creates a web. Here is the result. All are written in Japanese. The title means “a pretty Japanese sweets collection". The sentences of the explanation are pretty good! I do not think there is a need to fix or modify these sentences at all.

If you want to know more details with the code, you can see it here.

https://github.com/TOSHISTATS/Wagashi-Collection-Web-Generation-agent-by-GPT3.5#readme

 

Hope you can understand how AI agents work. I think potential use-cases of “Function calling”are limitless. I tried several use cases by “Function calling” and found that it can be a game changer to develop LLM application systems. I would like to update my article about AI agents by OpenAI GPT soon. Stay tuned!

 
 
 

Copyright ©  2023  Toshifumi Kuga.  All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

"Large Language Models as Tool Makers" is a good idea to enhance the accuracy of the model while keeping the cost of computation low!

Since GPT4 , one of the most intelligent Large Language Model (LLM) in the world, was released on 14 March 2023, many people are surprised because it is very intelligent. This is great. But there is one problem for users. It is not free service. Users should pay the fee of GPT4, based on how much tokens they use on GPT4. Therefore when we continue to use GPT4 all day long, it must be very expensive. Of course we prefer more intelligence. But we should consider the cost of it. There is a trade off between them. What should we do to solve it? Last week, I found a good research paper called “Large Language Models as Tool Makers“(1). All charts below come from this awesome paper. The idea is simple and looks promising to tackle these problems. So let me explain more details.

 

1. Tool user and Tool maker

The basic idea is as follows. Let us have two LLMs, one is called “Tool maker” and another is called “Tool user”. When a new task comes to us, Tool maker create “tools” for this task. Once “tools” are ready, they are passed to Tool user for inference. These tools are reusable to solve similar tasks in the future. So GPT4 can be used only as Tool maker as it should be more intelligent. Light weights models, such as GPT3.5, can be used as a Tool user. Then we can reduce the cost of computation for inference. It sounds great! The chart below explains how it works.

 

2. How can we create tools for our task?

As we want to keep the accuracy of the results, Tool maker should create better tools. There are three steps to do that.

• Tool Proposing: The tool maker generates a Python function to solve the given task. If the proposed tool makes errors, the tool maker makes another tool.

• Tool Verification: The tool maker generates unit tests using validation samples and subsequently executes these tests. 3 validation samples are prepared here. If the tool fails any of these tests, the tool makes an attempt to fix the issues. The paper explains it as follows “This stage fulfills two key roles: 1) it provides examples that demonstrate how to convert natural language questions into function calls, and 2) it verifies the tool’s reliability, enabling the entire process to be fully automated.”

• Tool Wrapping: If the execution or verification passes the preset threshold, tool maker prepares the wrapped tool for tool user. This step involves wrapping up the function code and providing demonstrations of how to convert a task into a function call. This final product is then ready for use by the tool user.

The chart below shows us how it works.

Once the tool is ready, the tool is passed to the tool-user. The tool user should solve various instances of the task by using tools made by the Tool maker. The prompt for this stage is the wrapped tool which contains the function for solving the task and demonstrations of how to convert a task query into a function call. With the demonstrations, tool user can then generate the required function call in an in-context learning fashion. The function calls are then executed to solve the task. The chart below shows us how the processes from Tool maker to Tool user are going.

 

3. Can we confirm if tools that fit our tasks are available or not?

Here, we use a third LLM called “the dispatcher”. Because the dispatcher maintains a record of existing tools produced by the tool maker, it can confirm if tools that fit our task are available at the time when a task is received, If no appropriate tool is found, the dispatcher identifies the instance as a new task and solves the instance with a powerful model, such as GPT4. The dispatcher’s workflow is shown here.

 

That is it! This is a major part of “Large Language Models as Tool Makers” or “LATM” in short. By LATM, we might reduce computational cost for heavy models, such as GPT4. It is amazing! Hope you enjoy the article today. I will update new technologies around LLM in the near future. Stay tuned!

 

1) “Large Language Models as Tool Makers” Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou, 26 May 2023, https://arxiv.org/abs/2305.17126



Copyright  ©  2023  Toshifumi Kuga  All right reserved



Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.