Deep Learning

New Prompt Engineering Method from Google DeepMind Surpassing CoT in Accuracy !

Hello everyone, how have you been? There are only two months left in this year. It has truly been a year of incredible AI advancements, and it doesn't seem to be slowing down. Recently, Google DeepMind announced a new prompt-engineering method called "Step-Back Prompting (1)." Let's dive into the details right away.


  1. Step-Back Prompting:

Coming from DeepMind, one might initially think it's a complicated method, but the concept turned out to be quite simple. Instead of directly answering the question input by the user, the process involves:

  • Creating a more generalized and essential question (Stepback Question)

  • Answering the generated question (Stepback Answer)

  • Producing the final answer to the user based on the original question and the generated response (Final Answer)

The paper abstract has the following note which could give insights on the Stepback Answer:

"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise. — Edsger W. Dijkstra"



2. Automatic Generation of "Stepback Question":

The key to this method seems to be the effective creation of the Stepback Question. However, constantly coming up with the Stepback Question could be challenging. While searching for an easier way, an excellent automatic generation method was introduced in LangChain's cookbook (2), which seems to apply Few shot learning.

By presenting these two examples to the model first, when a new user question like "Was ChatGPT around when Trump was president?" is posed,

As shown, a more general question, "When was ChatGPT developed?" is generated. Using this to guide the final answer results in higher accuracy. Although not always 100% correct based on my own trials, the accuracy does seem notably higher. According to the paper, it even achieves accuracy surpassing GPT-4 in some instances.



3. Anticipation for Future Developments:

Since "Step-Back Prompting" has a simple structure, it seems versatile for various applications. It can also be combined with existing techniques like CoT. Looking forward to its future growth, it seems highly compatible with LangChain and easy to implement, which will likely lead to an increase in use cases.

So, what do you think? I will continue to experiment and if there are any significant findings, I'll share them here. Stay tuned!

1) “TAKE A STEP BACK: EVOKING REASONING VIA ABSTRACTION IN LARGE LANGUAGE MODELS" Huaixiu Steven Zheng∗ Swaroop Mishra∗ Xinyun Chen Heng-Tze Cheng Ed H. Chi Quoc V Le Denny Zhou Google DeepMind” Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, Google DeepMind, 9 Oct 2023, https://arxiv.org/abs/2310.06117

2) langchain/cookbook/stepback-qa.ipynb https://github.com/langchain-ai/langchain/blob/master/cookbook/stepback-qa.ipynb

Copyright © 2023 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.



"Large Language Models as Tool Makers" is a good idea to enhance the accuracy of the model while keeping the cost of computation low!

Since GPT4 , one of the most intelligent Large Language Model (LLM) in the world, was released on 14 March 2023, many people are surprised because it is very intelligent. This is great. But there is one problem for users. It is not free service. Users should pay the fee of GPT4, based on how much tokens they use on GPT4. Therefore when we continue to use GPT4 all day long, it must be very expensive. Of course we prefer more intelligence. But we should consider the cost of it. There is a trade off between them. What should we do to solve it? Last week, I found a good research paper called “Large Language Models as Tool Makers“(1). All charts below come from this awesome paper. The idea is simple and looks promising to tackle these problems. So let me explain more details.

 

1. Tool user and Tool maker

The basic idea is as follows. Let us have two LLMs, one is called “Tool maker” and another is called “Tool user”. When a new task comes to us, Tool maker create “tools” for this task. Once “tools” are ready, they are passed to Tool user for inference. These tools are reusable to solve similar tasks in the future. So GPT4 can be used only as Tool maker as it should be more intelligent. Light weights models, such as GPT3.5, can be used as a Tool user. Then we can reduce the cost of computation for inference. It sounds great! The chart below explains how it works.

 

2. How can we create tools for our task?

As we want to keep the accuracy of the results, Tool maker should create better tools. There are three steps to do that.

• Tool Proposing: The tool maker generates a Python function to solve the given task. If the proposed tool makes errors, the tool maker makes another tool.

• Tool Verification: The tool maker generates unit tests using validation samples and subsequently executes these tests. 3 validation samples are prepared here. If the tool fails any of these tests, the tool makes an attempt to fix the issues. The paper explains it as follows “This stage fulfills two key roles: 1) it provides examples that demonstrate how to convert natural language questions into function calls, and 2) it verifies the tool’s reliability, enabling the entire process to be fully automated.”

• Tool Wrapping: If the execution or verification passes the preset threshold, tool maker prepares the wrapped tool for tool user. This step involves wrapping up the function code and providing demonstrations of how to convert a task into a function call. This final product is then ready for use by the tool user.

The chart below shows us how it works.

Once the tool is ready, the tool is passed to the tool-user. The tool user should solve various instances of the task by using tools made by the Tool maker. The prompt for this stage is the wrapped tool which contains the function for solving the task and demonstrations of how to convert a task query into a function call. With the demonstrations, tool user can then generate the required function call in an in-context learning fashion. The function calls are then executed to solve the task. The chart below shows us how the processes from Tool maker to Tool user are going.

 

3. Can we confirm if tools that fit our tasks are available or not?

Here, we use a third LLM called “the dispatcher”. Because the dispatcher maintains a record of existing tools produced by the tool maker, it can confirm if tools that fit our task are available at the time when a task is received, If no appropriate tool is found, the dispatcher identifies the instance as a new task and solves the instance with a powerful model, such as GPT4. The dispatcher’s workflow is shown here.

 

That is it! This is a major part of “Large Language Models as Tool Makers” or “LATM” in short. By LATM, we might reduce computational cost for heavy models, such as GPT4. It is amazing! Hope you enjoy the article today. I will update new technologies around LLM in the near future. Stay tuned!

 

1) “Large Language Models as Tool Makers” Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou, 26 May 2023, https://arxiv.org/abs/2305.17126



Copyright  ©  2023  Toshifumi Kuga  All right reserved



Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.