fine-tuning

"REST MEETS REACT" is a new prompt-engineering method using synthetic data. It holds immense potential for enhancing AI without relying on human-generated data

Happy New Year! Thank you for your continued support. Promptly, Google DeepMind has announced a new, advanced prompt engineering method suitable for the new year. It is a paper titled "REST MEETS REACT: SELF-IMPROVEMENT FOR MULTI-STEP REASONING LLM AGENT"(1). It incorporates fine-tuning with synthetic data, which looks promising! Let's get started.

 

1.Prompt Structure

This prompt is designed with a web Q&A system in mind that answers complex questions. The structure is as follows:

The blue part in the figure above represents the flow of the agent described in the prompt, aiming to answer complex questions using web search. In the latter half, "Relevance self-check" and "Grounding self-check" are functions for the agent to check its own answers. It's a self-check function. For a detailed explanation of the entire flow, please refer to the paper.

 

2. "Reward Model" - The Key to Success

Now, let's explain the core part of self-improvement. In a nutshell, it's about "creating new high-quality data and fine-tuning the model with it." . This function consists of three parts:

  • Grow: Start with a model capable of running Search Agent, using Google PaLM 2-L model for this purpose. Trajectories are collected based on a selected set of 2000 public questions. Trajectory, though an unfamiliar term, refers to the reasoning process and is commonly used in reinforcement learning.

  • Improve: Convert trajectories into data for fine-tuning, using the Reward model to select only high-quality data. No external data, like labels, are used.

  • Fine-tuning: Fine-tune a new model of the same size with this new data, ensuring it performs better than the original.

This process is repeated with the better model using the new data. As a result, accuracy improves while maintaining the original data, without adding external data. Therefore, the accuracy of the Reward model in ranking is crucial. The Reward model is constructed as a set of prompts in this paper. Let's look more closely at these prompts, showing only the initial part.

  • The goal of this rating is to filter out bad actions so that they'll be excluded from the fine-tuning dataset.

  • Overall, we want the agent to produce relevant and grounded answers with minimal steps. Anything deviating from this goal is considered bad.

  • If any element (thoughts, comments, etc.) is empty, then it's automatically bad.

"Filter out" indicates a method of discarding items that don't meet the standards and adopting only the high-quality data that remains. Please see the paper (p19) for details.

 




3.Improve Accuracy with Synthetic Data

Papers including this one have been published in late 2023, focusing on using the Reward model to create high-quality synthetic data for model fine-tuning and accuracy improvement. Vigorous research is expected to continue in 2024, yielding various results. Especially in the LLM field, collecting high-quality training data is becoming increasingly difficult, and fine-tuning with synthetic data is anticipated as a solution.


 


How was it? The improvement in model accuracy with synthetic data is expected to be a very effective development method for startups like us, who cannot collect vast amounts of data independently. Our blog will continue to follow these synthetic data and other technological innovations, so stay tuned. Wishing you a great year!






1) “REST MEETS REACT: SELF-IMPROVEMENT FOR MULTI-STEP REASONING LLM AGENT" Renat Aksitov†1 , Sobhan Miryoosefi†1 , Zonglin Li†1 , Daliang Li†1 , Sheila Babayan†2 , Kavya Kopparapu†2 , Zachary Fisher1 , Ruiqi Guo1 , Sushant Prakash1 , Pranesh Srinivasan3 , Manzil Zaheer2 , Felix Yu1 , and Sanjiv Kumar1,    1Google Research, 2Google DeepMind, 3Google †Core contributors, 15 Dec 2023, https://arxiv.org/abs/2312.10003





Copyright © 2023 Toshifumi Kuga. All right reserved





Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Fine-tuning GPT-3.5 with synthetic text generated by GPT-4. The accuracy has improved! In the future, we might not even need training text???

Hello, despite being in the latter half of September, it is still quite hot in Japan. The photos feel mismatched, but I'm deliberately sticking to the autumn theme, hoping it will get cooler soon. However, it might stay hot for the rest of the month.

Now, about the fine-tuning of ChatGPT-3.5 that I introduced the other day, it's certainly a hot topic. I think there is a strong demand in companies to specialize its performance for specific tasks. For this reason, we conducted an experiment assuming cases where you would want to proceed even without data at hand by generating synthetic text and then fine-tuning it.

 
  1. Experiment Details

Just like the previous experiment, we set a task to determine which financial product a given English-language complaint is about. They are complaints for the banking industry, so the task involves differentiating between six types of financial products such as mortgages and bank accounts. The data used for fine-tuning was minimal, with 100 samples for validation, just like last time. However, the training data is different this time. We generated customer complaint emails using GPT-4, and they are indistinguishable from real ones at a glance. GPT-4's performance is indeed impressive. We generated 15 similar customer complaints for training and then proceeded with fine-tuning.

synthetic text generated by GPT-4


2. Experiment Results

Since this was our first time using synthetic text, we were worried about the outcome, but we were able to confirm the effectiveness of fine-tuning as follows. Though the improvement isn't dramatic with just 15 samples, the accuracy for this task has improved compared to the base GPT-3.5, which had an accuracy of 0.5 to 0.55.

For more details on the experiment, please refer to this notebook.

 

3. Discussion

Fine-tuning with synthetic text was a method not even considered before, but with the arrival of GPT-4, it's becoming more realistic. There are several points to consider, such as the number of samples and how to write prompts, but the advantage of being able to start even without data is significant. Currently, GPT-4 is the only option for generation models, but it seems like new models like Gemini from Google will also be available next year. Technology is advancing rapidly, so we can expect a lot more in the future.

So, what did you think? We will continue to conduct various experiments and share our findings here. See you again soon!




Copyright © 2023 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.