It's been three weeks since OpenAI's o1 preview unveiled a new paradigm for generative AI. Its accuracy on logical tasks during inference is remarkable. Unfortunately, the mechanism isn't public, but it would be fascinating to know the state of the art in related technologies. Luckily, a helpful research paper (1) has been released by the University of California, Berkeley and Google DeepMind, which I'd like to introduce here and use to speculate on the mechanisms behind o1 preview. Let's begin!

What We Learned from OpenAI's o1 Preview and the Latest Research Papers

According to the OpenAI website (2), we've learned two key things. First, o1 preview leverages reinforcement learning for enhanced performance. Second, it emphasizes "chain of thought" and prioritizes test-time computing. However, this information alone isn't enough for a fruitful technical discussion. Therefore, let's examine recent research papers on natural language processing using reinforcement learning. From several papers, I've selected one related to hierarchical reinforcement learning. This algorithm is reportedly effective for "multi-turn" conversations that extend over multiple exchanges. As you may have experienced, when using ChatGPT or similar models to obtain information, rarely do you get the desired results in a single attempt; often, several interactions with the generative AI are required. In such cases, the number of generated tokens or words steadily increases, creating a challenging situation for efficient training of the generative AI. This new algorithm aims to address this challenge. A possible application is the task of "maximizing customer satisfaction at the end of a multi-turn conversation with a generative AI assistant."

2. Hierarchical Reinforcement Learning

The algorithm presented in this paper (1) is called "hierarchical reinforcement learning" and is characterized by the following hierarchical structure:

The most notable aspect here is the two-tiered structure consisting of the Utterance level and the token level. Separating utterance-level language processing from the processing of individual minimal units of action at the token level is highly effective for efficient training. Typically, generative AI operates on "next token prediction," where it diligently predicts the next best word based on the prompt's instructions. Its accuracy is remarkable, often generating more polished language than I can. However, in "multi-turn" scenarios with continuous utterances, the number of tokens increases, making training more challenging. This is where reinforcement learning at the Utterance level comes into play, with rewards also being considered at this level. For example, a reward scould be devised where "+1" is awarded for successfully retrieving necessary information by searching a website and "0" for failure. This facilitates efficient training. Based on this reward, an action-value function is calculated and used for reinforcement learning at the token level. This reportedly enables significantly more efficient training. For further details, please refer to (1).

3. Flexibility in Reinforcement Learning Design

As we've seen, hierarchical reinforcement learning offers flexibility and a high degree of design freedom. While it's used here to separate utterance-level and token-level analysis, it appears to be employed for other enhancements as well. For example, a research paper (3) from Google DeepMind uses hierarchical reinforcement learning to improve self-correction capabilities:

“Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. "

It's exciting to anticipate the various use cases that will likely emerge in the future. For more details, please refer to (3).

What do you think? The acclaim for o1-preview seems to be growing daily. While it's unlikely that the details of its mechanism will be revealed soon, speculating about it from the outside is crucial for understanding AGI. Next time, I'd like to consider the application examples of o1-preview. That's all for today. Stay tuned!

1) ArCHer: Training Language Model Agents via Hierarchical Multi-Turn, Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar, University of California, Berkeley, 1Google DeepMind, Feb 29,2024
2) Introducing OpenAI o1, OpenAI, Sep 12, 2024
3) Training Language Models to Self-Correct via Reinforcement Learning, Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, JD Co-Reyes , Avi Singh , Kate Baumli , Shariq Iqbal , Colton Bishop , Rebecca Roelofs , Lei M Zhang , Kay McKinney , Disha Shrivastava , Cosmin Paduraru , George Tucker , Doina Precup , Feryal Behbahani, Aleksandra Faust, Google DeepMind, Sep 19,2024

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Last week, we introduced OpenAI o1(1). Despite still being in preview, it boasts high performance, and as evidenced by the leaderboard below(2), it seems to be widely regarded as having overwhelming performance, especially in mathematics and coding. In this article, we'd like to explore why OpenAI o1 demonstrates higher accuracy compared to existing generative AI models like GPT4.

Scores of various generative AIs in the mathematics field

1. Chain of Thought is Key

The star of the OpenAI o1 model is "Chain of Thought." This refers to "a series of intermediate reasoning steps," which was previously considered an important element of prompts created by users for existing generative AI models. By incorporating "Chain of Thought" into prompts, users enabled generative AI to engage in deeper and broader thinking before providing an answer, thereby improving accuracy. "Chain of Thought" became known to the public through a 2022 research paper(3). Please refer to it for details.

2. OpenAI o1 Can Generate Chain of Thought Independently

OpenAI o1 can generate Chain of Thought internally on its own. Users don't need to devise Chain of Thought themselves; it's generated automatically. This is why it achieved high accuracy in mathematics and coding. Unfortunately, OpenAI seems to have decided not to disclose the Chain of Thought itself. Users can only see a summary of it. If you're like most users, you're probably thinking, "I'd really like to see that!" We hope that OpenAI will change its policy and release it in the future.

3. Creating a Reward Model

OpenAI has released very little information about what we'll discuss from here on. Please note that the following is based on speculation drawn from previously published research papers and information shared by OpenAI researchers. To enable generative AI to automatically generate task-specific Chain of Thought for practical use, we must evaluate whether the generated Chain of Thought is actually correct. This is where the Reward model comes into play. A 2023 research paper(4) from OpenAI provides a detailed explanation of how to train a Reward model, so let's look to it for clues.

The data for training the Reward model takes the form of Chain of Thought, as shown below. This research paper limits the tasks to mathematics. Since it's challenging for humans to manually create Chain of Thought for each task, they are automatically generated using GPT4. This is called the Generator. Humans then label each step of the Chain of Thought generated by the Generator on a three-point scale (correct, incorrect, or neither). This completes the training data. In the example below, you can see that each step is assigned a three-point label. It must have been quite a task for humans to label a large amount of such data.

4. Training Generative AI Through Reinforcement Learning

Once the Reward model is complete, we can train the generative AI using reinforcement learning. As a result, the generative AI can generate the correct Chain of Thought for the task. We, the users, actually run OpenAI o1 and benefit from the generated Chain of Thought. Unfortunately, OpenAI has not disclosed the specific method for training OpenAI o1 using reinforcement learning. Since this directly affects accuracy, it's unlikely to be released in the future. However, researchers worldwide are working on this issue and have published several promising results. As this is a technology that will support the future development of generative AI, we would like to revisit it in a future article.

5. A New Paradigm for Generative AI

OpenAI's website includes the following statement and chart:

“Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.”

Until now, there has been much discussion about how increasing the computational cost (time and number of parameters) in pre-training improves the accuracy of generative AI, but there hasn't been much in-depth discussion about the relationship between inference computational cost and accuracy. However, it has now become clear that by generating Chain of Thought itself and then providing an answer, generative AI can answer tasks requiring complex logical reasoning with high accuracy, albeit with significantly increased inference computational cost. The chart on the right above shows that accuracy improves as the computational cost at inference time increases. We believe this is a groundbreaking development. Therefore, it will be important to consider both training and inference computational costs for generative AI in the future. This marks the dawn of a new paradigm for generative AI.

OpenAI o1 has not only improved accuracy but has also brought about a new paradigm for the development of generative AI. We look forward to seeing how future generative AIs will follow in its footsteps. That's all for today. Stay tuned!

1) Introducing OpenAI o1, OpenAI, Sep 12, 2024
2) LMSYS Chatbot Arena Leader board
3) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Google Research, Brain Team, Jan 2023
4) Let’s Verify Step by Step, Hunter Lightman , Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, OpenAI, May 2023

Looking at OpenAI's o1-preview, I thought, "Reinforcement learning might become the main character in AI development!"

OpenAI o1-preview: A Breakthrough in Generative AI, Introducing a Novel Paradigm