Is your LLM spouting nonsense? The RLHF tool which revives 100 AI is here!
Over the past two years, ChatGPT and Claude have dazzled the world, while domestically, a fierce competition among hundreds of models has unfolded. These models seem to effortlessly handle a myriad of questions, seemingly blurring the line between machines and humans. Behind this achievement lies a new training paradigm for Large Language Models (LLMs): Reinforcement Learning from Human Feedback (RLHF). While each word in this concept is familiar, putting them together can be confusing.
Introduction to Reinforcement Learning
In movies, we often see robots improving their abilities through learning. In the real world, reinforcement learning is a method that allows machines to learn through "trial and error." Imagine teaching a child to ride a bicycle. You might say "well done" when they do well or "try again" when they fall. This is the basic idea of reinforcement learning: guiding the learning process through rewards and punishments.
Introduction to Human Feedback
But how does the machine know if it's doing well? This is where human feedback comes into play. When we tell the machine "this is great" or "that doesn't work," we are essentially providing feedback to help it learn better.
In the past, we used rule-based text generation metrics like BLEU and ROUGE to measure the similarity between machine-generated text and reference text. However, these metrics fail to adequately assess the semantic quality and contextual coherence of the text.
For example:
- Reference text: "The weather is nice today, perfect for a walk."
- Generated text (high BLEU score but low semantic quality): "Today walk weather is nice, perfect for a." This generated text matches every word with the reference, resulting in a high BLEU score, but the word order is chaotic, making the expression meaningless.
So why not directly have humans rate or provide feedback on the model's output? This would make the model's output more aligned with user needs, more natural, and more useful.
This is the essence of RLHF: using reinforcement learning to directly optimize language models with human feedback.
How to Perform RLHF?
It mainly involves three steps:
- Collect Human Feedback: Use questionnaires, conversations, or other forms to have humans rate or provide feedback on the LLM's output.
- Model Human Preferences: Use the collected human feedback to train a Reward Model (also called a Preference Model), which can predict human preferences for different outputs.
- Reinforcement Learning: Use the Preference Model as a reward function and apply reinforcement learning algorithms (such as policy gradient or Q-learning) to optimize the LLM's behavior, making its output more aligned with human preferences.
Step 1: Collect Human Feedback—Using Abaka AI's RLHF Tool
Abaka AI's Abaka AI Data Engineering Platform has launched a powerful RLHF tool, supporting RLHF annotation for text and images (custom development for video and audio is available by contacting Abaka AI's data experts).
Team users, quickly click https://app.Abaka AI.com/ to experience it now!
Step 2: Modeling Human Preferences—Understanding What People Like
The Reward Model (RM) is the core of RLHF. This model takes a series of text inputs and outputs a numerical value representing human preference for these texts. It acts like a judge, telling us whether people like or dislike a particular text. By analyzing text and assigning scores, the higher the score, the more people prefer the text.
There are two ways to build this reward model:
- End-to-End Modeling: Directly use a language model (LM) to handle the entire process, acting like an all-in-one judge that can manage everything from start to finish.
- Modular System Modeling: First rank the outputs, then convert the ranking results into reward values. This is like a team effort—first sorting, then assigning scores.
Step 3: Reinforcement Learning—Making LLMs Smarter
Reinforcement learning is like training an LLM to run a marathon, continuously improving its performance through practice. Currently, the Policy Gradient Reinforcement Learning (Policy Gradient RL) algorithm and Proximal Policy Optimization (PPO) are used to fine-tune part or all of the language model's parameters. Since fine-tuning a model with billions to hundreds of billions of parameters is costly, researchers have proposed methods like Low-Rank Adaptation (LoRA) and DeepMind's Sparrow LM to reduce costs. The PPO algorithm has been around for a while, with many guides available on its principles, making it a favorable choice in RLHF.
Application of the PPO Algorithm
The PPO algorithm acts like a coach for the LLM, helping it improve through the following steps:
- Input Prompt: Provide prompts to both the initial language model and the currently fine-tuned language model, generating output texts y1 and y2. This is like giving the language model some prompts and asking it to respond.
- Calculate Reward: Pass the text generated by the current strategy to the reward model to obtain a reward value rθ. This is like rewarding the LLM based on the judge's score.
- Penalty Term: Compare the differences between the texts generated by the two models and calculate a penalty term, typically using the scaled Kullback–Leibler divergence between the output token distributions, $$r = r_0 - \lambda r \pi r$$. Without the penalty term, the model might generate gibberish to trick the reward model into providing high rewards. Therefore, if the language model's response deviates too much from the original, it will be penalized, preventing the model from misbehaving.
Example
Suppose we have a language model that takes the prompt "How is the weather today?" and returns a text output.
This LLM has two versions: one is the initial model, and the other is the fine-tuned model after the above steps. The initial model outputs "Today is sunny," while the fine-tuned model outputs "The weather is very nice today, perfect for a walk." We pass both outputs to the reward model, which assigns a lower reward to the initial model's output and a higher reward to the fine-tuned model's output. Then, we calculate the difference between the outputs of the two versions, adding a penalty term to prevent the model from generating text that deviates too much from the initial model. Finally, we use this reward value to optimize the model, making it generate text that aligns better with human preferences.
In this way, the LLM learns to generate more natural and useful responses, better serving human needs.