Google’s DeepMind published a research paper proposing a way to train large language models to provide more reliable answers and be resistant to reward hacking, a step in the development of more adaptive and efficient AI systems.
Hat tip a @EthanLazuk for tweet about a new research paper from Google DeepMind.
The AI has a tendency towards bounty hacking
Reinforcement learning from human feedback (RLHF) is a method used to train generative AI to learn to provide answers that receive positive ratings from human raters. Positive scores are a reward for correct answers, which is why this technique is called Reinforcement Learning. Positive ratings are given by human raters, so it’s called Reinforcement Learning from human feedback.
RLHF is highly successful, but also has an unintended side effect where the AI learns shortcuts by receiving a positive reward. Instead of providing a correct answer, it provides an answer that looks like a correct answer, and when it fools the human raters (which is a failure of reinforcement training), the AI begins to improve its ability to deceive human raters with inaccuracies. responses to receive the rewards (the positive human ratings).
This tendency of the AI to “cheat” to earn the training reward is called Reward Hacking, which is what the study aims to minimize.
The causes of reward hacking in large language models
To solve the bounty piracy problem, the researchers identified two areas that lead to bounty piracy that need to be addressed with their solution:
Changes in distribution Inconsistencies in human preferences
Distribution shifts
Distribution shifts refer to the situation where an LLM is trained on a certain type of data set and then, during reinforcement learning, is exposed to different types of training data that it has not seen before . This change in data type is called a distributional change, and it could cause the language model to manipulate the reward system into giving a satisfying response that it is otherwise unprepared to give.
Inconsistencies in human preferences
This is a reference to humans being inconsistent in their ratings when judging responses provided by AI. For example, solving the problem of inconsistency in human preferences is probably one of the motivations behind creating Google’s search quality rater guidelines, which has the effect of reducing the influence of subjective preferences .
Human preferences can vary from person to person. Human feedback reinforcement learning relies on human feedback in the reward model (RM) training process, and it is the inconsistencies that can lead to reward piracy.
Finding a solution is important, as the researchers noted:
“This phenomenon of bounty piracy raises numerous issues.
First, it degrades performances, manifesting as linguistically flawed or unnecessarily detailed results that do not reflect true human preferences.
Second, it complicates control point selection due to the unreliability of the RM proxy, echoing Goodhart’s Law: “when a measure becomes a target, it ceases to be a good measure.”
Third, it may generate sympathy or amplify social biases, reflecting the narrow and biased demographics of feedback providers.
Finally, and most critically, misalignment due to reward piracy may increase security risks, particularly given the rapid integration of LLMs into everyday life and critical decision-making. “
Weighted Average Reward Models (WARM)
Google DeepMind researchers developed a system called Weight Averaged Reward Models (WARM), which creates a proxy model by combining multiple individual reward models, each with slight differences. With WARM, as the number of reward models (RMs) increases they average together and the results improve significantly, with the system avoiding the sudden drop in reliability that occurs with standard models.
The WARM system, because it uses several smaller models, has the advantage of being memory efficient and does not slow down the model’s ability to provide answers, as well as being resistant to reward hacking.
WARM also makes the model more reliable and consistent when dealing with changing data and more consistent.
What caught my attention is its ability to follow the “updated machine learning paradigm” which refers to WARM’s ability to adapt and improve by incorporating new data or changes over time, without starting over from zero
In the quote below, WA stands for weighted average and RM stands for reward model.
The researchers explain:
“WARM represents a flexible and pragmatic method to improve the alignment of AI with human values and social norms.
…WARM follows the upgradeable machine learning paradigm, eliminating the need for server-to-server communication, thus allowing embarrassingly simple parallelization of RMs.
This facilitates its use in a federated learning scenario where data must remain private; furthermore, WA would add a layer of privacy and bias mitigation by reducing private preference memorization. A simple extension of WARM would then combine RMs trained on different datasets, for example, from different (pools of) taggers.
… Also, as WA has been shown to limit catastrophic forgetting, WARM could perfectly support evolving and iterative preferences.
limitations
This research points the way to more ways to improve AI, it is not a complete solution because it has inherent limitations. Among the problems is that it does not completely eliminate all forms of “spurious correlations or biases inherent in preference data.”
However, they concluded on an optimistic note about WARM’s future:
“Our empirical results demonstrate its effectiveness when applied to the abstract. We anticipate that WARM will contribute to more aligned, transparent, and effective AI systems, encouraging further exploration in reward modeling.”
Read the research paper:
HOT: On the benefits of average weight reward models
Featured image by Shutterstock/Mansel Birst
[ad_2]
Source link