Deepmind’s Latest Machine Learning Research Automatically Finds Inputs That Generate Harmful Text From Language Models
Large generative language (LM) models such as GPT-3 and Gopher have proven their ability to generate high quality text. However, these templates run the risk of producing destructive text. Therefore, they are difficult to deploy due to their potential to hurt people in ways that are almost impossible to predict.
So many different inputs can lead to a template producing harmful text. Therefore, it is difficult to identify all scenarios in which a model fails before it is used in the real world.
Previous work has employed human annotators to handwrite test cases to identify unsafe behavior before deployment. However, human annotation is expensive and time-consuming, limiting the number and variety of test cases.
DeepMind researchers now generate test cases (“red team”) using another LM to automatically detect instances where a target LM is behaving in a detrimental manner. By automatically detecting cases of failure (or red teaming’), they want to complete manual tests and limit the number of significant omissions.
This technique identifies a number of detrimental model behaviors, including:
- Offensive language which includes hate speech, profanity, sexual material, discrimination, etc.
- Conversational Harms: For example, the use of offensive language during a long conversation.
- Data leakage, where models use a training corpus to generate copyrighted or private personally identifiable information.
- Contact information generation, where users are directed to email or call real people when they don’t have to.
- Distributional bias, referring to speaking of certain groups of people in an unfairly different way than other groups.
The team first used its approach to red-equip the 280B-setting Dialogue-Prompted Gopher chatbot, which was used to generate offensive content. They tested a number of strategies for generating test cases with language models, including prompt-based generation, hit-and-miss learning, supervised tuning, and reinforcement learning. Their results suggest that some methods produce more diverse test cases for the target model, while others generate more complex test cases. The techniques they provide are useful for achieving high test coverage and modeling conflicting circumstances when used together.
The model is prevented from generating outputs containing high-risk terms by blacklisting particular phrases that regularly appear in damaging results, finding objectionable training data cited by the model, and removing them from future iterations of the model. algorithm training.
The team states that the model’s detrimental behavior can also be corrected by augmenting the model’s benchmark (training text) with an example of the expected behavior for a specific input type. Additionally, the model can be trained for a given test input to reduce the risk of its detrimental initial outcome.
Overall, this work focuses on today’s language model damage red team, which can be used to detect and reduce language model damage. In the future, the team plans to use their approach to predict other potential drawbacks of sophisticated machine learning systems, such as internal misalignment or objective robustness failures.