Last year, I was asked to break GPT-4—to get it to output terrible things. I and other interdisciplinary researchers were given advance access and attempted to prompt GPT-4 to show biases, generate hateful propaganda, and even take deceptive actions in order to help OpenAI understand the risks it posed, so they could be addressed before its public release. This is called AI red teaming: attempting to get an AI system to act in harmful or unintended ways.
Red teaming is a valuable step toward building AI models that won’t harm society. To make AI systems stronger, we need to know how they can fail—and ideally we do that before they create significant problems in the real world. Imagine what could have gone differently had Facebook tried to red-team the impact of its major AI recommendation system changes with external experts, and fixed the issues they discovered, before impacting elections and conflicts around the world. Though OpenAI faces many valid criticisms, its willingness to involve external researchers and to provide a detailed public description of all the potential harms of its systems sets a bar for openness that potential competitors should also be called upon to follow.
Normalizing red teaming with external experts and public reports is an important first step for the industry. But because generative AI systems will likely impact many of society’s most critical institutions and public goods, red teams need people with a deep understanding of all of these issues (and their impacts on each other) in order to understand and mitigate potential harms. For example, teachers, therapists, and civic leaders might be paired with more experienced AI red teamers in order to grapple with such systemic impacts. AI industry investment in a cross-company community of such red-teamer pairs could significantly reduce the likelihood of critical blind spots.
After a new system is released, carefully allowing people who were not part of the prerelease red team to attempt to break the system without risk of bans could help identify new problems and issues with potential fixes. Scenario exercises, which explore how different actors would respond to model releases, can also help organizations understand more systemic impacts.
But if red-teaming GPT-4 taught me anything, it is that red teaming alone is not enough. For example, I just tested Google’s Bard and OpenAI’s ChatGPT and was able to get both to create scam emails and conspiracy propaganda on the first try “for educational purposes.” Red teaming alone did not fix this. To actually overcome the harms uncovered by red teaming, companies like OpenAI can go one step further and offer early access and resources to use their models for defense and resilience, as well.
I call this violet teaming: identifying how a system (e.g., GPT-4) might harm an institution or public good, and then supporting the development of tools using that same system to defend the institution or public good. You can think of this as a sort of judo. General-purpose AI systems are a vast new form of power being unleashed on the world, and that power can harm our public goods. Just as judo redirects the power of an attacker in order to neutralize them, violet teaming aims to redirect the power unleashed by AI systems in order to defend those public goods.