Large language models (LLM or large language model), in addition to being very energy-intensive, can reproduce the biases and stereotypes acquired during their training. Microsoft researchers have designed open source tools and datasets to test content moderation systems: (De)ToxiGen and AdaTest. These might lead to more reliable LLMs or models similar to OpenAI’s GPT-3 that can parse and generate text with human-like sophistication. Their work was presented at the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).
The large language models (LLM), if they can adapt to a wide variety of applications, however involve risks due to the fact that their formation was made on a mass of texts, written by humans, coming from Internet. As a result, they can generate inappropriate and harmful language reproducing the stereotypes conveyed by the authors of these texts. Content moderation tools have been designed to flag or filter such language in certain contexts, but the datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, particularly speech. of hate.
(De)ToxiGen: Leveraging Big Language Models to Build More Robust Hate Speech Detection Tools
In an effort to address this toxicity issue, a team comprised of researchers from Microsoft, MIT, the Allen Institute for AI, Carnegie Mellon University, and the University of Washington developed ToxiGen, a dataset to train content moderation tools that can be used to report malicious language and published their study titled “ ToxiGen: a large-scale machine-generated dataset for the detection of contradictory and implicit hate speech. » on Arxiv.
Toxic language detection systems often incorrectly label texts mentioning minority groups as toxic, as these groups are often the target of online hate. “Such an overreliance on false correlations also causes systems to struggle to detect implicitly toxic language. »according to the researchers, who to help alleviate these problems created ToxiGen, a new large-scale, machine-generated dataset of 274,000 toxic and benign reports on 13 minority groups.
ToxiGen is said to be one of the largest publicly available datasets on hate speech, according to Microsoft.
Ece Kamar, Partner Research Area Manager at Microsoft Research and project lead for AdaTest and (De)ToxiGen, told Techcrunch:
“We recognize that any content moderation system will have shortcomings, and these models need to be constantly improved. The goal of (De)ToxiGen is to enable developers of AI systems to more effectively find risks or issues in any existing content moderation technology. Our experiences show that the tool can be used to test many existing systems, and we look forward to learning from the community regarding new environments that would benefit from this tool. »
To generate the samples, the researchers fed an LLM with examples of neutral speech and hate speech targeting 13 minority groups, including blacks, Muslims, Asians, Latinos, Native Americans, people with physical and cognitive disabilities as well as LGBTQ. The statements were drawn from existing datasets but also from news articles, opinion pieces, podcast transcripts, and other similar public textual sources.
The team demonstrated the limitations of AI in detecting toxicity: they fooled a number of AI-powered content moderation tools using statements from (De)ToxiGen, the content filter used by OpenAI in the Open API (which provides access to GPT-3).
The team said:
“The statement creation process for ToxiGen, called (De)ToxiGen, was designed to uncover weaknesses in certain moderation tools by guiding an LLM to create statements that may misidentify the tools. Through a study of three sets of human-written toxicity data, the team found that starting with one tool and tweaking it with ToxiGen might “significantly” improve the tool’s performance. »
AdaTest: an adaptive testing and debugging process for NLP models inspired by the test debugging cycle in traditional software engineering
L’article « Pair people with great language models to find and fix bugs in NLP systems” was published by Scott Lundberg and Marco Tulio Ribeiro, both Principal Investigators. A process for adaptive testing and debugging of NLP models inspired by the test-debug cycle of traditional software engineering, AdaTest promotes a partnership between the user and a large language model (LM): the LM offers tests validated and organized by the user, who in turn gives his opinion and directs the LM towards better tests.
AdaTest, short for Human-AI Team Approach Adaptive Testing and Debugging, debugs a model by instructing it to generate a large number of tests, while a human controls the model by running valid tests, selecting and arranging semantically related topics. The goal is to target the model to specific areas of interest and use the tests to troubleshoot and retest the model. This last step of the debugging loop is essential because once the tests are used to repair the model, they are no longer test data but training data.
Ece Kamar explains:
“AdaTest is a tool that leverages the existing capabilities of large language models to bring diversity to human-created seed tests. In particular, AdaTest puts people at the center to initiate and guide test case generation. We use unit testing as a language to express appropriate or desired behavior for various inputs. This way, a person can create unit tests to express the desired behavior, using different inputs and pronouns…As the ability of current large-scale models to add diversity to all unit tests is diverse, there may be cases because automatically generated unit tests may need to be reviewed or corrected by people. This is where we benefit from the fact that AdaTest is not an automation tool, but a tool that helps people investigate and identify issues. »
The research team conducted an experiment to see if AdaTest made it easier for experts, who were trained in ML and NLP, and non-experts to write tests and bugs in the patterns to find. The results demonstrated that experts using AdaTest discovered on average five times more model errors per minute, while non-experts – who had no programming training – were ten times more successful in finding errors. in a given template (API perspective) for content moderation.
ToxiGen and AdaTest, their dependencies and source code, are available on GithHub.
Sources of the article:
TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar.
ADATEST : Adaptive Testing and Debugging of NLP Models
Scott Lundberg, Marco Tulio Ribeiro, Ece Kamar.