2024-01-16 08:56:05
Is it possible to train generative AIs in a roundregarding way so that, under certain conditions, they give completely different results, injecting malicious code or giving a completely wrong answer?
A study co-authored by researchers at Anthropic, the start-up founded in 2021 by former members of OpenAI, examined whether models might be trained to deceive. For example, by injecting exploits into otherwise secure computer code, relieving TechCrunch : « terrifying thing, they are exceptionally good at this ».
Mimic opportunistic/deceptive behavior of humans
In the summary of their scientific article, the researchers explain that they want to reproduce a behavior that they attribute to humans: “ Humans are capable of deceptive behavior: they behave in useful ways in most cases, but also in very different ways to serve alternative goals when given the opportunity. If an AI system learned such a strategy, might we detect and remove it using the latest security training techniques? ».
More simply, and by getting rid of any anthropomorphism, the researchers wanted to be able to integrate backdoors into their language models and observe the consequences of this type of ” poisoning ».
To test this issue, the researchers built proofs of concept (POC) of backdoors in large language models (LLM), while asking whether they might detect and remove them.
Sleeper agents ready to wake up with a keyword
They called them “ sleeper agents » (« sleeper agents » in English), from the name given, in terms of counter-espionage, to spies responsible for countering the detection measures of opposing intelligence services. For example, the famous Illegals Program Russians in the United States.
1705397448
#trained #deceive #persistent #manner