AI can also be trained to deceive, and in a “persistent” manner

2024-01-16 08:56:05

Is it possible to train generative AIs in a roundabout way so that, under certain conditions, they give completely different results, injecting malicious code or giving a completely wrong answer?

A study co-authored by researchers at Anthropic, the start-up founded in 2021 by former members of OpenAI, examined whether models could be trained to deceive. For example, by injecting exploits into otherwise secure computer code, relieving TechCrunch : « terrifying thing, they are exceptionally good at this ».

Mimic opportunistic/deceptive behavior of humans

In the summary of their scientific article, the researchers explain that they want to reproduce a behavior that they attribute to humans: “ Humans are capable of deceptive behavior: they behave in useful ways in most cases, but also in very different ways to serve alternative goals when given the opportunity. If an AI system learned such a strategy, could we detect and remove it using the latest security training techniques? ».

More simply, and by getting rid of any anthropomorphism, the researchers wanted to be able to integrate backdoors into their language models and observe the consequences of this type of ” poisoning ».

To test this issue, the researchers built proofs of concept (POC) of backdoors in large language models (LLM), while asking whether they could detect and remove them.

Sleeper agents ready to wake up with a keyword

They called them “ sleeper agents » (« sleeper agents » in English), from the name given, in terms of counter-espionage, to spies responsible for countering the detection measures of opposing intelligence services. For example, the famous Illegals Program Russians in the United States.

Related Articles:  "PS Store Golden Week Sale: Up to 90% Off on 2000+ Games and DLCs"

1705397448
#trained #deceive #persistent #manner

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.