💻 When psychologists decipher the reasoning of artificial intelligences

2024-07-08 11:00:05

By Stefano Palminteri, Researcher, Inserm

Have you heard of large language models (LLMs)? Even if this term seems obscure to you, chances are you’ve already heard of the most famous of them: ChatGPT, from the Californian company OpenAI.

The deployment of such artificial intelligence (AI) models might have consequences that are difficult to grasp. Indeed, it is complicated to predict precisely how LLMs will behave, whose complexity is comparable to that of the human brain. A certain number of their capacities have thus been discovered over the course of their use rather than anticipated at the time of their design.

To understand these “emerging behaviors”, new investigations must be carried out. With this in mind, within my research team, we have used cognitive psychology tools traditionally used to study rationality in humans in order to analyze the reasoning of different LLMs, including ChatGPT.

Our work has made it possible to highlight the existence of reasoning errors in these artificial intelligences. Explanations.

What are large language models?

Language models are artificial intelligence models that are able to understand and generate human language. Schematically speaking, language models are able to predict, based on the context, the words that have the greatest probability to appear in a sentence.

LLMs are artificial neural network algorithms. Inspired by the functioning of biological neural networks that make up the human brain, the nodes of a network of several artificial neurons generally receive several information values ​​as input and then generate, following processing, an output value.

LLMs differ from “classical” artificial neural network algorithms that make up language models by being based on a specific architecture, being trained on huge databases, and generally having a gargantuan size (of the order of several billion “neurons”).

Due to their size and structure (but also the way they are trained), LLMs have shown impressive performance in their own tasks from the very beginning, whether text creation, translation, or proofreading.

But that’s not all: LLMs also showed relatively surprising performance in a variety of diverse tasks, from mathematics to basic forms of reasoning.

In other words, LLMs quickly demonstrated abilities that were not necessarily explicitly predictable from their programming. What is more, they appear to be able to learn to perform new tasks from very few examples.

These capabilities have created for the first time a special situation in the field of artificial intelligence: we now have systems that are so complex that we cannot predict in advance the extent of their capabilities. In a way, we must “discover” their cognitive abilities experimentally.

Based on this observation, we postulated that the tools developed in the field of psychology might prove relevant for studying LLMs.

Why study reasoning in large language models?

One of the main goals of psychology scientist (experimental, behavioral and cognitive) is to try to understand the mechanisms underlying the capacities and behaviors of extremely complex neural networks: those of human brains.

Since our lab specializes in studying cognitive biases in humans, the first idea that came to mind was to try to determine whether LLMs also exhibited reasoning biases.

Given the role that these machines might take in our lives, understanding how these machines reason and make decisions is fundamental. Moreover, psychologists can also benefit from these studies. Indeed, artificial neural networks, which can perform tasks in which the human brain excels (object recognition, speech processing, etc.) might also serve as cognitive models.

A growing body of evidence suggests, in particular, that the neural networks implemented in LLMs not only provide accurate predictions regarding neuronal activity involved in processes such as vision and language processing.

Thus, it has been demonstrated in particular that the neuronal activity of neural networks trained in object recognition correlates significantly with the neuronal activity recorded in the visual cortex of an individual performing the same task.

This is also the case with regard to the prediction of behavioral data, particularly in learning.

Performances that ended up surpassing those of humans

During our work, we mainly focused on focused on OpenAI LLMs (the company behind the GPT-3 language model, used in early versions of ChatGPT), because these LLMs were the most powerful in the landscape at the time. We tested several versions of GPT-3, as well as ChatGPT and GPT-4. To test these models, we developed an interface to send questions and collect responses from the models automatically, which allowed us to acquire a large quantity of data.

Analysis of these data revealed that the performances of these LLMs presented behavioral profiles that might be classified into three categories.

Older models were simply unable to answer the questions in a meaningful way.

The intermediate models answered questions, but often engaged in intuitive reasoning that led them to make errors, such as those found in humans. They seemed to favor “system 1,” discussed by Nobel Prize-winning economist and psychologist Daniel Kahneman in his theory of thought patterns.

In humans, System 1 is a fast, instinctive and emotional mode of reasoning, while System 2 is slower, more reflective and more logical. Although it is more prone to reasoning biases, System 1 would nevertheless be favored, because it is faster and less energy-intensive than System 2.

Here is an example of the reasoning errors we tested, taken from the “Cognitive Reflection Test”:
– Question asked: A bat and ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
– Intuitive response (“system 1”): $0.10;
– Correct answer (“system 2”): $0.05.

Finally, the very latest generation (ChatGPT and GPT-4) showed performance that surpassed that of humans.

Our work has therefore made it possible to identify a positive trajectory in the performance of LLMs, which might be conceived as a “developmental” or “evolutionary” trajectory where an individual or a species acquires more and more skills over time.

Models that can improve

We asked ourselves whether it was possible to improve the performance of models with “intermediate” performance (i.e., those that answered the questions but exhibited cognitive biases). To do this, we “prompted” them to approach the problem that had misled them in a more analytical way, which resulted in an increase in performance.

The easiest way to improve the performance of models is to simply ask them to step back and “think step by step” before asking them the question. Another very effective solution is to show them an example of a problem solved correctly, which induces a form of rapid learning (“one shot”).

These results indicate once once more that the performance of these models is not fixed, but plastic; within the same model, apparently neutral changes in context can modify performance, much like in humans, where framing and context effects (tendency to be influenced by the way information is presented) are very common.


Evolution of model performance compared to humans (dotted lines).
S. Palmintieri, Author provided (no reuse)

On the other hand, we also found that LLMs’ behaviors differ from those of humans in many ways. On the one hand, among the dozen models tested, we had difficulty finding one that was able to correctly approximate the level of correct answers provided, to the same questions, by humans. In our experiments, the results of the AI ​​models were either worse or better. On the other hand, when looking more closely at the questions asked, those that were most difficult for humans were not necessarily perceived as the most difficult by the models. These observations suggest that we cannot substitute human subjects with LLMs to understand human psychology, as some authors have suggested.

Finally, we also observed a relatively worrying fact from the point of view of scientific reproducibility. We tested ChatGPT and GPT-4 a few months apart and observed that their performances had changed, but not necessarily for the better.

This corresponds to the fact that OpenAI has slightly modified their models, without necessarily informing the scientific community. Working with proprietary models is not immune to these hazards. For this reason, we believe that the future of research (cognitive or otherwise) on LLMs should be based on open and transparent models to guarantee more control.

1720515252
#psychologists #decipher #reasoning #artificial #intelligences

Share:

Facebook
Twitter
Pinterest
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent Articles:

Table of Contents