The Mystery of Emergent Behaviour in AI

Large language models (LLMs), the technology behind chatbots like ChatGPT, are astonishing researchers with abilities that sometimes appear to arrive “out of the blue”. Known as emergent behaviours, these capabilities are sparking debate in the artificial intelligence (AI) community about whether they are genuine breakthroughs or illusions created by how performance is measured.

From Benchmarks to Breakthroughs

In 2021, more than 450 researchers launched the Beyond the Imitation Game benchmark (BIG-bench), a project to test AI across 204 different tasks. Many results followed a predictable pattern: the bigger the model, the smoother the improvement.

But other results were far less predictable. On some tasks, performance stayed flat until suddenly the model leapt forward. A model that previously failed at basic arithmetic, for example, suddenly began solving sums correctly once it grew large enough.

Some researchers compared this to phase changes in physics, like water freezing into ice. They called it “emergent” behaviour: abilities that appear unexpectedly once systems pass a critical size.

How Emergent Behaviour is Measured

Much of the debate centres on measurement. Traditionally, LLMs are judged by accuracy, meaning whether they get the right answer or not. In BIG-bench, an addition problem was marked correct only if the model produced the full, exact answer.

That meant near-misses, such as predicting most of the digits correctly, were counted as total failures. Under this strict rule, ability appeared to “switch on” suddenly.

A team at Stanford University recently suggested that this creates the illusion of emergence. When tasks are graded more gradually, awarding partial credit for partially correct answers, improvements look smooth and predictable.

Explaining Some Surprises

Researchers have identified several factors that may drive these jumps in performance.

One is chain-of-thought reasoning, where large models can break problems into steps, succeeding where smaller ones cannot. Another is critical thresholds, where complex systems reorganise once they reach a certain size, allowing new reasoning strategies to emerge. A further explanation is semantic specialisation, where bigger models form dedicated “subspaces” for tasks such as maths or translation, enabling generalisation to new challenges.

These explanations help demystify some behaviours. A model’s sudden ability to translate a language or write computer code may not be magical, but the result of scale and structure.

But Not Everything Adds Up

Despite these insights, not everyone agrees that emergence can be explained away. Some scientists argue that even with more flexible metrics, jumps in capability still occur. Others note that it remains unclear which metrics will reveal smooth progress and which will hide abrupt leaps.

And as models grow ever larger, with GPT-4 estimated to use 1.75 trillion parameters, new abilities are appearing that defy current explanations, from inventing its own symbolic representations to coordinating in multi-agent settings.

Why It Matters

The debate is not just academic. If new abilities can appear suddenly and unpredictably, then forecasting the risks and benefits of future AI becomes harder. For some, emergent behaviours hint at steps toward general intelligence. For others, they highlight the dangers of relying on systems whose inner workings remain opaque.

As one researcher put it: “How do we make sure we’re not surprised by the next generation of models?”

For now, the mystery of emergence remains partly solved and partly open.