Apple Study Exposes Limits of AI “Reasoning”

Two differently coloured robotic hands, one orange-red and one white-and-green, move brightly painted disks on a wooden Tower of Hanoi puzzle.

A new study by Apple has cast doubt on the idea that today’s artificial intelligence (AI) can truly reason. While so-called Large Reasoning Models (LRMs) appear to think step by step, Apple researchers found their abilities collapse sharply once problems cross a certain complexity threshold.

What Are Reasoning Models?

Unlike standard language models, which jump straight to an answer, LRMs are designed to “think aloud”. They write down a chain of reasoning – like a student showing their working in maths – before giving a solution. This approach has raised hopes that AI could tackle tougher tasks in maths, coding and logic.

Apple’s new paper, The Illusion of Thinking, argues those hopes may be misplaced. The study found that while reasoning helps in some situations, it does not scale. When puzzles become too hard, the models do not simply struggle – they fall off a cliff.

Classic Puzzles as a Test Bed

To measure reasoning, the researchers used four well-known puzzles where difficulty can be increased step by step:

  • Tower of Hanoi – moving disks between pegs, which grows exponentially harder. Five disks require 31 moves; ten disks need over 1,000.
  • River Crossing – ferrying missionaries and cannibals across a river without breaking rules. Adding pairs rapidly increases complexity.
  • Checkers Jumping Puzzle – swapping coloured checkers across a board, where required moves grow quadratically as more pieces are added.
  • Blocks World – rearranging stacked blocks into a new configuration, a long-standing test of planning in AI.

Each puzzle was checked by a simulator to ensure every move was valid. Both reasoning-enabled models and standard language models were tested repeatedly, with generous resources to eliminate chance or memory limits.

Three Phases of Performance

The researchers observed a striking three-phase pattern in how the models performed:

1. Easy Tasks – Overthinking Hurts

On very simple puzzles, standard language models often did better. Reasoning models sometimes “talked themselves out” of correct answers. They might briefly land on the right solution, but then continue producing unnecessary steps until they introduced mistakes. Apple researchers likened this to a student who solves a riddle quickly but keeps second-guessing until they end up wrong. The extra reasoning, far from helping, became a liability.

2. Medium Tasks – Reasoning Pays Off

As puzzles grew more complicated, the reasoning advantage became clear. LRMs could successfully solve multi-step problems by writing out intermediate steps like a scratchpad. Standard models, by contrast, tended to get stuck when more than two or three logical moves were needed. In this “sweet spot”, chain-of-thought reasoning provided real benefits: it gave the AI the working memory it needed to plan, backtrack and eventually succeed.

3. Hard Tasks – Collapse at the Cliff Edge

Once the puzzles crossed a critical complexity threshold, performance did not decline gradually – it collapsed. Models that handled eight disks in the Tower of Hanoi could not cope with nine. Success rates dropped almost to zero, even with huge amounts of computational space and multiple attempts. Crucially, this was not because the models ran out of memory. Instead, they seemed to “give up”, producing shorter reasoning chains on the hardest tasks. Researchers call this “underthinking”: the model stopped trying just when it needed to think more.

Why Do They Fail?

Even when given the correct algorithm, models could not always follow it through. Errors crept in during long sequences, showing a lack of precision in execution as well as planning. In some puzzles, models were inconsistent: able to handle long sequences in one domain but failing on a shorter, unfamiliar puzzle in another.

Implications

The study highlights both promise and fragility. LRMs can genuinely help with mid-level reasoning, but their limits are severe. For businesses, it means AI cannot be assumed to scale smoothly to complex tasks like regulation or logistics.

Apple’s conclusion is stark: today’s AI does not reason in a human sense. It mimics patterns of reasoning, but when complexity rises, the illusion quickly unravels.