Google’s New AI Training Method Teaches Small Models to Think Like Humans

Researchers at Google Cloud AI and the University of California, Los Angeles (UCLA) have unveiled a new method for teaching artificial intelligence models to think more like humans. The approach, called Supervised Reinforcement Learning (SRL), combines two established training techniques to help smaller language models solve complex reasoning problems once thought to be beyond their reach.

Teaching Small Models to Think Step by Step

Large language models such as ChatGPT or Gemini often rely on vast amounts of data and computing power. Smaller open source models, while more efficient, tend to struggle with multi step reasoning tasks like advanced mathematics or software debugging.

SRL aims to change that. Rather than asking a model to simply copy the right answer or guess through trial and error, the system teaches it to reason through problems one step at a time. Each stage of the reasoning process is guided by examples from expert “trajectories” that show how a skilled model or human might approach the task.

At every step, the smaller model generates its own internal thought process, then decides on an action. Only that action is compared with the expert’s move, earning a reward if it is similar. Even if the final answer is wrong, the model still receives feedback on each intermediate step, allowing it to learn effectively from partial success.

Outperforming Traditional Training Methods

The research team tested SRL on mathematical reasoning and software engineering benchmarks using the 7 billion parameter Qwen2.5 model. When trained with standard supervised fine tuning, the model’s performance on hard maths problems actually declined. But with SRL, accuracy improved significantly, particularly when followed by another reinforcement learning phase known as RLVR.

In one experiment, SRL raised scores on two tough maths tests, AIME24 and AIME25, by up to double compared with the baseline. When combined with RLVR, the approach delivered the best open source results reported so far.

The team also applied SRL to software engineering tasks using verified programming examples generated by Anthropic’s Claude 3 Sonnet model. On the SWE Bench Verified benchmark, SRL more than doubled the performance of the base model, showing that the method works across very different problem types.

A Practical Bridge Between Two Worlds

Traditional supervised training helps AI models mimic examples but can make them rigid and over dependent on the data. Reinforcement learning encourages exploration but often fails when correct solutions are rare. SRL blends the two, offering the structure of supervision with the flexibility of reinforcement.

Crucially, SRL requires no additional reward model and can be applied to relatively small datasets. This makes it a practical approach for open source developers and research teams without access to the huge budgets of commercial AI labs.

Looking Ahead

Experts say SRL could help close the gap between large and small models, enabling compact AI systems that reason more transparently and cost less to run.

While challenges remain, such as the effort needed to produce expert trajectories, the early results are promising. By merging the best of both learning worlds, Google’s new framework may mark an important step towards more capable and efficient AI reasoning.