Apple's latest research: Existing AI models "are more like memory than real reasoning"

IT Home June 8th news, the Apple Machine Learning Research Center published a research paper on June 6th local time, saying that existing AI models do not have real thinking or reasoning skills, but rely on pattern matching and memory, especially for complex tasks.

Apple researchers' existing cutting-edge "large inference models"—such as OpenAI o3-mini, DeepSeek-R1, Anthropic's Claude 3.7 Sonnet Thinking and Google Gemini Thinking—system evaluations were conducted.

The study found that although these models have the ability to generate detailed "thinking chains" and show advantages on moderate-complexity tasks, their inference ability has fundamental limitations: when the complexity of the problem exceeds a specific critical point, the model performance completely collapses to "zero accuracy."

In addition, in the process of model reasoning, even if there is still sufficient inference computing power, the number of tokens they use to "think" will actually decrease with the increase in difficulty, which means that there are fundamental limitations of existing inference methods.

This article "The Illusion of Thinking: Understanding the Advantages and Limitations of Inference Models from the Perspective of Problem Complexity" was written by Parshin Shojaee et al. Research shows that the industry's current evaluation of these models focuses on mathematical and programming benchmarks, focusing on the accuracy of the final answer, but this often ignores the problem of data pollution and does not provide insights into the structure and quality of internal reasoning trajectory.

The researchers used a series of controllable puzzle-solving environments that allow precise manipulation of compositional complexity while maintaining consistency in the logical structure. This allows not only the final answers to be analyzed, but also the internal reasoning trajectory can be explored, thereby gaining a deeper understanding of how these models “think”.

The research team proposed that model performance can be divided into three stages:

low-complexity tasks: traditional big models (IT Home Note: such as Claude-3.7 no-thinking version) perform better;

medium-complexity tasks: Large inference models (LRMs) with thinking mechanisms are more dominant;

high-complexity tasks: both types of models fall into a completely invalid state.

In particular, studies have found that LRMs have limitations in performing precise calculations, unable to show inconsistency when using explicit algorithms and reasoning across different puzzles.

In general, this study not only questions the current paradigm of LRMs based on established mathematical benchmarks, but also emphasizes the need for more meticulous experimental setups to explore these problems. By using a controllable puzzle environment, this study provides profound insights into the capabilities and limitations of linguistic inference models and points the direction for future research.

These findings highlight the advantages and limitations of existing LRMs, raising questions about the nature of these systems’ reasoning, which are of great significance to their design and deployment.”

References: