Apple Just SHOCKED Everyone: AI IS FAKE!?

Apple Just SHOCKED Everyone: AI IS FAKE!?

Table of Contents

Do AI Models Really Think? Apple’s Bold Experiment Challenges the Illusion of Reasoning

Most artificial intelligence models work like this. You give them a question, they give you an answer. But these newer ones, called Large Reasoning Models, or LRMs, try to impress by showing their step-by-step thought process before giving the final answer.

It looks like they’re thinking, solving the problem piece by piece. But here’s the twist. Until now, nobody really checked if those steps actually made sense.

As long as the final answer was correct, we’d just assumed the model reasoned its way there. But what if it didn’t? What if it’s just stitching together familiar patterns from its training data, making it look like it’s thinking, when in reality, it’s just faking the whole process? Before we answer that question, mark your calendar. This Sunday at 4 p.m. Eastern, we’re premiering our first full-length documentary.

And trust me, you do not want to miss it. We’ve spent months tracking down the most advanced humanoid robots on the planet. Many you’ve never seen before, not on our channel, not anywhere on YouTube.

You’ll see Chinese robots doing backflips while loading cargo, Japan’s sword-swinging androids, a Polish robot made of artificial muscles, a Russian robot so human-like it might haunt your dreams, and much more. It’s dramatic, intense, and honestly a little unsettling. So set a reminder, grab a seat, and witness the world’s most advanced humanoids.

This Sunday at 4 p.m. Eastern. Alright, now Apple’s research team wanted to cut through all the confusion around whether artificial intelligence models actually think. So they ran experiments using puzzle-like environments that are familiar to anyone who’s studied computer science.

Inside Apple’s AI Lab: Stress-Testing Reasoning with Classic Logic Puzzles

These include Tower of Hanoi, checkers jumping, river crossing, and something called Bloch’s World. What makes these puzzles perfect is that their difficulty can be increased step by step while keeping the logic exactly the same. You just add more disks in Hanoi, more checker pieces in checkers, more people in Agents and River Crossing, or more Bloch’s in Bloch’s World, and the task becomes harder in a controlled way.

The researchers could simulate every move the models made, which meant they could evaluate not just the final answer, but every single step taken along the way, up to 64,000 tokens worth of reasoning. And since the puzzles are made from scratch, there’s no risk that the answers had already shown up in the training data. It’s a clean test.

For example, in Tower of Hanoi, if you start with 8 disks, the shortest correct solution takes 255 moves. With 10 disks, that number jumps to 1,023. In checkers jumping, if there are two colors with n pieces each, the minimum number of moves needed is The river crossing puzzle starts with small groups using a two-person boat, then increases complexity with more pairs and a slightly bigger boat.

Each of these setups becomes harder in a predictable way, letting Apple carefully dial in the challenge level. To test reasoning properly, they compared model pairs with the exact same architecture, same weight, same size, the only difference being whether the reasoning feature was enabled. So they compared Standard version.

Both versions got the same token budget, 64,000 tokens to work with. Each puzzle variation was run 25 times to reduce randomness. They tracked metrics like pass at K, basically how often the model gets it right within a few tries, as well as how many correct steps it took and how much thinking it actually did.

When More Thinking Fails: Apple Uncovers AI’s Reasoning Collapse at High Complexity

Now for the eye-opener. In every puzzle, results split into three distinct zones based on how hard the problem was. In simple puzzles, the regular non-reasoning models did better.

They gave quick, correct answers without wasting extra tokens. In puzzles of medium difficulty, like Hanoi, with about seven disks, the reasoning models started to outperform the others, though they needed a lot more tokens to get there. But once the puzzles got truly hard, all models, reasoning or not, collapsed completely.

Accuracy dropped to zero. Plaid 3.7 Thinking handled things well up to disk 8 in Hanoi, tried to push through 10 and then failed completely. DeepSeek R1 gave up even earlier.

When they tested Hanoi all the way up to 20 disks, over a million moves required. None of the models could hold up. The same breakdown happened in the other puzzles too, like Checkers, once it needed 11 or more jumps, or Blocks World with bigger stacks.

One of the strangest findings in the whole study was how these models handled their internal reasoning effort. You’d assume that the harder the puzzle gets, the more thinking the model would do. But that’s not what happened.

At first, as difficulty rose, the models started thinking more. But once they crossed a certain level of complexity, their reasoning effort actually started dropping, even though they still had plenty of tokens left to use. Apple called this a counterintuitive scaling limit.

AI Hits a Wall: Why Today’s Models Struggle with Step-by-Step Logic

You’d see DeepSeek R1 using up to 15,000 thinking tokens on a moderately hard puzzle, but when the puzzle got even harder, it would suddenly drop down to just 3,000. Even though the system could have done much more, that drop in effort matched exactly with the drop in accuracy. To dig deeper, the researchers even handed the models the full solution algorithm for Hanoi, written out in simple steps.

The models didn’t need to figure anything out, just follow instructions and copy each move. But that didn’t help. They still broke down at the same points.

This shows the issue isn’t about solving the puzzle from scratch, it’s about being able to execute a logical sequence over many steps. Flawed 3.7 thinking could do about 100 steps correctly out of the 1,023 needed for 10 disks, then failed. In River Crossing, it failed after just 5 or 6 moves, despite the solution requiring only 11 moves in total.

That’s not a memory issue, that’s a failure in symbolic reasoning. And of course, the moment this research hit the internet, everyone in the artificial intelligence space had something to say. Gary Marcus, who’s been skeptical of neural networks for years, called the findings pretty devastating.

He reminded everyone that back in 1957, Herb Simon had already coded a working solution for Hanoi, and that today you can find hundreds of working algorithms online. If artificial intelligence models can’t even follow those, how are they supposed to reason? Others pushed back. Kevin Bryan, from the University of Toronto, said Apple might not be exposing a weakness in intelligence, but rather a design choice.

Limits of Logic: Why AI Overthinks the Easy and Gives Up on the Hard

These models are often trained to avoid overthinking simple queries, mostly to save time and reduce server costs. He believes if you gave them more room to think and allowed them to use more tokens, they’d do better. It’s not that they can’t solve harder problems, it’s that they’re trained not to try too hard.

Software engineer Sean Godecky made a similar point, saying that when DeepSeek saw that a puzzle required a thousand moves, it probably realized it would run out of space and didn’t even attempt the full sequence. Meanwhile, artificial intelligence researcher Simon Willison said that using Tower of Hanoi to judge reasoning is kind of missing the point, since language models aren’t designed like step-by-step machines and might never match the efficiency of a proper algorithm. Even Apple’s own team admits the puzzles they used cover only a narrow part of what reasoning means.

But the reason these puzzles are still useful is that every move can be tracked and verified. That makes it really easy to spot when a model’s so-called reasoning just falls apart. The researchers noticed that on easy puzzles, Claude often got the right answer early in its chain of thought, but then wasted time chasing wrong paths anyway.

Basically overthinking. On medium puzzles, it often got stuck in incorrect thinking first and found the right answer later, and on hard puzzles, the correct answer never showed up at all, not even once across the entire 64,000-token reasoning sequence. That clear shift from getting it right early to getting it late to never getting it showed up across all puzzles and was visualized in Figure 7 of the paper.

Now, jump from that lab setting to Monday’s Worldwide Developers Conference event in Cupertino. You’d think, with all this deep research going on, that Apple would open with something groundbreaking in artificial intelligence. Instead, they focused on design.

Polish on the Surface, Problems Underneath: Apple’s AI Dazzles but Still Stumbles on Deep Reasoning

Their new liquid-glass look is what Apple’s calling the biggest redesign since iOS 7. It’s all about translucent buttons, fluid animations, and curvy glass-like elements across iPhones, Macs, and even the Vision Pro headset. Craig Federighi showed off how Apple Silicon is now fast enough to render all these effects at 120 frames per second. The animations look beautiful, but Wall Street wasn’t impressed.

Apple’s stock dropped by 1.2%. One analyst said the artificial intelligence features shown were nothing new. They’re already available in other apps. To be fair, Apple did showcase a few artificial intelligence tricks, just not the kind that steal headlines.

They demonstrated a new real-time translation feature for phone calls. Two people can talk in different languages. And the phone translates their voices instantly without needing a cloud connection.

They added a button to screenshots that lets you send an image to chat GPT to summarize it. You can now also use chat GPT to generate images directly inside Apple’s photo apps. But the big Siri upgrade they teased last year, the one that could scan your emails and schedule things for you automatically, was missing.

Apple says it still needs more time to meet their quality standards. Meanwhile, their researchers were tackling a much deeper question. Apple’s study even tested OpenAI’s O1 and O3 mini models, and they ran into the same brick wall.

They performed well up to around 8 disks in Hanoi, but completely failed by 10. This matched what the United States Math Olympiad study found earlier this year. Those same models scored below 5% when given brand new math proofs they hadn’t seen before.

Where Pattern Ends and Reasoning Fails: Apple’s Research Exposes AI’s True Limits

Only one managed 25%. None got a perfect score. As long as the solution already exists in their training data, they do fine.

But the moment a truly new challenge appears, the illusion of reasoning starts to crack. Look, most people still think AI is some distant future, but regular folks are already using it to build income streams quietly, behind the scenes. If you want to see how they’re doing it without tech skills or quitting their job, download the AI Income Blueprint.

It’s totally free. The link’s in the description. But it won’t stay free forever.

OK, now there was one last finding that really stood out. Apple ran math benchmark tests, again, using those same models, DeepSeq and Claude, with and without reasoning. On the popular Math500 dataset, both versions performed about the same when allowed to try multiple answers.

But on newer test sets, like AIME24 and 25, the reasoning models pulled ahead. What’s strange is that human test takers actually did better on AIME25 than on AIME24, but the artificial intelligence models did worse. That likely means older test sets had already leaked into artificial intelligence training data, which artificially boosted their scores.

So the newer test exposed what the older ones had hidden. One more curious detail, Claude 3.7 Thinking could perform over 100 correct steps on 10-disk Hanoi before making its first mistake. But in the River Crossing puzzle, it broke down after just 5 moves, even though the total solution only needed 11.

Why? The study suggests it’s because there are far more examples of Hanoi puzzles online, especially for 10 disks. River Crossing examples with more than two pairs of people are rare, so the model had less to draw from. What we’re seeing isn’t real reasoning, it’s the line where pattern recall stops working.

So is it fair to say artificial intelligence is fake? Maybe not in the way people mean when they say that, but Apple’s research definitely shows that current artificial intelligence systems can’t truly reason their way through long, complex tasks. Their thinking falls apart once the challenge reaches a certain point. Some researchers say we can fix that with more tokens and better training.

Others say we’ve hit a wall and need a new approach entirely. But what do you think? Did Apple just expose that most artificial intelligence models aren’t actually thinking, just pretending to? Let me know what you think down in the comments. Do not forget to subscribe, hit like if you found this wild, and thanks for reading.

Catch you in the next one.

en.wikipedia.org

Also Read:- OpenAI Sam Altman SHOCKS Industry: We’ve Already Crossed Into Superintelligence