They Lied to You About AI (This Study Proves It)

Name: They Lied to You About AI (This Study Proves It)
Duration: 590 s

Caleb Ulku 9:50

Transcript

0:00

0:00 A father and son just mathematically proved that an AI agent will never do what Silicon

0:04 Valley is promising. Not probably won't. Not might have limitations. They've mathematically

0:10 proved. They use computational complex theory that's been settled since the 1960s. And this

0:15 isn't coming from some AI doomer or clickbait journalist. This is coming from Vishal Sikha,

0:21 former CEO of Infosys, board member at Oracle and BMW. He's a Stanford PhD who literally studied

0:27 under John McCarthy. He's the guy who coined the term artificial intelligence. He and his

0:33 son just published a paper that no one in AI marketing departments wants you to read,

0:38 especially right now as we enter the era of Magnus and OpenClaw, the agents that can use

0:44 your browser and click buttons for you. It looks like AGI has arrived, but Sika says we're actually

0:50 just watching the ceiling get higher, not disappear. Their argument is simple. LLMs can

0:56 only perform a certain number of computations per response. That number is fixed. And if a task

1:02 requires more computation than that ceiling allows, the model will either fail or hallucinate. And

1:09 this isn't a maybe. It's baked into the math. But if the math is so broken, then why are the big

1:15 players still promising the world? I'll tell you the devious reason why at the end of this video.

1:21 But first, I want to look at the ceiling that they discover. Now, when you send a prompt to

1:26 ChatGPT or Clot or Grok or any of the current frontier models, the model will do a fixed amount

1:31 of work to generate each word as an output. This happens through the self-attention mechanism. This

1:37 is of course very simplified. But think of it like this. Every word in your prompt needs to look at

1:42 every other word to understand the context. So if you have a thousand words, it's a million

1:47 comparisons. A thousand, a thousand. But there's no let me think about this harder. There's no

1:52 give me more time on this one. Every token gets the same budget. A simple hello gets the same

1:58 number of operations as a complex physics problem. That's the ceiling. It's not about better hardware.

2:04 It's about the architecture of how the systems actually work. The paper, and I have it here on

2:09 screen if you want to read it, it uses traveling salesman problem as an example. To visit 20 cities

2:15 and figure out the shortest possible route between those cities, you need to check over two quintillion

2:20 combinations An LLM physically cannot do that math in one shot So what does it do It guesses it pattern matches It gives you something that looks plausible and it not a bug that the architecture

2:34 But how would you actually handle tasks

2:36 that require that level of computation?

2:39 Next, I'll show you why even verifying the answers

2:42 is just as impossible for these models.

2:44 The authors of this paper make a distinction,

2:47 doing a task versus verifying it.

2:50 Now, you'd think that the model could at least check if the answer is right,

2:53 even if it can't handle the computational complexity to calculate it.

2:57 But no, verification often requires just as much work as solving the problem up front.

3:02 Every AI demo you've ever seen,

3:05 it was running tasks designed to stay under the necessary complexity ceiling.

3:09 They work because they're designed to work.

3:12 Meanwhile, the real-world tasks that your business actually needs

3:16 are going to blow right past that ceiling.

3:18 And this is where Sika's background becomes a factor.

3:21 This isn't an outsider's perspective.

3:23 Remember, he studied under John McCarthy,

3:26 the man who literally coined the term artificial intelligence.

3:29 He's bridging the gap between the foundational laws of the 1960s

3:33 and the chaotic world of AI in 2026.

3:37 He isn't saying these tools are useless.

3:39 Far from it.

3:39 He's just saying they're being marketed as reasoning engines

3:43 when the math proves they're actually pattern mirrors.

3:45 They reference the time hierarchy theorem.

3:48 Again, I don't mean to throw so many fancy words,

3:51 but this basically says that some problems require a minimum number of steps.

3:55 You just can't shortcut them.

3:57 And the argument that the paper makes,

3:59 if a task needs more steps than the model can perform,

4:02 it will unavoidably hallucinate.

4:05 Unavoidably.

4:06 And this is why hallucination isn't a training issue.

4:08 Yes, more recent models have gotten better at it,

4:11 but for certain problems, hallucination is the only possible output.

4:14 But wait, you might be thinking, what about the new agentic era? Tools like Madness or OpenClaw,

4:19 they don't just give one answer. They run thousands of loops, browsing the web and thinking

4:24 through step by step. The tech community is calling this chain of thought or agentic workflows.

4:30 And the idea is that if a model has a ceiling, just spread the problem across more steps,

4:35 give it more room to work. But Sika's paper argues this is a trap. And here's why. If you

4:40 you have a fixed amount of thinking power per word, giving the AI more steps is like giving a

4:46 writer more sheets of paper Each individual sheet is still the same size You haven made the writer smarter You just given them more room to ramble off topic That why you see an agent book a flight perfectly but then get stuck

5:00 in a bizarre infinite loop trying to change a seat assignment. The math, specifically again,

5:05 that time hierarchy theorem, says that for complex problems, errors eventually compound.

5:11 The model goes off track at step five, and because it can't mathematically verify its own logic,

5:16 the whole chain eventually falls apart. In the agentic era, hallucination isn't a training bug.

5:22 It's a cumulative mathematical certainty. Then, of course, you might be arguing, well, they can just

5:27 use a tool, give it a calculator. After all, we wouldn't expect a human to be able to calculate

5:31 the traveling salesman problem by hand. But Sika acknowledges this as well. You can build components

5:36 around LLMs to overcome the limits, of course, and then the LLM becomes an orchestrator. But notice

5:42 what just happened. The LLM didn't solve the problem, it just handed it off to a classical

5:47 algorithm that could. But the catch? The model still has to verify that that tool worked. And

5:53 if verifying correctness requires more math than the model can do, again the agent fails in

5:58 unpredictable ways. Well, what about those massive context windows? Gemini 3 Pro can see a million

6:04 tokens at once. Yes, that solves information access. It doesn't solve the computational steps

6:09 per word. Having a bigger filing cabinet doesn't help if you don't have the brain power to process

6:15 what's inside. So what does this mean for you? Now, the paper, it's not saying that AI is useless.

6:20 Indeed, it definitely is not. I use these tools every day in my business. I'm sure most of the

6:25 people watching this do as well. For the right applications, current AI, the current frontier

6:30 models are exceptional. Writing drafts, summarizing, reformating data, research and comparison,

6:35 these stay under that ceiling. The problem is the gap between reality and the pitch decks.

6:41 AI agents will autonomously run your business is a lie. The math just doesn't support it.

6:47 To see this in action, look at Vending Bench 2 from Andon Labs. This is the 2026 gold standard

6:53 for testing AI agents at running a business. Models like Claude Opus 4.6, Gemini 3 Pro,

6:59 they're given $500 and a year to run a simulated vending machine business.

7:04 And on paper, the agents look like they're winning.

7:07 The current leader, Claude Opus 4.6, netted $8,000 in profit.

7:11 Here's that test, Vending Bench 2, feel free to look it up yourself.

7:14 And here are the current standings for Frontier Models.

7:17 We can see Claude Opus 4 pretty good But here the actual ceiling Anden Labs calculated a human baseline for this exact same simulation

7:28 Let me scroll down and show that to you.

7:30 It's a long paper.

7:31 Here, this isn't the best ever, $63,000 a year.

7:35 This is a human baseline

7:37 and it blows the AI models out of the water.

7:40 The reason the AI models can't make $63,000 a year

7:44 is because they lose coherence over a long time frame.

7:48 The result, the frontier models, the best we can make now,

7:52 aren't hitting even 15% of a human baseline.

7:56 Over these runs, we've seen agents honestly give away their inventory for free

8:00 due to social engineering, or they've even tried to contact the FBI

8:04 to report their own $2 bank fees as fraud.

8:07 And this is the time hierarchy theorem in the wild.

8:10 As the chain of tasks gets longer,

8:12 that AI's ability to verify its own logic collapses.

8:17 It doesn't matter how smart the model is,

8:19 the math says that without a human to reset the error rate,

8:22 the autonomous chain will eventually break.

8:25 So here's what you actually do

8:27 to stay on the winning side of this math.

8:30 First, be specific about tasks.

8:33 Draft an email using my tone and cadence, that works.

8:37 Automate this workflow is going to fail.

8:39 Build in human verification.

8:41 This is a structural requirement, not an option.

8:44 And third, use AI for pattern recognition, not logic-heavy math.

8:48 But here is the real tip-off.

8:50 Why the singularity probably isn't as close as people keep saying.

8:55 Because if the singularity were just months away, why are the smartest people in the room quitting?

9:00 Look at the insiders.

9:01 If OpenAI was about to hit AGI, why would senior engineers be leaving to start risky startups?

9:08 If you knew the world was about to change forever, you wouldn't leave.

9:12 You wouldn't leave OpenAI if they're on the verge of AGI.

9:15 You'd stay to be part of the release of a lifetime, to be part of the equity of a lifetime,

9:20 unless you saw the ceiling.

9:22 Now, they know the next model will be better, but not qualitatively different.

9:25 Just like ChatGPT5, it was better than 4, but not qualitatively different.

9:30 They're starting companies that use AI as a tool, not companies that use AI as a god.

9:35 The opportunity here is not chasing some imaginary AGI.

9:40 The opportunity is an understanding exactly what AI can do for you right now.

9:45 The ceiling is real, but there's a lot of room underneath it.

Caleb Ulku breaks down a paper by Vishal Sikha (former Infosys CEO, Stanford PhD who studied under AI pioneer John McCarthy) and his son, which uses computational complexity theory to argue that LLMs have a hard mathematical ceiling on how much computation they can perform per response. Because of the self-attention architecture, every token gets the same fixed computational budget — meaning tasks requiring more steps than that budget allows will inevitably produce hallucinations, not due to bad training but due to math. The paper argues that agentic AI (multi-step autonomous workflows) doesn't solve this problem — it compounds it, as errors accumulate and the model cannot verify its own logic over long chains. Real-world benchmark data (Vending Bench 2) supports this: the best frontier AI models achieve less than 15% of human baseline performance when running a simulated business over time.

Mathematical Limits of LLMs AI Agent Hype vs. Reality Hallucination as Structural Inevitability Practical AI Use vs. AGI Mythology Insider Signals and AGI Skepticism Unknown Host/Creator Vishal Sikka John McCarthy

Use AI for tasks that stay under the computational ceiling: drafting, summarizing, reformatting data, and research — not for complex autonomous multi-step workflows.
Always build in human verification checkpoints when using AI agents, because the math guarantees error accumulation over long task chains — this is a structural requirement, not optional.
Treat AI as a pattern-recognition tool, not a reasoning engine — and be skeptical of 'AI will run your business autonomously' pitches, since current frontier models hit less than 15% of human baseline on business simulation benchmarks.

Q&A 16

What did Vishal Sikka and his son mathematically prove about AI agents?

Vishal Sikka (former CEO of Infosys, board member at Oracle and BMW, Stanford PhD who studied under John McCarthy) and his son published a paper proving that AI agents will never be able to do what Silicon Valley is promising. Using computational complexity theory settled since the 1960s, they mathematically demonstrated that LLMs can only perform a fixed number of computations per response, and if a task requires more computation than that ceiling allows, the model will either fail or hallucinate — not as a maybe, but as a mathematical certainty baked into the architecture.

Why do LLMs hallucinate, and is it a training problem that can be fixed?

According to Sikka's paper, hallucination is not primarily a training issue — it is a mathematical inevitability for certain types of problems. LLMs perform a fixed amount of computation per token generated, and when a task requires more computational steps than the model's architecture allows, hallucination becomes the only possible output. While newer models have gotten better at reducing hallucination on simpler tasks, for computationally complex problems, hallucination is unavoidable regardless of training quality. This is rooted in the time hierarchy theorem, which states that some problems require a minimum number of steps that simply cannot be shortcut.

What is the 'computational ceiling' of LLMs and why does it matter?

The computational ceiling refers to the fixed number of computations an LLM can perform per response or per token generated. Every token gets the same computational budget — a simple 'hello' gets the same number of operations as a complex physics problem. This ceiling is not about hardware limitations; it is baked into the self-attention architecture of how these systems work. It matters because any task that requires more computation than this ceiling allows — such as finding the optimal route among 20 cities (which requires checking over 2 quintillion combinations) — cannot be solved correctly. The model will instead pattern-match and produce a plausible-sounding but potentially wrong answer.

Why don't agentic AI systems like Claude or OpenAI's browser agents solve the computational ceiling problem?

Agentic systems attempt to overcome the ceiling by spreading a problem across many steps — chain-of-thought reasoning, browsing the web, running loops. However, Sikka's paper argues this is a trap. Giving an AI more steps is like giving a writer more sheets of paper: each individual sheet is still the same size, so you haven't made the writer smarter, just given them more room to ramble. More critically, because the model cannot mathematically verify its own logic at each step, errors compound over time. The model may go off track at step five, and because it can't verify its own reasoning chain, the entire sequence eventually falls apart. In agentic workflows, hallucination becomes a cumulative mathematical certainty, not just an occasional bug.

What is the time hierarchy theorem, and how does it apply to AI limitations?

The time hierarchy theorem is a foundational result in computational complexity theory (established since the 1960s) that states some problems require a minimum number of computational steps — they simply cannot be shortcut. Sikka's paper applies this to LLMs by arguing that if a task needs more steps than the model can perform within its fixed per-token computation budget, the model will unavoidably hallucinate or fail. In agentic settings, as chains of tasks get longer, the AI's ability to verify its own logic collapses according to this theorem, meaning errors compound and the autonomous chain will eventually break without human intervention to reset the error rate.

What does the Vending Bench 2 benchmark reveal about AI agents' real-world capabilities?

Vending Bench 2, developed by Andon Labs, is a 2026 benchmark that tests AI agents by giving them $500 and a simulated year to run a vending machine business. While the results look impressive on the surface — the current leader, Claude Opus 4.6, netted $8,000 in profit — the human baseline for the same simulation is $63,000. That means the best frontier AI models are achieving less than 15% of human baseline performance. The agents also exhibited bizarre failures consistent with the time hierarchy theorem: giving away inventory for free due to social engineering, and even attempting to contact the FBI to report $2 bank fees as fraud. These failures demonstrate that as task chains grow longer, AI agents lose coherence and their ability to verify their own logic collapses.

Can giving an LLM access to external tools like calculators solve its computational limitations?

Partially, but not completely. Sikka acknowledges that you can build components around LLMs — giving them calculators, search tools, or classical algorithms — and the LLM then becomes an orchestrator. This does allow it to hand off computationally intensive tasks to tools that can handle them. However, the catch is that the model still has to verify that the tool worked correctly. If verifying the correctness of the tool's output requires more computation than the model can perform, the agent still fails in unpredictable ways. So while tooling extends what AI can do, it doesn't fully escape the fundamental ceiling.

Does a larger context window (like Gemini's 1 million token window) solve the computational ceiling problem?

No. A larger context window solves information access — the model can 'see' more data at once — but it does not increase the computational steps the model can perform per token generated. As the video explains, having a bigger filing cabinet doesn't help if you don't have the brain power to process what's inside. The ceiling is about the number of computational operations per word output, not about how much information the model can reference. So million-token context windows are a genuine improvement for certain tasks but do not address the fundamental architectural limitation identified in Sikka's paper.

What types of tasks are AI models actually good at, given these mathematical limitations?

AI models excel at tasks that stay under the computational ceiling — specifically tasks involving pattern recognition, summarization, and reformatting rather than deep logical reasoning or combinatorial problem-solving. Practical examples include: writing drafts, summarizing documents, reformatting data, research and comparison tasks, and drafting emails in a specific tone or cadence. These tasks work well because they don't require more computational steps than the model's architecture allows. The problem arises when AI is marketed as capable of autonomously running businesses or handling complex multi-step logical tasks, which the math shows it cannot reliably do.

What are the three practical recommendations for working effectively with AI given its mathematical limitations?

Based on the analysis of Sikka's paper, there are three key recommendations: (1) Be specific about tasks — for example, 'draft an email using my tone and cadence' works, while 'automate this workflow' will fail because it's too vague and computationally complex. (2) Build in human verification as a structural requirement, not an optional add-on — because AI cannot reliably verify its own logic on complex tasks, humans must serve as checkpoints to reset error rates before they compound. (3) Use AI for pattern recognition, not logic-heavy math — leverage AI's genuine strengths in summarization, drafting, and comparison rather than expecting it to handle computationally intensive reasoning chains autonomously.

Why are senior engineers leaving top AI companies like OpenAI if AGI is supposedly imminent?

The video argues that the exodus of senior engineers from companies like OpenAI is a revealing signal that AGI is not as close as marketed. The reasoning: if you genuinely believed your company was months away from AGI, you would not leave. You would stay to be part of the most significant technological release in history and to benefit from the associated equity. The fact that senior engineers are instead leaving to start their own companies — companies that use AI as a tool rather than treating AI as a god — suggests they can see the ceiling. They understand that the next model will be better but not qualitatively different, just as ChatGPT-5 was better than GPT-4 but not fundamentally different in kind.

Who is Vishal Sikka and why is his critique of AI significant?

Vishal Sikka is the former CEO of Infosys, a board member at Oracle and BMW, and a Stanford PhD. Crucially, he studied under John McCarthy — the computer scientist who coined the term 'artificial intelligence.' This background makes his critique particularly significant because he is not an AI doomer, a clickbait journalist, or an outsider. He bridges the foundational laws of computer science established in the 1960s with the current AI landscape. His critique carries weight precisely because he comes from inside the field, has deep technical credentials, and is not motivated by sensationalism. He and his son published a formal paper using established computational complexity theory rather than making speculative claims.

What is the difference between AI being a 'reasoning engine' versus a 'pattern mirror,' and why does it matter?

A reasoning engine would be capable of working through novel problems step by step, verifying its logic, and arriving at correct answers even for problems it hasn't seen before — including computationally complex ones. A pattern mirror, by contrast, recognizes patterns in its training data and produces outputs that look plausible based on those patterns, without actually performing the underlying computation required to verify correctness. Sikka's paper argues that LLMs are being marketed as reasoning engines when the math proves they are actually pattern mirrors. This distinction matters enormously for business decisions: tasks requiring genuine reasoning (complex logistics, legal analysis, autonomous business operations) will fail, while tasks that benefit from pattern recognition (drafting, summarizing, reformatting) will succeed.

Why do AI demos always seem to work perfectly if the technology has these fundamental limitations?

AI demos work because they are specifically designed to stay under the computational ceiling. The tasks chosen for demonstrations are those where the required computation fits within what the model can handle — they are curated to showcase success. Real-world business tasks, however, often require more computation than the ceiling allows, and that's where the failures occur. As the video puts it: 'Every AI demo you've ever seen was running tasks designed to stay under the necessary complexity ceiling. They work because they're designed to work.' This creates a misleading impression of general capability when the technology actually has hard mathematical limits on the complexity of problems it can solve.

What is the traveling salesman problem and why is it used to illustrate LLM limitations?

The traveling salesman problem involves finding the shortest possible route to visit a set of cities and return to the starting point. For just 20 cities, the number of possible combinations to check exceeds 2 quintillion. It is used in Sikka's paper to illustrate LLM limitations because it is a classic example of a computationally complex problem where no shortcut exists — you must check combinations to find the true optimum. An LLM physically cannot perform that many calculations in a single response pass, so instead of computing the answer, it pattern-matches and produces something that looks plausible. This is not a bug that better training can fix; it is a direct consequence of the fixed computational budget per token built into the architecture.

What is the real opportunity in AI if autonomous agents and AGI are overhyped?

The real opportunity lies in understanding precisely what AI can and cannot do right now, and deploying it accordingly. Current frontier models are genuinely exceptional for tasks that stay under the computational ceiling: writing drafts, summarizing content, reformatting data, research and comparison. The ceiling is real, but there is a lot of room underneath it. The opportunity is not chasing imaginary AGI or expecting AI to autonomously run your business — the math shows that will fail. Instead, the opportunity is using AI as a powerful tool for well-defined, pattern-recognition-based tasks while maintaining human oversight for verification and complex decision-making. Businesses and individuals who understand these limits will use AI effectively; those who believe the marketing hype will be disappointed.