Caleb Ulku breaks down a paper by Vishal Sikha (former Infosys CEO, Stanford PhD who studied under AI pioneer John McCarthy) and his son, which uses computational complexity theory to argue that LLMs have a hard mathematical ceiling on how much computation they can perform per response. Because of the self-attention architecture, every token gets the same fixed computational budget — meaning tasks requiring more steps than that budget allows will inevitably produce hallucinations, not due to bad training but due to math. The paper argues that agentic AI (multi-step autonomous workflows) doesn't solve this problem — it compounds it, as errors accumulate and the model cannot verify its own logic over long chains. Real-world benchmark data (Vending Bench 2) supports this: the best frontier AI models achieve less than 15% of human baseline performance when running a simulated business over time.
Vishal Sikka (former CEO of Infosys, board member at Oracle and BMW, Stanford PhD who studied under John McCarthy) and his son published a paper proving that AI agents will never be able to do what Silicon Valley is promising. Using computational complexity theory settled since the 1960s, they mathematically demonstrated that LLMs can only perform a fixed number of computations per response, and if a task requires more computation than that ceiling allows, the model will either fail or hallucinate — not as a maybe, but as a mathematical certainty baked into the architecture.
According to Sikka's paper, hallucination is not primarily a training issue — it is a mathematical inevitability for certain types of problems. LLMs perform a fixed amount of computation per token generated, and when a task requires more computational steps than the model's architecture allows, hallucination becomes the only possible output. While newer models have gotten better at reducing hallucination on simpler tasks, for computationally complex problems, hallucination is unavoidable regardless of training quality. This is rooted in the time hierarchy theorem, which states that some problems require a minimum number of steps that simply cannot be shortcut.
The computational ceiling refers to the fixed number of computations an LLM can perform per response or per token generated. Every token gets the same computational budget — a simple 'hello' gets the same number of operations as a complex physics problem. This ceiling is not about hardware limitations; it is baked into the self-attention architecture of how these systems work. It matters because any task that requires more computation than this ceiling allows — such as finding the optimal route among 20 cities (which requires checking over 2 quintillion combinations) — cannot be solved correctly. The model will instead pattern-match and produce a plausible-sounding but potentially wrong answer.
Agentic systems attempt to overcome the ceiling by spreading a problem across many steps — chain-of-thought reasoning, browsing the web, running loops. However, Sikka's paper argues this is a trap. Giving an AI more steps is like giving a writer more sheets of paper: each individual sheet is still the same size, so you haven't made the writer smarter, just given them more room to ramble. More critically, because the model cannot mathematically verify its own logic at each step, errors compound over time. The model may go off track at step five, and because it can't verify its own reasoning chain, the entire sequence eventually falls apart. In agentic workflows, hallucination becomes a cumulative mathematical certainty, not just an occasional bug.
The time hierarchy theorem is a foundational result in computational complexity theory (established since the 1960s) that states some problems require a minimum number of computational steps — they simply cannot be shortcut. Sikka's paper applies this to LLMs by arguing that if a task needs more steps than the model can perform within its fixed per-token computation budget, the model will unavoidably hallucinate or fail. In agentic settings, as chains of tasks get longer, the AI's ability to verify its own logic collapses according to this theorem, meaning errors compound and the autonomous chain will eventually break without human intervention to reset the error rate.
Vending Bench 2, developed by Andon Labs, is a 2026 benchmark that tests AI agents by giving them $500 and a simulated year to run a vending machine business. While the results look impressive on the surface — the current leader, Claude Opus 4.6, netted $8,000 in profit — the human baseline for the same simulation is $63,000. That means the best frontier AI models are achieving less than 15% of human baseline performance. The agents also exhibited bizarre failures consistent with the time hierarchy theorem: giving away inventory for free due to social engineering, and even attempting to contact the FBI to report $2 bank fees as fraud. These failures demonstrate that as task chains grow longer, AI agents lose coherence and their ability to verify their own logic collapses.
Partially, but not completely. Sikka acknowledges that you can build components around LLMs — giving them calculators, search tools, or classical algorithms — and the LLM then becomes an orchestrator. This does allow it to hand off computationally intensive tasks to tools that can handle them. However, the catch is that the model still has to verify that the tool worked correctly. If verifying the correctness of the tool's output requires more computation than the model can perform, the agent still fails in unpredictable ways. So while tooling extends what AI can do, it doesn't fully escape the fundamental ceiling.
No. A larger context window solves information access — the model can 'see' more data at once — but it does not increase the computational steps the model can perform per token generated. As the video explains, having a bigger filing cabinet doesn't help if you don't have the brain power to process what's inside. The ceiling is about the number of computational operations per word output, not about how much information the model can reference. So million-token context windows are a genuine improvement for certain tasks but do not address the fundamental architectural limitation identified in Sikka's paper.
AI models excel at tasks that stay under the computational ceiling — specifically tasks involving pattern recognition, summarization, and reformatting rather than deep logical reasoning or combinatorial problem-solving. Practical examples include: writing drafts, summarizing documents, reformatting data, research and comparison tasks, and drafting emails in a specific tone or cadence. These tasks work well because they don't require more computational steps than the model's architecture allows. The problem arises when AI is marketed as capable of autonomously running businesses or handling complex multi-step logical tasks, which the math shows it cannot reliably do.
Based on the analysis of Sikka's paper, there are three key recommendations: (1) Be specific about tasks — for example, 'draft an email using my tone and cadence' works, while 'automate this workflow' will fail because it's too vague and computationally complex. (2) Build in human verification as a structural requirement, not an optional add-on — because AI cannot reliably verify its own logic on complex tasks, humans must serve as checkpoints to reset error rates before they compound. (3) Use AI for pattern recognition, not logic-heavy math — leverage AI's genuine strengths in summarization, drafting, and comparison rather than expecting it to handle computationally intensive reasoning chains autonomously.
The video argues that the exodus of senior engineers from companies like OpenAI is a revealing signal that AGI is not as close as marketed. The reasoning: if you genuinely believed your company was months away from AGI, you would not leave. You would stay to be part of the most significant technological release in history and to benefit from the associated equity. The fact that senior engineers are instead leaving to start their own companies — companies that use AI as a tool rather than treating AI as a god — suggests they can see the ceiling. They understand that the next model will be better but not qualitatively different, just as ChatGPT-5 was better than GPT-4 but not fundamentally different in kind.
Vishal Sikka is the former CEO of Infosys, a board member at Oracle and BMW, and a Stanford PhD. Crucially, he studied under John McCarthy — the computer scientist who coined the term 'artificial intelligence.' This background makes his critique particularly significant because he is not an AI doomer, a clickbait journalist, or an outsider. He bridges the foundational laws of computer science established in the 1960s with the current AI landscape. His critique carries weight precisely because he comes from inside the field, has deep technical credentials, and is not motivated by sensationalism. He and his son published a formal paper using established computational complexity theory rather than making speculative claims.
A reasoning engine would be capable of working through novel problems step by step, verifying its logic, and arriving at correct answers even for problems it hasn't seen before — including computationally complex ones. A pattern mirror, by contrast, recognizes patterns in its training data and produces outputs that look plausible based on those patterns, without actually performing the underlying computation required to verify correctness. Sikka's paper argues that LLMs are being marketed as reasoning engines when the math proves they are actually pattern mirrors. This distinction matters enormously for business decisions: tasks requiring genuine reasoning (complex logistics, legal analysis, autonomous business operations) will fail, while tasks that benefit from pattern recognition (drafting, summarizing, reformatting) will succeed.
AI demos work because they are specifically designed to stay under the computational ceiling. The tasks chosen for demonstrations are those where the required computation fits within what the model can handle — they are curated to showcase success. Real-world business tasks, however, often require more computation than the ceiling allows, and that's where the failures occur. As the video puts it: 'Every AI demo you've ever seen was running tasks designed to stay under the necessary complexity ceiling. They work because they're designed to work.' This creates a misleading impression of general capability when the technology actually has hard mathematical limits on the complexity of problems it can solve.
The traveling salesman problem involves finding the shortest possible route to visit a set of cities and return to the starting point. For just 20 cities, the number of possible combinations to check exceeds 2 quintillion. It is used in Sikka's paper to illustrate LLM limitations because it is a classic example of a computationally complex problem where no shortcut exists — you must check combinations to find the true optimum. An LLM physically cannot perform that many calculations in a single response pass, so instead of computing the answer, it pattern-matches and produces something that looks plausible. This is not a bug that better training can fix; it is a direct consequence of the fixed computational budget per token built into the architecture.
The real opportunity lies in understanding precisely what AI can and cannot do right now, and deploying it accordingly. Current frontier models are genuinely exceptional for tasks that stay under the computational ceiling: writing drafts, summarizing content, reformatting data, research and comparison. The ceiling is real, but there is a lot of room underneath it. The opportunity is not chasing imaginary AGI or expecting AI to autonomously run your business — the math shows that will fail. Instead, the opportunity is using AI as a powerful tool for well-defined, pattern-recognition-based tasks while maintaining human oversight for verification and complex decision-making. Businesses and individuals who understand these limits will use AI effectively; those who believe the marketing hype will be disappointed.