Research labs around the world are racing to create artificial general intelligence (AGI). There is no universally accepted definition, but most researchers agree that such a system should match or exceed human intelligence. Some even claim that current systems powered by large language models (LLMs) can already be considered AGI. Many tests have been designed over the years to confirm whether a system meets the criteria. For example, the famous Turing test, in which a human tries to distinguish whether they are chatting with another human or a computer. Modern chatbots powered by LLMs are often able to successfully pass this test. But take the "coffee test," which is credited to Steve Wozniak, Apple co-founder, as another example. In this test, an AI-based robot should enter an unfamiliar house and prepare a cup of coffee. It is a feat no LLM can accomplish today. While modern LLMs can pass difficult legal exams or write thousands of lines of bug-free computer code, they fail at tasks that even children can do without much effort.
Sam Altman, the CEO of OpenAI, often describes AGI as if it was just around the corner. It can be unlocked, he says, if only we scale the existing AI systems - train them on more data using more compute. Dario Amodei, the CEO of Anthropic, holds a similar belief. In his "scaling hypothesis", he argues that compute, data quantity and quality, and training duration are among the primary drivers of AI capability.
I find language models immensely valuable and use them daily in both my personal and professional life. But as much as I would like to believe that simply scaling them will lead to AGI, I find it very difficult. Let me explain why.
Inability to remember and learn
Even some of the tiniest and simplest living organisms have the ability to remember their experiences and learn from them. For example, bees can be quickly trained to detect explosives or drugs (these are called sniffer bees). It is remarkable that they can do so while having brains the size of a grain of salt, containing far fewer than one billion neurons, and powering them the whole day using nothing but a few drops of sweet nectar.
On the other hand, LLMs typically have hundreds of billions of "neurons." These neurons consist mainly of weights - fractional numbers that are automatically tuned up or down during the process of model training. After the model has been trained, the weights are frozen and the model is used to make predictions - a phase called inference. But no matter how many GPUs or megawatts of electricity we provide, the weights won't change anymore.
You can imagine the LLM while being trained as molten glass being shaped, and after training it is cooled down and becomes a solid piece of crystal - a prism. When we shine a rainbow at the prism, it passes through, is recombined, and comes out as a single white beam. Similarly, when we send a sequence of words to a model, out comes a single next word. And, just as the prism never changes regardless of how many times we shine light through it, the model doesn't change no matter how many users send their deepest secrets through it.
It is a common misconception that if we send some information to a language model, the model will "remember" it and then accidentally leak it to another user. Applications like ChatGPT use various workarounds that make it appear as if the model were learning, but the information is always stored outside of the model (for example, in a database) and then invisibly injected into the prompt along with the user's text1. The model also appears to "remember" information that is part of the current conversation. This is thanks to a technique known as in-context learning, in which the model quite literally reads the entire conversation each time before answering.
The inability to remember is similar to how Leonard Shelby, the main character in the movie Memento who suffers from anterograde amnesia, remembers everything up to "the incident" but cannot form new memories after it. Similarly, a language model "remembers" information from its training but cannot internalize any new information afterwards. My intuition tells me that AGI should do better than a patient suffering from a neurological disorder.
To be fair, this issue is not specific to language models alone but to most machine learning models in general. There is ongoing research - a discipline known as continual learning - aimed at finding solutions to this problem, but it will likely require inventing new model architectures. Which brings me to the next point.
Transformer architecture is a bottleneck by design
To explain this, it is important to understand how large language models work, at least at a high level.
The architecture of a typical modern LLM is called a transformer. It is perfectly suited for generating text or any other type of data that can be expressed as a sequence of numbers. This will become clearer with an example. To demonstrate the process of generating an answer, let's assume that the user asks ChatGPT "Who was the first person to circumnavigate the world?"
The first step in answering the question is converting the text of the user's question into a sequence of numbers called tokens. This is handled by a tokenizer, a very simple program that maps groups of characters (often syllables or common words) to numbers following a fixed lookup table (the tokenizer used by GPT-5 has approximately 200,000 entries). The word "Who" corresponds to 20600, the word "was" to 673, and so on. The question's text is converted into a sequence of numbers: [20600, 673, 290, 1577, 1647, 316, 8233, 1988, 13505, 290, 2375, 30]. This sequence is then sent to the actual language model2.
The model now uses a significant amount of energy to perform billions of arithmetic operations (mostly addition and multiplication) to finally generate a single number: 976. If we look up the token 976 in the lookup table, we will find that it represents the word "The." The token 976 is now appended to the end of the original sequence, and the extended sequence [20600, 673, 290, 1577, 1647, 316, 8233, 1988, 13505, 290, 2375, 30, 976] is sent to the model again. The model performs billions of operations again to generate another number: 1577. This number maps to the word "first." The token 1577 is appended to the end of the sequence, which is then sent to the model again, and this process of generating a single number, appending it to the sequence, and sending the extended sequence back to the model is repeated until the entire answer has been generated. The final token the model generates is 200002, which is a special token <|return|> - the model uses it to indicate that the response is complete and should be displayed to the user. A model that generates sequences of tokens one at a time, using its previous outputs to generate the next token, is called an autoregressive decoder.
In the process of generating every single token, the model activates billions of its neurons. Its internal, abstract state captures the main facts (like what the user wants to know) as well as the minute nuances observed from the context of the conversation, such as humour, irony, or the overall tone. It evaluates whether answering the question would violate any of the safety and ethics principles the model was trained on. It does all of this and much more, only to eventually collapse into a single number. It completely discards its internal state and restarts the process of predicting the next token, activating billions of parameters from scratch, achieving an abstract state virtually identical to the one from the previous round, only to finally collapse to one number again. And this process repeats many times per second, going from an incredibly rich internal state to a single number.
While this is clearly a performance bottleneck, one could argue that having the model generate natural language is a worthy tradeoff because it gives us a window into the model's thinking. Unfortunately, studies by Anthropic have shown that a model can actually "think" and say completely different things. So this argument does not hold.
In a way, the model behaves just as if Leonard from Memento were trying to write a letter. He would write one word at a time, fall asleep right after writing it, and wake up a few moments later not remembering anything he had written, having to re-read the letter only to write the next word before falling asleep again, and so on.
Common sense suggests that there must be a better way. Over millions of years, evolution has produced other neural architectures that are much more efficient - for example, in humans. Instead of thinking in discrete pieces of ideas (tokens), we think continuously in a highly dimensional abstract space and only sometimes make the extra effort to verbalize a thought. Which brings me to another point.
Not everything can be verbalized
Language models use, well, primarily language to understand the world. There is a lot of knowledge contained in the millions of books and internet pages used to train the models, but language itself is too narrow to represent complex reality in its entirety. There are many things that cannot be described using words.
For example, I will never be able to learn to play the violin or ride a bicycle just by reading about it, even if it is the most detailed description ever written.
Language also cannot faithfully describe sensory inputs. Sure, many words exist to describe tastes or smells (such as earthy, woody, or nutty), but try using such words to describe the difference between the taste of a walnut and a hazelnut, or the smell of an orange and a tangerine. Or the feeling of carpet under your fingertips.
Touch is a very important sense, especially when navigating the physical world. A robot doing the coffee test would need to feel its environment through hundreds of pressure sensors in its arms to know, for example, how to open a flimsy sealed bag of coffee beans. Language falls short here, as it cannot efficiently communicate the state of all those pressure sensors.
It is true that some modern LLMs are also trained on and can understand images or sounds. Arguably, other sensory inputs could be handled similarly. The problem still lies in the model's slow output. To control a robot, the loop between sensing its environment and reacting to it must be very tight. The model must produce tens or hundreds of predictions every second. And while a language model may get away with occasionally hallucinating the year of some historical event, there is virtually no room for error when interacting with the real world - for example, when driving a car. Only the most powerful language models could be considered for such tasks, which simply would not be economically viable - the amount of compute and power required would be far too high.
Not only are language models ill-equipped to properly interact with the physical world, they don't even truly understand it. And that brings me to the next point.
Lacking understanding of physics
Some researchers believe that LLMs have an implicit understanding of the physics governing our world. Ilya Sutskever, OpenAI co-founder, says that compressing all the information contained on the internet down into a few hundred gigabytes is possible only if a system learns the underlying principles. I agree to a certain extent - a language model surely picks up some commonly described situations from its training data, but I doubt it is able to reliably generalize from them.
When I ask the newest LLM how to stack a yoga ball, a tennis ball, and a bowling ball on top of each other, the model generates thousands of tokens describing the ideal sequence and justifying its decision, without once mentioning the very obvious issue of the stability of the stack. Yet the average person knows intuitively and immediately that stacking three spheres on top of each other is very unstable and, without some kind of external support, borderline impossible.
I wouldn't trust a kitchen robot powered by an LLM to unload my dishwasher, because I like my dinnerware too much. A better solution is needed.
World models are the new paradigm
In the field of AI, efforts are growing to understand complex physical spaces using non-LLM models, generally called "world models." But, as with many things in this field, there is no universal definition of the term, and several competing ideas about how to build them are on the table.
One idea is pioneered by Fei-Fei Li's World Labs. Her startup attempts to create complete 3D environments from text, images, or videos. According to her, if the model understands the inputs and is able to generate a 3D output, it proves that it understands the spatial relationships between physical objects.
Another approach, implemented by Google Genie, is based on generating interactive video. If a model creates a plausible video, the reasoning goes, it understands the laws of physics. If people don't spontaneously float into the sky, a glass tipped over its edge falls down and shatters, and light reflects off a mirror, the model has successfully internalized the laws of reality.
But predicting the value of every pixel in the generated video uses a lot of computing power, which in most real-world use cases is completely unnecessary. For instance, when you drive a car and a person starts crossing the road in front of you, you don't care about the color of their shoelaces. You create a very lightweight mental model of the situation, focusing only on the important facts - that there is a person, their direction, and their speed. Ignoring all other unnecessary details saves our brains a lot of energy. Some labs are trying to borrow this concept and apply it to AI.
One of the most vocal proponents of such an approach is Yann LeCun, former chief AI scientist at Meta and founder of the startup Advanced Machine Intelligence (AMI). In his view, an AI model should "think" in abstract space - a computer's analogy to the human mental model - an approach he calls Joint-Embedding Predictive Architecture (JEPA). The model is trained on videos, just as a child learns about the world by observing it. Even a small child knows that a toy will fall down when nudged over the edge of a table. It understands the effects of gravity without ever having read a single article about it. I find this argument compelling and am very curious about what AMI will show in the coming months.
LLMs are here to stay
I do not believe that LLMs will deliver on the promise of AGI. They won't be controlling the robots in our homes or the autonomous cars on our roads. A Terminator won't have GPT-6 in its head.
But I see the value of LLMs in many daily applications - from software development to education, office automation, and health consulting. In the coming months, LLMs will continue to get more capable, follow instructions more reliably, hallucinate less, pass more challenging benchmarks, and complete more difficult tasks. I believe that they are here to stay, at least until a more efficient model architecture is discovered.
Companies like OpenAI could, purely theoretically, decide to use the data in the future to train the next version of the model, thus baking it into the next model’s weights and leaking it to other users.↩
For simplicity, I omit the system prompt and special tokens used to separate the various message types.↩