Yann LeCun Calls for JEPA to Surpass LLM Limits
LeCun warns LLMs are limited by language-only training and advocates JEPA, a predictive world‑modeling approach, to overcome representation collapse and build more capable AI.

Yann LeCun Pushes AI Beyond Language Models
Yann LeCun warns that large language models have hit a ceiling and advocates a new architecture called Joint Embedding Predictive Architecture (JEPA) to build AI that understands the world.
He argues that LLMs, trained only on text, cannot grasp physics or cause‑effect, and warns that current self‑supervised methods risk learning trivial representations.
LeCun, a Turing Award winner and Meta AI chief scientist, shared his perspective in a recent seminar, noting that the AI community’s heavy reliance on language models has yielded fluent text but little insight into how objects behave. He pointed out that a model that has only seen sentences cannot infer that a ball thrown upward will fall back down, limiting its usefulness for tasks that require interaction with the physical world. This gap, he said, becomes critical as AI is deployed in robotics, autonomous vehicles, and any system that must anticipate real‑world dynamics.
He says large language models are limited because they are trained only on language, preventing true world understanding. This reliance on textual data means the models lack direct experience with sensory inputs such as vision or touch. He is pushing for a new AI approach that goes beyond today's language‑model dominance. LeCun envisions an architecture that learns from raw sensory streams and predicts how the world will change. He identifies representation collapse as a key challenge in today's self‑supervised learning methods. When models collapse, they encode little useful information, making downstream tasks harder despite low training loss.
JEPA proposes to learn joint embeddings that predict future representations from current ones, encouraging the model to capture underlying dynamics rather than surface statistics. By training to forecast how an image or video frame will evolve, the system builds invariances to changes in viewpoint, lighting, or occlusion. This predictive objective is intended to produce richer features for tasks like object detection and motion planning. LeCun argues that a prediction‑focused objective reduces the risk of representation collapse because the model must encode useful information to make accurate forecasts. In contrast, pure next‑token prediction can be satisfied by memorizing superficial patterns without understanding structure. If JEPA proves effective, it could redirect investment toward multimodal world‑modeling approaches, though researchers caution that turning the concept into scalable algorithms remains an open challenge. Early experiments will need to demonstrate competitive performance on benchmarks such as ImageNet, KITTI, or simulated robotics tasks.
Researchers will watch for early JEPA prototypes and benchmark results later this year.
Continue reading
More in this thread
Conversation
Reader notes
Loading comments...