
The History of AI Development
With AI the front and centre focus of the business, tech and investment world today, many business leaders are racing to adopt AI within their businesses in the hope of finding competitive edge or looking for investment strategies that include AI developments. But how many people actually understand AI and how it operates and what its strengths and weaknesses are?
While it is not always necessary to understand a technology to use it, it is certainly useful to have a foundational knowledge. Much like one does not have to understand the theory of momentum or how an internal combustion engine works to drive a vehicle, it is useful to understand its ability and what its limitations are to get the best or safest use output.
AI has landed and developed very rapidly within the past four years but what most do not know is that the concept of Artificial intelligence was birthed back in 1956 with the Dartmouth Conference, where researchers coined the term “artificial intelligence” and predicted machines would soon match human intelligence. It took many years and a lot of detours and stalls for it to finally emerge into mainstream life around 2022. The building blocks for this was the development of machine learning in the 1990s–2010s that brought machine learning advances resulting in examples such as IBM’s Deep Blue beating chess champion in 1997 and the ImageNet breakthrough in 2012 sparking deep learning growth.
So let’s unwrap exactly how AI works and what the different aproaches are.
Large Language Models (LLM’s) and What Are Tokens?
LLM AI models uses the concept of Tokens to operate – think of tokens as the “building blocks” that large language models (LLMs, like ChatGPT or Grok) use to process and understand text. When you type a sentence, the model doesn’t see it as whole words or a continuous stream like humans do. Instead, it breaks the text into smaller chunks called tokens.
- A token can be a full word (e.g., “hello”), part of a word (e.g., “un” and “believable” for “unbelievable”), or even punctuation (e.g., a period or comma).
- On average, one token is about 3-4 characters or roughly 0.75 words in English.
LLMs are trained on massive amounts of text by learning to predict the next token in a sequence. For example, if it sees “The sky is…”, it might predict “blue” as the next token. This is called autoregressive training—they generate responses one token at a time, building sentences like a chain. It’s why LLMs are so good at “speaking” fluently, but their “intelligence” is based on statistical patterns in language, not necessarily deep meaning or real-world understanding.
How Do LLMs Work Overall?
LLMs are essentially taught to “speak” first. They’re fed billions of sentences from books, websites, etc., and learn by filling in blanks or predicting continuations at the token level.
This makes them excellent at:
- Generating coherent text, stories, code, or answers.
- Mimicking human conversation.
But critics (like AI Guru – Yann LeCun) say this approach priorities form over meaning—they’re great at language tricks but often “hallucinate” (make up facts) or lack true comprehension of the physical world, planning, or common sense.
What Is VL-JEPA and How Does It Work?
“VL JEPA” refers to Vision-Language Joint Embedding Predictive Architecture (VL-JEPA), a recent model from Meta AI (released around December 2025), led by Yann LeCun. It’s part of the JEPA family (earlier versions: I-JEPA for images, V-JEPA for videos).
Unlike LLMs, VL-JEPA focuses on visual learning first— by processing images, videos, and language together to build a deeper understanding of the world. The key difference is that it learns meaning (abstract ideas) before generating outputs, rather than starting with language tokens.
Here’s a simple breakdown of how it works:
- It takes input like a video clip, image, or image+text (multimodal).
- The model creates abstract representations (called “embeddings” or “thought vectors”)—high-level summaries of what’s happening, like “a person picking up a cup” instead of exact pixels or words.
- Parts of the input are “masked” (hidden), like covering sections of a video.
- The model predicts the meaning of the hidden parts based on the visible context—not by recreating exact details (pixels or tokens), but by guessing the abstract idea (e.g., “the cup is now in the hand”).
- This is self-supervised: it learns from unlabeled real-world videos/images, just by watching and predicting, similar to how babies learn by observing.
It’s non-generative, meaning it doesn’t output pixel-perfect images or token-by-token text by default. Instead, it builds an internal “world model” for understanding actions, physics, and interactions in real time.
Likely Advantages of Each Approach
LLMs (Token-Based):
- Strengths: Amazing at language tasks—writing essays, chatting, translating, coding. Fast and versatile for text-heavy applications. They’ve scaled hugely with more data/compute.
- Weaknesses: Can be inefficient (predicting every tiny detail), prone to errors in reasoning or facts, and don’t naturally “understand” visuals or the physical world without extra tweaks.
VL-JEPA / JEPA Approach (Meaning-Based):
- Strengths: More efficient (ignores unpredictable low-level details like exact lighting in a video). Better at multimodal tasks (vision + language), real-time understanding (e.g., recognizing actions in live video), and building grounded knowledge (e.g., physics of objects moving). Uses less energy/compute for some tasks and performs well with less labeled data.
- Weaknesses: Newer, so not yet as polished for pure text generation. Still scaling up.
In recent benchmarks, VL-JEPA has shown better performance and efficiency than huge multimodal LLMs on tasks like action recognition or world modeling.
Why Is Yann LeCun So Fixated on This Process?
Yann LeCun (Meta’s Chief AI Scientist and a Turing Award winner) believes current LLMs are a “dead end” for achieving true human-level intelligence (AGI). He’s outspoken that scaling token prediction alone won’t get us there—LLMs are brilliant at language but lack deep reasoning, planning, or common sense because they don’t truly model the real world.
His fixation on JEPA/VL-JEPA stems from believing that to achieve real machine based intelligence you require AI that learns like humans/animals:
- We learn mostly by observing and predicting the world (e.g., a baby watches objects fall and intuitively learns gravity).
- JEPA mimics this by building predictive “world models” in abstract meaning space, leading to more efficient, adaptable, and grounded intelligence.
- He predicts JEPA-like models could replace or surpass LLMs in 3-5 years for many tasks, especially in robotics, augmented reality (e.g., smart glasses), and real-world assistants.
Some AI developers see this as the path to safer, more capable AI that reasons before speaking—not just fluent parrots, but systems with genuine understanding. There is also the reasoning that the energy requirements and computational power required for massive LLM’s are prohibitive and will run out of funding. Critics of LLM models cite the case that they simply mimic and do not have the ability to really create or reason as humans do.
