Milestones to reach AGI

François Chollet argues AGI requires efficiency, abstraction, meta-learning, and autonomy, moving beyond brute-force scaling to achieve true generalization and adaptability.

Aug 02, 2025

François Chollet is a widely recognized figure in the field of artificial intelligence, not only for creating Keras, one of the most widely used deep learning libraries in the world, but also for challenging the mainstream trajectory of AI research. While much of the field celebrates scaling laws and massive models as the path forward, Chollet has consistently pushed back, arguing that this approach represents a fundamental misunderstanding of what intelligence really is. His work goes beyond engineering; it is a call to rethink the foundations of the discipline and to anchor progress in a scientifically rigorous and philosophically coherent definition of intelligence. For Chollet, the problem is not that current models are powerful—they are—but that they are powerful in narrow, brittle ways that fail to address the real challenge of generalization and abstraction.

From his perspective, the current obsession with large language models and multimodal architectures reflects a dangerous illusion: that bigger datasets and more parameters will inevitably lead us to human-level general intelligence. These systems achieve state-of-the-art performance because they have absorbed nearly all available human-generated text and optimized over billions of gradient steps, not because they can autonomously reason about novel situations. Their competence is statistical, not conceptual; they interpolate within patterns rather than extrapolate beyond them. For Chollet, this is the crux of the issue. General intelligence cannot emerge from memorization, no matter how vast the dataset. True intelligence requires the ability to operate in open-ended domains, to form abstractions that compress experience into rules, and to apply these rules flexibly in entirely new contexts. Scaling brute force is a dead end because it sidesteps these requirements.

To provide a solid theoretical grounding for his critique, Chollet proposed one of the most precise formal definitions of intelligence to date: “Intelligence is a measure of skill acquisition efficiency over a scope of tasks, relative to priors, experience, and generalization difficulty.” This definition reframes the conversation by emphasizing efficiency, adaptability, and scope rather than raw performance. Intelligence, under this lens, is not the sum of skills a system possesses but the process that generates those skills efficiently. Humans are not born knowing language or algebra; they are born with the ability to learn these skills rapidly under constraints. By contrast, today’s AI systems require millions of labeled examples and enormous compute budgets to approximate abilities that humans learn from a handful of demonstrations. Chollet’s definition exposes this gap and shows why current evaluation metrics are inadequate for measuring real progress toward AGI.

To operationalize this vision, Chollet introduced the Abstraction and Reasoning Corpus (ARC), a benchmark explicitly designed to test the ability to generalize to unseen tasks using minimal examples. Unlike conventional benchmarks that can be conquered through memorization or pretraining on massive datasets, ARC presents problems drawn from a combinatorial design space so vast that no dataset can cover it. Solving ARC requires discovering abstract structural rules—such as symmetry, color grouping, or object persistence—from just three to five demonstrations and applying them to new cases. These are the very cognitive moves humans make instinctively. Yet, despite years of progress in deep learning, ARC remains a steep challenge for AI systems: humans routinely score above 95%, while cutting-edge models languish below 40%. This persistent gap is not an accident; it reveals what Chollet considers the true bottleneck for AGI—our failure to build systems capable of abstraction, compositionality, and causal reasoning.

Based on these insights, Chollet has laid out a roadmap of milestones that must be achieved for human-level general intelligence to become attainable. First, research priorities must shift away from static benchmarks and brute-force scaling toward dynamic evaluations of generalization. Systems should be measured by their ability to learn quickly, reason flexibly, and adapt autonomously, not by their ability to memorize ever-larger datasets. Second, AI architectures must incorporate mechanisms for autonomous abstraction formation—the ability to synthesize new concepts from raw observations without explicit programming. Third, compositional reasoning must be a core design principle: intelligent systems should build complex solutions by recombining simpler elements, mimicking the combinatorial creativity of human thought. Without these capabilities, models will remain trapped in statistical mimicry, unable to transcend the confines of their training distributions.

Equally important, Chollet insists that efficiency—not scale—defines intelligence. Current models consume terabytes of text and petaflops of compute to achieve competence in language tasks, whereas humans achieve comparable mastery of natural language through a few years of sparse experience. The path to HGI requires architectures that are data-frugal, compute-efficient, and energy-conscious, reflecting the astonishing economy of the human brain. This efficiency imperative extends beyond resources to knowledge representation: systems must encode experience in modular, reusable abstractions rather than sprawling, entangled weight matrices. Without compact and transferable internal structures, adaptation will remain prohibitively expensive, making lifelong learning impossible.

Another pillar of Chollet’s vision is structural cognition: the integration of neural and symbolic paradigms. Pure pattern-matching systems, no matter how large, lack the algorithmic scaffolding required for systematic reasoning and causal inference. By combining the perceptual strengths of neural networks with the structured logic of symbolic systems, we can create architectures capable of both recognizing patterns and manipulating rules. This hybrid approach, coupled with meta-learning, would enable systems to reflect on their own strategies, improving their learning processes over time. For Chollet, meta-learning is the real lever of intelligence, because it transforms experience into acceleration: each solved problem makes the next one easier, closing the loop toward autonomous self-improvement.

Finally, Chollet argues that intelligence is purposive. It is not enough for systems to respond passively to prompts; they must demonstrate agency—the ability to generate goals, prioritize them under constraints, and navigate trade-offs under uncertainty. Creativity emerges from this agency: the capacity to produce novel, useful, and contextually appropriate solutions beyond memorized patterns. Risk-aware decision-making, dynamic goal management, and autonomous planning are not peripheral features but core requirements for any system aspiring to human-level generality. In this sense, the milestones Chollet envisions are not incremental extensions of today’s deep learning—they demand a paradigm shift. From static pattern recognition to dynamic, adaptive, and self-directed intelligence; from monolithic architectures to modular hybrids; from brute-force scaling to elegant, resource-efficient design.

Chollet’s roadmap is both sobering and inspiring. It rejects the seductive simplicity of “just make it bigger” and calls for deeper questions: How do we formalize abstraction? How do we measure generalization fairly? How do we architect systems that learn as efficiently as humans? These questions define not only a technical challenge but a philosophical stance: that intelligence is a process, not a dataset; a system’s capacity to continually invent solutions, not its ability to replay them. For Chollet, the journey to AGI is not about adding layers but about building systems that learn how to learn, reason about their reasoning, and evolve their own capabilities with minimal supervision. Until we meet these milestones, what we call “intelligence” will remain an illusion painted by scale.

Interview with The Creator of Keras, AI Researcher: François Chollet | by Sanyam Bhutani | Data Science Network (DSNet) | Medium

Summary

1. Fundamental Shift in AI Research Focus

Core Idea: Stop chasing benchmark scores and scaling models; start prioritizing generalization and skill acquisition efficiency.

Focus on learning from minimal data, not brute-force memorization.
Evaluate intelligence relative to priors (like humans’ core knowledge).
Use open-ended benchmarks (e.g., ARC) resistant to shortcutting.
Progress = systems that adapt to novel tasks quickly.

2. Autonomous Abstraction & Concept Formation

Core Idea: Intelligence = ability to create and manipulate abstractions autonomously.

Build systems that derive abstract rules without hand-coding.
Achieve compositional reasoning: combine simple concepts into complex ideas.
Develop efficient internal representations for transferability.
Enable hierarchical reasoning for multi-step problem-solving.

3. Data & Resource Efficiency

Core Idea: Intelligence = doing more with less.

Optimize for data frugality (few-shot learning as default).
Prioritize compute and energy efficiency (stop brute-force scaling).
Design compact memory structures for reusable knowledge.
Measure efficiency per performance, not just raw accuracy.

4. Symbolic-Neural Hybrid & Structural Design

Core Idea: Pure deep learning won’t reach HGI; structured cognition is essential.

Integrate symbolic reasoning with neural perception for abstraction and logic.
Implement meta-learning at the symbolic level for adaptive strategies.
Build modular architectures enabling reuse across domains.
Include causal reasoning for true understanding, not correlation mimicry.

5. Autonomous Meta-Learning & Self-Improvement

Core Idea: HGI systems must learn how to learn—and do it autonomously.

Implement meta-learning: improve skill acquisition over time.
Enable self-reflection: systems reason about their own performance.
Achieve continuous adaptation without catastrophic forgetting.
Develop self-repair mechanisms for autonomous error correction.

6. Agency, Goal-Directedness & Creativity

Core Idea: Intelligence is purposive, not reactive.

Allow systems to generate and reprioritize goals autonomously.
Enable dynamic planning under changing conditions.
Foster creativity: producing novel, useful solutions beyond memorized patterns.
Integrate risk-aware decision-making to handle uncertainty safely.

The Milestones in Detail

Group 1: Fundamental Shift in AI Research Focus

Group Definition and Context

Chollet repeatedly stresses that the AI community’s current trajectory—dominated by scaling up deep learning and pursuing benchmark scores—is insufficient and misaligned with achieving AGI. Current systems excel at narrow tasks and pattern recognition but fail catastrophically at generalization and abstraction. This failure is central to his ARC work and his critique of benchmark-driven progress.

He argues that true intelligence is not about task-specific mastery but about the efficiency of acquiring new skills across a wide variety of novel tasks under resource constraints. To achieve this, he calls for a paradigm shift in research priorities away from “brute-force scaling” toward generalization-centric, resource-efficient, and autonomy-driven AI research.

Observation 1: Prioritize Generalization Over Narrow Skills

Definition

Redirect AI research from achieving high performance on narrow, well-defined benchmarks to building systems that generalize to new, unseen tasks, requiring minimal retraining and leveraging abstract reasoning.

Logic (Chollet’s Argument)

Current benchmarks like ImageNet encourage overfitting to narrow domains and memorization strategies.
Intelligence ≠ performance on fixed tasks; true intelligence = ability to adapt to novel tasks efficiently.
Chollet’s formal definition:
“Intelligence is a measure of skill acquisition efficiency over a scope of tasks, relative to priors, experience, and generalization difficulty.”
Human intelligence shines because of generalization across countless unforeseen situations, not memorized solutions.

Implementation / Measurement

Replace task-specific benchmarks with generalization benchmarks:
- ARC (Abstraction and Reasoning Corpus): Measures ability to infer abstract rules from minimal examples.
- Out-of-distribution generalization tasks.
Measure learning curves under limited-data regimes (few-shot, zero-shot learning).
Emphasize meta-learning benchmarks that test adaptability across tasks.

Current AGI Status

Weak: GPT-like models achieve impressive in-distribution generalization but fail dramatically on ARC, which tests abstraction and open-ended reasoning.
Chollet’s Critique: Scaling laws improve interpolation, not extrapolation; current systems remain statistical parrots, not reasoning entities.

Observation 2: Abandon Pure Scaling and Memorization Approaches

Definition

Stop relying on brute-force data and compute scaling as the primary strategy for progress toward AGI.

Logic

Scaling large language models (LLMs) produces diminishing returns for generalization beyond training distributions.
Memorization ≠ intelligence:
- Memorization allows solving known problems, but cannot handle truly novel tasks.
Biological systems (human brain) achieve AGI with low energy, modest compute, and tiny training data compared to LLMs.

Implementation / Measurement

Evaluate algorithms for efficiency and abstraction, not raw benchmark scores.
Explicitly track resource-to-performance ratios: amount of data, compute, and energy per generalization improvement.
Promote research in architectures optimized for symbolic reasoning, representation learning, and meta-learning.

Current AGI Status

Current frontier models depend heavily on scaling:
- GPT-4 → trillions of parameters + terabytes of text data.
- Costs millions of dollars in compute and energy.
Chollet’s View: Scaling-based progress cannot bridge the gap to AGI because it sidesteps abstraction and reasoning; it only inflates memorization capacity.

Observation 3: Explicitly Measure Intelligence Relative to Priors

Definition

Design benchmarks that explicitly account for priors (innate knowledge) used by the system, ensuring fair comparisons across architectures and aligning with human cognition.

Logic

Chollet references Spelke’s Core Knowledge Theory: Humans are born with minimal priors like:
- Objectness & basic physics
- Agentness & goal-directedness
- Geometry/topology awareness
- Numerosity
Current AI systems embed massive implicit priors in their weights (from huge datasets), making them look more intelligent than they are.
True intelligence = efficient reasoning on minimal priors.

Implementation / Measurement

ARC: All tasks rely only on core priors, avoiding language, cultural knowledge, or dataset biases.
Track explicit priors given to AI systems in benchmarks.
Compare performance normalized for prior knowledge load.

Current AGI Status

LLMs leverage massive implicit priors learned from billions of documents → unfair advantage in narrow tasks, yet still fail at minimal-prior tests like ARC.
Chollet calls for benchmarks that penalize hidden prior overloading and reward reasoning from scratch.

Observation 4: Pursue Open-Ended Intelligence Challenges

Definition

Develop benchmarks and frameworks that simulate open-ended problem spaces, forcing AI systems to tackle genuinely novel and diverse tasks that cannot be memorized or brute-forced.

Logic

Real-world intelligence thrives on unpredictability.
IQ tests and ARC tasks succeed because:
- They prevent pre-computation of all possible answers.
- They measure adaptability and abstraction.
Current benchmarks fail because they are static and can be solved by memorization.

Implementation / Measurement

Expand ARC-like competitions:
- Dynamically generated tasks → impossible to anticipate or pre-train on.
- Require reasoning, pattern discovery, and abstraction.
Introduce never-before-seen task generators in benchmarks for ongoing evaluation of adaptability.

Current AGI Status

Very poor performance:
- ARC Leaderboards: Humans achieve >95%; best AI ~35–40%.
- No existing system demonstrates robust open-ended adaptability.
Chollet’s Position: Without open-ended evaluation, AGI progress claims are misleading and over-optimistic.

✅ Summary for Group 1:

Chollet insists that achieving AGI requires a foundational paradigm shift:

From benchmark chasing → generalization testing.
From brute-force scaling → efficiency-driven innovation.
From hidden prior exploitation → transparent prior normalization.
From static datasets → open-ended challenges.

Bottom line: Stop measuring “who memorizes better”; start measuring who learns faster, reasons deeper, and adapts more flexibly with minimal priors and resources.

Group 2: Autonomous Abstraction & Concept Formation

Group Definition and Context

Chollet repeatedly emphasizes that abstraction is the engine of intelligence. While current AI models excel at pattern recognition, they fundamentally fail at creating new abstractions. This gap explains why models like GPT-4 can mimic reasoning patterns in text but collapse when facing tasks requiring genuine conceptual synthesis (e.g., ARC puzzles).
To achieve AGI, systems must autonomously discover representations, form abstractions, and recombine them compositionally across domains.

Observation 5: Develop Autonomous Abstraction Capabilities

Definition

Build systems capable of autonomously generating abstract rules or concepts from raw observations, without explicit human-coded templates or brute-force memorization.

Logic (Chollet’s Argument)

Humans solve novel tasks by forming abstract rules from a few examples.
Abstraction allows transfer from specific experiences to a vast space of unknown scenarios.
Current LLMs rely on pattern interpolation; abstraction = extrapolation beyond training data.

Implementation / Measurement

ARC Benchmark: Requires discovering latent rules (e.g., symmetry, color grouping, shape completion) never seen before.
Measure speed & efficiency of abstraction from minimal examples.
Evaluate ability to verbalize or encode discovered rules.

Current AGI Status

Extremely weak: LLMs and vision models fail at ARC because they can’t autonomously hypothesize rules beyond their statistical priors.
Chollet: “Current systems cannot autonomously generate new abstractions—they can only remix what they have memorized.”

Observation 6: Emphasize Explicit Compositional Reasoning

Definition

Enable systems to compose new ideas or solutions by combining simpler concepts already known, producing novel but structured outputs.

Logic

Chollet frames compositionality as key to scalability of intelligence:
“The ability to combine a small set of concepts into an unbounded number of new ideas is what gives human cognition its power.”
Current AI lacks flexible compositional generalization, leading to brittle performance outside training distributions.

Implementation / Measurement

Test via ARC tasks requiring multi-step transformations (e.g., “reflect + recolor” → requires combining two distinct rules).
Evaluate models in cross-domain reasoning (combine geometry + numerosity).

Current AGI Status

Weak: LLMs approximate compositionality in language but fail in symbolic reasoning or visual tasks that require explicit combination of operations.

Observation 7: Explicitly Improve Representation Learning

Definition

Equip AI systems to autonomously build efficient, interpretable internal representations that capture structure and enable reasoning.

Logic

Chollet emphasizes that current AI stores dense statistical correlations instead of abstract representations.
Representation learning is critical for knowledge reuse and efficient generalization across domains.

Implementation / Measurement

Evaluate internal state structures (are they modular, transferable?).
Use compression-based metrics: shorter description length = better abstraction.
Test transfer learning performance to entirely novel task distributions.

Current AGI Status

Poor: LLMs’ representations are highly entangled and opaque.
Chollet: “They have no explicit abstractions. They memorize statistical regularities but don’t distill them into conceptual structures.”

Observation 8: Implement Explicit Hierarchical Reasoning

Definition

Design systems to reason across multiple abstraction layers, decomposing complex tasks into smaller steps and integrating sub-solutions.

Logic

Humans solve complex problems by creating hierarchies (e.g., planning: long-term goals → mid-level strategies → atomic actions).
Chollet notes that hierarchy gives combinatorial efficiency and adaptability.

Implementation / Measurement

Multi-step ARC puzzles: e.g., “Find largest shape → color swap → mirror transform.”
Evaluate explicit decomposition abilities:
- Can the system articulate intermediate steps?
- Does it optimize search over compositional space?

Current AGI Status

Limited: LLMs can mimic step-by-step reasoning when prompted (“chain of thought”), but they do not autonomously build hierarchical strategies.
ARC reveals: current models cannot break down problems without heavy hand-holding.

✅ Summary for Group 2

AGI demands abstraction as its foundation. Chollet insists that without autonomous abstraction formation, explicit compositionality, strong internal representation learning, and hierarchical reasoning, scaling will hit a wall. Current AI systems:

Fail to generate new rules (abstraction gap).
Struggle to recombine concepts adaptively (compositional gap).
Lack modular, reusable representations (representation gap).
Cannot autonomously plan in layered reasoning spaces (hierarchical gap).

Bottom line: To reach AGI, we must replace “brute-force pattern fitting” with structured, self-directed concept formation and combinatorial reasoning architectures.

Group 3: Data & Resource Efficiency

Group Definition and Context

Chollet explicitly defines intelligence as skill-acquisition efficiency, which inherently involves minimizing the cost of learning and problem-solving in terms of data, compute, energy, and memory.
Current large-scale AI systems achieve impressive results, but their approach—brute-force scaling—contradicts efficiency principles. GPT-4’s massive training regime (trillions of tokens, megawatts of energy) is an example of what Chollet argues is a dead end for achieving AGI.

To reach AGI, research must pivot from bigger models → smarter algorithms, emphasizing architectures that learn fast, reason with little data, and use resources optimally.

Observation 9: Explicitly Prioritize Data Efficiency

Definition

Build systems that can learn robust abstractions and generalize to unseen tasks using minimal examples—as humans do.

Logic (Chollet’s View)

Humans can learn a new skill from a handful of demonstrations, sometimes from a single exposure.
Current AIs need billions of examples for narrow tasks, which is antithetical to intelligence.
Chollet calls this problem “buying intelligence with data,” which creates brittle and non-generalizable systems.

Implementation / Measurement

ARC: Each task provides 3–5 demonstrations only.
Few-shot and zero-shot learning benchmarks:
- Measure performance per example rather than absolute accuracy.
Learning curves:
- How fast does accuracy improve as examples increase?

Current AGI Status

Poor: LLMs appear to do few-shot learning but mostly rely on pattern recall from enormous datasets.
Chollet: “When you’ve seen everything, zero-shot performance is an illusion.”

Observation 10: Develop Explicit Computational Efficiency

Definition

Enable models to reason and learn using minimal compute, rather than relying on massive parameter counts and training steps.

Logic

Efficiency is critical because brute-force compute scaling has diminishing returns and environmental costs.
Humans achieve AGI with ~20 W of brain power, not megawatt-scale clusters.
Computational efficiency ties back to algorithmic elegance—smarter architectures over bigger GPUs.

Implementation / Measurement

Normalize performance by FLOPs or runtime cost.
Test on ARC-like tasks under strict compute budgets.
Reward efficiency-oriented solutions in AGI benchmarks.

Current AGI Status

Extremely poor: Frontier models (GPT-4, Gemini) cost millions in compute; inference cost is high too.
Chollet: “We cannot scale our way to AGI; efficiency, not size, is the bottleneck.”

Observation 11: Explicitly Reduce Energy Consumption

Definition

Design AI systems that minimize energy per inference and per training epoch, approximating the energy efficiency of biological systems.

Logic

Energy scaling ≠ intelligence scaling.
Energy waste is a direct symptom of brute-force design, not smart algorithms.
Sustainability aside, physical constraints make energy-hungry AGI architectures non-viable at global scale.

Implementation / Measurement

Benchmark energy per task (joules per ARC puzzle solved).
Compare energy-to-performance ratios with human brain estimates.

Current AGI Status

Critical weakness: GPT-class models consume enormous energy for training and inference.
Chollet warns this is a dead end: systems must become thousands of times more energy-efficient to approach human-level intelligence.

Observation 12: Explicitly Optimize Memory Use

Definition

Ensure systems store and retrieve knowledge compactly, modularly, and with minimal redundancy.

Logic

Memory optimization = efficient abstraction.
Humans store concepts and rules as compressed representations; current AI stores billions of weights encoding patterns redundantly.
Poor memory architecture → catastrophic forgetting or inefficiency.

Implementation / Measurement

Evaluate internal state complexity vs. performance (compression ratio).
Test for reusability of learned modules across tasks (transfer benchmarks).

Current AGI Status

Weak: Neural networks lack modular memory; knowledge is distributed across weights, making reuse and updates costly.
Chollet: “Opaque entangled representations break generalization.”

✅ Summary for Group 3

AGI cannot be brute-forced by throwing more compute, data, and energy at the problem.
Chollet insists progress depends on:

Data frugality → Learning from few examples.
Compute and energy efficiency → Algorithmic leaps, not bigger clusters.
Memory compactness → Modular, reusable internal representations.

Bottom line: AGI must be elegant—a system that does more with less, like the human brain.

Group 4: Symbolic-Neural Hybrid & Structural Design

Group Definition and Context

Chollet argues that current deep learning models lack structured reasoning and operate almost entirely through pattern interpolation, which is insufficient for true generalization and abstraction.
He emphasizes that hybrid systems—combining symbolic reasoning with the representational power of neural networks—are essential for AGI. Why? Because human intelligence relies on:

Symbolic manipulation (rules, logic, hierarchical planning).
Perceptual learning (neural pattern recognition).
These complementary paradigms must be integrated for AGI to achieve robust abstraction, compositionality, and reasoning under uncertainty.

Observation 13: Explicitly Combine Symbolic and Neural Models

Definition

Develop architectures that integrate the statistical strength of neural networks with the structured, rule-based reasoning of symbolic systems.

Logic (Chollet’s Argument)

Neural nets excel at perception but fail at systematic reasoning.
Symbolic systems excel at reasoning but fail at perception.
AGI requires both:
- Neural layers for raw input → symbolic layers for compositional logic.
“You cannot brute-force search in an infinite combinatorial space; you need structured representations and symbolic abstractions.”

Implementation / Measurement

Hybrid models tested on ARC:
- Perception handled by neural nets.
- Rule inference handled by symbolic engines.
Benchmarks for abstraction depth and reasoning explainability.

Current AGI Status

Primitive: Some neuro-symbolic prototypes exist (DeepMind’s Gato + symbolic planners), but no large-scale hybrid achieving strong ARC performance.
Chollet: “Deep learning alone will not solve AGI.”

Observation 14: Incorporate Symbolic Meta-learning

Definition

Enable systems to adapt their learning strategies by dynamically building symbolic abstractions about tasks and reasoning paths.

Logic

Chollet frames meta-learning as essential because:
- Intelligence = improving at learning over time.
- Symbolic meta-learning allows reflective reasoning about strategies, enabling higher-order adaptation.
Neural-only systems lack explicit meta-reasoning.

Implementation / Measurement

Evaluate systems’ ability to:
- Generate new reasoning strategies without retraining.
- Transfer symbolic learning across tasks.
Benchmarks:
- “Meta-ARC” → meta-level reasoning about rule inference efficiency.

Current AGI Status

Weak: LLMs fake meta-learning via pattern recall, not true strategy invention.
Chollet: “No system today can autonomously create new reasoning strategies in unseen environments.”

Observation 15: Design Modular Architectures for Reuse

Definition

Build architectures where reasoning and perception modules are separable and reusable across domains, enabling flexible recombination of learned skills.

Logic

Human cognition is modular:
- Vision, language, planning, causal inference = loosely coupled.
Current AI = monolithic networks → brittle and costly to update.
Chollet: “Without modularity, every adaptation requires retraining the entire system.”

Implementation / Measurement

Evaluate transfer learning in ARC:
- Can a module learned for color transformations be reused in shape tasks?
Assess composition speed and parameter isolation.

Current AGI Status

Poor: Today’s LLMs and vision models are monolithic; module reusability = near zero.
“The lack of modularity makes current AI incredibly inefficient and inflexible.”

Observation 16: Integrate Symbolic Reasoning for Causality

Definition

Equip systems with explicit causal reasoning engines, moving beyond statistical correlation to genuine cause-effect understanding.

Logic

Chollet emphasizes causality as the key difference between statistical models and intelligent agents.
Without causal reasoning, AI cannot:
- Predict effects of actions in novel environments.
- Generalize knowledge structurally.
ARC puzzles often require causal inference (e.g., “if color = blue → move object”).

Implementation / Measurement

Benchmarks for causal abstraction:
- Simulated environments with manipulable variables.
- ARC variants where solutions require hypothetical reasoning (“What if I apply this rule?”).

Current AGI Status

Minimal: Current LLMs cannot model explicit causality.
Chollet: “Without causal inference, generalization is an illusion.”

✅ Summary for Group 4

AGI cannot emerge from monolithic pattern-matching models. Chollet prescribes structural intelligence built on:

Hybrid architectures (symbolic + neural).
Meta-learning capabilities for strategy-level adaptation.
Modularity for scalability and transferability.
Causal reasoning as a first-class citizen, not an afterthought.

Bottom line: Brains are structured, modular, causal; AGI systems must be too.

Group 5: Autonomous Meta-Learning & Self-Improvement

Group Definition and Context

Chollet argues that intelligence is not just the ability to learn but the ability to improve learning itself—what he calls skill-acquisition efficiency. Humans excel because they learn how to learn: every new experience refines our meta-strategies, enabling faster, more general adaptation in the future.

Current AI systems lack this capability. They “learn” through static optimization on massive datasets, then freeze their parameters. Any update = costly retraining, not autonomous refinement. For AGI, AI must:

Monitor its own performance.
Detect weaknesses.
Improve strategies without human intervention.
Adapt continuously without catastrophic forgetting.

Observation 17: Explicitly Implement Autonomous Meta-Learning

Definition

Enable systems to improve their own learning processes autonomously over time, using experience from diverse tasks to generalize faster in new ones.

Logic (Chollet’s View)

Meta-learning allows:
- Accumulating “learning priors” for accelerating skill acquisition.
- Developing internal rules about rules (second-order reasoning).
Without meta-learning, adaptation speed remains flat → no cumulative intelligence.

Implementation / Measurement

Multi-task sequences (ARC variants): Measure if the system learns faster on task N than on task N-1.
Evaluate strategy generalization across domains:
- E.g., after learning symmetry on one puzzle, apply it in unrelated contexts.

Current AGI Status

Superficial: LLMs mimic meta-learning via dataset coverage, not true self-improvement.
Chollet: “Our systems don’t improve themselves—they are improved by retraining.”

Observation 18: Prioritize Self-Reflection and Introspection

Definition

Develop AI that can evaluate its own knowledge gaps, reasoning errors, and confidence levels, enabling corrective action autonomously.

Logic

Humans engage in meta-cognition:
- We ask: “Am I sure? Do I need more evidence? Did I fail?”
Current AI outputs answers without awareness of uncertainty or knowledge limits.
Chollet stresses introspection as critical for self-directed improvement and safety.

Implementation / Measurement

Benchmarks requiring uncertainty reporting and error diagnosis:
- Can the system flag its low-confidence answers?
Meta-tasks: Detect failure and self-correct without external labeling.

Current AGI Status

Poor: LLM confidence scores correlate weakly with accuracy.
No autonomous pipeline for error-driven self-improvement.
Chollet: “LLMs cannot reason about their reasoning.”

Observation 19: Create Explicitly Adaptive Systems

Definition

Systems must adapt incrementally to new tasks without catastrophic forgetting or full retraining, maintaining performance across old and new domains.

Logic

Continuous adaptation = survival in dynamic worlds.
Current AI suffers from:
- Catastrophic forgetting: new learning erases old knowledge.
- Static parameterization → no lifelong learning.

Implementation / Measurement

Sequential ARC tasks: Evaluate retention of old skills after solving new puzzles.
Lifelong learning benchmarks:
- Performance trajectory across evolving distributions.

Current AGI Status

Weak: Continual learning is an active research field, but mainstream AGI (LLMs) rely on snapshot models—zero true lifelong adaptability.
Chollet: “Static models = dead intelligence.”

Observation 20: Foster Autonomous Error Detection and Correction

Definition

Equip systems with mechanisms to identify mistakes, infer causes, and generate self-corrections—without external re-labeling or retraining.

Logic

Humans learn from failure autonomously.
Current AI:
- Often unaware of being wrong.
- Requires curated feedback loops.
Chollet emphasizes self-repair as critical for scalable autonomy.

Implementation / Measurement

Design tasks where initial solution is wrong, but system can revise its own reasoning iteratively.
Measure:
- How often does the model recognize its error?
- How fast does it recover?

Current AGI Status

Minimal: LLM “self-correction” (via prompting) = illusion; still driven by human instruction.
Chollet: “Current systems don’t learn from their own mistakes—they just output another guess.”

✅ Summary for Group 5

HGI demands systems that:

Learn how to learn (meta-learning).
Think about their thinking (introspection).
Adapt continuously without forgetting.
Self-correct without supervision.

Bottom line: True general intelligence is self-improving, not frozen at training time.
Chollet: “As long as learning is static, there is no intelligence—only a database of patterns.”

Group 6: Agency, Goal-Directedness & Creativity

Group Definition and Context

Chollet argues that intelligence is inherently active and purposive. Humans don’t just react—they set goals, plan, adapt strategies, and create novel solutions.
Current AI systems lack genuine agency: they execute externally specified tasks without self-generated objectives or adaptive decision-making. Chollet emphasizes that to reach AGI, systems must:

Define and reprioritize goals autonomously.
Handle trade-offs and uncertainty.
Demonstrate creativity beyond pattern recall.

Observation 21: Explicitly Foster Autonomous Goal Setting

Definition

Develop systems capable of generating, prioritizing, and modifying goals without explicit external commands, aligned with high-level objectives.

Logic (Chollet’s Argument)

Intelligence = purposeful behavior.
Without internal goal formation:
- AI remains a passive pattern generator.
True agency requires:
- Anticipating future needs.
- Dynamically creating subgoals.

Implementation / Measurement

Goal-discovery tasks:
- Present open-ended environments (e.g., ARC variants, sandbox simulations).
- Evaluate if the system identifies novel intermediate objectives autonomously.
Metrics:
- Diversity and relevance of generated goals.
- Adaptation speed when context changes.

Current AGI Status

Non-existent: GPT-class models do not set goals; they react to prompts.
Chollet: “Static models cannot have agency—they cannot want.”

Observation 22: Train Systems for Flexible Goal Management

Definition

Enable AI to reprioritize objectives dynamically, managing multiple goals under evolving constraints.

Logic

Humans constantly adjust plans:
- New information = goal reprioritization.
Without flexible goal management:
- AI fails in dynamic environments.

Implementation / Measurement

Sequential ARC tasks with conflicting objectives.
Multi-objective benchmarks requiring:
- Trade-off reasoning.
- Dynamic strategy switching.

Current AGI Status

Weak: Reinforcement learning agents manage limited goals, but flexibility is rigid and brittle.
LLMs: Zero autonomous prioritization capability.

Observation 23: Prioritize Explicit Creativity and Innovation

Definition

Equip AI to produce genuinely novel, useful, and context-appropriate ideas, not just recombinations of memorized patterns.

Logic

Creativity = engine of open-ended generalization.
Chollet stresses:
- Pattern interpolation ≠ creativity.
- True creativity = abstraction-driven recombination + innovation.

Implementation / Measurement

ARC-based novelty tests:
- Require solutions with patterns not seen in training.
Evaluate:
- Originality (does it differ from memorized patterns?).
- Functionality (is it effective and generalizable?).

Current AGI Status

Superficial: LLM “creativity” = probabilistic remixing of dataset patterns.
Chollet: “Creativity cannot emerge from memorization alone.”

Observation 24: Teach Systems Risk-Aware Decision Making

Definition

Develop systems capable of balancing exploration and exploitation, reasoning under uncertainty, and evaluating trade-offs between risks and rewards.

Logic

Intelligence thrives in uncertain environments.
Risk management:
- Requires forecasting consequences.
- Demands causal and probabilistic reasoning—both weak in current AI.
Without it:
- Systems fail in real-world complexity.

Implementation / Measurement

Tasks introducing stochastic outcomes:
- Require safe yet exploratory strategies.
Metrics:
- Performance stability under uncertainty.
- Ability to self-calibrate risk levels.

Current AGI Status

Poor: LLMs lack mechanisms for explicit risk modeling.
Reinforcement agents approximate it, but fail in open-ended domains.

✅ Summary for Group 6

To achieve HGI, systems must go beyond passive pattern completion. They must:

Generate their own goals.
Reprioritize dynamically under uncertainty.
Innovate beyond memorization.
Evaluate risks intelligently.

Bottom line: True general intelligence = active, purposive, and creative.
Chollet: “Without agency, adaptation and creativity, intelligence is an illusion.”

A guest post by

Jakub Žegklitz-Bareš

Chief Strategist at Metamatics and Head of Research at ISRI. Former CTO with experience across 9 startups, 7 accelerators, and 25+ ML/NLP/GNN projects. Focused on AI-native systems, intelligence infra, and strategic innovation.

Strategic Intelligence

Discussion about this post

Ready for more?