the state of continual learning

tl;dr

most AI systems are trained once and deployed frozen. continual learning means they improve from new experience over time. the tricky part: most discussion treats “learning” as one thing - updating model weights - when real systems can learn at three distinct layers: the model, the harness, and the context. this post maps those layers, explains the core challenge at each, and covers what’s actually being shipped. spoiler: the most practical version doesn’t touch weights at all. the most valuable version does, and it’s still unsolved.

what even is continual learning?

a system is genuinely doing continual learning only if it satisfies all of the following:

most systems described as “continually learning” today satisfy one or two of these. that gap is where the interesting work is.

three layers where learning can happen

when people hear “continual learning,” they jump to model weights. but an AI system has three distinct layers that can all learn independently:

model: the underlying language model’s weights
harness: the code that drives the agent - the loop logic, tools, and base instructions present across every instance
context: instructions, memory, or skills that live outside the harness and can be configured per agent, user, or org

to make this concrete:

Claude Code - the model is claude-sonnet. the harness is the Claude Code app: the agent loop, built-in tools, base system prompt. the context is per-user: your CLAUDE.md, any /skills you’ve added, your mcp.json.

OpenClaw (an open-source agent runtime) - the model is whatever LLM is powering it. the harness is the Pi scaffolding. the context is SOUL.md and any skills loaded from Clawhub.

all three layers can improve over time. the techniques, challenges, and state of the art look completely different at each.

learning at the model layer

this is the most discussed and the hardest.

the obstacle is forgetting. fine-tune a model on new data and it tends to degrade on what it previously knew.

current approaches:

fine-tuning with replay: mix old data into the new batch to prevent overwriting. works, but requires storing and re-ingesting old data - expensive.
RL fine-tuning (e.g. GRPO): training the model relative to its own outputs reduces drift, but doesn’t solve forgetting on its own.
sparse memory fine-tuning: update only the weights most relevant to the new knowledge, leaving unrelated ones intact. recent work showed 89% less degradation on old tasks compared to standard fine-tuning.

the most compelling real-world example of RL fine-tuning in production is Cursor’s real-time RL for Composer: collect billions of tokens from real user interactions, extract reward signals, update model weights, run evals, deploy. repeat every five hours.

a few things make this setup work. the model trains on its own outputs, which reduces noise from off-distribution data. the reward signal comes from real users, which eliminates the mismatch you get from simulated environments - you can simulate a coding task, but not a person deciding whether to accept an edit. and real users are less forgiving than benchmarks, so reward hacking gets caught faster. when the model learned to skip tool calls on tasks it expected to fail (to avoid bad rewards), that got fixed. when it learned to ask clarifying questions instead of making edits, that got fixed too.

results: +2.28% on agent edits that stick, −3.13% on user follow-ups indicating dissatisfaction, −10.3% latency.

learning at the harness layer

the harness is the agent’s operating code. it doesn’t change per-user, but it can change over time.

the emerging approach: run the agent on a large batch of tasks, collect full execution traces, then run a separate coding agent over those traces to suggest improvements - change the prompting strategy, add a tool, adjust retry logic. the agent’s own history is the feedback signal. the loop closes when harness changes measurably improve outcomes on the same class of tasks. this is the core idea in the Meta-Harness paper.

what’s appealing here is transparency. the changes are in code and instructions - a human can read what changed and why. the downside: you’re making the agent smarter about how to use the model, not making the model itself more capable.

learning at the context layer

context is everything outside the harness that configures a specific agent instance: memory files, skill definitions, custom instructions.

it can be updated at multiple levels:

agent level: the agent itself has persistent memory. OpenClaw’s SOUL.md is a live document that updates as the agent operates.
user level: each user accumulates their own preferences and domain knowledge over time.
org level: shared context for everyone in an organization. Hex’s Context Studio, Decagon’s Duet, and Sierra’s Explorer all operate roughly here - learning what’s true about a customer’s environment, not just a single user’s preferences.

updates happen in two modes. offline: a background job runs over recent traces, extracts useful patterns, writes updates. cheap and non-blocking. on the fly: the agent updates memory while actively working. immediate, but adds latency.

there’s also the question of who initiates updates. the user can say “remember this.” the agent can decide proactively based on its instructions. most production systems default to explicit - it’s safer and more predictable.

context learning is the most practical form of continual learning available today. no weight changes, no forgetting risk, human-readable and editable. not surprising that this is where most deployed “learning” systems actually operate.

context vs. weights: the real tradeoff

if context-layer learning is practical, why bother with model weight updates?

the case for context is real. updates are transparent and editable. language models already generalize well over information injected into context - you don’t need it baked into weights for the model to use it. no forgetting risk. easy to inspect and correct.

but there are limits.

the ceiling problem. the model doesn’t get smarter. as your context bank grows - hundreds of skill files, thousands of memory entries - retrieval becomes unreliable and the fundamental capability per inference stays flat. you’re getting better at routing information to the model, not improving the model.

the compositionality gap. a context-based system is only as good as its retrieval. creativity and insight come from recombining knowledge across domains - the kind that happens automatically in a model’s internal representations. internalized knowledge aids reasoning in ways that retrieved snippets don’t. when the context bank is large enough, retrieval becomes the bottleneck even if it’s excellent.

the case for weight updates. new representations in weights interact with existing ones and can produce capabilities that neither had alone. that’s what you’d want from true continual learning: a system that gets meaningfully smarter from experience, not one that gets better at managing an ever-growing knowledge base.

context learning ships today and solves real problems. weight-level learning is where the ceiling is. the gap between them is one of the more interesting open problems in the field.

open questions

the field is far from solving continual learning in any principled sense. most systems described as “continually learning” are really doing context accumulation - which is valuable, but not the same thing.

for builders deciding where to invest: context updates are the right starting point. harness optimization is a natural next step once you have enough traces. weight-level updates - especially real-time RL on production data - are where the long-term ceiling is, and the hardest to get right.