AReaL 2.0 Open Source: Building Self-Evolving AI Agents with Online RL

Table of Contents

AReaL 2.0 Open Source: Building Self-Evolving AI Agents with Online RL

AI agents have rapidly evolved from impressive demonstrations into production systems powering software engineering, customer support, research, and enterprise automation. As adoption accelerates, the industry’s focus is shifting away from a simple question—Can an agent complete a task?—toward a far more ambitious one:

Can an agent continuously improve itself while serving real users?

This concept, often described as agent self-evolution, is gaining momentum across the AI ecosystem.

Recently, Anthropic engineer Boris Cherny revealed that many internal engineering workflows involve hundreds of autonomous agents operating in self-improvement loops. Anthropic’s accompanying research, When AI Builds Itself, further explores how AI systems are increasingly participating in their own research and development processes.

Despite these advances, most production agents still suffer from a fundamental limitation: they execute tasks but rarely learn from them.

Every day, deployed agents generate enormous amounts of valuable experience—including successful task trajectories, failed reasoning paths, tool invocations, user corrections, and reward signals. Yet in most production environments, this information remains little more than application logs.

AReaL 2.0 aims to close that gap.

Developed through collaboration between Ant Group, The Hong Kong University of Science and Technology (HKUST), and Tsinghua University, AReaL 2.0 introduces an open-source infrastructure designed to transform production agent interactions into continuous online reinforcement learning (RL).

Rather than requiring developers to redesign existing agents, the framework focuses on enabling continuous learning with minimal architectural disruption.

🚀 Why Agent Self-Evolution Requires More Than Better Models
#

Improving an AI agent is no longer simply a matter of training a larger language model.

Production agents consist of numerous interconnected components, including:

Large language models
Planning logic
Tool orchestration
Memory systems
Retrieval pipelines
Security policies
Human feedback mechanisms

Each user interaction produces valuable signals about what worked, what failed, and what should improve.

Without infrastructure capable of capturing, organizing, and replaying those experiences, reinforcement learning remains largely confined to offline experimentation.

AReaL 2.0 addresses this problem by treating online learning as infrastructure rather than as an isolated algorithm.

🏗️ Three Core Building Blocks of Agent Self-Evolution
#

AReaL 2.0 organizes continuous learning around three foundational components.

Agent Trajectory Data Protocol (ATDP)
#

Traditional application logs record operational information such as:

User prompts
Model responses
Tool calls
Errors
Latency
Token usage

While useful for debugging, these logs lack the structure required for reinforcement learning.

ATDP introduces a richer trajectory representation by recording each decision step throughout an agent’s execution.

Each trajectory may include:

Agent observations
Internal execution state
Selected actions
Tool outputs
Reward signals
Model versions
Tool versions
Cost metrics
Security metadata
Tenant information

By capturing the complete decision process, developers gain fine-grained visibility into exactly which reasoning steps contribute to successful or unsuccessful outcomes.

Enterprise Agent Data Proxy
#

Capturing production data introduces governance challenges.

Enterprise deployments often involve:

Multiple frameworks
Different tenants
Role-based permissions
Sensitive customer information
Regulatory compliance requirements

The Data Proxy serves as a controlled gateway between production services and reinforcement learning pipelines.

Its responsibilities include:

Trajectory collection
Data sanitization
Permission enforcement
Metadata management
Reward aggregation
Replay preparation

Importantly, governance occurs before data enters training workflows, allowing organizations to define exactly which information is eligible for learning while protecting sensitive content.

Agent Evolution Control Plane
#

Not every mistake should trigger model retraining.

Production agents evolve through multiple mechanisms.

For example:

Missing knowledge may require updating memory.
Incorrect tool selection may require routing changes.
Prompt failures may require prompt refinement.
Repeated policy failures may justify reinforcement learning.

The Evolution Control Plane determines:

Whether an update is necessary
Which component should evolve
Which learning algorithm is appropriate
How updates should be validated

Before deployment, candidate improvements can undergo:

Offline replay evaluation
Regression testing
Safety verification
Tenant-specific validation
Canary deployments
Version tracking

This governance layer transforms continuous learning into a controlled engineering process rather than an automated feedback loop.

⚙️ Online Reinforcement Learning as a Microservice Platform
#

Instead of tightly coupling training and inference, AReaL 2.0 decomposes online reinforcement learning into modular services that can be independently deployed and scaled.

This architecture enables existing agents to participate in continuous learning without significant changes to business logic.

The primary runtime components include:

Gateway
#

The Gateway serves as the external entry point.

It accepts requests through interfaces such as:

HTTP
WebSocket
OpenResponses-compatible APIs

It also routes trajectory data into training pipelines.

Router
#

Most production agents execute long-running workflows involving multiple interactions.

The Router maintains session affinity, ensuring that related requests remain associated with the same execution context.

This preserves conversation continuity while supporting horizontal scalability.

Data Proxy
#

Within the runtime architecture, the Data Proxy performs several functions:

Session management
Context packaging
Trajectory persistence
Training data retrieval
Metadata synchronization

It effectively bridges production traffic and reinforcement learning datasets.

Agent Compute Worker
#

The Agent Compute Worker executes the core agent logic.

Depending on deployment mode, it may perform:

Language model inference
Tool execution
Response generation
Trajectory sampling
Reinforcement learning training

Supported inference and training backends include systems such as:

vLLM
SGLang
Megatron
Fully Sharded Data Parallel (FSDP)

Controller
#

The Controller orchestrates the overall runtime environment.

Its responsibilities include:

Service discovery
Worker lifecycle management
Health monitoring
Traffic routing
Scaling operations

Together, these components provide an end-to-end infrastructure for serving, monitoring, and continuously improving AI agents.

🧪 Practical Reinforcement Learning Workflows
#

AReaL 2.0 demonstrates its architecture through two representative implementations.

Hermes Integration
#

Hermes illustrates how developers can integrate an existing production agent into an online reinforcement learning pipeline with minimal changes.

Instead of rebuilding:

Planning systems
Toolchains
Memory modules
Execution environments

developers simply replace the standard inference backend with an AReaL-managed Agent Compute Worker.

This allows real-world interactions to flow directly into asynchronous reinforcement learning pipelines.

The design emphasizes portability, enabling organizations to reuse the same architecture across different task domains.

Claude Code-Style Software Engineering Agents
#

AReaL also provides a complete software engineering (SWE) reference implementation inspired by coding agents.

The project demonstrates best practices across three major areas.

Data Processing
#

Training samples are carefully curated to ensure problems remain solvable while improving issue descriptions for clearer supervision.

Infrastructure
#

Large-scale sandbox environments support massive concurrent execution through techniques such as:

Distributed scheduling
Image prewarming
Fast environment creation

These optimizations reduce instability during long-running reinforcement learning experiments.

Algorithmic Stability
#

The framework introduces techniques including KPop to reduce discrepancies between inference and training engines.

Additional safeguards include:

Token-level adaptive filtering
Reward hacking prevention
Stable late-stage optimization

The result is a reproducible pipeline capable of supporting sustained reinforcement learning improvements across hundreds of training iterations.

🔄 From Task Execution to Continuous Learning
#

The broader AI agent ecosystem is rapidly becoming production infrastructure.

Coding assistants increasingly operate inside cloud sandboxes.

Protocols such as MCP and A2A simplify communication between models, tools, and specialized agents.

Enterprise deployments now demand capabilities such as:

Permission isolation
Cost optimization
Audit trails
Rollback mechanisms
Security governance

These operational requirements fundamentally change how reinforcement learning must be integrated.

Rather than treating learning as a separate offline process, production systems increasingly require learning to become part of the deployment lifecycle itself.

AReaL 2.0 targets precisely this transition.

By converting production interactions into structured reinforcement learning signals, the framework enables deployed agents to gradually improve through actual usage instead of relying exclusively on manually curated datasets.

🌐 Open-Source Roadmap
#

The AReaL project has continued expanding its open-source ecosystem.

Following its incubation within Ant Group’s inclusionAI initiative, the project joined the PyTorch Foundation Ecosystem, broadening community participation and hardware support.

Recent contributions include:

Huawei Cloud’s adaptation for Ascend NPUs
MindLab’s LoRA-based reinforcement learning serving solution for resource-constrained environments

Looking ahead, the roadmap focuses on two major initiatives.

AReaL AutoPilot
#

The project aims to reduce the complexity of reinforcement learning deployment by automating tasks such as:

Training kernel generation
Parallelization strategy optimization
Reinforcement learning health monitoring
Deployment orchestration

Unified Hardware Adaptation
#

AReaL also plans to establish standardized interfaces supporting multiple accelerator platforms through:

Precision alignment
Weight conversion standards
Common benchmarking suites
Cross-platform runtime compatibility

📈 The Future of Self-Evolving AI Agents
#

As AI agents become increasingly embedded within production workflows, the next competitive advantage will extend beyond task completion.

Future systems will distinguish themselves by how effectively they transform every interaction into an opportunity for improvement.

This shift requires far more than larger language models. It demands production-ready infrastructure capable of capturing trajectories, governing sensitive data, orchestrating reinforcement learning, and safely deploying incremental updates.

AReaL 2.0 represents an important step toward that vision by providing an open-source foundation for online reinforcement learning that integrates directly with real-world agent deployments.

While truly autonomous self-evolving agents remain an active research frontier, frameworks such as AReaL demonstrate that the underlying infrastructure is rapidly maturing—and that continuous learning is becoming a practical engineering problem rather than a purely theoretical one.