What distinguishes autonomous agent systems from agentic systems?

Agentic systems usually operate on digital workflows; autonomous agent systems operate over longer horizons, often across embodied substrates, and frequently as a fleet rather than a single agent. The research questions diverge: emergent reasoning, swarm-level invariants, and verifiable self-improvement matter much more once N agents must coordinate without manual oversight.

How do you make swarm protocols stable under adversarial members?

A small set of swarm-level invariants — no-collision, energy budgets, latency contracts, authority handoffs — are formalised and decomposed into per-agent local constraints sufficient to enforce them. The local constraints are runtime-monitored. This is the load-bearing claim of Project Aegis, and a research collaboration we welcome external red-teamers on.

How do you evaluate long-horizon agent systems where most failures are silent?

By treating evaluation itself as research. We use task-horizon-style benchmarks (informed by METR's task-length curves), structured red-teaming protocols that generalise across model families, and carefully synthesised long-horizon environments where ground-truth correctness is checkable but solution paths are diverse enough to resist Goodharting.

What does "verifiable self-improvement" mean in practice?

A self-improving agent updates its own components, but those updates are checked against pre-registered safety properties before they take effect. The verification cost has to amortise over many improvements, and the property set has to be expressive enough to catch real failures. We are characterising where that trade frontier sits.

Autonomous Agent Systems Research — Apik Systems

A single agent with tools is a workflow. A population of agents acting concurrently, observing each other, and adapting to each other is a different kind of object: a small society with its own dynamics, equilibria, and failure modes that the single-agent literature does not characterize. The autonomous-agent question is not the question of whether one agent can do the task; it is the question of whether ten of them can do it together without producing emergent miscoordination, runaway self-modification, swarm-level capability that exceeds the sum of the agents’ individual capabilities in ways the operators did not anticipate, or loss of human authority. We study autonomous multi-agent systems at the regime where coordination protocols matter more than individual model capability, and where the verifiable-composition question is what distinguishes a deployable system from a research demo.

The four questions are different

The phrase “autonomous agents” gets used to mean at least four distinct things, and the conflation has cost the field clarity. The first is the coordination claim: that populations of learned policies can solve coordination problems — task allocation, resource scheduling, communication-protocol design — at scales beyond single-agent capability. The second is the self-improvement claim: that agent populations can modify their own policies, scaffolding, or training data on the basis of in-deployment feedback, with cumulative capability gains over operational lifetimes. The third is the verifiability claim: that swarm-level guarantees — properties holding under worst-case composition of the swarm including adversarial members — can be established via compositional verification rather than via end-to-end testing alone. The fourth is the human-authority-retention claim: that the rate at which a population of autonomous agents accumulates decision authority does not, by default, remain compatible with operator-in-the-loop oversight, and that explicit protocol design is required for human authority to remain meaningful as agent throughput rises.

The four claims are independent. The relevant research question for the program is how to address all four simultaneously, with the discipline that the failure modes of each compound rather than cancel.

The most-cited starting points for the modern autonomous-agent conversation are the 2019 OpenAI hide-and-seek paper by Baker, Kanitscheider, Markov, Wu, Powell, McGrew, and Mordatch — which demonstrated that multi-agent self-play in a sufficiently rich physical simulator produces qualitatively new strategies (emergent tool use, counter-tool use, counter-counter-tool use) on a regular cadence, in a curriculum the experimenters did not design;¹ the 2019 AlphaStar Nature paper by Vinyals and colleagues — which demonstrated that imitation pretraining followed by population-based self-play produces grandmaster-level play in a game with imperfect information and massive action spaces;² the 2024 SIMA paper by the DeepMind team — which demonstrated a single instructable foundation policy operating across a wide library of 3D environments;³ and the 2023 RoboCat paper by Bousmalis and colleagues — which extended the same recipe into the embodied regime, with a single agent self-improving across multiple robotic embodiments.⁴ These are encouraging results. They are also a sketch of a problem.

What works in the simulator and what does not deploy

Each of those results works because the experimenters retained tight control over the simulator, the reward, the population composition, and the deployment surface. None of them deploy at production scale outside the laboratory. Production multi-agent systems have to handle adversarial population members, non-stationary environments, partial observability of teammates, and human authority that must remain meaningful even as agent throughput rises. The classical multi-agent reinforcement learning literature surveys most of these issues — Yang and Wang 2020 is the survey of record⁵ — and the multi-agent path-finding (MAPF) literature offers tractable subproblems with provable guarantees — Stern and colleagues 2019⁶ — but the integration with frontier-scale policies is open.

A second observation: the failure modes of autonomous-agent systems are not, in general, the union of the failure modes of the individual agents. Population-level pathologies — cascading miscoordination, coordination failures that arise from the dynamics of the protocol rather than from any single agent’s behavior, equilibria that no participant prefers — are first-class objects. The classical Byzantine-fault-tolerance literature — Castro and Liskov 1999⁷ — provides one model for thinking about adversarial population members, but it assumes a fault model that learned policies do not, in general, satisfy. The honest summary is that the verifiability literature for distributed systems and the capability literature for learned policies have not yet converged into an integrated discipline; one of the principal goals of the autonomous-agents pillar is to fund the work that produces that convergence.

The reason this matters operationally: autonomous-agent systems are the most natural delivery surface for a wide class of valuable work — supply-chain coordination, infrastructure management, scientific discovery pipelines, large-scale data curation — and they are also the failure mode that the AI Safety literature most directly anticipates. We work on both the capability and the verifiability simultaneously, on the assumption that they are not separable.

What the autonomous-agents program is, technically

We organize this work along four sub-strands. None of them is sufficient alone; the program is the intersection of the four.

Multi-agent coordination protocols

The fundamental question: how do agents coordinate without a central scheduler, when the coordination protocol itself is part of what is learned? We study explicit protocol design — communication grammars with bounded vocabularies, commitment primitives, contract nets — alongside emergent protocols that arise under self-play. The MAPF literature⁶ provides our reference for what provable coordination guarantees look like in tractable settings. The MARL survey⁵ frames the harder regime.

Our internal work focuses on protocols that degrade gracefully when a participant misbehaves, that admit external audit, and that retain meaningful human override under load. The discipline points include explicit-protocol-versioning (so that protocol changes can be audited and rolled back), bounded-vocabulary communication (so that the protocol’s expressive range is bounded by design rather than by training accident), and commitment primitives with explicit-revocation semantics (so that an agent’s commitment to another can be invalidated when the surrounding context changes). We are particularly interested in the trade-off between expressive emergent protocols, which may be more efficient on the tasks they are trained for, and constrained explicit protocols, which are easier to verify and to interface with non-learned components — including humans. The honest summary is that explicit protocols are the program’s preferred engineering substrate, with emergent protocols treated as a research curiosity rather than a deployment target.

Self-improvement loops

Self-improvement is the regime in which agents modify their own policy, scaffolding, or training data on the basis of in-deployment feedback. RoboCat is the canonical embodied example;⁴ Voyager is the digital-skill-library version.⁸ The interesting research question is not whether self-improvement works — it does — but how to build self-improvement loops whose stability properties can be characterized in advance.

We study bounded self-modification (the agent may add skills but not modify its core policy), checkpointing protocols that preserve rollback (so that capability regressions can be reverted), and evaluation harnesses that fire on capability deltas rather than at fixed intervals (so that the evaluation is responsive to the actual rate of capability change rather than to a fixed schedule). The general concern is that an unbounded self-improvement loop can produce capability changes faster than the surrounding evaluation infrastructure can characterize them, which collapses the distance between training and deployment in ways that the AI Safety literature has identified as the regime in which most evaluation strategies break down. The discipline point we adopt is that self-improvement loops have explicit capability-budget caps and explicit evaluation triggers, with the discipline that “let the loop run and see what happens” is not the right default.

Verifiable swarm behavior

Swarm-level guarantees are different from single-agent guarantees. A property like “no agent in the swarm exfiltrates data” must hold under the worst-case composition of the swarm, including a single malicious member. The multi-agent verification problem is harder than the single-agent verification problem in the same sense that distributed-systems verification is harder than sequential-program verification: the search space is the cross-product of the individual state spaces, and the relevant invariants must be invariants of the population rather than of any individual member.

We treat swarm verification as a compositional problem: per-agent verified envelopes (see Project Aegis) combined under a swarm-level invariant checker. The classical distributed-systems literature on Byzantine fault tolerance⁷ is directly applicable here, and we draw on it heavily. The novelty is that the participants are learned policies whose behavior is not analytically characterizable, which means the fault model must be specified behaviorally rather than mechanically. This is harder than classical BFT but not, in our view, intractable: many of the relevant invariants — bounded resource consumption, bounded action rates, bounded externally-visible state changes — admit runtime verification independent of the policy that produced them. The runtime-verification approach trades the strong guarantees of static formal verification for a deployable verification regime that does not require analytical understanding of the underlying policy, which is the right trade-off for learned-policy populations at the current state of formal-methods technology.

Long-horizon credit assignment

When a swarm of agents executes a multi-day plan and fails, attributing the failure to a specific decision by a specific agent is hard. The single-agent credit-assignment problem is already open; the multi-agent version is harder. Counterfactual reasoning — what would have happened had this agent done otherwise — is the natural framing, but counterfactual rollouts are expensive and the counterfactual environment is itself non-stationary. We work on structured logging that makes post-hoc analysis tractable, on training-time techniques that produce policies amenable to attribution, and on the question of how much of the credit-assignment burden can be shifted from training to deployment by appropriate logging discipline. The connection to the audit-log requirements of Project Aegis is direct.

The discipline points include explicit-decision-logging (every agent’s decision is logged with its reasoning, its inputs, and its expected outputs), counterfactual-evaluation infrastructure (the deployment environment supports replaying decisions with alternative agent choices to estimate counterfactual outcomes), and attribution-aware training (training procedures that produce policies whose decisions are interpretable in terms of features the operator can reason about). None of these is solved at the multi-agent scale; the program treats them as load-bearing research investments.

Definitional bounds

Before moving to the open problems, four exclusions are worth being explicit about.

Autonomous agents do not mean unsupervised agents. The autonomous-agent work is on operator-overseen multi-agent populations, where the autonomy budget is explicitly allocated and the operator-in-the-loop oversight is a load-bearing design property. Fully-unsupervised multi-agent populations, deployed without operator authority over the population’s decisions, are a separate research surface, and the program does not directly fund them.

Autonomous agents do not mean swarm consciousness or hive-mind cognition. The program is on coordination protocols and population dynamics in learned-policy populations. The popular-science framings of swarm intelligence as a form of collective consciousness are not the program’s research substrate; the program treats the populations as composable systems whose properties are derivable from the protocol and the participants, rather than as emergent agents with morally-relevant collective interests.

Autonomous agents do not mean self-improvement is unbounded. The self-improvement work is bounded, with explicit capability-delta evaluation and explicit rollback. Unbounded recursive self-improvement is a speculative-and-concerning research surface that the safety literature has identified, and the autonomous-agents pillar treats unbounded self-improvement as out of scope rather than as a target.

Autonomous agents do not mean human authority is dispensable. The fourth question above — human-authority retention — is treated as a load-bearing design property, not as a phase that the population eventually grows out of. The protocol-design problem is to keep human authority meaningful as agent throughput rises, not to find an asymptote where human authority becomes unnecessary.

Open problems

The research-program agenda. We name eight.

Stability under adversarial swarm members. Most multi-agent training assumes a cooperative or self-play population. Real deployments include adversarial actors. We do not have a clean characterization of which coordination protocols remain stable when one member is hostile, let alone a small fraction.
Task allocation under uncertainty. Optimal task allocation given known capabilities is a well-studied combinatorial-optimization problem. Allocation when agent capabilities are themselves estimates from finite observation is open, especially under non-stationarity.
Emergent miscoordination. Self-play populations sometimes produce stable equilibria that no participant prefers — coordination failures that arise from the dynamics rather than from any single agent. Detection and mitigation of such equilibria during training, before deployment, is open.
Human authority retention. As agent throughput rises, the rate of decisions surfaced for human review must stay roughly constant for human authority to remain meaningful. The protocol design problem of which decisions to surface, in which order, with what context, is open.
Simulation-to-real transfer for swarms. Single-agent sim-to-real has an established literature. Swarm-level sim-to-real, where the relevant gap is in the dynamics of the population rather than in the dynamics of the environment, has much less.
Verifiable composition. Per-agent verification gives per-agent guarantees. The compositional question — what guarantees hold for the swarm as a whole, given the per-agent envelopes and the protocol — is the open one.
Long-horizon credit assignment in multi-agent settings. The single-agent credit-assignment problem is open; the multi-agent version is harder. Counterfactual evaluation infrastructure at deployment scale is the principal engineering investment.
Stable self-improvement at population scale. RoboCat⁴ and Voyager⁸ demonstrate self-improvement at single-agent scale; the population-scale extension, where multiple agents are simultaneously self-improving, is open and currently unsafe-by-default.

Three risk scenarios

Scenario A — Cascading miscoordination

The first failure mode is the cascading-miscoordination scenario. A coordination protocol stable under in-distribution conditions becomes unstable under a low-probability event (a partition, a failure of a critical participant, a non-stationary environmental shift), and the resulting cascading failure is worse than what any single-agent failure mode would have produced. The classical distributed-systems literature has many examples of this — the “thundering herd,” cascading congestion-control failures, GFS failover storms — and the multi-agent learned-policy version has the additional problem that the failure modes are not, in general, characterizable from the policy specifications. The mitigation is conservative protocol design with explicit-failure-mode characterization, rate-limiting and load-shedding patterns borrowed from distributed-systems engineering, and runtime-monitoring of population-level invariants.

Scenario B — Self-improvement runaway

The second failure mode is the self-improvement-runaway scenario. A self-improving population produces capability changes faster than the evaluation infrastructure can characterize them, the deployment context relies on out-of-date capability bounds, and the resulting deployment is operating outside the envelope its designers intended. The mitigation is bounded self-modification with explicit capability-delta evaluation, rollback infrastructure, and the discipline that the self-improvement loop is not the operator’s default deployment condition.

Scenario C — Successful staged multi-agent deployment

The third scenario, which we treat as the base case if the engineering work is competent, is staged multi-agent deployment in which the per-agent envelopes are verified, the protocol-level invariants are runtime-checked, the population-level capability is bounded by explicit budget, and the operator-in-the-loop oversight is preserved by design. The trajectory is the trajectory the program is aiming at; it is multi-decade rather than near-term.

What technical work bears on this

The autonomous-agents work is coupled to the rest of the research program. This pillar is the multi-agent and emergent-behavior counterpart to Agentic Systems, which focuses on individual digital-workflow agents. It connects to Humanoid Robotics wherever the swarm is embodied: a fleet of mobile manipulators is an autonomous-agent system whose protocols must respect physical constraints. Project Aegis provides the per-agent verification substrate that compositional swarm guarantees are built on. The systems-engineering view is documented in Agentic Systems (engineering). The Economic Orchestration pillar treats the supply-chain and resource-allocation context in which multi-agent populations operate at scale.

Where to read further

Agentic Systems treats the single-agent counterpart. AI Safety treats the verification framework and the deceptive-alignment threat model that multi-agent populations make acute. Project Aegis treats the per-agent verification substrate. Physical Intelligence and Humanoid Robotics treat the embodied multi-agent extensions.

Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch (OpenAI), “Emergent Tool Use From Multi-Agent Autocurricula”, 2019; arXiv:1909.07528. The hide-and-seek emergence result. ↩
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, et al. (DeepMind), “Grandmaster level in StarCraft II using multi-agent reinforcement learning”, Nature 575 (2019): 350–354. ↩
SIMA Team (DeepMind), “Scaling Instructable Agents Across Many Simulated Worlds”, arXiv 2024. ↩
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, et al. (DeepMind), “RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation”, arXiv 2023. ↩ ↩² ↩³
Yaodong Yang and Jun Wang, “An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective”, arXiv 2020. ↩ ↩²
Roni Stern, Nathan Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker, et al., “Multi-Agent Pathfinding: Definitions, Variants, and Benchmarks”, arXiv 2019 (SoCS 2019). ↩ ↩²
Miguel Castro and Barbara Liskov, “Practical Byzantine Fault Tolerance”, OSDI 1999. ↩ ↩²
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar, “Voyager: An Open-Ended Embodied Agent with Large Language Models”, arXiv 2023. ↩ ↩²

Common questions

What distinguishes autonomous agent systems from agentic systems?

How do you make swarm protocols stable under adversarial members?

How do you evaluate long-horizon agent systems where most failures are silent?

What does "verifiable self-improvement" mean in practice?