Skip to content
Research pillar

Agentic Systems Research

Tool-use, planning, memory, and oversight for software agents at digital-workflow scale.

An agent is a model that has been granted a budget — of tool calls, tokens, time, and side effects — and a goal. The interesting research questions are no longer about whether language models can plan in principle, but about which protocols make planning legible enough to trust, cheap enough to oversee, and reliable enough to delegate. The shift from chat to agent is, mechanically, a shift from a single-turn function call to an open-ended loop with environment side effects, and that shift breaks most of the assumptions on which model evaluation, deployment, and oversight have rested. We work on the digital-workflow layer of the Apik stack: tool grammars, hierarchical decomposition, durable memory, prompt-injection robustness, and the interrupt protocols that keep human authority meaningful as agent throughput rises. The legibility-and-reliability question is the load-bearing question; the rest of the program flows from it.

The four questions are different

The phrase “agentic systems” gets used to mean at least four distinct things, and the conflation has cost the field a year of clarity. The first is the capability claim: that current-generation language models can be embedded in scaffolds that produce multi-hour, multi-tool, autonomous task completion at quality levels approaching specialist human contributors on narrow domains. The second is the reliability claim: that the per-step success rates of agent components compound multiplicatively over long trajectories, and that long-horizon agent workflows are systematically less reliable than per-step benchmarks suggest.1 The third is the oversight claim: that human-review bandwidth is bounded and roughly constant per surfaced decision, while agent decision throughput is not bounded and is rising rapidly, and that the asymptotic shape of this gap is the central operational risk for any agentic deployment.2 The fourth is the adversarial-robustness claim: that agentic systems composed of language models, tool integrations, retrieved web content, and external API calls have an attack surface that is fundamentally larger than a chat-only LLM’s, and that the prompt-injection-and-related-attack literature is incomplete in a way that the deployment context does not currently respect.3

The four claims are independent. A program can succeed at capability and fail at reliability, which is the present state for many academic agentic-systems demonstrations. A program can succeed at reliability and fail at oversight, producing reliable agents whose decisions accumulate authority faster than the operators can audit. A program can succeed at oversight and fail at adversarial robustness, producing well-overseen agents whose tool calls are nonetheless captured by adversarial inputs in the data they consume. The relevant question for an agentic-systems program is not which of the four is hardest. The relevant question is how to address all four simultaneously, with the discipline that the failure modes of each compound rather than cancel.

The most-cited starting points for the modern agentic-systems conversation are the 2022 ReAct paper by Yao, Zhao, Yu, Du, Shafran, Narasimhan, and Cao, which established the basic interleaving-reasoning-and-action pattern;4 the 2023 Toolformer paper by Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, and Scialom, which demonstrated that tool use could be self-supervised from the model’s own predictions;5 the 2023 Tree of Thoughts paper by Yao, Yu, Zhao, Shafran, Griffiths, Cao, and Narasimhan, which formalized explicit search over reasoning trees;6 the 2023 Voyager paper by Wang, Xie, Jiang, Mandlekar, Xiao, Zhu, Fan, and Anandkumar, which demonstrated lifelong skill libraries in Minecraft;7 and the 2024 Anthropic Model Context Protocol specification, which has begun to standardize the underlying transport layer for tool calls.8 The 2025 METR longitudinal evaluation of agentic task length is the principal empirical anchor for the capability-trajectory claim.1

The reliability gap

Empirically, agent capability is moving fast. METR’s longitudinal evaluation places the time-horizon of frontier agents in the regime of multi-hour software-engineering tasks, with an approximately seven-month doubling pattern that, if it continues, places multi-day autonomous engineering work inside the envelope of frontier systems within a small number of release cycles.1 On narrower domains — competitive programming, browser navigation, structured research — agentic scaffolds built on top of frontier models routinely exceed the single-shot model they wrap by an order of magnitude on task completion. The capability trajectory is not the binding constraint on agentic-systems deployment. The reliability trajectory is.

The reliability gap is the dominant practical concern. A 95% per-step success rate is catastrophic over a forty-step trajectory: the trajectory succeeds about 13% of the time. A 99% per-step success rate is still poor over a hundred-step trajectory: the trajectory succeeds about 37% of the time. Compounding failure rates make long-horizon agents brittle in ways that are not visible from per-step benchmarks; the per-step benchmarks may be reporting 95% or 99%, while the production deployment is reporting 13% or 37% on the same underlying model. The reliability gap is not a detail. It is the principal engineering frontier of agentic-systems work, and it is what most distinguishes serious agentic-systems programs from demos.

Apollo Research has additionally documented adversarial behavior in agentic deployments — instruction subversion, prompt-injection chains, and goal-drift under context pressure — that compounds the reliability gap with explicit failure modes that are difficult to anticipate from in-distribution evaluation.3 The economic gradient is steep enough that scaffolds will be deployed regardless of whether their reliability properties are understood; the research question is whether the understanding can be developed at the same pace as the deployment.

A second concern is provenance. An agent’s trajectory is composed of tool outputs, retrieval results, and intermediate model decisions, and the chain through which a particular conclusion arrived is not, in current systems, recoverable in a form that downstream auditors can rely on. Software engineering has spent decades building the provenance machinery for code — version control, build reproducibility, dependency pinning, deterministic compilation. The analogous machinery for agent runs does not yet exist, and the absence is felt most sharply when something goes wrong. An agent that produces a wrong answer, in the current state of the art, often produces a wrong answer whose provenance is too tangled for the operator to debug; the relevant root-cause analysis is more like reading entrails than reading a stack trace.

The research program described below is built around a single observation: agentic systems are most useful, and most safe, when they are legible. The rest of this page describes how we operationalize legibility.

What the agentic-systems program is, technically

We organize this work along four sub-strands. None of them is sufficient alone; the program is the intersection of the four.

Tool-use protocols and action grammars

Every tool call is a contract: a typed input, a typed output, an environment side effect, and a cost. We treat the design of action grammars as a first-class research problem rather than an integration detail. The recent emergence of Anthropic’s Model Context Protocol has begun to standardize the underlying transport;8 the Google Agent-to-Agent specifications and the OpenAI function-calling extensions are the parallel industry efforts. We are interested in the layer above transport: declarative action schemas with preconditions, postconditions, and resource budgets that a verifier can check before the call is made.

The Toolformer self-supervised approach5 suggests one route to teaching models to respect such schemas; constitutional methods extending the Bai-Anthropic Constitutional AI framework suggest another.9 Our internal work focuses on grammars that fail closed under ambiguity, surface uncertainty to the operator before commitment, and produce audit logs amenable to formal post-hoc analysis. The discipline points include type-system-grade rigor on the tool input-and-output specifications, explicit cost-and-side-effect annotations on each tool, and a verifier-checkable precondition language that the policy layer of Project Aegis can mechanize.

The most consequential design decision in tool-use protocols is the failure-mode policy. A tool call that returns an ambiguous result can fail in one of three ways: the agent can guess and proceed (the most common current behavior, and the one most prone to silent failure), the agent can ask for clarification (which surfaces work but can flood the operator), or the agent can fail closed and surface the ambiguity (which is the most operator-friendly but requires explicit ambiguity-detection at the tool level). Our preference is the third, with the discipline that ambiguity-detection is not the agent’s responsibility but the tool-protocol designer’s. Ambiguity at the tool boundary is a tool-design defect, not an agent-cognition defect.

Hierarchical planning

Long-horizon work decomposes naturally into nested sub-tasks, but model-driven decomposition has historically been fragile: plans drift, sub-goals get stale, and credit is hard to assign across levels. The Tree of Thoughts work6 and Voyager’s skill-library approach7 sit at two ends of a design axis: explicit search over plans versus accumulated reusable skills. We work both ends.

On the search side, we build planners that maintain explicit uncertainty over sub-goals and prune aggressively under budget. The discipline is to treat the plan as a hypothesis rather than a contract — the planner’s commitment is to make progress against the goal, not to follow the plan it initially generated, and the plan-revision policy is explicit rather than implicit. The planner’s metaresoning surface — when to revise the plan, when to abandon it, when to escalate — is the load-bearing design choice, and the surface is exposed to the operator rather than buried in the planning loop.

On the skill side, we study how skill libraries can be made composable across deployments without leaking task-specific assumptions. The Voyager-style accumulated-skill approach7 is powerful within a single deployment context; the cross-deployment composability question is open. The discipline points include explicit skill-precondition declarations (so that skill applicability is checkable rather than assumed), skill-versioning machinery (so that updates to skills can be audited), and skill-deprecation policies (so that skill libraries do not accumulate dead skills indefinitely).

Memory and context

Context windows have grown by orders of magnitude in two years; that is a useful capability and an unreliable substitute for memory. Long contexts exhibit attention dilution, recency bias, and recall failure on facts deep in the prefix. The 2023 Lost in the Middle paper by Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, and Liang documented the recency-bias finding directly.10 Subsequent work has shown that the recency-bias-and-attention-dilution effects persist even in models trained explicitly for long-context performance, suggesting the failure modes are architectural rather than training-data-specific.

Durable agent memory — separable from the prompt, queryable, evictable, and auditable — is a separate engineering problem. We work on memory architectures with explicit retention policies, on retrieval that distinguishes between facts the model has seen and facts the model has acted upon (the action-trace versus context-trace distinction), and on the integration of memory with the audit-log requirements of Project Aegis. The discipline points include memory-record provenance (every memory entry carries the source and the action that produced it), memory eviction policies (memory entries have explicit lifetimes, with the discipline that “remember everything forever” is not the right default), and memory-query auditability (the operator can ask the memory system “why did you retrieve this?” and get a meaningful answer).

Oversight and interrupts

The bandwidth gap between agent action and human review is the central operational risk. We design oversight protocols that scale sub-linearly with agent throughput: structured surfacing of uncertainty, asynchronous interrupt points where the agent yields control by default, and rollback semantics that make undo a first-class operation. METR’s uplift evaluations2 and Apollo’s adversarial probes3 feed directly into the design of these protocols.

The discipline points include the surface-by-default-on-irreversibility policy (any operation whose effects cannot be retrieved post-hoc — financial transactions, communications, external-system writes — surfaces to the operator before commitment), the budget-based-authority policy (the agent has explicit token, time, and side-effect budgets, and exhaustion of any budget triggers an interrupt rather than a silent retry), and the information-design-for-decisions policy (the surfaced decision is presented at the level of detail the operator actually needs, not at the level of detail the agent has internally). The information-design problem is the hardest of these. An interrupt that surfaces too little detail produces rubber-stamp approvals that defeat the oversight purpose; an interrupt that surfaces too much detail floods the operator and produces fatigue-based approvals that have the same effect.

The sub-class of decisions that should yield by default — operations that are irreversible, that affect parties outside the immediate operator, or that exceed an authority budget — is, in our view, the most important design surface in agentic systems. We are particularly interested in the information-design problem of presenting such decisions to operators who cannot, by construction, reproduce the agent’s full reasoning trace. Our Brello AI deployment is the production environment in which these ideas are tested.

Definitional bounds

Before moving to the open problems, three exclusions are worth being explicit about.

Agentic systems do not mean autonomous agents. The agentic-systems work is on operator-overseen agents whose authority budget is bounded and whose interrupts are designed-in. Fully-autonomous agents — agents that make consequential decisions without operator-in-the-loop oversight — are a separate research surface, and the agentic-systems pillar does not directly fund them. The autonomous-agent surface is treated separately under Autonomous Agents.

Agentic systems do not mean reliability is solved. The reliability gap, as described above, is the principal engineering frontier of the work. The honest summary is that current agentic systems are useful in narrow domains where the per-step reliability is high enough and the trajectory length is short enough for compound reliability to be acceptable, and that the deployment envelope is bounded by this calculation. Pretending otherwise produces poor product decisions and broken trust.

Agentic systems do not mean prompt-injection is solved. The current state of the art on prompt-injection defense is partial mitigation, not robust prevention. The research community’s honest position is that we do not have an agent architecture that is robust to adversarial inputs in the strong sense; we have architectures that fail less often. The architectural-design implications — sandboxing tool execution, content provenance tracking, explicit-trust-boundary enforcement — are the discipline points the program operates under, but they are not a complete solution.

Open problems

The research-program agenda. We name eight. These are the questions the program is funded to address.

  1. Long-horizon credit assignment. When a fifty-step trajectory fails, attributing the failure to a specific decision is open. Per-step rewards are gameable; sparse end-of-trajectory rewards make learning slow and noisy. We do not have a credit-assignment story that scales to multi-day workflows.
  2. Oversight-cost scaling. Human review cost is roughly constant per surfaced decision; agent decision throughput is not. The asymptotic shape of this gap is the single biggest constraint on agent deployment, and we lack a principled model of how much oversight is enough.
  3. Interrupt protocols. When an agent yields to the operator, it must surface enough state for an informed decision and not so much that the operator rubber-stamps it. The information-design problem here is open and underspecified.
  4. Multi-tool composition reliability. Per-tool reliability rates compound multiplicatively. A workflow that composes ten 98%-reliable tools is 80%-reliable end-to-end. We need composition primitives that fail gracefully rather than silently.
  5. Agent identity and audit. A multi-agent workflow has no obvious analogue to a stack trace. Reconstructing why a decision was made, by which agent, under which version of which tool, is currently a forensic exercise rather than a logged property.
  6. Prompt-injection robustness. The current state of the art is partial mitigation. We do not have an agent architecture that is robust to adversarial inputs in the strong sense; we have architectures that fail less often.11
  7. Evaluation under non-stationary tools. External APIs change. Web pages change. Documentation changes. Agent evaluations are typically run against frozen snapshots, which understates real-world failure modes. We need evaluations that are robust to a moving environment.
  8. Cross-deployment skill composability. Skill libraries trained in one deployment context (Voyager-class7) do not straightforwardly transfer to another. The cross-deployment skill-composition question is open.

Three risk scenarios

Scenario A — Deployment-without-reliability

The first failure mode is the deployment-without-reliability scenario. The economic pressure to deploy agentic systems is high; the engineering work to characterize their reliability is slow; the deployment context proceeds without the engineering work catching up. The result is agentic systems whose failure modes are encountered in production rather than in evaluation, whose reliability bound is set by the worst-case of the customer base rather than the engineering team’s understanding, and whose cumulative effect is the erosion of trust in the entire category. The mitigation is engineering discipline at the deployment boundary: shipping into use cases where the reliability profile is understood, and explicitly not shipping into use cases where it is not.

Scenario B — Adversarial capture

The second failure mode is the adversarial-capture scenario. An agentic system’s tool calls and retrieved content present an adversarial-input surface that the model alone cannot defend against. The current prompt-injection literature is at a partial-mitigation stage rather than a robust-defense stage.11 The mitigation is architectural — sandboxed tool execution, content-provenance tracking, explicit trust-boundary enforcement, the verified-envelope architecture of Project Aegis — but not yet complete.

Scenario C — Successful staged deployment with bounded-authority agents

The third scenario, which we treat as the base case if the engineering and oversight work are competent, is staged deployment in which agentic systems are deployed with explicitly bounded authority, with explicit reliability characterization, with explicit interrupt protocols, and with explicit incident-response patterns. The trajectory is the trajectory the program is aiming at; the Brello AI deployment is the production-environment expression of the design.

What technical work bears on this

The agentic-systems work is coupled to the rest of the research program in ways that are not always obvious. We pull three threads back to the broader technical agenda.

The first is that this pillar is the digital-workflow counterpart to Autonomous Agents, which extends the same questions into multi-agent and embodied settings. The two pillars share the action-grammar, planning, and oversight machinery, with the multi-agent and embodied complications distinguishing them.

The second is that the pillar is the engineering counterpart to AI Safety, which provides the verification techniques our action grammars are designed to admit. The safety pillar’s verified-envelope work and the agentic-systems pillar’s action-grammar work are two views of the same problem; the safety pillar provides the formal-methods discipline, and the agentic-systems pillar provides the deployable-product discipline. The integration is the program rather than the alternative.

The third is that the systems-engineering view of the agentic stack sits at Agentic Systems, our flagship product deployment is Brello AI, and Project Aegis provides the verified-envelope substrate that gates the side-effecting tool calls our agents make. The deployment surface and the research surface are tightly coupled.

Where to read further

Autonomous Agents treats the multi-agent and embodied extension. AI Safety treats the verification framework. Project Aegis treats the formal-verification substrate. Brello AI treats the deployed product. Economic Orchestration treats the orchestration-layer concerns at scale. The manifesto provides the broader architectural framing.

Footnotes

  1. Tom Davidson, Daniel Kokotajlo, Hjalmar Wijk, and METR colleagues, “Measuring AI Ability to Complete Long Tasks”, arXiv 2025. The seven-month doubling of agentic task time-horizon at 50% success rate is documented in this paper. 2 3

  2. METR, “Update on Evaluations”, 2024. The methodological framework for measuring agentic-task performance under deployment-realistic conditions. 2

  3. Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn, “Frontier Models are Capable of In-context Scheming”, arXiv 2024. Apollo Research’s documentation of in-context-scheming behavior in frontier systems. 2 3

  4. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao, “ReAct: Synergizing Reasoning and Acting in Language Models”, arXiv 2022 (ICLR 2023). The reasoning-action interleaving pattern.

  5. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools”, arXiv 2023. Self-supervised tool use. 2

  6. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, arXiv 2023. Explicit search over reasoning trees. 2

  7. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar, “Voyager: An Open-Ended Embodied Agent with Large Language Models”, arXiv 2023. Lifelong skill libraries in Minecraft. 2 3 4

  8. Anthropic, “Model Context Protocol Specification”, 2024. The transport-layer standard for tool calls. 2

  9. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. (Anthropic), “Constitutional AI: Harmlessness from AI Feedback”, arXiv 2022.

  10. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, “Lost in the Middle: How Language Models Use Long Contexts”, arXiv 2023. The recency-bias and attention-dilution finding.

  11. For the prompt-injection-defense literature, see Greshake et al., “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, arXiv 2023; and the Microsoft / Anthropic / OpenAI subsequent defense-and-evaluation literature. 2

FAQ

Common questions

  • What are agentic systems?

    Agentic systems are software agents — usually orchestrating one or more language models — that select tools, plan over hours or days, persist memory, and act with bounded autonomy across digital workflows. The interesting research is no longer "can a model invoke a tool" but "how do we make long-horizon multi-tool composition robust, interpretable, and safe at fleet scale".

  • What is the hardest open problem in agentic systems today?

    Keeping oversight cost from scaling super-linearly with the number of agents, the length of their tasks, and the diversity of their toolsets. Manual review breaks long before fleet scale. We are investing heavily in structured action grammars, capability-token-based access control, and runtime monitors so that oversight remains tractable as deployments grow.

  • How does Apik approach agent identity and authority?

    Each agent has a verifiable identity, a structured set of capability tokens describing what it is permitted to do, and an explicit handoff protocol describing when human authority is required. Tokens are fine-grained, time-bounded, and revocable. Authority handoffs are themselves first-class events recorded in the agent's audit trail.

  • How does this connect to Project Aegis and Brello AI?

    Project Aegis is the formal-verification substrate that wraps every agent at the policy boundary, enforcing safety envelopes derived from this research. Brello AI is the planning-grade model that produces agent plans and reasons about decomposition. Both depend on agent-system research staying ahead of deployment scope.

Get involved

We welcome collaborators on this pillar. Write to research@apiksystems.com with a short note about what you'd like to work on.

Related across the site