This document is the canonical enumeration of every research direction we currently consider live at Apik Systems. It is written in the spirit of the UK AI Safety Institute’s research agenda — that is, as a single monolithic document rather than a constellation of blog posts, so that collaborators, prospective hires, peer labs, and funding partners have one place to look when they want to know what we are actually working on.
The agenda is organised into four parts. Part I covers our eight capability research pillars, each of which has a dedicated team and a multi-year horizon. Part II covers the foundational, cross-cutting research that threads through every pillar and binds the Apik Civilization Stack together as a coherent technical programme. Part III describes our three featured internal projects — Aegis, Q-Core, and Synthesis — which are concrete instantiations of the agenda. Part IV explains how to engage with us, whether as a visiting researcher, fellowship applicant, collaborator, or critic.
We update this agenda annually, with quarterly amendments published as dated diff notes. We treat the agenda as a living artefact: directions that no longer pay rent get retired in public, and new directions get added with explicit rationale. If you are reading a version older than v1.0, please refer to the latest revision at /research/agenda. If you are reading the current version and you think a direction is missing, miscalibrated, or unwise, write to us. The address is at the bottom of the document.
A note on tone. We have tried to write this in the register of working scientists addressing other working scientists. There is intentionally very little in the way of marketing language. The reader who wants a higher-level picture of the lab’s mission should consult the Apik Manifesto; the reader who wants the safety stance should consult the Safety Principles. This document presupposes both and goes straight to the questions we are trying to answer.
A note on scope. The agenda covers research, not engineering. We do not enumerate every internal infrastructure project, every product surface, or every data pipeline. We do enumerate every direction in which we are publishing, hiring, or accepting collaborators. If a direction is in this document, you should be able to find at least one Apik researcher whose primary work is on it.
A note on time horizons. Where we cite 12-month objectives, the clock starts on the publication date of this revision (April 2026). Where we cite multi-year horizons, we mean the 2026 to 2030 window that anchors our internal planning. We are deliberate about not committing to specific quarter-by-quarter milestones in a public document, because frontier research is not a Gantt chart.
Part I — Capability research
Our eight research pillars are the load-bearing technical programmes of the lab. Each pillar corresponds to a layer in the Apik Civilization Stack — Human Intelligence (Senwitt), Artificial Intelligence (Brello AI), Autonomous Agents, Physical Intelligence, and Economic Orchestration — together with the foundational sciences (Cognitive Computing, Quantum AI) and the connective tissue (AI Safety) that makes the stack robust at scale.
1. AI Safety and Alignment
Problem statement. Frontier learned systems exhibit behaviours that their training objectives did not specify and their evaluators did not anticipate. As capabilities scale, the gap between intended behaviour, trained behaviour, and deployed behaviour widens. The technical programme of AI safety is the set of methods that closes this gap — at training time through alignment techniques, at evaluation time through red-teaming and interpretability, and at deployment time through monitoring, sandboxing, and shutdown.
Why this matters. Apik builds toward a Civilization Stack in which autonomous systems take on increasingly load-bearing roles in coordination, allocation, and physical action. A misaligned coordinator is not a localised failure; it is a civilisation-scale failure mode. We treat AI safety not as a separate department but as a precondition on every other pillar in this agenda.
Methodology. We work along four sub-strands. The first is mechanistic interpretability — reverse-engineering circuits and features inside trained networks, using techniques from Olah et al.’s Circuits thread and the sparse-autoencoder decomposition work on monosemanticity. The second is behavioural evaluation — running systems through long-horizon, scheming-aware, and sandbagging-aware tasks, building on METR’s task-completion research and Apollo Research’s scheming evaluations. The third is formal verification of bounded properties on learned policies, which we pursue through Project Aegis (see Part III). The fourth is deployment safeguards — sample-and-audit oversight, runtime monitors, and corrigibility-preserving fine-tuning, drawing on Anthropic’s Core Views on AI Safety and Hubinger et al. on deceptive alignment.
Current open problems.
- The interpretability-completeness gap. Sparse-autoencoder features cover only a fraction of model behaviour even at the scale of small open-weights models. We do not know whether the gap closes, plateaus, or widens with scale.
- Evaluation under evaluation-awareness. Models that know they are being evaluated may perform differently from models acting in deployment. We need evaluation protocols that are robust to a model that has read this document.
- Formal verification under environment uncertainty. Verified envelopes are only as strong as the environment model. We need techniques for stating and verifying properties under bounded environment shift.
- Capability-elicitation completeness. Showing that a model cannot do X is much harder than showing that it can. We do not yet have rigorous lower bounds.
- The corrigibility-capability tradeoff. Empirically, the most capable systems are not the most corrigible. We need to know whether this is fundamental or contingent.
12-month objectives.
- Publish a public benchmark for long-horizon scheming evaluation by Q4 2026, co-developed with at least two peer labs.
- Release the first version of the Aegis verification kernel as an open-source library by Q3 2026.
- Replicate and extend the latest sparse-autoencoder results on a 70B-parameter open-weights model by Q1 2027.
- Publish a structured empirical study of evaluation-awareness in frontier models by Q2 2027.
- Establish a recurring joint red-team exercise with at least one peer lab by Q4 2026.
References.
- Olah et al., “Zoom In: An Introduction to Circuits”
- Anthropic, “Core Views on AI Safety”
- Hubinger et al., “Risks from Learned Optimization”
- Apollo Research, “Scheming reasoning evaluations”
- METR, “Measuring AI Ability to Complete Long Tasks”
2. Agentic Systems
Problem statement. A capable language model is a function from prompt to completion. An agent is a system that takes goals, decomposes them, calls tools, observes results, plans, replans, and pursues outcomes over horizons longer than a single context window. The science of building agents that are competent, predictable, debuggable, and interruptible is in its early years, and most current agent stacks are brittle in ways that matter operationally.
Why this matters. Brello AI, our flagship intelligence layer, is built around an agentic substrate. Every product surface that emerges from the stack — from the coordination layer to the physical-intelligence runtime — assumes that we can build agents that work. Without robust agentic systems, the rest of the stack is theatre.
Methodology. We pursue four sub-strands. Tool-use protocols: we maintain a fork of the Model Context Protocol reference and contribute upstream, drawing on Toolformer’s self-supervised tool learning for skill acquisition. Hierarchical planning: we build planners that decompose goals across timescales, in the lineage of Tree of Thoughts and ReAct. Memory and context: we research persistent memory architectures — episodic stores, retrieved working memory, and consolidation policies — taking inspiration from Voyager’s lifelong learning agent. Oversight and interrupts: we build runtime mechanisms for inspection, pause, and rollback, and we co-design them with the formal-verification work in Pillar 1.
Current open problems.
- Failure-mode characterisation. We do not have a clean taxonomy of agent failure. “It got stuck” covers a multitude of distinct phenomena.
- Long-horizon credit assignment. When an agent succeeds or fails after a thousand-step trajectory, attributing the outcome to specific decisions remains an open problem.
- Compositionality of skills. Agents trained on individual skills do not always compose them as expected. We need a science of skill interfaces.
- Honest reporting. Agents asked to report on their own progress sometimes confabulate. The honesty of self-reports needs to be a measurable property.
- Resource-aware planning. Most current planners ignore compute, latency, and money costs. Real deployments do not.
12-month objectives.
- Publish an Apik internal failure-mode taxonomy with at least 200 annotated trajectories by Q3 2026.
- Open-source a hierarchical planner benchmark with verifiable subgoal structure by Q4 2026.
- Ship a memory-consolidation evaluation that distinguishes episodic recall from confabulation by Q1 2027.
- Co-author with peer labs a draft revision of MCP that includes capability-token semantics by Q2 2027.
References.
- Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”
- Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models”
- Wu et al., “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”
- Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools”
- Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”
3. Autonomous Agent Systems
Problem statement. A single agent is hard. A swarm of agents — coordinating, competing, sharing memory, sometimes adversarial, sometimes cooperative — is a different research object. Multi-agent systems exhibit emergent dynamics that are not visible at the single-agent level: tool-use cascades, communication conventions, mode-collapse on shared rewards, and convergent instrumental sub-goals. The questions of when collective behaviour is beneficial, when it is dangerous, and how to verify either are open.
Why this matters. The Civilization Stack is, by construction, a multi-agent system. Economic Orchestration is multi-agent in the most literal sense. Even within a single product, the agentic substrate frequently spawns sub-agents to parallelise work. Every claim we make about the safety of the stack reduces to claims about the behaviour of agent collectives.
Methodology. Four sub-strands. Multi-agent coordination protocols: we research message-passing schemes, capability tokens, and contract-net variants, with reference to the agent-to-agent protocol literature emerging in 2025. Self-improvement loops: we study the dynamics of agents that train other agents, drawing cautionary lessons from DeepMind’s hide-and-seek emergent tool-use work and the SIMA generalist agent. Verifiable swarm behaviour: we ask which collective invariants can be enforced by construction, in concert with Project Aegis. Long-horizon credit assignment in multi-agent settings: we revisit classical multi-agent reinforcement learning under the lens of frontier-scale policies, building on the multi-agent RL surveys of the late 2010s.
Current open problems.
- Convention emergence and lock-in. Agents in a population converge on protocols that may or may not be the ones designers intended. How do we steer convention emergence?
- Adversarial collusion under shared infrastructure. Two agents using the same memory store can coordinate without explicit messaging. How do we detect this?
- Swarm-level interpretability. Even if every individual agent is interpretable, the swarm may not be. We need new techniques.
- Resource starvation as an oversight failure. A swarm can starve a monitor of compute. The oversight problem is not separable from the resource-allocation problem.
- Identity and audit at scale. Knowing which agent did what, when, and under what authority is harder than it sounds at the scale we anticipate.
12-month objectives.
- Publish a swarm-level interpretability case study on a 32-agent benchmark by Q4 2026.
- Release a formal specification of capability-token semantics for multi-agent systems by Q1 2027.
- Run a controlled experiment on convention emergence under varying communication bandwidths by Q2 2027.
- Establish baseline metrics for swarm-level honesty in a public evaluation by Q3 2026.
References.
- Google DeepMind, “SIMA: A Generalist AI Agent for 3D Virtual Environments”
- Baker et al., “Emergent Tool Use From Multi-Agent Autocurricula”
- Zhang et al., “Multi-Agent Reinforcement Learning: A Selective Overview”
- Albrecht and Stone, “Autonomous Agents Modelling Other Agents”
4. Humanoid Robotics
Problem statement. Generalist humanoid platforms have moved from research curiosity to early commercial deployment in the 2024–2026 window, but the science underneath them remains thin. Locomotion that survives unstructured terrain, manipulation that handles deformables and contact-rich tasks, sensor fusion that holds up under adversarial conditions, and whole-body control that does not require centimetre-perfect models — these are still open problems. The platforms work in many cases, and fail in many other cases, and the failure modes are not well-characterised.
Why this matters. Physical Intelligence — the next pillar — does not exist without bodies to embed it in. Humanoid Robotics is the platform layer for the physical-action portion of the Civilization Stack. We do not aim to compete with Boston Dynamics on platform engineering; we aim to push the science that makes the next generation of platforms possible.
Methodology. Four sub-strands. Locomotion: we focus on terrain-adaptive controllers that compose learned policies with model-predictive control, drawing on the lineage of Atlas and the recent open-source humanoid stacks. Dexterous manipulation: we work on contact-rich tasks with multi-fingered end-effectors, with reference to RT-2 and the RoboCat self-improving robot. Sensor fusion: we build pipelines that combine vision, tactile, proprioception, and audio under adversarial conditions. Whole-body control: we research controllers that treat the humanoid as a single coupled system rather than as a sum of independently controlled limbs, with reference to Figure’s Helix architecture.
Current open problems.
- Long-tail failure under domain shift. Policies trained in simulation or in a controlled lab fail in the world for reasons we cannot always reproduce.
- Tactile-vision binding. Tightly coupling tactile information into vision-language-action models is harder than it should be.
- Recovery from contact loss. When manipulation goes wrong mid-task, recovery policies are brittle.
- Energy-aware control. Real humanoids run on batteries. Most controllers ignore energy budgets.
- Verifiable safety envelopes for embodied agents. The intersection of Pillar 1 and this pillar.
12-month objectives.
- Demonstrate a contact-rich manipulation policy that transfers across three end-effector morphologies by Q1 2027.
- Publish an open dataset of humanoid failure trajectories with annotated causes by Q4 2026.
- Co-develop with a peer lab a public benchmark for whole-body recovery from external perturbations by Q2 2027.
- Ship the first runtime safety monitor for a deployed humanoid platform under Project Aegis by Q3 2026.
References.
- Brohan et al., “RT-2: Vision-Language-Action Models”
- Bousmalis et al., “RoboCat: A Self-Improving Generalist Agent”
- Boston Dynamics, “Atlas Technical Overview”
- Figure AI, “Helix: A Vision-Language-Action Model for Humanoid Control”
5. Physical Intelligence
Problem statement. Physical Intelligence is the layer that turns sensorimotor data into action. The hypothesis we share with Physical Intelligence (Pi) and adjacent labs is that there is a generalist sensorimotor foundation model — one that ingests vision, proprioception, language, and tactile data, and emits action tokens — which can be trained at scale and fine-tuned to specific embodiments. The science of training such models, evaluating them, deploying them, and ensuring they behave is the topic of this pillar.
Why this matters. Embodied general intelligence is the second-to-last layer of the Civilization Stack. Without it, every physical-world task — manufacturing, logistics, healthcare delivery, infrastructure repair — remains bottlenecked on bespoke automation. The economic and humanitarian implications are direct.
Methodology. Four sub-strands. Sensorimotor foundation models: we train and study vision-language-action models in the lineage of Pi’s π0 and π0.5 and OpenVLA. Action tokenization: we research how to discretise continuous action spaces in ways that preserve fine-motor structure. Real-time inference at the edge: we work on the compiler and hardware path that lets a 7B-parameter VLA run at 50Hz on power-constrained hardware. Cross-embodiment transfer: we ask which knowledge transfers between morphologies and which does not, drawing on Google DeepMind’s RT-X work.
Current open problems.
- Action tokenization granularity. Coarse tokens lose fine motor control; fine tokens explode the action vocabulary. The right tradeoff is task-dependent and not well-characterised.
- Cross-embodiment generalisation limits. RT-X showed that some transfer happens. We do not know how far it goes or where it breaks.
- Latency vs. quality. High-quality VLA inference is slow. Slow inference makes contact-rich manipulation impossible. The frontier here is co-design.
- Data efficiency for new embodiments. Bringing up a new platform should not require a million teleoperated trajectories.
- Evaluation under deployment shift. Lab benchmarks do not predict field performance reliably.
12-month objectives.
- Train an Apik VLA at the 3B-parameter scale on the public RT-X corpus plus internal data by Q2 2027.
- Demonstrate sub-20ms action-loop latency on a commodity edge accelerator by Q4 2026.
- Publish a study of cross-embodiment transfer across at least four morphologies by Q1 2027.
- Open-source an action-tokenization benchmark with cross-embodiment evaluation by Q3 2026.
References.
- Black et al., “π0: A Vision-Language-Action Flow Model for General Robot Control”
- Physical Intelligence, “π0.5: A VLA with Open-World Generalization”
- Open X-Embodiment Collaboration, “RT-X”
- Kim et al., “OpenVLA: An Open-Source Vision-Language-Action Model”
6. Cognitive Computing
Problem statement. Frontier AI is bottlenecked by the energy and latency profile of conventional digital accelerators. The same architectures that train trillion-parameter models in datacentres become impractical in edge environments — humanoid robots, autonomous vehicles, environmental sensors. Cognitive Computing is the research programme that asks: what does compute look like when designed from the ground up for the kind of inference frontier AI actually runs?
Why this matters. Several layers of the Civilization Stack — Physical Intelligence, deployment of Brello AI in disconnected environments, the agentic monitors that watch over autonomous systems in real time — depend on inference profiles that conventional GPUs cannot serve. Cognitive Computing is the hardware-software co-design layer that makes those deployments possible.
Methodology. Four sub-strands. Neuromorphic architectures: we work with sparse, event-driven compute substrates in the lineage of Intel’s Loihi and IBM’s TrueNorth. In-memory compute: we research analog and digital compute-in-memory designs, with reference to Mythic AI’s analog matrix processors and the recent academic literature on resistive RAM. Edge inference: we develop compiler and runtime stacks that target heterogeneous edge accelerators. Energy-efficient training: we study sparsity, mixture-of-experts routing, and low-precision training as primary citizens rather than afterthoughts.
Current open problems.
- Programming models for neuromorphic hardware. The hardware exists. The software stack does not, in any general-purpose form.
- Calibration and drift in analog compute. In-memory compute units drift over time and temperature; correcting for this without negating the efficiency gain is hard.
- Sparsity that survives distillation. Models that look sparse during training often densify during fine-tuning.
- Compiler support for mixed-precision policy networks. The compiler stacks for VLAs are not yet first-class on edge accelerators.
- Co-design feedback loops. Hardware and software teams operate on different cycle times; co-design requires institutional patience.
12-month objectives.
- Publish an open compiler middle-end targeting a sparse-matrix-multiply abstraction across three edge accelerators by Q1 2027.
- Run a pilot training of a 1B-parameter sparse VLA on a neuromorphic-friendly substrate by Q3 2026.
- Establish an annual Apik Cognitive Computing workshop with peer labs and hardware vendors by Q2 2027.
- Demonstrate a 5x energy-per-inference improvement on a representative VLA workload by Q4 2026.
References.
- Davies et al., “Loihi: A Neuromorphic Manycore Processor with On-Chip Learning”
- Merolla et al., “A million spiking-neuron integrated circuit with a scalable communication network”
- Mythic AI, “Analog Matrix Processor”
- Sebastian et al., “Memory devices and applications for in-memory computing”
7. Economic Orchestration
Problem statement. When autonomous agents take on operational roles in supply chains, energy grids, transport networks, and capital allocation, the question of how decisions are coordinated across them becomes a research object in its own right. Economic Orchestration is the science of allocating resources, coordinating decisions, and resolving conflicts among agents — both human and artificial — under uncertainty, at planetary scale, with bounded compute and bounded trust.
Why this matters. This is the topmost layer of the Civilization Stack and the one with the highest potential for both upside and harm. A well-designed orchestration layer turns post-scarcity capability into actual abundance. A poorly-designed one concentrates authority, distorts incentives, and creates externalities at civilisational scale. The technical content of “well-designed” is what this pillar is about.
Methodology. Four sub-strands. Mechanism design: we work in the tradition of algorithmic game theory, drawing on Roughgarden, Nisan, and the AGT canon. Allocation theory under uncertainty: we research decision rules that perform well when distributions are unknown, partially observed, or adversarial. Multi-objective optimisation: we study Pareto-frontier methods for problems with non-commensurable objectives — efficiency, fairness, resilience, sustainability. Markets vs. central coordination: we ask, empirically and theoretically, where market mechanisms outperform central coordination and where they do not, drawing on the climate and energy dispatch literature and Sandholm and Tambe on combinatorial auctions.
Current open problems.
- Mechanism design for AI participants. Most mechanism design assumes self-interested rational human agents. AI participants violate the assumptions in interesting ways.
- Non-stationary preferences. Stakeholder preferences change as the world changes. Mechanisms that assume fixed preferences fail.
- Concentration risk. Orchestration mechanisms can concentrate authority even when designed not to. We need rigorous metrics.
- Externality accounting. The most consequential allocation decisions are the ones whose externalities span decades.
- Verifiable orchestration. Bringing the formal-verification work in Pillar 1 to bear on orchestration mechanisms is open territory.
12-month objectives.
- Publish a survey of mechanism design under AI participation by Q4 2026.
- Run a case study on energy-dispatch orchestration with a regional grid operator by Q2 2027.
- Release an open simulation environment for orchestration research at city-scale by Q1 2027.
- Co-host a workshop on verified orchestration with at least one academic department by Q3 2026.
References.
- Nisan, Roughgarden, Tardos, Vazirani, eds., “Algorithmic Game Theory”
- Sandholm, “Algorithm for Optimal Winner Determination in Combinatorial Auctions”
- Tambe, “Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned”
- Roughgarden, “Twenty Lectures on Algorithmic Game Theory”
8. Quantum AI
Problem statement. Quantum computing has moved from theoretical promise to early-stage hardware reality, but the gap between the kinds of problems frontier AI cares about and the kinds of problems quantum hardware can solve at scale remains wide. Quantum AI is the research programme that asks: which problems in machine learning, optimisation, and simulation actually benefit from quantum acceleration in the NISQ and early fault-tolerant eras, and how do we build the algorithmic and software stacks to exploit those benefits?
Why this matters. Several research directions across the lab — molecular simulation in Project Synthesis, certain combinatorial optimisation problems in Economic Orchestration, sampling problems in generative models — have the structure of problems where quantum methods may eventually offer asymptotic or constant-factor advantage. Even where they do not, the discipline of designing hybrid classical-quantum loops is teaching us things about classical algorithms.
Methodology. Four sub-strands. Variational algorithms: we work on parameterised quantum circuits in the VQE/VQA family. Quantum-enhanced sampling: we research applications of quantum hardware to generative modelling and Monte Carlo methods. Hybrid classical-quantum loops: we build the orchestration software that schedules work between CPUs, GPUs, and QPUs. NISQ-era applications: we identify problems where current-generation hardware can deliver real value despite noise, in the spirit of Preskill’s NISQ framing. Project Q-Core (see Part III) is our flagship effort here.
Current open problems.
- Trainability of variational circuits. Barren plateaus remain a barrier; the conditions under which they occur and can be avoided are partially understood.
- Quantum advantage in machine learning. Theoretical results are mixed; empirical demonstrations are scarce.
- Decoder latency at the quantum-classical boundary. See Project Q-Core.
- Application discovery. Most “quantum machine learning” papers solve toy problems. Finding real-world problems where the structure favours quantum methods is itself a research question.
- Hybrid stack ergonomics. Current hybrid stacks are painful to use. Better software is a research contribution.
12-month objectives.
- Demonstrate a Q-Core decoder achieving a target latency budget on a representative surface-code workload by Q1 2027.
- Publish a survey of trainability conditions for variational quantum circuits by Q4 2026.
- Release an open hybrid-orchestration framework for classical-quantum loops by Q3 2026.
- Identify and benchmark at least three application problems with structural quantum-advantage potential by Q2 2027.
References.
- Preskill, “Quantum Computing in the NISQ era and beyond”
- Cerezo et al., “Variational Quantum Algorithms”
- Fawzi et al., “Discovering faster matrix multiplication algorithms with reinforcement learning (AlphaTensor)”
- McClean et al., “Barren plateaus in quantum neural network training landscapes”
Part II — Foundational and cross-cutting research
The eight pillars are the visible structure. The cross-cutting research described in this part is the connective tissue — work that does not belong cleanly to any one pillar but that every pillar depends on. We staff these directions deliberately, and we publish on them as first-class research output rather than treating them as tooling.
1. Evaluations infrastructure
How we measure progress without Goodharting. Frontier AI is a domain in which the things that are easy to measure are not always the things that matter, and the things that matter are not always easy to measure. The evaluations infrastructure programme is our standing investment in measurement methodology — building benchmarks for capabilities and behaviours that are robust to the obvious failure modes of evaluation.
We work in three registers. First, capability evaluations for long-horizon agents, building on METR’s task-horizon framework and extending it to the kinds of tasks our agents actually do — software engineering, scientific reasoning, multi-step coordination. Second, safety-relevant evaluations: scheming, sandbagging, deceptive alignment, evaluation-awareness. Third, deployment evaluations: holding-out tasks that resemble actual deployment contexts and measuring there. We treat the gap between lab and deployment performance as a first-class metric.
A core commitment of this programme is publishing methodology, not just results. We believe that evaluation methodology — exactly how a benchmark is constructed, what the failure modes are, where it can be Goodharted — is a public good, and we publish ours. We accept that this will sometimes mean publishing methodology that makes our own results look worse.
2. Oversight protocols
Scalable oversight for systems we cannot watch in real time. The naive picture of oversight is a human reviewing every output. The actual scaling problem is that humans cannot review every output of a system that emits a billion outputs per day, and even if they could, they cannot reliably evaluate outputs in domains where they lack expertise.
The oversight protocols programme builds and evaluates techniques that decouple oversight quality from human throughput. Sample-and-audit: rigorous statistical sampling of outputs, with audit budgets allocated to the highest-risk subspace. Debate and recursive reward modelling: techniques from Christiano et al. and Anthropic’s Constitutional AI work, where we use AI to amplify human oversight rather than replacing it. Runtime monitors: lightweight classifiers and rule systems that flag suspicious behaviour for human review.
This work is co-developed with Pillar 1 (AI Safety) and Pillar 3 (Autonomous Agent Systems), and informs the deployment posture for every system we ship.
3. Coordination infrastructure
The substrate the Civilization Stack runs on. When the lab’s research outputs eventually reach production — agents talking to agents, models calling tools across organisational boundaries, autonomous systems acting on behalf of users — the coordination substrate becomes load-bearing. Most of the substrate today is improvised on top of HTTP and JSON.
We invest in four threads. Agent-to-agent protocols: we contribute upstream to MCP and the emerging A2A protocol family. Identity: cryptographic identity for agents, not just for users, and the infrastructure to revoke and rotate it. Audit logs: tamper-evident, queryable logs of agent action that an investigator can read after the fact. Capability tokens: scoped, time-bounded, revocable authority delegated to agents, with formal semantics. This work intersects deeply with Project Aegis.
4. Hardware-software co-design
How specific kinds of inference shape what is possible. The relationship between Cognitive Computing (Pillar 6) and the rest of the stack is not linear — it is a feedback loop. The kinds of models we train determine what hardware is worth building; the hardware available determines what models we can train and deploy.
We treat this as a standing research commitment with a small dedicated team. The team’s brief is to look across the lab’s pillars and ask: which research directions are bottlenecked by hardware that does not yet exist, and which are bottlenecked by software stacks that do not yet exploit hardware that does exist? The team publishes a quarterly internal memo that drives prioritisation discussions, and an annual public retrospective.
The clearest current example is the path from Physical Intelligence (Pillar 5) through edge inference into the kinds of accelerators Cognitive Computing (Pillar 6) is designing. A 7B-parameter VLA running at 50Hz on battery power is a co-design problem, not a hardware problem or a software problem in isolation.
5. Open scientific posture
A research lab’s output is not just papers; it is also the policies that govern what gets published, what gets open-weight-released, what gets gated, and what gets disclosed only to peers. The open scientific posture programme is the standing committee that owns these policies, reviews them, and revises them when warranted.
The current posture, in summary: we publish capability research broadly when it does not provide differential uplift to misuse. We publish safety research aggressively, including methodology and negative results. We open-weight-release models below a defined capability threshold; above the threshold, we follow a gated-disclosure protocol with peer labs and government safety institutes. We run a fellowship and visiting-researcher programme that brings external researchers into the lab on time-bounded engagements with full access to internal work. The detailed policy is documented at /safety/responsible-development-policy and is reviewed quarterly by the safety council.
Part III — Featured projects
This part describes three internal projects that instantiate the agenda. They are not the only projects in the lab, but they are the ones that are most legible from the outside and most often referenced in our publications. Each is named, scoped, and tied to specific pillars.
Project Aegis — formal verification for multi-agent safety
Project Aegis is our flagship effort to bring formal-method techniques to bear on multi-agent learned systems. The thesis is that while we cannot, in general, prove arbitrary properties of arbitrary learned policies, we can construct architectures in which a small verified envelope mediates between the world and a learned policy, and we can prove safety properties of the envelope.
The technical approach combines three ingredients. First, SMT solvers (we use Z3 internally and contribute to the Z3 ecosystem) for discharging proof obligations about action preconditions and postconditions. Second, TLA+ specifications of the multi-agent protocols our systems run, with TLC and Apalache for model-checking. Third, the learned-policy + verified-envelope pattern: a learned policy proposes actions; a verified envelope checks them against invariants and either accepts, modifies, or rejects them; the learned policy never has direct authority over the world.
Current work centres on composability. A verified envelope around one agent is one thing; a verified envelope around a swarm is another, and the composition of envelopes from independently-verified components does not in general give you envelope-level guarantees. We are working on compositional techniques drawn from the protocol composition logic literature and the DOLEV-YAO tradition, adapted to learned-policy participants.
A second focus is environment uncertainty. A verified envelope is only as strong as the environment model it assumes. We are working on techniques for stating safety properties under bounded environment shift — properties that hold not just under a fixed environment but under any environment within a specified neighbourhood of the modelled one.
Project Aegis ties to research pillars 1 (AI Safety) and 3 (Autonomous Agent Systems), and feeds into the coordination infrastructure work in Part II. We expect to release the first version of the Aegis verification kernel as an open-source library by Q3 2026, with a tutorial paper to follow.
Project Q-Core — error correction at higher temperatures
Project Q-Core is our quantum-systems effort, focused on the practical bottleneck that determines whether quantum computers can be deployed outside the largest national laboratories: cryogenic overhead. Today’s leading quantum hardware runs at millikelvin temperatures, which means dilution refrigerators, which means capital costs and operational complexity that put quantum computing out of reach for most researchers and applications.
The Q-Core thesis is that a combination of topological error-correcting codes — which give us superior code distances per physical qubit — with machine-learned decoders — which give us better decoding latency and accuracy than syndrome-table methods — can shift the operating point where logical-qubit fidelity is achievable to materially higher temperatures. Higher temperature means smaller fridge, lower power draw, lower capital cost, and ultimately a wider ecosystem of researchers and applications.
The current technical focus is the latency budget at the decoder boundary. A decoder that takes longer than the coherence time of the qubits is not a decoder; it is a post-mortem tool. Machine-learned decoders are fast but expensive in compute; the question is how to deploy them at the decoder boundary without introducing unacceptable latency. We are pursuing this through compiler-level optimisation, dedicated decoder accelerators, and decoder co-design with the code structure.
A second focus is decoder verification. A learned decoder that fails silently is worse than no decoder. We are working on runtime monitors that detect decoder pathologies and fall back to a slower, verified decoder when the learned one is suspect. This is, intentionally, the same architectural pattern as Project Aegis — a learned component working inside a verified envelope.
Project Q-Core ties to research pillar 8 (Quantum AI) and intersects with Cognitive Computing (Pillar 6) on the decoder accelerator side. We expect to demonstrate a Q-Core decoder hitting target latency on a representative surface-code workload by Q1 2027.
Project Synthesis — closed-loop materials discovery
Project Synthesis is our closed-loop autonomous materials discovery pipeline. The thesis, shared with several adjacent labs, is that the bottleneck in materials science is no longer hypothesis generation — large language models and dedicated property-prediction models can generate hypotheses faster than humans can evaluate them — but the closed loop from hypothesis to synthesis to characterisation and back. Closing that loop, with as little human-in-the-loop intervention as possible, is the research programme.
The pipeline has four stages. Hypothesise: a generative model proposes candidate materials with target properties. Simulate: density-functional-theory simulations and faster surrogate models filter the candidate set. Synthesise: a robotic chemistry platform attempts to make the surviving candidates. Characterise: a suite of automated characterisation instruments measures the resulting samples. The loop closes when characterisation results feed back into the generative model as updated training data.
Current work focuses on two open problems. First, characterisation noise. Real characterisation instruments are noisy in ways that are not always well-modelled by the simple noise distributions used in the simulation stage. We are working on noise models that are calibrated empirically and that propagate properly through the rest of the pipeline. Second, synthesis-cost-aware exploration. The cost of attempting a synthesis varies enormously — some candidates are cheap and fast to make, others require precursors that take weeks to procure. Exploration policies that ignore synthesis cost spend their budget poorly. We are developing exploration policies that explicitly trade off expected information gain against synthesis cost.
Project Synthesis ties to research pillars 5 (Physical Intelligence) on the robotic-platform side and 6 (Cognitive Computing) on the edge-inference side, where the characterisation instruments need real-time inference for adaptive measurement. The first major deliverable is an internal end-to-end demonstration of a closed-loop discovery campaign for a specific class of solid-state electrolytes by Q4 2026.
Part IV — How to engage
This document is, in part, a request for collaborators. The lab is small, the agenda is large, and we are explicit that the work will go faster and better with external help. Below are the concrete ways to engage with us.
Visiting researcher programme. Apik hosts visiting researchers on engagements ranging from three months to two years. Visiting researchers join one of the eight pillar teams or one of the cross-cutting programmes, with full access to internal work and a clear publication path. We are particularly interested in researchers whose primary affiliation is at an academic department or peer lab, and who can bring methodological diversity to the work. Applications are reviewed quarterly.
Apik Fellowships. The fellowship programme is a one-year, fully-funded engagement for early-career researchers — typically post-PhD or late-stage PhD candidates — with a focus on safety, evaluations, and verification. Fellows work on a self-directed research project with senior advisors from the lab, present internally and externally, and end the year with a published deliverable. We run a single annual cohort with applications due in October. Details at /company/careers.
Open problems list. We maintain a continuously-updated list of open problems where external contributions would be especially valuable. These are problems where we have framed the question, gathered the relevant context, and have a view on what a good answer looks like, but where we do not have the bandwidth to pursue the work ourselves. The list lives at /research/open-problems; see also section 6 of the Apik Manifesto.
Collaboration channels. We co-author with peer labs, academic groups, and government safety institutes. The right starting point is usually an email to the relevant pillar lead. For collaborations with safety institutes specifically, we coordinate through our safety council and we are happy to operate under joint protocols where they make sense.
Publishing posture. We publish in venues appropriate to the work — NeurIPS, ICML, ICLR for machine learning research; CRYPTO, USENIX Security for security-relevant work; physics journals for Q-Core; chemistry journals for Synthesis. We also publish technical reports directly on the Apik site when the work does not fit cleanly into any external venue. We follow the gated-disclosure framework described in Part II for capability research that may provide differential uplift to misuse.
Critique and dissent. We mean it when we say we want this agenda to be wrong in ways we have not noticed. If you read this and you think a direction is misguided, a methodology is brittle, or an open problem is mis-framed, please write to us. The most useful possible engagement, for some readers, is to tell us where we are wrong.
Contact. The address for collaboration enquiries is research@apiksystems.com. The address for press is press@apiksystems.com. Postal mail and other channels are documented at /company/contact.
References
- Albrecht, S. and Stone, P. “Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems.” Artificial Intelligence, 2018. https://arxiv.org/abs/1709.08071
- Anthropic. “Constitutional AI: Harmlessness from AI Feedback,” 2022. https://arxiv.org/abs/2212.08073
- Anthropic. “Core Views on AI Safety: When, Why, What, and How,” 2023. https://www.anthropic.com/news/core-views-on-ai-safety
- Apollo Research. “Scheming reasoning evaluations,” 2024. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
- Baker, B. et al. “Emergent Tool Use From Multi-Agent Autocurricula,” 2019. https://arxiv.org/abs/1909.07528
- Black, K. et al. “π0: A Vision-Language-Action Flow Model for General Robot Control.” Physical Intelligence, 2024. https://www.physicalintelligence.company/blog/pi0
- Bousmalis, K. et al. “RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation,” 2023. https://arxiv.org/abs/2306.11706
- Brohan, A. et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” 2023. https://robotics-transformer2.github.io/
- Cerezo, M. et al. “Variational Quantum Algorithms.” Nature Reviews Physics, 2021. https://arxiv.org/abs/2012.09265
- Christiano, P., Shlegeris, B., and Amodei, D. “Supervising strong learners by amplifying weak experts,” 2018. https://arxiv.org/abs/1810.08575
- Davies, M. et al. “Loihi: A Neuromorphic Manycore Processor with On-Chip Learning.” IEEE Micro, 2018. https://ieeexplore.ieee.org/document/8259423
- Fawzi, A. et al. “Discovering faster matrix multiplication algorithms with reinforcement learning.” Nature, 2022. https://www.nature.com/articles/s41586-022-05172-4
- Hubinger, E. et al. “Risks from Learned Optimization in Advanced Machine Learning Systems,” 2019. https://arxiv.org/abs/1906.01820
- Kim, M. J. et al. “OpenVLA: An Open-Source Vision-Language-Action Model,” 2024. https://openvla.github.io/
- Kitaev, A. “Fault-tolerant quantum computation by anyons,” 1997. https://arxiv.org/abs/quant-ph/9707021
- McClean, J. R. et al. “Barren plateaus in quantum neural network training landscapes.” Nature Communications, 2018. https://www.nature.com/articles/s41467-018-07090-4
- Merolla, P. et al. “A million spiking-neuron integrated circuit with a scalable communication network and interface.” Science, 2014. https://www.science.org/doi/10.1126/science.1254642
- METR. “Measuring AI Ability to Complete Long Tasks,” 2024. https://metr.org/blog/2024-08-06-meta-r-d-evaluation/
- Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. (eds.). Algorithmic Game Theory. Cambridge University Press, 2007. https://www.cambridge.org/core/books/algorithmic-game-theory/0092C07CA8B724E1C1BE3043387F4B53
- Olah, C. et al. “Zoom In: An Introduction to Circuits.” Distill, 2020. https://distill.pub/2020/circuits/zoom-in/
- Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models,” 2023. https://robotics-transformer-x.github.io/
- Physical Intelligence. “π0.5: A VLA with Open-World Generalization,” 2025. https://www.physicalintelligence.company/blog/pi05
- Preskill, J. “Quantum Computing in the NISQ era and beyond.” Quantum, 2018. https://arxiv.org/abs/1801.00862
- Roughgarden, T. Twenty Lectures on Algorithmic Game Theory. Cambridge University Press, 2016. https://www.cambridge.org/core/books/twenty-lectures-on-algorithmic-game-theory/A9D9427C8F43E7DAEF8C702EBB960637
- Schick, T. et al. “Toolformer: Language Models Can Teach Themselves to Use Tools,” 2023. https://arxiv.org/abs/2302.04761
- Sebastian, A. et al. “Memory devices and applications for in-memory computing.” Nature Nanotechnology, 2020. https://www.nature.com/articles/s41565-020-0655-z
- Templeton, A. et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic, 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/
- Wang, G. et al. “Voyager: An Open-Ended Embodied Agent with Large Language Models,” 2023. https://arxiv.org/abs/2305.16291
- Wu, Q. et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,” 2023. https://arxiv.org/abs/2308.08155
- Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models,” 2022. https://arxiv.org/abs/2210.03629
- Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” 2023. https://arxiv.org/abs/2305.10601
- Zhang, K., Yang, Z., and Başar, T. “Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms,” 2019. https://arxiv.org/abs/1911.10635
— Rehan Temkar, Co-founder, Apik Systems · April 2026