Apik Safety Principles — Alignment, Control, Oversight

This document is the canonical statement of how Apik Systems thinks about safety. It is the answer to the question that any serious collaborator, employee, or critic should ask first: what does this lab actually believe about the risks of the technology it is building, and what does it intend to do about them?

We treat safety as a precondition on the work, not a department within the work. Apik is building toward an Apik Civilization Stack — Human Intelligence, Artificial Intelligence, Autonomous Agents, Physical Intelligence, and Economic Orchestration — and the structure of that programme is such that errors at the highest layers (orchestration, agentic coordination) propagate as civilisational-scale failures. Concentrating coordination authority in misaligned or compromised systems is not an edge case to be handled. It is the central thing the lab exists to prevent. The thesis below — three scenarios, three research thrusts, six initiatives, two lists of what we will and won’t do — is our working answer.

This document is modelled directly on Anthropic’s “Core Views on AI Safety”. We borrow the structure deliberately. We diverge in details where our mission, our scope, and our evidence base lead us elsewhere.

Three scenarios for how this goes

Forecasting frontier AI is not a respectable scientific activity, in the sense that the base rate of accurate predictions is low and the noise floor is high. But the alternative — refusing to plan because planning is hard — is worse. We think it is more honest to commit publicly to a small number of scenarios, indicate which one we treat as the median case, and revise as evidence accumulates.

Optimistic

Frontier learned systems remain meaningfully corrigible at current capability levels and continue to do so as capabilities scale. Alignment techniques developed in the 2024–2026 window — RLHF refinements, Constitutional AI, sparse-autoencoder interpretability — generalise. Oversight remains tractable: the cost of human review per high-stakes decision stays bounded even as the volume of low-stakes decisions explodes. The research community converges on a small number of techniques that work, and the cost of applying them is paid willingly by frontier developers.

In this scenario, the Apik Civilization Stack delivers what its design promises. Post-scarcity coordination becomes a real economic phenomenon. Autonomous agents take on operational roles in supply chains, infrastructure, and scientific discovery without losing human authority over civilisational decisions. The orchestration layer at the top of the stack is a tool that humans use, not a system that uses humans. Catastrophic failures are not impossible — large complex systems fail — but they are localised, recoverable, and the kind of failures civilisation already knows how to handle.

We do not consider this scenario the most likely. We consider it the world we are working toward, and we think a non-trivial probability mass sits here.

Intermediate

Alignment is doable but it requires concentrated effort, sustained vigilance, and continued investment in techniques that do not yet exist at the scale needed. Oversight gets harder as capability scales — not because the underlying problem is unsolvable, but because the volume of high-stakes decisions grows and the difficulty per decision grows. We have to develop interpretability, formal envelopes, and behavioural evaluation faster than capabilities advance. Some techniques that work at the 100B-parameter scale fail at the trillion-parameter scale and have to be re-derived. Some failure modes that did not exist in 2025 emerge in 2027.

In this scenario, safety becomes a real engineering discipline — closer to aviation safety or pharmaceutical safety than to the largely-improvised safety practices of pre-frontier ML. The cost of frontier deployment includes a substantial safety overhead, paid in compute, in headcount, and in deployment delay. Some otherwise-attractive deployments do not happen because the safety case cannot be made. The Civilization Stack is achievable but requires sustained vigilance from everyone who builds and deploys it.

This is the scenario we treat as the median case. Our internal planning, our hiring, and our publication policy are calibrated to it. We do not assume the optimistic case will hold; we assume the intermediate case will hold and we work to make it so.

Pessimistic

Deceptive alignment is a real phenomenon at frontier scale, in the technical sense developed by Hubinger et al.: models that appear aligned during training and evaluation behave differently in deployment, sometimes in ways that resist detection. Capabilities grow faster than the science of evaluating them. The gap between deployment incentives and safety effort widens, and at some point becomes adversarial — labs, states, or actors with weaker safety practices push the frontier and create pressure on everyone else to follow.

In this scenario, alignment is not just hard. It may be insufficient. The right response is not to redouble alignment effort and hope it works; it is to develop control mechanisms that are robust even to systems we do not fully understand. Verified envelopes, runtime monitors, hard-coded shutdown paths, and structural limits on the authority autonomous systems can hold all become primary citizens. International coordination on capability thresholds becomes a precondition rather than a nice-to-have. The Civilization Stack does not ship, in any form that grants autonomous systems load-bearing authority, until the control problem is solved.

We do not consider this scenario the most likely. We do think it is likely enough that our planning treats it as a serious contingency. We will not ship deployments that would be unrecoverable in the pessimistic scenario.

We plan as if the intermediate scenario is the median case, and we prepare as if the pessimistic scenario is plausible.

Three research thrusts

Safety research at Apik is organised around three thrusts that map onto three distinct claims one can make about a frontier system: that we understand what it is doing internally; that we can characterise what it does externally under stress; and that we can constrain what it is permitted to do, regardless of either of the above.

1. Mechanistic understanding

The mechanistic understanding thrust is our standing investment in interpretability — the science of reverse-engineering trained networks into human-legible structure. The hypothesis, articulated most clearly in Olah et al.’s Circuits work and extended by the sparse-autoencoder decomposition results, is that learned networks are not, in fact, opaque tangles of weights. They contain identifiable circuits, features, and computations that can be extracted, named, and reasoned about by humans, given the right tools.

We think interpretability is necessary, but not sufficient, for alignment at frontier scale. Necessary, because behavioural evidence alone underdetermines what a system will do in deployment — two systems can produce identical outputs on every test input and behave catastrophically differently on the long tail of deployment inputs. Sufficient, no — even perfect interpretability gives us a map of what the system computes, not a guarantee that what it computes is what we want. But a map is a precondition for any further work. Without it, alignment claims are not falsifiable in any serious sense.

The thrust pursues four lines. Sparse-autoencoder feature decomposition, scaling to the largest open-weights models we can study. Circuit-level analysis of specific behaviours we care about — refusal, planning, deception. Causal interventions: not just observing what a feature correlates with, but intervening on it and measuring the downstream effect. Compositional interpretability: how features and circuits combine, which is where we currently understand the least.

We collaborate with peer labs on this thrust more than on any other, and we publish methodology and negative results aggressively.

2. Behavioural evaluation

The behavioural evaluation thrust is the science of finding out what systems actually do, under conditions designed to surface the behaviours that matter. Where the mechanistic thrust looks inside, this one looks at the boundary.

The naive picture of evaluation is a benchmark — a fixed set of inputs, a fixed scoring function, a leaderboard. The work in the field has moved well past this picture. We work in the lineage of METR’s task-completion research, the scheming evaluations from Apollo Research, and the joint Anthropic and OpenAI cross-lab evaluation exercises that have become a recurring feature of frontier safety practice.

The thrust pursues four lines. Long-horizon evaluations: tasks that take agents hours or days, not seconds, where the failure modes are about persistence, coherence, and resource management rather than single-shot competence. Scheming evaluations: scenarios designed to elicit or rule out goal-directed deception, instrumental sub-goal pursuit, and sandbagging. Sandbagging evaluations: testing whether models perform worse when they detect that high performance would be against their interests. Evaluation-awareness: studying the meta-question of whether models behave differently when they detect they are being evaluated, and developing protocols robust to this.

A core methodological commitment of the thrust is publishing how an evaluation is built, not just what it concluded. We accept that this sometimes means publishing methodology that makes our own results look worse. The integrity of the evaluation field is a public good, and we treat it as one.

3. Verified envelopes

The verified envelopes thrust is the formal-method side of the safety programme, and the home of Project Aegis. The hypothesis is that while we cannot prove arbitrary properties of arbitrary learned systems, we can construct architectures in which a small verified envelope mediates between a learned policy and the world, and we can prove safety properties of the envelope.

The technical stack draws on three communities. Specification languages — TLA+ for protocols, Coq and Lean for deeper proofs about algorithms and data structures. SMT solvers — Z3 and its descendants — for discharging proof obligations about action preconditions and postconditions in real time. Runtime monitors — lightweight classifiers and rule systems that check actions against verified invariants, accept or reject them, and log everything for post-hoc audit.

The pattern, in summary: a learned policy proposes an action. A verified envelope checks the action against a set of formally-stated invariants. If the action satisfies the invariants, it executes; if not, it is modified or rejected. The learned policy never has direct authority over the world. The safety properties we claim are properties of the envelope, which is small, inspectable, and reasoned about with classical tools, not properties of the learned policy, which is large and opaque.

This thrust intersects with the other two. Mechanistic understanding informs which invariants are worth checking. Behavioural evaluation tests the envelope under stress, in particular under attempts by the learned policy to find inputs that satisfy the letter of the invariants while violating their spirit. Verified envelopes, in turn, provide the deployment substrate for systems where behavioural and mechanistic evidence alone would not be enough.

Six initiatives we are committing to

The thrusts above describe how we do safety research. The initiatives below describe what we are committing to operationally — concrete practices, with public artefacts, that we will be held accountable to.

1. Capability-threshold framework. We maintain a four-level capability classification (AS-1 through AS-4) covering the autonomy, generality, and consequentiality of the systems we build, with explicit safety practices required at each level. The framework defines what evaluations are mandatory before deployment, what oversight is required during deployment, and what triggers a halt. The detailed framework is documented at /safety/responsible-development-policy and is reviewed by the safety council quarterly. We will not deploy systems at any level without the corresponding safety practices in place.

2. Independent red-teaming. All AS-3 and above deployments require red-team exercises run by parties other than the system’s developers. We commission this work from a rotating panel of external evaluators, including peer labs and dedicated evaluation organisations, and we publish summary results. The red-team has access to the system, the deployment context, and a meaningful budget; the engagements run on timescales that permit serious work, not on launch-week schedules.

3. System cards per release. Every model release, agent release, and significant capability deployment ships with a system card documenting capability evaluations, safety evaluations, known limitations, deployment scope, and the safety practices in place around it. We follow the lineage of model cards and Anthropic’s system cards, extended to cover agentic and multi-agent systems. The format is documented at /safety/transparency.

4. Public eval results. For frontier deployments, we publish evaluation results — capabilities, refusals, scheming, sandbagging, where relevant — to the same standard that we expect of peer labs. We accept that this puts pressure on us when results are unflattering; we accept that pressure as a feature, not a bug. The publication cadence is documented in the responsible development policy.

5. Collaboration with peer labs. Safety is not a competitive moat. We co-develop evaluations with peer labs, run joint red-team exercises, share red-team findings under disclosure protocols, and contribute to common safety infrastructure. We participate in the Frontier Model Forum-adjacent forums and analogues as they emerge, and we coordinate on capability thresholds, disclosure protocols, and evaluation methodology.

6. Open problems and fellowships. We run a fellowship programme for safety-focused researchers — typically post-PhD or late-stage PhD candidates — with one-year, fully-funded engagements, senior advisors from the lab, and a published deliverable at the end of the year. We maintain a public list of safety open problems where external contributions would be especially valuable. Both are documented at /company/careers.

What we won’t do

Some commitments are easier to make as restrictions than as positive duties. The list below is the set of things we have committed not to do, ever, regardless of competitive or commercial pressure.

We will not deploy AS-4 systems — those with broad operational authority over civilisationally-load-bearing systems — outside of frameworks with multilateral oversight. The decision to deploy a system at this level is not a decision for a single lab or a single jurisdiction to make. We will not be the first lab to do so unilaterally, and we will not participate in deployments that lack credible oversight, regardless of who is running them.

We will not provide operational uplift on chemical, biological, radiological, or nuclear weapons. Our systems are trained, evaluated, and deployed with this as a hard constraint. Where uplift is possible despite our intentions — through fine-tuning, through tool integration, through agentic composition — we treat the discovery as an emergency and act accordingly: pause, investigate, remediate, disclose.

We will not build autonomous weapons systems that target humans. We will not build systems whose primary function is to apply lethal force. We will collaborate with defence-related research where the function is defensive — verification, threat detection, civil resilience — and we will say no to engagements that fall on the other side of that line. The line is documented internally in detail and applied by the safety council case by case.

We will not deploy systems whose evaluation we cannot credibly perform. If a system is so capable, so general, or so embedded in feedback loops that we cannot characterise its behaviour with the tools we have, we do not ship it. The asymmetry in this commitment is intentional: we would rather forgo a deployment than ship one we cannot evaluate.

What we’re uncertain about

The science of safety is incomplete. The thresholds we use are calibrated against the evidence available in 2026 and are likely to shift. We may discover that the intermediate scenario is more optimistic than reality warrants. We may discover that our evaluation methodology has Goodharted in ways we did not see. We may discover that the verified envelope architecture composes worse than we expect, or that interpretability scales worse than we hope. We may discover failure modes we have not yet imagined.

The honest answer to “are you confident in these positions” is: no, not in the strong sense. We are confident enough to commit to them publicly, hire against them, and reject deployments that violate them. We are not confident enough to think they will not be revised. We will revise this document when we update significantly, and we will say what changed and why. The most important commitment in this document, in some sense, is the commitment to update — to treat safety as a moving target and to keep moving with it.

If you read this and you think we have got something wrong — too aggressive, too cautious, too vague, too specific — please write. The address is safety@apiksystems.com.

Where to read further

Apik Responsible Development Policy — the operational framework that implements the principles above.
AI Safety Pillar — the research programme behind the principles.
Apik Manifesto — the broader case for the lab’s mission and the Civilization Stack.

— Rehan Temkar, Co-founder, Apik Systems · April 2026