Safety · Transparency

Transparency

A consolidated index of model reports, system cards, and any incident disclosures, together with the framework that governs them. The schedule, schema, and evaluation methodology are public; the artifacts publish as the corresponding releases reach the deployment phases of our Responsible Development Policy.

Why transparency is load-bearing

The information-asymmetry argument

The most expensive thing a frontier laboratory can ship is a system whose behavior is opaque to the people deploying it. The asymmetry between what a developer knows about a model — its training data, its evaluation results, its known failure modes, the conditions under which it was red-teamed — and what an operator or end user knows is large by default and grows as capability scales. That asymmetry is not primarily a research-stage problem. Researchers can read their own logs. It is a deployment-stage problem, and it compounds: an operator who cannot tell where a system is brittle deploys it past its envelope; an end user who cannot tell where a system is confident treats hedged outputs as hedged and confident outputs as fact, exactly when the relationship should invert.

Public transparency artifacts — model cards, system cards, evaluation reports, and incident disclosures — are the engineering response to that asymmetry. They do not solve the underlying alignment problem; they make the alignment problem legible to the parties who have to absorb its consequences. The discipline traces to the Model Cards proposal of Mitchell et al. (2019) and the Datasheets-for-Datasets work of Gebru et al. (2018); it has since converged on a small set of conventions across the frontier laboratories — Anthropic's system cards under the Responsible Scaling Policy, OpenAI's preparedness reports under the Preparedness Framework, and DeepMind's evaluation reports under the Frontier Safety Framework.

The Apik framework documented here is in dialogue with that body of practice. We adopt the eight-field system-card schema described below; we tier disclosure against the capability thresholds of our Responsible Development Policy; we publish evaluation methodology before the first artifact lands, on the view that the methodology is itself a load-bearing claim. The Foundation Model Transparency Index ( Bommasani et al., 2023) provides the external benchmark against which we calibrate.

Disclosure schedule

What publishes at each tier

The schedule below maps each capability tier in our Responsible Development Policy to the artifacts we commit to publishing alongside a release at that tier. The tiers run from AS-1 (research-only systems with no external deployment) through AS-4 (frontier-capability systems for which the residual-risk surface is actively contested). Disclosure obligations grow with tier; the strongest obligation is at AS-3 and above, where independent assessment and a pre-release window become non-optional.

Tier	Capability posture	Required public artifacts
AS-1	Research-only. No external deployment. Internal evaluation logs only.	None mandated. Methodology and evaluation suites publish on request.
AS-2	Limited external deployment under contract. Bounded population, monitored throughput, published deployment limits.	Public system card on the eight-field schema below. Capability and alignment evaluations summarized; safeguard-test results reported in aggregate.
AS-3	Broad deployment, including consumer-facing surfaces. Capability uplift on the agentic and long-horizon dimensions is the gating concern.	System card plus red-team summary, plus a published evaluation harness replicating the central capability and safeguard claims. Independent assessment summary, redacted as needed for confidentiality.
AS-4	Frontier-capability. Residual-risk surface contested or under active research. Deployment paused or extremely narrow until risks are bounded.	All AS-3 artifacts plus a 30-day pre-release window for external review, an externally-conducted independent evaluation, and an explicit rollback plan published before deployment.

The schedule is conservative on purpose. Pre-release windows for AS-4 systems are longer than industry norms; the tradeoff (slower deployment, more external review) is the one we want. The schedule is part of the policy and changes only by amendment to the RDP, which itself changes under the procedure documented there.

System card schema

The eight fields

An Apik system card is a fixed-shape document. The schema below is the union of the conventions that have stabilized across the frontier-laboratory cohort, read against Mitchell et al. (2019) and the practical structures published in Anthropic's and OpenAI's recent system cards. We commit to all eight fields for any AS-2-or-above release; the depth of each field scales with tier.

Decision summary. The release decision in two paragraphs: what is being deployed, on what surface, to whom, under what limits, and the residual risks the decision-makers explicitly accepted. Authored by the safety council; signed.
Capability evaluations. Benchmark suite, sample sizes, sampling temperature, harness version, and results. We report performance on the public benchmarks we used and on the internal benchmarks we built — with the second set described in enough detail for an external party to replicate the design even if not the exact items. METR's task suite is one of the public reference points; longitudinal capability measurement follows Kwa et al. (2025).
Alignment evaluations. Adherence-to-constitution measurements where applicable (Bai et al., 2022), adversarial probes for in-context deception (Meinke et al., 2024), refusal calibration on dual-use queries, and the gap between behavior under evaluation and behavior under deployment-like conditions where we can measure it.
Safeguard tests. Red-team uplift on a fixed protocol; jailbreak-resistance numbers; agentic-task containment tests; tool-use envelope adherence. Reported as point estimates with bootstrapped confidence intervals where the sample size makes that meaningful, and as descriptive results with caveats where it does not.
Residual risks. The risks the safeguards do not bound. Stated explicitly, not minimized. This is the field that distinguishes a serious system card from a marketing document; the discipline is to write it as if a future incident will be read against it.
Deployment limits. Eligible population, throughput ceiling, monitored vs unmonitored channels, geographic limits where regulatory regimes require, and the time horizon of the limit. Limits are part of the release; relaxing them is a decision, not a default.
Reversal criteria. The conditions under which we pull the release back. A small number of named, falsifiable triggers (for example, a specific safeguard-failure rate exceeding a stated threshold over a stated window) plus a discretionary clause for cases the named triggers do not cover.
Independent assessment. Who reviewed the release before it shipped, what they had access to, what they concluded, and what they could not conclude. Where confidentiality constraints prevent full publication, the constraint is named.

Evaluation methodology

How the numbers in a system card get made

The numbers in a system card are only as informative as the methodology that produced them. Apik commits to publishing the methodology alongside the numbers; where we cannot publish a specific test (for safeguard reasons — publishing a jailbreak corpus is a capability uplift), we publish the design of the test in enough detail that an external evaluator can replicate the construction.

On capability, we run the public benchmarks expected of frontier releases and we run the agentic and long-horizon evaluations developed in dialogue with METR and the broader evaluation community. The longitudinal claim that matters most — the time-horizon at which models reach a fifty-percent task-success rate, doubling roughly every seven months in recent years (Kwa et al., 2025) — is the trajectory against which our internal capability claims are calibrated.

On alignment, the central technique is constitutional adherence under adversarial framing (Bai et al., 2022), augmented by Apollo-style probes for in-context scheming (Meinke et al., 2024) and by mechanistic-interpretability signals (Templeton et al., 2024) where we can extract them at the relevant scale. We treat the gap between evaluated and deployed behavior as a first-class measurement target, not a footnote.

On safeguards, red-team protocol design follows the practice that has stabilized across the frontier cohort — adversarial sampling, capability-elicit prompts on dual-use surfaces, and uplift measurement against a control workflow. Sample sizes are reported. Effect sizes are reported with intervals. A safeguard claim without a reported sample size and an interval is not a claim we will publish.

Pre-release pipeline

The gates before a release ships

A release does not reach the system-card stage until it has cleared a fixed sequence of internal gates. Each gate has an owner, an artifact, and a defined promotion criterion; promotion past a gate without the artifact is a process violation and is itself an incident.

Capability eval. Owned by the research team. Artifact: capability-evaluation report on the standard suite plus the agentic-tasks suite. Promotion criterion: results within the envelope expected by the release proposal; surprises trigger a re-scope.
Alignment eval. Owned by the alignment lead. Artifact: alignment-evaluation report including adversarial-probe results. Promotion criterion: no novel deceptive-alignment signals; refusal calibration within the band documented in the RDP.
Red-team. Owned by the red-team lead, conducted by an independent rotation. Artifact: red-team summary including uplift estimates and any unmitigated jailbreak chains. Promotion criterion: no unmitigated chains on the gating surfaces.
Safety-council review. Owned by the safety council, which includes at minimum the CTO, the alignment lead, and one external reviewer. Artifact: signed decision summary. Promotion criterion: unanimous sign-off on the decision summary or an explicitly recorded dissent and rationale.
Staged release. Owned by the deployment lead. Artifact: live monitoring with rollback capability and the published deployment limits. Promotion to broader release happens only if the early-stage telemetry is within the envelope predicted by the evaluation phase.
Public system card. Published at the same moment broader release goes live. Updated on a quarterly cadence and on any material change.

Incident-disclosure taxonomy

What counts as material, and how fast we say it

Not every safety event is a public incident, and not every public incident is a material one. The taxonomy below names the three classes of event we treat as material — the ones that trigger a public disclosure under the Responsible Development Policy. Less serious events are logged internally and summarized in the quarterly transparency update; the line between the two is defined here so it can be argued with rather than guessed at.

Capability surprise. A post-release demonstration of a capability that was not present in the pre-release evaluation. The disclosure window is 30 days from internal triage. The disclosure includes: what the capability is, how we measured that it is real, why our evaluation missed it, and what we changed in the evaluation suite as a result.
Behavioral failure. A deployed system exhibited the behavior its safeguards were designed to block, in conditions the deployment limits were supposed to cover. The disclosure window is 30 days from internal triage. The disclosure includes: the failure mode, the user-impact estimate, the rollback or mitigation action taken, and the safeguard change.
Confidentiality breach. A deployed model returned material it should not have — training-time secrets, user data from another session, or content under confidentiality. The disclosure window is shorter (7 days) because the downstream remediation depends on user notification. Disclosure follows the responsible-disclosure pattern documented at RDP §incident-response.

Internal triage starts the clock; the clock does not pause for legal review. We coordinate with regulators where required and with affected parties where feasible, but the public-disclosure window is fixed.

System cards

Per-release safety documentation

For every frontier-capability release that crosses the AS-2 threshold or above, we ship a system card alongside the release on the schema described above. The first cards will publish as the corresponding releases reach the public-deployment phase of our Responsible Development Policy. This index will update at that point. Earlier-tier work (AS-1) does not publish system cards by default; methodology and evaluation suites are available on request to research@apiksystems.com.

Coming releases

Pipeline

Project Aegis · v0.1
Aegis envelope toolkit (internal)
Internal · 2026
Brello AI
Partner-API preview
Q2 2026
Senwitt
Private beta
Q2 2026

Incident disclosures

None to date

There are no incident disclosures at this time. Per the taxonomy above, we publicly disclose any material safety incidents within the disclosure windows stated, regardless of internal-review or legal-review status. This page will update if that changes.

Related across the site

Safety

Responsible Development Policy

Versioned commitments on capability thresholds, evaluations, and deployment gating.

Safety

Safety Principles

Foundational positions on alignment, control, and oversight for frontier systems.

Legal

Acceptable Use Policy

Universal usage standards and high-risk-category requirements for Apik systems.

Project

Project Aegis

Provably safe multi-agent coordination — learned policies inside formally verified envelopes.