Mechanistic Interpretability at Apik: How We Plan to Scale It

This post is about how Apik plans to do mechanistic interpretability at scale. The short version is that interpretability is, in our reading, the most under-resourced strand of safety research relative to its load-bearing weight — and that the scaling story is harder than the public discourse generally lets on. There are specific reasons it is hard, specific reasons we believe it is possible anyway, and specific bets we are making about which strands of the field to pursue. We describe all three.

What interpretability gets you, and what it doesn’t

Mechanistic interpretability is the project of reverse-engineering what trained neural networks actually compute. Not what they output — what they compute internally, in the sense of identifying the algorithms, features, and circuits that produce the output. Done well, it gives you a vocabulary in which to describe the model’s reasoning that is not the model’s own self-report. Self-report, however helpful in practice, has an obvious epistemological problem: a model that does not understand its own internals, or that has reason to misrepresent them, will produce a self-report that is not faithful to the underlying computation. Interpretability is the third-person check.

What interpretability does well, when it works: it identifies features and circuits with predictive validity over the model’s behavior on held-out distributions; it produces interventions — feature ablations, activation patching, circuit edits — whose effects on output are large and lawful in a way that lets us test causal hypotheses; it surfaces structure that would not have been visible from behavioral evaluation alone, including structure that the model itself does not surface.

What interpretability does not do, currently: it does not produce certificates. We cannot, with the current toolkit, produce a statement of the form “this model has no internal representation of intent X, with probability greater than Y.” We cannot exhaustively map the feature space of a frontier model. We cannot, in general, distinguish a model that has internalized a value from a model that has internalized a heuristic that produces value-aligned behavior in distribution and breaks out of distribution. Those are the live limitations and they are large.

We treat interpretability as a necessary input to safe deployment, not a sufficient one. The composition with formal envelopes — Project Aegis — and with behavioral evaluation is what we expect to do the load-bearing work. Interpretability earns its place as one of the three pillars of that composition.

The circuits-and-features tradition

The lineage we are inheriting from has a clear shape. The early circuits work — Olah and collaborators at OpenAI and later at Anthropic, including the Distill circuits thread and the Transformer Circuits Thread — established the methodological frame. The argument was that small, interpretable circuits could be identified inside trained networks; that those circuits implemented identifiable computations (curve detectors, induction heads, and so on); and that a careful empirical program could chain such circuits into a partial mechanistic understanding of the larger network.

The frame has held up. Induction heads, identified in transformer models, turned out to be a load-bearing structure for in-context learning. Specific circuits responsible for specific behaviors — indirect object identification, modular arithmetic, the IOI circuit and its variants — have been identified, intervened on, and used to make falsifiable predictions about model behavior. The progress is real. The question that has dominated the field for the last several years is whether the methodology can be scaled.

Sparse autoencoders and the scaling story

The most consequential recent development is the rise of sparse autoencoders as a feature-decomposition tool. The intuition is that the features encoded in a transformer’s residual stream are not, in general, axis-aligned with the basis the model was trained in; they are linear combinations spread across many directions, a phenomenon Anthropic’s Toy Models of Superposition characterized in detail. Sparse autoencoders attempt to recover a more interpretable basis by training a wide, sparse autoencoder against the model’s activations; the dictionary atoms learned by the SAE are the candidate features.

The empirical results have been encouraging. Anthropic’s Towards Monosemanticity and the follow-on Scaling Monosemanticity work demonstrated that SAEs at scale can recover features that are interpretable, semantically rich, and behaviorally meaningful — features for specific concepts, including some that have safety-relevant interpretation. DeepMind’s Gemma Scope released open SAEs across an entire model and made the artifact available for the research community.

The scaling story is harder than the headline implies. Three obstacles dominate.

The first is compute. SAEs that are wide enough to recover the feature dictionary of a frontier model are themselves large. Training them requires substantial activation collection, substantial autoencoder training compute, and substantial evaluation overhead. The compute scales with the size of the model whose features you are trying to recover, and it scales unfavorably; we are not yet in a regime where you can SAE-decompose a 10^11+ parameter model on a research budget.

The second is dataset. The features an SAE recovers are functions of the activations it sees. If the activation distribution is narrow, the feature dictionary is narrow. Building activation datasets that span the range of behaviors the deployed model will exhibit is a research problem in its own right, and the field has not converged on a methodology.

The third is evaluation. Once you have an SAE-decomposed feature space, how do you know whether you have done it well? Reconstruction loss is a necessary but insufficient signal. Interpretability scores require human raters and do not scale. Downstream task performance under feature ablation is the most behaviorally meaningful signal we have, and it is expensive to compute. The field has produced several proposals; none has the status of a settled benchmark.

The Apik approach

We are pursuing four strands, in roughly descending order of resource allocation.

Strand one: SAE methodology and scaling. We are doing direct work on improving the sample efficiency of SAE training, on architectural variants (top-k SAEs, gated SAEs, sparse mixture-of-experts decompositions), and on the activation-dataset construction problem. The goal is to push the size of model that can be SAE-decomposed at a given compute budget by a meaningful factor. We will publish a methodological writeup with reproducible benchmarks before the end of this year.

Strand two: circuit identification on top of SAE features. The endgame is not features in isolation; it is circuits — graphs of features connected by attributable computational pathways. We are extending the attribution-graph line of work to operate over SAE-decomposed feature spaces, and we are particularly interested in circuits that span attention heads and MLP blocks rather than within either alone.

Strand three: interpretability for agentic systems. A frontier agent is not a model. It is a model embedded in a tool-using, memory-having, multi-step reasoning loop. The mechanistic interpretation of an agent’s behavior depends on the model’s internals, the tool calls, the memory state, and the inter-step dependencies. We are building a substrate for tracing across these surfaces — we call it a multi-substrate trace — and using it to identify causal patterns in agent behavior that single-step model interpretability misses.

Strand four: interpretability for monitoring. This is the bridge to Project Aegis. The runtime monitor in Aegis can incorporate interpretability signals — feature activations, circuit activations, attribution graph edges — as inputs to its invariant evaluation. The most direct path from interpretability to deployment-relevant safety, in our view, is through the monitor.

Open problems

Five problems are currently the most active in our internal lists. We name them so the community can call them out if we have mis-prioritized.

First, feature splitting and feature merging at scale. SAEs at different widths recover different feature dictionaries; some features split, some merge, and the relationship between dictionaries is not clean. We do not have a stable methodology for tracking features across SAE widths.

Second, the polysemanticity floor. Even in well-trained SAEs, a non-trivial fraction of the dictionary atoms remain polysemantic. We do not know whether this is a methodological artifact or a structural property of the model’s representation.

Third, interpretability under adversarial conditions. A sufficiently capable model that knew it was being interpreted could, in principle, modulate its internal representations to evade interpretation. We do not know how robust the current toolkit is to this and we consider it a load-bearing question for safety-relevant deployment.

Fourth, causal sufficiency of identified circuits. When we identify a circuit and intervene on it, we observe behavior changes. We do not, generally, have a methodology for asserting that the identified circuit is the only causal pathway for the behavior, and there are reasons to believe it often is not.

Fifth, the evaluation gap. As discussed above. The community needs a settled benchmark for SAE quality, and the question of who designs it and who runs it is itself a coordination problem.

What we ship

Three commitments on artifacts.

We will release SAEs and feature dictionaries for our open Brello checkpoints. The release will include the autoencoders themselves, an inspection harness, and a feature-card schema with semantic annotations and downstream-task evaluations.

We will release the attribution-graph tooling we use internally, with reproducible examples on public model checkpoints.

We will release evaluation suites — both for SAE quality and for circuit identification — alongside the methodological writeups, and we will work with external groups on calibrating them against shared benchmarks.

The substantive ongoing work lives at research / AI safety. The deployment posture and the safety principles that govern how interpretability artifacts are used live at safety / principles. The broader research agenda is at company / research agenda.

The argument we are making, in short, is that interpretability is necessary, that scaling it is harder than the discourse acknowledges, and that the path forward is a co-design with formal verification and behavioral evaluation rather than a unilateral bet on any one of the three. The work will tell.

— Rehan Temkar, Co-founder, Apik Systems