Part I · Foundations

Mechanistic interpretability — looking inside the model

A field guide to the third bucket past behavioural and representational work — what it means to actually find the algorithm a network has learned, with three landmark examples, the toolkit researchers use, and an honest account of what the field has not yet solved.

A frontier model writes a working Rust program. It picks Vec<T> over Box<[T]> in the third line, threads ownership cleanly, handles a borrow-checker corner. You can ask it why and it will produce a confident sentence about heap allocation and growable buffers. That sentence is not a causal trace. It is the model doing what models do — generating a plausible continuation of "explain your choice." The actual computation that produced Vec happened across a few hundred matrix multiplications inside a stack of attention layers, and the model has no introspective access to any of it. Neither do you.

That is the gap this piece is about. The model works. We can measure that it works. We can hill-climb the benchmarks. We can ship products. What we cannot do — yet, except in a handful of small, carefully-studied cases — is read off the algorithm. Mechanistic interpretability is the bet that the algorithms are in there to be read, and the slow craft of trying to read them. It is the part of the field that takes seriously the idea that neural networks are programs we wrote without understanding.

This is a companion to the anatomy of a frontier model chapter. That chapter is a map of what a frontier model is made of. This one is about what it would mean to know what's happening inside that map at the level of circuits — and why a number of serious researchers think the urgency of getting there is among the most important open problems in AI.

Contents

The mystery you can't ignore the introspection gap
What "mechanistic" actually means three buckets, one distinction
Three landmark examples induction heads, IOI, SAEs
The toolkit patching, probing, lenses
A feature explorer interactive demo
Why this matters — Amodei's urgency framing
What "doing the work" looks like entry points
Open problems & honest limits
Further reading

§ 01 · The opening problem

The mystery you can't ignore

In May 2024, Anthropic published a demo they called Golden Gate Claude. They had isolated, inside a production-class model, a single internal direction — a "feature" in their terminology — that fired whenever the model was thinking about the Golden Gate Bridge. Then they cranked that feature's activation up by hand during inference and let users talk to the result. Ask Golden-Gate-Claude for a chocolate-chip cookie recipe and it offered to bake the cookies on the bridge; ask it the best way to drive from LA to Phoenix and it suggested taking the Golden Gate, somehow. The model couldn't stop bringing the bridge up. It also, charmingly, knew something was wrong and apologised for it.

That demo is what mechanistic interpretability looks like when it works. Anthropic was able to point at a specific direction in the model's internal state and say this is what activates when the concept ‘Golden Gate Bridge’ is present — and then prove the claim was causal by perturbing that direction and watching the behaviour change. That kind of pointer is, today, exceedingly rare. For most things a frontier model does, no one can point at anything. The model writes a working Rust program and picks Vec<T> over Box<[T]> in the third line, and we cannot tell you which of its eighty-odd layers cast the deciding vote, or whether there is a "deciding vote" at all rather than a few hundred small biases summing in a direction nobody planned.

That is the gap this piece is about. We are shipping language models that pass bar exams, write production code, and conduct themselves as plausible junior collaborators. The companies shipping them have, by any reasonable accounting, a thinner understanding of how those models do what they do than a metallurgist has of how a bridge holds up its own weight. The bridge metaphor flatters us: a metallurgist can at least point at the steel.

This is not a rhetorical flourish. Take any concrete capability — say, the model's ability to track which speaker said what in a multi-turn dialogue, or to refuse a request when a particular safety policy is triggered. We can observe that the capability is present. We can construct evals that probe its limits. What we cannot, in general, do is point to the specific computation that produces it. We do not know which attention heads, in which layers, attending to which tokens, with which residual-stream features, are doing the work. We do not know whether the same circuit handles politely-phrased and rudely-phrased versions of the same request. We do not know whether removing a particular component would degrade the capability gracefully or catastrophically, because we have not, in most cases, identified the component.

You may want to object: but the model can explain itself. It can. It explains itself fluently. Those explanations are, however, generated by the same machinery whose internals we cannot inspect, and they are produced after the fact, conditioned on the question rather than on the trace of the original forward pass. They are not unlike asking a person why they fell in love with someone. The answer will be sincere and coherent and rich with detail, and it will also be a story the speaker is constructing on the spot from materials available to the speaker — not a log of the causal process. The introspection report is generation, not telemetry.

The asymmetry that drives the field

We can build systems whose behaviour is impressive much faster than we can build the tools to understand that behaviour. Capability and interpretability scale at different rates. This is the substrate fact: every quarter the gap widens, and every quarter "we will figure out interpretability later" gets a little less defensible.

That is the opening. The rest of this piece is about what people are doing about it, what they have managed to actually figure out, and what it would take to learn enough that you could stand behind a deployed model the way a structural engineer can stand behind a bridge.

§ 02 · Setting the terms

What "mechanistic" actually means

"Interpretability" is a wide tent. People use the word to mean things that range from "the model returns logprobs you can read" to "we have proven that this circuit implements the following algorithm." It is worth pulling apart at least three buckets, because the field's distinctive claim — and its distinctive ambition — lives in the third one.

Bucket 1

Behavioural

Prompt the model. Look at the outputs. Infer what it can and can't do. This is what nearly every published eval is. Cheap, fast, scales to any black-box system.

Strongest claimThe model behaves like X under conditions Y.

Bucket 2

Representational

Train a small classifier — a "probe" — on the hidden activations of an intermediate layer. If the probe can recover some property (sentiment, syntax, factual knowledge) from the activations, the property is encoded somewhere in there.

Strongest claimThe information is present in the representation.

Bucket 3

Mechanistic

Identify the specific neurons, attention heads, and computational paths — a circuit — that implement a behaviour, then verify causally: ablate the circuit and watch the behaviour break; patch the circuit's activations from one input into another and watch the behaviour transfer.

Strongest claimThis is the algorithm the network learned.

The distinction between bucket 2 and bucket 3 is where most misunderstandings about interpretability sit. A probe that recovers, say, the model's belief about the truth-value of a sentence does not tell you how the model arrived at that belief. It tells you the belief is decodable from the residual stream. The model could be computing the belief through a single principled mechanism, or through ten redundant heuristics, or through one mechanism on weekdays and another on weekends — the probe is silent on which. Bucket 3 is the part of the field that wants the silence to end.

The verb that matters in bucket 3 is causal. A mechanistic claim has to survive interventions: if you say head 7 in layer 12 implements the "previous-token lookback" step of induction, you should be able to delete that head's contribution and watch in-context copy break. If it doesn't break, your claim was wrong or incomplete. This is what separates the mechanistic style from a thousand attractive-looking correlational stories about what a network is doing.

A useful rough heuristic

If a paper's strongest evidence is "we plotted activations and they cluster nicely," you are reading bucket 2. If the strongest evidence is "we ablated this and the behaviour broke; we patched this in and the behaviour transferred; we wrote out the algorithm on a whiteboard," you are reading bucket 3.

§ 03 · Landmarks

Three landmark examples

The field has, in roughly seven years, produced a small but real set of genuinely understood circuits. Three of them are worth knowing about by name. None of them resolve the broader problem — frontier models are too large for any of these techniques to scale to in full — but they are the existence proofs that the mechanistic project is not vaporware. Inside small transformers, real algorithms have been found.

2022

Induction heads

Olsson et al., Anthropic · transformer-circuits.pub

In a two-layer transformer, a specific two-head circuit implements in-context copying: given a sequence … A B … A, the model predicts B. Head one (the "previous-token head") attends from each position to the one immediately before it, copying the previous token's identity forward. Head two (the "induction head") then attends from the final A back to any earlier occurrence of A, reads off what came after, and predicts that token.

Input: the cat sat … the → cat

What made this finding pivotal was not the circuit itself but its training dynamics. Induction heads emerge during training in a sharp phase transition — over a narrow band of training steps, the model goes from having no induction capability to having a fully-formed one. That phase transition is correlated with a corresponding step-change in many downstream in-context learning behaviours. The paper made it plausible, for the first time, that interesting capabilities have identifiable circuits and identifiable birth moments. The mechanistic programme stopped being purely speculative.

2022

The IOI circuit

Wang et al. · "Interpretability in the Wild" · GPT-2 Small

The Indirect Object Identification task is a sentence-completion puzzle small enough to study exhaustively: "When John and Mary went to the store, Mary gave the bag to ___" — the model should answer John, the indirect object. The paper traced this behaviour, in GPT-2 Small, to a specific circuit spanning multiple attention heads across multiple layers, with named roles: "duplicate token heads" that flag Mary as having appeared twice, "S-inhibition heads" that suppress the duplicated name, and "name mover heads" that copy the surviving name to the output position.

"When John and Mary went … Mary gave the bag to" → John

What makes IOI the canonical small-circuit example is the completeness of the story: every head's contribution is identified, every head's role is causally verified by patching, and the circuit can be written out as an algorithm a human could run on paper. It is also the cleanest demonstration of the field's central methodology — narrow the behaviour, narrow the model, narrow the input distribution, and grind until the picture closes.

2023–2024

Sparse autoencoders & monosemantic features

Bricken et al., Anthropic · Templeton et al. · "Scaling Monosemanticity"

The hidden activations of a transformer are a tangled mess: any given neuron lights up for an unrelated grab-bag of concepts (polysemanticity), and any given concept is spread across many neurons (superposition). The bet behind sparse autoencoders is that this mess is a compressed encoding — that the model is squeezing more concepts than it has neurons into the residual stream by overlaying them, and that if you train a wider, sparser decoder, you can recover the underlying concepts one at a time.

It worked. SAEs trained on Claude-scale models have turned up tens of millions of features, many of which are strikingly interpretable: features for insecure code patterns, internal-conflict expressions, French subjunctive verb conjugations, sycophantic phrasing — and the bridge feature from §01. You can clamp any of them to a high value during inference and watch the model's outputs warp toward that concept; that clamp is what produced Golden-Gate-Claude.

This changed the field's working ontology. Before SAEs, interpretability researchers were largely searching for circuits at the level of neurons and heads. After, the unit of analysis became the feature — a direction in activation space that the model has implicitly carved out. Circuits, in turn, became compositions of features. The shift is roughly as big as the shift from "atoms" to "molecules" in chemistry: more correct, more tractable, and the source of a fresh wave of follow-up research.

§ 04 · Methods

The toolkit

Here is what an interpretability researcher actually runs. Each of these is a craft on its own; the names are worth recognising because every paper in the field will assume you know them. None of them are clean — every result you read has methodological caveats, and the literature is full of careful disagreements about what each tool can and cannot establish.

activation patching

Run the model on input A, save the hidden activations at a chosen location. Run on input B but splice in A's activations at that location. If the output behaves like A's, the location is causally responsible for the behaviour. The workhorse of mechanistic claims.

causal scrubbing

A more rigorous version of patching: you state a hypothesis about which information flows through which paths, then systematically replace each component with activations from inputs that should not change the output under the hypothesis. If the output stays correct, the hypothesis is consistent with the network's behaviour.

logit lens

Apply the model's final unembedding layer to intermediate activations. The result tells you what the model would predict if asked to commit at that depth. Early layers predict generic continuations; later layers narrow. A cheap way to see the prediction crystallising layer by layer.

probing classifiers

Train a small classifier on intermediate activations to recover some property. Tells you the information is decodable; does not tell you the model uses it. A bucket-2 technique that nonetheless underpins a lot of bucket-3 hypothesis generation.

sparse autoencoders

Train an overcomplete, sparsity-regularised decoder on a layer's activations. Each output dimension is a candidate "feature." When it works, the features are interpretable concepts; when it doesn't, you get dead features and uninterpretable noise. Currently the most active subfield.

transcoders

A variant on SAEs that learns a sparse, interpretable approximation of an MLP block's input-output mapping rather than just its activations. Brings interpretability closer to "this is the function the block computes" rather than "these features are decodable from it."

attribution patching

A gradient-based approximation of full activation patching that's far cheaper at scale — you compute attribution scores for many components in a single backward pass instead of running thousands of intervention experiments. Loses some fidelity in exchange for the tractability you need on real-sized models.

feature visualisation

For a given feature, surface the inputs (text snippets, image patches, token contexts) that most strongly activate it. Combined with hand-labeling and clustering, this is how features get named. Mostly descriptive, not yet a verification tool.

Important caveat

Every tool above is contested. A good entry in the literature can be substantively disagreed with by a follow-up that runs the same experiment slightly differently. SAE features that look monosemantic in one slice can look polysemantic in another. Activation patching can produce apparent circuits that fail to generalise. Treat individual results as load-bearing only when they have been replicated, not when they have been merely published.

§ 05 · Hands-on

A feature explorer

The closest most people will get to seeing inside a model is a feature explorer — a browsable index of SAE features, each with the input snippets that most strongly activate it. Anthropic publishes one for Claude. Neuronpedia hosts public ones for several open models. Real explorers run live inference over a corpus and surface millions of features.

The demo below is the same shape, with twelve hand-curated features and a small set of illustrative snippets per feature. Click a card to expand it. The highlighting on each snippet indicates the rough activation strength of the feature on each token. (The data is hand-authored — this is to show you what a feature explorer feels like, not to pretend we ran an SAE in the browser. See the footnote.)

DEMO · SAE feature explorer (illustrative)

12 features · click to inspect

A walk through twelve features

Each card is a feature: a direction in activation space that the SAE has decoded out of the model's residual stream. The number under the name is the feature's rough firing density — the fraction of tokens, in a reference corpus, on which the feature fires above threshold.

Pick a feature card above to see the snippets that activate it.

Note: the snippets and activation patterns here are hand-authored to illustrate the format. Real SAE feature explorers are produced by training an autoencoder on millions of forward passes, then scoring features against a held-out corpus. The interpretability claims you see in production explorers are descriptive (these are the inputs that activate this feature) rather than mechanistic (this feature is causally responsible for behaviour X), and going from one to the other is a separate body of work. Treat this demo as a feel, not a finding.

Two things worth noticing as you click around. First, the named feature is often a useful but lossy summary — the actual activation pattern is often broader than the name suggests, and you can usually find a snippet where the feature fires in a way that doesn't quite fit the name. This is the gap between "interpretable enough to label" and "actually capturing one clean concept." Second, even on this hand-authored toy data, the features are clearly compositional — a feature for "Python list comprehension" co-fires with a feature for "the bracket character [," which co-fires with one for "iteration vocabulary." Real circuits are stacks of these compositions, and reading them out is the work.

§ 06 · The argument

Why this matters — Amodei's urgency framing

Dario Amodei, Anthropic's CEO, published an essay in early 2025 titled The Urgency of Interpretability. The argument is worth understanding on its own terms — not because the source is neutral (it is not; Amodei runs a company that does a lot of this work) but because the argument does not depend on the source being neutral, and the structure of it is what matters.

The argument, compressed: we are deploying systems whose capabilities continue to grow, and we are not getting comparable growth in our ability to understand what those systems are doing internally. Safety claims about these systems — that they will refuse certain requests, that they will not deceive, that they will not pursue power-seeking subgoals — are currently underwritten by behavioural evidence alone. The behaviour-only floor is too low. A model that has learned to act aligned during evaluations is, from the outside, indistinguishable from a model that has learned to be aligned. We do not have the tools to tell them apart. We need those tools before capability outruns oversight far enough that the question stops being academic.

Three concrete consequences are worth pulling out:

Alignment claims become unfalsifiable

If you cannot inspect the circuit that produces a refusal, you cannot tell whether the refusal generalises or is brittle to a small distributional shift. You are doing safety the way pre-germ-theory doctors did surgery: cleaning hands matters, but you can't say why in a way that lets you reason about novel situations.

Safety evals stay surface-only

Behavioural evals can only test inputs you have thought of. Mechanistic understanding would let you ask questions like "is there a feature direction the model has learned that corresponds to 'the operator is watching'?" — a question whose behavioural correlate is, by construction, hard to elicit. The interesting failure modes are exactly the ones you can't prompt for.

Rare failure modes go unpredicted

If a behaviour shows up one time in a hundred thousand, you may never see it during eval. With a mechanistic account, you could in principle audit the relevant circuit and notice that the failure mode is reachable — even before you've ever observed it. Without one, you are running a black box and hoping its tail behaviour is benign.

Honest hedge

This argument is most forcefully made by the people building the most capable systems, which is either reassuring (they are taking the problem seriously) or worrying (the urgency framing is also a strategic story about why we should keep building) depending on your prior. Both readings are coherent. The technical claim — that we do not currently have the interpretability tools to back the safety claims being made — stands independently of which reading you take.

§ 07 · Practice

What "doing the work" looks like

If you are reading this and wondering whether to take the field seriously enough to spend a year of evenings on it, here is roughly what the work involves. None of it is mysterious; all of it is a serious time commitment.

A typical research project goes something like this. You pick a narrow behaviour — IOI-style indirect object identification, a particular kind of arithmetic, a specific refusal pattern, the model's handling of a particular grammatical construction. You find the smallest open model that exhibits the behaviour. You run a bank of patching experiments to localise which layers and heads matter. You propose a circuit — a story about which components do what, in what order, with what information flow. Then you spend most of your time trying to break that story: input variations the circuit should and shouldn't handle, ablations that should and shouldn't destroy the behaviour, alternative explanations that might fit the same evidence. The final artifact is usually a paper, often with accompanying code, and ideally with the circuit specified precisely enough that someone else can verify it on their own machine.

The libraries you'll use:

TransformerLens

A PyTorch library by Neel Nanda for poking at the internals of small open transformers — hooks, activation caches, attention pattern access, ablation utilities. The standard entry point for hands-on work.

nnsight

A newer library that handles larger models and remote execution — write your interpretability experiments locally, run them against models you couldn't fit on a laptop. Increasingly the standard for scaling beyond GPT-2-class systems.

SAE Lens

Library and registry for training, loading, and evaluating sparse autoencoders. If you want to do feature-level work, this is where you start.

The training pipelines:

ARENA

A self-paced bootcamp covering transformers, RL, and interpretability — extensive notebooks, original exercises, run by the Alignment community. Free. Probably the single best on-ramp if you have ML fundamentals and want to get to running real experiments quickly.

MATS / SERI-MATS

Competitive research mentorship programs that pair early-career people with established researchers for a few months of focused work. Selective; serious commitment; many published interpretability papers in the last few years started as MATS projects.

Apollo Research

An independent interpretability lab that runs its own training/hiring pipeline. Notable for taking the "interpretability of capable models" framing especially seriously.

The reading list, in roughly the order I'd suggest:

Neel Nanda's blog

Pragmatic, opinionated, beginner-aware. Read the "How to become an interpretability researcher" post early; bookmark the rest.

transformer-circuits.pub

Anthropic's public interpretability venue. Long, careful, well-illustrated. Read the induction-heads paper, the toy-models-of-superposition paper, and the SAE papers in roughly that order.

AlignmentForum

Where the community argues about methodology in public. Useful for calibrating which claims are taken seriously and which are still being fought over.

To set expectations honestly: the bar to make a real contribution is high. The community is small, technically sharp, and unfashionable enough that the people in it have mostly self-selected for genuine interest rather than career incentives. It will reward months of focused study and punish dabbling. If that sounds appealing, it is one of the few corners of contemporary AI where you can still do real foundational work without working at a frontier lab.

§ 08 · Honest limits

Open problems & honest limits

The field has produced real results, and at the same time it is genuinely early. Anyone who tells you otherwise is selling something. The honest summary of where things are stuck:

Superposition is not solved, only managed

The reason SAEs are interesting at all is that activations in a transformer overlay many concepts in the same vector space. SAEs decompose them — partially. We do not know whether the features an SAE recovers are the "real" concepts the model is using, or one of several decompositions consistent with the observed activations, or an artefact of the autoencoder's particular training objective. Two SAEs trained on the same layer with slightly different sparsity penalties can produce overlapping-but-not-identical feature sets. This is uncomfortable.

The "we explained 80% of variance" trap

A standard SAE-quality metric is reconstruction loss — how well the autoencoder can reproduce the original activations from its sparse features. You can get to high reconstruction quality (90%+) and still be missing the parts of the activation that matter most for any given task. Variance explained is not the same as causal completeness, and the field is still working out which metrics actually track the latter.

Scaling is hard, and not just computationally

A circuit study on GPT-2 Small takes a few weeks. A circuit study on Claude or GPT-5 is a multi-team effort, and even then you are reduced to studying narrow slices because the full model is too big to hold all the relevant context for any one researcher. Tooling is improving (attribution patching, transcoders, automated circuit discovery), but the gap between "what we can do on small models" and "what we can do on frontier models" is still large enough that most published mechanistic results are about systems substantially smaller than what gets deployed.

Feature completeness is open

Even if every feature you found was perfectly monosemantic, you would still face the question: did you find all the features, or just the ones your autoencoder happened to surface? There is, currently, no clean way to prove a feature set is complete. The model could be using features you have not yet decoded. This is the version of "unknown unknowns" specific to the SAE programme.

Mechanistic claims fail to transfer surprisingly often

A circuit identified in one model frequently does not appear, or appears differently, in a closely related model — sometimes one fine-tune away, sometimes one architecture tweak. This is partly a real phenomenon (different training runs find different solutions) and partly a methodological one (current localisation methods are noisier than published results suggest). The field is still calibrating which of its results are about this specific model versus this class of model versus transformers in general.

Where this leaves us

You are reading about a field that has produced enough real wins to make the larger programme plausible, and not yet enough to make the safety claims being made about deployed models stand on it. That gap is the work. The honest position is neither hype nor despair — it is to take the open problems as seriously as the closed ones, and to keep doing the slow patient experiments that turn one into the other.

The cheerful version of all this: every one of these problems is a research question someone with patience and decent ML fundamentals could make a dent in. The field is still small enough that careful work gets noticed; the questions are real; the tooling is improving fast; and the importance of getting it right is not going to decrease.

Mechanistic interpretability — looking inside the model

Mechanistic interpretability — looking inside the model

§ 01 · The opening problem

The mystery you can't ignore

§ 02 · Setting the terms

What "mechanistic" actually means

§ 03 · Landmarks

Three landmark examples

§ 04 · Methods

The toolkit

§ 05 · Hands-on

A feature explorer

§ 06 · The argument

Why this matters — Amodei's urgency framing

Alignment claims become unfalsifiable

Safety evals stay surface-only

Rare failure modes go unpredicted

§ 07 · Practice

What "doing the work" looks like

§ 08 · Honest limits

Open problems & honest limits

Superposition is not solved, only managed

The "we explained 80% of variance" trap

Scaling is hard, and not just computationally

Feature completeness is open

Mechanistic claims fail to transfer surprisingly often

§ 09 · Sources

Further reading

The shape of the bet