Mechanistic interpretability — looking inside the model
A field guide to the third bucket past behavioural and representational work — what it means to actually find the algorithm a network has learned, with three landmark examples, the toolkit researchers use, and an honest account of what the field has not yet solved.
A frontier model writes a working Rust program. It picks Vec<T> over Box<[T]> in the third line, threads ownership cleanly, handles a borrow-checker corner. You can ask it why and it will produce a confident sentence about heap allocation and growable buffers. That sentence is not a causal trace. It is the model doing what models do — generating a plausible continuation of "explain your choice." The actual computation that produced Vec happened across a few hundred matrix multiplications inside a stack of attention layers, and the model has no introspective access to any of it. Neither do you.
That is the gap this piece is about. The model works. We can measure that it works. We can hill-climb the benchmarks. We can ship products. What we cannot do — yet, except in a handful of small, carefully-studied cases — is read off the algorithm. Mechanistic interpretability is the bet that the algorithms are in there to be read, and the slow craft of trying to read them. It is the part of the field that takes seriously the idea that neural networks are programs we wrote without understanding.
This is a companion to the anatomy of a frontier model chapter. That chapter is a map of what a frontier model is made of. This one is about what it would mean to know what's happening inside that map at the level of circuits — and why a number of serious researchers think the urgency of getting there is among the most important open problems in AI.
- The mystery you can't ignore the introspection gap
- What "mechanistic" actually means three buckets, one distinction
- Three landmark examples induction heads, IOI, SAEs
- The toolkit patching, probing, lenses
- A feature explorer interactive demo
- Why this matters — Amodei's urgency framing
- What "doing the work" looks like entry points
- Open problems & honest limits
- Further reading
§ 01 · The opening problem
The mystery you can't ignore
In May 2024, Anthropic published a demo they called Golden Gate Claude. They had isolated, inside a production-class model, a single internal direction — a "feature" in their terminology — that fired whenever the model was thinking about the Golden Gate Bridge. Then they cranked that feature's activation up by hand during inference and let users talk to the result. Ask Golden-Gate-Claude for a chocolate-chip cookie recipe and it offered to bake the cookies on the bridge; ask it the best way to drive from LA to Phoenix and it suggested taking the Golden Gate, somehow. The model couldn't stop bringing the bridge up. It also, charmingly, knew something was wrong and apologised for it.
That demo is what mechanistic interpretability looks like when it works. Anthropic was able to point at a specific direction in the model's internal state and say this is what activates when the concept ‘Golden Gate Bridge’ is present — and then prove the claim was causal by perturbing that direction and watching the behaviour change. That kind of pointer is, today, exceedingly rare. For most things a frontier model does, no one can point at anything. The model writes a working Rust program and picks Vec<T> over Box<[T]> in the third line, and we cannot tell you which of its eighty-odd layers cast the deciding vote, or whether there is a "deciding vote" at all rather than a few hundred small biases summing in a direction nobody planned.
That is the gap this piece is about. We are shipping language models that pass bar exams, write production code, and conduct themselves as plausible junior collaborators. The companies shipping them have, by any reasonable accounting, a thinner understanding of how those models do what they do than a metallurgist has of how a bridge holds up its own weight. The bridge metaphor flatters us: a metallurgist can at least point at the steel.
This is not a rhetorical flourish. Take any concrete capability — say, the model's ability to track which speaker said what in a multi-turn dialogue, or to refuse a request when a particular safety policy is triggered. We can observe that the capability is present. We can construct evals that probe its limits. What we cannot, in general, do is point to the specific computation that produces it. We do not know which attention heads, in which layers, attending to which tokens, with which residual-stream features, are doing the work. We do not know whether the same circuit handles politely-phrased and rudely-phrased versions of the same request. We do not know whether removing a particular component would degrade the capability gracefully or catastrophically, because we have not, in most cases, identified the component.
You may want to object: but the model can explain itself. It can. It explains itself fluently. Those explanations are, however, generated by the same machinery whose internals we cannot inspect, and they are produced after the fact, conditioned on the question rather than on the trace of the original forward pass. They are not unlike asking a person why they fell in love with someone. The answer will be sincere and coherent and rich with detail, and it will also be a story the speaker is constructing on the spot from materials available to the speaker — not a log of the causal process. The introspection report is generation, not telemetry.
We can build systems whose behaviour is impressive much faster than we can build the tools to understand that behaviour. Capability and interpretability scale at different rates. This is the substrate fact: every quarter the gap widens, and every quarter "we will figure out interpretability later" gets a little less defensible.
That is the opening. The rest of this piece is about what people are doing about it, what they have managed to actually figure out, and what it would take to learn enough that you could stand behind a deployed model the way a structural engineer can stand behind a bridge.
§ 02 · Setting the terms
What "mechanistic" actually means
"Interpretability" is a wide tent. People use the word to mean things that range from "the model returns logprobs you can read" to "we have proven that this circuit implements the following algorithm." It is worth pulling apart at least three buckets, because the field's distinctive claim — and its distinctive ambition — lives in the third one.
The distinction between bucket 2 and bucket 3 is where most misunderstandings about interpretability sit. A probe that recovers, say, the model's belief about the truth-value of a sentence does not tell you how the model arrived at that belief. It tells you the belief is decodable from the residual stream. The model could be computing the belief through a single principled mechanism, or through ten redundant heuristics, or through one mechanism on weekdays and another on weekends — the probe is silent on which. Bucket 3 is the part of the field that wants the silence to end.
The verb that matters in bucket 3 is causal. A mechanistic claim has to survive interventions: if you say head 7 in layer 12 implements the "previous-token lookback" step of induction, you should be able to delete that head's contribution and watch in-context copy break. If it doesn't break, your claim was wrong or incomplete. This is what separates the mechanistic style from a thousand attractive-looking correlational stories about what a network is doing.
If a paper's strongest evidence is "we plotted activations and they cluster nicely," you are reading bucket 2. If the strongest evidence is "we ablated this and the behaviour broke; we patched this in and the behaviour transferred; we wrote out the algorithm on a whiteboard," you are reading bucket 3.
§ 03 · Landmarks
Three landmark examples
The field has, in roughly seven years, produced a small but real set of genuinely understood circuits. Three of them are worth knowing about by name. None of them resolve the broader problem — frontier models are too large for any of these techniques to scale to in full — but they are the existence proofs that the mechanistic project is not vaporware. Inside small transformers, real algorithms have been found.
In a two-layer transformer, a specific two-head circuit implements in-context copying: given a sequence … A B … A, the model predicts B. Head one (the "previous-token head") attends from each position to the one immediately before it, copying the previous token's identity forward. Head two (the "induction head") then attends from the final A back to any earlier occurrence of A, reads off what came after, and predicts that token.
What made this finding pivotal was not the circuit itself but its training dynamics. Induction heads emerge during training in a sharp phase transition — over a narrow band of training steps, the model goes from having no induction capability to having a fully-formed one. That phase transition is correlated with a corresponding step-change in many downstream in-context learning behaviours. The paper made it plausible, for the first time, that interesting capabilities have identifiable circuits and identifiable birth moments. The mechanistic programme stopped being purely speculative.
The Indirect Object Identification task is a sentence-completion puzzle small enough to study exhaustively: "When John and Mary went to the store, Mary gave the bag to ___" — the model should answer John, the indirect object. The paper traced this behaviour, in GPT-2 Small, to a specific circuit spanning multiple attention heads across multiple layers, with named roles: "duplicate token heads" that flag Mary as having appeared twice, "S-inhibition heads" that suppress the duplicated name, and "name mover heads" that copy the surviving name to the output position.
What makes IOI the canonical small-circuit example is the completeness of the story: every head's contribution is identified, every head's role is causally verified by patching, and the circuit can be written out as an algorithm a human could run on paper. It is also the cleanest demonstration of the field's central methodology — narrow the behaviour, narrow the model, narrow the input distribution, and grind until the picture closes.
The hidden activations of a transformer are a tangled mess: any given neuron lights up for an unrelated grab-bag of concepts (polysemanticity), and any given concept is spread across many neurons (superposition). The bet behind sparse autoencoders is that this mess is a compressed encoding — that the model is squeezing more concepts than it has neurons into the residual stream by overlaying them, and that if you train a wider, sparser decoder, you can recover the underlying concepts one at a time.
It worked. SAEs trained on Claude-scale models have turned up tens of millions of features, many of which are strikingly interpretable: features for insecure code patterns, internal-conflict expressions, French subjunctive verb conjugations, sycophantic phrasing — and the bridge feature from §01. You can clamp any of them to a high value during inference and watch the model's outputs warp toward that concept; that clamp is what produced Golden-Gate-Claude.
This changed the field's working ontology. Before SAEs, interpretability researchers were largely searching for circuits at the level of neurons and heads. After, the unit of analysis became the feature — a direction in activation space that the model has implicitly carved out. Circuits, in turn, became compositions of features. The shift is roughly as big as the shift from "atoms" to "molecules" in chemistry: more correct, more tractable, and the source of a fresh wave of follow-up research.
§ 04 · Methods
The toolkit
Here is what an interpretability researcher actually runs. Each of these is a craft on its own; the names are worth recognising because every paper in the field will assume you know them. None of them are clean — every result you read has methodological caveats, and the literature is full of careful disagreements about what each tool can and cannot establish.
Every tool above is contested. A good entry in the literature can be substantively disagreed with by a follow-up that runs the same experiment slightly differently. SAE features that look monosemantic in one slice can look polysemantic in another. Activation patching can produce apparent circuits that fail to generalise. Treat individual results as load-bearing only when they have been replicated, not when they have been merely published.
§ 05 · Hands-on
A feature explorer
The closest most people will get to seeing inside a model is a feature explorer — a browsable index of SAE features, each with the input snippets that most strongly activate it. Anthropic publishes one for Claude. Neuronpedia hosts public ones for several open models. Real explorers run live inference over a corpus and surface millions of features.
The demo below is the same shape, with twelve hand-curated features and a small set of illustrative snippets per feature. Click a card to expand it. The highlighting on each snippet indicates the rough activation strength of the feature on each token. (The data is hand-authored — this is to show you what a feature explorer feels like, not to pretend we ran an SAE in the browser. See the footnote.)
Two things worth noticing as you click around. First, the named feature is often a useful but lossy summary — the actual activation pattern is often broader than the name suggests, and you can usually find a snippet where the feature fires in a way that doesn't quite fit the name. This is the gap between "interpretable enough to label" and "actually capturing one clean concept." Second, even on this hand-authored toy data, the features are clearly compositional — a feature for "Python list comprehension" co-fires with a feature for "the bracket character [," which co-fires with one for "iteration vocabulary." Real circuits are stacks of these compositions, and reading them out is the work.
§ 06 · The argument
Why this matters — Amodei's urgency framing
Dario Amodei, Anthropic's CEO, published an essay in early 2025 titled The Urgency of Interpretability. The argument is worth understanding on its own terms — not because the source is neutral (it is not; Amodei runs a company that does a lot of this work) but because the argument does not depend on the source being neutral, and the structure of it is what matters.
The argument, compressed: we are deploying systems whose capabilities continue to grow, and we are not getting comparable growth in our ability to understand what those systems are doing internally. Safety claims about these systems — that they will refuse certain requests, that they will not deceive, that they will not pursue power-seeking subgoals — are currently underwritten by behavioural evidence alone. The behaviour-only floor is too low. A model that has learned to act aligned during evaluations is, from the outside, indistinguishable from a model that has learned to be aligned. We do not have the tools to tell them apart. We need those tools before capability outruns oversight far enough that the question stops being academic.
Three concrete consequences are worth pulling out:
Alignment claims become unfalsifiable
If you cannot inspect the circuit that produces a refusal, you cannot tell whether the refusal generalises or is brittle to a small distributional shift. You are doing safety the way pre-germ-theory doctors did surgery: cleaning hands matters, but you can't say why in a way that lets you reason about novel situations.
Safety evals stay surface-only
Behavioural evals can only test inputs you have thought of. Mechanistic understanding would let you ask questions like "is there a feature direction the model has learned that corresponds to 'the operator is watching'?" — a question whose behavioural correlate is, by construction, hard to elicit. The interesting failure modes are exactly the ones you can't prompt for.
Rare failure modes go unpredicted
If a behaviour shows up one time in a hundred thousand, you may never see it during eval. With a mechanistic account, you could in principle audit the relevant circuit and notice that the failure mode is reachable — even before you've ever observed it. Without one, you are running a black box and hoping its tail behaviour is benign.
This argument is most forcefully made by the people building the most capable systems, which is either reassuring (they are taking the problem seriously) or worrying (the urgency framing is also a strategic story about why we should keep building) depending on your prior. Both readings are coherent. The technical claim — that we do not currently have the interpretability tools to back the safety claims being made — stands independently of which reading you take.
§ 07 · Practice
What "doing the work" looks like
If you are reading this and wondering whether to take the field seriously enough to spend a year of evenings on it, here is roughly what the work involves. None of it is mysterious; all of it is a serious time commitment.
A typical research project goes something like this. You pick a narrow behaviour — IOI-style indirect object identification, a particular kind of arithmetic, a specific refusal pattern, the model's handling of a particular grammatical construction. You find the smallest open model that exhibits the behaviour. You run a bank of patching experiments to localise which layers and heads matter. You propose a circuit — a story about which components do what, in what order, with what information flow. Then you spend most of your time trying to break that story: input variations the circuit should and shouldn't handle, ablations that should and shouldn't destroy the behaviour, alternative explanations that might fit the same evidence. The final artifact is usually a paper, often with accompanying code, and ideally with the circuit specified precisely enough that someone else can verify it on their own machine.
The libraries you'll use:
The training pipelines:
The reading list, in roughly the order I'd suggest:
To set expectations honestly: the bar to make a real contribution is high. The community is small, technically sharp, and unfashionable enough that the people in it have mostly self-selected for genuine interest rather than career incentives. It will reward months of focused study and punish dabbling. If that sounds appealing, it is one of the few corners of contemporary AI where you can still do real foundational work without working at a frontier lab.
§ 08 · Honest limits
Open problems & honest limits
The field has produced real results, and at the same time it is genuinely early. Anyone who tells you otherwise is selling something. The honest summary of where things are stuck:
Superposition is not solved, only managed
The reason SAEs are interesting at all is that activations in a transformer overlay many concepts in the same vector space. SAEs decompose them — partially. We do not know whether the features an SAE recovers are the "real" concepts the model is using, or one of several decompositions consistent with the observed activations, or an artefact of the autoencoder's particular training objective. Two SAEs trained on the same layer with slightly different sparsity penalties can produce overlapping-but-not-identical feature sets. This is uncomfortable.
The "we explained 80% of variance" trap
A standard SAE-quality metric is reconstruction loss — how well the autoencoder can reproduce the original activations from its sparse features. You can get to high reconstruction quality (90%+) and still be missing the parts of the activation that matter most for any given task. Variance explained is not the same as causal completeness, and the field is still working out which metrics actually track the latter.
Scaling is hard, and not just computationally
A circuit study on GPT-2 Small takes a few weeks. A circuit study on Claude or GPT-5 is a multi-team effort, and even then you are reduced to studying narrow slices because the full model is too big to hold all the relevant context for any one researcher. Tooling is improving (attribution patching, transcoders, automated circuit discovery), but the gap between "what we can do on small models" and "what we can do on frontier models" is still large enough that most published mechanistic results are about systems substantially smaller than what gets deployed.
Feature completeness is open
Even if every feature you found was perfectly monosemantic, you would still face the question: did you find all the features, or just the ones your autoencoder happened to surface? There is, currently, no clean way to prove a feature set is complete. The model could be using features you have not yet decoded. This is the version of "unknown unknowns" specific to the SAE programme.
Mechanistic claims fail to transfer surprisingly often
A circuit identified in one model frequently does not appear, or appears differently, in a closely related model — sometimes one fine-tune away, sometimes one architecture tweak. This is partly a real phenomenon (different training runs find different solutions) and partly a methodological one (current localisation methods are noisier than published results suggest). The field is still calibrating which of its results are about this specific model versus this class of model versus transformers in general.
You are reading about a field that has produced enough real wins to make the larger programme plausible, and not yet enough to make the safety claims being made about deployed models stand on it. That gap is the work. The honest position is neither hype nor despair — it is to take the open problems as seriously as the closed ones, and to keep doing the slow patient experiments that turn one into the other.
The cheerful version of all this: every one of these problems is a research question someone with patience and decent ML fundamentals could make a dent in. The field is still small enough that careful work gets noticed; the questions are real; the tooling is improving fast; and the importance of getting it right is not going to decrease.
§ 09 · Sources
Further reading
A short, deliberately curated list. Each entry has a one-sentence note on why it's worth your time.
- Dario Amodei — The Urgency of Interpretability The case for why interpretability matters now rather than later, made by someone who has both the technical chops to make it and the obvious incentive that means you should read it sceptically. The brief history section in particular is a clean summary of where the field has been.
- Anthropic — transformer-circuits.pub The single most important venue for ongoing mechanistic-interpretability work. Long, illustrated, carefully argued papers; not peer-reviewed in the traditional sense but extensively cited in everything that is.
- Neel Nanda — How to become a mechanistic interpretability researcher The most-cited entry-point guide. Opinionated, practical, regularly updated, and the source most people who do this work for a living point newcomers to.
- Olsson et al. — In-context learning and induction heads The induction-heads paper. The phase-transition result that made the mechanistic project look real. If you read one paper from the list, this is the one.
- Olah et al. — Zoom In: An Introduction to Circuits The Distill essay that named "circuits" as a unit of analysis. Vision-model focused, but the framing and the visual style are foundational for everything that followed.
- Bricken et al. — Towards monosemanticity The first major paper showing that sparse autoencoders can pull interpretable features out of a small transformer. Established the modern feature-centric paradigm.
- Templeton et al. — Scaling Monosemanticity The Claude-3-Sonnet feature dictionary work — proves the SAE approach scales to production-class models, includes the Golden-Gate-Claude feature-clamping demo that briefly went viral.
- Wang et al. — Interpretability in the Wild (IOI) The Indirect Object Identification paper. The canonical end-to-end example of finding, naming, and causally verifying every component of a real circuit in a real model.
- Neuronpedia Public feature-explorer for SAEs trained on open models. The closest you can get to playing with the kind of tool researchers use day to day without setting up your own pipeline.
The shape of the bet
Mechanistic interpretability is the bet that the algorithms learned by a large neural network can be recovered — not statistically summarised, not behaviourally probed, but read out as algorithms a human could write on a whiteboard. The bet has not been won. It has been won in miniature: a handful of small circuits in small models, a feature-dictionary that scales to production systems, a methodology that survives scrutiny. It has not been won at the level that would let you stand behind a deployed frontier model the way an engineer stands behind a bridge.
If you remember three things from this chapter:
- Bucket 3 is the one that counts. Behavioural evals tell you what a model does; probes tell you what's decodable from its activations; only mechanistic work tells you which computation does the work. The first two are easy, ubiquitous, and frequently mistaken for the third.
- Features, not neurons. The unit of analysis is a direction in activation space, not an individual cell. Polysemanticity and superposition are why; SAEs are how. Anything written before 2023 that treats single neurons as the natural unit is, by now, mostly a historical document.
- If you didn't ablate, you didn't prove anything. Mechanistic claims have to survive intervention — patch the circuit in and the behaviour transfers, patch it out and the behaviour breaks. Pretty activation plots without that test are bucket 2 in a bucket-3 costume.
The reason the field is worth your attention is the same reason it is uncomfortable: there are deployed systems being talked about in safety language that the science currently does not back. Closing that gap is the work, and it is small enough that one careful person can still move it.