Working with the grain
What language models actually do, why they fail the way they fail, and how to steer them — five structural truths, each with a working demonstration.
If you have ever asked an LLM to "build a visualization" and watched it default to a Chart.js bar chart when you wanted a node graph — or asked it to write code and gotten something plausible but subtly wrong — this document is for you. The friction is not a bug. It is the shape of the tool, and once you can see the shape, the friction tells you exactly where to push.
What follows is not a list of prompting tricks. It is five facts about how language models generate text, each one paired with a small interactive demonstration you can play with, and a short operating manual describing how to work with that fact rather than against it. The same five ideas explain why v0 and Lovable feel like magic, why "what's missing?" works better than "make it better," and why your fifth message to a long thread sometimes contradicts your first.
I.
They sample, they don't recall
A language model is a probability distribution over what word comes next, given everything that came before. That is the whole machine. It is not a database it queries, not a reasoner that deliberates — it is a very large function that, given a sequence of tokens, returns a probability for every possible next token in its vocabulary.
This has a consequence that surprises people: when you ask a vague question, the model does not "consider its options." It samples from a distribution whose shape was set by what was common in training data. The most frequent answer wins by default, because the most frequent answer has the most probability mass.
So "build me a visualization" — a request that could plausibly point at a hundred different libraries and styles — collapses onto whichever library was most heavily represented in the corpus when paired with the word visualization. Usually that is Chart.js, matplotlib, or D3. ReactFlow exists in the same landscape, but as a smaller hump. Without something in the prompt that tilts mass toward it, the model never lands there.
Below: nine common visualization libraries, with their rough base probabilities for an unconstrained prompt. Toggle the specification chips and watch the distribution reshape itself in real time.
Two things to notice. First, ReactFlow is reachable — but only through a particular path. You have to add enough constraints that the mass shifts toward it. Second, the model never "rejects" your other options; it just gives them less probability. With every constraint, you are reweighting, not filtering.
This reframes a lot of frustration with LLMs. They are not stubbornly choosing the wrong tool. They are landing on the most probable interpretation of an under-specified prompt. The fix is not to argue with the model. The fix is to reshape the distribution.
How to tilt the distribution
- Name the destination, not the journey. "Build a node graph in ReactFlow" gives the model a peak to climb to. "Make it interactive" is too soft to compete with the base distribution.
- Add constraints until ambiguity dies. Every constraint you add carves probability mass off the common humps. Three good constraints often beat one verbose paragraph.
- Use unusual but real signals. Mention a library, a pattern name, a stylistic anchor. "Brutalist UI," "shadcn style," "Tufte-inspired" — these tilt toward smaller, sharper regions of the distribution.
- Remember vagueness is not creativity. Asking the model to "be creative" without constraints means it samples the mode of "creative answers," which is itself a generic cluster. Constraints free the model to be specific.
II.
Generation is a walk, not a plan
The model emits one token, then conditions on that token to emit the next, and so on until it stops. There is no draft phase. There is no outline it consults. There is no moment where it reads the full response and asks itself whether the opening was a mistake. The walk is forward-only.
This is autoregression, and it has a sharp consequence: early tokens lock in late tokens. If the first sentence of a response commits to Chart.js, the rest of the response will explain Chart.js. The model cannot, mid-paragraph, say "actually, let me start over with ReactFlow." The probability of restarting is essentially zero, because nothing in training looks like that.
Pick a first word below. Then a second. Then a third. Watch how the available continuations narrow at each step — and notice the kinds of final sentences that become unreachable once you have chosen the opening, even though they were reachable from the empty prompt.
The point this makes visible: a prompt has a wide set of possible responses, but a generation is a single walk through that set. Once the model has committed to "Errors", it cannot land on a sentence that begins with "Our". Worse, you cannot tell from the output that there were other paths — the model emits one trajectory and that trajectory looks complete.
This is why chain-of-thought helps. When you ask a model to "think step by step" before answering, you are giving it a long runway of reasoning tokens to commit to before it commits to an answer. The reasoning becomes the context that conditions the answer. The walk is the same shape, but the early tokens are now exploratory rather than declarative, so the late tokens land somewhere considered.
Working with the walk
- Ask for the plan first, code second. "List five approaches, then pick one and implement it" gives the model planning tokens before commitment tokens. You can intervene between the two.
- Treat the first message as architecture, not a question. What ends up in the first 200 tokens of context governs the next 2,000. Front-load constraints, examples, and style anchors.
- Watch for early commitments you didn't ask for. If the model opens with "Here's a Python script…" and you wanted JavaScript, do not edit the script — restart with explicit framing. The walk is already down the wrong corridor.
- Use structured output formats for branching. Asking for JSON with a
"options": [...]field forces the model to enumerate alternatives instead of walking down the first one.
III.
Specification is steering
If the previous two chapters were about how the model behaves, this one is about how to bend its behavior reliably. The mechanism is the system prompt — instructions that sit above the conversation and condition every response without taking up visible airtime. This is the trick behind v0, Lovable, Cursor, and most modern AI products that feel "smarter" than raw model access.
The user sends "build me a button." The same model receives three radically different framings depending on the system prompt sitting above that message. Watch what happens to the output.
Three responses, same model, same user message. The system prompt did not just suggest a direction — it reshaped the entire probability distribution before the user ever spoke. By the time the model started generating, the universe of plausible outputs had already been narrowed to one neighborhood.
This is what makes products like v0 feel oddly sharp compared to raw chat. The product team has done specification work on your behalf, persistently, on every request. You type a vague thing, but the model receives a heavily steered request. The output looks like "the AI understood me" — but really, the AI was preloaded to understand you.
You can do this for yourself in any LLM product that supports persistent instructions (Claude Projects, Custom GPTs, system prompts in API access). A good standing prompt that names your stack, your conventions, your aesthetic preferences, and your usual constraints will outperform a thousand reminders inside individual conversations.
Build your own pre-tilt
- Write a personal system prompt once. Put it in Claude's project instructions or a custom GPT. Cover: your stack, your aesthetic, your conventions, the things you always want, the things you never want. Refresh it every few months.
- Pre-tilt the model toward your defaults, not your edge cases. If 80% of your requests are React components, make React the default. Mention exceptions in individual messages.
- Use examples in the system prompt. Few-shot examples are more steering-dense than abstract instructions. "Write like this:" + a paragraph beats "Write clearly."
- Specify what to refuse. "Don't add try/except around code unless I ask" or "don't write executive summaries unless requested" prunes a lot of unwanted output. Negative constraints work.
IV.
Generation is not evaluation
Here is an asymmetry that, once you see it, will change how you work with LLMs forever. The same model is dramatically better at telling you what's wrong with a piece of code than at writing that piece of code correctly on the first pass. The same model is better at critiquing an essay than at writing it. The same model is better at noticing a hole in an argument than at producing an argument without holes.
This is not because evaluation is "easier" in some abstract sense. It is because evaluation and generation are different operations mechanically. Generation walks forward, committing tokens. Evaluation conditions on an existing artifact and produces commentary about it. The artifact under evaluation acts as an external reference frame — the model is no longer trying to invent and check simultaneously, it is just checking.
Try it. The signup function below was produced by a model in a single generation pass. Switch to evaluation mode and ask the same model what's missing.
The model produced the code above. It looks reasonable. It runs. The output stream committed to a happy-path implementation token by token, and never went back to second-guess.
Nine issues, none of which appeared in the original generation. Same model, same training, same weights. The shift was structural: generation produced an artifact; evaluation surveyed it. The artifact gave the model something to compare against an implicit standard of "complete signup function," and the gap became visible.
This is why the iterate-and-critique loop is so much more powerful than one-shot prompting. You are alternating between two different modes of the same model. Generation makes something; evaluation finds what generation skipped; generation revises. Each pass moves the artifact closer to what a careful human would have produced — but you got there by exploiting an asymmetry, not by being clever.
The iterate-and-critique loop
- Always run two passes. First pass: generate. Second pass: paste the output back and ask "what's wrong, missing, or fragile?" The second pass costs nearly nothing and surfaces 80% of issues.
- Ask for criticism specifically, not improvement. "Improve this" puts the model back in generation mode and you'll get a different artifact with different gaps. "What's wrong with this" stays in evaluation mode.
- Use a different lens each pass. "Security issues," "edge cases," "performance," "what would a senior engineer flag in code review." Each lens illuminates a different region.
- Stop when criticism dries up. When the critique pass starts repeating itself or reaching for nitpicks, you've extracted what's available. Further passes hit diminishing returns.
V.
Memory is just context
When you have a long conversation with an LLM and it seems to "remember" what you said earlier, no remembering is happening. The entire conversation is being sent back to the model on every turn. It is rebuilt from raw text each time. There is no persistent state inside the model that carries from message to message.
This has practical consequences. There is a limit — the context window — on how much text can be sent. When you exceed it, something must be dropped. Older messages get summarized or truncated. The model genuinely does not see them anymore. It is not refusing to recall; it is operating without them in front of it.
The slider below sets a context budget. Watch what happens as you add messages and the budget gets tight.
Messages with strikethrough have fallen out of the model's view. From its perspective, they did not happen.
This explains a lot of "the model contradicted itself" or "the model forgot what we agreed on" behavior. The agreement was made twenty messages ago, the conversation has rolled past it, and the model is now reasoning from a context that does not contain the earlier exchange. It cannot reach back. From inside the model's frame, you and it just started talking.
There is also a subtler problem: even within the window, attention is not uniform. Information at the beginning and end of the context tends to be weighted more heavily; information stuck in the middle of a long context is sometimes effectively ignored. This is the "lost in the middle" phenomenon. Putting a critical instruction halfway through a 30,000-token document does not guarantee the model will act on it.
Working with context as a resource
- Put the important stuff at the top or the bottom. System prompts, key constraints, and the question being asked deserve prime real estate. The middle is the worst place for anything load-bearing.
- Externalize long-term memory. For information that needs to persist across sessions — your stack, your conventions, ongoing project state — use system prompts, project instructions, or external memory (RAG, knowledge files). Don't rely on the conversation to remember.
- Start fresh threads for genuinely new tasks. A long conversation accumulates context the model has to wade through. New task → new thread. You will get sharper output and lower latency.
- Summarize and restart when threads get long. "Summarize the key decisions and code we've landed on" → copy that summary → start a fresh conversation with it as the opener. You've compressed 20,000 tokens of meander into 800 tokens of substrate.
The taxonomy
Of the five facts above, some are structural and will not be trained away — they follow from how autoregressive language models work. Others are tendencies in current models that will soften as training improves. It helps to know which is which.
Structural · here forever
- Path dependence. Sequential generation is the architecture. Early tokens will always condition late tokens; no version of this model class will retroactively rewrite its own opening.
- Sampling, not retrieval. Output will always be a sample from a learned distribution. Bias toward common patterns is the math, not the training data.
- Context as memory. Without a different architecture (true persistent state), conversations will always be reconstructed from text.
Tendencies · will improve
- Mode-seeking under vagueness. Models can be trained to ask clarifying questions, explore alternatives before committing, and resist defaulting to common patterns. This is getting better with each generation.
- Weakness at self-critique mid-generation. Models can be trained to produce drafts and revise them in a single pass. This is the direction of "thinking" models.
- Lost in the middle. Attention mechanisms are an active research area. Long-context recall is improving fast.
The takeaway is not pessimistic. The structural limits are exactly what makes prompt design a real craft — there is a skill to working with the grain, and the people who develop it get dramatically more out of these systems than the people who don't. Once you can see what the model is actually doing, every frustration becomes a tell: it is showing you exactly where to push.