{/* limit line */}

context window limit
e.g. 200K tokens

the context window

{/* blocks (bottom-anchored) */} {visible.map((b, idx) => { const fromBottom = visible.length - 1 - idx; const y = baseY - (fromBottom+1)*bh - fromBottom*bgap; const dropping = overflow && idx===0; const o = b.summary ? ramp(lt, 11.0, 11.8, Easing.easeOutBack) : (dropping ? 0.4 : 1); return (

{b.label} {b.summary && compacted ✓}

); })} {/* overflow flash */} {overflow && (

⚠ window full — dropping oldest

)} Context isn't infinite — every model has a hard window limit. Long agent runs fill it up, turn after turn. Overflow it and the oldest turns fall off — the agent simply forgets them. Better: compact old turns into a summary — keep the meaning, drop the tokens. Managing the window is managing both memory and cost. ); } // ════════════════════════════════════════════════════════════════════════════ // SCENE — GENERATING THE ANSWER (decode + reasoning) (length 21s) // ════════════════════════════════════════════════════════════════════════════ function SceneDecode() { const { localTime: lt } = useSprite(); const setup = ramp(lt, 0.2, 1.0); // mini KV strip at top (prefill already done -> all green) const kvX = 150, kvY = 250, cell = 30, gap = 6, kcols = 10, krows = 4; // output tokens emitted one at a time const out = ['It', "'s", ' 18', '°C', ' and', ' clear', ' in', ' Tokyo', '.']; const emitAt = (i) => 2.6 + i*0.62; const nOut = out.filter((_,i)=> lt > emitAt(i)).length; // reasoning tokens (hidden) appear in a separate band BEFORE the visible answer (conceptually) const reasonIn = pulse(lt, 11.2, 21, 0.4); const reasonCount = 14; const oy = 470; // output row baseline const ox0 = 150; return ( <> {/* prefill KV strip (context, all cached/green) */}

context (prefilled) →

{Array.from({length:kcols}).map((_,c)=>(

{Array.from({length:krows}).map((_,r)=>(

))}

))} {/* the live decode column grows to the right */} {Array.from({length:nOut}).map((_,c)=>{ const isNewest = c===nOut-1; const col = isNewest ? COLORS.yellow : COLORS.blue; return (

{Array.from({length:krows}).map((_,r)=>(

))}

); })}

{/* forward-pass counter */} {nOut>0 && lt<11 && (

forward passes

{nOut}

one per output token

)} {/* emitted answer text */}

{out.slice(0,nOut).join('')}▌

{/* reasoning tokens band */} {reasonIn > 0.01 && (

🔒 hidden reasoning — you never see these, but you pay for them as output

{Array.from({length:reasonCount}).map((_,i)=>{ const o = ramp(lt, 11.4+i*0.05, 11.7+i*0.05); return

•••

; })}

→ visible answer

)} The prefill is parallel — but generating the reply is not. Output is produced one token at a time, each fed back to predict the next. N output tokens = N forward passes. That's why output costs ~5× input. "Thinking" models also emit hidden reasoning tokens before the answer… …billed as output, and often the biggest slice of the bill. ); } Object.assign(window, { SceneWindowEdges, SceneDecode });