{label}

{Math.round(clamp(fill,0,1)*80)} GB

); } // ════════════════════════════════════════════════════════════════════════════ // SCENE — WHAT IT TAKES TO RUN A MODEL (length 22s) // ════════════════════════════════════════════════════════════════════════════ function ScenePhysicsRun() { const { localTime: lt } = useSprite(); const setup = ramp(lt, 0.2, 1.0); // weight grid (left) const gx = 130, gy = 300, gc = 14, gr = 9, cell = 22, gap = 6; const fillGrid = ramp(lt, 1.6, 4.6); const wCount = Math.round(70 * ramp(lt, 1.8, 4.4)); // billions // gpu cards (center) const gpuPhase = lt > 5.0; const card1 = clamp(ramp(lt, 5.4, 7.0), 0, 1); // → 80GB const card2 = clamp(ramp(lt, 6.6, 8.2), 0, 1) * 0.75; // → 60GB (140 total) // per-token compute const tokPhase = lt > 9.0; // kv cache callback const kvPhase = lt > 13.0; const costPhase = lt > 17.0; return ( <> {/* weights grid */}

the weights

{Array.from({length:gc*gr}).map((_,i)=>{ const on = i < gc*gr*fillGrid; return

; })}

{wCount}B

weights — matrices, not magic

{/* GPU cards */} {gpuPhase && ( <>

loaded into GPU memory

70B × 2 bytes = 140 GB

frontier models → a whole cluster

)} {/* per-token compute / kv */} {tokPhase && (

Every token is multiplied through all of them:

≈ 140 GFLOP / token

{kvPhase && (

…and the KV cache for your context eats VRAM too —

longer context literally needs more GPU.

)}

)} {/* cost chain */} {costPhase && (

$2–4 / GPU-hour × GPU-seconds ÷ tokens

= the price per token. You're renting GPU time, sliced thin.

)} A model is just billions of numbers — the trained weights. To run, every weight must sit in GPU memory — big models need many GPUs. And each token is multiplied through all of them — that's the compute you rent. Context isn't free either — its KV cache takes VRAM that could serve other users. GPUs cost dollars an hour — that hourly rate, divided by tokens, is your price. ); } // ════════════════════════════════════════════════════════════════════════════ // SCENE — WHY PRICES DIFFER (length 23s) // ════════════════════════════════════════════════════════════════════════════ function ScenePhysicsWhy() { const { localTime: lt } = useSprite(); const setup = ramp(lt, 0.2, 1.0); // input vs output lanes const inN = 10, outN = 6; const inLight = ramp(lt, 1.8, 2.6); // all at once const outAt = (i) => 3.4 + i*0.5; const nOut = Array.from({length:outN}).filter((_,i)=> lt > outAt(i)).length; const spectrumPhase = lt > 8.4; const moePhase = lt > 12.6; const recapPhase = lt > 17.6; const laneY1 = 250, laneY2 = 400, lx = 150, sq = 38, sgap = 10; return ( <> {/* INPUT lane */}

INPUT · prefill

{Array.from({length:inN}).map((_,i)=>(

0.5?'cc':'22'), border:`1.4px solid ${COLORS.blue}`, opacity:setup }} /> ))}

{inLight>0.4 && (

)}

all tokens in one parallel pass
GPU fully used → efficient → cheap

{/* OUTPUT lane */}

OUTPUT · decode

{Array.from({length:outN}).map((_,i)=>{ const on = i < nOut; const newest = i === nOut-1; return (

); })}

{nOut>0 && Array.from({length:nOut}).map((_,i)=>(

))}

one token at a time, its own pass
GPU underused → ~5× the cost per token

{/* price spectrum */} {spectrumPhase && (() => { const o = pulse(lt, 8.6, 23, 0.4); const sx = 200, sw = 1520, sy = 660; const marks = [ { p:0.08, t:'8B dense', s:'1 GPU · cheap', c:COLORS.green }, { p:0.46, t:'MoE 200B / ~20B active', s:'runs lean for its size', c:COLORS.yellow, moe:true }, { p:0.92, t:'frontier dense', s:'a cluster · costly', c:COLORS.coral }, ]; return (

What drives the spread: parameters you must load & multiply

cheaper →→ pricier

{marks.map((m,i)=>{ const mo = (m.moe ? ramp(lt, 12.8, 13.6) : ramp(lt, 9.0+i*0.4, 9.7+i*0.4)); if (mo<=0.01) return null; return (

{m.t}

{m.s}

); })}

); })()} {/* recap drivers */} {recapPhase && (

{[['active parameters',COLORS.blue],['input vs output',COLORS.yellow],['GPU $/hour',COLORS.coral],['batching & demand',COLORS.green]].map(([t,c],i)=>( {t} ))}

)} Input is processed all at once — the GPU runs flat-out, so it's cheap. Output comes one token at a time — the GPU idles between them, so it costs ~5× more. Across models, price tracks the parameters you must hold in memory and multiply. Mixture-of-Experts fires only a few experts per token — big brain, small bill. So the drivers: active params, input vs output, GPU rate, and how well it batches. ); } Object.assign(window, { ScenePhysicsRun, ScenePhysicsWhy, GpuCard });