Your Prompt is a Function Call

You sent a prompt. You got garbage back. You blamed the model.

You were wrong.

The model did exactly what it was told. It resolved a probability cascade through a stack of trained bias registers, starting with whatever context you handed it. You handed it a mess. It resolved a mess. That's not a bug — that's the contract.

Here's what's actually happening under the hood every time you hit send.

The Problem Worth Caring About

This isn't a UX problem. It's not about "better phrasing." It's architectural.

Modern LLMs — especially Mixture of Experts (MoE) architectures — don't just process your prompt. They route on it. Your context determines which expert network fires, which attention patterns activate, which learned bias values get populated. Send thin context and you don't get a slightly worse answer. You get the output of a fundamentally different computational path.

Every developer building on the API needs this mental model before they write another system prompt. Prompt engineering without understanding the pipeline is like tuning a database query without knowing what an index is. You might get lucky. You mostly won't.

The Token Pipeline — How Your Words Become Probabilities

Before anything else, your text doesn't exist to the model. It's a sequence of token IDs.

Input Text → Tokenizer → Token IDs → Transformer Stack → Logits → Softmax → Probabilities → Output Token

The tokenizer chops your prompt into sub-word units and maps them to integer IDs from the model's vocabulary. Vocabulary size is the ceiling of the entire output distribution — on every single generation step, the model is asking: "Out of every token in my vocabulary, which one has the highest probability of being next?"

The softmax layer converts raw logit scores into a probability distribution across that full vocabulary. The output token is just the peak of that distribution, sampled or argmax'd depending on your temperature setting.

That's the macro view. Here's where it gets interesting.

Here's the full token-to-output pipeline:

The Hidden Layer Isn't Memory — It's a Bias Register

Here's the mental model most developers carry: the transformer "reads" the prompt like a human reads a sentence — start to finish, building up meaning as it goes.

This is wrong. And it's the source of most prompting failures.

Hidden layers are not memory. They are trained bias registers — numerical states that the model was explicitly trained to populate and decode as tokens pass through. Context isn't "stored" in the way RAM stores data. It's activated — the token's passage through each layer populates the hidden state with a learned transformation of its meaning in context.

Here's what's crucial: the transformer was trained to read these bias values. It's not magic and it's not emergent. Billions of gradient descent steps taught the model exactly how to write to these registers and exactly how to read them back. The hidden state is a language, and the transformer is the only entity in the world that speaks it fluently.

Think of it like this: every hidden layer is a whiteboard that gets written on as a token passes through. But the writing is in a notation only the transformer knows how to interpret — because it was specifically trained to.

The Info Extractor and the Info Utilizer

Each transformer block has two jobs, and understanding them separately is the key to understanding why your prompt order matters:

*The two-part structure inside each transformer block:*

The Info Extractor (Attention) answers: Who is talking about what? It looks at every token in the context and computes weighted relationships between them. The famous Q/K/V mechanism is the machinery for this — Query, Key, Value — but the intuition is simpler: for every token, attention asks every other token "how relevant are you to me right now?" and weights the answer accordingly.

The Info Utilizer (FFN/MLP) answers: What do I do with this? After attention has assembled the relational context, the feed-forward network applies the learned patterns from training — the stored knowledge, the syntactic rules, the semantic associations. This is where the "knowing things" happens.

Why Order Is Not Optional — The Undefined Variable Problem

Let me show you exactly why context order matters, using your own example.

Version A:

"My name is Prajwal; he is a vibe coder."

When the model processes this:

"My name is Prajwal" passes through the transformer stack → hidden layers are populated with Prajwal as the subject
"he" arrives → attention looks back → finds Prajwal in the populated hidden state → resolves he = Prajwal cleanly
"is a vibe coder" → predicate applies to the resolved subject → correct output

Version B:

"He is a vibe coder; it's Prajwal."

"He" arrives first → attention looks back → finds nothing establishing who He is
The transformer searches for the nearest semantic fit in its weights (not in your context — because your context hasn't given it one)
It resolves He to whatever entity from training data fits the "vibe coder" pattern — which could be anything
"it's Prajwal" arrives too late. The context slot is already filled with a guess

This is undefined behavior in the strictest programming sense. You called the variable before you declared it. The interpreter didn't throw an error — it made its best guess. Best guesses in a trillion-parameter model are statistically plausible but semantically wrong.

The rule this teaches: Context flows in one direction. Subject before pronoun. Background before question. Rules before task. If the model needs information to process token N, that information must exist at token N-1 or earlier — not after.

Input vs Output Tokens — The Cost Asymmetry Explained

Here's something the API pricing page shows you but doesn't explain: input tokens cost less than output tokens. This isn't a business decision. It's architecture.

*Why input tokens are parallel and output tokens are sequential:*

Input tokens are processed in parallel — the entire prompt hits the transformer stack at once. And critically, the hidden state they produce (the populated bias registers) can be saved in the KV cache — Key-Value cache. This snapshot of the context state means the model doesn't have to re-process your system prompt on every single turn. It reads the snapshot, picks up where it left off, and runs the new tokens through.

Output tokens are autoregressive — each one is generated by a complete forward pass through the entire model, conditioned on everything that came before it, including every output token generated so far. There's no shortcut. Token 5 can't be computed until token 4 exists. This is why generation feels sequential — it is sequential, architecturally.

This is also why input tokens feel "free" in a deeper sense than cost: they're doing setup work, not generation work. Every input token's real job is populating the attention layers. The candidate next-token it produces during its own forward pass is immediately discarded. You're paying for context establishment, not generation — and the KV cache means you sometimes pay for it only once.

The Gotchas

MoE: Your Prompt Is a Routing Signal

With Mixture of Experts architectures, the stakes of thin context compound. MoE models activate only a subset of their expert networks per token — the routing mechanism selects which experts to fire based on the token's representation at that layer.

Thin, ambiguous context → weak representation → wrong expert activates → output from a network that wasn't trained for this task.

Rich, specific context → strong representation → correct expert fires → output from the network that was trained exactly for what you need.

You're not just writing a message. You're writing an activation signal for a sparse network. The model isn't lazy when it gives you a bad response to a vague prompt — it's being precise about what you gave it.

The Discard Insight

Here's something that surprises most developers: every input token generates a "candidate next token" during the forward pass — and that candidate is immediately thrown away. Its only purpose was to populate the attention layers with contextual information. The actual generation only happens at the final transformer layer, on the final input token, and then continues autoregressively from there. You're never paying for "generation" on input tokens. You're paying for context computation. These are different things.

Context Order Is Non-Negotiable

Say it out loud: subject before pronoun, background before question, rules before task. Always. This isn't style advice. It's an architectural requirement. The transformer can't look ahead. By the time it reaches a token that needs prior information, that information must already exist in the hidden state.

When Not To Use This Mental Model

This mechanistic view breaks down at the edges:

Long-context models with positional encoding tricks (RoPE, ALiBi) handle distant dependencies differently — the "order matters" principle still holds but the decay function is non-linear and model-specific.

Fine-tuned models have had their bias registers reshaped by additional training. The base architecture is the same but the learned patterns in the FFN layers may have overridden general behaviors. Don't assume identical prompting strategies transfer across fine-tunes.

Multi-modal models introduce a separate embedding pathway for non-text inputs. The token pipeline still applies to text, but the interaction between vision tokens and text tokens in the attention layers follows a different geometry.

RAG and tool-augmented systems inject information mid-context in ways that can partially compensate for thin initial prompts — but this doesn't eliminate the ordering requirement, it just changes where the context comes from.

Knowledge Check

Simple: What does the softmax layer output, and why is the vocabulary size significant to that output?

Medium: Take these two prompts: "He built the company from nothing; that founder is Yusuf." vs "Yusuf is the founder who built the company from nothing." — Using the bias register model, explain why the second prompt produces more reliable completions about Yusuf.

Hard: A developer notices that their Claude API calls are fast on the second message in a conversation but slow on the first. Using KV cache, autoregressive generation, and the input/output token asymmetry, explain the full mechanism behind what they're observing — and describe what would happen to that speed if they changed their system prompt between turns.

The Landing

The model is not intelligent. It is a trained probability resolver with a stack of bias registers it was taught to populate and decode. Your prompt is the only variable you control in that entire pipeline.

Write it like a function signature — not a wish. Declare your variables before you use them. Pass your context in order. Give the routing signal enough density to fire the right expert.

The model will do the rest. Precisely.

Your Prompt is a Function Call

The Problem Worth Caring About

The Token Pipeline — How Your Words Become Probabilities

The Hidden Layer Isn't Memory — It's a Bias Register

The Info Extractor and the Info Utilizer

Why Order Is Not Optional — The Undefined Variable Problem

Input vs Output Tokens — The Cost Asymmetry Explained

The Gotchas

MoE: Your Prompt Is a Routing Signal

The Discard Insight

Context Order Is Non-Negotiable

When Not To Use This Mental Model

Knowledge Check

The Landing

Comments

More from this blog

Understanding JWE: How RSA and AES Work Together

How to Use Private Git Repositories as Internal npm Packages with Multiple GitHub Accounts

The Operator's Trap

The Arrow Function Betrayal

Command Palette

The Problem Worth Caring About

The Token Pipeline — How Your Words Become Probabilities

The Hidden Layer Isn't Memory — It's a Bias Register

The Info Extractor and the Info Utilizer

Why Order Is Not Optional — The Undefined Variable Problem

Input vs Output Tokens — The Cost Asymmetry Explained

The Gotchas

MoE: Your Prompt Is a Routing Signal

The Discard Insight

Context Order Is Non-Negotiable

When Not To Use This Mental Model

Knowledge Check

The Landing

Comments

More from this blog