Most prompt engineering advice is written for demos. The techniques that look clean in a notebook break down the moment real users arrive with unexpected input. This is what I’ve learned shipping LLM-powered systems to production.
The demo trap
A demo works because the input is controlled. You write the prompt, you write the test input, and you show the output. It’s a closed loop — optimised for one path through a system that will eventually have thousands.
Production means:
- Users phrase things you didn’t anticipate
- Documents have structure you didn’t test against
- Edge cases arrive in volume, not one by one
A production pipeline
A minimal reliable pipeline has four stages: ingest, validate input, call the model, validate output. Human review sits between output validation and any downstream action.
Structure your prompts like code
Treat system prompts as source files — versioned, reviewed, tested. A prompt that lives in a string literal inside a function call is untestable and undeployable.
What this looks like in practice:
// prompts/classify-document.ts
export const classifyDocument = {
system: `You are a document classifier. Given a document, return a JSON object
with the following fields:
- category: one of ["invoice", "contract", "report", "other"]
- confidence: a number from 0 to 1
- reasoning: one sentence explaining the classification
Return only valid JSON. No prose before or after.`,
user: (document: string) =>
`Classify the following document:\n\n${document}`,
};
Fail explicitly, not silently
The worst LLM failure mode is confident incorrectness. A system that returns plausible-looking wrong output is worse than one that errors — because the error is visible and the wrong output often isn’t.
Design for explicit failure:
const result = await classify(doc);
if (result.confidence < 0.7) {
await queueForReview(doc, result);
return { status: "pending_review" };
}
Validate the schema, not just the content
LLMs hallucinate structure. Even with explicit instructions to return JSON, you’ll see prose prefixes, trailing comments, and malformed objects. Parse every response against a schema before it touches downstream logic.
import { z } from "zod";
const ClassificationSchema = z.object({
category: z.enum(["invoice", "contract", "report", "other"]),
confidence: z.number().min(0).max(1),
reasoning: z.string().min(1),
});
const parsed = ClassificationSchema.safeParse(JSON.parse(raw));
if (!parsed.success) {
// log, retry with correction prompt, or route to fallback
}
Token cost is in prompt length
If you’re calling the same prompt in a loop over documents, the cost is linear in both the prompt length and the document length :
where is the per-token price. The practical implication: keep system prompts tight. An extra 500 tokens in a system prompt costs nothing at demo scale and real money at production scale.
For batch classification, the asymptotic cost per document is:
Fixed cost per document — which means prompt length matters far more than call overhead at volume.
One system prompt per task
Multipurpose prompts produce mediocre results across all purposes. A prompt that classifies documents, extracts entities, and summarises text will do all three worse than three focused prompts.
Composition over concentration. Build small, well-scoped prompt functions and wire them together in your application layer — not inside the prompt.
The pattern across all of this is the same as any other engineering discipline: explicit contracts, testable units, visible failures. LLMs are a new primitive, not a new category of engineering.