§16.2
Structured Outputs and Extraction
Most production LLM workflows don't show the model's output to a human. They show it to another system. A support ticket gets routed to a queue. A contract extraction populates a database. A sales call summary fills in a CRM field. For these handoffs, free-text output is worse than useless — it has to be parsed, cleaned, validated, and corrected before the next system can act on it. Structured outputs are the bridge. The model returns JSON. The JSON conforms to a schema. The schema is the contract.
This article is about that bridge — what it does, how it works, and why it matters for almost every LLM workflow that ships beyond a chat window.
The Executive Question
How do we get the model's output into the next system without manual cleanup, and how do we know what's in the output before it gets there?
The honest version: define a schema, force the model to fill it, validate the result, and reject what doesn't match. The schema becomes the contract between the model and the rest of the pipeline.
The Flow
The architecture has three stages and the entire value of structured outputs is in the third.
From messy text to validated JSON — the structured-output handoff
A walk-through:
- Free-text input. Whatever the upstream source delivers — a review, a ticket body, a sales-call note, a scanned contract page. Messy by nature.
- LLM + schema. The model is prompted with the task plus a JSON schema describing the required output shape. Modern APIs (OpenAI's JSON mode, Anthropic's tool use, structured-output libraries like
instructorandoutlines) enforce schema conformance at the inference layer. - Validated JSON. The output is parsed, validated against the schema, and only then handed to the downstream system. If validation fails, the system retries or routes to a human.
The validation step is the part most teams skip and the part where most failures hide.
A Real Schema
A reasonable extraction schema for a support ticket summary:
{
customer_id: string,
category: "billing" | "delivery" | "app" | "quality" | "other",
urgency: 1 | 2 | 3 | 4 | 5,
product_mentioned: string | null,
refund_requested: boolean,
contains_pii: boolean,
short_summary: string, // 1-2 sentences
recommended_action: string, // e.g., "route to T2 support"
confidence: number // 0-1
}
Three properties of a well-designed schema:
- Types are strict. "urgency" is an integer 1–5, not a string. The model returns numbers; the system parses numbers.
- Enums are explicit. Categories are a fixed list. The model can't invent "miscellaneous" or "weird stuff" — those values won't validate.
- Nullability is explicit. "product_mentioned" can be null. The schema says so. The model has a defined way to say "I don't know."
A schema this tight is the difference between an LLM workflow that ships and one that the engineering team has to babysit. The model returns JSON. The JSON validates. The next system consumes it. Done.
Common Extraction Tasks
A short tour of what teams routinely do with structured outputs:
| Task | Input | Output schema (sketch) |
|---|---|---|
| Ticket triage | support ticket body | category, urgency, refund_requested, recommended_action |
| Contract extraction | contract PDF (via document AI) | parties, term, renewal_terms, key_clauses, jurisdiction |
| Sales-call → CRM | call transcript | attendees, products_discussed, objections, next_step, deal_value |
| Review → action item | app store review | topic, severity, suggested_team_routing, evidence_quote |
| Email → calendar event | meeting-request email | attendees, proposed_times, duration, location, agenda |
| PDF earnings call → fields | earnings call transcript | company, quarter, revenue, guidance, key_risks, surprises |
| Social post → tag | social post | brand_mentioned, sentiment, requires_response, sensitive_topic |
A pattern: many "AI tasks" in business are actually extraction tasks. The model reads a messy input and fills in a structured record the rest of the firm can use. Once the schema is right, the LLM call is one line of code and the engineering effort is in plumbing.
Confidence and "I Don't Know"
A subtle but important schema design choice: how the model expresses uncertainty.
Two patterns work:
Per-field confidence. Each field has a value and a confidence. The system thresholds: high-confidence fields auto-process, low-confidence ones go to human review.
{
invoice_total: { value: 12000, confidence: 0.92 },
invoice_date: { value: "2026-08-14", confidence: 0.97 },
vendor_name: { value: "Acme Coffee Co.", confidence: 0.68 }
}
Explicit nullability + a separate refusal field. Fields can be null; a top-level extraction_complete: bool and unresolved_fields: string[] lets the model flag what it couldn't determine.
Both work. The choice depends on whether you need per-field thresholds (use confidence) or just to know whether the record is fully extracted (use refusal).
What doesn't work: prompting the model to "say I don't know if you're not sure" without a schema slot for that response. Without a place to put uncertainty, the model invents.
Validation Beyond the Schema
Schema conformance is necessary but not sufficient. A field can be the right type and still be wrong. Two validation layers worth building:
Domain rules. Arithmetic checks (subtotal + tax = total). Reference lookups (vendor exists in the AP system). Format checks (invoice_number matches the vendor's format). These catch a class of extraction errors no schema validator finds.
Cross-field consistency. If refund_requested is true, urgency should usually be ≥ 3. If intent_to_switch is true, the sentiment field probably isn't "positive." Flag the inconsistencies for human review rather than silently passing them through.
The combination of schema + domain rules + cross-field consistency catches the vast majority of extraction failures. Without all three, a small percentage of bad records flows downstream and corrupts whatever consumes it.
Human-in-the-Loop for Extraction
For high-stakes extraction (contracts, healthcare records, financial documents, anything regulatory), structured outputs feed a human review queue, not a downstream system directly. The pattern:
- Extract with the LLM, returning per-field confidence.
- Auto-process high-confidence + validated extractions.
- Route low-confidence or validation-failing extractions to a human reviewer.
- The reviewer's correction becomes a training signal for the next model fine-tune or prompt refinement.
This is the §10.2 threshold–profit logic applied to extraction. The right confidence threshold balances "wrong field reaches downstream system" cost against "human review queue" cost. The right threshold isn't 0.5; it depends on what each error costs.