§16.1

LLM Capabilities and Prompting

A language model is not a chatbot that happens to answer questions. It is a language interface for workflows. The chatbot framing is what gets attention; the workflow framing is what generates business value. A manager who learns to see LLMs through the second frame can deploy them on the dozens of small language-shaped tasks the firm already pays people to do — summarize, classify, extract, translate, draft, answer, reason, narrate — and free those people to do the work the model can't.

This article does two things. First it maps the capabilities a manager should expect from a modern LLM and the limits the model still has — what to ask for. Then it turns to how to ask: prompting, which is far less a matter of magic phrasing than of writing a clear task brief. The mechanics of forcing machine-readable output (§16.2) and wiring the model to tools (§16.3) come later; this article is about choosing the task and specifying it well.


The Executive Question

What is a modern language model actually good at — and what does it need to be told before it can do that work well?

The honest version, in two halves. On capability: LLMs are very good at language and unreliable at facts, math, current events, and anything that requires a database of truth they don't have. On instruction: once the task is clearly specified, the model does competent work — and the exact wording barely matters. Most of this article maps those two boundaries.


The Capability Map

A short, opinionated taxonomy. None of these are chatbot moves. They are tasks.

LLMs are language interfaces for workflows — eight capabilities, one substrate

Summarize
"summarize this 40-page contract"
Classify
"this ticket → billing"
Extract
"pull renewal date, parties"
Translate
"render in French"
Draft
"reply to this customer"
Q&A
"answer using these docs"
Reason / plan
"propose next test"
Narrate
"explain this chart"

None of these are chatbot moves. They are tasks a manager would have given to an analyst — now available as an API call.

Figure 1. Eight capabilities every modern LLM supports. Each one is a task a manager would have given to an analyst; each is now available as an API call. The pattern is the same — language in, language out — but the business value is in the wiring.

A walk-through of each, with one Bean & Basket example:

  • Summarize. Condense a long input. "Summarize this 40-page supplier contract into a one-page brief covering parties, term, pricing, termination, and renewal."
  • Classify. Assign a label. "Route this support ticket to billing, delivery, app, or quality."
  • Extract. Pull structured fields. "From this sales call note, extract the customer name, the products discussed, and the next step."
  • Translate. Render across languages. "Translate these French app reviews so the team can read them in English."
  • Draft. Generate text from instructions. "Write a reply to this customer apologizing for the outage and offering a credit."
  • Q&A. Answer using provided context. "Using only the policy documents below, answer this employee's question about parental leave." (This is RAG in §15.1.)
  • Reason / plan. Work through a multi-step problem. "Given these three retention experiments, recommend the next one to run."
  • Narrate. Explain a chart or a dataset. "Describe what's notable in this monthly sales trend."

The taxonomy is loose — many real tasks combine two or three of these. The point isn't to box every use case neatly; it's to give the manager vocabulary for the conversation about what to use the model for.


The Hidden Strength: Translation Between Formats

A capability that doesn't get its own panel in the diagram but appears in nearly every production workflow: format translation.

  • Free-text → structured JSON (§16.2).
  • Bullet points → narrative prose.
  • Spreadsheet rows → readable summaries.
  • Legal language → plain English.
  • Code → documentation.
  • Email thread → CRM update.

A surprising amount of business labor is moving information between formats with minor enrichment. LLMs collapse that labor into a single API call.


Where LLMs Are Unreliable

Symmetric clarity about limits:

Table 1. The standard LLM failure modes. Each has known mitigations, but none of the mitigations are 'use a bigger model'. They are workflow design choices.
LimitWhat happensMitigation
Internal company factsModel invents an answer or politely refuses.RAG (§15.1) — retrieve from the firm's indexed documents.
Current eventsModel's training data has a cutoff. It doesn't know about anything after.Web-search tool use, or live data ingestion via RAG.
Exact arithmeticModel can be confidently wrong on long-digit math.Call out to a calculator tool or code interpreter for arithmetic.
Long-range consistencyAcross many turns, the model may contradict earlier outputs.Structured memory + summarization between sessions.
Reliability across runsSame input can yield different outputs (temperature, sampling).Lower temperature, multi-run aggregation, schema validation.
Hallucination on rare casesFor inputs poorly represented in training, the model can be confidently wrong.Confidence calibration + human-in-the-loop on edge cases.
BiasOutputs reflect biases in training data.Subgroup evaluation + targeted prompts to mitigate.
Prompt injectionRetrieved or user-provided text contains instructions the model follows.Treat retrieved content as untrusted; system prompts that explicitly resist injection.

A recurring theme: the right response to most LLM weaknesses is a workflow fix, not a model fix. Wrap the model in RAG, give it tools, validate outputs against schemas, and route ambiguous cases to humans. These patterns occupy the rest of Chapter 16 and most of Chapter 16.


Reasoning Models vs. Fast Models

A specific recent shift worth flagging — the rise of reasoning models (OpenAI's o-series, DeepSeek-R1, etc.) that explicitly "think" before answering. They are slower and more expensive than fast models. They are better at multi-step problems, math, planning, and code.

A practical rule:

  • Fast model for tasks that are essentially pattern-matching — classification, extraction, summarization, drafting.
  • Reasoning model for tasks that benefit from explicit deliberation — strategy questions, complex extraction with cross-references, math-heavy analysis.

The cost difference can be 5–20×. Use the reasoning model only where its extra thinking time actually changes the answer.


The Model Landscape, Briefly

A note on what this article doesn't try to do: tell you which model to pick this week. The list of frontier providers (OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, Cohere, the open-weight ecosystem) and the rankings within it change every few months. Specific model names get stale faster than this book can be revised.

The stable claims:

  • Multiple competitive providers exist; vendor lock-in is a real risk to manage.
  • Open-weight models (Llama, Mistral, Qwen, Gemma, DeepSeek) have closed much of the gap to frontier closed models for many business tasks.
  • Capability tier varies by task. A model that wins on summarization may not win on code generation.
  • Smaller, specialized models often beat frontier models on narrow tasks — at a fraction of the cost.

The right operational posture is to abstract the model behind a single interface, evaluate per task on real data, and keep the door open to swapping providers as the landscape evolves.


When LLMs Are the Right Tool

A short rubric for choosing whether to reach for an LLM:

  • Use an LLM when: the input is language; the task is one of the eight capabilities above; the output needs human-readable language or schema-conforming structure; the cost of a small error is bounded.
  • Don't use an LLM when: the task is exact computation (use code); the input is large structured data (use SQL); the task requires real-time guarantees the model can't make; the cost of any error is catastrophic and the system can't tolerate human review.

The most expensive failure mode is reaching for an LLM where a deterministic rule-based system would have worked better, faster, and more cheaply. The second most expensive is not reaching for an LLM where one would have collapsed a week of language-shaped manual work into a workflow.


From What to How: Prompting as Task Design

Knowing what to ask for is half the job. The other half is briefing the model so it does the chosen task well — and that brief is where most teams either win or quietly lose.

A prompt looks like a chat message and behaves like a job description. The vague version produces vague work: generic summaries, "helpful" answers, output that's almost-but-not-quite what the firm needed. The structured version produces structured work. The fields are the same ones a manager would give an analyst — who they are, what they're doing, what context they should know, what constraints apply, and what the output should look like.

The most consequential finding from recent measurement research makes this concrete: once the task is well-specified, exact phrasing matters very little. Effort spent rewriting prompts is usually effort better spent clarifying what the task actually is.


The Prompt Structure

A prompt that does production work usually has six slots. Each maps to something a manager would write into an analyst's brief.

A prompt is a structured task brief — same fields a manager would give an analyst

Role
You are a customer insights analyst at a specialty coffee chain.
Task
Summarize the main complaints in the following twenty reviews.
Context
Reviews are from the iOS app, May 2026, after a checkout outage on May 12.
Constraints
Separate product, service, app, and pricing issues. Ignore non-English text. Flag any single review verbatim if it threatens regulatory action.
Examples
Two labelled example reviews with their target output (omitted here for brevity).
Output format
Return JSON: { topic, evidence_quotes, severity (1–5), suggested_action }.

Be clear about what you want — the GABRIEL paper shows wording matters less than people fear, once the construct is unambiguous.

Figure 2. The six slots of a production-quality prompt. The model performs as well as the brief; the brief performs as well as its missing fields are filled in.

A walk-through:

  • Role. Who the model is acting as. "You are a customer insights analyst at a specialty coffee chain." Roles set expectations about tone, vocabulary, and what the model treats as obvious.
  • Task. What it's being asked to do. "Summarize the main complaints in the following twenty reviews." This is the single most important slot; if the task is fuzzy, nothing else helps.
  • Context. What it needs to know that isn't in the input itself. Time, place, recent events, relevant policy. "Reviews are from the iOS app, May 2026, after a checkout outage on May 12."
  • Constraints. Boundaries the model must respect. "Separate product, service, app, and pricing issues. Ignore non-English text. Flag any review that threatens regulatory action."
  • Examples. A few labelled examples of the desired input → output mapping. Few-shot prompting is one of the highest-leverage moves when the task has any ambiguity.
  • Output format. The shape of the response. JSON schema, plain text, bullet list, code, etc. If you want JSON, ask for JSON — and validate (§16.2).

A prompt missing any of these slots usually performs predictably worse. A prompt missing the Task slot doesn't work at all.


A Bad Prompt and a Good One

Side by side:

Bad: "Summarize these reviews."

Good:

Role: You are a customer insights analyst at Bean & Basket Coffee.
Task: Summarize the main complaints in the 20 reviews below.
Context: These are iOS app store reviews from May 2026, after a checkout outage on May 12.
Constraints: Separate issues into four buckets — product, service, app, pricing.
  Ignore non-English text. Flag any single review verbatim if it threatens
  regulatory action or names a specific employee.
Examples:
  [Review A] → { topic: "app", evidence: "checkout crashed", severity: 4 }
  [Review B] → { topic: "service", evidence: "rude barista", severity: 2 }
Output: JSON array of { topic, evidence_quotes, severity (1-5), suggested_action }.

Reviews:
1. ...
2. ...

Three things to notice:

  • The bad prompt asks for a task. The good prompt asks for the same task with the surrounding context filled in.
  • The good prompt is not cleverly phrased. It is exactly as long as it needs to be. The information it carries is what does the work.
  • The output format is non-negotiable. If the team will parse JSON downstream, the prompt should commit to JSON up front. §16.2 covers the structured-output mechanics that enforce this.

Phrasing Barely Matters

A specific empirical finding from the GABRIEL paper, repeated here because it changes how managers should think about effort:

Once the construct is clearly defined, the exact wording of the prompt matters very little. 100 dramatically different phrasings of the same task — from terse 32-word telegrams to 563-word Shakespearean prose — produced 0.76–0.98 correlation with the baseline prompt across attributes.

The implication for prompt engineering: stop optimizing prose; start optimizing the brief. Pour time into the questions that move the answer:

  • What is the task, exactly?
  • What is the construct, exactly?
  • What output format do downstream systems require?
  • What edge cases need flagging?

Don't pour it into choosing between "please" and "kindly," hunting for the magic emoji, or rewriting the role for the seventh time. The slot model gives a structure; the structure carries the information; the phrasing carries almost nothing.


Patterns Worth Knowing

A few prompt patterns recur in production. Briefly:

Few-shot examples. A few labelled input/output pairs in the prompt. The model uses them to infer the pattern. Especially valuable when the task is ambiguous or when the output format is unusual. Two examples is often enough; ten is rarely necessary.

Chain of thought. Asking the model to "think step by step" before answering. Boosts performance on multi-step reasoning tasks. For modern reasoning models this is built-in; for fast models it can be the difference between a wrong and a right answer on a tricky task.

Role + persona. Setting the model's identity ("You are a 20-year auditor with experience in industrial accounting...") biases its tone, vocabulary, and what it treats as relevant. Effect sizes vary; useful when the task has a strong domain flavor.

System vs. user message. Most APIs distinguish system prompts (instructions from the developer) from user prompts (the actual query). Constraints, role, and output format go in the system prompt; the input data goes in the user prompt.

Refusal patterns. Adding "If you cannot answer from the provided context, say 'I don't have that information.'" The single most effective hallucination-reduction trick for RAG and similar grounded systems.


Iteration: A Cheap Loop

Prompt design is fast. A team can iterate through ten variants in an hour. The right loop:

  1. Write the prompt with all six slots filled.
  2. Run on a small evaluation set (20–50 cases that cover the easy, medium, and hard cases).
  3. Read the failures.
  4. Adjust — usually the constraints or examples slot, not the task.
  5. Re-run.

After three or four iterations the prompt is usually stable. Continuing beyond that rarely pays off — and as the GABRIEL finding shows, doesn't move the needle much.