§16.4
AI Evaluation, Risk, and Governance
An AI workflow that ships without an evaluation rubric, a risk register, and a one-page governance card is not yet operational. It's a research artefact in costume. The discipline of Chapter 16 is what turns the building blocks of Part V — classification, embeddings, RAG, vision, LLMs, agents — into a system the firm can stand behind in front of an auditor, a regulator, a customer, or a board.
This article gives the framework. The capstone (§16.5) applies it to a full Bean & Basket Customer Voice Intelligence Studio that integrates every method in Part V.
The AI Workflow Card developed here is the fourth one-page artefact in the family that began at §5.1. It inherits the discipline of the Decision Question Card (action, counterfactual, threshold), the Predictive Task Contract (§9.2), and the Model Card (§10.5) — and adds the governance fields that an LLM-driven system specifically needs. The Identification Memo (§6.2) is the parallel artefact on the causal side; both feed the §17.2 Decision Memo that ships.
The Executive Question
How do we decide whether an AI workflow is good enough — and safe enough — to use, and how do we keep that judgment current as the workflow runs?
The honest version: there is no single number. Eight evaluation dimensions, a risk register, a control map, monitoring, and an owner. Without all six, the workflow is a liability waiting to be discovered.
Validation Comes First: The Side-by-Side Lab
Before any AI workflow ships, run the methods side by side against human ground truth. The point isn't to crown a winner. The point is to know where each method fails.
Side-by-side validation — three methods, six tricky cases, one ground truth
| Case | VADER | BERT | GPT | Ground truth |
|---|---|---|---|---|
| Plain positive review | P | P | P | P |
| Sarcastic praise | P | N | N | N |
| Mixed (food good, wait bad) | N | M | M | M |
| Cold-brew (polysemy) | N | P | P | P |
| Domain idiom ("killing me") | N | N | M | M |
| Subtle disappointment | P | P | N | N |
The point isn't that one method "wins" — it's that error structures differ. Knowing where each method fails is the most important thing for a manager choosing between them.
A reasonable validation lab for a customer-voice system:
- Curate a ground-truth set. 50–500 documents, hand-coded by humans on the constructs the system measures.
- Run every method. Dictionary (VADER), supervised classifier (BERT or similar), LLM measurement, and any other production candidate.
- Score each method. Accuracy, agreement with ground truth, error patterns by subgroup.
- Diagnose error structures. Where does each method fail? On sarcasm? On domain idiom? On rare cases? On certain customer segments?
- Choose by error structure, not just accuracy. A method that is 85% accurate overall but fails systematically on the cases that matter most is worse than one that is 80% accurate with diffuse errors.
The output is not "method X wins." The output is a map of each method's failure regions that informs which method handles which slice of traffic in production.
The Eight-Dimension Evaluation Rubric
Accuracy alone is incomplete. A workflow can be accurate and ungrounded, accurate and biased, accurate and privacy-violating. The standard rubric has eight dimensions:
Eight evaluation dimensions every AI workflow review should cover
Accuracy alone is incomplete. A workflow that answers correctly with no grounding, or correctly only for some users, is not yet shippable.
A walk-through of each:
- Accuracy. Standard model evaluation against held-out ground truth. The headline number, not the only number.
- Grounding. For generative or RAG systems: is each claim traceable to a source the system retrieved? Ungrounded answers are the highest-volume hallucination category.
- Relevance. Did the answer address the question that was asked, or a related but different one? Surprisingly common failure on subtly framed queries.
- Consistency. Does the system behave the same way on similar inputs? Temperature settings, sampling, and model version changes all introduce inconsistency.
- Safety. Could the output cause harm if acted on? Includes both content safety (the obvious cases) and downstream-action safety (the subtle ones).
- Bias. Are errors uneven across groups? Per-subgroup evaluation is the only way to know.
- Privacy. Is sensitive data leaking in (prompt injection, retrieved content) or out (model outputs that disclose training data or PII)?
- Business value. Does the workflow improve a decision or reduce a cost? An accurate workflow that doesn't move a metric is a research artefact.
A workflow that scores well on all eight is ready to ship. A workflow with even one persistent failure on this list is not.
The Risk-Control Map
Beyond per-output evaluation, the workflow as a system has risks. The standard tool for managing them is the risk-control map: every identified risk gets a likelihood, a severity, and a mitigating control.
Risk-control map — likelihood × severity, with the mitigating control
The standard risk catalog for modern AI systems:
- Hallucination — RAG + citation-required prompts + refusal patterns + golden-set evals.
- Prompt injection — input sanitization, untrusted-content sandboxing, tool allow-lists.
- PII leakage — redaction, retention policy, on-device or self-hosted models for sensitive data.
- IP / copyright — source provenance, counsel review of generative outputs, opt-out compliance.
- Bias amplification — per-subgroup evaluation, targeted holdouts, fairness audits.
- Over-automation — human-approval gates on irreversible actions.
- Model drift — monitoring + retraining cadence (the §12.3 logic carried into LLM-land).
- Eval gaps — red-teaming, expanded golden sets, periodic re-audit.
Each risk should have a named control. Risks without controls are bets, not managed positions.
The AI Workflow Card
Every shipped workflow needs a one-page contract. The same role the model card played in §10.5, restated for AI workflows.
The AI workflow card — one page, every shipped workflow
| Workflow name | BB-Voice-of-Customer-2026Q2 |
|---|---|
| Intended use | Surface emerging complaint themes weekly; route urgent tickets; draft executive summary. |
| Inputs | App reviews, support tickets, social posts (last 7 days). |
| Components | Classification (§18.4) + topic model (§18.5) + embedding cluster (§19.2) + LLM summary (§21.3) + agent (§21.4). |
| Human-in-the-loop | Manager approves alerts before they post to Slack; quarterly red-team review. |
| Evaluation cadence | Weekly golden-set scoring; monthly drift check; quarterly side-by-side with human ground truth. |
| Known failure modes | Sarcasm in social posts; non-English reviews; competitor mentions misclassified as own brand. |
| Privacy | No raw customer PII passed to external LLM; redaction step before prompt assembly. |
| Escalation path | Workflow owner on-call; legal review for any external publication. |
| Owner | Customer Insights, Bean & Basket Coffee. |
Without this card, the workflow is a research artefact. With it, it's infrastructure with an owner.
The card should include, at minimum:
- Intended use. What decision the workflow supports, for whom, on what cadence.
- Inputs. What data sources feed the workflow.
- Components. Which methods are wired together (classification, embedding, RAG, LLM, etc.).
- Human-in-the-loop. Where approval gates sit, what triggers escalation.
- Evaluation cadence. Golden sets, drift checks, red-team reviews.
- Known failure modes. Subpopulations or input types where the workflow is unreliable.
- Privacy. What data is processed, what leaves the perimeter, what is retained.
- Escalation path. Who responds when something goes wrong.
- Owner. A real human and a real team.
A workflow without a card is a workflow whose authors have not done the operational work. A six-month-old workflow without a refreshed card is a workflow whose owners have moved on.
Monitoring an AI Workflow in Production
The §12.3 patterns transfer to LLM workflows with two adjustments:
- Input drift matters more, because LLMs and RAG systems are unusually sensitive to changes in the kind of question being asked.
- Output drift matters more, because the output is language; subtle changes in tone or focus can be hard to spot unless explicitly monitored.
A minimal monitoring stack:
- Eval scores on the rolling golden set. Refreshed weekly; alerts on a drop.
- Refusal rate. What fraction of queries does the system refuse? Sudden changes indicate either drift or a real change in the input distribution.
- Grounding rate. What fraction of grounded-system answers cite sources? Should be close to 100% for a well-behaved RAG system.
- Escalation rate. What fraction of agent runs hit the human-approval gate? Persistent high escalation suggests the agent is operating at the edge of its competence.
- Cost per query. A sudden spike often indicates a regression — a loop, an oversized retrieval, a model swap.
The patterns are the same as Part IV's monitoring. The artefacts are LLM-shaped instead of classifier-shaped. The discipline is identical.
Privacy, Data Residency, and Consent
A short note on the cross-cutting concerns:
- Customer text leaving the perimeter. Sending customer reviews to a third-party LLM is moving regulated data. Document the data flow; redact PII before the prompt; consider self-hosted or on-prem models for sensitive workloads.
- Retention. LLM providers vary on whether they retain prompts and outputs for training, monitoring, or compliance. Read the contract; pick the option that matches your obligations.
- Consent for AI-driven action. If a customer's data is being processed by AI to make a decision affecting them, increasingly many jurisdictions require disclosure and an appeal path. GDPR Article 22, the EU AI Act, U.S. state-level laws, sectoral regulations — all in scope.
- Data minimization. Send the LLM the minimum information needed for the task. Most failures here are accidental over-sharing.
None of this is unique to AI. All of it gets sharper attention because AI workflows are visible, scrutinized, and often customer-facing.
Concept check
Three questions on evaluating and governing an AI workflow.
- 1.A RAG system scores 0.92 accuracy on a golden set. The ungrounded-answer rate is 18%. The right interpretation is:
- 2.A team is choosing between three text-measurement methods on the same task. Their accuracy scores are 0.82, 0.84, 0.85. The right next question is:
- 3.An AI workflow's card lists "Privacy" as "TBD." The honest assessment is: