§16.4

AI Evaluation, Risk, and Governance

An AI workflow that ships without an evaluation rubric, a risk register, and a one-page governance card is not yet operational. It's a research artefact in costume. The discipline of Chapter 16 is what turns the building blocks of Part V — classification, embeddings, RAG, vision, LLMs, agents — into a system the firm can stand behind in front of an auditor, a regulator, a customer, or a board.

This article gives the framework. The capstone (§16.5) applies it to a full Bean & Basket Customer Voice Intelligence Studio that integrates every method in Part V.

The AI Workflow Card developed here is the fourth one-page artefact in the family that began at §5.1. It inherits the discipline of the Decision Question Card (action, counterfactual, threshold), the Predictive Task Contract (§9.2), and the Model Card (§10.5) — and adds the governance fields that an LLM-driven system specifically needs. The Identification Memo (§6.2) is the parallel artefact on the causal side; both feed the §17.2 Decision Memo that ships.


The Executive Question

How do we decide whether an AI workflow is good enough — and safe enough — to use, and how do we keep that judgment current as the workflow runs?

The honest version: there is no single number. Eight evaluation dimensions, a risk register, a control map, monitoring, and an owner. Without all six, the workflow is a liability waiting to be discovered.


Validation Comes First: The Side-by-Side Lab

Before any AI workflow ships, run the methods side by side against human ground truth. The point isn't to crown a winner. The point is to know where each method fails.

Side-by-side validation — three methods, six tricky cases, one ground truth

Dictionary (VADER)
rule-based
+ transparent, cheap, fast
sarcasm, negation, domain idiom
BERT classifier
fine-tuned model
+ good on standard sentiment
needs labelled data per task
GPT measurement
LLM-as-measurer
+ arbitrary constructs, zero-shot
cost, hallucination, shortcut bias
CaseVADERBERTGPTGround truth
Plain positive reviewPPPP
Sarcastic praisePNNN
Mixed (food good, wait bad)NMMM
Cold-brew (polysemy)NPPP
Domain idiom ("killing me")NNMM
Subtle disappointmentPPNN

The point isn't that one method "wins" — it's that error structures differ. Knowing where each method fails is the most important thing for a manager choosing between them.

Figure 1. Three methods, six tricky cases, one ground truth. The point isn't which method is most accurate — it's that the error structures differ. Knowing where each method fails is the most important thing for a manager choosing between them.

A reasonable validation lab for a customer-voice system:

  1. Curate a ground-truth set. 50–500 documents, hand-coded by humans on the constructs the system measures.
  2. Run every method. Dictionary (VADER), supervised classifier (BERT or similar), LLM measurement, and any other production candidate.
  3. Score each method. Accuracy, agreement with ground truth, error patterns by subgroup.
  4. Diagnose error structures. Where does each method fail? On sarcasm? On domain idiom? On rare cases? On certain customer segments?
  5. Choose by error structure, not just accuracy. A method that is 85% accurate overall but fails systematically on the cases that matter most is worse than one that is 80% accurate with diffuse errors.

The output is not "method X wins." The output is a map of each method's failure regions that informs which method handles which slice of traffic in production.


The Eight-Dimension Evaluation Rubric

Accuracy alone is incomplete. A workflow can be accurate and ungrounded, accurate and biased, accurate and privacy-violating. The standard rubric has eight dimensions:

Eight evaluation dimensions every AI workflow review should cover

Accuracy
Is the output correct on benchmarks we trust?
Grounding
Is each claim supported by a cited source?
Relevance
Does it answer the question that was asked?
Consistency
Does it behave the same way on similar inputs?
Safety
Could the output cause harm if acted on?
Bias
Are errors uneven across groups or contexts?
Privacy
Is sensitive data leaking in or out?
Business value
Does it improve a decision or reduce a cost?

Accuracy alone is incomplete. A workflow that answers correctly with no grounding, or correctly only for some users, is not yet shippable.

Figure 2. Eight evaluation dimensions every AI workflow review should cover. A workflow that ships on accuracy alone is shipping with seven blind spots.

A walk-through of each:

  • Accuracy. Standard model evaluation against held-out ground truth. The headline number, not the only number.
  • Grounding. For generative or RAG systems: is each claim traceable to a source the system retrieved? Ungrounded answers are the highest-volume hallucination category.
  • Relevance. Did the answer address the question that was asked, or a related but different one? Surprisingly common failure on subtly framed queries.
  • Consistency. Does the system behave the same way on similar inputs? Temperature settings, sampling, and model version changes all introduce inconsistency.
  • Safety. Could the output cause harm if acted on? Includes both content safety (the obvious cases) and downstream-action safety (the subtle ones).
  • Bias. Are errors uneven across groups? Per-subgroup evaluation is the only way to know.
  • Privacy. Is sensitive data leaking in (prompt injection, retrieved content) or out (model outputs that disclose training data or PII)?
  • Business value. Does the workflow improve a decision or reduce a cost? An accurate workflow that doesn't move a metric is a research artefact.

A workflow that scores well on all eight is ready to ship. A workflow with even one persistent failure on this list is not.


The Risk-Control Map

Beyond per-output evaluation, the workflow as a system has risks. The standard tool for managing them is the risk-control map: every identified risk gets a likelihood, a severity, and a mitigating control.

Risk-control map — likelihood × severity, with the mitigating control

HallucinationPrompt injectionPII leakageIP / copyrightBias amplificationOver-automationModel driftEval gapsraremildpossibleseriousfrequentcatastrophicLikelihood →Severity ↑Mitigating controlsHallucinationRAG + citation-required promptsPrompt injectioninput filtering + tool allow-listsPII leakageredaction + retention policyIP / copyrightsource provenance + counsel reviewBias amplificationsegment-level eval + holdoutsOver-automationhuman-approval gates
Figure 3. The risk-control map for an LLM workflow. Each dot is a risk; its position shows likelihood and severity; the right panel names the control. Risks in the upper-right quadrant — likely and catastrophic — need the strongest controls.

The standard risk catalog for modern AI systems:

  • Hallucination — RAG + citation-required prompts + refusal patterns + golden-set evals.
  • Prompt injection — input sanitization, untrusted-content sandboxing, tool allow-lists.
  • PII leakage — redaction, retention policy, on-device or self-hosted models for sensitive data.
  • IP / copyright — source provenance, counsel review of generative outputs, opt-out compliance.
  • Bias amplification — per-subgroup evaluation, targeted holdouts, fairness audits.
  • Over-automation — human-approval gates on irreversible actions.
  • Model drift — monitoring + retraining cadence (the §12.3 logic carried into LLM-land).
  • Eval gaps — red-teaming, expanded golden sets, periodic re-audit.

Each risk should have a named control. Risks without controls are bets, not managed positions.


The AI Workflow Card

Every shipped workflow needs a one-page contract. The same role the model card played in §10.5, restated for AI workflows.

The AI workflow card — one page, every shipped workflow

Workflow nameBB-Voice-of-Customer-2026Q2
Intended useSurface emerging complaint themes weekly; route urgent tickets; draft executive summary.
InputsApp reviews, support tickets, social posts (last 7 days).
ComponentsClassification (§18.4) + topic model (§18.5) + embedding cluster (§19.2) + LLM summary (§21.3) + agent (§21.4).
Human-in-the-loopManager approves alerts before they post to Slack; quarterly red-team review.
Evaluation cadenceWeekly golden-set scoring; monthly drift check; quarterly side-by-side with human ground truth.
Known failure modesSarcasm in social posts; non-English reviews; competitor mentions misclassified as own brand.
PrivacyNo raw customer PII passed to external LLM; redaction step before prompt assembly.
Escalation pathWorkflow owner on-call; legal review for any external publication.
OwnerCustomer Insights, Bean & Basket Coffee.

Without this card, the workflow is a research artefact. With it, it's infrastructure with an owner.

Figure 4. The AI workflow card for the Bean & Basket Customer Voice Intelligence system. Every row corresponds to a question the workflow's owner will be asked at some point. Without the card, the workflow is a research artefact; with it, it's infrastructure with an owner.

The card should include, at minimum:

  • Intended use. What decision the workflow supports, for whom, on what cadence.
  • Inputs. What data sources feed the workflow.
  • Components. Which methods are wired together (classification, embedding, RAG, LLM, etc.).
  • Human-in-the-loop. Where approval gates sit, what triggers escalation.
  • Evaluation cadence. Golden sets, drift checks, red-team reviews.
  • Known failure modes. Subpopulations or input types where the workflow is unreliable.
  • Privacy. What data is processed, what leaves the perimeter, what is retained.
  • Escalation path. Who responds when something goes wrong.
  • Owner. A real human and a real team.

A workflow without a card is a workflow whose authors have not done the operational work. A six-month-old workflow without a refreshed card is a workflow whose owners have moved on.


Monitoring an AI Workflow in Production

The §12.3 patterns transfer to LLM workflows with two adjustments:

  • Input drift matters more, because LLMs and RAG systems are unusually sensitive to changes in the kind of question being asked.
  • Output drift matters more, because the output is language; subtle changes in tone or focus can be hard to spot unless explicitly monitored.

A minimal monitoring stack:

  • Eval scores on the rolling golden set. Refreshed weekly; alerts on a drop.
  • Refusal rate. What fraction of queries does the system refuse? Sudden changes indicate either drift or a real change in the input distribution.
  • Grounding rate. What fraction of grounded-system answers cite sources? Should be close to 100% for a well-behaved RAG system.
  • Escalation rate. What fraction of agent runs hit the human-approval gate? Persistent high escalation suggests the agent is operating at the edge of its competence.
  • Cost per query. A sudden spike often indicates a regression — a loop, an oversized retrieval, a model swap.

The patterns are the same as Part IV's monitoring. The artefacts are LLM-shaped instead of classifier-shaped. The discipline is identical.


A short note on the cross-cutting concerns:

  • Customer text leaving the perimeter. Sending customer reviews to a third-party LLM is moving regulated data. Document the data flow; redact PII before the prompt; consider self-hosted or on-prem models for sensitive workloads.
  • Retention. LLM providers vary on whether they retain prompts and outputs for training, monitoring, or compliance. Read the contract; pick the option that matches your obligations.
  • Consent for AI-driven action. If a customer's data is being processed by AI to make a decision affecting them, increasingly many jurisdictions require disclosure and an appeal path. GDPR Article 22, the EU AI Act, U.S. state-level laws, sectoral regulations — all in scope.
  • Data minimization. Send the LLM the minimum information needed for the task. Most failures here are accidental over-sharing.

None of this is unique to AI. All of it gets sharper attention because AI workflows are visible, scrutinized, and often customer-facing.



Concept check

Three questions on evaluating and governing an AI workflow.

  1. 1.
    A RAG system scores 0.92 accuracy on a golden set. The ungrounded-answer rate is 18%. The right interpretation is:
  2. 2.
    A team is choosing between three text-measurement methods on the same task. Their accuracy scores are 0.82, 0.84, 0.85. The right next question is:
  3. 3.
    An AI workflow's card lists "Privacy" as "TBD." The honest assessment is: