§0.1

Where Data Comes From

Data begins before it is a table. It begins as ordinary life: a customer searches for a product, walks into a store, taps a card, opens an app, ignores an offer, leaves a review, calls support, uploads a receipt, asks a chatbot a question, or cancels a subscription. Modern businesses are covered in these traces. Some are clean rows. Some are messy sentences. Some are images, audio, locations, documents, or model outputs. All of them are partial records of something a person, machine, or organization did.

The executive question: what business activity generated this data?

The first managerial skill is not choosing a model. It is asking what the data is a trace of. A sales row is a trace of a completed transaction, not of all customer demand. A click is a trace of attention inside one interface, not of preference in general. A support ticket is a trace of a problem serious enough to report, not of every problem customers experienced. A prompt log is a trace of an AI workflow being used, not proof that the workflow helped.

This matters because every data source carries the bias of the process that created it. If the process changes, the data changes even when the underlying business does not. If the app redesign makes the return button harder to find, return requests may fall while dissatisfaction rises. If a chatbot deflects simple questions, the support tickets that remain will look more severe. If a store starts scanning loyalty IDs more consistently, "repeat customer" metrics may jump without any real change in loyalty.

Where business data comes from

Customer behavior
Purchases, clicks, searches, visits, returns, ratings, reviews
Business use: Demand, loyalty, churn, product-market fit
Business operations
Inventory, invoices, CRM records, shipments, staffing, contracts
Business use: Margin, service quality, capacity, working capital
Digital systems
App events, web logs, ad auctions, recommendation impressions
Business use: Funnels, personalization, attribution, experimentation
Physical world
Sensors, location, cameras, store traffic, delivery scans
Business use: Utilization, loss prevention, routing, field execution
Human language
Support tickets, chats, call transcripts, emails, documents
Business use: Customer voice, compliance, knowledge retrieval, workflow routing
AI workflows
Prompts, responses, citations, tool calls, evals, human review
Business use: Automation quality, risk controls, continuous improvement

Data is usually a trace of work that already happened. The trace can be useful, but it is never the whole reality.

Figure 1. Business data is generated by many kinds of activity. The manager reads each source by asking what work created the record and what decision it can support.

Figure 1 gives the broad map. Six source families cover most data a modern manager will encounter: customer behavior, business operations, digital systems, the physical world, human language, and AI workflows. The source family matters because it tells you what kind of claim the data can support.


A day in the life of data

Consider one Bean & Basket customer on a Tuesday morning.

At 7:48 a.m., Maya searches the mobile app for "oat latte." That creates a search event. At 7:49, the app shows her a seasonal drink recommendation. That creates an impression. She taps it, adds a pastry, applies a loyalty reward, and checks out. That creates cart, transaction, payment, promotion, and loyalty records. Her order is prepared at Store 104, where the point-of-sale system updates inventory and the kitchen display records fulfillment time. Her phone location confirms she picked up the order. At 9:10, she rates the experience four stars and writes, "Great drink, long wait." That creates a rating and review text. Later, the operations team uses that review in a text dashboard, the marketing team uses the transaction in a churn model, the product team uses the search event to tune recommendations, and the regional manager sees the wait-time issue in a KPI dashboard.

One morning produced many records. None of them is "the customer" in full. Each is a trace from a specific system, with a specific purpose and a specific blind spot.

Table 1. One customer morning becomes several business records. The managerial question changes with the source.
TraceLikely recordQuestion it can supportWhat it misses
Search for oat latteapp_search_eventsWhat are customers trying to find?Needs not expressed through search
Recommendation shownrecommendation_impressionsWhich offers receive attention?What would have happened without the recommendation
Completed purchasetransactionsWhat did customers buy, where, and when?Customers who considered but did not buy
Long waitfulfillment_timeWhere is the operating process slow?Subjective tolerance for waiting
Four-star reviewreviewsWhat language do customers use to describe the experience?Silent customers who never leave reviews
AI summary used by supportai_workflow_logsIs the AI workflow accurate, grounded, and useful?Errors not caught by human review

The lesson is practical: do not call these records "customer data" as if they were interchangeable. The search event, transaction, review, wait-time record, and AI log are different slices of the customer's interaction with the firm. They become powerful only when the manager knows which slice is being used.


The source changes the claim

The same managerial topic can look different depending on where the data came from.

Take customer satisfaction. A firm might measure it through star ratings, review text, support tickets, refund requests, call transcripts, social posts, survey responses, churn, or repeat purchase. These are not redundant measures of the same thing. They capture different moments in the customer journey.

  • Star ratings are easy to monitor but shallow.
  • Review text explains reasons but overrepresents people willing to write.
  • Support tickets reveal operational problems but only after the customer escalates.
  • Refunds and returns capture costly dissatisfaction but miss silent disappointment.
  • Churn is a final outcome, often too late for diagnosis.
  • AI summaries can scale interpretation but must be evaluated against source evidence.

The managerial question is not "which data source is best?" The question is: which source is closest to the decision we need to make, and what bias does its generation process introduce?


Three generation traps

Trap 1: activity bias. Data overrepresents people who act inside the measured system. App data overrepresents app users. Reviews overrepresent people motivated to write. Loyalty data overrepresents identified customers. The unmeasured population may behave differently.

Trap 2: workflow bias. Data changes when the business process changes. A new refund policy, a redesigned app, a chatbot handoff rule, or a new sales script can change recorded behavior without changing underlying demand or satisfaction.

Trap 3: AI feedback bias. AI workflows create new records and change the behavior that future models learn from. If an AI support assistant routes some complaints away from human agents, the remaining ticket data no longer represents the full complaint mix. If a recommender shows the same products repeatedly, future purchase data reflects exposure as much as preference.

These traps are not reasons to avoid data. They are reasons to read data as a product of its generating process.


What this chapter changes about the rest of the book

Part I will soon teach rows, columns, grain, variable types, joins, and transformations. Those concepts matter more after this chapter, not less. A row is not just a row. It is the footprint of a business event. A column is not just a column. It is a measurement choice. A table is not just a table. It is a compressed view of some workflow.

Later parts extend the same stance:

  • A dashboard is a repeated view of selected traces.
  • A causal design asks whether one action changed a future trace.
  • A prediction model learns patterns in past traces to rank future cases.
  • A recommender shapes the traces it will later observe.
  • A language model workflow turns documents, prompts, retrievals, and human reviews into governed evidence.