§0.1
Where Data Comes From
Data begins before it is a table. It begins as ordinary life: a customer searches for a product, walks into a store, taps a card, opens an app, ignores an offer, leaves a review, calls support, uploads a receipt, asks a chatbot a question, or cancels a subscription. Modern businesses are covered in these traces. Some are clean rows. Some are messy sentences. Some are images, audio, locations, documents, or model outputs. All of them are partial records of something a person, machine, or organization did.
The executive question: what business activity generated this data?
The first managerial skill is not choosing a model. It is asking what the data is a trace of. A sales row is a trace of a completed transaction, not of all customer demand. A click is a trace of attention inside one interface, not of preference in general. A support ticket is a trace of a problem serious enough to report, not of every problem customers experienced. A prompt log is a trace of an AI workflow being used, not proof that the workflow helped.
This matters because every data source carries the bias of the process that created it. If the process changes, the data changes even when the underlying business does not. If the app redesign makes the return button harder to find, return requests may fall while dissatisfaction rises. If a chatbot deflects simple questions, the support tickets that remain will look more severe. If a store starts scanning loyalty IDs more consistently, "repeat customer" metrics may jump without any real change in loyalty.
Where business data comes from
Data is usually a trace of work that already happened. The trace can be useful, but it is never the whole reality.
Figure 1 gives the broad map. Six source families cover most data a modern manager will encounter: customer behavior, business operations, digital systems, the physical world, human language, and AI workflows. The source family matters because it tells you what kind of claim the data can support.
A day in the life of data
Consider one Bean & Basket customer on a Tuesday morning.
At 7:48 a.m., Maya searches the mobile app for "oat latte." That creates a search event. At 7:49, the app shows her a seasonal drink recommendation. That creates an impression. She taps it, adds a pastry, applies a loyalty reward, and checks out. That creates cart, transaction, payment, promotion, and loyalty records. Her order is prepared at Store 104, where the point-of-sale system updates inventory and the kitchen display records fulfillment time. Her phone location confirms she picked up the order. At 9:10, she rates the experience four stars and writes, "Great drink, long wait." That creates a rating and review text. Later, the operations team uses that review in a text dashboard, the marketing team uses the transaction in a churn model, the product team uses the search event to tune recommendations, and the regional manager sees the wait-time issue in a KPI dashboard.
One morning produced many records. None of them is "the customer" in full. Each is a trace from a specific system, with a specific purpose and a specific blind spot.
| Trace | Likely record | Question it can support | What it misses |
|---|---|---|---|
| Search for oat latte | app_search_events | What are customers trying to find? | Needs not expressed through search |
| Recommendation shown | recommendation_impressions | Which offers receive attention? | What would have happened without the recommendation |
| Completed purchase | transactions | What did customers buy, where, and when? | Customers who considered but did not buy |
| Long wait | fulfillment_time | Where is the operating process slow? | Subjective tolerance for waiting |
| Four-star review | reviews | What language do customers use to describe the experience? | Silent customers who never leave reviews |
| AI summary used by support | ai_workflow_logs | Is the AI workflow accurate, grounded, and useful? | Errors not caught by human review |
The lesson is practical: do not call these records "customer data" as if they were interchangeable. The search event, transaction, review, wait-time record, and AI log are different slices of the customer's interaction with the firm. They become powerful only when the manager knows which slice is being used.
The source changes the claim
The same managerial topic can look different depending on where the data came from.
Take customer satisfaction. A firm might measure it through star ratings, review text, support tickets, refund requests, call transcripts, social posts, survey responses, churn, or repeat purchase. These are not redundant measures of the same thing. They capture different moments in the customer journey.
- Star ratings are easy to monitor but shallow.
- Review text explains reasons but overrepresents people willing to write.
- Support tickets reveal operational problems but only after the customer escalates.
- Refunds and returns capture costly dissatisfaction but miss silent disappointment.
- Churn is a final outcome, often too late for diagnosis.
- AI summaries can scale interpretation but must be evaluated against source evidence.
The managerial question is not "which data source is best?" The question is: which source is closest to the decision we need to make, and what bias does its generation process introduce?
Three generation traps
Trap 1: activity bias. Data overrepresents people who act inside the measured system. App data overrepresents app users. Reviews overrepresent people motivated to write. Loyalty data overrepresents identified customers. The unmeasured population may behave differently.
Trap 2: workflow bias. Data changes when the business process changes. A new refund policy, a redesigned app, a chatbot handoff rule, or a new sales script can change recorded behavior without changing underlying demand or satisfaction.
Trap 3: AI feedback bias. AI workflows create new records and change the behavior that future models learn from. If an AI support assistant routes some complaints away from human agents, the remaining ticket data no longer represents the full complaint mix. If a recommender shows the same products repeatedly, future purchase data reflects exposure as much as preference.
These traps are not reasons to avoid data. They are reasons to read data as a product of its generating process.
What this chapter changes about the rest of the book
Part I will soon teach rows, columns, grain, variable types, joins, and transformations. Those concepts matter more after this chapter, not less. A row is not just a row. It is the footprint of a business event. A column is not just a column. It is a measurement choice. A table is not just a table. It is a compressed view of some workflow.
Later parts extend the same stance:
- A dashboard is a repeated view of selected traces.
- A causal design asks whether one action changed a future trace.
- A prediction model learns patterns in past traces to rank future cases.
- A recommender shapes the traces it will later observe.
- A language model workflow turns documents, prompts, retrievals, and human reviews into governed evidence.