§0.2

How Data Is Stored

The word "database" hides too much. The system that records a customer's payment is not built for the same job as the system that scans five years of transactions for a pricing analysis. The place that stores raw app logs is not the same as the place that supports semantic search over policy documents. A manager does not need to administer these systems, but does need to understand their roles. Otherwise every data conversation becomes vague: "Can we get the data?" Which data? From which system? For what decision? At what latency? With what quality contract?

The executive question: what job is this data system doing?

Modern firms usually store data in several layers. Each layer optimizes for a different job.

An operational database records the next event correctly: a payment, an order, a login, a shipment, a service case. It is built for reliability, identity, permissions, and fast small updates. An analytical database scans many past events: a year of transactions, a panel of stores, a customer cohort, a product assortment, a marketing funnel. It is built for aggregation, history, and comparison. The distinction is not technical trivia. It determines whether the system is meant to run the business or analyze the business.

The storage stack is a division of labor

Source systems

Ingestion

Storage

Transform

Metrics, models, AI

Decision

System	Primary job	Common examples	Managerial question
Operational SQL	Run the application	Orders, accounts, payments, POS, CRM	Can the business record the next transaction correctly?
NoSQL and search	Serve flexible app data	Documents, sessions, profiles, product catalogs, keyword search	Can the app retrieve the right object quickly?
Lake and files	Keep raw and semi-raw assets	Logs, parquet files, PDFs, images, audio, vendor drops	Can the firm preserve data before every use is known?
Warehouse or lakehouse	Answer analytical questions	Snowflake, BigQuery, Databricks-style lakehouses	Can managers scan history across customers, products, and time?
Local analytics	Let one analyst work quickly	DuckDB, notebooks, local parquet, reproducible extracts	Can a small team investigate without waiting on production systems?
Vector and graph stores	Find meaning and relationships	Embeddings, semantic search, RAG indexes, product/customer graphs	Can the workflow retrieve related ideas, documents, or entities?

The practical distinction is transactional versus analytical: one system records the next event; another scans many past events to support a decision.

Figure 1. The modern storage stack is a division of labor. Each system class stores a different kind of evidence for a different kind of decision.

Figure 1 is the practical map. Operational SQL, NoSQL, lakes, warehouses, local analytical engines, vector databases, graph stores, and search indexes are not competing names for the same thing. They are specialized pieces of a workflow that moves from source activity to decision.

Transactional versus analytical

The most important distinction is transactional versus analytical.

A transactional system answers: can we record and retrieve one business event correctly right now? The point-of-sale system must know the price, charge the customer, update inventory, and create a receipt. The CRM must record a sales interaction. The app database must know which user is logged in. Mistakes here interrupt the business.

An analytical system answers: what pattern emerges across many business events? The warehouse computes weekly revenue by region, demand by product, churn by cohort, margin by promotion, and service quality by store. It is not trying to record the next transaction. It is trying to make history comparable.

Table 1. Transactional and analytical systems answer different questions. Confusing them creates slow tools, fragile reporting, and mistrusted numbers.

Dimension	Transactional system	Analytical system
Primary job	Record the next event correctly	Compare many past events
Typical questions	Did this order, payment, or login succeed?	Which customers, stores, products, or periods are changing?
Data shape	Current records, normalized entities, app state	History, panels, aggregates, derived metrics
Latency	Immediate or near-immediate	Batch, near-real-time, or streaming depending on the use case
Failure mode	The business cannot operate	The organization makes decisions from stale or inconsistent evidence

Managers feel this distinction in ordinary meetings. When the CFO asks for margin by promotion over the past six quarters, the answer should not come from the live checkout database. When customer support needs the current status of an order, the answer should not wait for the nightly warehouse refresh. Each system can be excellent and still be wrong for the job.

The major storage roles

SQL operational databases store structured app and business records: customers, orders, payments, products, subscriptions, tickets. They usually enforce relationships and consistency. If one customer has many orders, SQL is good at keeping that relationship explicit.

NoSQL systems serve flexible or high-scale application data: product catalogs, session state, user profiles, event payloads, documents, and other records whose structure changes often. They are often useful when the application needs fast reads and writes over flexible objects.

Data lakes and object storage keep raw or semi-raw assets: logs, vendor files, parquet tables, documents, images, audio, and historical extracts. The lake is useful when the firm wants to preserve data before every analytical use is known.

Warehouses and lakehouses make history analyzable. Systems such as Snowflake, BigQuery, and Databricks-style lakehouses are used to scan large historical datasets, join source systems, define metrics, and support dashboards, notebooks, and model training.

DuckDB-style local analytics gives analysts a fast, lightweight way to work with serious data on a laptop or in a reproducible script. This is useful for teaching, prototyping, case packs, and focused investigation before work becomes shared infrastructure.

Search, vector, and graph systems support retrieval and relationships. Keyword search finds exact or near-exact terms. Vector databases store embeddings so workflows can retrieve semantically related documents, products, customers, or images. Graph stores represent relationships such as referrals, product co-purchases, supply chains, account networks, and organizational structures.

The point is not to memorize product names. The point is to ask which system is doing which job.

Batch, streaming, and freshness

Data also differs by freshness.

Some workflows are fine with a nightly refresh. A weekly executive KPI dashboard, a monthly pricing review, or a quarterly market expansion analysis does not need every transaction within seconds. Other workflows need near-real-time data: fraud detection, stockout alerts, delivery routing, ad bidding, anomaly detection, or a customer-facing recommendation shown during a session.

Freshness has a cost. Real-time systems are harder to build, harder to monitor, and easier to over-trust. A manager should ask: what decision becomes better if this is refreshed sooner? If the action is weekly, minute-level freshness may only create noise.

Table 2. Data freshness should match the decision cadence. Faster is valuable only when someone can act faster.

Cadence	Example workflow	Managerial test
Daily or weekly batch	Executive KPI dashboard, store performance review	Will anyone change an action more than once per day or week?
Near-real-time	Inventory alert, fraud flag, support escalation	Does a faster signal prevent loss or improve service immediately?
Streaming or session-time	Ad bidding, next-best recommendation, live routing	Is the decision made during the customer or operational interaction?

How storage affects methods

The rest of the book repeatedly depends on storage choices.

Dashboards need stable metric tables, not ad hoc extracts.
Causal analysis needs historical data at the right grain, not only summary reports.
Prediction needs labels, features, and timestamps aligned in a feature table.
Recommenders need exposure logs as well as purchase logs, or they confuse preference with what the system happened to show.
Retrieval-augmented generation needs a document store, a search or vector index, source metadata, and a way to evaluate retrieval quality.
Governance needs lineage: where the data came from, who owns it, how fresh it is, and what changed since last time.