§0.2

How Data Is Stored

The word "database" hides too much. The system that records a customer's payment is not built for the same job as the system that scans five years of transactions for a pricing analysis. The place that stores raw app logs is not the same as the place that supports semantic search over policy documents. A manager does not need to administer these systems, but does need to understand their roles. Otherwise every data conversation becomes vague: "Can we get the data?" Which data? From which system? For what decision? At what latency? With what quality contract?

The executive question: what job is this data system doing?

Modern firms usually store data in several layers. Each layer optimizes for a different job.

An operational database records the next event correctly: a payment, an order, a login, a shipment, a service case. It is built for reliability, identity, permissions, and fast small updates. An analytical database scans many past events: a year of transactions, a panel of stores, a customer cohort, a product assortment, a marketing funnel. It is built for aggregation, history, and comparison. The distinction is not technical trivia. It determines whether the system is meant to run the business or analyze the business.

The storage stack is a division of labor

Source systems
Ingestion
Storage
Transform
Metrics, models, AI
Decision
SystemPrimary jobCommon examplesManagerial question
Operational SQLRun the applicationOrders, accounts, payments, POS, CRMCan the business record the next transaction correctly?
NoSQL and searchServe flexible app dataDocuments, sessions, profiles, product catalogs, keyword searchCan the app retrieve the right object quickly?
Lake and filesKeep raw and semi-raw assetsLogs, parquet files, PDFs, images, audio, vendor dropsCan the firm preserve data before every use is known?
Warehouse or lakehouseAnswer analytical questionsSnowflake, BigQuery, Databricks-style lakehousesCan managers scan history across customers, products, and time?
Local analyticsLet one analyst work quicklyDuckDB, notebooks, local parquet, reproducible extractsCan a small team investigate without waiting on production systems?
Vector and graph storesFind meaning and relationshipsEmbeddings, semantic search, RAG indexes, product/customer graphsCan the workflow retrieve related ideas, documents, or entities?

The practical distinction is transactional versus analytical: one system records the next event; another scans many past events to support a decision.

Figure 1. The modern storage stack is a division of labor. Each system class stores a different kind of evidence for a different kind of decision.

Figure 1 is the practical map. Operational SQL, NoSQL, lakes, warehouses, local analytical engines, vector databases, graph stores, and search indexes are not competing names for the same thing. They are specialized pieces of a workflow that moves from source activity to decision.


Transactional versus analytical

The most important distinction is transactional versus analytical.

A transactional system answers: can we record and retrieve one business event correctly right now? The point-of-sale system must know the price, charge the customer, update inventory, and create a receipt. The CRM must record a sales interaction. The app database must know which user is logged in. Mistakes here interrupt the business.

An analytical system answers: what pattern emerges across many business events? The warehouse computes weekly revenue by region, demand by product, churn by cohort, margin by promotion, and service quality by store. It is not trying to record the next transaction. It is trying to make history comparable.

Table 1. Transactional and analytical systems answer different questions. Confusing them creates slow tools, fragile reporting, and mistrusted numbers.
DimensionTransactional systemAnalytical system
Primary jobRecord the next event correctlyCompare many past events
Typical questionsDid this order, payment, or login succeed?Which customers, stores, products, or periods are changing?
Data shapeCurrent records, normalized entities, app stateHistory, panels, aggregates, derived metrics
LatencyImmediate or near-immediateBatch, near-real-time, or streaming depending on the use case
Failure modeThe business cannot operateThe organization makes decisions from stale or inconsistent evidence

Managers feel this distinction in ordinary meetings. When the CFO asks for margin by promotion over the past six quarters, the answer should not come from the live checkout database. When customer support needs the current status of an order, the answer should not wait for the nightly warehouse refresh. Each system can be excellent and still be wrong for the job.


The major storage roles

SQL operational databases store structured app and business records: customers, orders, payments, products, subscriptions, tickets. They usually enforce relationships and consistency. If one customer has many orders, SQL is good at keeping that relationship explicit.

NoSQL systems serve flexible or high-scale application data: product catalogs, session state, user profiles, event payloads, documents, and other records whose structure changes often. They are often useful when the application needs fast reads and writes over flexible objects.

Data lakes and object storage keep raw or semi-raw assets: logs, vendor files, parquet tables, documents, images, audio, and historical extracts. The lake is useful when the firm wants to preserve data before every analytical use is known.

Warehouses and lakehouses make history analyzable. Systems such as Snowflake, BigQuery, and Databricks-style lakehouses are used to scan large historical datasets, join source systems, define metrics, and support dashboards, notebooks, and model training.

DuckDB-style local analytics gives analysts a fast, lightweight way to work with serious data on a laptop or in a reproducible script. This is useful for teaching, prototyping, case packs, and focused investigation before work becomes shared infrastructure.

Search, vector, and graph systems support retrieval and relationships. Keyword search finds exact or near-exact terms. Vector databases store embeddings so workflows can retrieve semantically related documents, products, customers, or images. Graph stores represent relationships such as referrals, product co-purchases, supply chains, account networks, and organizational structures.

The point is not to memorize product names. The point is to ask which system is doing which job.


Batch, streaming, and freshness

Data also differs by freshness.

Some workflows are fine with a nightly refresh. A weekly executive KPI dashboard, a monthly pricing review, or a quarterly market expansion analysis does not need every transaction within seconds. Other workflows need near-real-time data: fraud detection, stockout alerts, delivery routing, ad bidding, anomaly detection, or a customer-facing recommendation shown during a session.

Freshness has a cost. Real-time systems are harder to build, harder to monitor, and easier to over-trust. A manager should ask: what decision becomes better if this is refreshed sooner? If the action is weekly, minute-level freshness may only create noise.

Table 2. Data freshness should match the decision cadence. Faster is valuable only when someone can act faster.
CadenceExample workflowManagerial test
Daily or weekly batchExecutive KPI dashboard, store performance reviewWill anyone change an action more than once per day or week?
Near-real-timeInventory alert, fraud flag, support escalationDoes a faster signal prevent loss or improve service immediately?
Streaming or session-timeAd bidding, next-best recommendation, live routingIs the decision made during the customer or operational interaction?

How storage affects methods

The rest of the book repeatedly depends on storage choices.

  • Dashboards need stable metric tables, not ad hoc extracts.
  • Causal analysis needs historical data at the right grain, not only summary reports.
  • Prediction needs labels, features, and timestamps aligned in a feature table.
  • Recommenders need exposure logs as well as purchase logs, or they confuse preference with what the system happened to show.
  • Retrieval-augmented generation needs a document store, a search or vector index, source metadata, and a way to evaluate retrieval quality.
  • Governance needs lineage: where the data came from, who owns it, how fresh it is, and what changed since last time.