§10.3

Numeric Prediction

Sometimes the answer the firm needs is a number, not a class. What should this listing rent for? How many units will sell next week? What lifetime value should we expect from this new cohort? The mathematical machinery rhymes with classification — fit a model on past cases, score it on held-out cases, monitor in production — but the evaluation language is different. The headline number is no longer an AUC; it is an error in the units the business cares about.

This article frames numeric prediction (often called regression ML, but distinct from the causal regression of Part III) by way of a single running example: predicting a nightly listing price. The vocabulary developed here travels to demand forecasting, lifetime value, and any other "predict a number" task.

The Executive Question

How wrong is the model on average, where is it most wrong, and is the magnitude of error tolerable for the action we are taking?

The same model can be good enough to set a listing's starting price and not good enough to drive a fully automated repricing engine. The grading depends on the cost of an off-by-X dollars error in the specific decision flow.

Setup: the Listing-Pricing Example

For the rest of this article, the task is:

For each new listing on a short-term rental platform, predict the nightly price the host should charge for the first month after listing, using features known at posting time (location, bedrooms, bathrooms, amenities, host status, seasonality).

This is a clean numeric prediction problem. The unit is the listing. The target is dollars per night. The features are static or as-of-listing. The action is a suggested price surfaced to the host, who keeps editorial control.

The Standard Error Metrics

Three error summaries are standard. Each answers a slightly different question.

Table 1. The standard summaries for numeric prediction error and what each is best suited to communicate. The right choice depends on whether the manager cares about average error, occasional big misses, or the proportion of variance the model explains.

Metric	Definition	Use when	Watch out for
MAE	Mean absolute error: average \|actual − predicted\|	Dollars-per-night is in the units a manager already reads.	Insensitive to big misses; the average is robust to outliers in ways the business may not be.
RMSE	Root mean squared error: square root of mean squared error	Big misses cost disproportionately (a $200 error is more than twice as bad as a $100 one).	Heavily influenced by outliers; one luxury listing can move the headline.
R²	1 − (sum of squared residuals / sum of squared deviations from mean)	Communicating "how much of the variance does the model explain?"	Improves mechanically as features are added; not in dollar units; can hide systematic bias.

The right headline depends on the business. For most operational tasks (pricing, demand), MAE in business units is the cleanest first metric. RMSE is the right second metric when occasional large errors carry asymmetric cost. R² is a useful sanity check, not a deployment criterion.

A subtle point: every metric collapses the error distribution to a single number. Two models with the same RMSE can have very different shapes of error — and very different behaviour in the deployment that consumes them.

Reading the Error Through Pictures

The single chart most worth drawing is actual vs. predicted on held-out data. It shows separation between the model's beliefs and the world in the same units the business uses.

Actual vs predicted — where does the model fall apart?

close (|error| ≤ $30) moderate ($30–$60) high (> $60)

Figure 1. Actual vs. predicted nightly price on held-out listings. Points close to the dashed 45° line are well-predicted; the spread of points away from the line is the model's error. Colors flag where the prediction fell within $30 (green), $30–60 (amber), and over $60 (red). The luxury tail is where the model breaks.

Three patterns to look for, each telling a different story:

Even scatter around the 45° line. The model is unbiased; errors are symmetric. The headline metric reflects real performance.
Systematic offset. Predictions are consistently too high or too low in a region of the input space. The model has a structural bias that no amount of retraining the same architecture will fix — the feature set or the model form is missing something.
Heteroskedasticity. Errors are small in one region (say, mid-priced listings) and large in another (luxury or budget tails). The headline metric is hiding heterogeneity. The right next move is to disaggregate.

In the diagram, the model has roughly the right average behaviour but blows up at the luxury end. A reasonable response is to add features that distinguish luxury listings, evaluate separately within the high-price tail, or — if the business will let you — exclude it from the automation altogether and route it to a human.

Residuals and Diagnostics

Plotting the residuals (actual minus predicted) against features or against fitted values is the modelling equivalent of asking "where is the model wrong, and why?" A few canonical patterns:

Funnel-shaped residuals. Errors grow with the level of the target. Often a log-transform of the target helps.
Slope in the residual plot. Predictions are systematically too low at one end and too high at the other. The model is missing curvature — a non-linear term, an interaction, or a richer model.
Clusters of large residuals. A subgroup the model handles badly. Often points to a missing feature (host tier, region, amenity flag).

Residual diagnostics are where engineering insight comes from. A model that ships with no residual analysis is a model whose authors have not asked it where it fails.

When Is the Error Acceptable?

The metric does not answer this on its own. The right test is:

Plug the model's distribution of errors into the decision flow. What does the business lose, on average and in the tail?

For the listing-pricing example: a $20 average error in suggested starting price may be fine if the host can edit, the platform's listing fee is a small percentage, and most listings end up close to the suggestion anyway. The same $20 error in a fully automated dynamic-pricing engine that controls bookings would be catastrophic. Same model, same RMSE, different deployment, different verdict.

This is the place where Part IV reconnects with Part III. The decision that the prediction supports — together with the prediction's distribution of errors — determines whether the model is good enough. Neither half of that statement is optional.