This document develops and compares the regression models for the house-price project. All of the main models predict the sale price in dollars so they can be compared directly. We also tested a version that predicts price as a percentage change rather than a flat dollar amount, called the log model, which we treat as an alternative we considered rather than a submitted model, for the reasons given in the final modeling section. Everything below is reproducible: the random seed is fixed at 42, so re-running reproduces the same numbers to the dollar.
To reproduce the analysis, keep House_Prices.csv and
Predict_Houses.csv in the same folder as this file, then
knit it or run the chunks in order.
d <- read_csv("House_Prices.csv", show_col_types = FALSE)
d |>
summarise(Min = min(SalePrice), Median = median(SalePrice),
Mean = round(mean(SalePrice)), Max = max(SalePrice)) |>
pivot_longer(everything(), names_to = "Statistic", values_to = "SalePrice") |>
kable(format.args = list(big.mark = ","),
caption = "SalePrice Summary (990 homes, no missing values)")
| Statistic | SalePrice |
|---|---|
| Min | 34,900 |
| Median | 163,250 |
| Mean | 182,151 |
| Max | 755,000 |
cor_df <- tibble(Predictor = names(d),
r = as.numeric(cor(d)[, "SalePrice"])) |>
filter(Predictor != "SalePrice") |>
arrange(r) |>
mutate(Predictor = factor(Predictor, levels = Predictor))
ggplot(cor_df, aes(r, Predictor, fill = r > 0)) +
geom_col(width = .7) +
geom_text(aes(label = sprintf("%.2f", r),
hjust = ifelse(r > 0, -0.15, 1.15)), size = 3.4) +
scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["neg"]]),
guide = "none") +
scale_x_continuous(limits = c(-0.1, 0.95), expand = expansion(c(.05, .1))) +
labs(title = "Correlation of Each Predictor With SalePrice",
x = "Correlation With Price (1.0 = perfect)", y = NULL)
The average price near $182k sits well above the $163k median, because a handful of expensive houses pull the average up; the distribution is right-skewed. OverallQual tracks price most closely, with a correlation of 0.80, where 1.0 would be a perfect match and 0 would mean no relationship. GarageArea, TotRmsAbvGrd, FullBath, YearBuilt, and YearRemodAdd follow in the 0.52 to 0.65 range. BedroomAbvGr is weak at 0.18, and YrSold is the only negative, essentially flat at -0.03.
These are the three models planned for submission, plus a fourth, ReducedMod, included here for comparison. All four predict SalePrice in dollars.
f_small <- SalePrice ~ OverallQual + GarageArea + TotRmsAbvGrd + YearBuilt
f_full <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + FullBath + HalfBath + BedroomAbvGr +
TotRmsAbvGrd + Fireplaces + GarageArea + YrSold
f_reduced <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + FullBath + HalfBath +
TotRmsAbvGrd + Fireplaces + GarageArea
f_reducedmod <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + TotRmsAbvGrd + Fireplaces + GarageArea
fit_small <- lm(f_small, d)
fit_full <- lm(f_full, d)
fit_reduced <- lm(f_reduced, d)
fit_reducedmod <- lm(f_reducedmod, d)
# 10-fold CV on the dollar scale. Same folds for every model so it is fair.
k <- 10
folds <- sample(rep(1:k, length.out = nrow(d)))
cv_metrics <- function(formula, is_log = FALSE) {
out <- map_dfr(1:k, function(i) {
tr <- d[folds != i, ]; te <- d[folds == i, ]
fit <- lm(formula, data = tr)
pred <- predict(fit, te)
if (is_log) pred <- exp(pred)
err <- te$SalePrice - pred
tibble(rmse = sqrt(mean(err^2)),
mae = mean(abs(err)),
mape = mean(abs(err) / te$SalePrice) * 100)
})
c(CV_RMSE = mean(out$rmse), CV_MAE = mean(out$mae), CV_MAPE = mean(out$mape))
}
comp <- tibble(
Model = c("Small", "ReducedMod", "Reduced", "Full"),
Predictors = c(4L, 8L, 10L, 12L),
fit = list(fit_small, fit_reducedmod, fit_reduced, fit_full),
formula = list(f_small, f_reducedmod, f_reduced, f_full)
) |>
mutate(cv = map(formula, cv_metrics),
CV_RMSE = round(map_dbl(cv, "CV_RMSE")),
CV_MAE = round(map_dbl(cv, "CV_MAE")),
CV_MAPE = round(map_dbl(cv, "CV_MAPE"), 1),
Adj_R2 = round(map_dbl(fit, ~ summary(.x)$adj.r.squared), 3),
AIC = round(map_dbl(fit, AIC))) |>
select(Model, Predictors, CV_RMSE, CV_MAE, CV_MAPE, Adj_R2, AIC)
kable(comp, format.args = list(big.mark = ","),
caption = "Model comparison. RMSE and MAE measure the average prediction miss in dollars (RMSE weights the big misses more heavily), and MAPE expresses that miss as a percent of price. Lower is better. From 10-fold cross-validation.")
| Model | Predictors | CV_RMSE | CV_MAE | CV_MAPE | Adj_R2 | AIC |
|---|---|---|---|---|---|---|
| Small | 4 | 40,756 | 28,845 | 17.5 | 0.733 | 23,868 |
| ReducedMod | 8 | 35,786 | 24,680 | 15.1 | 0.798 | 23,596 |
| Reduced | 10 | 35,778 | 24,611 | 15.1 | 0.798 | 23,598 |
| Full | 12 | 35,432 | 24,489 | 14.9 | 0.803 | 23,576 |
comp |>
select(Model, RMSE = CV_RMSE, MAE = CV_MAE) |>
pivot_longer(c(RMSE, MAE), names_to = "Metric", values_to = "Dollars") |>
mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full"))) |>
ggplot(aes(Model, Dollars, fill = Metric)) +
geom_col(position = position_dodge(.75), width = .65) +
geom_text(aes(label = scales::dollar(Dollars)),
position = position_dodge(.75), vjust = -0.4, size = 3.2) +
scale_fill_manual(values = c(RMSE = PAL[["hi"]], MAE = PAL[["lo"]])) +
scale_y_continuous(labels = scales::dollar, expand = expansion(c(0, .12))) +
labs(title = "Cross-validated Error by Model (lower is better)",
x = NULL, y = NULL, fill = NULL)
The Small model is the weakest by a clear margin, so the six middle-strength predictors carry real weight even with the overlap among them. Full and Reduced are effectively tied. These figures come from cross-validation, which tests each model on data it did not see while it was being built, splitting the data into ten parts and rotating which part is held back. Adjusted R-squared is the share of price variation a model explains, with a small penalty for adding predictors. Full edges Reduced on both measures, but by only a few hundred dollars of error, and it does so while carrying two predictors that contribute little. Reduced drops YrSold, which is uninformative, and BedroomAbvGr, which overlaps with total rooms and shows a backwards sign, at almost no accuracy cost. On that basis we treat Reduced as the best model for its balance of accuracy and simplicity, with Full noted as marginally more accurate.
ReducedMod takes Reduced and also drops FullBath and HalfBath, the two terms that were not significant. The result is that all eight remaining predictors are significant, and the accuracy cost is minimal: its cross-validated error sits within a few dollars of Reduced and its Adjusted R-squared matches to three decimals. The tradeoff is that a model with no bathroom term at all can look odd to a non-technical audience, since buyers clearly value bathrooms. The bathroom effect does not disappear; it is absorbed by the predictors it travels with, mainly room count and overall quality.
The cross-validated error above measures overall accuracy. The chart below shows how each of the four models predicts the five held-out homes against their actual sale prices.
# Load the five hold-out homes once here and build the prediction table the
# prediction section reuses later.
ph <- read_csv("Predict_Houses.csv", show_col_types = FALSE)
out <- tibble(
House = seq_len(nrow(ph)),
Actual = ph$SalePrice,
Small = round(predict(fit_small, ph)),
ReducedMod = round(predict(fit_reducedmod, ph)),
Reduced = round(predict(fit_reduced, ph)),
Full = round(predict(fit_full, ph))
)
out |>
pivot_longer(c(Small, ReducedMod, Reduced, Full),
names_to = "Model", values_to = "Pred") |>
mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full")),
House = factor(House)) |>
ggplot(aes(House)) +
geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
width = .75, alpha = .9) +
geom_point(aes(y = Actual), size = 3, color = "black") +
geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_manual(values = c(Small = "#b8b8d1", ReducedMod = "#4c9a8f",
Reduced = "#7a82c4", Full = "#4c72b0")) +
labs(title = "Models vs Actual",
subtitle = "Predicted sale price for each held-out home",
x = "House", y = NULL, fill = NULL)
coef_tbl <- function(fit) {
summary(fit)$coefficients |>
as_tibble(rownames = "Predictor") |>
transmute(Predictor,
Estimate = round(Estimate, 2),
Std_Error = round(`Std. Error`, 2),
t_value = round(`t value`, 2),
p_value = signif(`Pr(>|t|)`, 3),
Sig = cut(`Pr(>|t|)`, c(-Inf, .001, .01, .05, .1, Inf),
labels = c("***", "**", "*", ".", "")))
}
kable(coef_tbl(fit_full), caption = "Full Model Coefficients (dollar scale)")
| Predictor | Estimate | Std_Error | t_value | p_value | Sig |
|---|---|---|---|---|---|
| (Intercept) | -870591.44 | 1724927.10 | -0.50 | 6.14e-01 | |
| LotArea | 0.73 | 0.11 | 6.95 | 0.00e+00 | *** |
| OverallQual | 23142.84 | 1334.70 | 17.34 | 0.00e+00 | *** |
| YearBuilt | 129.53 | 56.99 | 2.27 | 2.32e-02 | * |
| YearRemodAdd | 374.08 | 73.71 | 5.08 | 5.00e-07 | *** |
| BsmtFinSF1 | 31.42 | 2.83 | 11.11 | 0.00e+00 | *** |
| FullBath | 5554.58 | 3068.87 | 1.81 | 7.06e-02 | . |
| HalfBath | 4068.65 | 2589.61 | 1.57 | 1.16e-01 | |
| BedroomAbvGr | -10303.06 | 2028.72 | -5.08 | 5.00e-07 | *** |
| TotRmsAbvGrd | 14891.32 | 1244.91 | 11.96 | 0.00e+00 | *** |
| Fireplaces | 10206.19 | 2053.47 | 4.97 | 8.00e-07 | *** |
| GarageArea | 58.19 | 7.15 | 8.14 | 0.00e+00 | *** |
| YrSold | -109.65 | 860.33 | -0.13 | 8.99e-01 |
# Put every predictor on the same scale (standardized) so features measured in different units can be compared by bar length.
preds <- all.vars(f_full)[-1]
sdx <- map_dbl(preds, ~ sd(d[[.x]]))
coef_df <- summary(fit_full)$coefficients[preds, ] |>
as_tibble(rownames = "Predictor") |>
mutate(beta = Estimate * sdx / sd(d$SalePrice),
sig = `Pr(>|t|)` < 0.05) |>
arrange(beta) |>
mutate(Predictor = factor(Predictor, levels = Predictor))
ggplot(coef_df, aes(beta, Predictor, fill = sig)) +
geom_col(width = .7) +
geom_vline(xintercept = 0, color = "gray40") +
scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["lo"]]),
labels = c(`TRUE` = "p < 0.05", `FALSE` = "not sig."),
name = NULL) +
labs(title = "Which Features Matter Most (Full Model)",
subtitle = "Longer bar = bigger effect, comparable across predictors",
x = "Standardized Effect", y = NULL)
Each model gives every feature a coefficient, which is simply the dollar amount it adds to the predicted price. Here is what the Full model coefficients say.
The interesting one is BedroomAbvGr, the single negative bar in the chart. Its coefficient is around -$10,000 per bedroom. That reads backwards until you remember the model already accounts for total room count, so adding bedrooms really means slicing the same space into smaller rooms, and that reads as slightly lower value. It is a quirk of the overlap between bedrooms and total rooms, not a real preference, and it makes a good example for the report of why we trimmed the model.
What is not significant is also useful. FullBath, HalfBath, and YrSold all stop carrying real weight once the other predictors are in the model. Their effect is no longer statistically meaningful, meaning we cannot tell it apart from random noise. YrSold is the clearest case, with a p-value near 0.90, where anything above about 0.05 signals an effect we should not trust. Dropping it costs us nothing.
For comparison, ReducedMod is the same model with the two bathroom terms removed. Every predictor now clears the significance bar, and the coefficients move only a little, since room count and overall quality quietly absorb what the bathrooms were contributing.
kable(coef_tbl(fit_reducedmod),
caption = "ReducedMod Coefficients (FullBath and HalfBath removed)")
| Predictor | Estimate | Std_Error | t_value | p_value | Sig |
|---|---|---|---|---|---|
| (Intercept) | -1261994.49 | 142325.88 | -8.87 | 0.00000 | *** |
| LotArea | 0.71 | 0.11 | 6.73 | 0.00000 | *** |
| OverallQual | 24383.65 | 1320.77 | 18.46 | 0.00000 | *** |
| YearBuilt | 157.38 | 53.31 | 2.95 | 0.00323 | ** |
| YearRemodAdd | 428.00 | 73.50 | 5.82 | 0.00000 | *** |
| BsmtFinSF1 | 31.92 | 2.80 | 11.39 | 0.00000 | *** |
| TotRmsAbvGrd | 11874.18 | 836.93 | 14.19 | 0.00000 | *** |
| Fireplaces | 11636.82 | 2046.99 | 5.68 | 0.00000 | *** |
| GarageArea | 60.86 | 7.20 | 8.45 | 0.00000 | *** |
# vif() measures how much each predictor overlaps with the others. A value of 1 means no overlap; higher means more.
vif(fit_full) |>
enframe(name = "Predictor", value = "VIF") |>
arrange(desc(VIF)) |>
mutate(VIF = round(VIF, 2)) |>
kable(caption = "Variance inflation factors, full model")
| Predictor | VIF |
|---|---|
| TotRmsAbvGrd | 3.15 |
| OverallQual | 2.63 |
| YearBuilt | 2.26 |
| FullBath | 2.24 |
| BedroomAbvGr | 2.14 |
| YearRemodAdd | 1.76 |
| GarageArea | 1.74 |
| Fireplaces | 1.39 |
| HalfBath | 1.30 |
| BsmtFinSF1 | 1.22 |
| LotArea | 1.14 |
| YrSold | 1.01 |
Before trusting the coefficients, we check whether predictors overlap too much, since heavy overlap can make individual coefficients unstable. The VIF numbers in the table measure that overlap. A value of 1 means a predictor shares no information with the others, and the common warning line is 5. Ours are all mild. TotRmsAbvGrd is highest at about 3.2, OverallQual next near 2.6, and everything else is under 2.5. Nothing crosses the warning line, so the overlap is real but not severe enough to distort the model. It affects how we read an individual coefficient, such as the negative bedroom sign, more than the overall fit.
# ph (the five hold-out homes) and out (predictions for all four models) are
# built in the model section above; here we report them and the percent error.
kable(out, format.args = list(big.mark = ","),
caption = "Predicted Sale Prices by Model, Against the Actual Price")
| House | Actual | Small | ReducedMod | Reduced | Full |
|---|---|---|---|---|---|
| 1 | 348,000 | 281,910 | 291,226 | 293,432 | 290,687 |
| 2 | 168,000 | 226,114 | 231,187 | 232,046 | 223,113 |
| 3 | 187,000 | 180,261 | 192,000 | 195,278 | 196,447 |
| 4 | 173,900 | 189,050 | 170,290 | 172,819 | 171,337 |
| 5 | 337,500 | 335,683 | 345,418 | 343,546 | 337,498 |
pe <- out |>
mutate(across(c(Small, ReducedMod, Reduced, Full),
~ round(100 * abs(.x - Actual) / Actual, 1))) |>
select(House, Small, ReducedMod, Reduced, Full)
kable(pe, caption = "Absolute Percent Error vs Actual")
| House | Small | ReducedMod | Reduced | Full |
|---|---|---|---|---|
| 1 | 19.0 | 16.3 | 15.7 | 16.5 |
| 2 | 34.6 | 37.6 | 38.1 | 32.8 |
| 3 | 3.6 | 2.7 | 4.4 | 5.1 |
| 4 | 8.7 | 2.1 | 0.6 | 1.5 |
| 5 | 0.5 | 2.3 | 1.8 | 0.0 |
mape <- pe |> summarise(across(c(Small, ReducedMod, Reduced, Full), mean))
out |>
pivot_longer(c(Small, ReducedMod, Reduced, Full), names_to = "Model", values_to = "Pred") |>
mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full")),
House = factor(House)) |>
ggplot(aes(House)) +
geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
width = .78, alpha = .9) +
geom_point(aes(y = Actual), size = 3, color = "black") +
geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_manual(values = c(Small = "#b8b8d1", ReducedMod = "#4c9a8f",
Reduced = "#7a82c4", Full = "#4c72b0")) +
labs(title = "Predicted vs Actual Price (black dots = actual)",
x = "House", y = NULL, fill = NULL)
Mean absolute percent error across the five houses came in at Small 13.3%, ReducedMod 12.2%, Reduced 12.1%, and Full 11.2%. These five homes were held out of the model building, so they are a fair test, and they tell the same story as the cross-validation. ReducedMod, Reduced, and Full are all close, clearly ahead of Small.
Two houses are worth noting. House 2 is the hardest case for every model: an 1882 build on a large lot that all of them overprice, shown by the gap above the black dot in the chart. A linear model struggles with very old homes that sit far outside the typical range. House 5 is the high-quality home (OverallQual 10), and the dollar models handle it well, landing within a percent or two.
We weighed one main alternative to the dollar models. Because the prices are lopsided, with a long tail of expensive homes, a common fix is to model price on a percentage basis instead of in flat dollars. In practice that means predicting the logarithm of the price and then converting the prediction back into dollars. We tested it on the same Reduced predictor set and scored it the same way.
f_log <- log(SalePrice) ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + FullBath + HalfBath +
TotRmsAbvGrd + Fireplaces + GarageArea
fit_log <- lm(f_log, d)
log_cv <- cv_metrics(f_log, is_log = TRUE) # reuses the same folds as above
red_cv <- cv_metrics(f_reduced)
tibble(
Model = c("Reduced (dollars)", "Log (back-transformed)"),
CV_RMSE = round(c(red_cv["CV_RMSE"], log_cv["CV_RMSE"])),
CV_MAE = round(c(red_cv["CV_MAE"], log_cv["CV_MAE"]))
) |>
kable(format.args = list(big.mark = ","),
caption = "Reduced dollar model vs the log alternative, same folds")
| Model | CV_RMSE | CV_MAE |
|---|---|---|
| Reduced (dollars) | 35,778 | 24,611 |
| Log (back-transformed) | 31,604 | 20,633 |
out |>
transmute(House, Actual, Reduced,
Log_bt = round(exp(predict(fit_log, ph)))) |>
kable(format.args = list(big.mark = ","),
caption = "Predictions: Dollar Model vs Log Alternative")
| House | Actual | Reduced | Log_bt |
|---|---|---|---|
| 1 | 348,000 | 293,432 | 298,485 |
| 2 | 168,000 | 232,046 | 198,033 |
| 3 | 187,000 | 195,278 | 183,578 |
| 4 | 173,900 | 172,819 | 170,956 |
| 5 | 337,500 | 343,546 | 371,912 |
out |>
mutate(Log = round(exp(predict(fit_log, ph)))) |>
pivot_longer(c(Small, Reduced, Full, Log),
names_to = "Model", values_to = "Pred") |>
mutate(Model = factor(Model, levels = c("Small", "Reduced", "Full", "Log")),
House = factor(House)) |>
ggplot(aes(House)) +
geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
width = .75, alpha = .9) +
geom_point(aes(y = Actual), size = 3, color = "black") +
geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_manual(values = c(Small = "#b8b8d1", Reduced = "#7a82c4",
Full = "#4c72b0", Log = "#dd8452")) +
labs(title = "Dollar Models and the Log Alternative vs Actual (black dots = actual)",
subtitle = "Log is the back-transformed alternative, shown in orange",
x = "House", y = NULL, fill = NULL)
Side by side, the models are close on Houses 3 and 4, and the differences show up at the extremes. On House 2, the 1882 home, the log bar (orange) lands nearest the actual while the three dollar models overshoot together. On House 5, the OverallQual 10 home, the pattern reverses: the dollar models sit right on the dot and the log model runs high. This is the same middle-versus-tails behavior the residual plot below explains, seen here on the five houses we predict.
k_lab <- scales::label_dollar(scale = 1e-3, suffix = "k")
bind_rows(
tibble(fitted = fitted(fit_reduced), resid = resid(fit_reduced),
Model = "Dollar Model (residuals fan out)"),
tibble(fitted = exp(fitted(fit_log)), resid = d$SalePrice - exp(fitted(fit_log)),
Model = "Log Model (residuals stay even)")
) |>
ggplot(aes(fitted, resid)) +
geom_point(alpha = .22, color = PAL[["pos"]]) +
geom_hline(yintercept = 0, color = PAL[["neg"]]) +
facet_wrap(~ Model, scales = "free") +
scale_x_continuous(labels = k_lab, n.breaks = 4) +
scale_y_continuous(labels = k_lab, n.breaks = 5) +
labs(title = "Why we tested logging: error spread vs price",
x = "Predicted price", y = "Residual")
The log model is more accurate. It shaves a few thousand dollars off both cross-validated RMSE and MAE, and on the five houses its mean error is about 9% against roughly 11% for the dollar models. Statistically it is the better fit, mostly because logging evens out the error spread that otherwise grows with price. The right panel above is the tell. The dollar model’s residuals fan out as price rises, while the log model’s sit in an even band.
We did not adopt it as our headline model anyway, for three reasons. The project is built around explaining dollar effects, and the log model does not give those directly. Its effects come out as percentages, so the OverallQual term means about an 11.6% lift per quality point rather than a flat dollar figure, which is harder to explain to a non-technical audience and easier to misread. And converting the log model’s predictions back into dollars introduces a small technical bias that goes beyond what the course covers. We therefore report it honestly as the stronger-fitting alternative we tested, explain why we kept the dollar model for clarity and scope, and proceed with the dollar-scale Reduced model.
We recommend the Reduced model as the best of the four. It is about as accurate as Full but simpler and easier to explain, and it drops two predictors that contribute little: YrSold, which is uninformative, and BedroomAbvGr, which overlaps with total rooms. Full is noted as marginally more accurate, and Small serves as the simple baseline. ReducedMod is reported as an additional comparison rather than the recommended model.
The log transform is documented as the alternative we tested and chose not to adopt. It fits the data slightly better, but its effects are expressed as percentages rather than dollars and converting its predictions back to dollars introduces a small bias, so the dollar-scale Reduced model is preferred for clarity.
Two points are noted as limitations rather than open issues. Predictions are weakest on very old homes well outside the typical range, as House 2, an 1882 build, shows. The large LotArea values were left in the data: the VIF values indicate collinearity is not a problem and the fit is solid without capping them, so they were retained and their influence noted.