This document develops and compares the regression models for the house-price project. All of the main models predict the sale price in dollars so they can be compared directly. We also tested a version that predicts price as a percentage change rather than a flat dollar amount, called the log model, which we treat as an alternative we considered rather than a submitted model, for the reasons given in the final modeling section. Everything below is reproducible: the random seed is fixed at 42, so re-running reproduces the same numbers to the dollar.

To reproduce the analysis, keep House_Prices.csv and Predict_Houses.csv in the same folder as this file, then knit it or run the chunks in order.

1 What the Data Confirmed

d <- read_csv("House_Prices.csv", show_col_types = FALSE)

d |>
  summarise(Min = min(SalePrice), Median = median(SalePrice),
            Mean = round(mean(SalePrice)), Max = max(SalePrice)) |>
  pivot_longer(everything(), names_to = "Statistic", values_to = "SalePrice") |>
  kable(format.args = list(big.mark = ","),
        caption = "SalePrice Summary (990 homes, no missing values)")

SalePrice Summary (990 homes, no missing values)
Statistic	SalePrice
Min	34,900
Median	163,250
Mean	182,151
Max	755,000

cor_df <- tibble(Predictor = names(d),
                 r = as.numeric(cor(d)[, "SalePrice"])) |>
  filter(Predictor != "SalePrice") |>
  arrange(r) |>
  mutate(Predictor = factor(Predictor, levels = Predictor))

ggplot(cor_df, aes(r, Predictor, fill = r > 0)) +
  geom_col(width = .7) +
  geom_text(aes(label = sprintf("%.2f", r),
                hjust = ifelse(r > 0, -0.15, 1.15)), size = 3.4) +
  scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["neg"]]),
                    guide = "none") +
  scale_x_continuous(limits = c(-0.1, 0.95), expand = expansion(c(.05, .1))) +
  labs(title = "Correlation of Each Predictor With SalePrice",
       x = "Correlation With Price (1.0 = perfect)", y = NULL)

The average price near $182k sits well above the $163k median, because a handful of expensive houses pull the average up; the distribution is right-skewed. OverallQual tracks price most closely, with a correlation of 0.80, where 1.0 would be a perfect match and 0 would mean no relationship. GarageArea, TotRmsAbvGrd, FullBath, YearBuilt, and YearRemodAdd follow in the 0.52 to 0.65 range. BedroomAbvGr is weak at 0.18, and YrSold is the only negative, essentially flat at -0.03.

2 The Four Models

These are the three models planned for submission, plus a fourth, ReducedMod, included here for comparison. All four predict SalePrice in dollars.

f_small   <- SalePrice ~ OverallQual + GarageArea + TotRmsAbvGrd + YearBuilt
f_full    <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + FullBath + HalfBath + BedroomAbvGr +
                         TotRmsAbvGrd + Fireplaces + GarageArea + YrSold
f_reduced <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + FullBath + HalfBath +
                         TotRmsAbvGrd + Fireplaces + GarageArea
f_reducedmod <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + TotRmsAbvGrd + Fireplaces + GarageArea

Small keeps only the strong predictors, OverallQual, GarageArea, TotRmsAbvGrd, and YearBuilt. Four in total.
Full is all 12 predictors, our baseline.
Reduced is the Full model with YrSold and BedroomAbvGr removed. That clears out one predictor that tells us nothing about price (YrSold) and one of two predictors that carry much the same information (bedrooms and total rooms).
ReducedMod goes one step further than Reduced by also dropping FullBath and HalfBath, the two weakest terms in the Full model. Eight predictors in total. It is included as an additional comparison rather than a submitted model.

fit_small      <- lm(f_small,      d)
fit_full       <- lm(f_full,       d)
fit_reduced    <- lm(f_reduced,    d)
fit_reducedmod <- lm(f_reducedmod, d)

# 10-fold CV on the dollar scale. Same folds for every model so it is fair.
k <- 10
folds <- sample(rep(1:k, length.out = nrow(d)))
cv_metrics <- function(formula, is_log = FALSE) {
  out <- map_dfr(1:k, function(i) {
    tr <- d[folds != i, ]; te <- d[folds == i, ]
    fit  <- lm(formula, data = tr)
    pred <- predict(fit, te)
    if (is_log) pred <- exp(pred)
    err <- te$SalePrice - pred
    tibble(rmse = sqrt(mean(err^2)),
           mae  = mean(abs(err)),
           mape = mean(abs(err) / te$SalePrice) * 100)
  })
  c(CV_RMSE = mean(out$rmse), CV_MAE = mean(out$mae), CV_MAPE = mean(out$mape))
}

comp <- tibble(
  Model      = c("Small", "ReducedMod", "Reduced", "Full"),
  Predictors = c(4L, 8L, 10L, 12L),
  fit        = list(fit_small, fit_reducedmod, fit_reduced, fit_full),
  formula    = list(f_small, f_reducedmod, f_reduced, f_full)
) |>
  mutate(cv     = map(formula, cv_metrics),
         CV_RMSE = round(map_dbl(cv, "CV_RMSE")),
         CV_MAE  = round(map_dbl(cv, "CV_MAE")),
         CV_MAPE = round(map_dbl(cv, "CV_MAPE"), 1),
         Adj_R2  = round(map_dbl(fit, ~ summary(.x)$adj.r.squared), 3),
         AIC     = round(map_dbl(fit, AIC))) |>
  select(Model, Predictors, CV_RMSE, CV_MAE, CV_MAPE, Adj_R2, AIC)

kable(comp, format.args = list(big.mark = ","),
      caption = "Model comparison. RMSE and MAE measure the average prediction miss in dollars (RMSE weights the big misses more heavily), and MAPE expresses that miss as a percent of price. Lower is better. From 10-fold cross-validation.")

Model comparison. RMSE and MAE measure the average prediction miss in dollars (RMSE weights the big misses more heavily), and MAPE expresses that miss as a percent of price. Lower is better. From 10-fold cross-validation.
Model	Predictors	CV_RMSE	CV_MAE	CV_MAPE	Adj_R2	AIC
Small	4	40,756	28,845	17.5	0.733	23,868
ReducedMod	8	35,786	24,680	15.1	0.798	23,596
Reduced	10	35,778	24,611	15.1	0.798	23,598
Full	12	35,432	24,489	14.9	0.803	23,576

comp |>
  select(Model, RMSE = CV_RMSE, MAE = CV_MAE) |>
  pivot_longer(c(RMSE, MAE), names_to = "Metric", values_to = "Dollars") |>
  mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full"))) |>
  ggplot(aes(Model, Dollars, fill = Metric)) +
  geom_col(position = position_dodge(.75), width = .65) +
  geom_text(aes(label = scales::dollar(Dollars)),
            position = position_dodge(.75), vjust = -0.4, size = 3.2) +
  scale_fill_manual(values = c(RMSE = PAL[["hi"]], MAE = PAL[["lo"]])) +
  scale_y_continuous(labels = scales::dollar, expand = expansion(c(0, .12))) +
  labs(title = "Cross-validated Error by Model (lower is better)",
       x = NULL, y = NULL, fill = NULL)

The Small model is the weakest by a clear margin, so the six middle-strength predictors carry real weight even with the overlap among them. Full and Reduced are effectively tied. These figures come from cross-validation, which tests each model on data it did not see while it was being built, splitting the data into ten parts and rotating which part is held back. Adjusted R-squared is the share of price variation a model explains, with a small penalty for adding predictors. Full edges Reduced on both measures, but by only a few hundred dollars of error, and it does so while carrying two predictors that contribute little. Reduced drops YrSold, which is uninformative, and BedroomAbvGr, which overlaps with total rooms and shows a backwards sign, at almost no accuracy cost. On that basis we treat Reduced as the best model for its balance of accuracy and simplicity, with Full noted as marginally more accurate.

ReducedMod takes Reduced and also drops FullBath and HalfBath, the two terms that were not significant. The result is that all eight remaining predictors are significant, and the accuracy cost is minimal: its cross-validated error sits within a few dollars of Reduced and its Adjusted R-squared matches to three decimals. The tradeoff is that a model with no bathroom term at all can look odd to a non-technical audience, since buyers clearly value bathrooms. The bathroom effect does not disappear; it is absorbed by the predictors it travels with, mainly room count and overall quality.

The cross-validated error above measures overall accuracy. The chart below shows how each of the four models predicts the five held-out homes against their actual sale prices.

# Load the five hold-out homes once here and build the prediction table the
# prediction section reuses later.
ph <- read_csv("Predict_Houses.csv", show_col_types = FALSE)

out <- tibble(
  House      = seq_len(nrow(ph)),
  Actual     = ph$SalePrice,
  Small      = round(predict(fit_small,      ph)),
  ReducedMod = round(predict(fit_reducedmod, ph)),
  Reduced    = round(predict(fit_reduced,    ph)),
  Full       = round(predict(fit_full,       ph))
)

out |>
  pivot_longer(c(Small, ReducedMod, Reduced, Full),
               names_to = "Model", values_to = "Pred") |>
  mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full")),
         House = factor(House)) |>
  ggplot(aes(House)) +
  geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
           width = .75, alpha = .9) +
  geom_point(aes(y = Actual), size = 3, color = "black") +
  geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_manual(values = c(Small = "#b8b8d1", ReducedMod = "#4c9a8f",
                               Reduced = "#7a82c4", Full = "#4c72b0")) +
  labs(title = "Models vs Actual",
       subtitle = "Predicted sale price for each held-out home",
       x = "House", y = NULL, fill = NULL)

3 Coefficients Worth Discussing

coef_tbl <- function(fit) {
  summary(fit)$coefficients |>
    as_tibble(rownames = "Predictor") |>
    transmute(Predictor,
              Estimate  = round(Estimate, 2),
              Std_Error = round(`Std. Error`, 2),
              t_value   = round(`t value`, 2),
              p_value   = signif(`Pr(>|t|)`, 3),
              Sig = cut(`Pr(>|t|)`, c(-Inf, .001, .01, .05, .1, Inf),
                        labels = c("***", "**", "*", ".", "")))
}
kable(coef_tbl(fit_full), caption = "Full Model Coefficients (dollar scale)")

Full Model Coefficients (dollar scale)
Predictor	Estimate	Std_Error	t_value	p_value	Sig
(Intercept)	-870591.44	1724927.10	-0.50	6.14e-01
LotArea	0.73	0.11	6.95	0.00e+00	***
OverallQual	23142.84	1334.70	17.34	0.00e+00	***
YearBuilt	129.53	56.99	2.27	2.32e-02	*
YearRemodAdd	374.08	73.71	5.08	5.00e-07	***
BsmtFinSF1	31.42	2.83	11.11	0.00e+00	***
FullBath	5554.58	3068.87	1.81	7.06e-02	.
HalfBath	4068.65	2589.61	1.57	1.16e-01
BedroomAbvGr	-10303.06	2028.72	-5.08	5.00e-07	***
TotRmsAbvGrd	14891.32	1244.91	11.96	0.00e+00	***
Fireplaces	10206.19	2053.47	4.97	8.00e-07	***
GarageArea	58.19	7.15	8.14	0.00e+00	***
YrSold	-109.65	860.33	-0.13	8.99e-01

# Put every predictor on the same scale (standardized) so features measured in different units can be compared by bar length.
preds <- all.vars(f_full)[-1]
sdx <- map_dbl(preds, ~ sd(d[[.x]]))
coef_df <- summary(fit_full)$coefficients[preds, ] |>
  as_tibble(rownames = "Predictor") |>
  mutate(beta = Estimate * sdx / sd(d$SalePrice),
         sig  = `Pr(>|t|)` < 0.05) |>
  arrange(beta) |>
  mutate(Predictor = factor(Predictor, levels = Predictor))

ggplot(coef_df, aes(beta, Predictor, fill = sig)) +
  geom_col(width = .7) +
  geom_vline(xintercept = 0, color = "gray40") +
  scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["lo"]]),
                    labels = c(`TRUE` = "p < 0.05", `FALSE` = "not sig."),
                    name = NULL) +
  labs(title = "Which Features Matter Most (Full Model)",
       subtitle = "Longer bar = bigger effect, comparable across predictors",
       x = "Standardized Effect", y = NULL)

Each model gives every feature a coefficient, which is simply the dollar amount it adds to the predicted price. Here is what the Full model coefficients say.

OverallQual is the workhorse. Each one-point step up the quality scale adds about $23,000, holding everything else fixed. That is the single cleanest investor message we have, and the chart confirms it dwarfs everything else.
GarageArea runs about $58 per square foot. BsmtFinSF1 about $31 per finished basement square foot. LotArea is tiny per unit (~$0.73) but the lots are huge, so it still adds up.
TotRmsAbvGrd adds roughly $15,000 per room and stays strong across all three models.
YearBuilt and YearRemodAdd are both positive and significant. Newer and more recently updated homes sell higher.
Fireplaces is worth about $10,000 each and holds up well.

The interesting one is BedroomAbvGr, the single negative bar in the chart. Its coefficient is around -$10,000 per bedroom. That reads backwards until you remember the model already accounts for total room count, so adding bedrooms really means slicing the same space into smaller rooms, and that reads as slightly lower value. It is a quirk of the overlap between bedrooms and total rooms, not a real preference, and it makes a good example for the report of why we trimmed the model.

What is not significant is also useful. FullBath, HalfBath, and YrSold all stop carrying real weight once the other predictors are in the model. Their effect is no longer statistically meaningful, meaning we cannot tell it apart from random noise. YrSold is the clearest case, with a p-value near 0.90, where anything above about 0.05 signals an effect we should not trust. Dropping it costs us nothing.

For comparison, ReducedMod is the same model with the two bathroom terms removed. Every predictor now clears the significance bar, and the coefficients move only a little, since room count and overall quality quietly absorb what the bathrooms were contributing.

kable(coef_tbl(fit_reducedmod),
      caption = "ReducedMod Coefficients (FullBath and HalfBath removed)")

ReducedMod Coefficients (FullBath and HalfBath removed)
Predictor	Estimate	Std_Error	t_value	p_value	Sig
(Intercept)	-1261994.49	142325.88	-8.87	0.00000	***
LotArea	0.71	0.11	6.73	0.00000	***
OverallQual	24383.65	1320.77	18.46	0.00000	***
YearBuilt	157.38	53.31	2.95	0.00323	**
YearRemodAdd	428.00	73.50	5.82	0.00000	***
BsmtFinSF1	31.92	2.80	11.39	0.00000	***
TotRmsAbvGrd	11874.18	836.93	14.19	0.00000	***
Fireplaces	11636.82	2046.99	5.68	0.00000	***
GarageArea	60.86	7.20	8.45	0.00000	***

4 Overlap Between Predictors

# vif() measures how much each predictor overlaps with the others. A value of 1 means no overlap; higher means more.
vif(fit_full) |>
  enframe(name = "Predictor", value = "VIF") |>
  arrange(desc(VIF)) |>
  mutate(VIF = round(VIF, 2)) |>
  kable(caption = "Variance inflation factors, full model")

Variance inflation factors, full model
Predictor	VIF
TotRmsAbvGrd	3.15
OverallQual	2.63
YearBuilt	2.26
FullBath	2.24
BedroomAbvGr	2.14
YearRemodAdd	1.76
GarageArea	1.74
Fireplaces	1.39
HalfBath	1.30
BsmtFinSF1	1.22
LotArea	1.14
YrSold	1.01

Before trusting the coefficients, we check whether predictors overlap too much, since heavy overlap can make individual coefficients unstable. The VIF numbers in the table measure that overlap. A value of 1 means a predictor shares no information with the others, and the common warning line is 5. Ours are all mild. TotRmsAbvGrd is highest at about 3.2, OverallQual next near 2.6, and everything else is under 2.5. Nothing crosses the warning line, so the overlap is real but not severe enough to distort the model. It affects how we read an individual coefficient, such as the negative bedroom sign, more than the overall fit.

5 The Five Predictions

# ph (the five hold-out homes) and out (predictions for all four models) are
# built in the model section above; here we report them and the percent error.
kable(out, format.args = list(big.mark = ","),
      caption = "Predicted Sale Prices by Model, Against the Actual Price")

Predicted Sale Prices by Model, Against the Actual Price
House	Actual	Small	ReducedMod	Reduced	Full
1	348,000	281,910	291,226	293,432	290,687
2	168,000	226,114	231,187	232,046	223,113
3	187,000	180,261	192,000	195,278	196,447
4	173,900	189,050	170,290	172,819	171,337
5	337,500	335,683	345,418	343,546	337,498

pe <- out |>
  mutate(across(c(Small, ReducedMod, Reduced, Full),
                ~ round(100 * abs(.x - Actual) / Actual, 1))) |>
  select(House, Small, ReducedMod, Reduced, Full)
kable(pe, caption = "Absolute Percent Error vs Actual")

Absolute Percent Error vs Actual
House	Small	ReducedMod	Reduced	Full
1	19.0	16.3	15.7	16.5
2	34.6	37.6	38.1	32.8
3	3.6	2.7	4.4	5.1
4	8.7	2.1	0.6	1.5
5	0.5	2.3	1.8	0.0

mape <- pe |> summarise(across(c(Small, ReducedMod, Reduced, Full), mean))

out |>
  pivot_longer(c(Small, ReducedMod, Reduced, Full), names_to = "Model", values_to = "Pred") |>
  mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full")),
         House = factor(House)) |>
  ggplot(aes(House)) +
  geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
           width = .78, alpha = .9) +
  geom_point(aes(y = Actual), size = 3, color = "black") +
  geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_manual(values = c(Small = "#b8b8d1", ReducedMod = "#4c9a8f",
                               Reduced = "#7a82c4", Full = "#4c72b0")) +
  labs(title = "Predicted vs Actual Price (black dots = actual)",
       x = "House", y = NULL, fill = NULL)

Mean absolute percent error across the five houses came in at Small 13.3%, ReducedMod 12.2%, Reduced 12.1%, and Full 11.2%. These five homes were held out of the model building, so they are a fair test, and they tell the same story as the cross-validation. ReducedMod, Reduced, and Full are all close, clearly ahead of Small.

Two houses are worth noting. House 2 is the hardest case for every model: an 1882 build on a large lot that all of them overprice, shown by the gap above the black dot in the chart. A linear model struggles with very old homes that sit far outside the typical range. House 5 is the high-quality home (OverallQual 10), and the dollar models handle it well, landing within a percent or two.

6 Alternative We Considered: Logging the SalePrice

We weighed one main alternative to the dollar models. Because the prices are lopsided, with a long tail of expensive homes, a common fix is to model price on a percentage basis instead of in flat dollars. In practice that means predicting the logarithm of the price and then converting the prediction back into dollars. We tested it on the same Reduced predictor set and scored it the same way.

f_log   <- log(SalePrice) ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + FullBath + HalfBath +
                         TotRmsAbvGrd + Fireplaces + GarageArea
fit_log <- lm(f_log, d)

log_cv  <- cv_metrics(f_log, is_log = TRUE)   # reuses the same folds as above
red_cv  <- cv_metrics(f_reduced)
tibble(
  Model   = c("Reduced (dollars)", "Log (back-transformed)"),
  CV_RMSE = round(c(red_cv["CV_RMSE"], log_cv["CV_RMSE"])),
  CV_MAE  = round(c(red_cv["CV_MAE"],  log_cv["CV_MAE"]))
) |>
  kable(format.args = list(big.mark = ","),
        caption = "Reduced dollar model vs the log alternative, same folds")

Reduced dollar model vs the log alternative, same folds
Model	CV_RMSE	CV_MAE
Reduced (dollars)	35,778	24,611
Log (back-transformed)	31,604	20,633

out |>
  transmute(House, Actual, Reduced,
            Log_bt = round(exp(predict(fit_log, ph)))) |>
  kable(format.args = list(big.mark = ","),
        caption = "Predictions: Dollar Model vs Log Alternative")

Predictions: Dollar Model vs Log Alternative
House	Actual	Reduced	Log_bt
1	348,000	293,432	298,485
2	168,000	232,046	198,033
3	187,000	195,278	183,578
4	173,900	172,819	170,956
5	337,500	343,546	371,912

out |>
  mutate(Log = round(exp(predict(fit_log, ph)))) |>
  pivot_longer(c(Small, Reduced, Full, Log),
               names_to = "Model", values_to = "Pred") |>
  mutate(Model = factor(Model, levels = c("Small", "Reduced", "Full", "Log")),
         House = factor(House)) |>
  ggplot(aes(House)) +
  geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
           width = .75, alpha = .9) +
  geom_point(aes(y = Actual), size = 3, color = "black") +
  geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_manual(values = c(Small = "#b8b8d1", Reduced = "#7a82c4",
                               Full = "#4c72b0", Log = "#dd8452")) +
  labs(title = "Dollar Models and the Log Alternative vs Actual (black dots = actual)",
       subtitle = "Log is the back-transformed alternative, shown in orange",
       x = "House", y = NULL, fill = NULL)

Side by side, the models are close on Houses 3 and 4, and the differences show up at the extremes. On House 2, the 1882 home, the log bar (orange) lands nearest the actual while the three dollar models overshoot together. On House 5, the OverallQual 10 home, the pattern reverses: the dollar models sit right on the dot and the log model runs high. This is the same middle-versus-tails behavior the residual plot below explains, seen here on the five houses we predict.

k_lab <- scales::label_dollar(scale = 1e-3, suffix = "k")
bind_rows(
  tibble(fitted = fitted(fit_reduced), resid = resid(fit_reduced),
         Model = "Dollar Model (residuals fan out)"),
  tibble(fitted = exp(fitted(fit_log)), resid = d$SalePrice - exp(fitted(fit_log)),
         Model = "Log Model (residuals stay even)")
) |>
  ggplot(aes(fitted, resid)) +
  geom_point(alpha = .22, color = PAL[["pos"]]) +
  geom_hline(yintercept = 0, color = PAL[["neg"]]) +
  facet_wrap(~ Model, scales = "free") +
  scale_x_continuous(labels = k_lab, n.breaks = 4) +
  scale_y_continuous(labels = k_lab, n.breaks = 5) +
  labs(title = "Why we tested logging: error spread vs price",
       x = "Predicted price", y = "Residual")

The log model is more accurate. It shaves a few thousand dollars off both cross-validated RMSE and MAE, and on the five houses its mean error is about 9% against roughly 11% for the dollar models. Statistically it is the better fit, mostly because logging evens out the error spread that otherwise grows with price. The right panel above is the tell. The dollar model’s residuals fan out as price rises, while the log model’s sit in an even band.

We did not adopt it as our headline model anyway, for three reasons. The project is built around explaining dollar effects, and the log model does not give those directly. Its effects come out as percentages, so the OverallQual term means about an 11.6% lift per quality point rather than a flat dollar figure, which is harder to explain to a non-technical audience and easier to misread. And converting the log model’s predictions back into dollars introduces a small technical bias that goes beyond what the course covers. We therefore report it honestly as the stronger-fitting alternative we tested, explain why we kept the dollar model for clarity and scope, and proceed with the dollar-scale Reduced model.

7 Conclusions

We recommend the Reduced model as the best of the four. It is about as accurate as Full but simpler and easier to explain, and it drops two predictors that contribute little: YrSold, which is uninformative, and BedroomAbvGr, which overlaps with total rooms. Full is noted as marginally more accurate, and Small serves as the simple baseline. ReducedMod is reported as an additional comparison rather than the recommended model.

The log transform is documented as the alternative we tested and chose not to adopt. It fits the data slightly better, but its effects are expressed as percentages rather than dollars and converting its predictions back to dollars introduces a small bias, so the dollar-scale Reduced model is preferred for clarity.

Two points are noted as limitations rather than open issues. Predictions are weakest on very old homes well outside the typical range, as House 2, an 1882 build, shows. The large LotArea values were left in the data: the VIF values indicate collinearity is not a problem and the fit is solid without capping them, so they were retained and their influence noted.

House Price Regression: Model Development and Comparison

Kent State Business Analytics group project

June 2026