Built the three models from the EDA plan, all predicting the price in plain dollars so they compare apples to apples. We also tested a version that predicts the price as a percentage change instead of a flat dollar amount, called the log model, but we are treating that as an alternative we considered rather than a graded model, for reasons in the last section. This is written up so we can argue about it before we touch the report. Everything below is reproducible with the seed set to 42, so re-running gives the same numbers down to the dollar.

To run it yourself, keep House_Prices.csv and Predict_Houses.csv in the same folder as this file, then click Knit, or step through the chunks.

1 What the Data Confirmed

d <- read_csv("House_Prices.csv", show_col_types = FALSE)

d |>
  summarise(Min = min(SalePrice), Median = median(SalePrice),
            Mean = round(mean(SalePrice)), Max = max(SalePrice)) |>
  pivot_longer(everything(), names_to = "Statistic", values_to = "SalePrice") |>
  kable(format.args = list(big.mark = ","),
        caption = "SalePrice Summary (990 homes, no missing values)")

SalePrice Summary (990 homes, no missing values)
Statistic	SalePrice
Min	34,900
Median	163,250
Mean	182,151
Max	755,000

cor_df <- tibble(Predictor = names(d),
                 r = as.numeric(cor(d)[, "SalePrice"])) |>
  filter(Predictor != "SalePrice") |>
  arrange(r) |>
  mutate(Predictor = factor(Predictor, levels = Predictor))

ggplot(cor_df, aes(r, Predictor, fill = r > 0)) +
  geom_col(width = .7) +
  geom_text(aes(label = sprintf("%.2f", r),
                hjust = ifelse(r > 0, -0.15, 1.15)), size = 3.4) +
  scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["neg"]]),
                    guide = "none") +
  scale_x_continuous(limits = c(-0.1, 0.95), expand = expansion(c(.05, .1))) +
  labs(title = "Correlation of Each Predictor With SalePrice",
       x = "Correlation With Price (1.0 = perfect)", y = NULL)

Everything in the EDA notes held up. The average price near $182k sits well above the $163k middle home, because a handful of expensive houses pull the average up. Statisticians call that right-skewed. OverallQual tracks price most closely, with a correlation of 0.80, where 1.0 would be a perfect lockstep match and 0 would mean no relationship. GarageArea, TotRmsAbvGrd, FullBath, YearBuilt, and YearRemodAdd follow in the 0.52 to 0.65 range. BedroomAbvGr is weak at 0.18 and YrSold is the only negative, basically flat at -0.03. Nothing new to chase down.

2 The Three Models

These are the three we planned to submit, plus a fourth, ReducedMod, that the group asked to see for discussion. All four predict SalePrice in dollars.

f_small   <- SalePrice ~ OverallQual + GarageArea + TotRmsAbvGrd + YearBuilt
f_full    <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + FullBath + HalfBath + BedroomAbvGr +
                         TotRmsAbvGrd + Fireplaces + GarageArea + YrSold
f_reduced <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + FullBath + HalfBath +
                         TotRmsAbvGrd + Fireplaces + GarageArea
f_reducedmod <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + TotRmsAbvGrd + Fireplaces + GarageArea

Small keeps only the strong predictors, OverallQual, GarageArea, TotRmsAbvGrd, and YearBuilt. Four in total.
Full is all 12 predictors, our baseline.
Reduced is the Full model with YrSold and BedroomAbvGr removed. That clears out one predictor that tells us nothing about price (YrSold) and one of two predictors that carry much the same information (bedrooms and total rooms).
ReducedMod goes one step further than Reduced by also dropping FullBath and HalfBath, the two weakest terms in the Full model. Eight predictors in total. This one is a candidate for discussion, not a decision yet.

fit_small      <- lm(f_small,      d)
fit_full       <- lm(f_full,       d)
fit_reduced    <- lm(f_reduced,    d)
fit_reducedmod <- lm(f_reducedmod, d)

# 10-fold CV on the dollar scale. Same folds for every model so it is fair.
k <- 10
folds <- sample(rep(1:k, length.out = nrow(d)))
cv_metrics <- function(formula, is_log = FALSE) {
  out <- map_dfr(1:k, function(i) {
    tr <- d[folds != i, ]; te <- d[folds == i, ]
    fit  <- lm(formula, data = tr)
    pred <- predict(fit, te)
    if (is_log) pred <- exp(pred)
    err <- te$SalePrice - pred
    tibble(rmse = sqrt(mean(err^2)), mae = mean(abs(err)))
  })
  c(CV_RMSE = mean(out$rmse), CV_MAE = mean(out$mae))
}

comp <- tibble(
  Model      = c("Small", "ReducedMod", "Reduced", "Full"),
  Predictors = c(4L, 8L, 10L, 12L),
  fit        = list(fit_small, fit_reducedmod, fit_reduced, fit_full),
  formula    = list(f_small, f_reducedmod, f_reduced, f_full)
) |>
  mutate(cv     = map(formula, cv_metrics),
         CV_RMSE = round(map_dbl(cv, "CV_RMSE")),
         CV_MAE  = round(map_dbl(cv, "CV_MAE")),
         Adj_R2  = round(map_dbl(fit, ~ summary(.x)$adj.r.squared), 3),
         AIC     = round(map_dbl(fit, AIC))) |>
  select(Model, Predictors, CV_RMSE, CV_MAE, Adj_R2, AIC)

kable(comp, format.args = list(big.mark = ","),
      caption = "Model comparison. RMSE and MAE are two ways to measure the average prediction miss in dollars, RMSE weighting the big misses more heavily. Lower is better. From 10-fold cross-validation.")

Model comparison. RMSE and MAE are two ways to measure the average prediction miss in dollars, RMSE weighting the big misses more heavily. Lower is better. From 10-fold cross-validation.
Model	Predictors	CV_RMSE	CV_MAE	Adj_R2	AIC
Small	4	40,756	28,845	0.733	23,868
ReducedMod	8	35,786	24,680	0.798	23,596
Reduced	10	35,778	24,611	0.798	23,598
Full	12	35,432	24,489	0.803	23,576

comp |>
  select(Model, RMSE = CV_RMSE, MAE = CV_MAE) |>
  pivot_longer(c(RMSE, MAE), names_to = "Metric", values_to = "Dollars") |>
  mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full"))) |>
  ggplot(aes(Model, Dollars, fill = Metric)) +
  geom_col(position = position_dodge(.75), width = .65) +
  geom_text(aes(label = scales::dollar(Dollars)),
            position = position_dodge(.75), vjust = -0.4, size = 3.2) +
  scale_fill_manual(values = c(RMSE = PAL[["hi"]], MAE = PAL[["lo"]])) +
  scale_y_continuous(labels = scales::dollar, expand = expansion(c(0, .12))) +
  labs(title = "Cross-validated Error by Model (lower is better)",
       x = NULL, y = NULL, fill = NULL)

The Small model is the worst by a clear margin, so those six middle-strength predictors are pulling real weight even with all the overlap. Full and Reduced are basically tied. The numbers above come from cross-validation, which means we tested each model on data it had not seen while it was being built, splitting the data into ten parts and rotating which part is held back. Adjusted R-squared is the share of the price variation the model explains, with a small penalty for adding predictors. Full edges Reduced on both, but only by a few hundred dollars of error, and it does that while carrying two predictors that earn nothing. Reduced drops YrSold, which is useless, and BedroomAbvGr, which overlaps with total rooms and shows a backwards sign, for almost no accuracy cost. My lean is to call Reduced our best model on the strength of being simpler and easier to explain, with Full noted as fractionally more accurate. That gives us a clean “how and why we chose” story for the rubric.

ReducedMod is the new one to talk through. It takes Reduced and also drops FullBath and HalfBath, the two terms that were not significant. The payoff is that all eight remaining predictors are significant, which is a tidy story, and the cost is almost nothing, its cross-validated error sits within a few dollars of Reduced and its Adjusted R-squared matches to three decimals. The one thing to weigh is that a model with no bathroom term at all can look odd to a non-technical audience, since buyers clearly care about bathrooms. The honest read is that the bathroom effect does not vanish, it gets absorbed by the predictors it travels with, mainly room count and overall quality. That is the tradeoff for the group to decide.

3 Coefficients Worth Discussing

coef_tbl <- function(fit) {
  summary(fit)$coefficients |>
    as_tibble(rownames = "Predictor") |>
    transmute(Predictor,
              Estimate  = round(Estimate, 2),
              Std_Error = round(`Std. Error`, 2),
              t_value   = round(`t value`, 2),
              p_value   = signif(`Pr(>|t|)`, 3),
              Sig = cut(`Pr(>|t|)`, c(-Inf, .001, .01, .05, .1, Inf),
                        labels = c("***", "**", "*", ".", "")))
}
kable(coef_tbl(fit_full), caption = "Full Model Coefficients (dollar scale)")

Full Model Coefficients (dollar scale)
Predictor	Estimate	Std_Error	t_value	p_value	Sig
(Intercept)	-870591.44	1724927.10	-0.50	6.14e-01
LotArea	0.73	0.11	6.95	0.00e+00	***
OverallQual	23142.84	1334.70	17.34	0.00e+00	***
YearBuilt	129.53	56.99	2.27	2.32e-02	*
YearRemodAdd	374.08	73.71	5.08	5.00e-07	***
BsmtFinSF1	31.42	2.83	11.11	0.00e+00	***
FullBath	5554.58	3068.87	1.81	7.06e-02	.
HalfBath	4068.65	2589.61	1.57	1.16e-01
BedroomAbvGr	-10303.06	2028.72	-5.08	5.00e-07	***
TotRmsAbvGrd	14891.32	1244.91	11.96	0.00e+00	***
Fireplaces	10206.19	2053.47	4.97	8.00e-07	***
GarageArea	58.19	7.15	8.14	0.00e+00	***
YrSold	-109.65	860.33	-0.13	8.99e-01

# Put every predictor on the same scale (standardized) so features measured in different units can be compared by bar length.
preds <- all.vars(f_full)[-1]
sdx <- map_dbl(preds, ~ sd(d[[.x]]))
coef_df <- summary(fit_full)$coefficients[preds, ] |>
  as_tibble(rownames = "Predictor") |>
  mutate(beta = Estimate * sdx / sd(d$SalePrice),
         sig  = `Pr(>|t|)` < 0.05) |>
  arrange(beta) |>
  mutate(Predictor = factor(Predictor, levels = Predictor))

ggplot(coef_df, aes(beta, Predictor, fill = sig)) +
  geom_col(width = .7) +
  geom_vline(xintercept = 0, color = "gray40") +
  scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["lo"]]),
                    labels = c(`TRUE` = "p < 0.05", `FALSE` = "not sig."),
                    name = NULL) +
  labs(title = "Which Features Matter Most (Full Model)",
       subtitle = "Longer bar = bigger effect, comparable across predictors",
       x = "Standardized Effect", y = NULL)

Each model gives every feature a coefficient, which is simply the dollar amount it adds to the predicted price. Here is what the Full model coefficients say.

OverallQual is the workhorse. Each one-point step up the quality scale adds about $23,000, holding everything else fixed. That is the single cleanest investor message we have, and the chart confirms it dwarfs everything else.
GarageArea runs about $58 per square foot. BsmtFinSF1 about $31 per finished basement square foot. LotArea is tiny per unit (~$0.73) but the lots are huge, so it still adds up.
TotRmsAbvGrd adds roughly $15,000 per room and stays strong across all three models.
YearBuilt and YearRemodAdd are both positive and significant. Newer and more recently updated homes sell higher.
Fireplaces is worth about $10,000 each and holds up well.

The interesting one is BedroomAbvGr, the single negative bar in the chart. Its coefficient is around -$10,000 per bedroom. That reads backwards until you remember the model already accounts for total room count, so adding bedrooms really means slicing the same space into smaller rooms, and that reads as slightly lower value. It is a quirk of the overlap between bedrooms and total rooms, not a real preference, and it makes a good example for the report of why we trimmed the model.

What is not significant is also useful. FullBath, HalfBath, and YrSold all stop carrying real weight once the other predictors are in the model. Their effect is no longer statistically meaningful, meaning we cannot tell it apart from random noise. YrSold is the clearest case, with a p-value near 0.90, where anything above about 0.05 signals an effect we should not trust. Dropping it costs us nothing.

For discussion, here is ReducedMod, the same model with the two bath terms gone. Every predictor now clears the significance bar, and the coefficients move only a little, since room count and overall quality quietly absorb what the bathrooms were contributing.

kable(coef_tbl(fit_reducedmod),
      caption = "ReducedMod Coefficients (FullBath and HalfBath removed)")

ReducedMod Coefficients (FullBath and HalfBath removed)
Predictor	Estimate	Std_Error	t_value	p_value	Sig
(Intercept)	-1261994.49	142325.88	-8.87	0.00000	***
LotArea	0.71	0.11	6.73	0.00000	***
OverallQual	24383.65	1320.77	18.46	0.00000	***
YearBuilt	157.38	53.31	2.95	0.00323	**
YearRemodAdd	428.00	73.50	5.82	0.00000	***
BsmtFinSF1	31.92	2.80	11.39	0.00000	***
TotRmsAbvGrd	11874.18	836.93	14.19	0.00000	***
Fireplaces	11636.82	2046.99	5.68	0.00000	***
GarageArea	60.86	7.20	8.45	0.00000	***

4 Overlap Between Predictors

# vif() measures how much each predictor overlaps with the others. A value of 1 means no overlap; higher means more.
vif(fit_full) |>
  enframe(name = "Predictor", value = "VIF") |>
  arrange(desc(VIF)) |>
  mutate(VIF = round(VIF, 2)) |>
  kable(caption = "Variance inflation factors, full model")

Variance inflation factors, full model
Predictor	VIF
TotRmsAbvGrd	3.15
OverallQual	2.63
YearBuilt	2.26
FullBath	2.24
BedroomAbvGr	2.14
YearRemodAdd	1.76
GarageArea	1.74
Fireplaces	1.39
HalfBath	1.30
BsmtFinSF1	1.22
LotArea	1.14
YrSold	1.01

Before trusting the coefficients, we check whether predictors overlap too much, since heavy overlap can make individual coefficients unstable. The VIF numbers in the table measure that overlap. A value of 1 means a predictor shares no information with the others, and the common warning line is 5. Ours are all mild. TotRmsAbvGrd is highest at about 3.2, OverallQual next near 2.6, and everything else is under 2.5. Nothing crosses the warning line, so the overlap is real but not bad enough to distort the model. It shows up in how we read a coefficient, like that bedroom sign, more than in the numbers themselves. Worth mentioning in the report rather than a big deal.

5 The Five Predictions

ph <- read_csv("Predict_Houses.csv", show_col_types = FALSE)

out <- tibble(
  House      = seq_len(nrow(ph)),
  Actual     = ph$SalePrice,
  Small      = round(predict(fit_small,      ph)),
  ReducedMod = round(predict(fit_reducedmod, ph)),
  Reduced    = round(predict(fit_reduced,    ph)),
  Full       = round(predict(fit_full,       ph))
)
kable(out, format.args = list(big.mark = ","),
      caption = "Predicted Sale Prices by Model, Against the Actual Price")

Predicted Sale Prices by Model, Against the Actual Price
House	Actual	Small	ReducedMod	Reduced	Full
1	348,000	281,910	291,226	293,432	290,687
2	168,000	226,114	231,187	232,046	223,113
3	187,000	180,261	192,000	195,278	196,447
4	173,900	189,050	170,290	172,819	171,337
5	337,500	335,683	345,418	343,546	337,498

pe <- out |>
  mutate(across(c(Small, ReducedMod, Reduced, Full),
                ~ round(100 * abs(.x - Actual) / Actual, 1))) |>
  select(House, Small, ReducedMod, Reduced, Full)
kable(pe, caption = "Absolute Percent Error vs Actual")

Absolute Percent Error vs Actual
House	Small	ReducedMod	Reduced	Full
1	19.0	16.3	15.7	16.5
2	34.6	37.6	38.1	32.8
3	3.6	2.7	4.4	5.1
4	8.7	2.1	0.6	1.5
5	0.5	2.3	1.8	0.0

mape <- pe |> summarise(across(c(Small, ReducedMod, Reduced, Full), mean))

out |>
  pivot_longer(c(Small, ReducedMod, Reduced, Full), names_to = "Model", values_to = "Pred") |>
  mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full")),
         House = factor(House)) |>
  ggplot(aes(House)) +
  geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
           width = .78, alpha = .9) +
  geom_point(aes(y = Actual), size = 3, color = "black") +
  geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_manual(values = c(Small = "#b8b8d1", ReducedMod = "#4c9a8f",
                               Reduced = "#7a82c4", Full = "#4c72b0")) +
  labs(title = "Predicted vs Actual Price (black dots = actual)",
       x = "House", y = NULL, fill = NULL)

Mean absolute percent error across the five houses came in at Small 13.3%, ReducedMod 12.2%, Reduced 12.1%, and Full 11.2%. These five homes were held out of the model building, so they are a fair test, and they tell the same story as the cross-validation. ReducedMod, Reduced, and Full are all close, clearly ahead of Small.

Two houses are worth a note. House 2 is the problem child for every model. It is an 1882 build on a big lot that all three overprice, the tall gap above the black dot in the chart. A straight-line model like ours struggles with very old homes that sit far outside the typical range. House 5 is the high-quality home (OverallQual 10) and the dollar models handle it well, landing within a percent or two.

6 Alternative We Considered: Logging the SalePrice

The rubric asks us to discuss alternatives we weighed, and this is the main one. Because the prices are lopsided, with that long tail of expensive homes, a common fix is to model the price on a percentage basis instead of in flat dollars. In practice that means predicting the logarithm of the price and then converting the prediction back into dollars. We tested it on the same Reduced predictor set and scored it the same way.

f_log   <- log(SalePrice) ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
                         BsmtFinSF1 + FullBath + HalfBath +
                         TotRmsAbvGrd + Fireplaces + GarageArea
fit_log <- lm(f_log, d)

log_cv  <- cv_metrics(f_log, is_log = TRUE)   # reuses the same folds as above
red_cv  <- cv_metrics(f_reduced)
tibble(
  Model   = c("Reduced (dollars)", "Log (back-transformed)"),
  CV_RMSE = round(c(red_cv["CV_RMSE"], log_cv["CV_RMSE"])),
  CV_MAE  = round(c(red_cv["CV_MAE"],  log_cv["CV_MAE"]))
) |>
  kable(format.args = list(big.mark = ","),
        caption = "Reduced dollar model vs the log alternative, same folds")

Reduced dollar model vs the log alternative, same folds
Model	CV_RMSE	CV_MAE
Reduced (dollars)	35,778	24,611
Log (back-transformed)	31,604	20,633

out |>
  transmute(House, Actual, Reduced,
            Log_bt = round(exp(predict(fit_log, ph)))) |>
  kable(format.args = list(big.mark = ","),
        caption = "Predictions: Dollar Model vs Log Alternative")

Predictions: Dollar Model vs Log Alternative
House	Actual	Reduced	Log_bt
1	348,000	293,432	298,485
2	168,000	232,046	198,033
3	187,000	195,278	183,578
4	173,900	172,819	170,956
5	337,500	343,546	371,912

out |>
  mutate(Log = round(exp(predict(fit_log, ph)))) |>
  pivot_longer(c(Small, Reduced, Full, Log),
               names_to = "Model", values_to = "Pred") |>
  mutate(Model = factor(Model, levels = c("Small", "Reduced", "Full", "Log")),
         House = factor(House)) |>
  ggplot(aes(House)) +
  geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
           width = .75, alpha = .9) +
  geom_point(aes(y = Actual), size = 3, color = "black") +
  geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_manual(values = c(Small = "#b8b8d1", Reduced = "#7a82c4",
                               Full = "#4c72b0", Log = "#dd8452")) +
  labs(title = "All Four Models vs Actual (black dots = actual)",
       subtitle = "Log is the back-transformed alternative, shown in orange",
       x = "House", y = NULL, fill = NULL)

Side by side, the four are close on Houses 3 and 4. The split shows up at the extremes. On House 2, the old 1882 home, the log bar (orange) lands nearest the actual while the three dollar models overshoot together. On House 5, the OverallQual 10 home, it flips, the dollar models sit right on the dot and log runs hot. That is the same middle-versus-tails behavior the residual plot below explains, just seen on the five houses we have to predict.

k_lab <- scales::label_dollar(scale = 1e-3, suffix = "k")
bind_rows(
  tibble(fitted = fitted(fit_reduced), resid = resid(fit_reduced),
         Model = "Dollar Model (residuals fan out)"),
  tibble(fitted = exp(fitted(fit_log)), resid = d$SalePrice - exp(fitted(fit_log)),
         Model = "Log Model (residuals stay even)")
) |>
  ggplot(aes(fitted, resid)) +
  geom_point(alpha = .22, color = PAL[["pos"]]) +
  geom_hline(yintercept = 0, color = PAL[["neg"]]) +
  facet_wrap(~ Model, scales = "free") +
  scale_x_continuous(labels = k_lab, n.breaks = 4) +
  scale_y_continuous(labels = k_lab, n.breaks = 5) +
  labs(title = "Why we tested logging: error spread vs price",
       x = "Predicted price", y = "Residual")

The log model is more accurate. It shaves a few thousand dollars off both cross-validated RMSE and MAE, and on the five houses its mean error is about 9% against roughly 11% for the dollar models. Statistically it is the better fit, mostly because logging evens out the error spread that otherwise grows with price. The right panel above is the tell. The dollar model’s residuals fan out as price rises, while the log model’s sit in an even band.

We are not making it our headline model anyway, for three reasons. The assignment is built around explaining dollar effects, and the log model does not give those directly. Its effects come out as percentages, so the OverallQual term means about an 11.6% lift per quality point rather than a flat dollar figure, which is more to explain to a non-technical audience and easier to get wrong. And converting the log model’s predictions back into dollars introduces a small technical bias that goes past what the course covers. So we report it honestly as the stronger-fitting alternative we tested, explain why we kept the dollar model for clarity and scope, and move on. That is exactly the kind of tradeoff the “alternatives and consequences” part of the rubric is asking for.

7 What To Disccus Open Items

Recommendation: Reduced as our best model, Full noted as marginally more accurate, Small as the simple baseline. The log transform goes in the report as the alternative we tested and chose not to adopt, which meets the methods-comparison requirement without including a technique that may sit outside the scope of this class.

A few open questions are still worth settling Monday.

Reduced or Full as the named best model? Reduced is simpler and cleaner to explain, Full is a hair more accurate. I lean Reduced.
House 2 drags everyone down. Do we say anything about model limits on very old homes, or leave it?
LotArea outliers. The VIFs say collinearity is fine, but we never decided whether to cap the giant lots. The fit is solid without capping, so I lean toward leaving them and just noting it.

Once we agree on the final set of models, I will edit our our notes and comments and make this into a comparison workbook the rubric wants.

8 Report Outline

This is the structure for the graded PDF, with the pieces each section needs. The numbers and figures all come from the sections above, so we are really just arranging work we have already done.

Front matter (page 1). A table of group members and a short note on what each person did. This is required on the first page.

1. Project Goal. State the objective in plain terms, predict a home’s sale price from twelve features and value the five homes in Predict_Houses.csv. Frame it as a simplified version of Zillow’s Zestimate, and use Zillow’s own accuracy history, median error down from about 14 percent to roughly 5 percent, for context.

2. Overview of data and exploratory analysis. Describe the 990 homes, twelve features, and clean data with no missing values. Show the lopsided prices and the correlation chart. Cover the three issues we handled, the lopsided prices that led us to test the log model, the overlap between bedrooms and total rooms that we checked with VIF and fixed by dropping one, and the YrSold field that told us nothing and got removed. Note that we chose to leave the few very large lots in.

3. Model output and interpretation. Describe the three models that build from small to full. Read the coefficients in plain dollars, OverallQual at about $23,000 per quality point and so on. Explain the backwards bedroom coefficient as a side effect of holding total rooms constant. Point out which features are not significant. Use the importance chart to show what matters most. Mention the log model as the alternative we tested.

4. Performance. Report cross-validated RMSE and MAE plus Adjusted R-squared, with the comparison table and chart. Explain why Reduced is our pick, about as accurate as Full but simpler. Be honest about the limits, a typical miss around $35,000 or roughly 12 percent, weakest on very old homes and at the price extremes. This is the home for the full log comparison and the residual chart, plus the three reasons we did not adopt the log model.

5. Predicted sale prices for the five houses. Give the Reduced predictions against the actual prices, supported by the four-model chart. Note House 2, the 1882 home every model overprices, and House 5, the top-quality home where the log alternative runs hot.

Deliverables checklist. One PDF report with the sections above and the contributions table on page 1. A separate Excel or R file with at least three models, their coefficients, and an RMSE and MAE comparison that marks the best one. (Emailed Prof to confirm R file is ok, contridiction in the assignment PDF) A roughly five-minute narrated PowerPoint aimed at investors. Everything zipped as Group_X.zip.

Figure to section map. Section 2 gets the correlation chart and the skew histogram. Section 3 gets the importance chart. Section 4 gets the comparison chart, the log comparison, and the residual chart. Section 5 gets the four-model prediction chart.

One thing to settle. Sections 3 and 4 both touch the log model. We should decide whether it lives mainly in section 3 as an alternative strategy or section 4 as an alternative way to measure performance, so we do not write it up twice. My thoughts are include it in section 4 with a one-line pointer from section 3.

House Price Regression: Model Observations

Kent State Business Analytics group project

Notes for Monday’s discussion