Built the three models from the EDA plan, all predicting the price in plain dollars so they compare apples to apples. We also tested a version that predicts the price as a percentage change instead of a flat dollar amount, called the log model, but we are treating that as an alternative we considered rather than a graded model, for reasons in the last section. This is written up so we can argue about it before we touch the report. Everything below is reproducible with the seed set to 42, so re-running gives the same numbers down to the dollar.
To run it yourself, keep House_Prices.csv and
Predict_Houses.csv in the same folder as this file, then
click Knit, or step through the chunks.
d <- read_csv("House_Prices.csv", show_col_types = FALSE)
d |>
summarise(Min = min(SalePrice), Median = median(SalePrice),
Mean = round(mean(SalePrice)), Max = max(SalePrice)) |>
pivot_longer(everything(), names_to = "Statistic", values_to = "SalePrice") |>
kable(format.args = list(big.mark = ","),
caption = "SalePrice Summary (990 homes, no missing values)")
| Statistic | SalePrice |
|---|---|
| Min | 34,900 |
| Median | 163,250 |
| Mean | 182,151 |
| Max | 755,000 |
cor_df <- tibble(Predictor = names(d),
r = as.numeric(cor(d)[, "SalePrice"])) |>
filter(Predictor != "SalePrice") |>
arrange(r) |>
mutate(Predictor = factor(Predictor, levels = Predictor))
ggplot(cor_df, aes(r, Predictor, fill = r > 0)) +
geom_col(width = .7) +
geom_text(aes(label = sprintf("%.2f", r),
hjust = ifelse(r > 0, -0.15, 1.15)), size = 3.4) +
scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["neg"]]),
guide = "none") +
scale_x_continuous(limits = c(-0.1, 0.95), expand = expansion(c(.05, .1))) +
labs(title = "Correlation of Each Predictor With SalePrice",
x = "Correlation With Price (1.0 = perfect)", y = NULL)
Everything in the EDA notes held up. The average price near $182k sits well above the $163k middle home, because a handful of expensive houses pull the average up. Statisticians call that right-skewed. OverallQual tracks price most closely, with a correlation of 0.80, where 1.0 would be a perfect lockstep match and 0 would mean no relationship. GarageArea, TotRmsAbvGrd, FullBath, YearBuilt, and YearRemodAdd follow in the 0.52 to 0.65 range. BedroomAbvGr is weak at 0.18 and YrSold is the only negative, basically flat at -0.03. Nothing new to chase down.
These are the three we planned to submit, plus a fourth, ReducedMod, that the group asked to see for discussion. All four predict SalePrice in dollars.
f_small <- SalePrice ~ OverallQual + GarageArea + TotRmsAbvGrd + YearBuilt
f_full <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + FullBath + HalfBath + BedroomAbvGr +
TotRmsAbvGrd + Fireplaces + GarageArea + YrSold
f_reduced <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + FullBath + HalfBath +
TotRmsAbvGrd + Fireplaces + GarageArea
f_reducedmod <- SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + TotRmsAbvGrd + Fireplaces + GarageArea
fit_small <- lm(f_small, d)
fit_full <- lm(f_full, d)
fit_reduced <- lm(f_reduced, d)
fit_reducedmod <- lm(f_reducedmod, d)
# 10-fold CV on the dollar scale. Same folds for every model so it is fair.
k <- 10
folds <- sample(rep(1:k, length.out = nrow(d)))
cv_metrics <- function(formula, is_log = FALSE) {
out <- map_dfr(1:k, function(i) {
tr <- d[folds != i, ]; te <- d[folds == i, ]
fit <- lm(formula, data = tr)
pred <- predict(fit, te)
if (is_log) pred <- exp(pred)
err <- te$SalePrice - pred
tibble(rmse = sqrt(mean(err^2)), mae = mean(abs(err)))
})
c(CV_RMSE = mean(out$rmse), CV_MAE = mean(out$mae))
}
comp <- tibble(
Model = c("Small", "ReducedMod", "Reduced", "Full"),
Predictors = c(4L, 8L, 10L, 12L),
fit = list(fit_small, fit_reducedmod, fit_reduced, fit_full),
formula = list(f_small, f_reducedmod, f_reduced, f_full)
) |>
mutate(cv = map(formula, cv_metrics),
CV_RMSE = round(map_dbl(cv, "CV_RMSE")),
CV_MAE = round(map_dbl(cv, "CV_MAE")),
Adj_R2 = round(map_dbl(fit, ~ summary(.x)$adj.r.squared), 3),
AIC = round(map_dbl(fit, AIC))) |>
select(Model, Predictors, CV_RMSE, CV_MAE, Adj_R2, AIC)
kable(comp, format.args = list(big.mark = ","),
caption = "Model comparison. RMSE and MAE are two ways to measure the average prediction miss in dollars, RMSE weighting the big misses more heavily. Lower is better. From 10-fold cross-validation.")
| Model | Predictors | CV_RMSE | CV_MAE | Adj_R2 | AIC |
|---|---|---|---|---|---|
| Small | 4 | 40,756 | 28,845 | 0.733 | 23,868 |
| ReducedMod | 8 | 35,786 | 24,680 | 0.798 | 23,596 |
| Reduced | 10 | 35,778 | 24,611 | 0.798 | 23,598 |
| Full | 12 | 35,432 | 24,489 | 0.803 | 23,576 |
comp |>
select(Model, RMSE = CV_RMSE, MAE = CV_MAE) |>
pivot_longer(c(RMSE, MAE), names_to = "Metric", values_to = "Dollars") |>
mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full"))) |>
ggplot(aes(Model, Dollars, fill = Metric)) +
geom_col(position = position_dodge(.75), width = .65) +
geom_text(aes(label = scales::dollar(Dollars)),
position = position_dodge(.75), vjust = -0.4, size = 3.2) +
scale_fill_manual(values = c(RMSE = PAL[["hi"]], MAE = PAL[["lo"]])) +
scale_y_continuous(labels = scales::dollar, expand = expansion(c(0, .12))) +
labs(title = "Cross-validated Error by Model (lower is better)",
x = NULL, y = NULL, fill = NULL)
The Small model is the worst by a clear margin, so those six middle-strength predictors are pulling real weight even with all the overlap. Full and Reduced are basically tied. The numbers above come from cross-validation, which means we tested each model on data it had not seen while it was being built, splitting the data into ten parts and rotating which part is held back. Adjusted R-squared is the share of the price variation the model explains, with a small penalty for adding predictors. Full edges Reduced on both, but only by a few hundred dollars of error, and it does that while carrying two predictors that earn nothing. Reduced drops YrSold, which is useless, and BedroomAbvGr, which overlaps with total rooms and shows a backwards sign, for almost no accuracy cost. My lean is to call Reduced our best model on the strength of being simpler and easier to explain, with Full noted as fractionally more accurate. That gives us a clean “how and why we chose” story for the rubric.
ReducedMod is the new one to talk through. It takes Reduced and also drops FullBath and HalfBath, the two terms that were not significant. The payoff is that all eight remaining predictors are significant, which is a tidy story, and the cost is almost nothing, its cross-validated error sits within a few dollars of Reduced and its Adjusted R-squared matches to three decimals. The one thing to weigh is that a model with no bathroom term at all can look odd to a non-technical audience, since buyers clearly care about bathrooms. The honest read is that the bathroom effect does not vanish, it gets absorbed by the predictors it travels with, mainly room count and overall quality. That is the tradeoff for the group to decide.
coef_tbl <- function(fit) {
summary(fit)$coefficients |>
as_tibble(rownames = "Predictor") |>
transmute(Predictor,
Estimate = round(Estimate, 2),
Std_Error = round(`Std. Error`, 2),
t_value = round(`t value`, 2),
p_value = signif(`Pr(>|t|)`, 3),
Sig = cut(`Pr(>|t|)`, c(-Inf, .001, .01, .05, .1, Inf),
labels = c("***", "**", "*", ".", "")))
}
kable(coef_tbl(fit_full), caption = "Full Model Coefficients (dollar scale)")
| Predictor | Estimate | Std_Error | t_value | p_value | Sig |
|---|---|---|---|---|---|
| (Intercept) | -870591.44 | 1724927.10 | -0.50 | 6.14e-01 | |
| LotArea | 0.73 | 0.11 | 6.95 | 0.00e+00 | *** |
| OverallQual | 23142.84 | 1334.70 | 17.34 | 0.00e+00 | *** |
| YearBuilt | 129.53 | 56.99 | 2.27 | 2.32e-02 | * |
| YearRemodAdd | 374.08 | 73.71 | 5.08 | 5.00e-07 | *** |
| BsmtFinSF1 | 31.42 | 2.83 | 11.11 | 0.00e+00 | *** |
| FullBath | 5554.58 | 3068.87 | 1.81 | 7.06e-02 | . |
| HalfBath | 4068.65 | 2589.61 | 1.57 | 1.16e-01 | |
| BedroomAbvGr | -10303.06 | 2028.72 | -5.08 | 5.00e-07 | *** |
| TotRmsAbvGrd | 14891.32 | 1244.91 | 11.96 | 0.00e+00 | *** |
| Fireplaces | 10206.19 | 2053.47 | 4.97 | 8.00e-07 | *** |
| GarageArea | 58.19 | 7.15 | 8.14 | 0.00e+00 | *** |
| YrSold | -109.65 | 860.33 | -0.13 | 8.99e-01 |
# Put every predictor on the same scale (standardized) so features measured in different units can be compared by bar length.
preds <- all.vars(f_full)[-1]
sdx <- map_dbl(preds, ~ sd(d[[.x]]))
coef_df <- summary(fit_full)$coefficients[preds, ] |>
as_tibble(rownames = "Predictor") |>
mutate(beta = Estimate * sdx / sd(d$SalePrice),
sig = `Pr(>|t|)` < 0.05) |>
arrange(beta) |>
mutate(Predictor = factor(Predictor, levels = Predictor))
ggplot(coef_df, aes(beta, Predictor, fill = sig)) +
geom_col(width = .7) +
geom_vline(xintercept = 0, color = "gray40") +
scale_fill_manual(values = c(`TRUE` = PAL[["pos"]], `FALSE` = PAL[["lo"]]),
labels = c(`TRUE` = "p < 0.05", `FALSE` = "not sig."),
name = NULL) +
labs(title = "Which Features Matter Most (Full Model)",
subtitle = "Longer bar = bigger effect, comparable across predictors",
x = "Standardized Effect", y = NULL)
Each model gives every feature a coefficient, which is simply the dollar amount it adds to the predicted price. Here is what the Full model coefficients say.
The interesting one is BedroomAbvGr, the single negative bar in the chart. Its coefficient is around -$10,000 per bedroom. That reads backwards until you remember the model already accounts for total room count, so adding bedrooms really means slicing the same space into smaller rooms, and that reads as slightly lower value. It is a quirk of the overlap between bedrooms and total rooms, not a real preference, and it makes a good example for the report of why we trimmed the model.
What is not significant is also useful. FullBath, HalfBath, and YrSold all stop carrying real weight once the other predictors are in the model. Their effect is no longer statistically meaningful, meaning we cannot tell it apart from random noise. YrSold is the clearest case, with a p-value near 0.90, where anything above about 0.05 signals an effect we should not trust. Dropping it costs us nothing.
For discussion, here is ReducedMod, the same model with the two bath terms gone. Every predictor now clears the significance bar, and the coefficients move only a little, since room count and overall quality quietly absorb what the bathrooms were contributing.
kable(coef_tbl(fit_reducedmod),
caption = "ReducedMod Coefficients (FullBath and HalfBath removed)")
| Predictor | Estimate | Std_Error | t_value | p_value | Sig |
|---|---|---|---|---|---|
| (Intercept) | -1261994.49 | 142325.88 | -8.87 | 0.00000 | *** |
| LotArea | 0.71 | 0.11 | 6.73 | 0.00000 | *** |
| OverallQual | 24383.65 | 1320.77 | 18.46 | 0.00000 | *** |
| YearBuilt | 157.38 | 53.31 | 2.95 | 0.00323 | ** |
| YearRemodAdd | 428.00 | 73.50 | 5.82 | 0.00000 | *** |
| BsmtFinSF1 | 31.92 | 2.80 | 11.39 | 0.00000 | *** |
| TotRmsAbvGrd | 11874.18 | 836.93 | 14.19 | 0.00000 | *** |
| Fireplaces | 11636.82 | 2046.99 | 5.68 | 0.00000 | *** |
| GarageArea | 60.86 | 7.20 | 8.45 | 0.00000 | *** |
# vif() measures how much each predictor overlaps with the others. A value of 1 means no overlap; higher means more.
vif(fit_full) |>
enframe(name = "Predictor", value = "VIF") |>
arrange(desc(VIF)) |>
mutate(VIF = round(VIF, 2)) |>
kable(caption = "Variance inflation factors, full model")
| Predictor | VIF |
|---|---|
| TotRmsAbvGrd | 3.15 |
| OverallQual | 2.63 |
| YearBuilt | 2.26 |
| FullBath | 2.24 |
| BedroomAbvGr | 2.14 |
| YearRemodAdd | 1.76 |
| GarageArea | 1.74 |
| Fireplaces | 1.39 |
| HalfBath | 1.30 |
| BsmtFinSF1 | 1.22 |
| LotArea | 1.14 |
| YrSold | 1.01 |
Before trusting the coefficients, we check whether predictors overlap too much, since heavy overlap can make individual coefficients unstable. The VIF numbers in the table measure that overlap. A value of 1 means a predictor shares no information with the others, and the common warning line is 5. Ours are all mild. TotRmsAbvGrd is highest at about 3.2, OverallQual next near 2.6, and everything else is under 2.5. Nothing crosses the warning line, so the overlap is real but not bad enough to distort the model. It shows up in how we read a coefficient, like that bedroom sign, more than in the numbers themselves. Worth mentioning in the report rather than a big deal.
ph <- read_csv("Predict_Houses.csv", show_col_types = FALSE)
out <- tibble(
House = seq_len(nrow(ph)),
Actual = ph$SalePrice,
Small = round(predict(fit_small, ph)),
ReducedMod = round(predict(fit_reducedmod, ph)),
Reduced = round(predict(fit_reduced, ph)),
Full = round(predict(fit_full, ph))
)
kable(out, format.args = list(big.mark = ","),
caption = "Predicted Sale Prices by Model, Against the Actual Price")
| House | Actual | Small | ReducedMod | Reduced | Full |
|---|---|---|---|---|---|
| 1 | 348,000 | 281,910 | 291,226 | 293,432 | 290,687 |
| 2 | 168,000 | 226,114 | 231,187 | 232,046 | 223,113 |
| 3 | 187,000 | 180,261 | 192,000 | 195,278 | 196,447 |
| 4 | 173,900 | 189,050 | 170,290 | 172,819 | 171,337 |
| 5 | 337,500 | 335,683 | 345,418 | 343,546 | 337,498 |
pe <- out |>
mutate(across(c(Small, ReducedMod, Reduced, Full),
~ round(100 * abs(.x - Actual) / Actual, 1))) |>
select(House, Small, ReducedMod, Reduced, Full)
kable(pe, caption = "Absolute Percent Error vs Actual")
| House | Small | ReducedMod | Reduced | Full |
|---|---|---|---|---|
| 1 | 19.0 | 16.3 | 15.7 | 16.5 |
| 2 | 34.6 | 37.6 | 38.1 | 32.8 |
| 3 | 3.6 | 2.7 | 4.4 | 5.1 |
| 4 | 8.7 | 2.1 | 0.6 | 1.5 |
| 5 | 0.5 | 2.3 | 1.8 | 0.0 |
mape <- pe |> summarise(across(c(Small, ReducedMod, Reduced, Full), mean))
out |>
pivot_longer(c(Small, ReducedMod, Reduced, Full), names_to = "Model", values_to = "Pred") |>
mutate(Model = factor(Model, levels = c("Small", "ReducedMod", "Reduced", "Full")),
House = factor(House)) |>
ggplot(aes(House)) +
geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
width = .78, alpha = .9) +
geom_point(aes(y = Actual), size = 3, color = "black") +
geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_manual(values = c(Small = "#b8b8d1", ReducedMod = "#4c9a8f",
Reduced = "#7a82c4", Full = "#4c72b0")) +
labs(title = "Predicted vs Actual Price (black dots = actual)",
x = "House", y = NULL, fill = NULL)
Mean absolute percent error across the five houses came in at Small 13.3%, ReducedMod 12.2%, Reduced 12.1%, and Full 11.2%. These five homes were held out of the model building, so they are a fair test, and they tell the same story as the cross-validation. ReducedMod, Reduced, and Full are all close, clearly ahead of Small.
Two houses are worth a note. House 2 is the problem child for every model. It is an 1882 build on a big lot that all three overprice, the tall gap above the black dot in the chart. A straight-line model like ours struggles with very old homes that sit far outside the typical range. House 5 is the high-quality home (OverallQual 10) and the dollar models handle it well, landing within a percent or two.
The rubric asks us to discuss alternatives we weighed, and this is the main one. Because the prices are lopsided, with that long tail of expensive homes, a common fix is to model the price on a percentage basis instead of in flat dollars. In practice that means predicting the logarithm of the price and then converting the prediction back into dollars. We tested it on the same Reduced predictor set and scored it the same way.
f_log <- log(SalePrice) ~ LotArea + OverallQual + YearBuilt + YearRemodAdd +
BsmtFinSF1 + FullBath + HalfBath +
TotRmsAbvGrd + Fireplaces + GarageArea
fit_log <- lm(f_log, d)
log_cv <- cv_metrics(f_log, is_log = TRUE) # reuses the same folds as above
red_cv <- cv_metrics(f_reduced)
tibble(
Model = c("Reduced (dollars)", "Log (back-transformed)"),
CV_RMSE = round(c(red_cv["CV_RMSE"], log_cv["CV_RMSE"])),
CV_MAE = round(c(red_cv["CV_MAE"], log_cv["CV_MAE"]))
) |>
kable(format.args = list(big.mark = ","),
caption = "Reduced dollar model vs the log alternative, same folds")
| Model | CV_RMSE | CV_MAE |
|---|---|---|
| Reduced (dollars) | 35,778 | 24,611 |
| Log (back-transformed) | 31,604 | 20,633 |
out |>
transmute(House, Actual, Reduced,
Log_bt = round(exp(predict(fit_log, ph)))) |>
kable(format.args = list(big.mark = ","),
caption = "Predictions: Dollar Model vs Log Alternative")
| House | Actual | Reduced | Log_bt |
|---|---|---|---|
| 1 | 348,000 | 293,432 | 298,485 |
| 2 | 168,000 | 232,046 | 198,033 |
| 3 | 187,000 | 195,278 | 183,578 |
| 4 | 173,900 | 172,819 | 170,956 |
| 5 | 337,500 | 343,546 | 371,912 |
out |>
mutate(Log = round(exp(predict(fit_log, ph)))) |>
pivot_longer(c(Small, Reduced, Full, Log),
names_to = "Model", values_to = "Pred") |>
mutate(Model = factor(Model, levels = c("Small", "Reduced", "Full", "Log")),
House = factor(House)) |>
ggplot(aes(House)) +
geom_col(aes(y = Pred, fill = Model), position = position_dodge(.8),
width = .75, alpha = .9) +
geom_point(aes(y = Actual), size = 3, color = "black") +
geom_line(aes(y = Actual, group = 1), color = "black", linetype = 2, linewidth = .5) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_manual(values = c(Small = "#b8b8d1", Reduced = "#7a82c4",
Full = "#4c72b0", Log = "#dd8452")) +
labs(title = "All Four Models vs Actual (black dots = actual)",
subtitle = "Log is the back-transformed alternative, shown in orange",
x = "House", y = NULL, fill = NULL)
Side by side, the four are close on Houses 3 and 4. The split shows up at the extremes. On House 2, the old 1882 home, the log bar (orange) lands nearest the actual while the three dollar models overshoot together. On House 5, the OverallQual 10 home, it flips, the dollar models sit right on the dot and log runs hot. That is the same middle-versus-tails behavior the residual plot below explains, just seen on the five houses we have to predict.
k_lab <- scales::label_dollar(scale = 1e-3, suffix = "k")
bind_rows(
tibble(fitted = fitted(fit_reduced), resid = resid(fit_reduced),
Model = "Dollar Model (residuals fan out)"),
tibble(fitted = exp(fitted(fit_log)), resid = d$SalePrice - exp(fitted(fit_log)),
Model = "Log Model (residuals stay even)")
) |>
ggplot(aes(fitted, resid)) +
geom_point(alpha = .22, color = PAL[["pos"]]) +
geom_hline(yintercept = 0, color = PAL[["neg"]]) +
facet_wrap(~ Model, scales = "free") +
scale_x_continuous(labels = k_lab, n.breaks = 4) +
scale_y_continuous(labels = k_lab, n.breaks = 5) +
labs(title = "Why we tested logging: error spread vs price",
x = "Predicted price", y = "Residual")
The log model is more accurate. It shaves a few thousand dollars off both cross-validated RMSE and MAE, and on the five houses its mean error is about 9% against roughly 11% for the dollar models. Statistically it is the better fit, mostly because logging evens out the error spread that otherwise grows with price. The right panel above is the tell. The dollar model’s residuals fan out as price rises, while the log model’s sit in an even band.
We are not making it our headline model anyway, for three reasons. The assignment is built around explaining dollar effects, and the log model does not give those directly. Its effects come out as percentages, so the OverallQual term means about an 11.6% lift per quality point rather than a flat dollar figure, which is more to explain to a non-technical audience and easier to get wrong. And converting the log model’s predictions back into dollars introduces a small technical bias that goes past what the course covers. So we report it honestly as the stronger-fitting alternative we tested, explain why we kept the dollar model for clarity and scope, and move on. That is exactly the kind of tradeoff the “alternatives and consequences” part of the rubric is asking for.
Recommendation: Reduced as our best model, Full noted as marginally more accurate, Small as the simple baseline. The log transform goes in the report as the alternative we tested and chose not to adopt, which meets the methods-comparison requirement without including a technique that may sit outside the scope of this class.
A few open questions are still worth settling Monday.
Once we agree on the final set of models, I will edit our our notes and comments and make this into a comparison workbook the rubric wants.
This is the structure for the graded PDF, with the pieces each section needs. The numbers and figures all come from the sections above, so we are really just arranging work we have already done.
Front matter (page 1). A table of group members and a short note on what each person did. This is required on the first page.
1. Project Goal. State the objective in plain terms, predict a home’s sale price from twelve features and value the five homes in Predict_Houses.csv. Frame it as a simplified version of Zillow’s Zestimate, and use Zillow’s own accuracy history, median error down from about 14 percent to roughly 5 percent, for context.
2. Overview of data and exploratory analysis. Describe the 990 homes, twelve features, and clean data with no missing values. Show the lopsided prices and the correlation chart. Cover the three issues we handled, the lopsided prices that led us to test the log model, the overlap between bedrooms and total rooms that we checked with VIF and fixed by dropping one, and the YrSold field that told us nothing and got removed. Note that we chose to leave the few very large lots in.
3. Model output and interpretation. Describe the three models that build from small to full. Read the coefficients in plain dollars, OverallQual at about $23,000 per quality point and so on. Explain the backwards bedroom coefficient as a side effect of holding total rooms constant. Point out which features are not significant. Use the importance chart to show what matters most. Mention the log model as the alternative we tested.
4. Performance. Report cross-validated RMSE and MAE plus Adjusted R-squared, with the comparison table and chart. Explain why Reduced is our pick, about as accurate as Full but simpler. Be honest about the limits, a typical miss around $35,000 or roughly 12 percent, weakest on very old homes and at the price extremes. This is the home for the full log comparison and the residual chart, plus the three reasons we did not adopt the log model.
5. Predicted sale prices for the five houses. Give the Reduced predictions against the actual prices, supported by the four-model chart. Note House 2, the 1882 home every model overprices, and House 5, the top-quality home where the log alternative runs hot.
Deliverables checklist. One PDF report with the sections above and the contributions table on page 1. A separate Excel or R file with at least three models, their coefficients, and an RMSE and MAE comparison that marks the best one. (Emailed Prof to confirm R file is ok, contridiction in the assignment PDF) A roughly five-minute narrated PowerPoint aimed at investors. Everything zipped as Group_X.zip.
Figure to section map. Section 2 gets the correlation chart and the skew histogram. Section 3 gets the importance chart. Section 4 gets the comparison chart, the log comparison, and the residual chart. Section 5 gets the four-model prediction chart.
One thing to settle. Sections 3 and 4 both touch the log model. We should decide whether it lives mainly in section 3 as an alternative strategy or section 4 as an alternative way to measure performance, so we do not write it up twice. My thoughts are include it in section 4 with a one-line pointer from section 3.