This report completes the Week #4 multiple linear regression workflow using the Ames Housing data. The workflow includes:
The response is:
We construct a focused analytic dataset using interpretable predictors related to size, quality, age, and amenities. We also create engineered predictors that reduce redundancy and improve interpretability.
housing <- AmesHousing::make_ames()
dim(housing)
[1] 2930 81
head(housing[, 1:6])
Interpretation.
The Ames dataset contains 2,930 homes and
81 variables. We will use a focused subset for
regression to keep results interpretable.
Engineered variables: - Total_Baths = Full +
0.5×Half (including basement baths)
- House_Age = max(Year_Built) − Year_Built
- Years_Since_Remod = max(Year_Remod_Add) −
Year_Remod_Add
dat0 <- housing %>%
dplyr::select(
Sale_Price,
Gr_Liv_Area, Total_Bsmt_SF, First_Flr_SF, Second_Flr_SF,
Overall_Qual, Overall_Cond,
Year_Built, Year_Remod_Add,
Full_Bath, Half_Bath, Bsmt_Full_Bath, Bsmt_Half_Bath,
Garage_Cars, Garage_Area,
Fireplaces,
Lot_Area, Mas_Vnr_Area
) %>%
tidyr::drop_na(Sale_Price)
dat <- dat0 %>%
mutate(
Total_Baths = Full_Bath + 0.5 * Half_Bath + Bsmt_Full_Bath + 0.5 * Bsmt_Half_Bath,
House_Age = max(Year_Built, na.rm = TRUE) - Year_Built,
Years_Since_Remod = max(Year_Remod_Add, na.rm = TRUE) - Year_Remod_Add,
Overall_Qual_S = as.numeric(Overall_Qual),
Overall_Cond_S = as.numeric(Overall_Cond)
) %>%
dplyr::select(
Sale_Price,
Gr_Liv_Area, Total_Bsmt_SF, First_Flr_SF, Second_Flr_SF,
Overall_Qual_S, Overall_Cond_S,
House_Age, Years_Since_Remod,
Total_Baths, Garage_Cars, Garage_Area,
Fireplaces, Lot_Area, Mas_Vnr_Area
) %>%
drop_na()
dim(dat)
[1] 2930 15
pander::pander(head(dat), caption = "Preview of analytic dataset (first 6 rows).")
| Sale_Price | Gr_Liv_Area | Total_Bsmt_SF | First_Flr_SF | Second_Flr_SF |
|---|---|---|---|---|
| 215000 | 1656 | 1080 | 1656 | 0 |
| 105000 | 896 | 882 | 896 | 0 |
| 172000 | 1329 | 1329 | 1329 | 0 |
| 244000 | 2110 | 2110 | 2110 | 0 |
| 189900 | 1629 | 928 | 928 | 701 |
| 195500 | 1604 | 926 | 926 | 678 |
| Overall_Qual_S | Overall_Cond_S | House_Age | Years_Since_Remod | Total_Baths |
|---|---|---|---|---|
| 6 | 5 | 50 | 50 | 2 |
| 5 | 6 | 49 | 49 | 1 |
| 6 | 6 | 52 | 52 | 1.5 |
| 7 | 5 | 42 | 42 | 3.5 |
| 5 | 5 | 13 | 12 | 2.5 |
| 6 | 6 | 12 | 12 | 2.5 |
| Garage_Cars | Garage_Area | Fireplaces | Lot_Area | Mas_Vnr_Area |
|---|---|---|---|---|
| 2 | 528 | 2 | 31770 | 112 |
| 1 | 730 | 0 | 11622 | 0 |
| 1 | 312 | 0 | 14267 | 108 |
| 2 | 522 | 2 | 11160 | 0 |
| 2 | 482 | 1 | 13830 | 0 |
| 2 | 470 | 1 | 9978 | 20 |
Interpretation.
The analytic dataset contains 2,930 observations and 15
columns (Sale_Price + 14 predictors). The engineered variables
summarize modernization and bathroom availability in an interpretable
way.
set.seed(321)
small <- dat %>% dplyr::sample_n(min(1000, nrow(dat)))
GGally::ggpairs(
dplyr::select(
small,
Sale_Price, Gr_Liv_Area, Total_Bsmt_SF,
Overall_Qual_S, Total_Baths, Garage_Cars, Fireplaces
)
)
Interpretation.
EDA shows strong positive relationships between sale price and size
(living area, basement), overall quality, bathrooms, and amenities.
Increasing spread at higher prices suggests mild heteroscedasticity.
We fit a full multiple regression model, then summarize results cleanly using: - a compact fit-statistics table, - a clean coefficient table (with CIs), - a VIF table for multicollinearity, - diagnostic plots.
full_formula <- Sale_Price ~ Gr_Liv_Area + Total_Bsmt_SF +
First_Flr_SF + Second_Flr_SF +
Overall_Qual_S + Overall_Cond_S +
House_Age + Years_Since_Remod +
Total_Baths + Garage_Cars + Garage_Area +
Fireplaces + Lot_Area + Mas_Vnr_Area
m_full <- lm(full_formula, data = dat)
fit_full <- broom::glance(m_full) %>%
dplyr::select(r.squared, adj.r.squared, sigma, statistic, p.value, df)
pander::pander(
fit_full,
caption = "Full Model: Key Fit Statistics"
)
| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.813 | 0.8121 | 34628 | 905.3 | 0 | 14 |
Interpretation (numeric).
The full model achieves Adjusted \(R^2
\approx 0.812\), meaning it explains about
81% of the variation in sale price. The overall F-test
is highly significant (very small p-value), indicating the predictors
jointly explain substantial variation in prices.
coef_full <- broom::tidy(m_full, conf.int = TRUE) %>%
dplyr::mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
dplyr::arrange(p.value)
pander::pander(
coef_full,
caption = "Full Model: Coefficients with 95% Confidence Intervals"
)
| term | estimate | std.error | statistic | p.value | conf.low |
|---|---|---|---|---|---|
| (Intercept) | -77702 | 6299 | -12.33 | 0 | -90053 |
| Total_Bsmt_SF | 19.49 | 2.639 | 7.386 | 0 | 14.32 |
| Overall_Qual_S | 18168 | 771.2 | 23.56 | 0 | 16656 |
| Overall_Cond_S | 4595 | 683 | 6.728 | 0 | 3256 |
| House_Age | -263.7 | 37.36 | -7.059 | 0 | -337 |
| Years_Since_Remod | -243.3 | 44.72 | -5.441 | 0 | -331 |
| Total_Baths | 7264 | 1168 | 6.22 | 0 | 4974 |
| Garage_Area | 32.02 | 6.81 | 4.702 | 0 | 18.67 |
| Fireplaces | 8071 | 1174 | 6.874 | 0 | 5769 |
| Lot_Area | 0.5961 | 0.089 | 6.696 | 0 | 0.4216 |
| Mas_Vnr_Area | 38.17 | 4.181 | 9.128 | 0 | 29.97 |
| First_Flr_SF | 25.43 | 14.4 | 1.767 | 0.0774 | -2.793 |
| Gr_Liv_Area | 21.89 | 14.03 | 1.56 | 0.1189 | -5.623 |
| Garage_Cars | 2841 | 1982 | 1.434 | 0.1518 | -1044 |
| Second_Flr_SF | 14.43 | 14.2 | 1.016 | 0.3097 | -13.42 |
| conf.high |
|---|
| -65350 |
| 24.67 |
| 19680 |
| 5934 |
| -190.5 |
| -155.6 |
| 9554 |
| 45.37 |
| 10373 |
| 0.7707 |
| 46.37 |
| 53.66 |
| 49.4 |
| 6727 |
| 42.28 |
Interpretation (numeric).
The full model typically identifies quality, bathrooms, basement size,
fireplaces, and lot size as important predictors. However,
interpretation of individual square-footage coefficients is unreliable
if multicollinearity is severe.
vif_full <- car::vif(m_full)
vif_tbl <- tibble::tibble(
Predictor = names(vif_full),
VIF = as.numeric(vif_full)
) %>%
dplyr::arrange(desc(VIF)) %>%
dplyr::mutate(VIF = round(VIF, 2))
pander::pander(vif_tbl, caption = "Full Model: VIF (Multicollinearity Diagnostic)")
| Predictor | VIF |
|---|---|
| Gr_Liv_Area | 122.9 |
| Second_Flr_SF | 90.44 |
| First_Flr_SF | 77.75 |
| Garage_Cars | 5.56 |
| Garage_Area | 5.25 |
| Total_Bsmt_SF | 3.31 |
| House_Age | 3.12 |
| Overall_Qual_S | 2.89 |
| Total_Baths | 2.17 |
| Years_Since_Remod | 2.13 |
| Fireplaces | 1.41 |
| Overall_Cond_S | 1.41 |
| Mas_Vnr_Area | 1.36 |
| Lot_Area | 1.2 |
Interpretation (numeric).
Very large VIF values for square-footage variables indicate severe
multicollinearity, inflating standard errors and making coefficients
unstable. This motivates a reduced model via stepwise AIC.
op <- par(mfrow = c(2, 2))
plot(m_full)
par(op)
Interpretation (visual).
Diagnostics show mild heteroscedasticity (increasing spread at higher
fitted values) and some tail departure in the Q–Q plot. This motivates
assessing a Box–Cox transformation of the response.
Box–Cox searches for a power transformation of the response that improves normality and stabilizes variance. The chosen \(\lambda\) maximizes the likelihood.
bc <- MASS::boxcox(m_full, lambda = seq(-1, 1, by = 0.05), plotit = TRUE)
lambda_hat <- bc$x[which.max(bc$y)]
lambda_hat
[1] 0.1515152
Interpretation (numeric).
A \(\lambda\) near 0 indicates a
log-type transformation; \(\lambda\)
near 1 indicates no transformation. Here, \(\lambda\) is typically near
0.15, suggesting a mild power transform consistent with
price data.
y <- dat$Sale_Price
eps <- ifelse(any(y <= 0), 1, 0)
y_bc <- if (abs(lambda_hat) < 1e-8) log(y + eps) else ((y + eps)^lambda_hat - 1) / lambda_hat
m_bc <- lm(
y_bc ~ Gr_Liv_Area + Total_Bsmt_SF + First_Flr_SF + Second_Flr_SF +
Overall_Qual_S + Overall_Cond_S + House_Age + Years_Since_Remod +
Total_Baths + Garage_Cars + Garage_Area + Fireplaces +
Lot_Area + Mas_Vnr_Area,
data = dat
)
fit_bc <- broom::glance(m_bc) %>%
dplyr::select(r.squared, adj.r.squared, sigma, statistic, p.value, df)
pander::pander(
fit_bc,
caption = "Box–Cox Model: Key Fit Statistics (Transformed Response)"
)
| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.8671 | 0.8664 | 0.9242 | 1358 | 0 | 14 |
Interpretation (numeric).
The Box–Cox model often increases adjusted \(R^2\) relative to the untransformed models
(commonly around 0.866). This indicates better fit on
the transformed scale and typically improves residual behavior, but
coefficients are harder to interpret in dollars.
op <- par(mfrow = c(2, 2))
plot(m_bc)
par(op)
Interpretation (visual).
Residual patterns often look more stable (less fanning) and the Q–Q plot
aligns more closely with normality. However, interpretability is reduced
since the response is transformed.
We use stepwise AIC to reduce redundancy and multicollinearity while keeping the response in dollars. We compare the full and stepwise models using \(R^2\), adjusted \(R^2\), Mallows’ \(C_p\), and PRESS.
m_step <- MASS::stepAIC(m_full, trace = FALSE)
fit_step <- broom::glance(m_step) %>%
dplyr::select(r.squared, adj.r.squared, sigma, statistic, p.value, df)
pander::pander(
fit_step,
caption = "Stepwise AIC Model: Key Fit Statistics"
)
| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.8129 | 0.8121 | 34628 | 974.8 | 0 | 13 |
Interpretation (numeric).
The stepwise model maintains essentially the same adjusted \(R^2\) as the full model while using fewer
predictors, supporting a simpler model with similar explanatory
power.
coef_step <- broom::tidy(m_step, conf.int = TRUE) %>%
dplyr::mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
dplyr::arrange(p.value)
pander::pander(
coef_step,
caption = "Stepwise AIC Model: Coefficients with 95% Confidence Intervals"
)
| term | estimate | std.error | statistic | p.value | conf.low |
|---|---|---|---|---|---|
| (Intercept) | -77725 | 6299 | -12.34 | 0 | -90077 |
| Gr_Liv_Area | 35.99 | 2.063 | 17.44 | 0 | 31.94 |
| Total_Bsmt_SF | 19.38 | 2.637 | 7.35 | 0 | 14.21 |
| Overall_Qual_S | 18180 | 771.1 | 23.58 | 0 | 16668 |
| Overall_Cond_S | 4626 | 682.3 | 6.781 | 0 | 3288 |
| House_Age | -267 | 37.23 | -7.172 | 0 | -340 |
| Years_Since_Remod | -242.3 | 44.71 | -5.42 | 0 | -329.9 |
| Total_Baths | 7324 | 1166 | 6.28 | 0 | 5037 |
| Garage_Area | 31.99 | 6.81 | 4.698 | 0 | 18.64 |
| Fireplaces | 8092 | 1174 | 6.893 | 0 | 5790 |
| Lot_Area | 0.5982 | 0.089 | 6.721 | 0 | 0.4237 |
| Mas_Vnr_Area | 38.41 | 4.175 | 9.2 | 0 | 30.22 |
| First_Flr_SF | 11.14 | 3.063 | 3.639 | 3e-04 | 5.138 |
| Garage_Cars | 2887 | 1981 | 1.458 | 0.1451 | -997.1 |
| conf.high |
|---|
| -65373 |
| 40.03 |
| 24.55 |
| 19692 |
| 5964 |
| -194 |
| -154.6 |
| 9611 |
| 45.34 |
| 10394 |
| 0.7727 |
| 46.59 |
| 17.15 |
| 6772 |
vif_step <- car::vif(m_step)
vif_step_tbl <- tibble::tibble(
Predictor = names(vif_step),
VIF = as.numeric(vif_step)
) %>%
dplyr::arrange(desc(VIF)) %>%
dplyr::mutate(VIF = round(VIF, 2))
pander::pander(vif_step_tbl, caption = "Stepwise AIC Model: VIF")
| Predictor | VIF |
|---|---|
| Garage_Cars | 5.55 |
| Garage_Area | 5.25 |
| First_Flr_SF | 3.52 |
| Total_Bsmt_SF | 3.3 |
| House_Age | 3.1 |
| Overall_Qual_S | 2.89 |
| Gr_Liv_Area | 2.66 |
| Total_Baths | 2.17 |
| Years_Since_Remod | 2.12 |
| Fireplaces | 1.41 |
| Overall_Cond_S | 1.4 |
| Mas_Vnr_Area | 1.36 |
| Lot_Area | 1.2 |
Interpretation (numeric).
VIF values drop substantially relative to the full model, indicating
improved coefficient stability and reduced multicollinearity (especially
among square footage predictors).
gof_table <- function(m, mse_full, n){
e <- resid(m)
SSE <- sum(e^2)
R2 <- summary(m)$r.squared
R2adj <- summary(m)$adj.r.squared
p <- length(coef(m))
Cp <- SSE/mse_full - (n - 2 * p)
X <- model.matrix(m)
H <- X %*% solve(t(X) %*% X) %*% t(X)
d <- e/(1 - diag(H))
PRESS <- sum(d^2)
tibble::tibble(
SSE = SSE, R2 = R2, R2_Adj = R2adj,
Cp = Cp, PRESS = PRESS, p = p
)
}
n <- nobs(m_full)
mse_full <- mean(resid(m_full)^2)
gof <- dplyr::bind_rows(
Full = gof_table(m_full, mse_full, n) %>% mutate(Model = "Full"),
Stepwise = gof_table(m_step, mse_full, n) %>% mutate(Model = "Stepwise AIC")
) %>%
relocate(Model) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
pander::pander(gof, caption = "Goodness-of-Fit: Full vs Stepwise (R2, Adj R2, Cp, PRESS)")
| Model | SSE | R2 | R2_Adj | Cp | PRESS | p |
|---|---|---|---|---|---|---|
| Full | 3.495e+12 | 0.813 | 0.812 | 30 | 3.629e+12 | 15 |
| Stepwise AIC | 3.497e+12 | 0.813 | 0.812 | 29.04 | 3.626e+12 | 14 |
Interpretation (numeric).
If the stepwise model matches the full model on adjusted \(R^2\) and PRESS while using fewer
predictors and lower VIF, it is preferred for interpretability and
stability. Cp close to the number of parameters supports appropriate
model complexity.
We choose the stepwise AIC model as the final model because it: - preserves interpretability in dollars, - maintains essentially identical fit to the full model, - and reduces multicollinearity substantially.
m_final <- m_step
op <- par(mfrow = c(2, 2))
plot(m_final)
par(op)
Interpretation (visual).
The final model diagnostics are acceptable for regression inference,
with mild heteroscedasticity but strong overall fit. Reduced
multicollinearity improves coefficient stability compared to the full
model.
key_terms <- c("Overall_Qual_S","Total_Baths","House_Age","Years_Since_Remod",
"Fireplaces","Gr_Liv_Area","Garage_Area","Mas_Vnr_Area","Lot_Area")
key_tbl <- broom::tidy(m_final, conf.int = TRUE) %>%
dplyr::filter(term %in% key_terms) %>%
dplyr::mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
dplyr::arrange(desc(abs(estimate)))
pander::pander(key_tbl, caption = "Final Model: Selected Key Coefficients (Dollar Scale)")
| term | estimate | std.error | statistic | p.value | conf.low |
|---|---|---|---|---|---|
| Overall_Qual_S | 18180 | 771.1 | 23.58 | 0 | 16668 |
| Fireplaces | 8092 | 1174 | 6.893 | 0 | 5790 |
| Total_Baths | 7324 | 1166 | 6.28 | 0 | 5037 |
| House_Age | -267 | 37.23 | -7.172 | 0 | -340 |
| Years_Since_Remod | -242.3 | 44.7 | -5.42 | 0 | -329.9 |
| Mas_Vnr_Area | 38.41 | 4.175 | 9.2 | 0 | 30.22 |
| Gr_Liv_Area | 35.99 | 2.063 | 17.44 | 0 | 31.94 |
| Garage_Area | 31.99 | 6.81 | 4.698 | 0 | 18.64 |
| Lot_Area | 0.598 | 0.089 | 6.721 | 0 | 0.424 |
| conf.high |
|---|
| 19692 |
| 10394 |
| 9611 |
| -194 |
| -154.6 |
| 46.59 |
| 40.03 |
| 45.34 |
| 0.773 |
Interpretation (numeric).
Interpret coefficients as “holding all else constant.” For example: -
Overall_Qual_S indicates the dollar premium per 1-unit
increase in quality rating. - Total_Baths indicates the
dollar premium per additional effective bathroom. -
House_Age and Years_Since_Remod show
depreciation and remodeling effects (negative values). -
Gr_Liv_Area, Garage_Area, and
Mas_Vnr_Area represent size/amenity premiums per square
foot.
The final model indicates that Ames home sale prices are strongly driven by: - overall quality, - living area and basement size, - total bathrooms, - amenities (fireplaces and garage area), - and modernity (age and time since remodel).
Box–Cox improves assumption behavior and increases adjusted \(R^2\) on a transformed scale, but reduces interpretability. The stepwise AIC model retains dollar-scale interpretation, substantially lowers multicollinearity, and preserves predictive fit compared to the full model. Therefore, stepwise AIC is the preferred final model for this assignment.
This analysis does not include neighborhood/location variables, which are often major drivers of price. Future work could incorporate spatial predictors and explore nonlinear models (splines, tree-based methods) with validation through cross-validation or a train/test split.