This week I’m building on last week’s simple linear regression model by adding more variables to improve predictions and understand the factors that drive home prices. More importantly, I’ll rigorously diagnose the model using the 5 diagnostic plots covered in class to ensure it meets the assumptions required for valid inference.
A model that violates key assumptions cannot be trusted for making conclusions, no matter how good its R² looks.
Last week’s model was: SalePrice = β₀ + β₁ × Gr.Liv.Area (R² = 0.500)
This week I’ll expand it to a multiple regression model and thoroughly check whether it’s valid.
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
# Recreate Week 8's simple model
base_model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)
summary_base <- summary(base_model)
Week 8 Base Model Summary:
This model explains 50% of price variation. The other ~50% comes from factors we haven’t modeled yet.
# Calculate correlations with SalePrice
candidates <- c("Overall.Qual", "Year.Built", "Garage.Area", "Total.Bsmt.SF")
correlations <- sapply(candidates, function(var) {
cor(ames[[var]], ames$SalePrice, use = "complete.obs")
})
# Also check correlation with Gr.Liv.Area (for multicollinearity)
correlations_area <- sapply(candidates, function(var) {
cor(ames[[var]], ames$Gr.Liv.Area, use = "complete.obs")
})
cor_table <- data.frame(
Variable = candidates,
Correlation_with_Price = correlations,
Correlation_with_Area = correlations_area
) |>
arrange(desc(abs(Correlation_with_Price)))
kable(cor_table,
col.names = c("Variable", "r with SalePrice", "r with Gr.Liv.Area"),
caption = "Potential Variables to Add",
digits = 3)
| Variable | r with SalePrice | r with Gr.Liv.Area | |
|---|---|---|---|
| Overall.Qual | Overall.Qual | 0.799 | 0.571 |
| Garage.Area | Garage.Area | 0.640 | 0.485 |
| Total.Bsmt.SF | Total.Bsmt.SF | 0.632 | 0.445 |
| Year.Built | Year.Built | 0.558 | 0.242 |
Why include it:
Multicollinearity concern:
DECISION: Include Overall.Qual ✓
# Create binary variable
ames <- ames |>
mutate(Has_Central_Air = if_else(Central.Air == "Y", 1, 0))
Why include it:
Multicollinearity concern:
corr_ac_area <- cor(ames$Has_Central_Air, ames$Gr.Liv.Area)
corr_ac_qual <- cor(ames$Has_Central_Air, ames$Overall.Qual)
DECISION: Include Has_Central_Air ✓
Theoretical justification:
# Fit model with and without interaction
model_no_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air,
data = ames)
model_with_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +
Gr.Liv.Area:Overall.Qual,
data = ames)
# Compare models with ANOVA
anova_result <- anova(model_no_interaction, model_with_interaction)
p_value_interaction <- anova_result$`Pr(>F)`[2]
Testing if interaction is needed:
kable(anova_result,
caption = "ANOVA F-test: Comparing Models With and Without Interaction",
digits = c(0, 0, 0, 2, 2, 4))
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 2926 | 4957505742213 | ||||
| 2925 | 4486148928825 | 1 | 471356813388 | 307.33 | 0 |
The interaction term has p = 1.664e-65. Since p >= 0.05, the interaction term is significant.
DECISION: Include interaction term to keep the model simpler and more interpretable.
# Use model without interaction for simplicity
final_model <- model_no_interaction
Final Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air
This model has 4 parameters (including intercept).
summary_final <- summary(final_model)
print(summary_final)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air,
## data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -393423 -23038 -1377 19705 284783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -120952.256 3951.472 -30.609 < 2e-16 ***
## Gr.Liv.Area 59.275 1.834 32.312 < 2e-16 ***
## Overall.Qual 32245.142 680.880 47.358 < 2e-16 ***
## Has_Central_Air 17493.918 3181.808 5.498 0.0000000417 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41160 on 2926 degrees of freedom
## Multiple R-squared: 0.7348, Adjusted R-squared: 0.7345
## F-statistic: 2702 on 3 and 2926 DF, p-value: < 2.2e-16
VIF measures how much variance in a coefficient is inflated due to multicollinearity with other predictors.
Rule of thumb:
vif_values <- vif(final_model)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = vif_values
) |>
arrange(desc(VIF))
kable(vif_df,
col.names = c("Variable", "VIF"),
caption = "Variance Inflation Factors",
digits = 2,
row.names = FALSE)
| Variable | VIF |
|---|---|
| Overall.Qual | 1.60 |
| Gr.Liv.Area | 1.49 |
| Has_Central_Air | 1.09 |
Assessment: All VIF values are < 5, indicating no multicollinearity concerns. Coefficients are interpretable and standard errors are not inflated.
Now I’ll use the 5 diagnostic plots covered in class to check model assumptions. Each assumption must be met for the model to provide valid inferences.
gg_resfitted(final_model) +
geom_smooth(se = FALSE, color = "blue", linewidth = 1) +
labs(title = "Diagnostic Plot 1: Residuals vs. Fitted Values",
subtitle = "Checking for constant variance and linearity")
What to look for:
What I observe:
The plot shows residuals scattered around the zero line. The blue smoothed curve is mostly flat near zero, which is good for linearity. However, there’s a noticeable “fanning” effect from left to right—the spread of residuals increases for higher fitted values (more expensive homes).
Assessment:
Implications: Standard errors might be underestimated for expensive homes, making confidence intervals and p-values less reliable at the high end.
plots <- gg_resX(final_model, plot.all = FALSE)
# Plot for each predictor
plots$Gr.Liv.Area +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Living Area")
plots$Overall.Qual +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Overall Quality")
plots$Has_Central_Air +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Has Central Air")
What to look for:
What I observe:
Gr.Liv.Area: Some fanning—residuals spread more at larger living areas. This confirms the heteroscedasticity we saw in Plot 1. The blue smoothed line is mostly flat, supporting linearity.
Overall.Qual: Residuals appear more evenly distributed across quality levels. The blue line is close to zero with slight waviness, suggesting the linear relationship is appropriate.
Has_Central_Air: Binary variable shows two vertical clusters (0 vs. 1). Both clusters have similar spread, suggesting this variable doesn’t introduce additional heteroscedasticity.
Assessment:
Interpretation: The heteroscedasticity issue is primarily driven by living area (size), not quality or AC. Larger homes naturally have more price variation.
# Select only predictor variables (not response)
predictor_vars <- ames |>
select(Gr.Liv.Area, Overall.Qual, Has_Central_Air)
ggcorr(predictor_vars, label = TRUE, label_size = 4, hjust = 0.75) +
labs(title = "Diagnostic Plot 3: Correlation Heatmap of Predictors",
subtitle = "Checking for multicollinearity among explanatory variables")
What to look for:
What I observe:
The heatmap shows pairwise correlations between all predictors:
Assessment:
Conclusion: All predictors contribute unique information. The VIF values we calculated earlier (all < 2) confirm this visual assessment.
gg_reshist(final_model) +
labs(title = "Diagnostic Plot 4: Histogram of Residuals",
subtitle = "Checking if residuals are approximately normally distributed")
What to look for:
What I observe:
The histogram shows a roughly bell-shaped distribution centered near zero. However, there’s a slight asymmetry—the right tail (positive residuals) extends further than the left tail (negative residuals), indicating slight right skew.
Assessment:
Note: Per class guidance, “residuals just need to be normal enough.” This histogram suggests they are.
gg_qqplot(final_model) +
labs(title = "Diagnostic Plot 5: Normal Q-Q Plot",
subtitle = "Comparing residual distribution to theoretical normal distribution")
What to look for:
What I observe:
Points follow the diagonal line well in the center of the distribution, but deviate in the tails:
This S-shaped pattern indicates heavier right tail (positive residuals more extreme than normal distribution predicts).
Assessment:
Balancing with histogram (per class guidance): The histogram shows approximately normal distribution. The Q-Q plot shows tail deviations but Q-Q plots exaggerate small differences. Together, they suggest residuals are “normal enough.”
Conclusion: Normality assumption is reasonably met for valid inference.
gg_cooksd(final_model, threshold = 'matlab') +
labs(title = "BONUS: Cook's Distance by Observation",
subtitle = "Identifying influential points (threshold = Matlab convention)")
What to look for:
What I observe:
A few observations have Cook’s D values above the threshold, indicated by their row numbers. These are homes that have high influence on the regression model—meaning they’re unusual (high leverage) AND poorly predicted (large residuals).
Assessment:
Recommendation: Could investigate these specific homes to understand what makes them unusual. However, we should NOT remove them without good reason—they’re legitimate sales, just atypical.
comparison <- data.frame(
Metric = c("R-squared", "Adjusted R-squared", "RMSE", "Predictors"),
Base_Model = c(
summary_base$r.squared,
summary_base$adj.r.squared,
sqrt(mean(residuals(base_model)^2)),
length(coef(base_model)) - 1 # exclude intercept
),
Final_Model = c(
summary_final$r.squared,
summary_final$adj.r.squared,
sqrt(mean(residuals(final_model)^2)),
length(coef(final_model)) - 1
)
) |>
mutate(
Improvement = Final_Model - Base_Model,
Pct_Change = round((Final_Model - Base_Model) / Base_Model * 100, 1)
)
kable(comparison,
digits = 3,
caption = "Base Model vs. Final Multiple Regression Model")
| Metric | Base_Model | Final_Model | Improvement | Pct_Change |
|---|---|---|---|---|
| R-squared | 0.500 | 0.735 | 0.235 | 47.1 |
| Adjusted R-squared | 0.499 | 0.735 | 0.235 | 47.1 |
| RMSE | 56504.877 | 41133.703 | -15371.174 | -27.2 |
| Predictors | 1.000 | 3.000 | 2.000 | 200.0 |
Key improvements:
coef_table <- tidy(final_model, conf.int = TRUE)
kable(coef_table,
digits = c(0, 2, 2, 2, 4, 2, 2),
caption = "Model Coefficients with 95% Confidence Intervals")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -120952.26 | 3951.47 | -30.61 | 0 | -128700.20 | -113204.31 |
| Gr.Liv.Area | 59.28 | 1.83 | 32.31 | 0 | 55.68 | 62.87 |
| Overall.Qual | 32245.14 | 680.88 | 47.36 | 0 | 30910.09 | 33580.19 |
| Has_Central_Air | 17493.92 | 3181.81 | 5.50 | 0 | 11255.11 | 23732.73 |
Practical interpretations:
coeffs <- coef(final_model)
Living Area: Each additional square foot adds approximately $59.28 to price (holding quality and AC constant). This is less than the simple model’s $111.69 per sq ft because we’re now accounting for quality and AC, which were confounded before.
Overall Quality: Each 1-point quality increase adds approximately $32,245.14 to price (holding size and AC constant). Moving from quality 5 to quality 8 (3 points) adds about $96,735.42.
Central Air: Having central air adds approximately $17,493.92 to price (holding size and quality constant). This is much less than the $84,562 from Week 7’s simple comparison because that number was confounded with size and quality.
The 5 diagnostic plots reveal:
Residuals vs. Fitted: Linearity mostly met (70% confidence), but heteroscedasticity present—variance increases with price (40% confidence in constant variance)
Residuals vs. X Values: Linearity confirmed across predictors (65% confidence), heteroscedasticity primarily from living area variable
Correlation Heatmap: No multicollinearity concerns (95% confidence). All correlations < 0.6, VIF values < 2
Residual Histogram: Approximately normal with slight right skew (75% confidence in normality). Distribution is “normal enough” per class guidance
Normal Q-Q Plot: Good fit in center, tail deviations present but Q-Q plots are oversensitive (70% confidence in normality when balanced with histogram)
BONUS - Cook’s Distance: Few influential outliers, don’t dominate model (85% confidence)
Overall Confidence: 75%
The model IS valid for inference but not perfect.
Strengths:
Weaknesses:
Implications:
The key quote from the assignment: “We cannot draw conclusions from an invalid model.”
My conclusion: This model IS valid for drawing conclusions. The violations are mild enough that inferences are trustworthy. We know where the model is weakest (expensive homes with high variance) and could address this with transformations (like log-transforming price) in future work.
Would log-transforming SalePrice fix the heteroscedasticity? Log(Price) often stabilizes variance but makes coefficients harder to interpret.
Are the influential outliers a distinct subgroup? Could fit separate models for luxury vs. standard homes, or add variables capturing their uniqueness.
Is there a Neighborhood effect we’re missing? Location might explain some of the remaining 26% of unexplained variance.
Do relationships change over time? Dataset spans 2006-2010 (financial crisis period). Coefficients might differ by year.
Are there non-linear relationships? Maybe quality has diminishing returns at high levels, or size effects are non-linear.
Regression diagnostics are ESSENTIAL. Without them, we’d never know:
A model with R² = 0.74 looks great, but diagnostics revealed it’s not perfect. The violations are mild, so our conclusions are still valid, but we now understand WHERE the model is weakest and what to improve next.
As emphasized in class: We cannot draw conclusions from an invalid model. These diagnostics confirm our model IS valid for inference, just not flawless. That’s acceptable for real-world data analysis.