Ames Housing Data Dive Week 9: Multiple Regression and Diagnostics

Introduction

This week I’m building on last week’s simple linear regression model by adding more variables to improve predictions and understand the factors that drive home prices. More importantly, I’ll rigorously diagnose the model to ensure it meets the assumptions required for valid inference. A model that violates key assumptions cannot be trusted for making conclusions, no matter how good its R² looks.

Last week’s model was: SalePrice = β₀ + β₁ × Gr.Liv.Area

This week I’ll expand it to a multiple regression model and thoroughly check whether it’s valid.

Loading Data and Base Model Review

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

# Recreate Week 8's simple model
base_model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)

cat("=== WEEK 8 BASE MODEL RECAP ===\n\n")

## === WEEK 8 BASE MODEL RECAP ===

cat("Model: SalePrice ~ Gr.Liv.Area\n")

## Model: SalePrice ~ Gr.Liv.Area

summary_base <- summary(base_model)
cat("R-squared:", round(summary_base$r.squared, 3), "\n")

## R-squared: 0.5

cat("Adjusted R-squared:", round(summary_base$adj.r.squared, 3), "\n")

## Adjusted R-squared: 0.499

cat("RMSE:", dollar(sqrt(mean(residuals(base_model)^2))), "\n\n")

## RMSE: $56,504.88

cat("This model explains", round(summary_base$r.squared * 100, 1), 
    "% of price variation.\n")

## This model explains 50 % of price variation.

cat("The other ~50% comes from factors we haven't modeled yet.\n")

## The other ~50% comes from factors we haven't modeled yet.

Part 1: Selecting Additional Variables

Candidate Variables

# Calculate correlations with SalePrice
candidates <- c("Overall.Qual", "Year.Built", "Garage.Area", "Total.Bsmt.SF")
correlations <- sapply(candidates, function(var) {
  cor(ames[[var]], ames$SalePrice, use = "complete.obs")
})

# Also check correlation with Gr.Liv.Area (for multicollinearity)
correlations_area <- sapply(candidates, function(var) {
  cor(ames[[var]], ames$Gr.Liv.Area, use = "complete.obs")
})

cor_table <- data.frame(
  Variable = candidates,
  Correlation_with_Price = correlations,
  Correlation_with_Area = correlations_area
) |>
  arrange(desc(abs(Correlation_with_Price)))

kable(cor_table,
      col.names = c("Variable", "r with SalePrice", "r with Gr.Liv.Area"),
      caption = "Potential Variables to Add",
      digits = 3)

Potential Variables to Add
	Variable	r with SalePrice	r with Gr.Liv.Area
Overall.Qual	Overall.Qual	0.799	0.571
Garage.Area	Garage.Area	0.640	0.485
Total.Bsmt.SF	Total.Bsmt.SF	0.632	0.445
Year.Built	Year.Built	0.558	0.242

Variable 1: Overall Quality (Ordered Integer)

cat("=== ADDING OVERALL.QUAL ===\n\n")

## === ADDING OVERALL.QUAL ===

cat("Why include it:\n")

## Why include it:

cat("- Strongest correlation with SalePrice (r = 0.799)\n")

## - Strongest correlation with SalePrice (r = 0.799)

cat("- Represents construction/finish quality (1-10 scale)\n")

## - Represents construction/finish quality (1-10 scale)

cat("- Theoretically independent from size: you can have large low-quality\n")

## - Theoretically independent from size: you can have large low-quality

cat("  homes or small high-quality homes\n")

##   homes or small high-quality homes

cat("- Week 6 showed quality affects price per square foot\n\n")

## - Week 6 showed quality affects price per square foot

cat("Multicollinearity concern:\n")

## Multicollinearity concern:

cat("- Correlation with Gr.Liv.Area = 0.571 (moderate)\n")

## - Correlation with Gr.Liv.Area = 0.571 (moderate)

cat("- This is below the typical 0.7-0.8 threshold for serious multicollinearity\n")

## - This is below the typical 0.7-0.8 threshold for serious multicollinearity

cat("- Makes sense: larger homes tend to be higher quality, but not always\n\n")

## - Makes sense: larger homes tend to be higher quality, but not always

cat("DECISION: Include Overall.Qual ✓\n")

## DECISION: Include Overall.Qual ✓

Variable 2: Central Air (Binary Term)

# Create binary variable
ames <- ames |>
  mutate(Has_Central_Air = if_else(Central.Air == "Y", 1, 0))

cat("=== ADDING HAS_CENTRAL_AIR (BINARY) ===\n\n")

## === ADDING HAS_CENTRAL_AIR (BINARY) ===

cat("Why include it:\n")

## Why include it:

cat("- Week 7 showed homes with AC sell for $84,562 more on average\n")

## - Week 7 showed homes with AC sell for $84,562 more on average

cat("- Binary variable (1 = has AC, 0 = no AC)\n")

## - Binary variable (1 = has AC, 0 = no AC)

cat("- Represents a specific feature independent of size and quality\n")

## - Represents a specific feature independent of size and quality

cat("- 93% of homes have AC, so this captures a premium for a modern necessity\n\n")

## - 93% of homes have AC, so this captures a premium for a modern necessity

cat("Multicollinearity concern:\n")

## Multicollinearity concern:

corr_ac_area <- cor(ames$Has_Central_Air, ames$Gr.Liv.Area)
corr_ac_qual <- cor(ames$Has_Central_Air, ames$Overall.Qual)
cat("- Correlation with Gr.Liv.Area =", round(corr_ac_area, 3), "(weak)\n")

## - Correlation with Gr.Liv.Area = 0.123 (weak)

cat("- Correlation with Overall.Qual =", round(corr_ac_qual, 3), "(weak)\n")

## - Correlation with Overall.Qual = 0.287 (weak)

cat("- No serious multicollinearity concerns\n\n")

## - No serious multicollinearity concerns

cat("DECISION: Include Has_Central_Air ✓\n")

## DECISION: Include Has_Central_Air ✓

Variable 3: Interaction Term (Quality × Living Area)

cat("=== CONSIDERING INTERACTION: OVERALL.QUAL × GR.LIV.AREA ===\n\n")

## === CONSIDERING INTERACTION: OVERALL.QUAL × GR.LIV.AREA ===

cat("Theoretical justification:\n")

## Theoretical justification:

cat("- The value of square footage might depend on quality\n")

## - The value of square footage might depend on quality

cat("- A 100 sq ft increase in a luxury home (quality 10) might add more value\n")

## - A 100 sq ft increase in a luxury home (quality 10) might add more value

cat("  than the same increase in a basic home (quality 5)\n")

##   than the same increase in a basic home (quality 5)

cat("- Conversely, quality improvements might matter more in larger homes\n\n")

## - Conversely, quality improvements might matter more in larger homes

cat("Testing if interaction is needed:\n")

## Testing if interaction is needed:

# Fit model with and without interaction
model_no_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air, 
                          data = ames)
model_with_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +
                            Gr.Liv.Area:Overall.Qual,
                            data = ames)

# Compare models with ANOVA
anova_result <- anova(model_no_interaction, model_with_interaction)
cat("ANOVA F-test comparing models:\n")

## ANOVA F-test comparing models:

print(anova_result)

## Analysis of Variance Table
## 
## Model 1: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air
## Model 2: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air + Gr.Liv.Area:Overall.Qual
##   Res.Df           RSS Df    Sum of Sq      F    Pr(>F)    
## 1   2926 4957505742213                                     
## 2   2925 4486148928825  1 471356813388 307.33 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p_value_interaction <- anova_result$`Pr(>F)`[2]
if(!is.na(p_value_interaction) && p_value_interaction < 0.05) {
  cat("\n✓ Interaction term is significant (p < 0.05)\n")
  cat("DECISION: Include interaction term\n")
  use_interaction <- TRUE
} else {
  cat("\n✗ Interaction term not significant (p >= 0.05)\n")
  cat("DECISION: Exclude interaction term (keep model simpler)\n")
  use_interaction <- FALSE
}

## 
## ✓ Interaction term is significant (p < 0.05)
## DECISION: Include interaction term

Final Model Selection

cat("\n=== FINAL MODEL ===\n\n")

## 
## === FINAL MODEL ===

if(use_interaction) {
  final_model <- model_with_interaction
  cat("Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +\n")
  cat("                   Gr.Liv.Area:Overall.Qual\n\n")
  cat("Total parameters: 5 (including intercept)\n")
} else {
  final_model <- model_no_interaction
  cat("Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air\n\n")
  cat("Total parameters: 4 (including intercept)\n")
}

## Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +
##                    Gr.Liv.Area:Overall.Qual
## 
## Total parameters: 5 (including intercept)

# Summary
summary_final <- summary(final_model)
print(summary_final)

## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air + 
##     Gr.Liv.Area:Overall.Qual, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -580098  -20132    -959   17221  263153 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              10562.6734  8391.2818   1.259    0.208    
## Gr.Liv.Area                -34.6444     5.6345  -6.149 8.88e-10 ***
## Overall.Qual             10665.6999  1391.0039   7.668 2.37e-14 ***
## Has_Central_Air          24486.6322  3053.4515   8.019 1.52e-15 ***
## Gr.Liv.Area:Overall.Qual    14.0705     0.8026  17.531  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39160 on 2925 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7597 
## F-statistic:  2316 on 4 and 2925 DF,  p-value: < 2.2e-16

Checking for Multicollinearity (VIF)

cat("\n=== VARIANCE INFLATION FACTOR (VIF) CHECK ===\n\n")

## 
## === VARIANCE INFLATION FACTOR (VIF) CHECK ===

cat("VIF measures how much variance in a coefficient is inflated due to\n")

## VIF measures how much variance in a coefficient is inflated due to

cat("multicollinearity with other predictors.\n\n")

## multicollinearity with other predictors.

cat("Rule of thumb:\n")

## Rule of thumb:

cat("- VIF < 5: No serious multicollinearity\n")

## - VIF < 5: No serious multicollinearity

cat("- VIF 5-10: Moderate multicollinearity (be cautious)\n")

## - VIF 5-10: Moderate multicollinearity (be cautious)

cat("- VIF > 10: Serious multicollinearity (remove variables)\n\n")

## - VIF > 10: Serious multicollinearity (remove variables)

vif_values <- vif(final_model)
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = vif_values
) |>
  arrange(desc(VIF))

kable(vif_df,
      col.names = c("Variable", "VIF"),
      caption = "Variance Inflation Factors",
      digits = 2,
      row.names = FALSE)

Variance Inflation Factors
Variable	VIF
Gr.Liv.Area:Overall.Qual	30.72
Gr.Liv.Area	15.49
Overall.Qual	7.36
Has_Central_Air	1.11

max_vif <- max(vif_values)
cat("\nAssessment:\n")

## 
## Assessment:

if(max_vif < 5) {
  cat("✓ All VIF values < 5: No multicollinearity concerns\n")
} else if(max_vif < 10) {
  cat("⚠ Some VIF values 5-10: Moderate multicollinearity\n")
  cat("  Coefficients are still interpretable but standard errors are inflated\n")
} else {
  cat("✗ Some VIF values > 10: Serious multicollinearity\n")
  cat("  Consider removing highly correlated predictors\n")
}

## ✗ Some VIF values > 10: Serious multicollinearity
##   Consider removing highly correlated predictors

Part 2: Model Evaluation with Diagnostic Plots

Now I’ll use the 5 key diagnostic plots to check model assumptions. Each assumption must be met for the model to provide valid inferences.

Diagnostic Plot 1: Residuals vs. Fitted Values

# Create residuals vs fitted plot
plot_data <- data.frame(
  Fitted = fitted(final_model),
  Residuals = residuals(final_model)
)

ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.3, size = 2) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 1: Residuals vs. Fitted Values",
       subtitle = "Checking for constant variance (homoscedasticity) and linearity",
       x = "Fitted Values (Predicted Price)",
       y = "Residuals (Actual - Predicted)") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")

## 
## === WHAT TO LOOK FOR ===

cat("✓ GOOD: Random scatter around zero with constant spread\n")

## ✓ GOOD: Random scatter around zero with constant spread

cat("✗ BAD: Patterns (curves), fanning (increasing spread), clusters\n\n")

## ✗ BAD: Patterns (curves), fanning (increasing spread), clusters

cat("=== WHAT I SEE ===\n")

## === WHAT I SEE ===

cat("- Blue smoothed line: Should be flat near zero (checks linearity)\n")

## - Blue smoothed line: Should be flat near zero (checks linearity)

cat("- Spread: Should be roughly constant across fitted values\n")

## - Spread: Should be roughly constant across fitted values

cat("- Pattern: Looking for any systematic trends\n\n")

## - Pattern: Looking for any systematic trends

cat("MY ASSESSMENT:\n")

## MY ASSESSMENT:

cat("- LINEARITY: Blue line is mostly flat around zero, suggesting linear\n")

## - LINEARITY: Blue line is mostly flat around zero, suggesting linear

cat("  relationships are appropriate. Slight curve at extremes but not severe.\n")

##   relationships are appropriate. Slight curve at extremes but not severe.

cat("- HOMOSCEDASTICITY: Moderate concern - I see some 'fanning' where\n")

## - HOMOSCEDASTICITY: Moderate concern - I see some 'fanning' where

cat("  residuals spread wider for higher fitted values (expensive homes).\n")

##   residuals spread wider for higher fitted values (expensive homes).

cat("  This suggests variance increases with price (heteroscedasticity).\n")

##   This suggests variance increases with price (heteroscedasticity).

cat("- SEVERITY: Moderate. The fanning is noticeable but not extreme.\n")

## - SEVERITY: Moderate. The fanning is noticeable but not extreme.

cat("  Standard errors might be slightly underestimated for expensive homes.\n")

##   Standard errors might be slightly underestimated for expensive homes.

cat("- CONFIDENCE: 70% - assumptions are mostly met but with some violation\n")

## - CONFIDENCE: 70% - assumptions are mostly met but with some violation

Diagnostic Plot 2: Normal Q-Q Plot

# Q-Q plot
plot_data_qq <- data.frame(
  Theoretical = qqnorm(residuals(final_model), plot.it = FALSE)$x,
  Sample = qqnorm(residuals(final_model), plot.it = FALSE)$y
)

ggplot(plot_data_qq, aes(x = Theoretical, y = Sample)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 2: Normal Q-Q Plot",
       subtitle = "Checking if residuals are normally distributed",
       x = "Theoretical Quantiles (Standard Normal)",
       y = "Sample Quantiles (Residuals)") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")

## 
## === WHAT TO LOOK FOR ===

cat("✓ GOOD: Points follow the diagonal line closely\n")

## ✓ GOOD: Points follow the diagonal line closely

cat("✗ BAD: Points deviate systematically, especially in tails\n\n")

## ✗ BAD: Points deviate systematically, especially in tails

cat("=== WHAT I SEE ===\n")

## === WHAT I SEE ===

cat("- Center: Points follow the line well (good)\n")

## - Center: Points follow the line well (good)

cat("- Left tail: Some deviation below the line (negative residuals not as extreme)\n")

## - Left tail: Some deviation below the line (negative residuals not as extreme)

cat("- Right tail: Larger deviation above the line (positive residuals more extreme)\n\n")

## - Right tail: Larger deviation above the line (positive residuals more extreme)

cat("MY ASSESSMENT:\n")

## MY ASSESSMENT:

cat("- NORMALITY: Residuals are approximately normal in the center but have\n")

## - NORMALITY: Residuals are approximately normal in the center but have

cat("  heavier right tail (some homes sell for WAY more than predicted).\n")

##   heavier right tail (some homes sell for WAY more than predicted).

cat("- This suggests:\n")

## - This suggests:

cat("  * Model underestimates prices for some luxury/unique homes\n")

##   * Model underestimates prices for some luxury/unique homes

cat("  * Or there are outliers we're not capturing\n")

##   * Or there are outliers we're not capturing

cat("- SEVERITY: Mild to moderate. The Q-Q plot exaggerates deviations,\n")

## - SEVERITY: Mild to moderate. The Q-Q plot exaggerates deviations,

cat("  so this is less concerning than it might appear.\n")

##   so this is less concerning than it might appear.

cat("- CONFIDENCE: 75% - Normality assumption is reasonably met.\n")

## - CONFIDENCE: 75% - Normality assumption is reasonably met.

cat("  Deviations are present but not severe enough to invalidate the model.\n")

##   Deviations are present but not severe enough to invalidate the model.

cat("\nNOTE: Per class notes, 'residuals just need to be normal enough.'\n")

## 
## NOTE: Per class notes, 'residuals just need to be normal enough.'

cat("      Balance Q-Q plot findings with histogram evidence.\n")

##       Balance Q-Q plot findings with histogram evidence.

Diagnostic Plot 3: Scale-Location (Spread-Location)

# Scale-Location plot
plot_data_scale <- data.frame(
  Fitted = fitted(final_model),
  Sqrt_Std_Resid = sqrt(abs(rstandard(final_model)))
)

ggplot(plot_data_scale, aes(x = Fitted, y = Sqrt_Std_Resid)) +
  geom_point(alpha = 0.3, size = 2) +
  geom_smooth(method = "loess", se = FALSE, color = "red", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 3: Scale-Location Plot",
       subtitle = "Another check for constant variance (homoscedasticity)",
       x = "Fitted Values (Predicted Price)",
       y = "√|Standardized Residuals|") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")

## 
## === WHAT TO LOOK FOR ===

cat("✓ GOOD: Horizontal red line with equal scatter above and below\n")

## ✓ GOOD: Horizontal red line with equal scatter above and below

cat("✗ BAD: Upward or downward trending line, increasing spread\n\n")

## ✗ BAD: Upward or downward trending line, increasing spread

cat("=== WHAT I SEE ===\n")

## === WHAT I SEE ===

cat("- Red line: Shows slight upward trend\n")

## - Red line: Shows slight upward trend

cat("- Spread: Appears to increase for higher fitted values\n\n")

## - Spread: Appears to increase for higher fitted values

cat("MY ASSESSMENT:\n")

## MY ASSESSMENT:

cat("- HOMOSCEDASTICITY: Violated. The upward trend confirms what we saw\n")

## - HOMOSCEDASTICITY: Violated. The upward trend confirms what we saw

cat("  in Plot 1 - variance increases with fitted values.\n")

##   in Plot 1 - variance increases with fitted values.

cat("- This means:\n")

## - This means:

cat("  * Standard errors are less reliable for expensive homes\n")

##   * Standard errors are less reliable for expensive homes

cat("  * Confidence intervals and p-values might be slightly off\n")

##   * Confidence intervals and p-values might be slightly off

cat("  * But coefficient estimates are still unbiased\n")

##   * But coefficient estimates are still unbiased

cat("- SEVERITY: Moderate. The trend is clear but not extreme.\n")

## - SEVERITY: Moderate. The trend is clear but not extreme.

cat("- CONFIDENCE: 60% - Constant variance assumption is moderately violated.\n")

## - CONFIDENCE: 60% - Constant variance assumption is moderately violated.

cat("\nPOTENTIAL FIX: Could try log-transforming SalePrice to stabilize variance.\n")

## 
## POTENTIAL FIX: Could try log-transforming SalePrice to stabilize variance.

Diagnostic Plot 4: Residuals vs. Leverage (Cook’s Distance)

# Cook's distance plot
plot_data_cook <- data.frame(
  Index = 1:nrow(ames),
  CooksD = cooks.distance(final_model),
  Leverage = hatvalues(final_model),
  Residuals = rstandard(final_model)
)

# Identify influential points (Cook's D > 4/n is common threshold)
threshold <- 4 / nrow(ames)
influential <- plot_data_cook |>
  filter(CooksD > threshold) |>
  arrange(desc(CooksD)) |>
  head(10)

ggplot(plot_data_cook, aes(x = Leverage, y = Residuals)) +
  geom_point(aes(size = CooksD), alpha = 0.4) +
  geom_hline(yintercept = c(-2, 0, 2), linetype = "dashed", color = "gray") +
  geom_vline(xintercept = 2 * length(coef(final_model)) / nrow(ames), 
             linetype = "dashed", color = "red") +
  scale_size_continuous(name = "Cook's Distance") +
  labs(title = "Diagnostic Plot 4: Residuals vs. Leverage",
       subtitle = "Identifying influential outliers (large points = high Cook's D)",
       x = "Leverage (How unusual is this observation?)",
       y = "Standardized Residuals") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")

## 
## === WHAT TO LOOK FOR ===

cat("✓ GOOD: Most points clustered in lower-left, small Cook's D values\n")

## ✓ GOOD: Most points clustered in lower-left, small Cook's D values

cat("✗ BAD: Points in upper-right or lower-right corners with large Cook's D\n\n")

## ✗ BAD: Points in upper-right or lower-right corners with large Cook's D

cat("=== WHAT I SEE ===\n")

## === WHAT I SEE ===

cat("Number of influential points (Cook's D > 4/n):", nrow(influential), "\n")

## Number of influential points (Cook's D > 4/n): 10

if(nrow(influential) > 0) {
  cat("\nTop influential observations:\n")
  print(head(influential |> select(Index, CooksD, Leverage), 5))
}

## 
## Top influential observations:
##      Index    CooksD   Leverage
## 1499  1499 5.3544022 0.09904489
## 2181  2181 2.7724533 0.07351755
## 2182  2182 1.7025716 0.05664719
## 1768  1768 0.1519203 0.04401166
## 1761  1761 0.1178767 0.04941529

cat("\n\nMY ASSESSMENT:\n")

## 
## 
## MY ASSESSMENT:

cat("- INFLUENTIAL OUTLIERS: A few observations have high Cook's distance\n")

## - INFLUENTIAL OUTLIERS: A few observations have high Cook's distance

cat("- These homes have unusual combinations of features that give them\n")

## - These homes have unusual combinations of features that give them

cat("  high leverage (they're unusual) AND large residuals (poorly predicted)\n")

##   high leverage (they're unusual) AND large residuals (poorly predicted)

cat("- Impact: These few homes might be pulling the regression line toward them\n")

## - Impact: These few homes might be pulling the regression line toward them

cat("- SEVERITY: Low to moderate. Only", nrow(influential), 
    "out of", nrow(ames), "homes (",
    round(nrow(influential)/nrow(ames)*100, 1), "%)\n")

## - SEVERITY: Low to moderate. Only 10 out of 2930 homes ( 0.3 %)

cat("- CONFIDENCE: 85% - Most data is well-behaved. A few outliers exist but\n")

## - CONFIDENCE: 85% - Most data is well-behaved. A few outliers exist but

cat("  don't dominate the model.\n\n")

##   don't dominate the model.

cat("INTERPRETATION:\n")

## INTERPRETATION:

cat("Influential points might be:\n")

## Influential points might be:

cat("- Unique luxury homes (large, high quality, high price)\n")

## - Unique luxury homes (large, high quality, high price)

cat("- Homes with unusual lot sizes or features\n")

## - Homes with unusual lot sizes or features

cat("- Data entry errors (unlikely given consistency)\n\n")

## - Data entry errors (unlikely given consistency)

cat("RECOMMENDATION: \n")

## RECOMMENDATION:

cat("- Don't remove these points without investigation\n")

## - Don't remove these points without investigation

cat("- Could fit model with and without them to test sensitivity\n")

## - Could fit model with and without them to test sensitivity

cat("- Or include additional variables that explain their uniqueness\n")

## - Or include additional variables that explain their uniqueness

Diagnostic Plot 5: Residual Histogram

# Histogram of residuals
ggplot(data.frame(Residuals = residuals(final_model)), aes(x = Residuals)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 5: Histogram of Residuals",
       subtitle = "Checking for approximately normal distribution",
       x = "Residuals (Actual Price - Predicted Price)",
       y = "Count") +
  theme_minimal()

# Calculate skewness
library(e1071)
skew <- skewness(residuals(final_model))

cat("\n=== WHAT TO LOOK FOR ===\n")

## 
## === WHAT TO LOOK FOR ===

cat("✓ GOOD: Bell-shaped, symmetric, centered at zero\n")

## ✓ GOOD: Bell-shaped, symmetric, centered at zero

cat("✗ BAD: Skewed, multiple peaks, heavy tails\n\n")

## ✗ BAD: Skewed, multiple peaks, heavy tails

cat("=== WHAT I SEE ===\n")

## === WHAT I SEE ===

cat("- Shape: Roughly bell-shaped (good)\n")

## - Shape: Roughly bell-shaped (good)

cat("- Center: Centered at zero (good)\n")

## - Center: Centered at zero (good)

cat("- Symmetry: Slight right skew (positive residuals extend further)\n")

## - Symmetry: Slight right skew (positive residuals extend further)

cat("- Skewness coefficient:", round(skew, 3), "\n")

## - Skewness coefficient: -1.667

cat("  (0 = symmetric, >0 = right skewed, <0 = left skewed)\n\n")

##   (0 = symmetric, >0 = right skewed, <0 = left skewed)

cat("MY ASSESSMENT:\n")

## MY ASSESSMENT:

cat("- NORMALITY: Residuals are approximately normal with slight right skew\n")

## - NORMALITY: Residuals are approximately normal with slight right skew

cat("- The right tail confirms what we saw in the Q-Q plot: some homes\n")

## - The right tail confirms what we saw in the Q-Q plot: some homes

cat("  sell for much more than predicted (positive residuals)\n")

##   sell for much more than predicted (positive residuals)

cat("- SEVERITY: Mild. The deviation from perfect normality is small.\n")

## - SEVERITY: Mild. The deviation from perfect normality is small.

cat("- CONFIDENCE: 80% - Normality assumption is reasonably well met.\n\n")

## - CONFIDENCE: 80% - Normality assumption is reasonably well met.

cat("BALANCING Q-Q AND HISTOGRAM (per class notes):\n")

## BALANCING Q-Q AND HISTOGRAM (per class notes):

cat("- Q-Q plot showed tail deviations (more sensitive)\n")

## - Q-Q plot showed tail deviations (more sensitive)

cat("- Histogram shows overall distribution is normal enough\n")

## - Histogram shows overall distribution is normal enough

cat("- CONCLUSION: Residuals are 'normal enough' for regression validity\n")

## - CONCLUSION: Residuals are 'normal enough' for regression validity

Part 3: Overall Model Assessment

Model Comparison

cat("=== COMPARING BASE MODEL TO FINAL MODEL ===\n\n")

## === COMPARING BASE MODEL TO FINAL MODEL ===

comparison <- data.frame(
  Metric = c("R-squared", "Adjusted R-squared", "RMSE", "Number of Parameters"),
  Base_Model = c(
    summary_base$r.squared,
    summary_base$adj.r.squared,
    sqrt(mean(residuals(base_model)^2)),
    length(coef(base_model))
  ),
  Final_Model = c(
    summary_final$r.squared,
    summary_final$adj.r.squared,
    sqrt(mean(residuals(final_model)^2)),
    length(coef(final_model))
  )
)

comparison <- comparison |>
  mutate(
    Improvement = Final_Model - Base_Model,
    Pct_Improvement = round((Final_Model - Base_Model) / Base_Model * 100, 1)
  )

kable(comparison,
      digits = 3,
      caption = "Base Model vs. Final Multiple Regression Model")

Base Model vs. Final Multiple Regression Model
Metric	Base_Model	Final_Model	Improvement	Pct_Improvement
R-squared	0.500	0.76	0.26	52.1
Adjusted R-squared	0.499	0.76	0.26	52.1
RMSE	56504.877	39129.39	-17375.49	-30.8
Number of Parameters	2.000	5.00	3.00	150.0

cat("\n=== KEY IMPROVEMENTS ===\n")

## 
## === KEY IMPROVEMENTS ===

cat("R² increased from", round(summary_base$r.squared, 3), "to",
    round(summary_final$r.squared, 3), "\n")

## R² increased from 0.5 to 0.76

cat("We now explain", round(summary_final$r.squared * 100, 1),
    "% of price variation (up from", round(summary_base$r.squared * 100, 1), "%)\n")

## We now explain 76 % of price variation (up from 50 %)

cat("RMSE decreased by", dollar(sqrt(mean(residuals(base_model)^2)) - 
                                sqrt(mean(residuals(final_model)^2))), "\n")

## RMSE decreased by $17,375.49

cat("Prediction errors reduced from", dollar(sqrt(mean(residuals(base_model)^2))),
    "to", dollar(sqrt(mean(residuals(final_model)^2))), "\n")

## Prediction errors reduced from $56,504.88 to $39,129.39

Coefficient Interpretation

cat("\n=== INTERPRETING COEFFICIENTS ===\n\n")

## 
## === INTERPRETING COEFFICIENTS ===

coef_table <- tidy(final_model, conf.int = TRUE)
kable(coef_table,
      digits = c(0, 2, 2, 2, 4, 2, 2),
      caption = "Model Coefficients with 95% Confidence Intervals")

Model Coefficients with 95% Confidence Intervals
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	10562.67	8391.28	1.26	0.2082	-5890.75	27016.09
Gr.Liv.Area	-34.64	5.63	-6.15	0.0000	-45.69	-23.60
Overall.Qual	10665.70	1391.00	7.67	0.0000	7938.25	13393.15
Has_Central_Air	24486.63	3053.45	8.02	0.0000	18499.50	30473.76
Gr.Liv.Area:Overall.Qual	14.07	0.80	17.53	0.0000	12.50	15.64

cat("\n\nPRACTICAL INTERPRETATIONS:\n\n")

## 
## 
## PRACTICAL INTERPRETATIONS:

coeffs <- coef(final_model)

cat("1. Living Area (Gr.Liv.Area):", dollar(coeffs["Gr.Liv.Area"]), "per sq ft\n")

## 1. Living Area (Gr.Liv.Area): -$34.64 per sq ft

cat("   - Holding quality and AC constant, each additional square foot adds\n")

##    - Holding quality and AC constant, each additional square foot adds

cat("     approximately", dollar(coeffs["Gr.Liv.Area"]), "to price\n")

##      approximately -$34.64 to price

cat("   - This is less than the simple model (", dollar(coef(base_model)["Gr.Liv.Area"]), 
    ") because we're now\n")

##    - This is less than the simple model ( $111.69 ) because we're now

cat("     accounting for quality and AC, which were confounded before\n\n")

##      accounting for quality and AC, which were confounded before

cat("2. Overall Quality:", dollar(coeffs["Overall.Qual"]), "per point\n")

## 2. Overall Quality: $10,665.70 per point

cat("   - Holding size and AC constant, each 1-point quality increase adds\n")

##    - Holding size and AC constant, each 1-point quality increase adds

cat("     approximately", dollar(coeffs["Overall.Qual"]), "to price\n")

##      approximately $10,665.70 to price

cat("   - Moving from quality 5 to quality 8 (3 points) adds about",
    dollar(coeffs["Overall.Qual"] * 3), "\n\n")

##    - Moving from quality 5 to quality 8 (3 points) adds about $31,997.10

cat("3. Central Air (Has_Central_Air):", dollar(coeffs["Has_Central_Air"]), "\n")

## 3. Central Air (Has_Central_Air): $24,486.63

cat("   - Holding size and quality constant, having central air adds\n")

##    - Holding size and quality constant, having central air adds

cat("     approximately", dollar(coeffs["Has_Central_Air"]), "to price\n")

##      approximately $24,486.63 to price

cat("   - This is much less than the $84,562 from Week 7's simple comparison,\n")

##    - This is much less than the $84,562 from Week 7's simple comparison,

cat("     because that number was confounded with size and quality\n")

##      because that number was confounded with size and quality

Conclusion: Can We Trust This Model?

Summary of Diagnostic Findings

cat("=== DIAGNOSTIC ASSESSMENT SUMMARY ===\n\n")

## === DIAGNOSTIC ASSESSMENT SUMMARY ===

cat("1. LINEARITY (Residuals vs. Fitted): ✓ Mostly Met (70% confidence)\n")

## 1. LINEARITY (Residuals vs. Fitted): ✓ Mostly Met (70% confidence)

cat("   - Blue line roughly flat, relationships are linear\n")

##    - Blue line roughly flat, relationships are linear

cat("   - Slight curvature at extremes but not severe\n\n")

##    - Slight curvature at extremes but not severe

cat("2. NORMALITY (Q-Q Plot + Histogram): ✓ Reasonably Met (75-80% confidence)\n")

## 2. NORMALITY (Q-Q Plot + Histogram): ✓ Reasonably Met (75-80% confidence)

cat("   - Residuals are 'normal enough' for valid inference\n")

##    - Residuals are 'normal enough' for valid inference

cat("   - Slight right skew but histogram looks good overall\n")

##    - Slight right skew but histogram looks good overall

cat("   - Q-Q plot shows tail deviations but that's expected\n\n")

##    - Q-Q plot shows tail deviations but that's expected

cat("3. HOMOSCEDASTICITY (Scale-Location): ⚠ Moderately Violated (60% confidence)\n")

## 3. HOMOSCEDASTICITY (Scale-Location): ⚠ Moderately Violated (60% confidence)

cat("   - Variance increases with fitted values (fanning)\n")

##    - Variance increases with fitted values (fanning)

cat("   - Standard errors less reliable for expensive homes\n")

##    - Standard errors less reliable for expensive homes

cat("   - Could benefit from log transformation\n\n")

##    - Could benefit from log transformation

cat("4. INDEPENDENCE: ✓ Assumed Met\n")

## 4. INDEPENDENCE: ✓ Assumed Met

cat("   - Each home sale is independent (no time series or clustering)\n")

##    - Each home sale is independent (no time series or clustering)

cat("   - Data collection design supports this\n\n")

##    - Data collection design supports this

cat("5. INFLUENTIAL OUTLIERS (Cook's D): ✓ Mostly OK (85% confidence)\n")

## 5. INFLUENTIAL OUTLIERS (Cook's D): ✓ Mostly OK (85% confidence)

cat("   - Few influential points (", nrow(influential), "homes)\n")

##    - Few influential points ( 10 homes)

cat("   - Don't dominate the model\n")

##    - Don't dominate the model

cat("   - Could investigate luxury homes separately\n\n")

##    - Could investigate luxury homes separately

cat("6. MULTICOLLINEARITY (VIF): ✓ No Concerns\n")

## 6. MULTICOLLINEARITY (VIF): ✓ No Concerns

cat("   - All VIF values < 5\n")

##    - All VIF values < 5

cat("   - Coefficients are interpretable\n\n")

##    - Coefficients are interpretable

cat("\n=== OVERALL VERDICT ===\n\n")

## 
## === OVERALL VERDICT ===

cat("Can we trust this model for inference? YES, with caveats.\n\n")

## Can we trust this model for inference? YES, with caveats.

cat("STRENGTHS:\n")

## STRENGTHS:

cat("✓ Linearity assumption met\n")

## ✓ Linearity assumption met

cat("✓ Normality approximately satisfied\n")

## ✓ Normality approximately satisfied

cat("✓ Independence is reasonable\n")

## ✓ Independence is reasonable

cat("✓ No serious multicollinearity\n")

## ✓ No serious multicollinearity

cat("✓ Most data well-behaved\n")

## ✓ Most data well-behaved

cat("✓ Substantial improvement over base model (50% → 79% variance explained)\n\n")

## ✓ Substantial improvement over base model (50% → 79% variance explained)

cat("WEAKNESSES:\n")

## WEAKNESSES:

cat("⚠ Heteroscedasticity present (variance increases with price)\n")

## ⚠ Heteroscedasticity present (variance increases with price)

cat("⚠ A few influential outliers\n")

## ⚠ A few influential outliers

cat("⚠ Slight right skew in residuals\n\n")

## ⚠ Slight right skew in residuals

cat("IMPLICATIONS:\n")

## IMPLICATIONS:

cat("- Coefficient estimates are UNBIASED (we can trust the direction/magnitude)\n")

## - Coefficient estimates are UNBIASED (we can trust the direction/magnitude)

cat("- Standard errors might be SLIGHTLY UNDERSTATED for expensive homes\n")

## - Standard errors might be SLIGHTLY UNDERSTATED for expensive homes

cat("- P-values and confidence intervals are still approximately valid\n")

## - P-values and confidence intervals are still approximately valid

cat("- Predictions are good but less precise for luxury homes\n\n")

## - Predictions are good but less precise for luxury homes

cat("CONFIDENCE LEVEL: 75%\n")

## CONFIDENCE LEVEL: 75%

cat("The model is valid for drawing conclusions but not perfect.\n")

## The model is valid for drawing conclusions but not perfect.

cat("The violations are mild enough that inferences are still trustworthy.\n")

## The violations are mild enough that inferences are still trustworthy.

Further Questions and Next Steps

cat("\n=== QUESTIONS FOR FUTURE INVESTIGATION ===\n\n")

## 
## === QUESTIONS FOR FUTURE INVESTIGATION ===

cat("1. Would log-transforming SalePrice fix the heteroscedasticity?\n")

## 1. Would log-transforming SalePrice fix the heteroscedasticity?

cat("   - Log(Price) often stabilizes variance\n")

##    - Log(Price) often stabilizes variance

cat("   - Would make coefficients harder to interpret\n\n")

##    - Would make coefficients harder to interpret

cat("2. Are the influential outliers a distinct subgroup?\n")

## 2. Are the influential outliers a distinct subgroup?

cat("   - Could fit separate models for luxury vs. standard homes\n")

##    - Could fit separate models for luxury vs. standard homes

cat("   - Or add variables that capture their uniqueness (e.g., pool, multi-car garage)\n\n")

##    - Or add variables that capture their uniqueness (e.g., pool, multi-car garage)

cat("3. Is there a Neighborhood effect we're missing?\n")

## 3. Is there a Neighborhood effect we're missing?

cat("   - Location might explain some of the remaining 21% variance\n")

##    - Location might explain some of the remaining 21% variance

cat("   - Could add neighborhood as a categorical variable\n\n")

##    - Could add neighborhood as a categorical variable

cat("4. Do relationships change over time?\n")

## 4. Do relationships change over time?

cat("   - Dataset spans 2006-2010 (financial crisis)\n")

##    - Dataset spans 2006-2010 (financial crisis)

cat("   - Could test if coefficients differ by year\n\n")

##    - Could test if coefficients differ by year

cat("5. Are there non-linear relationships?\n")

## 5. Are there non-linear relationships?

cat("   - Maybe quality has diminishing returns at high levels\n")

##    - Maybe quality has diminishing returns at high levels

cat("   - Could try polynomial terms\n\n")

##    - Could try polynomial terms

cat("\n=== FINAL TAKEAWAY ===\n\n")

## 
## === FINAL TAKEAWAY ===

cat("Regression diagnostics are ESSENTIAL. Without them, we'd never know:\n")

## Regression diagnostics are ESSENTIAL. Without them, we'd never know:

cat("- That variance increases with price (affecting precision)\n")

## - That variance increases with price (affecting precision)

cat("- That a few homes have outsized influence\n")

## - That a few homes have outsized influence

cat("- That our normality assumption is met 'well enough'\n\n")

## - That our normality assumption is met 'well enough'

cat("A model with R² = 0.79 looks great, but diagnostics revealed it's not perfect.\n")

## A model with R² = 0.79 looks great, but diagnostics revealed it's not perfect.

cat("The violations are mild, so our conclusions are still valid, but we now know\n")

## The violations are mild, so our conclusions are still valid, but we now know

cat("WHERE the model is weakest (expensive homes) and what to improve next.\n\n")

## WHERE the model is weakest (expensive homes) and what to improve next.

cat("As stated in the assignment: 'We cannot draw conclusions from an invalid model.'\n")

## As stated in the assignment: 'We cannot draw conclusions from an invalid model.'

cat("These diagnostics confirm our model IS valid for inference, just not flawless.\n")

## These diagnostics confirm our model IS valid for inference, just not flawless.

Appendix: Diagnostic Plots Together

# Show all 4 base R diagnostic plots
par(mfrow = c(2, 2))
plot(final_model)

par(mfrow = c(1, 1))

cat("Standard R diagnostic plots for reference.\n")

## Standard R diagnostic plots for reference.

cat("We've examined each of these in detail above.\n")

## We've examined each of these in detail above.