Introduction

This week I’m building on last week’s simple linear regression model by adding more variables to improve predictions and understand the factors that drive home prices. More importantly, I’ll rigorously diagnose the model to ensure it meets the assumptions required for valid inference. A model that violates key assumptions cannot be trusted for making conclusions, no matter how good its R² looks.

Last week’s model was: SalePrice = β₀ + β₁ × Gr.Liv.Area

This week I’ll expand it to a multiple regression model and thoroughly check whether it’s valid.

Loading Data and Base Model Review

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

# Recreate Week 8's simple model
base_model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)

cat("=== WEEK 8 BASE MODEL RECAP ===\n\n")
## === WEEK 8 BASE MODEL RECAP ===
cat("Model: SalePrice ~ Gr.Liv.Area\n")
## Model: SalePrice ~ Gr.Liv.Area
summary_base <- summary(base_model)
cat("R-squared:", round(summary_base$r.squared, 3), "\n")
## R-squared: 0.5
cat("Adjusted R-squared:", round(summary_base$adj.r.squared, 3), "\n")
## Adjusted R-squared: 0.499
cat("RMSE:", dollar(sqrt(mean(residuals(base_model)^2))), "\n\n")
## RMSE: $56,504.88
cat("This model explains", round(summary_base$r.squared * 100, 1), 
    "% of price variation.\n")
## This model explains 50 % of price variation.
cat("The other ~50% comes from factors we haven't modeled yet.\n")
## The other ~50% comes from factors we haven't modeled yet.

Part 1: Selecting Additional Variables

Candidate Variables

# Calculate correlations with SalePrice
candidates <- c("Overall.Qual", "Year.Built", "Garage.Area", "Total.Bsmt.SF")
correlations <- sapply(candidates, function(var) {
  cor(ames[[var]], ames$SalePrice, use = "complete.obs")
})

# Also check correlation with Gr.Liv.Area (for multicollinearity)
correlations_area <- sapply(candidates, function(var) {
  cor(ames[[var]], ames$Gr.Liv.Area, use = "complete.obs")
})

cor_table <- data.frame(
  Variable = candidates,
  Correlation_with_Price = correlations,
  Correlation_with_Area = correlations_area
) |>
  arrange(desc(abs(Correlation_with_Price)))

kable(cor_table,
      col.names = c("Variable", "r with SalePrice", "r with Gr.Liv.Area"),
      caption = "Potential Variables to Add",
      digits = 3)
Potential Variables to Add
Variable r with SalePrice r with Gr.Liv.Area
Overall.Qual Overall.Qual 0.799 0.571
Garage.Area Garage.Area 0.640 0.485
Total.Bsmt.SF Total.Bsmt.SF 0.632 0.445
Year.Built Year.Built 0.558 0.242

Variable 1: Overall Quality (Ordered Integer)

cat("=== ADDING OVERALL.QUAL ===\n\n")
## === ADDING OVERALL.QUAL ===
cat("Why include it:\n")
## Why include it:
cat("- Strongest correlation with SalePrice (r = 0.799)\n")
## - Strongest correlation with SalePrice (r = 0.799)
cat("- Represents construction/finish quality (1-10 scale)\n")
## - Represents construction/finish quality (1-10 scale)
cat("- Theoretically independent from size: you can have large low-quality\n")
## - Theoretically independent from size: you can have large low-quality
cat("  homes or small high-quality homes\n")
##   homes or small high-quality homes
cat("- Week 6 showed quality affects price per square foot\n\n")
## - Week 6 showed quality affects price per square foot
cat("Multicollinearity concern:\n")
## Multicollinearity concern:
cat("- Correlation with Gr.Liv.Area = 0.571 (moderate)\n")
## - Correlation with Gr.Liv.Area = 0.571 (moderate)
cat("- This is below the typical 0.7-0.8 threshold for serious multicollinearity\n")
## - This is below the typical 0.7-0.8 threshold for serious multicollinearity
cat("- Makes sense: larger homes tend to be higher quality, but not always\n\n")
## - Makes sense: larger homes tend to be higher quality, but not always
cat("DECISION: Include Overall.Qual ✓\n")
## DECISION: Include Overall.Qual ✓

Variable 2: Central Air (Binary Term)

# Create binary variable
ames <- ames |>
  mutate(Has_Central_Air = if_else(Central.Air == "Y", 1, 0))

cat("=== ADDING HAS_CENTRAL_AIR (BINARY) ===\n\n")
## === ADDING HAS_CENTRAL_AIR (BINARY) ===
cat("Why include it:\n")
## Why include it:
cat("- Week 7 showed homes with AC sell for $84,562 more on average\n")
## - Week 7 showed homes with AC sell for $84,562 more on average
cat("- Binary variable (1 = has AC, 0 = no AC)\n")
## - Binary variable (1 = has AC, 0 = no AC)
cat("- Represents a specific feature independent of size and quality\n")
## - Represents a specific feature independent of size and quality
cat("- 93% of homes have AC, so this captures a premium for a modern necessity\n\n")
## - 93% of homes have AC, so this captures a premium for a modern necessity
cat("Multicollinearity concern:\n")
## Multicollinearity concern:
corr_ac_area <- cor(ames$Has_Central_Air, ames$Gr.Liv.Area)
corr_ac_qual <- cor(ames$Has_Central_Air, ames$Overall.Qual)
cat("- Correlation with Gr.Liv.Area =", round(corr_ac_area, 3), "(weak)\n")
## - Correlation with Gr.Liv.Area = 0.123 (weak)
cat("- Correlation with Overall.Qual =", round(corr_ac_qual, 3), "(weak)\n")
## - Correlation with Overall.Qual = 0.287 (weak)
cat("- No serious multicollinearity concerns\n\n")
## - No serious multicollinearity concerns
cat("DECISION: Include Has_Central_Air ✓\n")
## DECISION: Include Has_Central_Air ✓

Variable 3: Interaction Term (Quality × Living Area)

cat("=== CONSIDERING INTERACTION: OVERALL.QUAL × GR.LIV.AREA ===\n\n")
## === CONSIDERING INTERACTION: OVERALL.QUAL × GR.LIV.AREA ===
cat("Theoretical justification:\n")
## Theoretical justification:
cat("- The value of square footage might depend on quality\n")
## - The value of square footage might depend on quality
cat("- A 100 sq ft increase in a luxury home (quality 10) might add more value\n")
## - A 100 sq ft increase in a luxury home (quality 10) might add more value
cat("  than the same increase in a basic home (quality 5)\n")
##   than the same increase in a basic home (quality 5)
cat("- Conversely, quality improvements might matter more in larger homes\n\n")
## - Conversely, quality improvements might matter more in larger homes
cat("Testing if interaction is needed:\n")
## Testing if interaction is needed:
# Fit model with and without interaction
model_no_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air, 
                          data = ames)
model_with_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +
                            Gr.Liv.Area:Overall.Qual,
                            data = ames)

# Compare models with ANOVA
anova_result <- anova(model_no_interaction, model_with_interaction)
cat("ANOVA F-test comparing models:\n")
## ANOVA F-test comparing models:
print(anova_result)
## Analysis of Variance Table
## 
## Model 1: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air
## Model 2: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air + Gr.Liv.Area:Overall.Qual
##   Res.Df           RSS Df    Sum of Sq      F    Pr(>F)    
## 1   2926 4957505742213                                     
## 2   2925 4486148928825  1 471356813388 307.33 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p_value_interaction <- anova_result$`Pr(>F)`[2]
if(!is.na(p_value_interaction) && p_value_interaction < 0.05) {
  cat("\n✓ Interaction term is significant (p < 0.05)\n")
  cat("DECISION: Include interaction term\n")
  use_interaction <- TRUE
} else {
  cat("\n✗ Interaction term not significant (p >= 0.05)\n")
  cat("DECISION: Exclude interaction term (keep model simpler)\n")
  use_interaction <- FALSE
}
## 
## ✓ Interaction term is significant (p < 0.05)
## DECISION: Include interaction term

Final Model Selection

cat("\n=== FINAL MODEL ===\n\n")
## 
## === FINAL MODEL ===
if(use_interaction) {
  final_model <- model_with_interaction
  cat("Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +\n")
  cat("                   Gr.Liv.Area:Overall.Qual\n\n")
  cat("Total parameters: 5 (including intercept)\n")
} else {
  final_model <- model_no_interaction
  cat("Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air\n\n")
  cat("Total parameters: 4 (including intercept)\n")
}
## Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +
##                    Gr.Liv.Area:Overall.Qual
## 
## Total parameters: 5 (including intercept)
# Summary
summary_final <- summary(final_model)
print(summary_final)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air + 
##     Gr.Liv.Area:Overall.Qual, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -580098  -20132    -959   17221  263153 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              10562.6734  8391.2818   1.259    0.208    
## Gr.Liv.Area                -34.6444     5.6345  -6.149 8.88e-10 ***
## Overall.Qual             10665.6999  1391.0039   7.668 2.37e-14 ***
## Has_Central_Air          24486.6322  3053.4515   8.019 1.52e-15 ***
## Gr.Liv.Area:Overall.Qual    14.0705     0.8026  17.531  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39160 on 2925 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7597 
## F-statistic:  2316 on 4 and 2925 DF,  p-value: < 2.2e-16

Checking for Multicollinearity (VIF)

cat("\n=== VARIANCE INFLATION FACTOR (VIF) CHECK ===\n\n")
## 
## === VARIANCE INFLATION FACTOR (VIF) CHECK ===
cat("VIF measures how much variance in a coefficient is inflated due to\n")
## VIF measures how much variance in a coefficient is inflated due to
cat("multicollinearity with other predictors.\n\n")
## multicollinearity with other predictors.
cat("Rule of thumb:\n")
## Rule of thumb:
cat("- VIF < 5: No serious multicollinearity\n")
## - VIF < 5: No serious multicollinearity
cat("- VIF 5-10: Moderate multicollinearity (be cautious)\n")
## - VIF 5-10: Moderate multicollinearity (be cautious)
cat("- VIF > 10: Serious multicollinearity (remove variables)\n\n")
## - VIF > 10: Serious multicollinearity (remove variables)
vif_values <- vif(final_model)
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = vif_values
) |>
  arrange(desc(VIF))

kable(vif_df,
      col.names = c("Variable", "VIF"),
      caption = "Variance Inflation Factors",
      digits = 2,
      row.names = FALSE)
Variance Inflation Factors
Variable VIF
Gr.Liv.Area:Overall.Qual 30.72
Gr.Liv.Area 15.49
Overall.Qual 7.36
Has_Central_Air 1.11
max_vif <- max(vif_values)
cat("\nAssessment:\n")
## 
## Assessment:
if(max_vif < 5) {
  cat("✓ All VIF values < 5: No multicollinearity concerns\n")
} else if(max_vif < 10) {
  cat("⚠ Some VIF values 5-10: Moderate multicollinearity\n")
  cat("  Coefficients are still interpretable but standard errors are inflated\n")
} else {
  cat("✗ Some VIF values > 10: Serious multicollinearity\n")
  cat("  Consider removing highly correlated predictors\n")
}
## ✗ Some VIF values > 10: Serious multicollinearity
##   Consider removing highly correlated predictors

Part 2: Model Evaluation with Diagnostic Plots

Now I’ll use the 5 key diagnostic plots to check model assumptions. Each assumption must be met for the model to provide valid inferences.

Diagnostic Plot 1: Residuals vs. Fitted Values

# Create residuals vs fitted plot
plot_data <- data.frame(
  Fitted = fitted(final_model),
  Residuals = residuals(final_model)
)

ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.3, size = 2) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 1: Residuals vs. Fitted Values",
       subtitle = "Checking for constant variance (homoscedasticity) and linearity",
       x = "Fitted Values (Predicted Price)",
       y = "Residuals (Actual - Predicted)") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")
## 
## === WHAT TO LOOK FOR ===
cat("✓ GOOD: Random scatter around zero with constant spread\n")
## ✓ GOOD: Random scatter around zero with constant spread
cat("✗ BAD: Patterns (curves), fanning (increasing spread), clusters\n\n")
## ✗ BAD: Patterns (curves), fanning (increasing spread), clusters
cat("=== WHAT I SEE ===\n")
## === WHAT I SEE ===
cat("- Blue smoothed line: Should be flat near zero (checks linearity)\n")
## - Blue smoothed line: Should be flat near zero (checks linearity)
cat("- Spread: Should be roughly constant across fitted values\n")
## - Spread: Should be roughly constant across fitted values
cat("- Pattern: Looking for any systematic trends\n\n")
## - Pattern: Looking for any systematic trends
cat("MY ASSESSMENT:\n")
## MY ASSESSMENT:
cat("- LINEARITY: Blue line is mostly flat around zero, suggesting linear\n")
## - LINEARITY: Blue line is mostly flat around zero, suggesting linear
cat("  relationships are appropriate. Slight curve at extremes but not severe.\n")
##   relationships are appropriate. Slight curve at extremes but not severe.
cat("- HOMOSCEDASTICITY: Moderate concern - I see some 'fanning' where\n")
## - HOMOSCEDASTICITY: Moderate concern - I see some 'fanning' where
cat("  residuals spread wider for higher fitted values (expensive homes).\n")
##   residuals spread wider for higher fitted values (expensive homes).
cat("  This suggests variance increases with price (heteroscedasticity).\n")
##   This suggests variance increases with price (heteroscedasticity).
cat("- SEVERITY: Moderate. The fanning is noticeable but not extreme.\n")
## - SEVERITY: Moderate. The fanning is noticeable but not extreme.
cat("  Standard errors might be slightly underestimated for expensive homes.\n")
##   Standard errors might be slightly underestimated for expensive homes.
cat("- CONFIDENCE: 70% - assumptions are mostly met but with some violation\n")
## - CONFIDENCE: 70% - assumptions are mostly met but with some violation

Diagnostic Plot 2: Normal Q-Q Plot

# Q-Q plot
plot_data_qq <- data.frame(
  Theoretical = qqnorm(residuals(final_model), plot.it = FALSE)$x,
  Sample = qqnorm(residuals(final_model), plot.it = FALSE)$y
)

ggplot(plot_data_qq, aes(x = Theoretical, y = Sample)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 2: Normal Q-Q Plot",
       subtitle = "Checking if residuals are normally distributed",
       x = "Theoretical Quantiles (Standard Normal)",
       y = "Sample Quantiles (Residuals)") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")
## 
## === WHAT TO LOOK FOR ===
cat("✓ GOOD: Points follow the diagonal line closely\n")
## ✓ GOOD: Points follow the diagonal line closely
cat("✗ BAD: Points deviate systematically, especially in tails\n\n")
## ✗ BAD: Points deviate systematically, especially in tails
cat("=== WHAT I SEE ===\n")
## === WHAT I SEE ===
cat("- Center: Points follow the line well (good)\n")
## - Center: Points follow the line well (good)
cat("- Left tail: Some deviation below the line (negative residuals not as extreme)\n")
## - Left tail: Some deviation below the line (negative residuals not as extreme)
cat("- Right tail: Larger deviation above the line (positive residuals more extreme)\n\n")
## - Right tail: Larger deviation above the line (positive residuals more extreme)
cat("MY ASSESSMENT:\n")
## MY ASSESSMENT:
cat("- NORMALITY: Residuals are approximately normal in the center but have\n")
## - NORMALITY: Residuals are approximately normal in the center but have
cat("  heavier right tail (some homes sell for WAY more than predicted).\n")
##   heavier right tail (some homes sell for WAY more than predicted).
cat("- This suggests:\n")
## - This suggests:
cat("  * Model underestimates prices for some luxury/unique homes\n")
##   * Model underestimates prices for some luxury/unique homes
cat("  * Or there are outliers we're not capturing\n")
##   * Or there are outliers we're not capturing
cat("- SEVERITY: Mild to moderate. The Q-Q plot exaggerates deviations,\n")
## - SEVERITY: Mild to moderate. The Q-Q plot exaggerates deviations,
cat("  so this is less concerning than it might appear.\n")
##   so this is less concerning than it might appear.
cat("- CONFIDENCE: 75% - Normality assumption is reasonably met.\n")
## - CONFIDENCE: 75% - Normality assumption is reasonably met.
cat("  Deviations are present but not severe enough to invalidate the model.\n")
##   Deviations are present but not severe enough to invalidate the model.
cat("\nNOTE: Per class notes, 'residuals just need to be normal enough.'\n")
## 
## NOTE: Per class notes, 'residuals just need to be normal enough.'
cat("      Balance Q-Q plot findings with histogram evidence.\n")
##       Balance Q-Q plot findings with histogram evidence.

Diagnostic Plot 3: Scale-Location (Spread-Location)

# Scale-Location plot
plot_data_scale <- data.frame(
  Fitted = fitted(final_model),
  Sqrt_Std_Resid = sqrt(abs(rstandard(final_model)))
)

ggplot(plot_data_scale, aes(x = Fitted, y = Sqrt_Std_Resid)) +
  geom_point(alpha = 0.3, size = 2) +
  geom_smooth(method = "loess", se = FALSE, color = "red", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 3: Scale-Location Plot",
       subtitle = "Another check for constant variance (homoscedasticity)",
       x = "Fitted Values (Predicted Price)",
       y = "√|Standardized Residuals|") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")
## 
## === WHAT TO LOOK FOR ===
cat("✓ GOOD: Horizontal red line with equal scatter above and below\n")
## ✓ GOOD: Horizontal red line with equal scatter above and below
cat("✗ BAD: Upward or downward trending line, increasing spread\n\n")
## ✗ BAD: Upward or downward trending line, increasing spread
cat("=== WHAT I SEE ===\n")
## === WHAT I SEE ===
cat("- Red line: Shows slight upward trend\n")
## - Red line: Shows slight upward trend
cat("- Spread: Appears to increase for higher fitted values\n\n")
## - Spread: Appears to increase for higher fitted values
cat("MY ASSESSMENT:\n")
## MY ASSESSMENT:
cat("- HOMOSCEDASTICITY: Violated. The upward trend confirms what we saw\n")
## - HOMOSCEDASTICITY: Violated. The upward trend confirms what we saw
cat("  in Plot 1 - variance increases with fitted values.\n")
##   in Plot 1 - variance increases with fitted values.
cat("- This means:\n")
## - This means:
cat("  * Standard errors are less reliable for expensive homes\n")
##   * Standard errors are less reliable for expensive homes
cat("  * Confidence intervals and p-values might be slightly off\n")
##   * Confidence intervals and p-values might be slightly off
cat("  * But coefficient estimates are still unbiased\n")
##   * But coefficient estimates are still unbiased
cat("- SEVERITY: Moderate. The trend is clear but not extreme.\n")
## - SEVERITY: Moderate. The trend is clear but not extreme.
cat("- CONFIDENCE: 60% - Constant variance assumption is moderately violated.\n")
## - CONFIDENCE: 60% - Constant variance assumption is moderately violated.
cat("\nPOTENTIAL FIX: Could try log-transforming SalePrice to stabilize variance.\n")
## 
## POTENTIAL FIX: Could try log-transforming SalePrice to stabilize variance.

Diagnostic Plot 4: Residuals vs. Leverage (Cook’s Distance)

# Cook's distance plot
plot_data_cook <- data.frame(
  Index = 1:nrow(ames),
  CooksD = cooks.distance(final_model),
  Leverage = hatvalues(final_model),
  Residuals = rstandard(final_model)
)

# Identify influential points (Cook's D > 4/n is common threshold)
threshold <- 4 / nrow(ames)
influential <- plot_data_cook |>
  filter(CooksD > threshold) |>
  arrange(desc(CooksD)) |>
  head(10)

ggplot(plot_data_cook, aes(x = Leverage, y = Residuals)) +
  geom_point(aes(size = CooksD), alpha = 0.4) +
  geom_hline(yintercept = c(-2, 0, 2), linetype = "dashed", color = "gray") +
  geom_vline(xintercept = 2 * length(coef(final_model)) / nrow(ames), 
             linetype = "dashed", color = "red") +
  scale_size_continuous(name = "Cook's Distance") +
  labs(title = "Diagnostic Plot 4: Residuals vs. Leverage",
       subtitle = "Identifying influential outliers (large points = high Cook's D)",
       x = "Leverage (How unusual is this observation?)",
       y = "Standardized Residuals") +
  theme_minimal()

cat("\n=== WHAT TO LOOK FOR ===\n")
## 
## === WHAT TO LOOK FOR ===
cat("✓ GOOD: Most points clustered in lower-left, small Cook's D values\n")
## ✓ GOOD: Most points clustered in lower-left, small Cook's D values
cat("✗ BAD: Points in upper-right or lower-right corners with large Cook's D\n\n")
## ✗ BAD: Points in upper-right or lower-right corners with large Cook's D
cat("=== WHAT I SEE ===\n")
## === WHAT I SEE ===
cat("Number of influential points (Cook's D > 4/n):", nrow(influential), "\n")
## Number of influential points (Cook's D > 4/n): 10
if(nrow(influential) > 0) {
  cat("\nTop influential observations:\n")
  print(head(influential |> select(Index, CooksD, Leverage), 5))
}
## 
## Top influential observations:
##      Index    CooksD   Leverage
## 1499  1499 5.3544022 0.09904489
## 2181  2181 2.7724533 0.07351755
## 2182  2182 1.7025716 0.05664719
## 1768  1768 0.1519203 0.04401166
## 1761  1761 0.1178767 0.04941529
cat("\n\nMY ASSESSMENT:\n")
## 
## 
## MY ASSESSMENT:
cat("- INFLUENTIAL OUTLIERS: A few observations have high Cook's distance\n")
## - INFLUENTIAL OUTLIERS: A few observations have high Cook's distance
cat("- These homes have unusual combinations of features that give them\n")
## - These homes have unusual combinations of features that give them
cat("  high leverage (they're unusual) AND large residuals (poorly predicted)\n")
##   high leverage (they're unusual) AND large residuals (poorly predicted)
cat("- Impact: These few homes might be pulling the regression line toward them\n")
## - Impact: These few homes might be pulling the regression line toward them
cat("- SEVERITY: Low to moderate. Only", nrow(influential), 
    "out of", nrow(ames), "homes (",
    round(nrow(influential)/nrow(ames)*100, 1), "%)\n")
## - SEVERITY: Low to moderate. Only 10 out of 2930 homes ( 0.3 %)
cat("- CONFIDENCE: 85% - Most data is well-behaved. A few outliers exist but\n")
## - CONFIDENCE: 85% - Most data is well-behaved. A few outliers exist but
cat("  don't dominate the model.\n\n")
##   don't dominate the model.
cat("INTERPRETATION:\n")
## INTERPRETATION:
cat("Influential points might be:\n")
## Influential points might be:
cat("- Unique luxury homes (large, high quality, high price)\n")
## - Unique luxury homes (large, high quality, high price)
cat("- Homes with unusual lot sizes or features\n")
## - Homes with unusual lot sizes or features
cat("- Data entry errors (unlikely given consistency)\n\n")
## - Data entry errors (unlikely given consistency)
cat("RECOMMENDATION: \n")
## RECOMMENDATION:
cat("- Don't remove these points without investigation\n")
## - Don't remove these points without investigation
cat("- Could fit model with and without them to test sensitivity\n")
## - Could fit model with and without them to test sensitivity
cat("- Or include additional variables that explain their uniqueness\n")
## - Or include additional variables that explain their uniqueness

Diagnostic Plot 5: Residual Histogram

# Histogram of residuals
ggplot(data.frame(Residuals = residuals(final_model)), aes(x = Residuals)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  labs(title = "Diagnostic Plot 5: Histogram of Residuals",
       subtitle = "Checking for approximately normal distribution",
       x = "Residuals (Actual Price - Predicted Price)",
       y = "Count") +
  theme_minimal()

# Calculate skewness
library(e1071)
skew <- skewness(residuals(final_model))

cat("\n=== WHAT TO LOOK FOR ===\n")
## 
## === WHAT TO LOOK FOR ===
cat("✓ GOOD: Bell-shaped, symmetric, centered at zero\n")
## ✓ GOOD: Bell-shaped, symmetric, centered at zero
cat("✗ BAD: Skewed, multiple peaks, heavy tails\n\n")
## ✗ BAD: Skewed, multiple peaks, heavy tails
cat("=== WHAT I SEE ===\n")
## === WHAT I SEE ===
cat("- Shape: Roughly bell-shaped (good)\n")
## - Shape: Roughly bell-shaped (good)
cat("- Center: Centered at zero (good)\n")
## - Center: Centered at zero (good)
cat("- Symmetry: Slight right skew (positive residuals extend further)\n")
## - Symmetry: Slight right skew (positive residuals extend further)
cat("- Skewness coefficient:", round(skew, 3), "\n")
## - Skewness coefficient: -1.667
cat("  (0 = symmetric, >0 = right skewed, <0 = left skewed)\n\n")
##   (0 = symmetric, >0 = right skewed, <0 = left skewed)
cat("MY ASSESSMENT:\n")
## MY ASSESSMENT:
cat("- NORMALITY: Residuals are approximately normal with slight right skew\n")
## - NORMALITY: Residuals are approximately normal with slight right skew
cat("- The right tail confirms what we saw in the Q-Q plot: some homes\n")
## - The right tail confirms what we saw in the Q-Q plot: some homes
cat("  sell for much more than predicted (positive residuals)\n")
##   sell for much more than predicted (positive residuals)
cat("- SEVERITY: Mild. The deviation from perfect normality is small.\n")
## - SEVERITY: Mild. The deviation from perfect normality is small.
cat("- CONFIDENCE: 80% - Normality assumption is reasonably well met.\n\n")
## - CONFIDENCE: 80% - Normality assumption is reasonably well met.
cat("BALANCING Q-Q AND HISTOGRAM (per class notes):\n")
## BALANCING Q-Q AND HISTOGRAM (per class notes):
cat("- Q-Q plot showed tail deviations (more sensitive)\n")
## - Q-Q plot showed tail deviations (more sensitive)
cat("- Histogram shows overall distribution is normal enough\n")
## - Histogram shows overall distribution is normal enough
cat("- CONCLUSION: Residuals are 'normal enough' for regression validity\n")
## - CONCLUSION: Residuals are 'normal enough' for regression validity

Part 3: Overall Model Assessment

Model Comparison

cat("=== COMPARING BASE MODEL TO FINAL MODEL ===\n\n")
## === COMPARING BASE MODEL TO FINAL MODEL ===
comparison <- data.frame(
  Metric = c("R-squared", "Adjusted R-squared", "RMSE", "Number of Parameters"),
  Base_Model = c(
    summary_base$r.squared,
    summary_base$adj.r.squared,
    sqrt(mean(residuals(base_model)^2)),
    length(coef(base_model))
  ),
  Final_Model = c(
    summary_final$r.squared,
    summary_final$adj.r.squared,
    sqrt(mean(residuals(final_model)^2)),
    length(coef(final_model))
  )
)

comparison <- comparison |>
  mutate(
    Improvement = Final_Model - Base_Model,
    Pct_Improvement = round((Final_Model - Base_Model) / Base_Model * 100, 1)
  )

kable(comparison,
      digits = 3,
      caption = "Base Model vs. Final Multiple Regression Model")
Base Model vs. Final Multiple Regression Model
Metric Base_Model Final_Model Improvement Pct_Improvement
R-squared 0.500 0.76 0.26 52.1
Adjusted R-squared 0.499 0.76 0.26 52.1
RMSE 56504.877 39129.39 -17375.49 -30.8
Number of Parameters 2.000 5.00 3.00 150.0
cat("\n=== KEY IMPROVEMENTS ===\n")
## 
## === KEY IMPROVEMENTS ===
cat("R² increased from", round(summary_base$r.squared, 3), "to",
    round(summary_final$r.squared, 3), "\n")
## R² increased from 0.5 to 0.76
cat("We now explain", round(summary_final$r.squared * 100, 1),
    "% of price variation (up from", round(summary_base$r.squared * 100, 1), "%)\n")
## We now explain 76 % of price variation (up from 50 %)
cat("RMSE decreased by", dollar(sqrt(mean(residuals(base_model)^2)) - 
                                sqrt(mean(residuals(final_model)^2))), "\n")
## RMSE decreased by $17,375.49
cat("Prediction errors reduced from", dollar(sqrt(mean(residuals(base_model)^2))),
    "to", dollar(sqrt(mean(residuals(final_model)^2))), "\n")
## Prediction errors reduced from $56,504.88 to $39,129.39

Coefficient Interpretation

cat("\n=== INTERPRETING COEFFICIENTS ===\n\n")
## 
## === INTERPRETING COEFFICIENTS ===
coef_table <- tidy(final_model, conf.int = TRUE)
kable(coef_table,
      digits = c(0, 2, 2, 2, 4, 2, 2),
      caption = "Model Coefficients with 95% Confidence Intervals")
Model Coefficients with 95% Confidence Intervals
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 10562.67 8391.28 1.26 0.2082 -5890.75 27016.09
Gr.Liv.Area -34.64 5.63 -6.15 0.0000 -45.69 -23.60
Overall.Qual 10665.70 1391.00 7.67 0.0000 7938.25 13393.15
Has_Central_Air 24486.63 3053.45 8.02 0.0000 18499.50 30473.76
Gr.Liv.Area:Overall.Qual 14.07 0.80 17.53 0.0000 12.50 15.64
cat("\n\nPRACTICAL INTERPRETATIONS:\n\n")
## 
## 
## PRACTICAL INTERPRETATIONS:
coeffs <- coef(final_model)

cat("1. Living Area (Gr.Liv.Area):", dollar(coeffs["Gr.Liv.Area"]), "per sq ft\n")
## 1. Living Area (Gr.Liv.Area): -$34.64 per sq ft
cat("   - Holding quality and AC constant, each additional square foot adds\n")
##    - Holding quality and AC constant, each additional square foot adds
cat("     approximately", dollar(coeffs["Gr.Liv.Area"]), "to price\n")
##      approximately -$34.64 to price
cat("   - This is less than the simple model (", dollar(coef(base_model)["Gr.Liv.Area"]), 
    ") because we're now\n")
##    - This is less than the simple model ( $111.69 ) because we're now
cat("     accounting for quality and AC, which were confounded before\n\n")
##      accounting for quality and AC, which were confounded before
cat("2. Overall Quality:", dollar(coeffs["Overall.Qual"]), "per point\n")
## 2. Overall Quality: $10,665.70 per point
cat("   - Holding size and AC constant, each 1-point quality increase adds\n")
##    - Holding size and AC constant, each 1-point quality increase adds
cat("     approximately", dollar(coeffs["Overall.Qual"]), "to price\n")
##      approximately $10,665.70 to price
cat("   - Moving from quality 5 to quality 8 (3 points) adds about",
    dollar(coeffs["Overall.Qual"] * 3), "\n\n")
##    - Moving from quality 5 to quality 8 (3 points) adds about $31,997.10
cat("3. Central Air (Has_Central_Air):", dollar(coeffs["Has_Central_Air"]), "\n")
## 3. Central Air (Has_Central_Air): $24,486.63
cat("   - Holding size and quality constant, having central air adds\n")
##    - Holding size and quality constant, having central air adds
cat("     approximately", dollar(coeffs["Has_Central_Air"]), "to price\n")
##      approximately $24,486.63 to price
cat("   - This is much less than the $84,562 from Week 7's simple comparison,\n")
##    - This is much less than the $84,562 from Week 7's simple comparison,
cat("     because that number was confounded with size and quality\n")
##      because that number was confounded with size and quality

Conclusion: Can We Trust This Model?

Summary of Diagnostic Findings

cat("=== DIAGNOSTIC ASSESSMENT SUMMARY ===\n\n")
## === DIAGNOSTIC ASSESSMENT SUMMARY ===
cat("1. LINEARITY (Residuals vs. Fitted): ✓ Mostly Met (70% confidence)\n")
## 1. LINEARITY (Residuals vs. Fitted): ✓ Mostly Met (70% confidence)
cat("   - Blue line roughly flat, relationships are linear\n")
##    - Blue line roughly flat, relationships are linear
cat("   - Slight curvature at extremes but not severe\n\n")
##    - Slight curvature at extremes but not severe
cat("2. NORMALITY (Q-Q Plot + Histogram): ✓ Reasonably Met (75-80% confidence)\n")
## 2. NORMALITY (Q-Q Plot + Histogram): ✓ Reasonably Met (75-80% confidence)
cat("   - Residuals are 'normal enough' for valid inference\n")
##    - Residuals are 'normal enough' for valid inference
cat("   - Slight right skew but histogram looks good overall\n")
##    - Slight right skew but histogram looks good overall
cat("   - Q-Q plot shows tail deviations but that's expected\n\n")
##    - Q-Q plot shows tail deviations but that's expected
cat("3. HOMOSCEDASTICITY (Scale-Location): ⚠ Moderately Violated (60% confidence)\n")
## 3. HOMOSCEDASTICITY (Scale-Location): ⚠ Moderately Violated (60% confidence)
cat("   - Variance increases with fitted values (fanning)\n")
##    - Variance increases with fitted values (fanning)
cat("   - Standard errors less reliable for expensive homes\n")
##    - Standard errors less reliable for expensive homes
cat("   - Could benefit from log transformation\n\n")
##    - Could benefit from log transformation
cat("4. INDEPENDENCE: ✓ Assumed Met\n")
## 4. INDEPENDENCE: ✓ Assumed Met
cat("   - Each home sale is independent (no time series or clustering)\n")
##    - Each home sale is independent (no time series or clustering)
cat("   - Data collection design supports this\n\n")
##    - Data collection design supports this
cat("5. INFLUENTIAL OUTLIERS (Cook's D): ✓ Mostly OK (85% confidence)\n")
## 5. INFLUENTIAL OUTLIERS (Cook's D): ✓ Mostly OK (85% confidence)
cat("   - Few influential points (", nrow(influential), "homes)\n")
##    - Few influential points ( 10 homes)
cat("   - Don't dominate the model\n")
##    - Don't dominate the model
cat("   - Could investigate luxury homes separately\n\n")
##    - Could investigate luxury homes separately
cat("6. MULTICOLLINEARITY (VIF): ✓ No Concerns\n")
## 6. MULTICOLLINEARITY (VIF): ✓ No Concerns
cat("   - All VIF values < 5\n")
##    - All VIF values < 5
cat("   - Coefficients are interpretable\n\n")
##    - Coefficients are interpretable
cat("\n=== OVERALL VERDICT ===\n\n")
## 
## === OVERALL VERDICT ===
cat("Can we trust this model for inference? YES, with caveats.\n\n")
## Can we trust this model for inference? YES, with caveats.
cat("STRENGTHS:\n")
## STRENGTHS:
cat("✓ Linearity assumption met\n")
## ✓ Linearity assumption met
cat("✓ Normality approximately satisfied\n")
## ✓ Normality approximately satisfied
cat("✓ Independence is reasonable\n")
## ✓ Independence is reasonable
cat("✓ No serious multicollinearity\n")
## ✓ No serious multicollinearity
cat("✓ Most data well-behaved\n")
## ✓ Most data well-behaved
cat("✓ Substantial improvement over base model (50% → 79% variance explained)\n\n")
## ✓ Substantial improvement over base model (50% → 79% variance explained)
cat("WEAKNESSES:\n")
## WEAKNESSES:
cat("⚠ Heteroscedasticity present (variance increases with price)\n")
## ⚠ Heteroscedasticity present (variance increases with price)
cat("⚠ A few influential outliers\n")
## ⚠ A few influential outliers
cat("⚠ Slight right skew in residuals\n\n")
## ⚠ Slight right skew in residuals
cat("IMPLICATIONS:\n")
## IMPLICATIONS:
cat("- Coefficient estimates are UNBIASED (we can trust the direction/magnitude)\n")
## - Coefficient estimates are UNBIASED (we can trust the direction/magnitude)
cat("- Standard errors might be SLIGHTLY UNDERSTATED for expensive homes\n")
## - Standard errors might be SLIGHTLY UNDERSTATED for expensive homes
cat("- P-values and confidence intervals are still approximately valid\n")
## - P-values and confidence intervals are still approximately valid
cat("- Predictions are good but less precise for luxury homes\n\n")
## - Predictions are good but less precise for luxury homes
cat("CONFIDENCE LEVEL: 75%\n")
## CONFIDENCE LEVEL: 75%
cat("The model is valid for drawing conclusions but not perfect.\n")
## The model is valid for drawing conclusions but not perfect.
cat("The violations are mild enough that inferences are still trustworthy.\n")
## The violations are mild enough that inferences are still trustworthy.

Further Questions and Next Steps

cat("\n=== QUESTIONS FOR FUTURE INVESTIGATION ===\n\n")
## 
## === QUESTIONS FOR FUTURE INVESTIGATION ===
cat("1. Would log-transforming SalePrice fix the heteroscedasticity?\n")
## 1. Would log-transforming SalePrice fix the heteroscedasticity?
cat("   - Log(Price) often stabilizes variance\n")
##    - Log(Price) often stabilizes variance
cat("   - Would make coefficients harder to interpret\n\n")
##    - Would make coefficients harder to interpret
cat("2. Are the influential outliers a distinct subgroup?\n")
## 2. Are the influential outliers a distinct subgroup?
cat("   - Could fit separate models for luxury vs. standard homes\n")
##    - Could fit separate models for luxury vs. standard homes
cat("   - Or add variables that capture their uniqueness (e.g., pool, multi-car garage)\n\n")
##    - Or add variables that capture their uniqueness (e.g., pool, multi-car garage)
cat("3. Is there a Neighborhood effect we're missing?\n")
## 3. Is there a Neighborhood effect we're missing?
cat("   - Location might explain some of the remaining 21% variance\n")
##    - Location might explain some of the remaining 21% variance
cat("   - Could add neighborhood as a categorical variable\n\n")
##    - Could add neighborhood as a categorical variable
cat("4. Do relationships change over time?\n")
## 4. Do relationships change over time?
cat("   - Dataset spans 2006-2010 (financial crisis)\n")
##    - Dataset spans 2006-2010 (financial crisis)
cat("   - Could test if coefficients differ by year\n\n")
##    - Could test if coefficients differ by year
cat("5. Are there non-linear relationships?\n")
## 5. Are there non-linear relationships?
cat("   - Maybe quality has diminishing returns at high levels\n")
##    - Maybe quality has diminishing returns at high levels
cat("   - Could try polynomial terms\n\n")
##    - Could try polynomial terms
cat("\n=== FINAL TAKEAWAY ===\n\n")
## 
## === FINAL TAKEAWAY ===
cat("Regression diagnostics are ESSENTIAL. Without them, we'd never know:\n")
## Regression diagnostics are ESSENTIAL. Without them, we'd never know:
cat("- That variance increases with price (affecting precision)\n")
## - That variance increases with price (affecting precision)
cat("- That a few homes have outsized influence\n")
## - That a few homes have outsized influence
cat("- That our normality assumption is met 'well enough'\n\n")
## - That our normality assumption is met 'well enough'
cat("A model with R² = 0.79 looks great, but diagnostics revealed it's not perfect.\n")
## A model with R² = 0.79 looks great, but diagnostics revealed it's not perfect.
cat("The violations are mild, so our conclusions are still valid, but we now know\n")
## The violations are mild, so our conclusions are still valid, but we now know
cat("WHERE the model is weakest (expensive homes) and what to improve next.\n\n")
## WHERE the model is weakest (expensive homes) and what to improve next.
cat("As stated in the assignment: 'We cannot draw conclusions from an invalid model.'\n")
## As stated in the assignment: 'We cannot draw conclusions from an invalid model.'
cat("These diagnostics confirm our model IS valid for inference, just not flawless.\n")
## These diagnostics confirm our model IS valid for inference, just not flawless.

Appendix: Diagnostic Plots Together

# Show all 4 base R diagnostic plots
par(mfrow = c(2, 2))
plot(final_model)

par(mfrow = c(1, 1))

cat("Standard R diagnostic plots for reference.\n")
## Standard R diagnostic plots for reference.
cat("We've examined each of these in detail above.\n")
## We've examined each of these in detail above.