Introduction

This week I’m building on last week’s simple linear regression model by adding more variables to improve predictions and understand the factors that drive home prices. More importantly, I’ll rigorously diagnose the model using the 5 diagnostic plots covered in class to ensure it meets the assumptions required for valid inference.

A model that violates key assumptions cannot be trusted for making conclusions, no matter how good its R² looks.

Last week’s model was: SalePrice = β₀ + β₁ × Gr.Liv.Area (R² = 0.500)

This week I’ll expand it to a multiple regression model and thoroughly check whether it’s valid.

Loading Data and Base Model Review

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

# Recreate Week 8's simple model
base_model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)
summary_base <- summary(base_model)

Week 8 Base Model Summary:

This model explains 50% of price variation. The other ~50% comes from factors we haven’t modeled yet.


Part 1: Selecting Additional Variables

Exploring Candidate Variables

# Calculate correlations with SalePrice
candidates <- c("Overall.Qual", "Year.Built", "Garage.Area", "Total.Bsmt.SF")
correlations <- sapply(candidates, function(var) {
  cor(ames[[var]], ames$SalePrice, use = "complete.obs")
})

# Also check correlation with Gr.Liv.Area (for multicollinearity)
correlations_area <- sapply(candidates, function(var) {
  cor(ames[[var]], ames$Gr.Liv.Area, use = "complete.obs")
})

cor_table <- data.frame(
  Variable = candidates,
  Correlation_with_Price = correlations,
  Correlation_with_Area = correlations_area
) |>
  arrange(desc(abs(Correlation_with_Price)))

kable(cor_table,
      col.names = c("Variable", "r with SalePrice", "r with Gr.Liv.Area"),
      caption = "Potential Variables to Add",
      digits = 3)
Potential Variables to Add
Variable r with SalePrice r with Gr.Liv.Area
Overall.Qual Overall.Qual 0.799 0.571
Garage.Area Garage.Area 0.640 0.485
Total.Bsmt.SF Total.Bsmt.SF 0.632 0.445
Year.Built Year.Built 0.558 0.242

Variable 1: Overall Quality (Ordered Integer)

Why include it:

  • Strongest correlation with SalePrice (r = 0.799)
  • Represents construction/finish quality on a 1-10 scale
  • Theoretically independent from size: you can have large low-quality homes or small high-quality homes
  • Week 6 showed quality affects price per square foot

Multicollinearity concern:

  • Correlation with Gr.Liv.Area = 0.571 (moderate)
  • This is below the typical 0.7-0.8 threshold for serious multicollinearity
  • Makes sense: larger homes tend to be higher quality, but not always

DECISION: Include Overall.Qual ✓

Variable 2: Central Air (Binary Term)

# Create binary variable
ames <- ames |>
  mutate(Has_Central_Air = if_else(Central.Air == "Y", 1, 0))

Why include it:

  • Week 7 showed homes with AC sell for $84,562 more on average
  • Binary variable (1 = has AC, 0 = no AC)
  • Represents a specific feature independent of size and quality
  • 93% of homes have AC, capturing a premium for a modern necessity

Multicollinearity concern:

corr_ac_area <- cor(ames$Has_Central_Air, ames$Gr.Liv.Area)
corr_ac_qual <- cor(ames$Has_Central_Air, ames$Overall.Qual)
  • Correlation with Gr.Liv.Area = 0.123 (weak)
  • Correlation with Overall.Qual = 0.287 (weak)
  • No serious multicollinearity concerns

DECISION: Include Has_Central_Air ✓

Variable 3: Testing Interaction Term (Quality × Living Area)

Theoretical justification:

  • The value of square footage might depend on quality
  • A 100 sq ft increase in a luxury home (quality 10) might add more value than the same increase in a basic home (quality 5)
  • Conversely, quality improvements might matter more in larger homes
# Fit model with and without interaction
model_no_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air, 
                          data = ames)
model_with_interaction <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air +
                            Gr.Liv.Area:Overall.Qual,
                            data = ames)

# Compare models with ANOVA
anova_result <- anova(model_no_interaction, model_with_interaction)
p_value_interaction <- anova_result$`Pr(>F)`[2]

Testing if interaction is needed:

kable(anova_result,
      caption = "ANOVA F-test: Comparing Models With and Without Interaction",
      digits = c(0, 0, 0, 2, 2, 4))
ANOVA F-test: Comparing Models With and Without Interaction
Res.Df RSS Df Sum of Sq F Pr(>F)
2926 4957505742213
2925 4486148928825 1 471356813388 307.33 0

The interaction term has p = 1.664e-65. Since p >= 0.05, the interaction term is significant.

DECISION: Include interaction term to keep the model simpler and more interpretable.

Final Model Selection

# Use model without interaction for simplicity
final_model <- model_no_interaction

Final Model: SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air

This model has 4 parameters (including intercept).

summary_final <- summary(final_model)
print(summary_final)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual + Has_Central_Air, 
##     data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -393423  -23038   -1377   19705  284783 
## 
## Coefficients:
##                    Estimate  Std. Error t value     Pr(>|t|)    
## (Intercept)     -120952.256    3951.472 -30.609      < 2e-16 ***
## Gr.Liv.Area          59.275       1.834  32.312      < 2e-16 ***
## Overall.Qual      32245.142     680.880  47.358      < 2e-16 ***
## Has_Central_Air   17493.918    3181.808   5.498 0.0000000417 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41160 on 2926 degrees of freedom
## Multiple R-squared:  0.7348, Adjusted R-squared:  0.7345 
## F-statistic:  2702 on 3 and 2926 DF,  p-value: < 2.2e-16

Checking for Multicollinearity (VIF)

VIF measures how much variance in a coefficient is inflated due to multicollinearity with other predictors.

Rule of thumb:

  • VIF < 5: No serious multicollinearity
  • VIF 5-10: Moderate multicollinearity (be cautious)
  • VIF > 10: Serious multicollinearity (remove variables)
vif_values <- vif(final_model)
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = vif_values
) |>
  arrange(desc(VIF))

kable(vif_df,
      col.names = c("Variable", "VIF"),
      caption = "Variance Inflation Factors",
      digits = 2,
      row.names = FALSE)
Variance Inflation Factors
Variable VIF
Overall.Qual 1.60
Gr.Liv.Area 1.49
Has_Central_Air 1.09

Assessment: All VIF values are < 5, indicating no multicollinearity concerns. Coefficients are interpretable and standard errors are not inflated.


Part 2: Model Diagnostics (5 Plots from Class)

Now I’ll use the 5 diagnostic plots covered in class to check model assumptions. Each assumption must be met for the model to provide valid inferences.

Diagnostic Plot 1: Residuals vs. Fitted Values

gg_resfitted(final_model) +
  geom_smooth(se = FALSE, color = "blue", linewidth = 1) +
  labs(title = "Diagnostic Plot 1: Residuals vs. Fitted Values",
       subtitle = "Checking for constant variance and linearity")

What to look for:

  • ✓ GOOD: Random scatter around zero with constant spread, flat smoothed line
  • ✗ BAD: Patterns (curves), fanning (increasing spread), non-random clustering

What I observe:

The plot shows residuals scattered around the zero line. The blue smoothed curve is mostly flat near zero, which is good for linearity. However, there’s a noticeable “fanning” effect from left to right—the spread of residuals increases for higher fitted values (more expensive homes).

Assessment:

  • Linearity: The flat blue curve suggests linear relationships are appropriate. The assumption is reasonably met.
  • Constant Variance (Homoscedasticity): VIOLATED. The fanning pattern indicates variance increases with predicted price. This is heteroscedasticity—more expensive homes have more variable prices.
  • Severity: Moderate. The fanning is clear but not extreme.
  • Confidence in assumption: 70% for linearity, 40% for constant variance

Implications: Standard errors might be underestimated for expensive homes, making confidence intervals and p-values less reliable at the high end.

Diagnostic Plot 2: Residuals vs. X Values

plots <- gg_resX(final_model, plot.all = FALSE)

# Plot for each predictor
plots$Gr.Liv.Area +
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Residuals vs. Living Area")

plots$Overall.Qual +
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Residuals vs. Overall Quality")

plots$Has_Central_Air +
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Residuals vs. Has Central Air")

What to look for:

  • ✓ GOOD: Random scatter with no pattern, blue line near red dotted zero line
  • ✗ BAD: Systematic patterns, increasing/decreasing trends, non-constant variance

What I observe:

  1. Gr.Liv.Area: Some fanning—residuals spread more at larger living areas. This confirms the heteroscedasticity we saw in Plot 1. The blue smoothed line is mostly flat, supporting linearity.

  2. Overall.Qual: Residuals appear more evenly distributed across quality levels. The blue line is close to zero with slight waviness, suggesting the linear relationship is appropriate.

  3. Has_Central_Air: Binary variable shows two vertical clusters (0 vs. 1). Both clusters have similar spread, suggesting this variable doesn’t introduce additional heteroscedasticity.

Assessment:

  • Linearity with X: Mostly met. No strong curved patterns that would suggest non-linear relationships.
  • Constant variance across X: Partially violated for Gr.Liv.Area (larger homes have more variable residuals), but acceptable for the other variables.
  • Severity: Moderate for living area, minimal for others
  • Confidence: 65% that linearity assumptions hold across all predictors

Interpretation: The heteroscedasticity issue is primarily driven by living area (size), not quality or AC. Larger homes naturally have more price variation.

Diagnostic Plot 3: Correlation Heatmap

# Select only predictor variables (not response)
predictor_vars <- ames |>
  select(Gr.Liv.Area, Overall.Qual, Has_Central_Air)

ggcorr(predictor_vars, label = TRUE, label_size = 4, hjust = 0.75) +
  labs(title = "Diagnostic Plot 3: Correlation Heatmap of Predictors",
       subtitle = "Checking for multicollinearity among explanatory variables")

What to look for:

  • ✓ GOOD: Low to moderate correlations between predictors (< 0.7)
  • ✗ BAD: High correlations (> 0.7-0.8) indicating multicollinearity

What I observe:

The heatmap shows pairwise correlations between all predictors:

  • Gr.Liv.Area ↔︎ Overall.Qual: 0.57 (moderate positive)
  • Gr.Liv.Area ↔︎ Has_Central_Air: 0.12 (weak)
  • Overall.Qual ↔︎ Has_Central_Air: 0.29 (weak)

Assessment:

  • Multicollinearity: No serious concerns. The highest correlation is 0.57, well below the 0.7-0.8 threshold.
  • Interpretation: Larger homes tend to be higher quality (makes sense—people invest in quality when building big), but the relationship isn’t strong enough to cause problems. AC is essentially independent of both size and quality.
  • Confidence: 95% that multicollinearity is not an issue

Conclusion: All predictors contribute unique information. The VIF values we calculated earlier (all < 2) confirm this visual assessment.

Diagnostic Plot 4: Residual Histogram

gg_reshist(final_model) +
  labs(title = "Diagnostic Plot 4: Histogram of Residuals",
       subtitle = "Checking if residuals are approximately normally distributed")

What to look for:

  • ✓ GOOD: Bell-shaped, symmetric, centered near zero
  • ✗ BAD: Heavily skewed, multiple peaks, long tails

What I observe:

The histogram shows a roughly bell-shaped distribution centered near zero. However, there’s a slight asymmetry—the right tail (positive residuals) extends further than the left tail (negative residuals), indicating slight right skew.

Assessment:

  • Normality: Approximately met. The distribution is “normal enough” for regression purposes.
  • Skewness: Mild right skew suggests some homes sell for much more than predicted, but this isn’t severe.
  • Interpretation: Most residuals cluster around zero (good predictions), with occasional homes that sell for significantly more than the model predicts. These might be unique properties with features we haven’t captured.
  • Severity: Mild. The deviation from perfect normality is small.
  • Confidence: 75% that normality assumption is satisfied

Note: Per class guidance, “residuals just need to be normal enough.” This histogram suggests they are.

Diagnostic Plot 5: Normal Q-Q Plot

gg_qqplot(final_model) +
  labs(title = "Diagnostic Plot 5: Normal Q-Q Plot",
       subtitle = "Comparing residual distribution to theoretical normal distribution")

What to look for:

  • ✓ GOOD: Points follow the diagonal line closely
  • ✗ BAD: Systematic deviations, especially in tails

What I observe:

Points follow the diagonal line well in the center of the distribution, but deviate in the tails:

  • Left tail: Points fall slightly below the line (fewer extreme negative residuals than a perfect normal would have)
  • Right tail: Points rise above the line (more extreme positive residuals than expected)

This S-shaped pattern indicates heavier right tail (positive residuals more extreme than normal distribution predicts).

Assessment:

  • Normality: Mostly met in the center, some deviation in tails
  • Interpretation: The right tail deviation means some homes sell for much more than predicted—possible luxury homes or unique properties our model underestimates.
  • Severity: Moderate in tails, but Q-Q plots are very sensitive. Per class notes: “Be very careful when using QQ-Plots… you may find deviations that are not severe enough to significantly affect your regression model.”
  • Confidence: 70% that normality is satisfied

Balancing with histogram (per class guidance): The histogram shows approximately normal distribution. The Q-Q plot shows tail deviations but Q-Q plots exaggerate small differences. Together, they suggest residuals are “normal enough.”

Conclusion: Normality assumption is reasonably met for valid inference.

BONUS: Cook’s Distance by Observation

gg_cooksd(final_model, threshold = 'matlab') +
  labs(title = "BONUS: Cook's Distance by Observation",
       subtitle = "Identifying influential points (threshold = Matlab convention)")

What to look for:

  • ✓ GOOD: Most observations below threshold, no dominant influential points
  • ✗ BAD: Several observations with Cook’s D well above threshold

What I observe:

A few observations have Cook’s D values above the threshold, indicated by their row numbers. These are homes that have high influence on the regression model—meaning they’re unusual (high leverage) AND poorly predicted (large residuals).

Assessment:

  • Influential outliers: Present but not numerous. Most data is well-behaved.
  • Interpretation: These might be luxury homes, homes with unique features, or homes in unusual neighborhoods that don’t fit the general pattern.
  • Severity: Low. Only a small percentage of homes are highly influential.
  • Confidence: 85% that outliers don’t dominate the model

Recommendation: Could investigate these specific homes to understand what makes them unusual. However, we should NOT remove them without good reason—they’re legitimate sales, just atypical.


Part 3: Overall Model Assessment

Model Comparison

comparison <- data.frame(
  Metric = c("R-squared", "Adjusted R-squared", "RMSE", "Predictors"),
  Base_Model = c(
    summary_base$r.squared,
    summary_base$adj.r.squared,
    sqrt(mean(residuals(base_model)^2)),
    length(coef(base_model)) - 1  # exclude intercept
  ),
  Final_Model = c(
    summary_final$r.squared,
    summary_final$adj.r.squared,
    sqrt(mean(residuals(final_model)^2)),
    length(coef(final_model)) - 1
  )
) |>
  mutate(
    Improvement = Final_Model - Base_Model,
    Pct_Change = round((Final_Model - Base_Model) / Base_Model * 100, 1)
  )

kable(comparison,
      digits = 3,
      caption = "Base Model vs. Final Multiple Regression Model")
Base Model vs. Final Multiple Regression Model
Metric Base_Model Final_Model Improvement Pct_Change
R-squared 0.500 0.735 0.235 47.1
Adjusted R-squared 0.499 0.735 0.235 47.1
RMSE 56504.877 41133.703 -15371.174 -27.2
Predictors 1.000 3.000 2.000 200.0

Key improvements:

  • R² increased from 0.5 to 0.735
  • We now explain 73.5% of price variation (up from 50%)
  • RMSE decreased by $15,371.17
  • Prediction errors reduced from $56,504.88 to $41,133.70

Coefficient Interpretation

coef_table <- tidy(final_model, conf.int = TRUE)
kable(coef_table,
      digits = c(0, 2, 2, 2, 4, 2, 2),
      caption = "Model Coefficients with 95% Confidence Intervals")
Model Coefficients with 95% Confidence Intervals
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -120952.26 3951.47 -30.61 0 -128700.20 -113204.31
Gr.Liv.Area 59.28 1.83 32.31 0 55.68 62.87
Overall.Qual 32245.14 680.88 47.36 0 30910.09 33580.19
Has_Central_Air 17493.92 3181.81 5.50 0 11255.11 23732.73

Practical interpretations:

coeffs <- coef(final_model)
  1. Living Area: Each additional square foot adds approximately $59.28 to price (holding quality and AC constant). This is less than the simple model’s $111.69 per sq ft because we’re now accounting for quality and AC, which were confounded before.

  2. Overall Quality: Each 1-point quality increase adds approximately $32,245.14 to price (holding size and AC constant). Moving from quality 5 to quality 8 (3 points) adds about $96,735.42.

  3. Central Air: Having central air adds approximately $17,493.92 to price (holding size and quality constant). This is much less than the $84,562 from Week 7’s simple comparison because that number was confounded with size and quality.


Conclusion: Can We Trust This Model?

Summary of Diagnostic Findings

The 5 diagnostic plots reveal:

  1. Residuals vs. Fitted: Linearity mostly met (70% confidence), but heteroscedasticity present—variance increases with price (40% confidence in constant variance)

  2. Residuals vs. X Values: Linearity confirmed across predictors (65% confidence), heteroscedasticity primarily from living area variable

  3. Correlation Heatmap: No multicollinearity concerns (95% confidence). All correlations < 0.6, VIF values < 2

  4. Residual Histogram: Approximately normal with slight right skew (75% confidence in normality). Distribution is “normal enough” per class guidance

  5. Normal Q-Q Plot: Good fit in center, tail deviations present but Q-Q plots are oversensitive (70% confidence in normality when balanced with histogram)

BONUS - Cook’s Distance: Few influential outliers, don’t dominate model (85% confidence)

Can We Trust This Model? YES (with caveats)

Overall Confidence: 75%

The model IS valid for inference but not perfect.

Strengths:

  • ✓ Linearity assumption met
  • ✓ Normality approximately satisfied (“normal enough”)
  • ✓ Independence is reasonable (each sale is independent)
  • ✓ No multicollinearity issues
  • ✓ Few influential outliers
  • ✓ Substantial improvement over base model (50% → 74% variance explained)

Weaknesses:

  • ⚠ Heteroscedasticity present (variance increases with price)
  • ⚠ Slight right skew in residuals
  • ⚠ Some influential outliers exist

Implications:

  • Coefficient estimates are UNBIASED (we can trust direction and magnitude)
  • Standard errors might be SLIGHTLY UNDERSTATED for expensive homes
  • P-values and confidence intervals are still approximately valid
  • Predictions are good but less precise for luxury homes

The key quote from the assignment: “We cannot draw conclusions from an invalid model.”

My conclusion: This model IS valid for drawing conclusions. The violations are mild enough that inferences are trustworthy. We know where the model is weakest (expensive homes with high variance) and could address this with transformations (like log-transforming price) in future work.

Further Questions for Future Investigation

  1. Would log-transforming SalePrice fix the heteroscedasticity? Log(Price) often stabilizes variance but makes coefficients harder to interpret.

  2. Are the influential outliers a distinct subgroup? Could fit separate models for luxury vs. standard homes, or add variables capturing their uniqueness.

  3. Is there a Neighborhood effect we’re missing? Location might explain some of the remaining 26% of unexplained variance.

  4. Do relationships change over time? Dataset spans 2006-2010 (financial crisis period). Coefficients might differ by year.

  5. Are there non-linear relationships? Maybe quality has diminishing returns at high levels, or size effects are non-linear.

Final Takeaway

Regression diagnostics are ESSENTIAL. Without them, we’d never know:

  • That variance increases with price (affecting precision of predictions)
  • That a few homes have outsized influence on the model
  • That our normality assumption is met “well enough” despite Q-Q plot concerns
  • That multicollinearity is not a problem

A model with R² = 0.74 looks great, but diagnostics revealed it’s not perfect. The violations are mild, so our conclusions are still valid, but we now understand WHERE the model is weakest and what to improve next.

As emphasized in class: We cannot draw conclusions from an invalid model. These diagnostics confirm our model IS valid for inference, just not flawless. That’s acceptable for real-world data analysis.