1 1. Data Generation

This example simulates housing prices and fits a Multiple Linear Regression (MLR) to predict Price from Size, Bedrooms, Bathrooms, Age.
We purposefully generate data from a known linear process with random noise so we can compare the recovered model to the true data-generating mechanism.

set.seed(123)  # for reproducibility

n <- 100 # Number of observations 

Size      <- runif(n, 1000, 5000)           # Size in square feet 
Bedrooms  <- sample(2:5, n, replace = TRUE) # Number of bedrooms 
Bathrooms <- sample(1:4, n, replace = TRUE) # Number of bathrooms 
Age       <- sample(1:30, n, replace = TRUE) # Age of the house in years 

# Generate Prices based on a hypothetical model (the DGP)
Price <- 50000 + 3*Size + 20000*Bedrooms + 15000*Bathrooms - 2000*Age + rnorm(n, 0, 5000)

# Combine into a dataframe 
housing_data <- data.frame(Size, Bedrooms, Bathrooms, Age, Price)

str(housing_data)

## 'data.frame':    100 obs. of  5 variables:
##  $ Size     : num  2150 4153 2636 4532 4762 ...
##  $ Bedrooms : int  2 5 3 2 5 3 2 4 4 5 ...
##  $ Bathrooms: int  2 2 4 1 4 4 2 3 4 4 ...
##  $ Age      : int  26 10 17 26 17 29 26 27 21 7 ...
##  $ Price    : num  72070 168517 140935 74851 190015 ...

summary(housing_data)

##       Size         Bedrooms      Bathrooms         Age            Price       
##  Min.   :1002   Min.   :2.00   Min.   :1.00   Min.   : 1.00   Min.   : 62363  
##  1st Qu.:1982   1st Qu.:3.00   1st Qu.:1.00   1st Qu.: 7.00   1st Qu.:112149  
##  Median :2865   Median :4.00   Median :2.00   Median :16.50   Median :137773  
##  Mean   :2994   Mean   :3.62   Mean   :2.37   Mean   :15.53   Mean   :136111  
##  3rd Qu.:4022   3rd Qu.:4.25   3rd Qu.:3.25   3rd Qu.:24.25   3rd Qu.:156084  
##  Max.   :4977   Max.   :5.00   Max.   :4.00   Max.   :30.00   Max.   :201607

Interpretation:
- The coefficients used to create Price are the “true” effects; fitting MLR should recover values close to these, up to sampling noise.
- This controlled setup lets us focus on the workflow: fitting, validating assumptions, and interpreting results.

2 2. Fit the Multiple Linear Regression Model

model <- lm(Price ~ Size + Bedrooms + Bathrooms + Age, data = housing_data)
summary(model)

## 
## Call:
## lm(formula = Price ~ Size + Bedrooms + Bathrooms + Age, data = housing_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8485.5 -3008.9  -529.9  3215.3 11175.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49128.5075  2529.9210  19.419  < 2e-16 ***
## Size            3.5372     0.4214   8.394 4.43e-13 ***
## Bedrooms    20003.2507   458.1050  43.665  < 2e-16 ***
## Bathrooms   14816.1260   409.9013  36.146  < 2e-16 ***
## Age         -2004.7813    52.6274 -38.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4731 on 95 degrees of freedom
## Multiple R-squared:  0.9802, Adjusted R-squared:  0.9794 
## F-statistic:  1178 on 4 and 95 DF,  p-value: < 2.2e-16

Interpretation:
- Each coefficient is the expected marginal change in Price for a one-unit increase in that predictor, holding others constant.
- Small standard errors and large |t| values indicate precise, statistically significant estimates.
- A high R-squared suggests strong in-sample fit—expected here because the data were generated from a linear model.

co <- coef(model)
eq <- sprintf(
  "Estimated Model:  Price = %.2f*Size + %.2f*Bedrooms + %.2f*Bathrooms %+.2f*Age %+.2f (Intercept)",
  co["Size"], co["Bedrooms"], co["Bathrooms"], co["Age"], co["(Intercept)"]
)
eq

## [1] "Estimated Model:  Price = 3.54*Size + 20003.25*Bedrooms + 14816.13*Bathrooms -2004.78*Age +49128.51 (Intercept)"

rsq <- summary(model)$r.squared
adj <- summary(model)$adj.r.squared
sprintf("R-squared = %.4f, Adjusted R-squared = %.4f", rsq, adj)

## [1] "R-squared = 0.9802, Adjusted R-squared = 0.9794"

Interpretation:
- Compare these estimates to the data-generating parameters (Size=3, Bedrooms=20000, Bathrooms=15000, Age=-2000, Intercept=50000).
- Due to random noise, the fitted values should be close but not identical.

3 3. Prediction for a New House

new_house <- data.frame(Size = 3000, Bedrooms = 4, Bathrooms = 3, Age = 10)
predicted_price <- predict(model, newdata = new_house, interval = "prediction", level = 0.95)
predicted_price

##        fit      lwr      upr
## 1 164153.7 154673.1 173634.3

Interpretation:
- The point prediction is the model’s best guess given features.
- The prediction interval reflects uncertainty for a single new observation (includes residual variance).

4 4. Residual Diagnostics (Plots)

Examining diagnostic plots is essential for assessing linear model assumptions.

par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

How to read these plots:
1. Residuals vs Fitted — look for random cloud around 0. Patterns or funnels suggest nonlinearity or heteroscedasticity.
2. Normal Q–Q — points near the line imply approximately normal residuals (good for t-tests/CI validity).
3. Scale–Location — flat band suggests constant variance across fitted values.
4. Residuals vs Leverage — points with high leverage and large residuals are influential; check Cook’s distance contours.

5 5. Normality Test (Shapiro–Wilk)

sw <- shapiro.test(residuals(model))
sw

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.9759, p-value = 0.06348

Interpretation:
- p > .05 → fail to reject normality (residuals are plausibly normal).
- p < .05 → potential deviation from normality; consider transformations or robust inference if severe.

6 6. Independence of Errors (Durbin–Watson)

For time-indexed data, test residual autocorrelation. For purely cross-sectional data (like this example), independence is usually reasonable.

if (!requireNamespace("lmtest", quietly = TRUE)) install.packages("lmtest", quiet = TRUE)
library(lmtest)
dwtest(model)

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.8934, p-value = 0.2929
## alternative hypothesis: true autocorrelation is greater than 0

Interpretation:
- Non-significant result suggests no strong autocorrelation.
- If autocorrelation is present, consider time-series models or adding lags.

7 7. Homoscedasticity (Breusch–Pagan)

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.3662, df = 4, p-value = 0.85

Interpretation:
- p > .05 supports constant residual variance (homoscedasticity).
- p < .05 indicates heteroscedasticity; consider transformations (e.g., log Price) or robust SEs.

8 8. Multicollinearity (VIF)

if (!requireNamespace("car", quietly = TRUE)) install.packages("car", quiet = TRUE)
library(car)
vif(model)

##      Size  Bedrooms Bathrooms       Age 
##  1.020737  1.008399  1.030653  1.045136

Interpretation:
- Rules of thumb: VIF < 5 (or < 10) is typically acceptable.
- High VIFs inflate SEs and make coefficients unstable; consider feature selection or combining correlated variables.

9 9. Influence (Cook’s Distance & Leverage)

cd <- cooks.distance(model)
plot(cd, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4/length(cd), col = "red", lty = 2)  # common heuristic cutoff

lev <- hatvalues(model)
summary(lev)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01896 0.03765 0.04750 0.05000 0.05871 0.09430

head(sort(lev, decreasing = TRUE), 10)

##         11         18         21          4         31         35         84 
## 0.09429627 0.09233762 0.08680494 0.08634897 0.08513207 0.08208426 0.08040005 
##         87         15         74 
## 0.07923661 0.07759301 0.07758268

Interpretation:
- Observations above the dashed line (Cook’s D) or with unusually high leverage merit review for data quality issues or missing predictors.
- Perform sensitivity checks (refit without the point) before deciding to remove any observation.

10 10. Summary & Next Steps

The model recovers coefficients close to the true data-generating parameters, as expected in a well-specified linear setting with noise.
Assumptions (normality, independence, homoscedasticity) appear reasonable when tests are non-significant and plots look healthy.
Next steps: explore interaction terms (e.g., Size × Bedrooms), nonlinearities (splines, polynomials), model comparison via AIC/BIC, and out-of-sample validation (train/test split or cross-validation).

Multiple Linear Regression & Diagnostics — Housing Example

Irene Tsapara

October 11, 2025