This example simulates housing prices and fits a Multiple
Linear Regression (MLR) to predict Price from
Size, Bedrooms, Bathrooms, Age.
We purposefully generate data from a known linear process with random
noise so we can compare the recovered model to the true data-generating
mechanism.
set.seed(123) # for reproducibility
n <- 100 # Number of observations
Size <- runif(n, 1000, 5000) # Size in square feet
Bedrooms <- sample(2:5, n, replace = TRUE) # Number of bedrooms
Bathrooms <- sample(1:4, n, replace = TRUE) # Number of bathrooms
Age <- sample(1:30, n, replace = TRUE) # Age of the house in years
# Generate Prices based on a hypothetical model (the DGP)
Price <- 50000 + 3*Size + 20000*Bedrooms + 15000*Bathrooms - 2000*Age + rnorm(n, 0, 5000)
# Combine into a dataframe
housing_data <- data.frame(Size, Bedrooms, Bathrooms, Age, Price)
str(housing_data)
## 'data.frame': 100 obs. of 5 variables:
## $ Size : num 2150 4153 2636 4532 4762 ...
## $ Bedrooms : int 2 5 3 2 5 3 2 4 4 5 ...
## $ Bathrooms: int 2 2 4 1 4 4 2 3 4 4 ...
## $ Age : int 26 10 17 26 17 29 26 27 21 7 ...
## $ Price : num 72070 168517 140935 74851 190015 ...
summary(housing_data)
## Size Bedrooms Bathrooms Age Price
## Min. :1002 Min. :2.00 Min. :1.00 Min. : 1.00 Min. : 62363
## 1st Qu.:1982 1st Qu.:3.00 1st Qu.:1.00 1st Qu.: 7.00 1st Qu.:112149
## Median :2865 Median :4.00 Median :2.00 Median :16.50 Median :137773
## Mean :2994 Mean :3.62 Mean :2.37 Mean :15.53 Mean :136111
## 3rd Qu.:4022 3rd Qu.:4.25 3rd Qu.:3.25 3rd Qu.:24.25 3rd Qu.:156084
## Max. :4977 Max. :5.00 Max. :4.00 Max. :30.00 Max. :201607
Interpretation:
- The coefficients used to create Price are the “true”
effects; fitting MLR should recover values close to these, up to
sampling noise.
- This controlled setup lets us focus on the workflow: fitting,
validating assumptions, and interpreting results.
model <- lm(Price ~ Size + Bedrooms + Bathrooms + Age, data = housing_data)
summary(model)
##
## Call:
## lm(formula = Price ~ Size + Bedrooms + Bathrooms + Age, data = housing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8485.5 -3008.9 -529.9 3215.3 11175.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49128.5075 2529.9210 19.419 < 2e-16 ***
## Size 3.5372 0.4214 8.394 4.43e-13 ***
## Bedrooms 20003.2507 458.1050 43.665 < 2e-16 ***
## Bathrooms 14816.1260 409.9013 36.146 < 2e-16 ***
## Age -2004.7813 52.6274 -38.094 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4731 on 95 degrees of freedom
## Multiple R-squared: 0.9802, Adjusted R-squared: 0.9794
## F-statistic: 1178 on 4 and 95 DF, p-value: < 2.2e-16
Interpretation:
- Each coefficient is the expected marginal change in
Price for a one-unit increase in that predictor,
holding others constant.
- Small standard errors and large |t| values indicate precise,
statistically significant estimates.
- A high R-squared suggests strong in-sample fit—expected here because
the data were generated from a linear model.
co <- coef(model)
eq <- sprintf(
"Estimated Model: Price = %.2f*Size + %.2f*Bedrooms + %.2f*Bathrooms %+.2f*Age %+.2f (Intercept)",
co["Size"], co["Bedrooms"], co["Bathrooms"], co["Age"], co["(Intercept)"]
)
eq
## [1] "Estimated Model: Price = 3.54*Size + 20003.25*Bedrooms + 14816.13*Bathrooms -2004.78*Age +49128.51 (Intercept)"
rsq <- summary(model)$r.squared
adj <- summary(model)$adj.r.squared
sprintf("R-squared = %.4f, Adjusted R-squared = %.4f", rsq, adj)
## [1] "R-squared = 0.9802, Adjusted R-squared = 0.9794"
Interpretation:
- Compare these estimates to the data-generating parameters (Size=3,
Bedrooms=20000, Bathrooms=15000, Age=-2000, Intercept=50000).
- Due to random noise, the fitted values should be close but not
identical.
new_house <- data.frame(Size = 3000, Bedrooms = 4, Bathrooms = 3, Age = 10)
predicted_price <- predict(model, newdata = new_house, interval = "prediction", level = 0.95)
predicted_price
## fit lwr upr
## 1 164153.7 154673.1 173634.3
Interpretation:
- The point prediction is the model’s best guess given
features.
- The prediction interval reflects uncertainty for a
single new observation (includes residual variance).
Examining diagnostic plots is essential for assessing linear model assumptions.
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
How to read these plots:
1. Residuals vs Fitted — look for random cloud around
0. Patterns or funnels suggest nonlinearity or heteroscedasticity.
2. Normal Q–Q — points near the line imply
approximately normal residuals (good for t-tests/CI validity).
3. Scale–Location — flat band suggests constant
variance across fitted values.
4. Residuals vs Leverage — points with high leverage
and large residuals are influential; check Cook’s distance contours.
sw <- shapiro.test(residuals(model))
sw
##
## Shapiro-Wilk normality test
##
## data: residuals(model)
## W = 0.9759, p-value = 0.06348
Interpretation:
- p > .05 → fail to reject normality (residuals are
plausibly normal).
- p < .05 → potential deviation from normality;
consider transformations or robust inference if severe.
For time-indexed data, test residual autocorrelation. For purely cross-sectional data (like this example), independence is usually reasonable.
if (!requireNamespace("lmtest", quietly = TRUE)) install.packages("lmtest", quiet = TRUE)
library(lmtest)
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.8934, p-value = 0.2929
## alternative hypothesis: true autocorrelation is greater than 0
Interpretation:
- Non-significant result suggests no strong autocorrelation.
- If autocorrelation is present, consider time-series models or adding
lags.
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 1.3662, df = 4, p-value = 0.85
Interpretation:
- p > .05 supports constant residual variance
(homoscedasticity).
- p < .05 indicates heteroscedasticity; consider
transformations (e.g., log Price) or robust SEs.
if (!requireNamespace("car", quietly = TRUE)) install.packages("car", quiet = TRUE)
library(car)
vif(model)
## Size Bedrooms Bathrooms Age
## 1.020737 1.008399 1.030653 1.045136
Interpretation:
- Rules of thumb: VIF < 5 (or < 10) is typically
acceptable.
- High VIFs inflate SEs and make coefficients unstable; consider feature
selection or combining correlated variables.
cd <- cooks.distance(model)
plot(cd, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4/length(cd), col = "red", lty = 2) # common heuristic cutoff
lev <- hatvalues(model)
summary(lev)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01896 0.03765 0.04750 0.05000 0.05871 0.09430
head(sort(lev, decreasing = TRUE), 10)
## 11 18 21 4 31 35 84
## 0.09429627 0.09233762 0.08680494 0.08634897 0.08513207 0.08208426 0.08040005
## 87 15 74
## 0.07923661 0.07759301 0.07758268
Interpretation:
- Observations above the dashed line (Cook’s D) or with unusually high
leverage merit review for data quality issues or missing
predictors.
- Perform sensitivity checks (refit without the point) before deciding
to remove any observation.