This report analyzes the Boston Housing dataset to identify key drivers of residential property prices. Using multiple regression techniques — from a baseline linear model to quadratic and log-transformed variants — the goal is to build a well-specified, interpretable model that accurately predicts median home values (medv).
Skills demonstrated: exploratory data analysis, feature selection, model comparison, assumption diagnostics, and business interpretation.
2 Setup
3 Data Overview
The Boston dataset contains 506 observations across 14 variables. The target variable is medv — the median value of owner-occupied homes in $1,000s.
A 70/30 split was used to evaluate out-of-sample predictive performance on all models.
5 Model 1 — Full Linear Model
Show Code
model_full <-lm(medv ~ ., data = train)pred_full <-predict(model_full, newdata = test)rmse_full <-sqrt(mean((test$medv - pred_full)^2))tidy(model_full) |>mutate(across(where(is.numeric), \(x) round(x, 4))) |>kable(caption ="Full Linear Model — Coefficients")
Full Linear Model — Coefficients
term
estimate
std.error
statistic
p.value
(Intercept)
38.5220
6.0837
6.3320
0.0000
crim
-0.1090
0.0354
-3.0792
0.0022
zn
0.0530
0.0167
3.1789
0.0016
indus
-0.0522
0.0788
-0.6632
0.5077
chas
4.0439
1.0247
3.9463
0.0001
nox
-14.4294
4.6706
-3.0894
0.0022
rm
3.1783
0.4993
6.3650
0.0000
age
-0.0006
0.0162
-0.0350
0.9721
dis
-1.5407
0.2405
-6.4058
0.0000
rad
0.3023
0.0806
3.7490
0.0002
tax
-0.0105
0.0047
-2.2520
0.0250
ptratio
-0.8587
0.1599
-5.3699
0.0000
black
0.0069
0.0034
1.9938
0.0470
lstat
-0.5838
0.0591
-9.8707
0.0000
Interpretation: The full linear model explains 73.3% of variance in home prices (R² = 0.733). Most predictors are statistically significant. However, indus and age are not significant (p > 0.05), suggesting they add little predictive value. The RMSE of 4.8 means predictions are off by about $4.8k on average.
Interpretation: Removing indus and age produced a cleaner model with nearly identical performance (RMSE = 4.77, R² = 0.727). A simpler model is always preferred when performance is equal — easier to explain to clients and less prone to overfitting.
ggplot(data.frame(actual = test$medv, predicted = pred_quad),aes(x = actual, y = predicted)) +geom_point(alpha =0.6, color ="#2c7bb6") +geom_abline(slope =1, intercept =0, color ="red", linewidth =1) +labs(title ="Predicted vs Actual — Quadratic Model",x ="Actual Median Value ($1,000s)",y ="Predicted Median Value ($1,000s)") +theme_minimal()
Show Code
par(mfrow =c(2, 2))plot(model_quad)
Diagnostic Plots — Quadratic Model
Interpretation: Adding quadratic terms for rm and lstat improved the model noticeably — RMSE dropped to 4.49 and R² rose to 0.832. Very large homes command disproportionately higher prices, and neighborhoods with very high poverty rates suffer steeper price drops.
ggplot(data.frame(actual = test$medv, predicted = pred_log),aes(x = actual, y = predicted)) +geom_point(alpha =0.6, color ="#1a9641") +geom_abline(slope =1, intercept =0, color ="red", linewidth =1) +labs(title ="Predicted vs Actual — Log-Quadratic Model",x ="Actual Median Value ($1,000s)",y ="Predicted Median Value ($1,000s)") +theme_minimal()
Show Code
par(mfrow =c(2, 2))plot(model_log)
Diagnostic Plots — Log-Quadratic Model
Interpretation: Log-transforming medv improved normality of residuals and stabilized variance. The model achieved R² of 0.827 with better diagnostic behavior. RMSE on the original scale is 4.56 — the statistically soundest model.
results |>pivot_longer(cols =c(RMSE, R2), names_to ="Metric", values_to ="Value") |>ggplot(aes(x = Model, y = Value, fill = Model)) +geom_col(show.legend =FALSE) +facet_wrap(~Metric, scales ="free_y") +scale_fill_brewer(palette ="Set2") +labs(title ="Model Performance Comparison", x =NULL, y =NULL) +theme_minimal() +theme(axis.text.x =element_text(angle =20, hjust =1))
Conclusion: The Quadratic model achieved the lowest RMSE (4.49) while the Log-Quadratic model achieved the highest R² (0.827) with better diagnostic behavior. For a client prioritizing prediction accuracy, the Quadratic model wins. For a client prioritizing statistical rigor, the Log-Quadratic model is the stronger choice.