Boston Housing Price Analysis

Regression Modeling & Predictive Performance

Author

Dagmawi Yosef Asagid

Published

March 4, 2026

1 Introduction

This report analyzes the Boston Housing dataset to identify key drivers of residential property prices. Using multiple regression techniques — from a baseline linear model to quadratic and log-transformed variants — the goal is to build a well-specified, interpretable model that accurately predicts median home values (medv).

Skills demonstrated: exploratory data analysis, feature selection, model comparison, assumption diagnostics, and business interpretation.


2 Setup


3 Data Overview

The Boston dataset contains 506 observations across 14 variables. The target variable is medv — the median value of owner-occupied homes in $1,000s.

Variable Description
crim Per capita crime rate
rm Average number of rooms per dwelling
lstat % lower status of the population
nox Nitrogen oxide concentration
ptratio Pupil-teacher ratio
medv Median home value (target)

4 Train/Test Split

Show Code
n           <- nrow(Boston)
train_index <- sample(1:n, size = 0.7 * n)
train       <- Boston[train_index, ]
test        <- Boston[-train_index, ]

cat("Training observations:", nrow(train), "\n")
Training observations: 354 
Show Code
cat("Test observations:    ", nrow(test),  "\n")
Test observations:     152 

A 70/30 split was used to evaluate out-of-sample predictive performance on all models.


5 Model 1 — Full Linear Model

Show Code
model_full <- lm(medv ~ ., data = train)

pred_full  <- predict(model_full, newdata = test)
rmse_full  <- sqrt(mean((test$medv - pred_full)^2))

tidy(model_full) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Full Linear Model — Coefficients")
Full Linear Model — Coefficients
term estimate std.error statistic p.value
(Intercept) 38.5220 6.0837 6.3320 0.0000
crim -0.1090 0.0354 -3.0792 0.0022
zn 0.0530 0.0167 3.1789 0.0016
indus -0.0522 0.0788 -0.6632 0.5077
chas 4.0439 1.0247 3.9463 0.0001
nox -14.4294 4.6706 -3.0894 0.0022
rm 3.1783 0.4993 6.3650 0.0000
age -0.0006 0.0162 -0.0350 0.9721
dis -1.5407 0.2405 -6.4058 0.0000
rad 0.3023 0.0806 3.7490 0.0002
tax -0.0105 0.0047 -2.2520 0.0250
ptratio -0.8587 0.1599 -5.3699 0.0000
black 0.0069 0.0034 1.9938 0.0470
lstat -0.5838 0.0591 -9.8707 0.0000

Interpretation: The full linear model explains 73.3% of variance in home prices (R² = 0.733). Most predictors are statistically significant. However, indus and age are not significant (p > 0.05), suggesting they add little predictive value. The RMSE of 4.8 means predictions are off by about $4.8k on average.


6 Model 2 — Reduced Linear Model

Show Code
model_reduced <- lm(medv ~ crim + zn + chas + nox + rm +
                    dis + rad + tax + ptratio + black + lstat,
                    data = train)

pred_reduced  <- predict(model_reduced, newdata = test)
rmse_reduced  <- sqrt(mean((test$medv - pred_reduced)^2))

tidy(model_reduced) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Reduced Linear Model — Coefficients")
Reduced Linear Model — Coefficients
term estimate std.error statistic p.value
(Intercept) 38.5779 6.0632 6.3627 0.0000
crim -0.1084 0.0353 -3.0717 0.0023
zn 0.0545 0.0164 3.3272 0.0010
chas 3.9456 1.0088 3.9111 0.0001
nox -15.3265 4.2956 -3.5679 0.0004
rm 3.2183 0.4896 6.5736 0.0000
dis -1.5037 0.2214 -6.7910 0.0000
rad 0.3187 0.0761 4.1912 0.0000
tax -0.0120 0.0040 -2.9638 0.0033
ptratio -0.8670 0.1585 -5.4687 0.0000
black 0.0070 0.0034 2.0421 0.0419
lstat -0.5852 0.0566 -10.3369 0.0000

Interpretation: Removing indus and age produced a cleaner model with nearly identical performance (RMSE = 4.77, R² = 0.727). A simpler model is always preferred when performance is equal — easier to explain to clients and less prone to overfitting.


7 Model 3 — Quadratic Model

Show Code
model_quad <- lm(medv ~ crim + zn + chas + nox + rm + I(rm^2) +
                 dis + rad + tax + ptratio + black +
                 lstat + I(lstat^2),
                 data = train)

pred_quad  <- predict(model_quad, newdata = test)
rmse_quad  <- sqrt(mean((test$medv - pred_quad)^2))

tidy(model_quad) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Quadratic Model — Coefficients")
Quadratic Model — Coefficients
term estimate std.error statistic p.value
(Intercept) 125.2146 9.8198 12.7513 0.0000
crim -0.1407 0.0283 -4.9768 0.0000
zn 0.0233 0.0133 1.7524 0.0806
chas 3.7136 0.8032 4.6233 0.0000
nox -12.5413 3.4376 -3.6482 0.0003
rm -26.0535 2.9877 -8.7201 0.0000
I(rm^2) 2.3123 0.2364 9.7827 0.0000
dis -1.1125 0.1787 -6.2269 0.0000
rad 0.2140 0.0610 3.5099 0.0005
tax -0.0077 0.0032 -2.3801 0.0179
ptratio -0.5416 0.1283 -4.2219 0.0000
black 0.0063 0.0027 2.3023 0.0219
lstat -1.2179 0.1331 -9.1514 0.0000
I(lstat^2) 0.0182 0.0037 4.9935 0.0000
Show Code
ggplot(data.frame(actual = test$medv, predicted = pred_quad),
       aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.6, color = "#2c7bb6") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(title = "Predicted vs Actual — Quadratic Model",
       x = "Actual Median Value ($1,000s)",
       y = "Predicted Median Value ($1,000s)") +
  theme_minimal()

Show Code
par(mfrow = c(2, 2))
plot(model_quad)

Diagnostic Plots — Quadratic Model

Interpretation: Adding quadratic terms for rm and lstat improved the model noticeably — RMSE dropped to 4.49 and R² rose to 0.832. Very large homes command disproportionately higher prices, and neighborhoods with very high poverty rates suffer steeper price drops.


8 Model 4 — Log-Quadratic Model

Show Code
model_log <- lm(log(medv) ~ crim + zn + chas + nox + rm + I(rm^2) +
                dis + rad + tax + ptratio + black +
                lstat + I(lstat^2),
                data = train)

pred_log  <- exp(predict(model_log, newdata = test))
rmse_log  <- sqrt(mean((test$medv - pred_log)^2))

tidy(model_log) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Log-Quadratic Model — Coefficients")
Log-Quadratic Model — Coefficients
term estimate std.error statistic p.value
(Intercept) 6.9784 0.4542 15.3650 0.0000
crim -0.0105 0.0013 -8.0072 0.0000
zn 0.0004 0.0006 0.7269 0.4678
chas 0.1335 0.0372 3.5937 0.0004
nox -0.6019 0.1590 -3.7857 0.0002
rm -0.8560 0.1382 -6.1944 0.0000
I(rm^2) 0.0727 0.0109 6.6482 0.0000
dis -0.0408 0.0083 -4.9383 0.0000
rad 0.0116 0.0028 4.1061 0.0001
tax -0.0005 0.0001 -3.2644 0.0012
ptratio -0.0273 0.0059 -4.6090 0.0000
black 0.0003 0.0001 2.5151 0.0124
lstat -0.0446 0.0062 -7.2504 0.0000
I(lstat^2) 0.0004 0.0002 2.1724 0.0305
Show Code
ggplot(data.frame(actual = test$medv, predicted = pred_log),
       aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.6, color = "#1a9641") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(title = "Predicted vs Actual — Log-Quadratic Model",
       x = "Actual Median Value ($1,000s)",
       y = "Predicted Median Value ($1,000s)") +
  theme_minimal()

Show Code
par(mfrow = c(2, 2))
plot(model_log)

Diagnostic Plots — Log-Quadratic Model

Interpretation: Log-transforming medv improved normality of residuals and stabilized variance. The model achieved R² of 0.827 with better diagnostic behavior. RMSE on the original scale is 4.56 — the statistically soundest model.


9 Model Comparison

Show Code
results <- data.frame(
  Model = c("Full Linear", "Reduced Linear", "Quadratic", "Log-Quadratic"),
  RMSE  = round(c(rmse_full, rmse_reduced, rmse_quad, rmse_log), 3),
  R2    = round(c(
    summary(model_full)$r.squared,
    summary(model_reduced)$r.squared,
    summary(model_quad)$r.squared,
    summary(model_log)$r.squared
  ), 3)
)

kable(results, caption = "Model Comparison: RMSE and R²")
Model Comparison: RMSE and R²
Model RMSE R2
Full Linear 4.803 0.733
Reduced Linear 4.773 0.733
Quadratic 4.493 0.832
Log-Quadratic 4.558 0.827
Show Code
results |>
  pivot_longer(cols = c(RMSE, R2), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_y") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Model Performance Comparison", x = NULL, y = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Conclusion: The Quadratic model achieved the lowest RMSE (4.49) while the Log-Quadratic model achieved the highest R² (0.827) with better diagnostic behavior. For a client prioritizing prediction accuracy, the Quadratic model wins. For a client prioritizing statistical rigor, the Log-Quadratic model is the stronger choice.


10 Key Business Insights

  • 🏠 Room count matters non-linearly — larger homes command disproportionately higher premiums
  • 📉 Poverty rate (lstat) is the strongest negative driver — steep, non-linear price declines in high-poverty areas
  • 🌫️ Air quality (nox) significantly depresses prices — ~$12,500–$15,000 lower home values per unit increase
  • 🏫 Higher pupil-teacher ratios lower values — areas with underfunded schools are penalized in home prices
  • Best for prediction: Quadratic model (RMSE = 4.49)
  • Best for inference: Log-Quadratic model (R² = 0.827)

Analysis by Dagmawi Yosef Asagid