Boston Housing Price Analysis

Regression Modeling & Predictive Performance

Author

Dagmawi Yosef Asagid

Published

March 4, 2026

1 Introduction

This report analyzes the Boston Housing dataset to identify key drivers of residential property prices. Using multiple regression techniques — from a baseline linear model to quadratic and log-transformed variants — the goal is to build a well-specified, interpretable model that accurately predicts median home values (medv).

Skills demonstrated: exploratory data analysis, feature selection, model comparison, assumption diagnostics, and business interpretation.

2 Setup

3 Data Overview

The Boston dataset contains 506 observations across 14 variables. The target variable is medv — the median value of owner-occupied homes in $1,000s.

Variable	Description
`crim`	Per capita crime rate
`rm`	Average number of rooms per dwelling
`lstat`	% lower status of the population
`nox`	Nitrogen oxide concentration
`ptratio`	Pupil-teacher ratio
`medv`	Median home value (target)

4 Train/Test Split

Show Code

n           <- nrow(Boston)
train_index <- sample(1:n, size = 0.7 * n)
train       <- Boston[train_index, ]
test        <- Boston[-train_index, ]

cat("Training observations:", nrow(train), "\n")

Training observations: 354

Show Code

cat("Test observations:    ", nrow(test),  "\n")

Test observations:     152

A 70/30 split was used to evaluate out-of-sample predictive performance on all models.

5 Model 1 — Full Linear Model

Show Code

model_full <- lm(medv ~ ., data = train)

pred_full  <- predict(model_full, newdata = test)
rmse_full  <- sqrt(mean((test$medv - pred_full)^2))

tidy(model_full) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Full Linear Model — Coefficients")

Full Linear Model — Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	38.5220	6.0837	6.3320	0.0000
crim	-0.1090	0.0354	-3.0792	0.0022
zn	0.0530	0.0167	3.1789	0.0016
indus	-0.0522	0.0788	-0.6632	0.5077
chas	4.0439	1.0247	3.9463	0.0001
nox	-14.4294	4.6706	-3.0894	0.0022
rm	3.1783	0.4993	6.3650	0.0000
age	-0.0006	0.0162	-0.0350	0.9721
dis	-1.5407	0.2405	-6.4058	0.0000
rad	0.3023	0.0806	3.7490	0.0002
tax	-0.0105	0.0047	-2.2520	0.0250
ptratio	-0.8587	0.1599	-5.3699	0.0000
black	0.0069	0.0034	1.9938	0.0470
lstat	-0.5838	0.0591	-9.8707	0.0000

Interpretation: The full linear model explains 73.3% of variance in home prices (R² = 0.733). Most predictors are statistically significant. However, indus and age are not significant (p > 0.05), suggesting they add little predictive value. The RMSE of 4.8 means predictions are off by about $4.8k on average.

6 Model 2 — Reduced Linear Model

Show Code

model_reduced <- lm(medv ~ crim + zn + chas + nox + rm +
                    dis + rad + tax + ptratio + black + lstat,
                    data = train)

pred_reduced  <- predict(model_reduced, newdata = test)
rmse_reduced  <- sqrt(mean((test$medv - pred_reduced)^2))

tidy(model_reduced) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Reduced Linear Model — Coefficients")

Reduced Linear Model — Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	38.5779	6.0632	6.3627	0.0000
crim	-0.1084	0.0353	-3.0717	0.0023
zn	0.0545	0.0164	3.3272	0.0010
chas	3.9456	1.0088	3.9111	0.0001
nox	-15.3265	4.2956	-3.5679	0.0004
rm	3.2183	0.4896	6.5736	0.0000
dis	-1.5037	0.2214	-6.7910	0.0000
rad	0.3187	0.0761	4.1912	0.0000
tax	-0.0120	0.0040	-2.9638	0.0033
ptratio	-0.8670	0.1585	-5.4687	0.0000
black	0.0070	0.0034	2.0421	0.0419
lstat	-0.5852	0.0566	-10.3369	0.0000

Interpretation: Removing indus and age produced a cleaner model with nearly identical performance (RMSE = 4.77, R² = 0.727). A simpler model is always preferred when performance is equal — easier to explain to clients and less prone to overfitting.

7 Model 3 — Quadratic Model

Show Code

model_quad <- lm(medv ~ crim + zn + chas + nox + rm + I(rm^2) +
                 dis + rad + tax + ptratio + black +
                 lstat + I(lstat^2),
                 data = train)

pred_quad  <- predict(model_quad, newdata = test)
rmse_quad  <- sqrt(mean((test$medv - pred_quad)^2))

tidy(model_quad) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Quadratic Model — Coefficients")

Quadratic Model — Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	125.2146	9.8198	12.7513	0.0000
crim	-0.1407	0.0283	-4.9768	0.0000
zn	0.0233	0.0133	1.7524	0.0806
chas	3.7136	0.8032	4.6233	0.0000
nox	-12.5413	3.4376	-3.6482	0.0003
rm	-26.0535	2.9877	-8.7201	0.0000
I(rm^2)	2.3123	0.2364	9.7827	0.0000
dis	-1.1125	0.1787	-6.2269	0.0000
rad	0.2140	0.0610	3.5099	0.0005
tax	-0.0077	0.0032	-2.3801	0.0179
ptratio	-0.5416	0.1283	-4.2219	0.0000
black	0.0063	0.0027	2.3023	0.0219
lstat	-1.2179	0.1331	-9.1514	0.0000
I(lstat^2)	0.0182	0.0037	4.9935	0.0000

Show Code

ggplot(data.frame(actual = test$medv, predicted = pred_quad),
       aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.6, color = "#2c7bb6") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(title = "Predicted vs Actual — Quadratic Model",
       x = "Actual Median Value ($1,000s)",
       y = "Predicted Median Value ($1,000s)") +
  theme_minimal()

Show Code

par(mfrow = c(2, 2))
plot(model_quad)

Interpretation: Adding quadratic terms for rm and lstat improved the model noticeably — RMSE dropped to 4.49 and R² rose to 0.832. Very large homes command disproportionately higher prices, and neighborhoods with very high poverty rates suffer steeper price drops.

8 Model 4 — Log-Quadratic Model

Show Code

model_log <- lm(log(medv) ~ crim + zn + chas + nox + rm + I(rm^2) +
                dis + rad + tax + ptratio + black +
                lstat + I(lstat^2),
                data = train)

pred_log  <- exp(predict(model_log, newdata = test))
rmse_log  <- sqrt(mean((test$medv - pred_log)^2))

tidy(model_log) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Log-Quadratic Model — Coefficients")

Log-Quadratic Model — Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	6.9784	0.4542	15.3650	0.0000
crim	-0.0105	0.0013	-8.0072	0.0000
zn	0.0004	0.0006	0.7269	0.4678
chas	0.1335	0.0372	3.5937	0.0004
nox	-0.6019	0.1590	-3.7857	0.0002
rm	-0.8560	0.1382	-6.1944	0.0000
I(rm^2)	0.0727	0.0109	6.6482	0.0000
dis	-0.0408	0.0083	-4.9383	0.0000
rad	0.0116	0.0028	4.1061	0.0001
tax	-0.0005	0.0001	-3.2644	0.0012
ptratio	-0.0273	0.0059	-4.6090	0.0000
black	0.0003	0.0001	2.5151	0.0124
lstat	-0.0446	0.0062	-7.2504	0.0000
I(lstat^2)	0.0004	0.0002	2.1724	0.0305

Show Code

ggplot(data.frame(actual = test$medv, predicted = pred_log),
       aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.6, color = "#1a9641") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(title = "Predicted vs Actual — Log-Quadratic Model",
       x = "Actual Median Value ($1,000s)",
       y = "Predicted Median Value ($1,000s)") +
  theme_minimal()

Show Code

par(mfrow = c(2, 2))
plot(model_log)

Interpretation: Log-transforming medv improved normality of residuals and stabilized variance. The model achieved R² of 0.827 with better diagnostic behavior. RMSE on the original scale is 4.56 — the statistically soundest model.

9 Model Comparison

Show Code

results <- data.frame(
  Model = c("Full Linear", "Reduced Linear", "Quadratic", "Log-Quadratic"),
  RMSE  = round(c(rmse_full, rmse_reduced, rmse_quad, rmse_log), 3),
  R2    = round(c(
    summary(model_full)$r.squared,
    summary(model_reduced)$r.squared,
    summary(model_quad)$r.squared,
    summary(model_log)$r.squared
  ), 3)
)

kable(results, caption = "Model Comparison: RMSE and R²")

Model Comparison: RMSE and R²
Model	RMSE	R2
Full Linear	4.803	0.733
Reduced Linear	4.773	0.733
Quadratic	4.493	0.832
Log-Quadratic	4.558	0.827

Show Code

results |>
  pivot_longer(cols = c(RMSE, R2), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_y") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Model Performance Comparison", x = NULL, y = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Conclusion: The Quadratic model achieved the lowest RMSE (4.49) while the Log-Quadratic model achieved the highest R² (0.827) with better diagnostic behavior. For a client prioritizing prediction accuracy, the Quadratic model wins. For a client prioritizing statistical rigor, the Log-Quadratic model is the stronger choice.

10 Key Business Insights

🏠 Room count matters non-linearly — larger homes command disproportionately higher premiums
📉 Poverty rate (lstat) is the strongest negative driver — steep, non-linear price declines in high-poverty areas
🌫️ Air quality (nox) significantly depresses prices — ~$12,500–$15,000 lower home values per unit increase
🏫 Higher pupil-teacher ratios lower values — areas with underfunded schools are penalized in home prices
✅ Best for prediction: Quadratic model (RMSE = 4.49)
✅ Best for inference: Log-Quadratic model (R² = 0.827)

Analysis by Dagmawi Yosef Asagid