Boston Housing Dataset - Advanced Analytical Report

Author

Naveen

Published

June 28, 2025

Introduction

Among data science datasets, the Boston Housing dataset is a famous one and remains among the top regression benchmarks. It originated in a 1970s U.S. Census Bureau research effort and has been extensively used as a teaching and testing dataset for statistical and machine learning models. Its extensive availability and practicality in real-world applications make it a go-to choice for housing price modeling and socioeconomic examination.

The dataset has 506 observations and 14 predictors with the median number of houses owned (medv) as the response. The significant predictors include crime rate (crim), rooms (rm), distance to city center (dis), nitrogen oxides concentration (nox), and pupil-teacher ratio (ptratio), among other socio-economic variables such as ratio of lower status population (lstat).

R programming is used to clean the data, perform exploratory data analysis (EDA), visualization, correlation inspections, and predictive modeling using linear and logistic regression in this report. The emphasis here is on analytical depth as well as interpretability, with particular attention paid to multicollinearity, model diagnostics, as well as real-world insights into real estate and urban planning.


1. Loading Libraries and Dataset

library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(caret)
library(corrplot)
library(scales)
library(car)

data("Boston")

2. Data Cleaning and Structure Validation

str(Boston)
'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  
missing_values <- sapply(Boston, function(x) sum(is.na(x)))
missing_values
   crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
      0       0       0       0       0       0       0       0       0       0 
ptratio   black   lstat    medv 
      0       0       0       0 
cat("Total missing values:", sum(missing_values))
Total missing values: 0

Interpretation:
There are no missing values. All variables are numeric, making the dataset ready for modeling.


3. Exploratory Data Analysis (EDA)

p1 <- ggplot(Boston, aes(x = crim)) + geom_histogram(binwidth = 1, fill = "#3366cc") + ggtitle("Histogram: Crime Rate")
p2 <- ggplot(Boston, aes(y = crim)) + geom_boxplot(fill = "#3366cc") + ggtitle("Boxplot: Crime Rate")
p3 <- ggplot(Boston, aes(x = nox)) + geom_histogram(binwidth = 0.01, fill = "#66cc66") + ggtitle("Histogram: Nitrogen Oxides")
p4 <- ggplot(Boston, aes(y = nox)) + geom_boxplot(fill = "#66cc66") + ggtitle("Boxplot: Nitrogen Oxides")
p5 <- ggplot(Boston, aes(x = rm)) + geom_histogram(binwidth = 0.2, fill = "#ff9933") + ggtitle("Histogram: Number of Rooms")
p6 <- ggplot(Boston, aes(y = rm)) + geom_boxplot(fill = "#ff9933") + ggtitle("Boxplot: Number of Rooms")
p7 <- ggplot(Boston, aes(x = medv)) + geom_histogram(binwidth = 1, fill = "#cc33cc") + ggtitle("Histogram: Median Value")
p8 <- ggplot(Boston, aes(y = medv)) + geom_boxplot(fill = "#cc33cc") + ggtitle("Boxplot: Median Value")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 4)

Insight:
Some variables like crim and nox are skewed, and medv has a cap at 50, indicating potential outliers or censoring. rm is normally distributed and expected to be positively related to house prices.


4. Correlation and Multicollinearity Check

cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "color", type = "lower", tl.cex = 0.8)

Insight:
rm shows a strong positive correlation with medv, while lstat and nox are strong negative predictors. nox and dis are highly correlated, raising concerns about multicollinearity.

vif_model <- lm(medv ~ ., data = Boston)
vif(vif_model)
    crim       zn    indus     chas      nox       rm      age      dis 
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
     rad      tax  ptratio    black    lstat 
7.484496 9.008554 1.799084 1.348521 2.941491 

5. Multiple Linear Regression

model_lm <- lm(medv ~ crim + rm + nox, data = Boston)
summary(model_lm)

Call:
lm(formula = medv ~ crim + rm + nox, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.242  -3.203  -0.762   2.386  39.130 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -19.13263    3.24427  -5.897 6.79e-09 ***
crim         -0.19856    0.03496  -5.680 2.28e-08 ***
rm            7.91045    0.40723  19.425  < 2e-16 ***
nox         -13.21688    2.65597  -4.976 8.92e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.095 on 502 degrees of freedom
Multiple R-squared:  0.5635,    Adjusted R-squared:  0.5609 
F-statistic:   216 on 3 and 502 DF,  p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(model_lm$fitted.values, residuals(model_lm), xlab = "Fitted", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
qqnorm(residuals(model_lm))
qqline(residuals(model_lm), col = "red")

par(mfrow = c(1, 1))

Linear regression model tells us about the effect of medv, among other explanatory factors such as lstat and rm. As indicated in the model summary, lstat (lower status group population) negatively correlates greatly with house prices, whereas rm (mean number of rooms per dwelling) has a great positive impact. The adjusted R-squared measure shows the degree to which the model accounts for the variation in house prices. Although linear regression assumes linearity and homoscedasticity, the residual plots show slight deviations from assumption, indicating potential to better fit the model using transformation or higher-order techniques.

Model Evaluation:
The model shows expected signs on coefficients. However, residuals show some heteroscedasticity and slight deviation from normality.


6. Enhanced Regression Model with All Variables

full_model <- lm(medv ~ ., data = Boston)
summary(full_model)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Improvement:
The adjusted R-squared improves, but multicollinearity may affect coefficient reliability. This can be mitigated using variable selection or regularization techniques.


7. Logistic Regression

Boston$medv_cat <- ifelse(Boston$medv > median(Boston$medv), 1, 0)
log_model <- glm(medv_cat ~ crim + rm + nox, data = Boston, family = binomial)
summary(log_model)

Call:
glm(formula = medv_cat ~ crim + rm + nox, family = binomial, 
    data = Boston)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -13.08124    2.11656  -6.180 6.39e-10 ***
crim         -0.15723    0.04558  -3.450 0.000561 ***
rm            2.57572    0.31017   8.304  < 2e-16 ***
nox          -4.93415    1.51082  -3.266 0.001091 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 701.39  on 505  degrees of freedom
Residual deviance: 445.42  on 502  degrees of freedom
AIC: 453.42

Number of Fisher Scoring iterations: 6
pred_probs <- predict(log_model, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)
conf_matrix <- table(Predicted = pred_class, Actual = Boston$medv_cat)
conf_matrix
         Actual
Predicted   0   1
        0 217  48
        1  39 202
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Classification Accuracy:", round(accuracy * 100, 2), "%")
Classification Accuracy: 82.81 %
df_conf <- as.data.frame(conf_matrix)
ggplot(df_conf, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white", size = 5) +
  scale_fill_gradient(low = "#cce5ff", high = "#003366") +
  labs(title = "Confusion Matrix", x = "Actual", y = "Predicted") +
  theme_minimal()

The logistic regression model classifies houses into high value or low value, and this is determined based on the medv threshold. lstat and rm are once more found to be key predictors here. The accuracy, precision, and confusion matrix of the model suggest that the model classifies fairly well, but not without mistake. It could be further improved by feature engineering or ensembling. This is a sound way of classification to segment housing markets for targeted strategies.

Insight:
The simple classification model achieves moderate accuracy. Incorporating additional predictors like lstat may improve performance.


8. Reflection and Recommendations

Strengths:
- Key variables (rm, lstat, crim) were identified as highly influential.
- Graphical and statistical techniques supported valid conclusions.
- Both regression and classification models were explored.

Weaknesses:
- Linear model assumptions were not fully met (e.g., residuals not homoscedastic).
- Multicollinearity remains a concern in full models.

Recommendations:
- Apply variable transformation or interaction terms.
- Use advanced models such as Random Forests or Gradient Boosting.
- Address multicollinearity using VIF checks or PCA.


Conclusion

The project was successful in showing the application of R in analyzing the Boston Housing data properly. From statistical modeling, EDA, and visualization, the analysis was able to draw critical insights into the determinants of house prices. Predictors like number of rooms (rm), percentage of lower-status population (lstat), and crime rate (crim) were among the best predictors.

Both linear and logistic models provided informative perspectives. Linear models allowed us to measure price effects, while logistic models branded properties as high- or low-value. Both were simple but showed both strengths and weaknesses, especially in assumptions and multicollinearity.

Visualizations guided interpretation and allowed communication of findings. Issues with heteroscedasticity and multicollinearity between predictors were addressed, and recommendations for future model tuning were provided. Practical application—better urban development, strategic investment, and policy intervention—illustrates the value of such data-intensive approaches.

In general, this report shows how R and statistical analysis can allow evidence-based decision-making in complex areas like real estate and urban development.


References

  • Harrison, D., & Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81-102.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
  • RStudio (2023). Quarto: A Next-Generation Publishing System. https://quarto.org