library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(caret)
library(corrplot)
library(scales)
library(car)
data("Boston")Boston Housing Dataset - Advanced Analytical Report
Introduction
Among data science datasets, the Boston Housing dataset is a famous one and remains among the top regression benchmarks. It originated in a 1970s U.S. Census Bureau research effort and has been extensively used as a teaching and testing dataset for statistical and machine learning models. Its extensive availability and practicality in real-world applications make it a go-to choice for housing price modeling and socioeconomic examination.
The dataset has 506 observations and 14 predictors with the median number of houses owned (medv) as the response. The significant predictors include crime rate (crim), rooms (rm), distance to city center (dis), nitrogen oxides concentration (nox), and pupil-teacher ratio (ptratio), among other socio-economic variables such as ratio of lower status population (lstat).
R programming is used to clean the data, perform exploratory data analysis (EDA), visualization, correlation inspections, and predictive modeling using linear and logistic regression in this report. The emphasis here is on analytical depth as well as interpretability, with particular attention paid to multicollinearity, model diagnostics, as well as real-world insights into real estate and urban planning.
1. Loading Libraries and Dataset
2. Data Cleaning and Structure Validation
str(Boston)'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston) crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
missing_values <- sapply(Boston, function(x) sum(is.na(x)))
missing_values crim zn indus chas nox rm age dis rad tax
0 0 0 0 0 0 0 0 0 0
ptratio black lstat medv
0 0 0 0
cat("Total missing values:", sum(missing_values))Total missing values: 0
Interpretation:
There are no missing values. All variables are numeric, making the dataset ready for modeling.
3. Exploratory Data Analysis (EDA)
p1 <- ggplot(Boston, aes(x = crim)) + geom_histogram(binwidth = 1, fill = "#3366cc") + ggtitle("Histogram: Crime Rate")
p2 <- ggplot(Boston, aes(y = crim)) + geom_boxplot(fill = "#3366cc") + ggtitle("Boxplot: Crime Rate")
p3 <- ggplot(Boston, aes(x = nox)) + geom_histogram(binwidth = 0.01, fill = "#66cc66") + ggtitle("Histogram: Nitrogen Oxides")
p4 <- ggplot(Boston, aes(y = nox)) + geom_boxplot(fill = "#66cc66") + ggtitle("Boxplot: Nitrogen Oxides")
p5 <- ggplot(Boston, aes(x = rm)) + geom_histogram(binwidth = 0.2, fill = "#ff9933") + ggtitle("Histogram: Number of Rooms")
p6 <- ggplot(Boston, aes(y = rm)) + geom_boxplot(fill = "#ff9933") + ggtitle("Boxplot: Number of Rooms")
p7 <- ggplot(Boston, aes(x = medv)) + geom_histogram(binwidth = 1, fill = "#cc33cc") + ggtitle("Histogram: Median Value")
p8 <- ggplot(Boston, aes(y = medv)) + geom_boxplot(fill = "#cc33cc") + ggtitle("Boxplot: Median Value")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 4)Insight:
Some variables like crim and nox are skewed, and medv has a cap at 50, indicating potential outliers or censoring. rm is normally distributed and expected to be positively related to house prices.
4. Correlation and Multicollinearity Check
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "color", type = "lower", tl.cex = 0.8)Insight:
rm shows a strong positive correlation with medv, while lstat and nox are strong negative predictors. nox and dis are highly correlated, raising concerns about multicollinearity.
vif_model <- lm(medv ~ ., data = Boston)
vif(vif_model) crim zn indus chas nox rm age dis
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
rad tax ptratio black lstat
7.484496 9.008554 1.799084 1.348521 2.941491
5. Multiple Linear Regression
model_lm <- lm(medv ~ crim + rm + nox, data = Boston)
summary(model_lm)
Call:
lm(formula = medv ~ crim + rm + nox, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-18.242 -3.203 -0.762 2.386 39.130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19.13263 3.24427 -5.897 6.79e-09 ***
crim -0.19856 0.03496 -5.680 2.28e-08 ***
rm 7.91045 0.40723 19.425 < 2e-16 ***
nox -13.21688 2.65597 -4.976 8.92e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.095 on 502 degrees of freedom
Multiple R-squared: 0.5635, Adjusted R-squared: 0.5609
F-statistic: 216 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(model_lm$fitted.values, residuals(model_lm), xlab = "Fitted", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
qqnorm(residuals(model_lm))
qqline(residuals(model_lm), col = "red")par(mfrow = c(1, 1))Linear regression model tells us about the effect of medv, among other explanatory factors such as lstat and rm. As indicated in the model summary, lstat (lower status group population) negatively correlates greatly with house prices, whereas rm (mean number of rooms per dwelling) has a great positive impact. The adjusted R-squared measure shows the degree to which the model accounts for the variation in house prices. Although linear regression assumes linearity and homoscedasticity, the residual plots show slight deviations from assumption, indicating potential to better fit the model using transformation or higher-order techniques.
Model Evaluation:
The model shows expected signs on coefficients. However, residuals show some heteroscedasticity and slight deviation from normality.
6. Enhanced Regression Model with All Variables
full_model <- lm(medv ~ ., data = Boston)
summary(full_model)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Improvement:
The adjusted R-squared improves, but multicollinearity may affect coefficient reliability. This can be mitigated using variable selection or regularization techniques.
7. Logistic Regression
Boston$medv_cat <- ifelse(Boston$medv > median(Boston$medv), 1, 0)
log_model <- glm(medv_cat ~ crim + rm + nox, data = Boston, family = binomial)
summary(log_model)
Call:
glm(formula = medv_cat ~ crim + rm + nox, family = binomial,
data = Boston)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.08124 2.11656 -6.180 6.39e-10 ***
crim -0.15723 0.04558 -3.450 0.000561 ***
rm 2.57572 0.31017 8.304 < 2e-16 ***
nox -4.93415 1.51082 -3.266 0.001091 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 701.39 on 505 degrees of freedom
Residual deviance: 445.42 on 502 degrees of freedom
AIC: 453.42
Number of Fisher Scoring iterations: 6
pred_probs <- predict(log_model, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)
conf_matrix <- table(Predicted = pred_class, Actual = Boston$medv_cat)
conf_matrix Actual
Predicted 0 1
0 217 48
1 39 202
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Classification Accuracy:", round(accuracy * 100, 2), "%")Classification Accuracy: 82.81 %
df_conf <- as.data.frame(conf_matrix)
ggplot(df_conf, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white", size = 5) +
scale_fill_gradient(low = "#cce5ff", high = "#003366") +
labs(title = "Confusion Matrix", x = "Actual", y = "Predicted") +
theme_minimal()The logistic regression model classifies houses into high value or low value, and this is determined based on the medv threshold. lstat and rm are once more found to be key predictors here. The accuracy, precision, and confusion matrix of the model suggest that the model classifies fairly well, but not without mistake. It could be further improved by feature engineering or ensembling. This is a sound way of classification to segment housing markets for targeted strategies.
Insight:
The simple classification model achieves moderate accuracy. Incorporating additional predictors like lstat may improve performance.
8. Reflection and Recommendations
Strengths:
- Key variables (rm, lstat, crim) were identified as highly influential.
- Graphical and statistical techniques supported valid conclusions.
- Both regression and classification models were explored.
Weaknesses:
- Linear model assumptions were not fully met (e.g., residuals not homoscedastic).
- Multicollinearity remains a concern in full models.
Recommendations:
- Apply variable transformation or interaction terms.
- Use advanced models such as Random Forests or Gradient Boosting.
- Address multicollinearity using VIF checks or PCA.
Conclusion
The project was successful in showing the application of R in analyzing the Boston Housing data properly. From statistical modeling, EDA, and visualization, the analysis was able to draw critical insights into the determinants of house prices. Predictors like number of rooms (rm), percentage of lower-status population (lstat), and crime rate (crim) were among the best predictors.
Both linear and logistic models provided informative perspectives. Linear models allowed us to measure price effects, while logistic models branded properties as high- or low-value. Both were simple but showed both strengths and weaknesses, especially in assumptions and multicollinearity.
Visualizations guided interpretation and allowed communication of findings. Issues with heteroscedasticity and multicollinearity between predictors were addressed, and recommendations for future model tuning were provided. Practical application—better urban development, strategic investment, and policy intervention—illustrates the value of such data-intensive approaches.
In general, this report shows how R and statistical analysis can allow evidence-based decision-making in complex areas like real estate and urban development.
References
- Harrison, D., & Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81-102.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
- RStudio (2023). Quarto: A Next-Generation Publishing System. https://quarto.org