Boston Housing Data consists of price of house in suburbs of Boston. The median value variable ‘medv’ is the dependent variable which might be dependent on a set/all other predictor variables of this dataset such as crime rate in the vicinity, accessibility in terms of distance, pollution levels et cetera.
Boston Housing Data comes with the MASS library.
Method
The method of analysis will include following stages:
library(tidyverse)
library(leaps)
library(MASS)
library(corrgram)
library(glmnet)
library(boot)
library(rpart)
library(rpart.plot)
attach(Boston)
Boston dataset has 506 rows and 14 columns. The variable description can be seen in the next tab.
dim(Boston)
## [1] 506 14
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Missing Values: There are no missing values. (Yayyy !!!)
Histogram
ggplot(gather(Boston), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')+
theme_gray()+
ggtitle("Histogram of all variables")
Boston %>% gather(key, val, -medv) %>%
ggplot(aes(x = val, y = medv)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE, col = "green") +
facet_wrap(~key, scales = "free") +
theme_gray() +
ggtitle("Scatter plot of dependent variables vs Median Value (medv)")
boxplot(Boston, col = "grey")
corrgram(Boston, upper.panel = NULL, lower.panel = panel.cor)
set.seed(12823435)
# Splitting the data
sample_index <- sample(nrow(Boston), nrow(Boston)*0.75)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]
# Standardisation
for (i in 1:(ncol(Boston_train)-1)){
Boston_train[,i] <- scale(Boston_train[,i])
}
for (i in 1:(ncol(Boston_test)-1)){
Boston_test[,i] <- scale(Boston_test[,i])
}
boston.lm <- lm(medv ~. , data = Boston_train)
boston.lm.summary <- summary(boston.lm)
boston.lm.summary
##
## Call:
## lm(formula = medv ~ ., data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8463 -2.6601 -0.5896 1.7639 25.4728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.2615 0.2361 94.293 < 2e-16 ***
## crim -1.0952 0.3102 -3.530 0.000468 ***
## zn 1.3784 0.3587 3.843 0.000143 ***
## indus -0.1249 0.4675 -0.267 0.789504
## chas 0.5342 0.2478 2.155 0.031778 *
## nox -1.6862 0.4929 -3.421 0.000695 ***
## rm 2.2552 0.3172 7.109 6.17e-12 ***
## age 0.1547 0.4116 0.376 0.707349
## dis -3.0674 0.4728 -6.488 2.84e-10 ***
## rad 3.0405 0.7003 4.342 1.83e-05 ***
## tax -2.6094 0.7541 -3.460 0.000603 ***
## ptratio -1.8588 0.3132 -5.934 6.86e-09 ***
## black 0.7370 0.2769 2.661 0.008127 **
## lstat -3.9747 0.3956 -10.048 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.596 on 365 degrees of freedom
## Multiple R-squared: 0.7435, Adjusted R-squared: 0.7344
## F-statistic: 81.4 on 13 and 365 DF, p-value: < 2.2e-16
boston.lm.summary$adj.r.squared
## [1] 0.7343941
mean(boston.lm$residuals^2)
## [1] 20.34444
The Best Subset selection, Stepwise Model Selection (Both) using AIC and BIC are used below to come up with the best linear model for dependent variable ‘medv’.
boston.regsubset <- regsubsets(medv~. , data = Boston_train, nbest = 1 , nvmax = 13)
summary.regsubset <- summary(boston.regsubset)
summary.regsubset
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston_train, nbest = 1,
## nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" "*" "*"
## 7 ( 1 ) "*" "*" " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " "*" "*" "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " " " "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " " " "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(boston.regsubset, scale="bic")
#Null Model
null_model <- lm(medv~1, data=Boston_train)
# Full Model
full_model <- lm(medv~., data=Boston_train)
#Using AIC
model_step <- step(null_model, scope = list(lower= null_model, upper= full_model), direction = "both", k = 2)
#Using BIC
model_step_BIC <- step(null_model, scope = list(lower= null_model, upper= full_model), direction = "both", k = log(nrow(Boston_train)))
step_summary <- summary(model_step)
step_summary
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black +
## zn + crim + chas + rad + tax, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9478 -2.6360 -0.6697 1.8128 25.6591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.2615 0.2355 94.523 < 2e-16 ***
## lstat -3.9290 0.3690 -10.648 < 2e-16 ***
## rm 2.2875 0.3084 7.417 8.38e-13 ***
## ptratio -1.8565 0.3106 -5.977 5.38e-09 ***
## dis -3.0934 0.4427 -6.987 1.32e-11 ***
## nox -1.6804 0.4573 -3.675 0.000274 ***
## black 0.7469 0.2754 2.712 0.007003 **
## zn 1.3720 0.3528 3.889 0.000120 ***
## crim -1.0927 0.3093 -3.532 0.000465 ***
## chas 0.5359 0.2452 2.185 0.029516 *
## rad 3.0665 0.6815 4.500 9.15e-06 ***
## tax -2.6788 0.6964 -3.847 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.585 on 367 degrees of freedom
## Multiple R-squared: 0.7434, Adjusted R-squared: 0.7357
## F-statistic: 96.65 on 11 and 367 DF, p-value: < 2.2e-16
par(mfrow= (c(2,2)))
plot(model_step)
step_summary_BIC <- summary(model_step_BIC)
step_summary_BIC
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + zn + crim +
## chas, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4248 -3.0074 -0.6004 1.8000 26.8175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.2615 0.2428 91.676 < 2e-16 ***
## lstat -4.1132 0.3767 -10.919 < 2e-16 ***
## rm 2.4746 0.3109 7.960 2.13e-14 ***
## ptratio -1.6259 0.2836 -5.733 2.05e-08 ***
## dis -2.9548 0.4533 -6.519 2.33e-10 ***
## nox -1.7871 0.4199 -4.256 2.64e-05 ***
## zn 1.2148 0.3532 3.439 0.00065 ***
## crim -0.9480 0.2868 -3.305 0.00104 **
## chas 0.6739 0.2511 2.684 0.00761 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.727 on 370 degrees of freedom
## Multiple R-squared: 0.725, Adjusted R-squared: 0.719
## F-statistic: 121.9 on 8 and 370 DF, p-value: < 2.2e-16
par(mfrow= (c(2,2)))
plot(model_step_BIC)
Final Model
Since, the 3 models that were previously seen state that all the variables are significant except indus and age, the final model is as follows:
#Final model
final_model <- lm(medv~lstat+ rm+ptratio+ dis+nox+black+zn+crim+chas+rad+tax, data = Boston_train)
In-Sample MSE of the model is 20.35
Testing of Out-of-Sample Performace of the final model
predict.1 <- predict(final_model, newdata = Boston_test, type = "response")
mean((predict.1- Boston_test$medv)^2)
## [1] 28.46205
Three trees will be fitted here, first with default cp, cp=0.01 , second with cp = 0.001 and third will be the pruned version of second tree.
cp = 0.01
boston.rpart <- rpart(formula = medv~. , data = Boston_train)
rpart.plot(boston.rpart,type = 3, box.palette = c("red", "grey"), fallen.leaves = TRUE)
plotcp(boston.rpart)
predict.tree <- predict(boston.rpart, Boston_test)
mse.tree <- mean((predict.tree- Boston_test$medv)^2)
mse.tree
## [1] 27.64065
The MSE of this tree is 27.64065
cp = 0.001
boston.rpart.1 <- rpart(formula = medv ~. , data = Boston_train, cp= 0.001)
rpart.plot(boston.rpart.1,type = 3, box.palette = c("red", "grey"), fallen.leaves = TRUE)
plotcp(boston.rpart.1)
printcp(boston.rpart.1)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] age black crim lstat nox ptratio rm tax
##
## Root node error: 30064/379 = 79.324
##
## n= 379
##
## CP nsplit rel error xerror xstd
## 1 0.4421096 0 1.00000 1.00339 0.098808
## 2 0.1542982 1 0.55789 0.66654 0.063115
## 3 0.1007721 2 0.40359 0.51634 0.058191
## 4 0.0338343 3 0.30282 0.43772 0.051738
## 5 0.0247920 4 0.26899 0.37443 0.047481
## 6 0.0146350 5 0.24419 0.34577 0.048516
## 7 0.0130194 6 0.22956 0.32514 0.045877
## 8 0.0107750 7 0.21654 0.30799 0.045559
## 9 0.0090389 8 0.20576 0.30106 0.044050
## 10 0.0087362 9 0.19673 0.29715 0.044095
## 11 0.0049263 10 0.18799 0.28878 0.043996
## 12 0.0041107 12 0.17814 0.27936 0.043535
## 13 0.0033515 13 0.17403 0.27435 0.043386
## 14 0.0031806 14 0.17067 0.27336 0.042933
## 15 0.0028764 15 0.16749 0.27446 0.042927
## 16 0.0027868 16 0.16462 0.27360 0.042904
## 17 0.0021553 17 0.16183 0.27342 0.042910
## 18 0.0018430 18 0.15968 0.26909 0.042799
## 19 0.0018274 19 0.15783 0.26739 0.042462
## 20 0.0012457 20 0.15601 0.26979 0.042571
## 21 0.0011442 22 0.15351 0.26992 0.042476
## 22 0.0011067 23 0.15237 0.26961 0.042477
## 23 0.0010000 25 0.15016 0.27128 0.043463
predict.tree.1 <- predict(boston.rpart.1, Boston_test)
mse.tree.1 <- mean((predict.tree.1- Boston_test$medv)^2)
mse.tree.1
## [1] 24.70509
The MSE of this tree is 24.7059. But it needs to be since such a complexity is not needed. If we see from the plot then, more than 7-9 splits won’t help much.
cp = 0.0049
boston.rpart.prune <- prune(boston.rpart.1, cp= 0.0049 )
rpart.plot(boston.rpart.prune, type=3, box.palette = c("red", "grey"), fallen.leaves = TRUE)
predict.tree.2 <- predict(boston.rpart.prune, Boston_test)
mse.tree.2 <- mean((predict.tree.2- Boston_test$medv)^2)
mse.tree.2
## [1] 26.99284
The MSE of this tree is 26.99.
The largest tree has the lowest MSE, but it is hard to interpret.The pruned tree has the next best MSE and is earier to interpret.
The out of sample prediction error in Linear Regression for this randomyl generated sample is 28.46205 , whereas for CART model it is 26.99. Hence, in this case, the CART model turns out to be better. But, this is not necessarily the case everytime. It depends on the sample that was generated.