Analysis of Boston Housing Data using Linear Regression and CART Model

Introduction

Boston Housing Data consists of price of house in suburbs of Boston. The median value variable ‘medv’ is the dependent variable which might be dependent on a set/all other predictor variables of this dataset such as crime rate in the vicinity, accessibility in terms of distance, pollution levels et cetera.

Boston Housing Data comes with the MASS library.

Method

The method of analysis will include following stages:

Exploratory Data Analysis of the Data
Randomly Splitting the data into test and training data, in the ratio of 75:25
Fitting various models using different variable selection methods and finding the best model using AIC , BIC and residual analysis.
Testing the model on out-of-sample data on the final model and stating the MSE.
Repeat the steps above for the regression tree (CART) model.

About the Dataset

Libraries Used

library(tidyverse)
library(leaps)
library(MASS)
library(corrgram)
library(glmnet)
library(boot)
library(rpart)
library(rpart.plot)
attach(Boston)

Glimpse of the dataset

Boston dataset has 506 rows and 14 columns. The variable description can be seen in the next tab.

dim(Boston)

## [1] 506  14

glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Variable Description

crim- per capita crime rate by town.
zn- proportion of residential land zoned for lots over 25,000 sq.ft.
indus- proportion of non-retail business acres per town.
chas- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox- nitrogen oxides concentration (parts per 10 million).
rm- average number of rooms per dwelling.
age- proportion of owner-occupied units built prior to 1940.
dis- weighted mean of distances to five Boston employment centres.
rad- index of accessibility to radial highways.
tax- full-value property-tax rate per $10,000.
ptratio- pupil-teacher ratio by town.
black- 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat- lower status of the population (percent).
medv- median value of owner-occupied homes in $1000s.

Exploratory Data Analysis

Missing Values: There are no missing values. (Yayyy !!!)
Histogram

ggplot(gather(Boston), aes(value)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~key, scales = 'free_x')+
  theme_gray()+
  ggtitle("Histogram of all variables")

Scatter Plot

Boston %>% gather(key, val, -medv) %>%
  ggplot(aes(x = val, y = medv)) +
  geom_point() +
  stat_smooth(method = "lm", se = TRUE, col = "green") +
  facet_wrap(~key, scales = "free") +
  theme_gray() +
  ggtitle("Scatter plot of dependent variables vs Median Value (medv)")

Box Plots

boxplot(Boston, col = "grey")

Correlation Plots

corrgram(Boston, upper.panel = NULL, lower.panel = panel.cor)

Generating Random Sample & Scaling the Data

set.seed(12823435)
# Splitting the data 

sample_index <- sample(nrow(Boston), nrow(Boston)*0.75)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

# Standardisation 

for (i in 1:(ncol(Boston_train)-1)){
  Boston_train[,i] <- scale(Boston_train[,i])
}

for (i in 1:(ncol(Boston_test)-1)){
  Boston_test[,i] <- scale(Boston_test[,i])
}

Linear Regression on Training Data

boston.lm <- lm(medv ~. , data = Boston_train)
boston.lm.summary <- summary(boston.lm)
boston.lm.summary

## 
## Call:
## lm(formula = medv ~ ., data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8463  -2.6601  -0.5896   1.7639  25.4728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.2615     0.2361  94.293  < 2e-16 ***
## crim         -1.0952     0.3102  -3.530 0.000468 ***
## zn            1.3784     0.3587   3.843 0.000143 ***
## indus        -0.1249     0.4675  -0.267 0.789504    
## chas          0.5342     0.2478   2.155 0.031778 *  
## nox          -1.6862     0.4929  -3.421 0.000695 ***
## rm            2.2552     0.3172   7.109 6.17e-12 ***
## age           0.1547     0.4116   0.376 0.707349    
## dis          -3.0674     0.4728  -6.488 2.84e-10 ***
## rad           3.0405     0.7003   4.342 1.83e-05 ***
## tax          -2.6094     0.7541  -3.460 0.000603 ***
## ptratio      -1.8588     0.3132  -5.934 6.86e-09 ***
## black         0.7370     0.2769   2.661 0.008127 ** 
## lstat        -3.9747     0.3956 -10.048  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.596 on 365 degrees of freedom
## Multiple R-squared:  0.7435, Adjusted R-squared:  0.7344 
## F-statistic:  81.4 on 13 and 365 DF,  p-value: < 2.2e-16

boston.lm.summary$adj.r.squared

## [1] 0.7343941

mean(boston.lm$residuals^2)

## [1] 20.34444

We can see that all the variables are significant except indus and age.
The adjusted R-Squared for Full Model is 0.7343941
The model MSE is 20.34444.
It is better to create a model without indus and age.

Variable Selection

The Best Subset selection, Stepwise Model Selection (Both) using AIC and BIC are used below to come up with the best linear model for dependent variable ‘medv’.

Best Subset Selection

boston.regsubset <- regsubsets(medv~. , data = Boston_train, nbest = 1 , nvmax = 13)
summary.regsubset <- summary(boston.regsubset)
summary.regsubset

## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston_train, nbest = 1, 
##     nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 7  ( 1 )  "*"  "*" " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 8  ( 1 )  "*"  "*" " "   " "  " " "*" " " "*" "*" "*" "*"     " "   "*"  
## 9  ( 1 )  "*"  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 10  ( 1 ) "*"  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

plot(boston.regsubset, scale="bic")

The adjusted R-Squared of this model is 0.734
The significant variables are all the dependent variables except indus and age.
chas has very little significance according to this model,it cannot be exluded.

Stepwise Regression (Both)

#Null Model 
null_model <- lm(medv~1, data=Boston_train)
# Full Model
full_model <- lm(medv~., data=Boston_train)

#Using AIC 
model_step <- step(null_model, scope = list(lower= null_model, upper= full_model), direction = "both", k = 2)
#Using BIC 
model_step_BIC <- step(null_model, scope = list(lower= null_model, upper= full_model), direction = "both", k = log(nrow(Boston_train)))

step_summary <- summary(model_step)
step_summary

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black + 
##     zn + crim + chas + rad + tax, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9478  -2.6360  -0.6697   1.8128  25.6591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.2615     0.2355  94.523  < 2e-16 ***
## lstat        -3.9290     0.3690 -10.648  < 2e-16 ***
## rm            2.2875     0.3084   7.417 8.38e-13 ***
## ptratio      -1.8565     0.3106  -5.977 5.38e-09 ***
## dis          -3.0934     0.4427  -6.987 1.32e-11 ***
## nox          -1.6804     0.4573  -3.675 0.000274 ***
## black         0.7469     0.2754   2.712 0.007003 ** 
## zn            1.3720     0.3528   3.889 0.000120 ***
## crim         -1.0927     0.3093  -3.532 0.000465 ***
## chas          0.5359     0.2452   2.185 0.029516 *  
## rad           3.0665     0.6815   4.500 9.15e-06 ***
## tax          -2.6788     0.6964  -3.847 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.585 on 367 degrees of freedom
## Multiple R-squared:  0.7434, Adjusted R-squared:  0.7357 
## F-statistic: 96.65 on 11 and 367 DF,  p-value: < 2.2e-16

par(mfrow= (c(2,2)))
plot(model_step)

step_summary_BIC <- summary(model_step_BIC)
step_summary_BIC

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + zn + crim + 
##     chas, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4248  -3.0074  -0.6004   1.8000  26.8175 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.2615     0.2428  91.676  < 2e-16 ***
## lstat        -4.1132     0.3767 -10.919  < 2e-16 ***
## rm            2.4746     0.3109   7.960 2.13e-14 ***
## ptratio      -1.6259     0.2836  -5.733 2.05e-08 ***
## dis          -2.9548     0.4533  -6.519 2.33e-10 ***
## nox          -1.7871     0.4199  -4.256 2.64e-05 ***
## zn            1.2148     0.3532   3.439  0.00065 ***
## crim         -0.9480     0.2868  -3.305  0.00104 ** 
## chas          0.6739     0.2511   2.684  0.00761 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.727 on 370 degrees of freedom
## Multiple R-squared:  0.725,  Adjusted R-squared:  0.719 
## F-statistic: 121.9 on 8 and 370 DF,  p-value: < 2.2e-16

par(mfrow= (c(2,2)))
plot(model_step_BIC)

Model using AIC has adjusted R-Squared of 0.7356
Model using BIC has adjusted R-Squared of 0.7190

Final Model

Since, the 3 models that were previously seen state that all the variables are significant except indus and age, the final model is as follows:

#Final model 
final_model <- lm(medv~lstat+ rm+ptratio+ dis+nox+black+zn+crim+chas+rad+tax, data = Boston_train)

In-Sample MSE of the model is 20.35
Testing of Out-of-Sample Performace of the final model

predict.1 <- predict(final_model, newdata = Boston_test, type = "response")

mean((predict.1- Boston_test$medv)^2)

## [1] 28.46205

The MSE reported here is 28.46205

CART Model

Three trees will be fitted here, first with default cp, cp=0.01 , second with cp = 0.001 and third will be the pruned version of second tree.

cp = 0.01

boston.rpart <- rpart(formula = medv~. , data = Boston_train)

rpart.plot(boston.rpart,type = 3, box.palette = c("red", "grey"), fallen.leaves = TRUE)

plotcp(boston.rpart)

predict.tree <- predict(boston.rpart, Boston_test)

mse.tree <- mean((predict.tree- Boston_test$medv)^2)
mse.tree

## [1] 27.64065

The MSE of this tree is 27.64065

cp = 0.001

boston.rpart.1 <- rpart(formula = medv ~. , data = Boston_train, cp= 0.001)
rpart.plot(boston.rpart.1,type = 3, box.palette = c("red", "grey"), fallen.leaves = TRUE)

plotcp(boston.rpart.1)

printcp(boston.rpart.1)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] age     black   crim    lstat   nox     ptratio rm      tax    
## 
## Root node error: 30064/379 = 79.324
## 
## n= 379 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4421096      0   1.00000 1.00339 0.098808
## 2  0.1542982      1   0.55789 0.66654 0.063115
## 3  0.1007721      2   0.40359 0.51634 0.058191
## 4  0.0338343      3   0.30282 0.43772 0.051738
## 5  0.0247920      4   0.26899 0.37443 0.047481
## 6  0.0146350      5   0.24419 0.34577 0.048516
## 7  0.0130194      6   0.22956 0.32514 0.045877
## 8  0.0107750      7   0.21654 0.30799 0.045559
## 9  0.0090389      8   0.20576 0.30106 0.044050
## 10 0.0087362      9   0.19673 0.29715 0.044095
## 11 0.0049263     10   0.18799 0.28878 0.043996
## 12 0.0041107     12   0.17814 0.27936 0.043535
## 13 0.0033515     13   0.17403 0.27435 0.043386
## 14 0.0031806     14   0.17067 0.27336 0.042933
## 15 0.0028764     15   0.16749 0.27446 0.042927
## 16 0.0027868     16   0.16462 0.27360 0.042904
## 17 0.0021553     17   0.16183 0.27342 0.042910
## 18 0.0018430     18   0.15968 0.26909 0.042799
## 19 0.0018274     19   0.15783 0.26739 0.042462
## 20 0.0012457     20   0.15601 0.26979 0.042571
## 21 0.0011442     22   0.15351 0.26992 0.042476
## 22 0.0011067     23   0.15237 0.26961 0.042477
## 23 0.0010000     25   0.15016 0.27128 0.043463

predict.tree.1 <- predict(boston.rpart.1, Boston_test)

mse.tree.1 <- mean((predict.tree.1- Boston_test$medv)^2)
mse.tree.1

## [1] 24.70509

The MSE of this tree is 24.7059. But it needs to be since such a complexity is not needed. If we see from the plot then, more than 7-9 splits won’t help much.

cp = 0.0049

boston.rpart.prune <- prune(boston.rpart.1, cp= 0.0049 )

rpart.plot(boston.rpart.prune, type=3, box.palette = c("red", "grey"), fallen.leaves = TRUE)

predict.tree.2 <- predict(boston.rpart.prune, Boston_test)

mse.tree.2 <- mean((predict.tree.2- Boston_test$medv)^2)
mse.tree.2

## [1] 26.99284

The MSE of this tree is 26.99.

The largest tree has the lowest MSE, but it is hard to interpret.The pruned tree has the next best MSE and is earier to interpret.

Comparison of CART Model and Linear Regression Model

The out of sample prediction error in Linear Regression for this randomyl generated sample is 28.46205 , whereas for CART model it is 26.99. Hence, in this case, the CART model turns out to be better. But, this is not necessarily the case everytime. It depends on the sample that was generated.