This project aims to find the factors affecting the domestic property value in the city of Boston. Factors like per capita income, environmental factors, educational facilities, property size, etc were taken into consideration to determine the most significant parameters.
For predicting the house prices We use Tree Family:
We will compare the performance of these methods to get the best model
Part 1 deals with CART and BAGGING
The following packages are required for the project:
## function to install and load multiple packages at once in R
install_load <- function(pack){
## Statement to check if the package has been previously installed
new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
if (length(new_pack_load))
install.packages(new_pack_load, dependencies = TRUE)
sapply(pack, require, character.only = TRUE)
}
package_load <- c("corrr", "gridExtra", "ggplot2", "tidyverse", "dplyr", "DT", "MASS", "leaps", "glmnet", "PerformanceAnalytics", "tree", "rpart", "rpart.plot","randomForest", "ipred", "gbm")
install_load(package_load)
## corrr gridExtra ggplot2
## TRUE TRUE TRUE
## tidyverse dplyr DT
## TRUE TRUE TRUE
## MASS leaps glmnet
## TRUE TRUE TRUE
## PerformanceAnalytics tree rpart
## TRUE TRUE TRUE
## rpart.plot randomForest ipred
## TRUE TRUE TRUE
## gbm
## TRUE
Our data contains 506 observations containing 14 variables. The datatypes are as follows:
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
A quick summary of the distribution of every variable in the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We plot the scatter plots of target variable medv versus the other variables, we see that rm and lstat show parabolic nature
Boston %>%
gather(-medv, key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = value, y = medv)) +
geom_point() +
stat_smooth() +
facet_wrap(~ var, scales = "free") +
theme_bw()
Boxplots show no significant outliers in the data
Boston %>%
gather(-medv, key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = '',y = value)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
facet_wrap(~ var, scales = "free") +
theme_bw()
The histograms of predictors give the following insights:
Boston %>%
gather(-medv, key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~ var, scales = "free") +
theme_bw()
We split our data in 80:20 ratio as training data and test data. We will use our train data for modelling and test data for validation
set.seed(12420352)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]
We fit a regression tree to the Training data.
We start off by creating the deepeset tree keepinf cp=0.001. The fitted tree can be seen below:
Boston.tree <- rpart(medv ~ ., data = Boston.train, cp = 0.001)
rpart.plot(Boston.tree, type = 1, fallen.leaves = FALSE)
In order to avoid overfitting, and reduce variance, we need to prune back the big tree, We use cross validation to select the best cp level for the tree. The below plot shows the relative cross validation error associated with different tree sizes(terminal nodes). We can see that after 8 terminal nodes, the decrease in error is not significant. Hence we prune our tree at this level.
So for pruning we use
plotcp(Boston.tree)
printcp(Boston.tree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston.train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] black crim dis indus lstat nox ptratio rm tax
##
## Root node error: 34807/404 = 86.155
##
## n= 404
##
## CP nsplit rel error xerror xstd
## 1 0.4394029 0 1.00000 1.00605 0.093348
## 2 0.1801891 1 0.56060 0.64455 0.066897
## 3 0.0707158 2 0.38041 0.43935 0.051076
## 4 0.0424304 3 0.30969 0.38269 0.050220
## 5 0.0345825 4 0.26726 0.37247 0.050483
## 6 0.0239214 5 0.23268 0.35471 0.049760
## 7 0.0236063 6 0.20876 0.33543 0.047298
## 8 0.0088629 7 0.18515 0.31061 0.047573
## 9 0.0085790 8 0.17629 0.30480 0.045155
## 10 0.0078425 9 0.16771 0.30122 0.045136
## 11 0.0065798 10 0.15987 0.29348 0.044691
## 12 0.0058768 11 0.15329 0.30020 0.045738
## 13 0.0044298 12 0.14741 0.28274 0.040699
## 14 0.0042918 13 0.14298 0.28074 0.040854
## 15 0.0032757 14 0.13869 0.27332 0.039521
## 16 0.0032652 15 0.13541 0.27335 0.039458
## 17 0.0023268 16 0.13215 0.27044 0.039449
## 18 0.0022818 17 0.12982 0.27062 0.039497
## 19 0.0017836 18 0.12754 0.27116 0.039486
## 20 0.0016586 19 0.12576 0.27257 0.039451
## 21 0.0016459 21 0.12244 0.27281 0.039459
## 22 0.0015443 22 0.12079 0.27206 0.039467
## 23 0.0013947 23 0.11925 0.27245 0.039484
## 24 0.0013701 24 0.11785 0.27250 0.039475
## 25 0.0011746 26 0.11511 0.27313 0.040010
## 26 0.0010000 27 0.11394 0.27125 0.039026
We get the pruned tree with the below splits:
prune.tree1 <- prune(Boston.tree, cp = 0.0085790)
prune.tree1
## n= 404
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 404 34806.8000 22.42599
## 2) lstat>=9.725 236 5997.1150 17.23475
## 4) lstat>=16.085 114 2184.3260 13.89386
## 8) dis< 2.0037 60 698.0260 11.33000 *
## 9) dis>=2.0037 54 653.6720 16.74259 *
## 5) lstat< 16.085 122 1351.4000 20.35656 *
## 3) lstat< 9.725 168 13515.4700 29.71845
## 6) rm< 7.0115 123 4635.5160 26.02276
## 12) dis>=2.0891 116 2225.6160 25.17155
## 24) rm< 6.5445 67 618.6687 22.89552 *
## 25) rm>=6.5445 49 785.2869 28.28367
## 50) lstat>=5.195 28 203.4668 26.11071 *
## 51) lstat< 5.195 21 273.3324 31.18095 *
## 13) dis< 2.0891 7 933.0343 40.12857 *
## 7) rm>=7.0115 45 2608.1520 39.82000
## 14) rm< 7.437 22 404.9277 34.53182 *
## 15) rm>=7.437 23 999.5191 44.87826 *
rpart.plot(prune.tree1, type =1, fallen.leaves = FALSE)
We use the Train MSE and Test MSPE as the parameters to test the tree performance. We get a test MSPE of 15.94623 and train MSE of 15.18823 for the tree.
boston.train.pred.tree = predict(prune.tree1)
boston.test.pred.tree = predict(prune.tree1, Boston.test)
mean((boston.test.pred.tree - Boston.test$medv)^2)
## [1] 15.94623
mean((boston.train.pred.tree - Boston.train$medv)^2)
## [1] 15.18823
Bagging (Bootstrap+Aggregation) is a method for reducing the variance by building multiple tree models on bootstrapped data, and aggregating the results of all trees. The below model aggregates the prediction from 500 trees taking all 13 variables at each step.
Pre <- as.formula("medv ~ .")
bag.boston <- randomForest(Pre, data=Boston.train, mtry=13, importance =TRUE)
bag.boston
##
## Call:
## randomForest(formula = Pre, data = Boston.train, mtry = 13, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 10.94713
## % Var explained: 87.29
On plotting the cross validation error for different number of trees used for aggregation, we see that the error stops decresing after 100 trees, so using more trees gives no added advantage.
plot(bag.boston)
Checking for the MSE and test MSPE value, we get a test error of 7.960609, and training error of 10.94713. This is a significant inprovement over single tree model
boston.train.pred.tree = predict(bag.boston)
boston.test.pred.tree = predict(bag.boston, Boston.test)
mean((boston.test.pred.tree - Boston.test$medv)^2)
## [1] 7.960609
mean((boston.train.pred.tree - Boston.train$medv)^2)
## [1] 10.94713
We plot the two plots to check for variable importance. The former is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node RSS that results from splits over that variable, averaged over all trees. From the plots, we see that rm and lstat are the two most important variables
importance(bag.boston)
## %IncMSE IncNodePurity
## crim 14.965464 1465.16882
## zn 3.423625 35.81986
## indus 10.451900 258.84124
## chas -1.237404 28.38546
## nox 16.774764 636.84081
## rm 46.651222 11743.19452
## age 12.152279 580.90133
## dis 28.662656 2540.36637
## rad 3.817848 125.78485
## tax 17.627977 554.42317
## ptratio 18.708013 577.95894
## black 8.483604 350.10429
## lstat 46.444565 15596.09061
varImpPlot(bag.boston)
On comparing the performance of CART vs Bagging:
Bagging clearly outperforms CART
We will compare these two techniques with RANDOM FOREST and BOOSTING in part 2