This project aims to find the factors affecting the domestic property value in the city of Boston. Factors like per capita income, environmental factors, educational facilities, property size, etc were taken into consideration to determine the most significant parameters.
For predicting the house prices We use Tree Family:
We will compare the performance of these methods to get the best model
Part 2 deals with BOOSTING AND RANDOM FOREST
The following packages are required for the project:
## function to install and load multiple packages at once in R
install_load <- function(pack){
## Statement to check if the package has been previously installed
new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
if (length(new_pack_load))
install.packages(new_pack_load, dependencies = TRUE)
sapply(pack, require, character.only = TRUE)
}
package_load <- c("corrr", "gridExtra", "ggplot2", "tidyverse", "dplyr", "DT", "MASS", "leaps", "glmnet", "PerformanceAnalytics", "tree", "rpart", "rpart.plot","randomForest", "ipred", "gbm")
install_load(package_load)
## corrr gridExtra ggplot2
## TRUE TRUE TRUE
## tidyverse dplyr DT
## TRUE TRUE TRUE
## MASS leaps glmnet
## TRUE TRUE TRUE
## PerformanceAnalytics tree rpart
## TRUE TRUE TRUE
## rpart.plot randomForest ipred
## TRUE TRUE TRUE
## gbm
## TRUE
Our data contains 506 observations containing 14 variables. The datatypes are as follows:
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
A quick summary of the distribution of every variable in the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We plot the scatter plots of target variable medv versus the other variables, we see that rm and lstat show parabolic nature
Boston %>%
gather(-medv, key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = value, y = medv)) +
geom_point() +
stat_smooth() +
facet_wrap(~ var, scales = "free") +
theme_bw()
Boxplots show no significant outliers in the data
Boston %>%
gather(-medv, key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = '',y = value)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
facet_wrap(~ var, scales = "free") +
theme_bw()
The histograms of predictors give the following insights:
Boston %>%
gather(-medv, key = "var", value = "value") %>%
filter(var != "chas") %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~ var, scales = "free") +
theme_bw()
We split our data in 80:20 ratio as training data and test data. We will use our train data for modelling and test data for validation
set.seed(12420352)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]
Random Forest gives an additional advantage over Bagging by removing the multicollinearity between the trees, and hence further reducing the variance. We need to find two parameters for random forest:
oob.err<- data.frame()
for(i in 3:5){
fit<- randomForest(medv~., data = Boston.train, mtry=i)
oob.err <- rbind(oob.err,data.frame(cbind(mtry=i, ntree=seq(1,500), oob_error=fit$mse)))
cat(i, " ")
}
## 3 4 5
oob.err$mtry <- as.factor(oob.err$mtry)
Below plot shows that variable size=4 and tree size of 100 is good enough for the model
ggplot(data=oob.err, aes(x=ntree, y=oob_error, color=mtry)) + geom_point() + geom_line()
Pre <- as.formula("medv ~ .")
rf.boston <- randomForest(Pre,data=Boston.train, mtry=4, ntree=100, importance =TRUE)
rf.boston
##
## Call:
## randomForest(formula = Pre, data = Boston.train, mtry = 4, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.59266
## % Var explained: 86.54
Checking for the MSE and test MSPE value, we get a test error of 7.953978, and training error of 11.91586.
boston.train.pred.tree = predict(rf.boston)
boston.test.pred.tree = predict(rf.boston, Boston.test)
mean((boston.test.pred.tree - Boston.test$medv)^2)
## [1] 7.571549
mean((boston.train.pred.tree - Boston.train$medv)^2)
## [1] 11.59266
rm and lstat are the important variables again in random forest model too
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 6.664825 2178.1090
## zn 2.317746 206.5644
## indus 5.962539 2162.5411
## chas 2.644446 252.1947
## nox 7.660876 2576.0658
## rm 16.407279 10604.4800
## age 4.802469 969.8919
## dis 7.666008 2384.9127
## rad 2.348645 355.1198
## tax 6.656205 876.0768
## ptratio 7.614382 2213.7355
## black 3.826313 752.7358
## lstat 12.060453 8629.4972
varImpPlot(rf.boston)
We use Gradient Boosting method with following parameters:
rm and lstat are the most influential variables
boston.boost<- gbm(medv~., data = Boston.train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)
## var rel.inf
## lstat lstat 39.5617585
## rm rm 28.9728906
## dis dis 9.3589794
## crim crim 4.2978865
## nox nox 3.8464416
## age age 3.6606354
## ptratio ptratio 2.8917482
## black black 2.6447396
## tax tax 2.0662900
## indus indus 1.0548192
## rad rad 0.8875357
## chas chas 0.6073448
## zn zn 0.1489306
Checking for error variation with different number of trees, we see that the tree size is the least when number of trees is around 500-1000 mark
ntree<- seq(100, 10000, 100)
predmat<- predict(boston.boost, newdata = Boston.test, n.trees = ntree)
err<- apply((predmat-Boston.test$medv)^2, 2, mean)
plot(ntree, err, type = 'l', col=2, lwd=2, xlab = "n.trees", ylab = "Test MSE")
We can also tune the other two parameters : shrinkage and tree depth by checking the performance on test data
oob.err<- data.frame()
shrinkage_val <- c(0.01, 0.02, 0.05, 0.1)
trees <- c(500, 700, 900, 1000, 1200, 1500)
depth <- c(2, 3, 5, 8)
for(i in shrinkage_val){
for(k in depth){
fit<- gbm(medv~., data = Boston.train, distribution = "gaussian", n.trees = 2500, shrinkage = i, interaction.depth = k)
for(j in trees){
fit.pred = predict(fit, Boston.test, n.trees = j)
mspe <- mean((fit.pred - Boston.test$medv)^2)
oob.err <- rbind(oob.err,data.frame(cbind(shrink=i, ntree=j, depth_val=k, oob_error=mspe)))
}
cat(i, " ")
}
}
## 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1
head(oob.err %>% arrange((oob_error)), 10)
## shrink ntree depth_val oob_error
## 1 0.02 500 5 6.713356
## 2 0.01 700 5 6.729115
## 3 0.01 900 5 6.764589
## 4 0.01 500 8 6.784178
## 5 0.01 900 8 6.785266
## 6 0.02 700 5 6.786077
## 7 0.01 500 5 6.799373
## 8 0.01 700 8 6.803835
## 9 0.01 1000 8 6.807028
## 10 0.01 1000 5 6.814400
Checking the performance of Gradient Boosting method with following parameters:
boston.boost<- gbm(medv~., data = Boston.train, distribution = "gaussian", n.trees = 900, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)
## var rel.inf
## lstat lstat 44.12305773
## rm rm 28.04294547
## dis dis 8.20558906
## crim crim 4.26713113
## nox nox 3.36487020
## age age 2.95590726
## ptratio ptratio 2.87778696
## tax tax 2.04182794
## black black 1.84154431
## indus indus 0.95251606
## rad rad 0.78624255
## chas chas 0.45296121
## zn zn 0.08762012
boston.train.pred.tree = predict(boston.boost, n.trees = 900)
boston.test.pred.tree = predict(boston.boost, Boston.test, n.trees = 900)
mean((boston.test.pred.tree - Boston.test$medv)^2)
## [1] 6.647038
mean((boston.train.pred.tree - Boston.train$medv)^2)
## [1] 3.010795
On checking the impact of two most significant variables:
par(mfrow=c(1,2))
plot(boston.boost, i="lstat")
plot(boston.boost, i="rm")
On comparing the performance of different techniques:
Boosting clearly outperforms all other techniques in this dataset
Random Forest and Bagging give almost the same performance
CART does not do as well in prediction as other methods