In this document, you will find the answers to the PW4 questions.
I would be honored if you would check my other contributions on time-series forecasting during my internship winthin ING Bank N.V.’s Finance Advanced Analytics team :
# Load libraries
library(tidyverse)
library(MASS)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)
# Q1
# Taking a sample of the specified size from the elemnts of the data
Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train = Boston[Boston_idx,]
Boston_test = Boston[-Boston_idx,]
# Q2 Fitting a regression tree to the training
Boston_tree <- rpart(medv~.,
data = Boston_train)
# Q3 Plotting the obtained tree
plot(Boston_tree)
text(Boston_tree, pretty = 0)
title(main = "Regression Tree")
This regression tree represents a series of splits starting on the very top of the tree. On the top branch we can see that the observations having lstat >= remains on the right of the tree. The predicted medv is given by the mean value of the medv in the dataset.
Next up we will use the rpart.plot package to obtain a better tree. Each node shows us the predicted value, and the percentage of observations in the node. Let’s genererate the function prp() to have a better view of the tree and print the tree.
rpart.plot(Boston_tree)
prp(Boston_tree)
Let’s now print the summary of the tree. We can see in the summary the CP (complexity parameter) table and the importance of each variable in the model. The Complexity Parameter is the minimum improvement in the model needed at each node. We can say that it’s a combination of the size of the tree and the ability of the tree to seperate the classes of the target variable. +
In our case we can see that our tree is composed by 3 nodes, so rpart() ended the process there.
summary(Boston_tree)
Let’s now print the CP table using the printcp() function to see the cross validation results. The CP controls the size of the tree, the greater the CP value, the fewer the numbers of splits in the tree. To determine the value, rpart generates a 10-fold cross validation.
printcp(Boston_tree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston_train)
##
## Variables actually used in tree construction:
## [1] crim lstat rm tax
##
## Root node error: 19858/253 = 78.491
##
## n= 253
##
## CP nsplit rel error xerror xstd
## 1 0.512271 0 1.00000 1.00792 0.120633
## 2 0.175301 1 0.48773 0.62700 0.080092
## 3 0.074396 2 0.31243 0.41810 0.063020
## 4 0.029298 3 0.23803 0.38866 0.067651
## 5 0.025293 4 0.20873 0.34899 0.061710
## 6 0.014194 5 0.18344 0.31728 0.056246
## 7 0.010000 6 0.16925 0.30517 0.056180
Let’s now check the optimal size of the tree, using another function called plotcp() which is going to give us xerros vs. cp value and the size of the tree.
plotcp(Boston_tree)
Let’s now write a function for to calculate the RMSE, Root Mean Square Error.
rmse_calculator <- function (y_pred, y_true)
{
mean((y_pred-y_true)^2)
}
We will now use the function predict() to predict the response on the test set, and then calculate the RMSE obtained with tree model.
# prediction
pred.regtree=predict(Boston_tree,newdata=Boston_test)
# Here is a small RMSE Function
rmse_calculator(pred.regtree, Boston_test[,"medv"])
## [1] 25.514
Let’s fit a linear regression model on the training set and predict responses on the test set using the linear model. We will later on calculate the RMSE and compare the performance of the tree to the linear regression model.
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
## (Intercept) crim zn indus chas
## 3.645949e+01 -1.080114e-01 4.642046e-02 2.055863e-02 2.686734e+00
## nox rm age dis rad
## -1.776661e+01 3.809865e+00 6.922246e-04 -1.475567e+00 3.060495e-01
## tax ptratio black lstat
## -1.233459e-02 -9.527472e-01 9.311683e-03 -5.247584e-01
## [1] 23.1661
We can say here that our linear model is slightly better than our regression tree, we have a mean squared error of 21.80366 for our linear model vs. 25.27769 for our regression tree.
Bagging, or Bootstrap aggregation, is a general-purpose procedure for reducing the variance of a statistical learning method, it is particularly useful and frequently used in the context of decision trees. The idea is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. Generally we do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.
library(randomForest)
bagged.Boston <- randomForest(medv~.,data=Boston_train)
bagged.Boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 12.3674
## % Var explained: 84.24
bagged.pred <- predict(bagged.Boston, newdata=Boston_test)
plot(bagged.pred, Boston_test[,"medv"])
abline(0,1)
rmse_calculator(bagged.pred, Boston_test[,"medv"])
## [1] 17.07727
We see above that our bagged predictions have a better score, of only 12.97904.
# Here we calculate how many predictors there are, the "-1" comes from the medv exclusion.
p = dim(Boston)[2] - 1
bagged.Boston.p <- randomForest(medv~.,data=Boston_train, mtry=p/3)
bagged.Boston.p
##
## Call:
## randomForest(formula = medv ~ ., data = Boston_train, mtry = p/3)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.98099
## % Var explained: 84.74
bagged.pred <- predict(bagged.Boston.p, newdata=Boston_test)
plot(bagged.pred, Boston_test[,"medv"])
abline(0,1)
rmse_calculator(bagged.pred, Boston_test[,"medv"])
## [1] 16.87312
When we equal p to the number of predictors and we use p/3 for the mtry argument, we get a very similar RMSE, slightly higher one.
Let’s use the importance function from the randomForest package to get the most important predictors in our random forest mode.
randomForest::importance(bagged.Boston.p, type=2)
## IncNodePurity
## crim 1250.04791
## zn 125.05849
## indus 1279.45030
## chas 72.82679
## nox 1264.73455
## rm 5334.00841
## age 523.37634
## dis 933.97164
## rad 192.75672
## tax 728.69897
## ptratio 1046.59141
## black 458.44652
## lstat 6214.67383
randomForest::varImpPlot(bagged.Boston.p)
We can see on our output that the best predictors are lstat (% lower status of the population) and rm (average number of rooms per dwelling). We did get the same predictions in our PW2 session with the linear model.
Let’s finially try the boosted model using the gbm package.
library(gbm)
## Loaded gbm 2.1.8
boosted.boston = gbm(medv ~ ., data = Boston_train, distribution = "gaussian",
n.trees = 5000, interaction.depth = 4, shrinkage = 0.01)
summary(boosted.boston)
## var rel.inf
## lstat lstat 45.00215414
## rm rm 26.73856587
## dis dis 5.91776483
## nox nox 4.65815302
## crim crim 4.53091181
## black black 3.55310511
## age age 2.90731543
## ptratio ptratio 2.89714608
## tax tax 2.10249175
## indus indus 0.86486779
## rad rad 0.62754873
## zn zn 0.14305967
## chas chas 0.05691576
boosted.pred <- predict(boosted.boston, newdata=Boston_test)
## Using 5000 trees...
plot(boosted.pred, Boston_test[,"medv"])
abline(0,1)
rmse_calculator(boosted.pred, Boston_test[,"medv"])
## [1] 13.49629
When we type to get the summary of our gbm model, we can see that there is a figure that is shown. In this figure we can see that on the x-axis we have the relative influence and on the discrete y-axis we have the predictors. This is yet another figure to show the importance of our predictors and as expected, we see that rm sits on the top of the axis (along with lstat that we can not see but it’s obvious thanks to our other models).
After calculating the RMSE, we get 11.53314. This is the lowest RMSE that we got out of all the models that we have calculated.
spam_data <- read_csv("spam.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## spam = col_logical()
## )
## See spec(...) for full column specifications.
head(spam_data)
## # A tibble: 6 x 58
## spam make address all X3d our over remove internet order mail
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TRUE 0 0.64 0.64 0 0.32 0 0 0 0 0
## 2 TRUE 0.21 0.28 0.5 0 0.14 0.28 0.21 0.07 0 0.94
## 3 TRUE 0.06 0 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25
## 4 TRUE 0 0 0 0 0.63 0 0.31 0.63 0.31 0.63
## 5 TRUE 0 0 0 0 0.63 0 0.31 0.63 0.31 0.63
## 6 TRUE 0 0 0 0 1.85 0 0 1.85 0 0
## # … with 47 more variables: receive <dbl>, will <dbl>, people <dbl>,
## # report <dbl>, addresses <dbl>, free <dbl>, business <dbl>, email <dbl>,
## # you <dbl>, credit <dbl>, your <dbl>, font <dbl>, X000 <dbl>, money <dbl>,
## # hp <dbl>, hpl <dbl>, george <dbl>, X650 <dbl>, lab <dbl>, labs <dbl>,
## # telnet <dbl>, X857 <dbl>, data <dbl>, X415 <dbl>, X85 <dbl>,
## # technology <dbl>, X1999 <dbl>, parts <dbl>, pm <dbl>, direct <dbl>,
## # cs <dbl>, meeting <dbl>, original <dbl>, project <dbl>, re <dbl>,
## # edu <dbl>, table <dbl>, conference <dbl>, ch. <dbl>, ch..1 <dbl>,
## # ch..2 <dbl>, ch..3 <dbl>, ch..4 <dbl>, ch..5 <dbl>, crl.ave <dbl>,
## # crl.long <dbl>, crl.tot <dbl>
We will now split the dataset into training and test sets.