In this document, you will find the answers to the PW4 questions.

I would be honored if you would check my other contributions on time-series forecasting during my internship winthin ING Bank N.V.’s Finance Advanced Analytics team :

  • Time-series forecasting package, employable forecast models fall into three categories: simple heuristics (mean of last 12 months, last 3 months etc), classical time series econometrics (ARIMA, Holt-Winters, Kalman filters etc.) and machine learning (Neural networks, Facebook’s Prophet etc.)
  • Tools package, a set of helper functions, geared mostly towards data with a time dimension

Regression Trees

Single tree

# Load libraries 
library(tidyverse)
library(MASS)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)


# Q1

# Taking a sample of the specified size from the elemnts of the data
Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_train = Boston[Boston_idx,]
Boston_test  = Boston[-Boston_idx,]

# Q2 Fitting a regression tree to the training
Boston_tree <- rpart(medv~.,
                    data = Boston_train)

# Q3 Plotting the obtained tree
plot(Boston_tree)
text(Boston_tree, pretty = 0)
title(main = "Regression Tree")

This regression tree represents a series of splits starting on the very top of the tree. On the top branch we can see that the observations having lstat >= remains on the right of the tree. The predicted medv is given by the mean value of the medv in the dataset.

Next up we will use the rpart.plot package to obtain a better tree. Each node shows us the predicted value, and the percentage of observations in the node. Let’s genererate the function prp() to have a better view of the tree and print the tree.

rpart.plot(Boston_tree)
prp(Boston_tree)

Let’s now print the summary of the tree. We can see in the summary the CP (complexity parameter) table and the importance of each variable in the model. The Complexity Parameter is the minimum improvement in the model needed at each node. We can say that it’s a combination of the size of the tree and the ability of the tree to seperate the classes of the target variable. +

In our case we can see that our tree is composed by 3 nodes, so rpart() ended the process there.

summary(Boston_tree)

Let’s now print the CP table using the printcp() function to see the cross validation results. The CP controls the size of the tree, the greater the CP value, the fewer the numbers of splits in the tree. To determine the value, rpart generates a 10-fold cross validation.

printcp(Boston_tree)
## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston_train)
## 
## Variables actually used in tree construction:
## [1] crim  lstat rm    tax  
## 
## Root node error: 19858/253 = 78.491
## 
## n= 253 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.512271      0   1.00000 1.00792 0.120633
## 2 0.175301      1   0.48773 0.62700 0.080092
## 3 0.074396      2   0.31243 0.41810 0.063020
## 4 0.029298      3   0.23803 0.38866 0.067651
## 5 0.025293      4   0.20873 0.34899 0.061710
## 6 0.014194      5   0.18344 0.31728 0.056246
## 7 0.010000      6   0.16925 0.30517 0.056180

Let’s now check the optimal size of the tree, using another function called plotcp() which is going to give us xerros vs. cp value and the size of the tree.

plotcp(Boston_tree)

Let’s now write a function for to calculate the RMSE, Root Mean Square Error.

rmse_calculator <- function (y_pred, y_true) 
{
    mean((y_pred-y_true)^2)
}

We will now use the function predict() to predict the response on the test set, and then calculate the RMSE obtained with tree model.

# prediction
pred.regtree=predict(Boston_tree,newdata=Boston_test)

# Here is a small RMSE Function
rmse_calculator(pred.regtree, Boston_test[,"medv"])
## [1] 25.514

Let’s fit a linear regression model on the training set and predict responses on the test set using the linear model. We will later on calculate the RMSE and compare the performance of the tree to the linear regression model.

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
##   (Intercept)          crim            zn         indus          chas 
##  3.645949e+01 -1.080114e-01  4.642046e-02  2.055863e-02  2.686734e+00 
##           nox            rm           age           dis           rad 
## -1.776661e+01  3.809865e+00  6.922246e-04 -1.475567e+00  3.060495e-01 
##           tax       ptratio         black         lstat 
## -1.233459e-02 -9.527472e-01  9.311683e-03 -5.247584e-01
## [1] 23.1661

We can say here that our linear model is slightly better than our regression tree, we have a mean squared error of 21.80366 for our linear model vs. 25.27769 for our regression tree.

Bagging

Bagging, or Bootstrap aggregation, is a general-purpose procedure for reducing the variance of a statistical learning method, it is particularly useful and frequently used in the context of decision trees. The idea is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. Generally we do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

library(randomForest)

bagged.Boston <- randomForest(medv~.,data=Boston_train)
bagged.Boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 12.3674
##                     % Var explained: 84.24
bagged.pred <- predict(bagged.Boston, newdata=Boston_test)

plot(bagged.pred, Boston_test[,"medv"])
abline(0,1)

rmse_calculator(bagged.pred, Boston_test[,"medv"])
## [1] 17.07727

We see above that our bagged predictions have a better score, of only 12.97904.

Random Forest

# Here we calculate how many predictors there are, the "-1" comes from the medv exclusion.
p = dim(Boston)[2] - 1

bagged.Boston.p <- randomForest(medv~.,data=Boston_train, mtry=p/3)
bagged.Boston.p
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston_train, mtry = p/3) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.98099
##                     % Var explained: 84.74
bagged.pred <- predict(bagged.Boston.p, newdata=Boston_test)

plot(bagged.pred, Boston_test[,"medv"])
abline(0,1)

rmse_calculator(bagged.pred, Boston_test[,"medv"])
## [1] 16.87312

When we equal p to the number of predictors and we use p/3 for the mtry argument, we get a very similar RMSE, slightly higher one.

Let’s use the importance function from the randomForest package to get the most important predictors in our random forest mode.

randomForest::importance(bagged.Boston.p, type=2)
##         IncNodePurity
## crim       1250.04791
## zn          125.05849
## indus      1279.45030
## chas         72.82679
## nox        1264.73455
## rm         5334.00841
## age         523.37634
## dis         933.97164
## rad         192.75672
## tax         728.69897
## ptratio    1046.59141
## black       458.44652
## lstat      6214.67383
randomForest::varImpPlot(bagged.Boston.p)

We can see on our output that the best predictors are lstat (% lower status of the population) and rm (average number of rooms per dwelling). We did get the same predictions in our PW2 session with the linear model.

Boosting

Let’s finially try the boosted model using the gbm package.

library(gbm)
## Loaded gbm 2.1.8
boosted.boston = gbm(medv ~ ., data = Boston_train, distribution = "gaussian", 
                    n.trees = 5000, interaction.depth = 4, shrinkage = 0.01)

summary(boosted.boston)
##             var     rel.inf
## lstat     lstat 45.00215414
## rm           rm 26.73856587
## dis         dis  5.91776483
## nox         nox  4.65815302
## crim       crim  4.53091181
## black     black  3.55310511
## age         age  2.90731543
## ptratio ptratio  2.89714608
## tax         tax  2.10249175
## indus     indus  0.86486779
## rad         rad  0.62754873
## zn           zn  0.14305967
## chas       chas  0.05691576
boosted.pred <- predict(boosted.boston, newdata=Boston_test)
## Using 5000 trees...
plot(boosted.pred, Boston_test[,"medv"])
abline(0,1)

rmse_calculator(boosted.pred, Boston_test[,"medv"])
## [1] 13.49629

When we type to get the summary of our gbm model, we can see that there is a figure that is shown. In this figure we can see that on the x-axis we have the relative influence and on the discrete y-axis we have the predictors. This is yet another figure to show the importance of our predictors and as expected, we see that rm sits on the top of the axis (along with lstat that we can not see but it’s obvious thanks to our other models).

After calculating the RMSE, we get 11.53314. This is the lowest RMSE that we got out of all the models that we have calculated.

Classification tree

spam_data <- read_csv("spam.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   spam = col_logical()
## )
## See spec(...) for full column specifications.
head(spam_data)
## # A tibble: 6 x 58
##   spam   make address   all   X3d   our  over remove internet order  mail
##   <lgl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1 TRUE   0       0.64  0.64     0  0.32  0      0        0     0     0   
## 2 TRUE   0.21    0.28  0.5      0  0.14  0.28   0.21     0.07  0     0.94
## 3 TRUE   0.06    0     0.71     0  1.23  0.19   0.19     0.12  0.64  0.25
## 4 TRUE   0       0     0        0  0.63  0      0.31     0.63  0.31  0.63
## 5 TRUE   0       0     0        0  0.63  0      0.31     0.63  0.31  0.63
## 6 TRUE   0       0     0        0  1.85  0      0        1.85  0     0   
## # … with 47 more variables: receive <dbl>, will <dbl>, people <dbl>,
## #   report <dbl>, addresses <dbl>, free <dbl>, business <dbl>, email <dbl>,
## #   you <dbl>, credit <dbl>, your <dbl>, font <dbl>, X000 <dbl>, money <dbl>,
## #   hp <dbl>, hpl <dbl>, george <dbl>, X650 <dbl>, lab <dbl>, labs <dbl>,
## #   telnet <dbl>, X857 <dbl>, data <dbl>, X415 <dbl>, X85 <dbl>,
## #   technology <dbl>, X1999 <dbl>, parts <dbl>, pm <dbl>, direct <dbl>,
## #   cs <dbl>, meeting <dbl>, original <dbl>, project <dbl>, re <dbl>,
## #   edu <dbl>, table <dbl>, conference <dbl>, ch. <dbl>, ch..1 <dbl>,
## #   ch..2 <dbl>, ch..3 <dbl>, ch..4 <dbl>, ch..5 <dbl>, crl.ave <dbl>,
## #   crl.long <dbl>, crl.tot <dbl>

The Spam dataset

We will now split the dataset into training and test sets.