Decision Trees are versatile Machine Learning algorithm that can perform both classification and regression tasks. They are very powerful algorithms, capable of fitting complex datasets. Besides, decision trees are fundamental components of random forests, which are among the most potent Machine Learning algorithms available today.
set.seed(678)
path <- 'https://raw.githubusercontent.com/guru99-edu/R-Programming/master/titanic_data.csv'
titanic <-read.csv(path)
head(titanic)
tail(titanic)
From the head and tail output, you can notice the data is not shuffled. This is a big issue! When you will split your data between a train set and test set, you will select only the passenger from class 1 and 2 (No passenger from class 3 are in the top 80 percent of the observations), which means the algorithm will never see the features of passenger of class 3. This mistake will lead to poor prediction.
shuffle_index <- sample(1:nrow(titanic))
head(shuffle_index)
[1] 57 774 796 1044 681 920
titanic <- titanic[shuffle_index, ]
head(titanic)
View(titanic)
The structure of the data shows some variables have NA’s. Data clean up to be done as follows
library(tidyverse)
# Drop variables
clean_titanic <- titanic %>%
select(-c(home.dest, cabin, name, x, ticket)) %>%
#Convert to factor level
mutate(age = as.numeric(age), fare= as.numeric(fare), pclass = factor(pclass, levels = c(1, 2, 3), labels = c('Upper', 'Middle', 'Lower')),
survived = factor(survived, levels = c(0, 1), labels = c('No', 'Yes'))) %>% na.omit()
Warning: Problem while computing `age = as.numeric(age)`.
i NAs introduced by coercion
Warning: Problem while computing `fare = as.numeric(fare)`.
i NAs introduced by coercion
glimpse(clean_titanic)
Rows: 1,045
Columns: 8
$ pclass <fct> Upper, Lower, Lower, Middle, Lower, Middle, Lower, Lower, Upper, Middle, Upper, Middle, Middle, Middle, Upper, Upper,~
$ survived <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, No, Yes, No, No, No, No, No, No, No, No, Yes, No, No, Yes, No, Yes~
$ sex <chr> "male", "male", "male", "male", "female", "female", "male", "male", "female", "male", "female", "male", "female", "fe~
$ age <dbl> 36.0, 42.0, 18.5, 44.0, 19.0, 26.0, 23.0, 28.5, 64.0, 36.5, 45.0, 42.0, 22.0, 20.0, 17.0, 52.0, 24.0, 35.0, 18.0, 27.~
$ sibsp <int> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
$ parch <int> 2, 0, 0, 0, 0, 1, 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ~
$ fare <dbl> 120.0000, 8.6625, 7.2292, 13.0000, 16.1000, 26.0000, 7.8542, 16.1000, 83.1583, 26.0000, 164.8667, 13.0000, 21.0000, 2~
$ embarked <chr> "S", "S", "C", "S", "S", "S", "S", "S", "C", "S", "S", "S", "S", "S", "S", "S", "S", "S", "Q", "S", "S", "S", "S", "S~
Before you train your model, you need to perform two steps:
Create a train and test set: You train the model on the train set and test the prediction on the test set (i.e. unseen data) Install rpart.plot from the console
The common practice is to split the data 80/20, 80 percent of the data serves to train the model, and 20 percent to make predictions. You need to create two separate data frames. You don’t want to touch the test set until you finish building your model. You can create a function name create_train_test() that takes three arguments.
create_train_test <- function(data, size = 0.8, train = TRUE) {
n_row = nrow(data)
total_row = size * n_row
train_sample = c(1: total_row)
if (train == TRUE) {
return (data[train_sample, ])
} else {
return (data[-train_sample, ])
}
}
data_train <- create_train_test(clean_titanic, 0.8, train = TRUE)
data_test <- create_train_test(clean_titanic, 0.8, train = FALSE)
dim(data_train)
[1] 836 8
dim(data_test)
[1] 209 8
str(data_train)
'data.frame': 836 obs. of 8 variables:
$ pclass : Factor w/ 3 levels "Upper","Middle",..: 1 3 3 2 3 2 3 3 1 2 ...
$ survived: Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ sex : chr "male" "male" "male" "male" ...
$ age : num 36 42 18.5 44 19 26 23 28.5 64 36.5 ...
$ sibsp : int 1 0 0 0 1 1 0 0 0 0 ...
$ parch : int 2 0 0 0 0 1 0 0 2 2 ...
$ fare : num 120 8.66 7.23 13 16.1 ...
$ embarked: chr "S" "S" "C" "S" ...
- attr(*, "na.action")= 'omit' Named int [1:264] 3 4 5 8 9 14 19 44 49 54 ...
..- attr(*, "names")= chr [1:264] "796" "1044" "681" "1019" ...
The train dataset has 1046 rows while the test dataset has 262 rows.
You use the function prop.table() combined with table() to verify if the randomization process is correct.
prop.table(table(data_train$survived))
No Yes
0.6004785 0.3995215
prop.table(table(data_test$survived))
No Yes
0.5550239 0.4449761
You are ready to build the model. The syntax for Rpart decision tree function is:
rpart(formula, data=, method=’’) arguments:
- formula: The function to predict - data: Specifies the data frame-
method:
- “class” for a classification tree
- “anova” for a regression tree
library(rpart)
library(rpart.plot)
fit <- rpart(survived~., data = data_train, method = 'class')
rpart.plot(fit, extra = 106)
You can predict your test dataset. To make a prediction, you can use the predict() function. The basic syntax of predict for R decision tree is:
predict_unseen <-predict(fit, data_test, type = 'class')
table_mat <- table(data_test$survived, predict_unseen)
table_mat
predict_unseen
No Yes
No 106 10
Yes 34 59
You can compute an accuracy measure for classification task with the confusion matrix:
The confusion matrix is a better choice to evaluate the classification performance. The general idea is to count the number of times True instances are classified are False.
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))
[1] "Accuracy for test 0.789473684210526"
Decision tree in R has various parameters that control aspects of the fit. In rpart decision tree library, you can control the parameters using the rpart.control() function. In the following code, you introduce the parameters you will tune. You can refer to the vignette for other parameters.
rpart.control(minsplit = 20, minbucket = round(minsplit/3), maxdepth = 30) Arguments: -minsplit: Set the minimum number of observations in the node before the algorithm perform a split -minbucket: Set the minimum number of observations in the final note i.e. the leaf -maxdepth: Set the maximum depth of any node of the final tree. The root node is treated a depth 0
accuracy_tune <- function(fit) {
predict_unseen <- predict(fit, data_test, type = 'class')
table_mat <- table(data_test$survived, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
accuracy_Test
}
You can try to tune the parameters and see if you can improve the model over the default value. As a reminder, you need to get an accuracy higher than 0.78
control <- rpart.control(minsplit = 4,
minbucket = round(5 / 3),
maxdepth = 3,
cp = 0)
tune_fit <- rpart(survived~., data = data_train, method = 'class', control = control)
accuracy_tune(tune_fit)
[1] 0.7942584
You get a higher performance than the previous model. Congratulation!
Random forests are based on a simple idea: ‘the wisdom of the crowd’. Aggregate of the results of multiple predictors gives a better prediction than the best individual predictor. A group of predictors is called an ensemble. Thus, this technique is called Ensemble Learning.
In earlier tutorial, you learned how to use Decision trees to make a binary prediction. To improve our technique, we can train a group of Decision Tree classifiers, each on a different random subset of the train set. To make a prediction, we just obtain the predictions of all individuals trees, then predict the class that gets the most votes. This technique is called Random Forest.
To make sure you have the same dataset as in the tutorial for decision trees, the train test and test set are stored on the internet. You can import them without make any change.
One way to evaluate the performance of a model is to train it on a number of different smaller datasets and evaluate them over the other smaller testing set. This is called the F-fold cross-validation feature. R has a function to randomly split number of datasets of almost the same size. For example, if k=9, the model is evaluated over the nine folder and tested on the remaining test set. This process is repeated until all the subsets have been evaluated. This technique is widely used for model selection, especially when the model has parameters to tune.
Now that we have a way to evaluate our model, we need to figure out how to choose the parameters that generalized best the data.
Random forest chooses a random subset of features and builds many Decision Trees. The model averages out all the predictions of the Decisions trees.
RandomForest(formula, ntree=n, mtry=FALSE, maxnodes = NULL) Arguments: - Formula: Formula of the fitted model - ntree: number of trees in the forest - mtry: Number of candidates draw to feed the algorithm. By default, it is the square of the number of columns. - maxnodes: Set the maximum amount of terminal nodes in the forest - importance=TRUE: Whether independent variables importance in the random forest be assessed
Tuning a model is very tedious work. There are lot of combination possible between the parameters. You don’t necessarily have the time to try all of them. A good alternative is to let the machine find the best combination for you. There are two methods available:
The grid search method is simple, the model will be evaluated over all the combination you pass in the function, using cross-validation.
For instance, you want to try the model with 10, 20, 30 number of trees and each tree will be tested over a number of mtry equals to 1, 2, 3, 4, 5. Then the machine will test 15 different models:
library(randomForest)
library(caret)
library(e1071)
Default setting K-fold cross validation is controlled by the trainControl() function
trainControl(method = “cv”, number = n, search =“grid”) arguments - method = “cv”: The method used to resample the dataset. - number = n: Number of folders to create - search = “grid”: Use the search grid method. For randomized method, use “grid”
Random Search definition The big difference between random search and grid search is, random search will not evaluate all the combination of hyperparameter in the searching space. Instead, it will randomly choose combination at every iteration. The advantage is it lower the computational cost.
Set the control parameter You will proceed as follow to construct and evaluate the model:
# Define the control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")
You will use caret library to evaluate your model. The library has one function called train() to evaluate almost all machine learning algorithm. Say differently, you can use this function to train other algorithms.
formula: Define the formula of the algorithmmethod: Define which model to train.metric = “Accuracy”: Define how to select the optimal
modeltrControl = trainControl(): Define the control
parameterstuneGrid = NULL: Return a data frame with all the
possible combinationset.seed(1234)
# Run the model
rf_default <- train(survived ~.,
data = data_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)
# Print the results
print(rf_default)
Random Forest
836 samples
7 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 752, 753, 752, 753, 752, 753, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7906483 0.5457050
6 0.7894435 0.5515548
10 0.7643431 0.5030373
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
You can test the model with values of mtry from 1 to 10
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300)
print(rf_mtry)
Random Forest
836 samples
7 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 752, 753, 752, 753, 752, 753, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
1 0.7667097 0.4778802
2 0.7990103 0.5617566
3 0.7966581 0.5600419
4 0.7942915 0.5567493
5 0.7978772 0.5662198
6 0.7966724 0.5636781
7 0.7930579 0.5570674
8 0.7943058 0.5600845
9 0.7894578 0.5492753
10 0.7918962 0.5555188
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
rf_mtry$bestTune$mtry
[1] 2
max(rf_mtry$results$Accuracy)
[1] 0.7990103
best_mtry <- rf_mtry$bestTune$mtry
best_mtry
[1] 2
You need to create a loop to evaluate the different values of maxnodes. In the following code, you will:
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
Call:
summary.resamples(object = results_mtry)
Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5 0.7469880 0.7597533 0.7738095 0.7727338 0.7850688 0.8072289 0
6 0.7500000 0.7710843 0.7857143 0.7894435 0.8095238 0.8313253 0
7 0.7380952 0.7597533 0.7843517 0.7870052 0.8089501 0.8571429 0
8 0.7590361 0.7641997 0.7843517 0.7929862 0.8048264 0.8690476 0
9 0.7500000 0.7657057 0.7916667 0.7930293 0.8162651 0.8333333 0
10 0.7500000 0.7641997 0.7904475 0.7966294 0.8273810 0.8571429 0
11 0.7380952 0.7777180 0.7976190 0.7966150 0.8162651 0.8452381 0
12 0.7500000 0.7831325 0.7857143 0.7942197 0.8138626 0.8571429 0
13 0.7590361 0.7678571 0.7904475 0.7930149 0.8138626 0.8452381 0
14 0.7590361 0.7738095 0.7963282 0.7989530 0.8208907 0.8571429 0
15 0.7590361 0.7951807 0.8035714 0.8049627 0.8208907 0.8690476 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5 0.4339071 0.4610718 0.5005875 0.4939166 0.5236180 0.5662965 0
6 0.4473684 0.4849300 0.5222538 0.5337185 0.5876485 0.6246770 0
7 0.4122137 0.4584416 0.5284378 0.5291909 0.5799157 0.6939891 0
8 0.4601542 0.4725903 0.5284378 0.5421289 0.5645372 0.7083333 0
9 0.4417722 0.4776392 0.5415944 0.5428190 0.5918055 0.6474820 0
10 0.4417722 0.4725903 0.5471631 0.5510684 0.6313285 0.6800000 0
11 0.4181360 0.5040234 0.5583671 0.5521731 0.5918055 0.6591760 0
12 0.4417722 0.5138337 0.5330832 0.5456990 0.5853127 0.6800000 0
13 0.4578707 0.4790076 0.5468076 0.5435401 0.5858432 0.6591760 0
14 0.4638243 0.4923274 0.5601550 0.5581549 0.6169907 0.6800000 0
15 0.4638243 0.5470799 0.5689730 0.5701457 0.6072453 0.7083333 0
The last value of maxnode has the highest accuracy. You can try with higher values to see if you can get a higher score.
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(20: 30)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
key <- toString(maxnodes)
store_maxnode[[key]] <- rf_maxnode
}
results_node <- resamples(store_maxnode)
summary(results_node)
Call:
summary.resamples(object = results_node)
Models: 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
20 0.7590361 0.7738095 0.7976190 0.7918531 0.8048264 0.8313253 0
21 0.7590361 0.7747418 0.8024240 0.7966150 0.8168388 0.8333333 0
22 0.7590361 0.7678571 0.7916667 0.7977912 0.8162651 0.8690476 0
23 0.7590361 0.7747418 0.7976190 0.7954389 0.8089501 0.8333333 0
24 0.7590361 0.7648810 0.7857143 0.7906483 0.8089501 0.8452381 0
25 0.7590361 0.7740964 0.7916667 0.7977912 0.8184524 0.8571429 0
26 0.7590361 0.7648810 0.8024240 0.7966150 0.8214286 0.8333333 0
27 0.7590361 0.7648810 0.8024240 0.7954102 0.8168388 0.8452381 0
28 0.7469880 0.7767857 0.7857143 0.7906483 0.8089501 0.8313253 0
29 0.7380952 0.7657057 0.7916667 0.7918531 0.8178787 0.8333333 0
30 0.7469880 0.7619048 0.7857143 0.7870625 0.8089501 0.8333333 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
20 0.4578707 0.4935681 0.5584709 0.5431912 0.5830589 0.6246770 0
21 0.4638243 0.5010434 0.5791086 0.5551867 0.5940465 0.6508314 0
22 0.4638243 0.4832188 0.5510870 0.5574394 0.5929211 0.7179487 0
23 0.4578707 0.5010434 0.5583671 0.5498851 0.5968977 0.6308851 0
24 0.4638243 0.4754773 0.5349377 0.5416141 0.5929211 0.6666667 0
25 0.4638243 0.5051482 0.5471290 0.5567316 0.6160751 0.6800000 0
26 0.4638243 0.4827514 0.5664209 0.5549536 0.6233991 0.6350093 0
27 0.4638243 0.4742366 0.5744215 0.5510311 0.5940465 0.6591760 0
28 0.4400899 0.4983097 0.5371585 0.5417438 0.5901742 0.6246770 0
29 0.4181360 0.4832188 0.5536855 0.5459419 0.6142639 0.6429872 0
30 0.4400899 0.4782746 0.5311582 0.5353611 0.5929211 0.6350093 0
The highest accuracy score is obtained with a value of maxnode equals to 22.
Now that you have the best value of mtry and maxnode, you can tune the number of trees. The method is exactly the same as maxnode.
store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
set.seed(5678)
rf_maxtrees <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = 24,
ntree = ntree)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)
Call:
summary.resamples(object = results_tree)
Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
250 0.7500000 0.7747418 0.7904475 0.7967011 0.8268072 0.8554217 0
300 0.7500000 0.7761403 0.7904475 0.7955106 0.8178787 0.8452381 0
350 0.7500000 0.7648810 0.7891566 0.7943201 0.8178787 0.8452381 0
400 0.7500000 0.7648810 0.7891566 0.7931297 0.8178787 0.8433735 0
450 0.7380952 0.7648810 0.7891566 0.7931297 0.8268072 0.8433735 0
500 0.7380952 0.7738095 0.7891566 0.7955106 0.8268072 0.8433735 0
550 0.7380952 0.7738095 0.7844951 0.7943058 0.8268072 0.8433735 0
600 0.7380952 0.7738095 0.7844951 0.7954963 0.8268072 0.8452381 0
800 0.7380952 0.7717656 0.7844951 0.7954963 0.8343373 0.8452381 0
1000 0.7380952 0.7717656 0.7844951 0.7943058 0.8268072 0.8452381 0
2000 0.7500000 0.7717656 0.7844951 0.7966867 0.8343373 0.8452381 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
250 0.4302326 0.5126573 0.5517463 0.5553591 0.6178311 0.6885553 0
300 0.4302326 0.5179021 0.5517463 0.5524474 0.5999075 0.6643857 0
350 0.4473684 0.4972937 0.5444663 0.5503549 0.5999075 0.6643857 0
400 0.4473684 0.4972937 0.5444663 0.5479382 0.5999075 0.6643857 0
450 0.4181360 0.4972937 0.5444663 0.5474048 0.6178311 0.6643857 0
500 0.4181360 0.5145985 0.5444663 0.5526736 0.6178311 0.6643857 0
550 0.4181360 0.5068326 0.5378591 0.5503167 0.6178311 0.6643857 0
600 0.4181360 0.5068326 0.5378591 0.5527334 0.6178311 0.6643857 0
800 0.4181360 0.5068326 0.5378591 0.5525672 0.6359562 0.6643857 0
1000 0.4181360 0.5056400 0.5378591 0.5492948 0.6178311 0.6643857 0
2000 0.4417722 0.5068326 0.5378591 0.5549308 0.6359562 0.6643857 0
You have your final model. You can train the random forest with the following parameters:
fit_rf <- train(survived~.,
data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 800,
maxnodes = 24)
The library caret has a function to make prediction. predict(model,
newdata= df) argument - model: Define the model evaluated
before. - newdata: Define the dataset to make
prediction
prediction <-predict(fit_rf, data_test)
confusionMatrix(prediction, data_test$survived)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 109 32
Yes 7 61
Accuracy : 0.8134
95% CI : (0.7539, 0.8638)
No Information Rate : 0.555
P-Value [Acc > NIR] : 3.115e-15
Kappa : 0.6119
Mcnemar's Test P-Value : 0.0001215
Sensitivity : 0.9397
Specificity : 0.6559
Pos Pred Value : 0.7730
Neg Pred Value : 0.8971
Prevalence : 0.5550
Detection Rate : 0.5215
Detection Prevalence : 0.6746
Balanced Accuracy : 0.7978
'Positive' Class : No
You have an accuracy of 0.8134 percent, which is higher than the default value
Lastly, you can look at the feature importance with the function varImp(). It seems that the most important features are the sex and age. That is not surprising because the important features are likely to appear closer to the root of the tree, while less important features will often appear closed to the leaves.
varImp(fit_rf)
rf variable importance