library(tidyverse)
library(openxlsx)
library(kableExtra)
library(flextable)
library(lubridate)
library(ISLR) #Smarket data set
library(class) #knn function
library(caret)
library(C50)
library(gt)
library(Hmisc)
library(rpart)
library(rpart.plot)

df <- recipes::credit_data
#options(scipen = 999)

1 Decision trees

A decision tree model can be used for both regression and classification problems both in a supervised and in an unsupervised setting.

2 Decision tree in a supervised classification problem

2.1 Example

Data set: recipes::credit_data
Betekenis variabelen: https://github.com/gastonstat/CreditScoring
The aim is to construct a model with which the variable ‘Status’ can be predicted based on the values of the other variables.

flextable(head(df)) %>% 
  add_header_lines("First six records of the data set with credit data") %>% 
  add_header_lines("Table 1") %>% 
  fontsize(size = 14, i = 1:2, part = 'header') %>% 
  italic(i = 2, part = 'header')

Table 1
First six records of the data set with credit data
Status	Seniority	Home	Time	Age	Marital	Records	Job	Expenses	Income	Assets	Debt	Amount	Price
good	9	rent	60	30	married	no	freelance	73	129	0	0	800	846
good	17	rent	60	58	widow	no	fixed	48	131	0	0	1000	1658
bad	10	owner	36	46	married	yes	freelance	90	200	3000	0	2000	2985
good	0	rent	60	24	single	no	fixed	63	182	2500	0	900	1325
good	0	rent	36	26	single	no	fixed	46	107	0	0	310	910
good	1	owner	60	36	married	no	fixed	75	214	3500	0	650	1645

2.1.1 Preprocessing

2.1.1.1 Missing values

Missing values can be a problem in an ML model. Table 1 gives an overview of the number of missing values, variables without missing values are omitted.

MV <- lapply(df, function(x) {sum(is.na(x))})
MV <- tibble(VARIABLE = names(MV), NR_NA = MV) %>% 
  filter(NR_NA != 0)
flextable(MV) %>% 
  add_header_lines(c(
    "Numbers of missing values in the credit data set",
    "Table 2")) %>% 
  fontsize(size = 14, i = 1:2, part = 'header') %>% 
  italic(i = 2, part = 'header') %>% 
  width(j = 1:2, 1.5) %>% 
  set_header_labels(values =
                      list(NR_NA = "MISSING VALUES"))

Table 2
Numbers of missing values in the credit data set
VARIABLE	MISSING VALUES
Home	6
Marital	1
Job	2
Income	381
Assets	47
Debt	18

To avoid problems with missing values the NA’s for the categorical variables are replaced by value ‘unknown’; for the numeric variables the median value of the variable is imputed¹, using Hmisc::impute() function.

df_complete <- df %>% 
  mutate(Home = as.character(Home),
         Home = ifelse(is.na(Home), "unknown", Home),
         Home = factor(Home),
         Marital = as.character(Marital),
         Marital = ifelse(is.na(Marital), "unknown", Marital),
         Marital = factor(Marital),
         Job = as.character(Job),
         Job = ifelse(is.na(Job), "unknown", Job),
         Job = factor(Job),
         Income = as.numeric(impute(Income, median)),
         Assets = as.numeric(impute(Assets, median)),
         Debt = as.numeric(impute(Debt, median)))

2.1.2 Decision trees with rpart package

First the data set is split in a training and a test set, with caret::createDataPartition() function. After that the rpart() funciton is used to construct a decision tree; the quality of the tree is assessed on the training data and on the test data.

set.seed(20190723)
train <- createDataPartition(df_complete$Status, p=.70, list = FALSE)  
df_train <- df_complete[train,]
df_test <- df_complete[-train,]

rpart_tree <- rpart(Status ~ ., data = df_train)
pred_train <- predict(rpart_tree, df_train, type = "class")
pred_test <- predict(rpart_tree, df_test, type = "class")
cm_train <- confusionMatrix(pred_train, df_train$Status)
cm_test <- confusionMatrix(pred_test, df_test$Status)

train_characteristics <- c(cm_train$overall[1:2],
                          cm_train$byClass[1:2])
cm_train$table; train_characteristics

          Reference
Prediction  bad good
      bad   376  153
      good  502 2087

   Accuracy       Kappa Sensitivity Specificity 
  0.7899294   0.4094188   0.4282460   0.9316964

Figure 1. Confusion matrix and characteristics predictions with rpart tree on training data.

test_characteristics <- c(cm_test$overall[1:2],
                          cm_test$byClass[1:2])
cm_test$table; test_characteristics

          Reference
Prediction bad good
      bad  145   86
      good 231  874

   Accuracy       Kappa Sensitivity Specificity 
  0.7627246   0.3353964   0.3856383   0.9104167

Figure 2. Confusion matrix and characteristics predictions with rpart tree on test data.

2.1.3 Decision trees with C50 package

The same partition in training and test data has been uses as in the former section.

c50_tree <- C5.0(Status ~ ., data = df_train, trials = 1)
pred_train <- predict(c50_tree, df_train, type = "class")
pred_test <- predict(c50_tree, df_test, type = "class")
cm_train <- confusionMatrix(pred_train, df_train$Status)
cm_test <- confusionMatrix(pred_test, df_test$Status)

train_characteristics <- c(cm_train$overall[1:2],
                          cm_train$byClass[1:2])
cm_train$table; train_characteristics

          Reference
Prediction  bad good
      bad   551  199
      good  327 2041

   Accuracy       Kappa Sensitivity Specificity 
  0.8313021   0.5637077   0.6275626   0.9111607

Figure 3. Confusion matrix and characteristics predictions with C50 tree on training data.

test_characteristics <- c(cm_test$overall[1:2],
                          cm_test$byClass[1:2])
cm_test$table; test_characteristics

          Reference
Prediction bad good
      bad  175  133
      good 201  827

   Accuracy       Kappa Sensitivity Specificity 
  0.7500000   0.3459121   0.4654255   0.8614583

Figure 4. Confusion matrix and characteristics predictions with C50 tree on test data.

2.1.4 Decision trees with caret package

there are many other options how to deal with missing values; imputing tht median is one of the simplest methods, other methods among others use k nearest neighbours or random forest methods.↩