library(tidyverse)
library(openxlsx)
library(kableExtra)
library(flextable)
library(lubridate)
library(ISLR) #Smarket data set
library(class) #knn function
library(caret)
library(C50)
library(gt)
library(Hmisc)
library(rpart)
library(rpart.plot)

df <- recipes::credit_data
#options(scipen = 999)

1 Decision trees

A decision tree model can be used for both regression and classification problems both in a supervised and in an unsupervised setting.

2 Decision tree in a supervised classification problem

2.1 Example

Data set: recipes::credit_data
Betekenis variabelen: https://github.com/gastonstat/CreditScoring
The aim is to construct a model with which the variable ‘Status’ can be predicted based on the values of the other variables.

flextable(head(df)) %>% 
  add_header_lines("First six records of the data set with credit data") %>% 
  add_header_lines("Table 1") %>% 
  fontsize(size = 14, i = 1:2, part = 'header') %>% 
  italic(i = 2, part = 'header')

Table 1

First six records of the data set with credit data

Status

Seniority

Home

Time

Age

Marital

Records

Job

Expenses

Income

Assets

Debt

Amount

Price

good

9

rent

60

30

married

no

freelance

73

129

0

0

800

846

good

17

rent

60

58

widow

no

fixed

48

131

0

0

1000

1658

bad

10

owner

36

46

married

yes

freelance

90

200

3000

0

2000

2985

good

0

rent

60

24

single

no

fixed

63

182

2500

0

900

1325

good

0

rent

36

26

single

no

fixed

46

107

0

0

310

910

good

1

owner

60

36

married

no

fixed

75

214

3500

0

650

1645

2.1.1 Preprocessing

2.1.1.1 Missing values

Missing values can be a problem in an ML model. Table 1 gives an overview of the number of missing values, variables without missing values are omitted.

MV <- lapply(df, function(x) {sum(is.na(x))})
MV <- tibble(VARIABLE = names(MV), NR_NA = MV) %>% 
  filter(NR_NA != 0)
flextable(MV) %>% 
  add_header_lines(c(
    "Numbers of missing values in the credit data set",
    "Table 2")) %>% 
  fontsize(size = 14, i = 1:2, part = 'header') %>% 
  italic(i = 2, part = 'header') %>% 
  width(j = 1:2, 1.5) %>% 
  set_header_labels(values =
                      list(NR_NA = "MISSING VALUES"))

Table 2

Numbers of missing values in the credit data set

VARIABLE

MISSING VALUES

Home

6

Marital

1

Job

2

Income

381

Assets

47

Debt

18

To avoid problems with missing values the NA’s for the categorical variables are replaced by value ‘unknown’; for the numeric variables the median value of the variable is imputed1, using Hmisc::impute() function.

df_complete <- df %>% 
  mutate(Home = as.character(Home),
         Home = ifelse(is.na(Home), "unknown", Home),
         Home = factor(Home),
         Marital = as.character(Marital),
         Marital = ifelse(is.na(Marital), "unknown", Marital),
         Marital = factor(Marital),
         Job = as.character(Job),
         Job = ifelse(is.na(Job), "unknown", Job),
         Job = factor(Job),
         Income = as.numeric(impute(Income, median)),
         Assets = as.numeric(impute(Assets, median)),
         Debt = as.numeric(impute(Debt, median)))

2.1.2 Decision trees with rpart package

First the data set is split in a training and a test set, with caret::createDataPartition() function. After that the rpart() funciton is used to construct a decision tree; the quality of the tree is assessed on the training data and on the test data.

set.seed(20190723)
train <- createDataPartition(df_complete$Status, p=.70, list = FALSE)  
df_train <- df_complete[train,]
df_test <- df_complete[-train,]

rpart_tree <- rpart(Status ~ ., data = df_train)
pred_train <- predict(rpart_tree, df_train, type = "class")
pred_test <- predict(rpart_tree, df_test, type = "class")
cm_train <- confusionMatrix(pred_train, df_train$Status)
cm_test <- confusionMatrix(pred_test, df_test$Status)

train_characteristics <- c(cm_train$overall[1:2],
                          cm_train$byClass[1:2])
cm_train$table; train_characteristics
          Reference
Prediction  bad good
      bad   376  153
      good  502 2087
   Accuracy       Kappa Sensitivity Specificity 
  0.7899294   0.4094188   0.4282460   0.9316964 

Figure 1. Confusion matrix and characteristics predictions with rpart tree on training data.

test_characteristics <- c(cm_test$overall[1:2],
                          cm_test$byClass[1:2])
cm_test$table; test_characteristics
          Reference
Prediction bad good
      bad  145   86
      good 231  874
   Accuracy       Kappa Sensitivity Specificity 
  0.7627246   0.3353964   0.3856383   0.9104167 

Figure 2. Confusion matrix and characteristics predictions with rpart tree on test data.

2.1.3 Decision trees with C50 package

The same partition in training and test data has been uses as in the former section.

c50_tree <- C5.0(Status ~ ., data = df_train, trials = 1)
pred_train <- predict(c50_tree, df_train, type = "class")
pred_test <- predict(c50_tree, df_test, type = "class")
cm_train <- confusionMatrix(pred_train, df_train$Status)
cm_test <- confusionMatrix(pred_test, df_test$Status)

train_characteristics <- c(cm_train$overall[1:2],
                          cm_train$byClass[1:2])
cm_train$table; train_characteristics
          Reference
Prediction  bad good
      bad   551  199
      good  327 2041
   Accuracy       Kappa Sensitivity Specificity 
  0.8313021   0.5637077   0.6275626   0.9111607 

Figure 3. Confusion matrix and characteristics predictions with C50 tree on training data.

test_characteristics <- c(cm_test$overall[1:2],
                          cm_test$byClass[1:2])
cm_test$table; test_characteristics
          Reference
Prediction bad good
      bad  175  133
      good 201  827
   Accuracy       Kappa Sensitivity Specificity 
  0.7500000   0.3459121   0.4654255   0.8614583 

Figure 4. Confusion matrix and characteristics predictions with C50 tree on test data.

2.1.4 Decision trees with caret package


  1. there are many other options how to deal with missing values; imputing tht median is one of the simplest methods, other methods among others use k nearest neighbours or random forest methods.