library(tidyverse)
library(openxlsx)
library(kableExtra)
library(flextable)
library(lubridate)
library(ISLR) #Smarket data set
library(class) #knn function
library(caret)
library(C50)
library(gt)
library(Hmisc)
library(rpart)
library(rpart.plot)
df <- recipes::credit_data
#options(scipen = 999)
A decision tree model can be used for both regression and classification problems both in a supervised and in an unsupervised setting.
Data set: recipes::credit_data
Betekenis variabelen: https://github.com/gastonstat/CreditScoring
The aim is to construct a model with which the variable ‘Status’ can be predicted based on the values of the other variables.
flextable(head(df)) %>%
add_header_lines("First six records of the data set with credit data") %>%
add_header_lines("Table 1") %>%
fontsize(size = 14, i = 1:2, part = 'header') %>%
italic(i = 2, part = 'header')
Table 1 | |||||||||||||
First six records of the data set with credit data | |||||||||||||
Status | Seniority | Home | Time | Age | Marital | Records | Job | Expenses | Income | Assets | Debt | Amount | Price |
good | 9 | rent | 60 | 30 | married | no | freelance | 73 | 129 | 0 | 0 | 800 | 846 |
good | 17 | rent | 60 | 58 | widow | no | fixed | 48 | 131 | 0 | 0 | 1000 | 1658 |
bad | 10 | owner | 36 | 46 | married | yes | freelance | 90 | 200 | 3000 | 0 | 2000 | 2985 |
good | 0 | rent | 60 | 24 | single | no | fixed | 63 | 182 | 2500 | 0 | 900 | 1325 |
good | 0 | rent | 36 | 26 | single | no | fixed | 46 | 107 | 0 | 0 | 310 | 910 |
good | 1 | owner | 60 | 36 | married | no | fixed | 75 | 214 | 3500 | 0 | 650 | 1645 |
Missing values can be a problem in an ML model. Table 1 gives an overview of the number of missing values, variables without missing values are omitted.
MV <- lapply(df, function(x) {sum(is.na(x))})
MV <- tibble(VARIABLE = names(MV), NR_NA = MV) %>%
filter(NR_NA != 0)
flextable(MV) %>%
add_header_lines(c(
"Numbers of missing values in the credit data set",
"Table 2")) %>%
fontsize(size = 14, i = 1:2, part = 'header') %>%
italic(i = 2, part = 'header') %>%
width(j = 1:2, 1.5) %>%
set_header_labels(values =
list(NR_NA = "MISSING VALUES"))
Table 2 | |
Numbers of missing values in the credit data set | |
VARIABLE | MISSING VALUES |
Home | 6 |
Marital | 1 |
Job | 2 |
Income | 381 |
Assets | 47 |
Debt | 18 |
To avoid problems with missing values the NA’s for the categorical variables are replaced by value ‘unknown’; for the numeric variables the median value of the variable is imputed1, using Hmisc::impute() function.
df_complete <- df %>%
mutate(Home = as.character(Home),
Home = ifelse(is.na(Home), "unknown", Home),
Home = factor(Home),
Marital = as.character(Marital),
Marital = ifelse(is.na(Marital), "unknown", Marital),
Marital = factor(Marital),
Job = as.character(Job),
Job = ifelse(is.na(Job), "unknown", Job),
Job = factor(Job),
Income = as.numeric(impute(Income, median)),
Assets = as.numeric(impute(Assets, median)),
Debt = as.numeric(impute(Debt, median)))
First the data set is split in a training and a test set, with caret::createDataPartition() function. After that the rpart() funciton is used to construct a decision tree; the quality of the tree is assessed on the training data and on the test data.
set.seed(20190723)
train <- createDataPartition(df_complete$Status, p=.70, list = FALSE)
df_train <- df_complete[train,]
df_test <- df_complete[-train,]
rpart_tree <- rpart(Status ~ ., data = df_train)
pred_train <- predict(rpart_tree, df_train, type = "class")
pred_test <- predict(rpart_tree, df_test, type = "class")
cm_train <- confusionMatrix(pred_train, df_train$Status)
cm_test <- confusionMatrix(pred_test, df_test$Status)
train_characteristics <- c(cm_train$overall[1:2],
cm_train$byClass[1:2])
cm_train$table; train_characteristics
Reference
Prediction bad good
bad 376 153
good 502 2087
Accuracy Kappa Sensitivity Specificity
0.7899294 0.4094188 0.4282460 0.9316964
Figure 1. Confusion matrix and characteristics predictions with rpart tree on training data.
test_characteristics <- c(cm_test$overall[1:2],
cm_test$byClass[1:2])
cm_test$table; test_characteristics
Reference
Prediction bad good
bad 145 86
good 231 874
Accuracy Kappa Sensitivity Specificity
0.7627246 0.3353964 0.3856383 0.9104167
Figure 2. Confusion matrix and characteristics predictions with rpart tree on test data.
The same partition in training and test data has been uses as in the former section.
c50_tree <- C5.0(Status ~ ., data = df_train, trials = 1)
pred_train <- predict(c50_tree, df_train, type = "class")
pred_test <- predict(c50_tree, df_test, type = "class")
cm_train <- confusionMatrix(pred_train, df_train$Status)
cm_test <- confusionMatrix(pred_test, df_test$Status)
train_characteristics <- c(cm_train$overall[1:2],
cm_train$byClass[1:2])
cm_train$table; train_characteristics
Reference
Prediction bad good
bad 551 199
good 327 2041
Accuracy Kappa Sensitivity Specificity
0.8313021 0.5637077 0.6275626 0.9111607
Figure 3. Confusion matrix and characteristics predictions with C50 tree on training data.
test_characteristics <- c(cm_test$overall[1:2],
cm_test$byClass[1:2])
cm_test$table; test_characteristics
Reference
Prediction bad good
bad 175 133
good 201 827
Accuracy Kappa Sensitivity Specificity
0.7500000 0.3459121 0.4654255 0.8614583
Figure 4. Confusion matrix and characteristics predictions with C50 tree on test data.
there are many other options how to deal with missing values; imputing tht median is one of the simplest methods, other methods among others use k nearest neighbours or random forest methods.↩