Decision Tree using caret in R.

Gabriel Miranda

Context:

The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

Content:

It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:

Age (numeric) Sex (text: male, female) Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled) Housing (text: own, rent, or free) Saving accounts (text - little, moderate, quite rich, rich) Checking account (numeric, in DM - Deutsch Mark) Credit amount (numeric, in DM) Duration (numeric, in month) Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)

Click here to download

# Packages used

require(mlr)
require(ggplot2)
require(dplyr)
require(ggcorrplot)
require(caret)

# Reading the database

df = read.csv('german_credit_risk_target.csv')
head(df, n = 10)

##    X Age    Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1  0  67   male   2     own            <NA>           little          1169
## 2  1  22 female   2     own          little         moderate          5951
## 3  2  49   male   1     own          little             <NA>          2096
## 4  3  45   male   2    free          little           little          7882
## 5  4  53   male   2    free          little           little          4870
## 6  5  35   male   1    free            <NA>             <NA>          9055
## 7  6  53   male   2     own      quite rich             <NA>          2835
## 8  7  35   male   3    rent          little         moderate          6948
## 9  8  61   male   1     own            rich             <NA>          3059
## 10 9  28   male   3     own          little         moderate          5234
##    Duration             Purpose Risk
## 1         6            radio/TV good
## 2        48            radio/TV  bad
## 3        12           education good
## 4        42 furniture/equipment good
## 5        24                 car  bad
## 6        36           education good
## 7        24 furniture/equipment good
## 8        36                 car good
## 9        12            radio/TV good
## 10       30                 car  bad

Getting some information about the variables

str(df)

## 'data.frame':    1000 obs. of  11 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Sex             : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Job             : int  2 2 1 2 2 1 2 3 1 3 ...
##  $ Housing         : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
##  $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: NA 1 1 1 1 NA 3 1 4 1 ...
##  $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 NA 1 1 NA NA 2 NA 2 ...
##  $ Credit.amount   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Purpose         : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
##  $ Risk            : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...

Look! The variable ‘Job’ is an int, but should be Factor. So we’re gonna turn it into Factor using the description offer by who posted the dataset on Kaggle.

df$Job = factor(df$Job, levels = c(0, 1, 2, 3), labels = c('unskilled and non-resident',
                                                           'unskilled and resident', 'skilled',
                                                           'highly skilled'))

str(df)

## 'data.frame':    1000 obs. of  11 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Sex             : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Job             : Factor w/ 4 levels "unskilled and non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ Housing         : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
##  $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: NA 1 1 1 1 NA 3 1 4 1 ...
##  $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 NA 1 1 NA NA 2 NA 2 ...
##  $ Credit.amount   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Purpose         : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
##  $ Risk            : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...

Besides, we can see that ‘Saving.accounts’ and ‘Checking.account’ have soma NA’s, but we don’t know if those are the only one’s. Let’s check!

Verifying if there is any other variable with NA’s:

apply(df, 2, function(x) any(is.na(x)))

##                X              Age              Sex              Job 
##            FALSE            FALSE            FALSE            FALSE 
##          Housing  Saving.accounts Checking.account    Credit.amount 
##            FALSE             TRUE             TRUE            FALSE 
##         Duration          Purpose             Risk 
##            FALSE            FALSE            FALSE

See that the ‘Saving.accounts’ and ‘Checking.account’ columns are the only ones with NA’s, so we’re going to use a method to impute values on thoses NA’s values.

set.seed(100)
imputer = mlr::impute(df, target = "Risk",
                     cols = list(Saving.accounts = imputeLearner("classif.rpart"),
                                 Checking.account = imputeLearner("classif.rpart")))
                 
df = imputer$data
head(df, n = 10)

##    X Age    Sex                    Job Housing Saving.accounts Checking.account
## 1  0  67   male                skilled     own          little           little
## 2  1  22 female                skilled     own          little         moderate
## 3  2  49   male unskilled and resident     own          little           little
## 4  3  45   male                skilled    free          little           little
## 5  4  53   male                skilled    free          little           little
## 6  5  35   male unskilled and resident    free          little           little
## 7  6  53   male                skilled     own      quite rich           little
## 8  7  35   male         highly skilled    rent          little         moderate
## 9  8  61   male unskilled and resident     own            rich         moderate
## 10 9  28   male         highly skilled     own          little         moderate
##    Credit.amount Duration             Purpose Risk
## 1           1169        6            radio/TV good
## 2           5951       48            radio/TV  bad
## 3           2096       12           education good
## 4           7882       42 furniture/equipment good
## 5           4870       24                 car  bad
## 6           9055       36           education good
## 7           2835       24 furniture/equipment good
## 8           6948       36                 car good
## 9           3059       12            radio/TV good
## 10          5234       30                 car  bad

str(df)

## 'data.frame':    1000 obs. of  11 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ Sex             : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Job             : Factor w/ 4 levels "unskilled and non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ Housing         : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
##  $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: 1 1 1 1 1 1 3 1 4 1 ...
##  $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 1 1 1 1 1 2 2 2 ...
##  $ Credit.amount   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Purpose         : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
##  $ Risk            : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...

Now we’re gonna do some exploratory analysis to understand our dataset.

ggplot(data = df, aes(x = Job, fill = Risk)) + geom_bar(position='fill') +
  labs(y = 'Count', x = 'Fig. 1: Count of Bad and Good risk by kind of Job') + 
  theme(axis.text.x = element_text(angle=10, vjust=0.6)) + 
  scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Housing, fill = Risk)) + geom_bar(position='fill') +
  labs(y = 'Count', x = 'Fig. 2: Count of Bad and Good risk by Housing') + 
  scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Saving.accounts, fill = Risk)) + geom_bar(position='fill') +
  labs(y = 'Count', x = 'Fig. 3: Count of Bad and Good risk by Saving Accounts') + 
  scale_fill_manual(values=c('red', 'blue'))

We can see that ‘little’ and ‘moderate’ are very similar, as ‘quite rich’ and ‘rich’. So we can make two categories by four to simplify our model.

df$Saving.accounts =  as.factor(if_else(df$Saving.accounts == 'little' | df$Saving.accounts == 'moderate', 
                             'little/moderate', 'quite rich/rich'))

ggplot(data = df, aes(x = Saving.accounts, fill = Risk)) + geom_bar(position='fill') +
  labs(y = 'Count', x = 'Fig. 3: Count of Bad and Good risk by Saving Accounts') + 
  scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Checking.account, fill = Risk)) + geom_bar(position='fill') +
  labs(y = 'Count', x = 'Fig. 4: Count of Bad and Good risk by Checking Accounts') + 
  scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Purpose, fill = Risk)) + geom_bar(position=position_dodge()) +
  labs(y = 'Count', x = 'Fig. 5: Count of Bad and Good risk by Purpose') + 
  theme(axis.text.x = element_text(angle=50, vjust=0.6)) + ylim(0, 240) +
  scale_fill_manual(values=c('red', 'blue'))

bad_risk = df %>% filter(Risk == 'bad')

good_risk = df %>% filter(Risk == 'good')

ggplot(data = bad_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
                                                       breaks = c(seq(15, 75, 5))) +
  labs(y = 'Count', x = 'Age', caption = 'Fig. 6: Ages of the people with bad risk distribution')

ggplot(data = good_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
                                                        breaks = c(seq(15, 75, 5))) +
  labs(y = 'Count', x = 'Age', caption = 'Fig. 7: Ages of the people with good risk distribution')

ggplot(data = good_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
                                                        breaks = c(seq(15, 75, 5))) +
  labs(y = 'Count', x = 'Duration', caption = 'Fig. 8: Duration with good risk distribution')

ggplot(data = bad_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
                                                        breaks = c(seq(15, 75, 5))) +
  labs(y = 'Count', x = 'Duration', caption = 'Fig. 9: Duration with bad risk distribution')

ggplot(data = bad_risk, aes(x = Credit.amount)) + geom_histogram(fill = 'blue',
                                                        breaks = c(seq(250, 18424, 828))) +
  labs(y = 'Credit amount', x = 'Duration', caption = 'Fig. 10: Credit amount with bad risk distribution')

ggplot(data = good_risk, aes(x = Credit.amount)) + geom_histogram(fill = 'blue', breaks = c(seq(250, 18424, 828))) +
  labs(y = 'Credit amount', x = 'Duration', caption = 'Fig. 11: Credit amount with good risk distribution')

There is 700 people classified as ‘good risk’ and only 300 peole as ‘bad risk’, so it’s quite hard to compare and we will have problems to create our predictive model.

In Figure 1 we can see that the proportion of good and bad risk by kind of job are very similar, so we shouldn’t use it in our model;
In Figure 2, between free and rent categories, as the proportion as quite the same, they might not help us a lot, but own may help;
In Figure 3, as expected, the proportion of quite rich and rich as good risk is bigger than the others. So I united quite rich/rich and little/moderate;
In Figure 4, we can see different proportion for each class. It’ll help our prediction;
In Figure 5, some categories are very similar, but others are very distinct. We shall use this variable;
In Figure 6 and 7, we can see that both distribution are very similar. So this variable won’t be used;
In Figure 8 and 9, there are some difference in the distributions. So we will use it;
In Figure 10 and 11, the distributions are not very similar. So we might use this variable.

Let’s create out model!

df1 = select(df, -X ,-Age, -Job)

# Splitting train and test samples

set.seed(100)
inTrain = createDataPartition(df1$Risk, p = 0.7, list = F)

train = df[inTrain, ]
test = df[-inTrain, ]

We’ll create the same model with three different resampling methods and check what is better.

# Bootstrapp resampling method:
control = trainControl(method = 'boot', number = 25)

# Creating the model 1:
set.seed(80)
model1 = train(Risk ~ ., data = train, method = 'rpart', trControl = control)
model1

## CART 
## 
## 700 samples
##  10 predictor
##   2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 700, 700, 700, 700, 700, 700, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.01111111  0.6776081  0.15196227
##   0.02857143  0.6973115  0.11717655
##   0.04047619  0.7007476  0.09295091
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04047619.

# Cross-Validation resampling method:
control = trainControl(method = 'cv', number = 10)

# Creating model 2:
set.seed(80)
model2 = train(Risk ~ ., data = train, method = 'rpart', trControl = control)
model2

## CART 
## 
## 700 samples
##  10 predictor
##   2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 630, 630, 630, 630, 630, 630, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.01111111  0.7142857  0.19557444
##   0.02857143  0.7128571  0.13292981
##   0.04047619  0.6928571  0.05810254
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01111111.

# Repeated k-fold Cross-Validation resampling method:
control = trainControl(method = 'repeatedcv', number = 5, repeats = 10)

# Creating model 3:
set.seed(80)
model3 = train(Risk ~ ., data = train, method = 'rpart', trControl = control)
model3

## CART 
## 
## 700 samples
##  10 predictor
##   2 classes: 'bad', 'good' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 560, 560, 560, 560, 560, 560, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.01111111  0.6947143  0.17763394
##   0.02857143  0.7082857  0.13516447
##   0.04047619  0.6982857  0.06785114
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02857143.

# Testing the model 1:

prev = predict(model1, test)

confusionMatrix(prev, test$Risk)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction bad good
##       bad    0    0
##       good  90  210
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6447, 0.7513)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.5284          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0             
##             Specificity : 1.0             
##          Pos Pred Value : NaN             
##          Neg Pred Value : 0.7             
##              Prevalence : 0.3             
##          Detection Rate : 0.0             
##    Detection Prevalence : 0.0             
##       Balanced Accuracy : 0.5             
##                                           
##        'Positive' Class : bad             
##

# Testing the model 2:

prev = predict(model2, test)

confusionMatrix(prev, test$Risk)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction bad good
##       bad   18   17
##       good  72  193
##                                           
##                Accuracy : 0.7033          
##                  95% CI : (0.6481, 0.7545)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.4782          
##                                           
##                   Kappa : 0.1442          
##                                           
##  Mcnemar's Test P-Value : 1.041e-08       
##                                           
##             Sensitivity : 0.2000          
##             Specificity : 0.9190          
##          Pos Pred Value : 0.5143          
##          Neg Pred Value : 0.7283          
##              Prevalence : 0.3000          
##          Detection Rate : 0.0600          
##    Detection Prevalence : 0.1167          
##       Balanced Accuracy : 0.5595          
##                                           
##        'Positive' Class : bad             
##

# Testing the model 3:

prev = predict(model3, test)

confusionMatrix(prev, test$Risk)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction bad good
##       bad   15   17
##       good  75  193
##                                          
##                Accuracy : 0.6933         
##                  95% CI : (0.6378, 0.745)
##     No Information Rate : 0.7            
##     P-Value [Acc > NIR] : 0.6264         
##                                          
##                   Kappa : 0.1051         
##                                          
##  Mcnemar's Test P-Value : 2.804e-09      
##                                          
##             Sensitivity : 0.1667         
##             Specificity : 0.9190         
##          Pos Pred Value : 0.4687         
##          Neg Pred Value : 0.7201         
##              Prevalence : 0.3000         
##          Detection Rate : 0.0500         
##    Detection Prevalence : 0.1067         
##       Balanced Accuracy : 0.5429         
##                                          
##        'Positive' Class : bad            
##

results = tibble(Models = c('Model 1', 'Model 2', 'Model 3' ), 
                 'Train Accuracy' = c(0.7007476, 0.7142857, 0.7082857),
                 'Test Accuracy' = c(0.7000, 0.7033, 0.6933))
results

## # A tibble: 3 x 3
##   Models  `Train Accuracy` `Test Accuracy`
##   <chr>              <dbl>           <dbl>
## 1 Model 1            0.701           0.7  
## 2 Model 2            0.714           0.703
## 3 Model 3            0.708           0.693

As we can see in the table above, the model 2 is the better model. Our model is great predicting ‘good risk’, but very bad predicting ‘bad risk’.

Obs: This is not a good dataset to create predictive models because it’s not balanced, most of the target are ‘good risk’. What would you do to improve de accuracy?