1 Classification 2

Classification method that we have learned before are Logistic Regression and k-Nearest Neighbors Algorithm, now we continue to 3 methods (Naive Bayes, Decision Tree and Random Forest).

As we know earlier, different from Regression Model which has Numeric as its Target, Classification is a method to predict the ‘Category’ as a Target Variable.

The details of the methods that we would like to learn today shown below:
(asset/Classification 2.png)

We would like to use 3 classification algorithms to predict the risk status of a bank loan The variable default in the dataset indicates whether the applicant did default on the loan issued by the bank. The original dataset comes from: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

1.1 Preprocessing Steps

Load the library

library(inspectdf)
library(e1071)        #Naive Bayes
library(partykit)     #Decision Tree
library(caret)        #Confusion Matrix , nearZeroVar
library(animation)    #Cross Validation K-Fold
library(randomForest)

Read the data

loan <- read.csv("data/loan.csv")
head(loan)
##   checking_balance months_loan_duration credit_history
## 1           < 0 DM                    6       critical
## 2       1 - 200 DM                   48           good
## 3          unknown                   12       critical
## 4           < 0 DM                   42           good
## 5           < 0 DM                   24           poor
## 6          unknown                   36           good
##                purpose amount savings_balance employment_duration
## 1 furniture/appliances   1169         unknown           > 7 years
## 2 furniture/appliances   5951        < 100 DM         1 - 4 years
## 3            education   2096        < 100 DM         4 - 7 years
## 4 furniture/appliances   7882        < 100 DM         4 - 7 years
## 5                  car   4870        < 100 DM         1 - 4 years
## 6            education   9055         unknown         1 - 4 years
##   percent_of_income years_at_residence age other_credit housing
## 1                 4                  4  67         none     own
## 2                 2                  2  22         none     own
## 3                 2                  3  49         none     own
## 4                 2                  4  45         none   other
## 5                 3                  4  53         none   other
## 6                 2                  4  35         none   other
##   existing_loans_count       job dependents phone default
## 1                    2   skilled          1   yes      no
## 2                    1   skilled          1    no     yes
## 3                    1 unskilled          2    no      no
## 4                    1   skilled          2    no      no
## 5                    2   skilled          2    no     yes
## 6                    1 unskilled          2   yes      no

Check the structure & inspect the data

str(loan)
## 'data.frame':    1000 obs. of  17 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
##  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
##  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
inspect_types(loan, show_plot = T)

## # A tibble: 2 x 4
##   type      cnt  pcnt col_name  
##   <chr>   <int> <dbl> <list>    
## 1 factor     10  58.8 <chr [10]>
## 2 integer     7  41.2 <chr [7]>
inspect_imb(loan, show_plot = T)

## # A tibble: 10 x 4
##    col_name            value                 pcnt   cnt
##    <chr>               <chr>                <dbl> <int>
##  1 other_credit        none                  81.4   139
##  2 housing             own                   71.3   108
##  3 default             no                    70     700
##  4 job                 skilled               63     148
##  5 savings_balance     < 100 DM              60.3   603
##  6 phone               no                    59.6   596
##  7 credit_history      good                  53     293
##  8 purpose             furniture/appliances  47.3    97
##  9 checking_balance    unknown               39.4   274
## 10 employment_duration 1 - 4 years           33.9   172
inspect_cat(loan, show_plot = T)

## # A tibble: 10 x 5
##    col_name              cnt common             common_pcnt levels         
##    <chr>               <int> <chr>                    <dbl> <list>         
##  1 checking_balance        4 unknown                   39.4 <tibble [4 x 2~
##  2 credit_history          5 good                      53   <tibble [5 x 2~
##  3 default                 2 no                        70   <tibble [2 x 2~
##  4 employment_duration     5 1 - 4 years               33.9 <tibble [5 x 2~
##  5 housing                 3 own                       71.3 <tibble [3 x 2~
##  6 job                     4 skilled                   63   <tibble [4 x 2~
##  7 other_credit            3 none                      81.4 <tibble [3 x 2~
##  8 phone                   2 no                        59.6 <tibble [2 x 2~
##  9 purpose                 6 furniture/applian~        47.3 <tibble [6 x 2~
## 10 savings_balance         5 < 100 DM                  60.3 <tibble [5 x 2~

2 Naive Bayes

This algorithm see all the variables as independent Create the 1st model using Naive Bayes algorithm:

set.seed(200)
split_loan <- sample(nrow(loan), nrow(loan)*0.8)
loan.train <- loan[split_loan, ]
loan.test <- loan[-split_loan, ]

Create Naive Bayes model and predict:

Bmodel <- naiveBayes(default ~ ., loan.train)
Bprediction <- predict(Bmodel, loan.test)

Check the confusion matrix:

table("Prediction"=Bprediction, "Actual"=loan.test$default) #manual : Confusion Matrix
##           Actual
## Prediction  no yes
##        no  117  33
##        yes  17  33
sum(Bprediction == loan.test$default)/length(loan.test$default) # accuracy
## [1] 0.75

Use library (caret) to run the confusion Matrix automatically:

confusionMatrix(Bprediction, loan.test$default, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  117  33
##        yes  17  33
##                                          
##                Accuracy : 0.75           
##                  95% CI : (0.684, 0.8084)
##     No Information Rate : 0.67           
##     P-Value [Acc > NIR] : 0.008752       
##                                          
##                   Kappa : 0.3976         
##                                          
##  Mcnemar's Test P-Value : 0.033895       
##                                          
##             Sensitivity : 0.5000         
##             Specificity : 0.8731         
##          Pos Pred Value : 0.6600         
##          Neg Pred Value : 0.7800         
##              Prevalence : 0.3300         
##          Detection Rate : 0.1650         
##    Detection Prevalence : 0.2500         
##       Balanced Accuracy : 0.6866         
##                                          
##        'Positive' Class : yes            
## 

Because the data seems imbalance, we try up-sampling & analyze wether the new model give us better Accuracy and/or Recall.

table(loan$default)
## 
##  no yes 
## 700 300
loan.train.up <-  upSample(x = loan.train[, -17], y = loan.train[, 17], yname = "default")
table(loan.train.up$default)
## 
##  no yes 
## 566 566
Bmodel_up <- naiveBayes(default ~ ., loan.train.up)
Bprediction_up <- predict(Bmodel_up, loan.test)

Lalu kita cek confusion matrix-nya:

confusionMatrix(Bprediction_up, loan.test$default, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  102  24
##        yes  32  42
##                                          
##                Accuracy : 0.72           
##                  95% CI : (0.6523, 0.781)
##     No Information Rate : 0.67           
##     P-Value [Acc > NIR] : 0.07513        
##                                          
##                   Kappa : 0.3857         
##                                          
##  Mcnemar's Test P-Value : 0.34957        
##                                          
##             Sensitivity : 0.6364         
##             Specificity : 0.7612         
##          Pos Pred Value : 0.5676         
##          Neg Pred Value : 0.8095         
##              Prevalence : 0.3300         
##          Detection Rate : 0.2100         
##    Detection Prevalence : 0.3700         
##       Balanced Accuracy : 0.6988         
##                                          
##        'Positive' Class : yes            
## 

From the result of both confusion Matrix, we can see that the better model comes from data after upSampling process. The Recall increase from 0.5 to 0.6364. (in this case the Positive Class is ‘yes’)
We would like to tolerate that we predict ‘yes’ but the actual ‘no’:
- We prefer that we have more prediction of ‘loan’ but in the reality ‘no loan’ rather than we predict ‘no loan’ but in the reality ‘loan’.
- We prefer not to give loan even to someone who can pay rather than we give loan but someone couldn’t pay it.

2.1 [Optional] ROC Curve

When we would like to predict just the Accuracy, in certain cases (cases with skewed data), we can only describe one class well. Examples, such as predicting * fraud * from our 1000 * customers * with 950 shares not * fraud * and 50 customer * fraud *. With conditions like this, we can just return results that are not good on the recall and precision.

One solution to improve * recall * and * precision * is to play a threshold. However, do we have a model that is ‘rigid’ enough to be able to manipulate us?

The ROC used to indicate the ‘trigility’ of this model. The purpose of ‘rigid’ is the solidity of the model in predicting our positive values.

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library (MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
Bprob <- predict(Bmodel, loan.test, type = "raw")

loan_df <- data.frame("Prediction" = Bprob, "Actual" = as.numeric(loan.test$default == "yes"))
loan_roc <- prediction(loan_df$Prediction.yes, loan_df$Actual)
plot(performance(loan_roc, "tpr", "fpr"), colorize = T)

If we call (print(Bmodel)) , the probability for every variable will show.

print(Bmodel)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     no    yes 
## 0.7075 0.2925 
## 
## Conditional probabilities:
##      checking_balance
## Y         < 0 DM   > 200 DM 1 - 200 DM    unknown
##   no  0.19081272 0.07067138 0.23674912 0.50176678
##   yes 0.46153846 0.03418803 0.35470085 0.14957265
## 
##      months_loan_duration
## Y         [,1]     [,2]
##   no  19.00177 10.94580
##   yes 24.59829 12.79355
## 
##      credit_history
## Y       critical       good    perfect       poor  very good
##   no  0.33745583 0.52473498 0.02296820 0.08833922 0.02650177
##   yes 0.17094017 0.55128205 0.08547009 0.09401709 0.09829060
## 
##      purpose
## Y       business        car       car0  education furniture/appliances
##   no  0.08833922 0.32685512 0.01236749 0.05830389           0.49293286
##   yes 0.09829060 0.35470085 0.02136752 0.07264957           0.42307692
##      purpose
## Y     renovations
##   no   0.02120141
##   yes  0.02991453
## 
##      amount
## Y         [,1]     [,2]
##   no  3006.456 2424.933
##   yes 3869.248 3575.944
## 
##      savings_balance
## Y       < 100 DM  > 1000 DM 100 - 500 DM 500 - 1000 DM    unknown
##   no  0.55300353 0.05830389   0.10247350    0.07067138 0.21554770
##   yes 0.70940171 0.02136752   0.11111111    0.03846154 0.11965812
## 
##      employment_duration
## Y       < 1 year  > 7 years 1 - 4 years 4 - 7 years unemployed
##   no  0.14840989 0.26148410  0.33038869  0.20671378 0.05300353
##   yes 0.24358974 0.19230769  0.35042735  0.14102564 0.07264957
## 
##      percent_of_income
## Y         [,1]     [,2]
##   no  2.897527 1.127340
##   yes 3.153846 1.069398
## 
##      years_at_residence
## Y         [,1]     [,2]
##   no  2.816254 1.116957
##   yes 2.816239 1.094462
## 
##      age
## Y         [,1]     [,2]
##   no  35.89223 11.22548
##   yes 33.57692 10.80358
## 
##      other_credit
## Y           bank       none      store
##   no  0.11130742 0.84982332 0.03886926
##   yes 0.18803419 0.75641026 0.05555556
## 
##      housing
## Y          other        own       rent
##   no  0.08480565 0.75618375 0.15901060
##   yes 0.12820513 0.65384615 0.21794872
## 
##      existing_loans_count
## Y         [,1]      [,2]
##   no  1.422261 0.5891039
##   yes 1.346154 0.5440333
## 
##      job
## Y     management    skilled unemployed  unskilled
##   no  0.14487633 0.62897527 0.02120141 0.20494700
##   yes 0.17521368 0.61965812 0.01709402 0.18803419
## 
##      dependents
## Y         [,1]      [,2]
##   no  1.155477 0.3626794
##   yes 1.128205 0.3350347
## 
##      phone
## Y            no       yes
##   no  0.5759717 0.4240283
##   yes 0.6068376 0.3931624

From here, we also can see the trend of every Variables

freqtable.loan.up <- lapply(loan.train.up[,c(1:16)], table, loan.train.up[,17]) 

freqtable.bc <- lapply(freqtable.loan.up, t)
head(freqtable.bc)
## $checking_balance
##      
##       < 0 DM > 200 DM 1 - 200 DM unknown
##   no     108       40        134     284
##   yes    264       16        194      92
## 
## $months_loan_duration
##      
##         4   5   6   7   8   9  10  11  12  13  14  15  16  18  20  21  22
##   no    5   1  52   5   5  28  21   9 103   1   2  44   1  62   6  17   2
##   yes   0   0  25   0   1  31   1   0 103   0   2  19   3  63   0  15   0
##      
##        24  26  27  28  30  33  36  39  40  42  45  47  48  54  60
##   no  106   1   7   2  20   2  33   2   0   6   0   1  14   1   7
##   yes 108   0  12   1  34   0  70   3   2   6   1   0  50   3  13
## 
## $credit_history
##      
##       critical good perfect poor very good
##   no       191  297      13   50        15
##   yes      106  300      49   49        62
## 
## $purpose
##      
##       business car car0 education furniture/appliances renovations
##   no        50 185    7        33                  279          12
##   yes       63 194   14        42                  233          20
## 
## $amount
##      
##       250 276 338 339 362 368 385 392 426 433 448 454 518 522 585 590 601
##   no    1   1   1   1   1   1   1   1   1   0   0   1   1   1   1   1   1
##   yes   0   0   0   0   0   0   0   0   0   5   4   0   0   0   0   0   0
##      
##       609 618 625 626 629 639 640 652 654 660 666 674 683 684 691 697 700
##   no    1   1   1   0   1   0   1   1   0   1   1   0   1   0   0   0   1
##   yes   2   0   0   3   0   2   0   0   4   0   0   2   0   4   3   3   0
##      
##       701 707 708 709 717 727 730 731 741 745 750 754 759 763 766 776 783
##   no    2   1   1   1   2   1   1   1   0   0   0   1   0   1   0   1   1
##   yes   0   0   0   4   0   2   0   0   4   1   1   0   4   0   3   0   0
##      
##       797 802 806 841 846 860 866 874 888 894 900 902 907 909 915 918 926
##   no    0   1   1   1   1   1   1   1   0   1   0   0   1   1   0   0   1
##   yes   5   0   0   0   0   0   0   0   4   0   2   3   0   0   1   4   0
##      
##       929 930 931 932 936 937 947 950 951 958 959 975 976 1007 1028 1037
##   no    1   1   0   2   1   1   0   0   0   1   0   1   1    1    1    1
##   yes   0   0   2   0   0   0   1   3   1   0   1   0   1    0    0    0
##      
##       1038 1042 1047 1048 1050 1053 1055 1056 1068 1076 1082 1092 1098
##   no     1    0    1    1    1    1    1    0    1    1    1    1    1
##   yes    0    2    0    0    0    0    0    1    0    0    2    0    0
##      
##       1101 1103 1107 1108 1123 1126 1131 1136 1138 1149 1154 1158 1163
##   no     1    1    1    0    0    2    0    0    1    1    2    1    1
##   yes    0    0    0    4    1    0    2    2    0    0    0    0    0
##      
##       1164 1169 1175 1185 1188 1198 1199 1200 1203 1204 1207 1209 1216
##   no     1    2    1    1    0    0    1    1    1    1    0    0    0
##   yes    0    0    0    0    3    4    2    0    0    0    1    1    4
##      
##       1217 1221 1224 1225 1228 1231 1236 1237 1238 1239 1240 1244 1245
##   no     0    1    1    1    0    2    1    1    1    1    1    1    0
##   yes    2    0    0    0    2    0    0    1    0    0    0    0    2
##      
##       1246 1249 1258 1262 1264 1271 1274 1275 1278 1282 1283 1285 1287
##   no     0    1    3    3    0    0    0    2    1    0    1    0    2
##   yes    2    0    0    0    2    3    2    2    0    3    0    2    0
##      
##       1288 1289 1291 1295 1297 1299 1300 1301 1308 1309 1311 1313 1316
##   no     1    1    1    1    1    1    1    1    1    0    1    1    1
##   yes    0    0    0    2    0    0    0    0    0    1    0    0    0
##      
##       1318 1322 1323 1330 1333 1338 1343 1344 1345 1346 1347 1355 1358
##   no     1    1    1    1    0    1    1    1    0    1    1    0    0
##   yes    0    0    0    0    2    0    0    1    2    0    0    1    3
##      
##       1360 1361 1364 1371 1372 1376 1381 1382 1386 1391 1393 1402 1403
##   no     1    1    2    0    0    1    0    2    1    1    3    1    1
##   yes    0    0    0    2    2    0    2    0    3    0    0    0    0
##      
##       1409 1410 1412 1413 1414 1418 1422 1424 1437 1442 1444 1445 1449
##   no     2    1    1    2    1    1    0    2    0    0    1    1    1
##   yes    0    0    0    0    0    0    1    0    5    2    0    0    0
##      
##       1453 1455 1459 1469 1473 1474 1478 1484 1493 1494 1498 1501 1503
##   no     1    1    1    1    1    2    1    0    1    1    1    0    1
##   yes    0    0    0    0    0    0    4    1    0    0    0    1    0
##      
##       1505 1512 1514 1520 1521 1525 1526 1530 1533 1534 1538 1542 1543
##   no     1    0    1    1    1    1    1    0    1    0    1    1    1
##   yes    0    1    0    0    0    0    0    2    1    3    0    0    0
##      
##       1544 1546 1549 1552 1553 1554 1555 1559 1567 1568 1569 1572 1577
##   no     1    1    1    1    1    1    0    1    1    1    1    1    1
##   yes    0    2    0    0    2    0    5    0    0    0    0    0    0
##      
##       1591 1592 1595 1597 1602 1603 1647 1655 1657 1659 1670 1680 1721
##   no     1    1    1    2    1    1    0    1    1    0    0    1    1
##   yes    0    0    0    0    0    0    1    0    0    1    6    0    0
##      
##       1736 1740 1743 1747 1755 1766 1768 1778 1795 1800 1804 1808 1817
##   no     1    1    2    1    1    1    1    0    1    1    1    0    1
##   yes    0    0    0    0    0    0    0    2    0    0    0    3    0
##      
##       1819 1820 1823 1835 1837 1842 1845 1851 1858 1860 1864 1867 1872
##   no     0    1    0    0    0    0    1    1    1    1    0    1    1
##   yes    1    0    3    1    2    3    0    0    0    0    5    0    0
##      
##       1880 1881 1882 1884 1887 1893 1898 1901 1905 1908 1913 1919 1922
##   no     1    1    0    1    1    1    1    1    1    0    1    1    0
##   yes    0    0    2    0    0    0    0    0    0    3    0    1    3
##      
##       1924 1925 1927 1928 1934 1935 1936 1938 1940 1941 1943 1950 1953
##   no     1    1    1    0    1    0    1    0    2    1    0    1    0
##   yes    1    0    0    2    0    2    0    1    0    0    2    0    2
##      
##       1957 1961 1963 1965 1967 1977 1979 1980 1987 2002 2012 2022 2028
##   no     1    1    1    1    1    0    1    0    0    1    1    1    2
##   yes    0    0    0    0    0    1    0    2    2    0    0    0    0
##      
##       2030 2032 2039 2051 2058 2063 2064 2069 2073 2096 2108 2116 2118
##   no     1    1    0    1    1    1    0    1    1    1    1    1    1
##   yes    0    0    3    0    0    0    1    0    0    0    0    0    0
##      
##       2121 2132 2136 2142 2145 2150 2169 2171 2186 2197 2214 2221 2223
##   no     1    1    1    1    0    0    0    1    1    1    1    1    1
##   yes    0    0    0    0    4    2    3    0    0    0    0    0    0
##      
##       2225 2238 2241 2246 2247 2249 2251 2255 2273 2278 2279 2284 2288
##   no     0    1    2    0    1    1    1    1    1    0    1    1    1
##   yes    1    0    0    2    0    0    0    0    0    3    0    0    0
##      
##       2299 2301 2302 2303 2319 2323 2325 2326 2329 2331 2333 2337 2346
##   no     1    1    0    0    0    1    1    1    1    1    1    1    1
##   yes    0    0    2    2    3    0    0    0    0    0    0    0    0
##      
##       2353 2359 2360 2366 2384 2389 2390 2394 2397 2404 2406 2415 2424
##   no     1    0    1    1    0    1    1    1    0    1    0    1    1
##   yes    0    3    0    0    3    0    0    0    3    0    1    0    0
##      
##       2427 2439 2442 2462 2463 2476 2483 2503 2507 2511 2515 2520 2528
##   no     1    0    1    0    1    1    1    1    1    1    1    0    1
##   yes    0    1    0    5    0    0    0    0    0    0    0    2    0
##      
##       2569 2570 2576 2577 2578 2579 2580 2600 2603 2606 2611 2613 2625
##   no     1    0    1    1    1    0    0    0    1    1    1    1    0
##   yes    0    1    0    0    0    2    5    1    0    0    0    0    3
##      
##       2629 2631 2647 2659 2662 2670 2671 2675 2679 2687 2697 2708 2718
##   no     1    1    1    1    1    1    0    1    1    1    1    1    0
##   yes    0    0    0    0    0    0    2    0    0    0    0    0    2
##      
##       2728 2743 2745 2746 2753 2760 2764 2779 2782 2788 2799 2812 2820
##   no     1    1    1    0    1    1    1    1    1    1    1    1    0
##   yes    0    0    0    4    0    0    0    0    0    0    0    0    4
##      
##       2825 2835 2848 2859 2862 2872 2896 2899 2923 2924 2957 2969 2978
##   no     1    1    1    1    1    1    1    1    1    1    1    0    1
##   yes    0    0    0    0    0    0    0    0    0    0    0    3    0
##      
##       2993 2996 3001 3016 3017 3021 3029 3049 3051 3059 3060 3062 3069
##   no     1    0    1    1    2    1    1    1    0    1    0    1    1
##   yes    0    4    0    0    0    0    0    0    2    0    4    0    0
##      
##       3074 3077 3079 3092 3104 3105 3108 3114 3123 3124 3148 3149 3160
##   no     1    2    1    0    1    1    0    0    0    1    1    1    1
##   yes    0    0    0    2    0    0    7    2    3    0    0    0    0
##      
##       3161 3181 3186 3213 3229 3235 3244 3249 3331 3342 3343 3349 3357
##   no     0    1    1    1    1    1    1    1    1    1    1    0    1
##   yes    2    0    0    0    0    0    0    0    0    0    0    6    0
##      
##       3368 3378 3384 3386 3394 3398 3399 3416 3422 3430 3441 3446 3447
##   no     1    1    0    0    1    1    1    1    1    1    0    0    1
##   yes    0    0    4    2    0    0    0    0    0    0    2    1    0
##      
##       3448 3485 3488 3499 3509 3518 3527 3552 3556 3565 3566 3568 3573
##   no     1    1    1    0    1    1    1    0    1    1    1    1    1
##   yes    0    0    0    3    0    0    0    1    0    0    0    0    0
##      
##       3577 3590 3594 3595 3599 3612 3617 3620 3621 3622 3643 3650 3651
##   no     1    2    1    1    1    1    2    1    0    1    1    1    1
##   yes    0    0    0    0    0    0    0    0    2    0    0    0    0
##      
##       3652 3656 3676 3711 3749 3757 3758 3780 3804 3812 3832 3844 3850
##   no     1    1    1    1    1    1    1    1    0    1    2    0    1
##   yes    0    0    0    0    0    0    0    0    3    0    0    2    0
##      
##       3857 3863 3868 3878 3905 3914 3915 3939 3949 3959 3965 3972 3973
##   no     1    1    1    1    1    0    0    1    1    1    0    1    1
##   yes    0    0    0    0    0    2    4    0    0    0    2    0    0
##      
##       3976 3979 3990 4006 4020 4042 4057 4110 4113 4151 4153 4169 4210
##   no     1    1    1    0    1    1    0    0    0    1    0    1    0
##   yes    0    0    0    1    0    0    2    2    3    0    1    0    2
##      
##       4221 4241 4249 4272 4280 4297 4351 4370 4380 4439 4454 4455 4463
##   no     1    0    0    2    0    0    1    0    1    1    1    0    0
##   yes    0    3    5    0    3    1    0    2    0    0    0    2    1
##      
##       4526 4530 4583 4623 4657 4679 4712 4736 4771 4795 4796 4811 4817
##   no     1    1    1    0    1    1    1    0    1    1    1    1    0
##   yes    0    0    0    4    0    0    0    3    0    0    0    0    2
##      
##       4843 4870 4933 5003 5045 5084 5096 5117 5150 5179 5190 5234 5293
##   no     0    0    0    0    1    1    0    1    1    0    1    0    0
##   yes    1    2    3    1    0    0    2    0    0    1    0    1    5
##      
##       5302 5324 5371 5381 5433 5507 5511 5742 5743 5771 5800 5801 5804
##   no     1    1    1    1    1    1    1    1    1    1    1    1    1
##   yes    0    0    0    0    0    0    0    0    0    0    0    0    0
##      
##       5842 5848 5866 5943 5951 5954 5965 5998 6078 6110 6143 6148 6187
##   no     1    1    1    0    0    2    1    0    1    1    0    1    1
##   yes    0    0    0    5    2    0    0    2    0    0    5    0    0
##      
##       6199 6204 6229 6260 6288 6304 6313 6314 6331 6361 6403 6458 6468
##   no     0    1    0    1    0    1    1    1    0    1    1    0    1
##   yes    1    0    4    0    2    0    0    0    5    0    0    4    1
##      
##       6527 6568 6579 6614 6615 6742 6758 6761 6836 6850 6872 6967 6999
##   no     1    1    1    1    1    1    0    1    0    0    0    1    0
##   yes    0    0    0    0    0    0    2    0    3    2    4    0    4
##      
##       7057 7127 7166 7174 7228 7238 7253 7308 7374 7393 7408 7418 7472
##   no     1    0    1    0    1    1    1    1    1    1    0    1    1
##   yes    0    2    0    2    0    0    0    0    0    0    3    0    0
##      
##       7476 7485 7511 7582 7596 7678 7685 7721 7758 7763 7814 7824 7855
##   no     1    0    0    1    1    1    0    1    1    0    1    1    0
##   yes    0    3    3    0    0    0    1    0    0    2    0    0    2
##      
##       7865 7882 7966 8065 8072 8086 8133 8335 8358 8386 8471 8487 8588
##   no     0    1    1    0    1    0    1    0    1    0    1    1    1
##   yes    5    0    0    1    0    2    0    2    0    3    0    0    0
##      
##       8613 8648 8858 8947 8978 9034 9055 9157 9271 9277 9283 9398 9436
##   no     1    0    1    1    0    0    1    1    0    1    1    0    1
##   yes    0    3    0    0    2    2    0    0    2    0    0    2    0
##      
##       9572 9629 9857 9960 10127 10144 10222 10297 10366 10477 10623 10722
##   no     0    0    1    0     0     1     1     0     1     1     1     1
##   yes    2    4    0    1     4     0     0     4     0     0     0     0
##      
##       10875 10961 10974 11328 11590 11816 11938 11998 12169 12204 12389
##   no      1     0     0     0     0     0     0     0     1     1     0
##   yes     0     3     4     3     1     1     4     3     0     0     3
##      
##       12579 12680 12749 13756 14027 14421 14555 14782 14896 15653 15857
##   no      0     0     1     1     0     0     0     0     0     1     1
##   yes     2     1     0     0     2     3     6     3     2     0     0
##      
##       15945 18424
##   no      0     0
##   yes     3     1
## 
## $savings_balance
##      
##       < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown
##   no       313        33           58            40     122
##   yes      399        13           62            29      63

Advantages in Naive Bayes
- Simple

Limitations in Naive Bayes
- Could have poor performance due to rare events or outliers
- Naive assumption is rarely true in real world application

3 Decision Tree

We would like to compare Bayes model and Decision Tree. Target variable is ‘default’

Default <- ctree(default ~ ., loan)
plot(Default)

plot(Default, type="simple")

We have defined the train & test before, We will use the same ‘seed set’ and create Decision Tree model.

Model for data ‘train’

Tmodel <- ctree(default ~ ., loan.train)
plot(Tmodel)

plot(Tmodel, type="simple")

We can see the separation (width) and level (depth).

Tmodel
## 
## Model formula:
## default ~ checking_balance + months_loan_duration + credit_history + 
##     purpose + amount + savings_balance + employment_duration + 
##     percent_of_income + years_at_residence + age + other_credit + 
##     housing + existing_loans_count + job + dependents + phone
## 
## Fitted party:
## [1] root
## |   [2] checking_balance < 0 DM, 1 - 200 DM
## |   |   [3] months_loan_duration <= 21
## |   |   |   [4] credit_history in critical, good, poor: no (n = 220, err = 29.5%)
## |   |   |   [5] credit_history in perfect, very good: yes (n = 25, err = 28.0%)
## |   |   [6] months_loan_duration > 21: yes (n = 188, err = 42.6%)
## |   [7] checking_balance > 200 DM, unknown
## |   |   [8] other_credit in bank: no (n = 42, err = 31.0%)
## |   |   [9] other_credit in none, store: no (n = 325, err = 9.2%)
## 
## Number of inner nodes:    4
## Number of terminal nodes: 5
width(Tmodel)
## [1] 5
depth(Tmodel)
## [1] 3

After we got the model from data ‘train’, we will check the prediction in data ‘test’

predict(Tmodel, head(loan.test[,-17]))
##   8  12  21  26  28  34 
## yes yes  no  no  no  no 
## Levels: no yes
Tpred <- predict(Tmodel, loan.test[,-17])

Lalu, kita lihat ConfusionMatrixnya

caret::confusionMatrix(Tpred, loan.test[,17])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  111  37
##        yes  23  29
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6314, 0.7626)
##     No Information Rate : 0.67            
##     P-Value [Acc > NIR] : 0.20488         
##                                           
##                   Kappa : 0.283           
##                                           
##  Mcnemar's Test P-Value : 0.09329         
##                                           
##             Sensitivity : 0.8284          
##             Specificity : 0.4394          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.5577          
##              Prevalence : 0.6700          
##          Detection Rate : 0.5550          
##    Detection Prevalence : 0.7400          
##       Balanced Accuracy : 0.6339          
##                                           
##        'Positive' Class : no              
## 

The result from the Decision Tree, Accuracy decrease from the Naive Bayes result. But the Recall get better, it increase from 0.6364 to 0.8284.

Using data after up sampling: Model for data ‘train’

Tmodel_up <- ctree(default ~ ., loan.train.up)
plot(Tmodel_up)

plot(Tmodel_up, type="simple")

width(Tmodel_up)
## [1] 18
depth(Tmodel_up)
## [1] 6
predict(Tmodel_up, head(loan.test[,-17]))
##   8  12  21  26  28  34 
## yes yes  no yes  no  no 
## Levels: no yes
Tpred_up <- predict(Tmodel_up, loan.test[,-17])

caret::confusionMatrix(Tpred_up, loan.test[,17])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  96  30
##        yes 38  36
##                                           
##                Accuracy : 0.66            
##                  95% CI : (0.5898, 0.7253)
##     No Information Rate : 0.67            
##     P-Value [Acc > NIR] : 0.6492          
##                                           
##                   Kappa : 0.2541          
##                                           
##  Mcnemar's Test P-Value : 0.3960          
##                                           
##             Sensitivity : 0.7164          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.7619          
##          Neg Pred Value : 0.4865          
##              Prevalence : 0.6700          
##          Detection Rate : 0.4800          
##    Detection Prevalence : 0.6300          
##       Balanced Accuracy : 0.6309          
##                                           
##        'Positive' Class : no              
## 

Kalau kita ingin membentuk plot kurva AUC dari model kita, kita bisa menggunakan type="prob" pada fungsi prediksinya.

predictAUC <- predict(Tmodel, loan.test[,-17], type="prob")
plot(predictAUC)

3.1 Decision Tree Consideration

Decision Tree follow the concept “greedy algorithms” - because the structure is top-down which is trying to find ‘information gain’ during the process, hence it only found optimal at its time (local optima), not really optimal globally. This is similar withstepwise using step in regression.

Decision Tree really refer to the provided data ‘train’: the changes in this data will change the tree very much. Otherwise, Decision tree also tend to Overfit. In order to show this, we try to change two parameters :

  • mincriterion: work as “regulator” in tree ‘depth’; if the value smaller, the tree become bigger. When mincriterion’s value is 0.8, the p-value must be lower than 0.2 then tree can branch off

  • minsplit dan minbucket: Set these into 0 so the minimum criteria always happened and resulting the tree not stop in cutting itself.

The result of the model:

Tmodel2 <- ctree(default ~ ., loan.train, control = ctree_control(mincriterion=0.005, minsplit=0, minbucket=0))
plot(Tmodel2)

And we got almost 95% accuracy from the sample which not really useful and potentially misleading:

sum(predict(Tmodel2, loan.train[,-17])== loan.train$default)/nrow(loan.train)
## [1] 0.9525

In additional, every predictor in Decision tree assumed to be interactive each other. Decision Tree is a suitable method to solve Multicolinearity case. And this method is really good in solving outlier problem. Because of its character is greedy, when there is an input error, “purity” will do the offset for the data.

The result of Upsampling model:

Tmodel_up2 <- ctree(default ~ ., loan.train.up, control = ctree_control(mincriterion=0.005, minsplit=0, minbucket=0))
plot(Tmodel_up2)

sum(predict(Tmodel_up2, loan.train.up[,-17])== loan.train.up$default)/nrow(loan.train)
## [1] 1.415

The result from upsampling model is overfit.

Advantages in Decision Tree:
- High interpretability
- Could handle outliers
- Robust to multicollinearity

Limitations in Decision Tree
- Tend to be overfitted to training data
- Predictors are assumed to have interactions

4 Random Forest

Random forest has similar principle with Decision Tree, but It doesn’t choose all the data and variable in every Tree. It randomly do sampling and choose variable in the tree and then combine the output later. This method reduce the bias which can be obtained in Decision Tree, also increase prediction capability significantly. In clarification case, Random Forest clasify the new example using voting mechanism, compare to regression which only get the mean/average of the output.

4.1 Random Forest tp Predict someone will get the loan or not

First we checked the data again:

summary(loan$default)
##  no yes 
## 700 300
dim(loan)
## [1] 1000   17

We apply function nearZeroVar() to eliminate features which almost don’t have variance. Those features will give very little or no contribution at all to our model:

set.seed(101)
n0_var <- nearZeroVar(loan[,1:17])
loanRF <- loan[,-n0_var] # kurangin var yang near 0 variance

dim(loanRF)
## [1] 1000    0

We cannot use the near 0 var, because they eliminated all the variables. We try another method.

Check the proportional in the table:

prop.table(table(loan.train$default))
## 
##     no    yes 
## 0.7075 0.2925
prop.table(table(loan.test$default))
## 
##   no  yes 
## 0.67 0.33

4.2 Cross-Validation K-fold

We divide dataset into k sample which have same amount (read: ‘bins’). From one bins, set data ‘test’ and the rest k-1 bins set as data ‘training’. This Proses is repeated as much as k (the folds) so every bin has once been as data ‘test’. This will make every bin used as ‘Hal ini membuat semua bin digunakan jadi set ’train’, as well as ‘test’.

ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l")

Now, the model will be created with 5-Fold Cross Validation, and repeat 3 times:

Chunk below set eval=F because will consume time. ctrl to set the k-fold/ parameter validation. Meanwhile train to create the model

set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) #k-fold 5 iterasi, 3 repetisi

loan_forest <- train(default ~ ., data=loan.train, method="rf", trControl = ctrl)

Show the:

loan_forest
## Random Forest 
## 
## 800 samples
##  16 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 640, 640, 640, 640, 640, 640, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##    2    0.7120833  0.04323729
##   18    0.7575000  0.35546429
##   35    0.7475000  0.34494915
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.

mtry Accuracy Kappa
2 0.7120833 0.04323729 18 0.7575000 0.35546429 35 0.7475000 0.34494915

Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 18.

The best mtry is using 18 variable: accuracy become 0.7575, We have used 3 methods in this case: Naive Bayes, Decision Tree and Random Forest, the best result in accuracy is in this method.

mtry is the amount of variable which used in creating the model. From here, we can know that Random Forest try all types. In the standard Decision Tree, we consider each variable in order to get “purity” or Information Gain.

plot(loan_forest)

Let’s check our model to predict the data ‘test’

table(predict(loan_forest, loan.test[,-17]), loan.test[,17])
##      
##        no yes
##   no  119  37
##   yes  15  29
sum(predict(loan_forest, loan.test[,-17])==loan.test[,17])
## [1] 148
nrow(loan.test)
## [1] 200

Check the important variables from Model:

varImp(loan_forest)
## rf variable importance
## 
##   only 20 most important variables shown (out of 35)
## 
##                                Overall
## amount                         100.000
## age                             70.093
## months_loan_duration            60.423
## checking_balanceunknown         54.479
## percent_of_income               24.885
## years_at_residence              23.268
## savings_balanceunknown          11.260
## phoneyes                        11.128
## credit_historyvery good         10.373
## existing_loans_count             9.820
## other_creditnone                 9.531
## purposecar                       9.396
## employment_duration1 - 4 years   8.844
## credit_historyperfect            8.668
## jobskilled                       8.610
## checking_balance1 - 200 DM       8.395
## purposefurniture/appliances      8.340
## employment_duration4 - 7 years   7.550
## housingown                       6.935
## credit_historygood               6.705

amount 100.000000 –> This variable is very important to our model.

And then, the default syntax to predict is predict(model, test, type="response") but we can change the type to “response”, “prob” or “votes”.

Check the Out-of-bag error in every classes by doing plot

plot(loan_forest$finalModel) #balikin error OOB
legend("topright", colnames(loan_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

We can see the Final Model.

loan_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 18
## 
##         OOB estimate of  error rate: 24.5%
## Confusion matrix:
##      no yes class.error
## no  503  63   0.1113074
## yes 133 101   0.5683761

Last, we would like to know the model that has been created is the result after doing k-fold cross validation. If we want to do the original Random Forest, we can do:

forest <- randomForest(default ~ ., loan.train)
# Show the tree size/number of nodes
hist(treesize(forest))

Let us use the upsampling to check whether it will increase the accuracy or not;

set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) # method = "none" -> no k-fold # 5 iterasi, 3 repetisi

loan_forest_up <- train(default ~ ., data=loan.train.up, method="rf", trControl = ctrl)

loan_forest_up

plot(loan_forest_up)

table(predict(loan_forest_up, loan.test[,-17]), loan.test[,17])
sum(predict(loan_forest_up, loan.test[,-17])==loan.test[,17])
nrow(loan.test)

varImp(loan_forest_up)

plot(loan_forest_up$finalModel) #balikin error OOB
legend("topright", colnames(loan_forest_up$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

loan_forest_up$finalModel

forest_up  <- randomForest(default ~ ., loan.train.up)
hist(treesize(forest_up))

We still use the mtry = 18, but after we do upsampling the data, the accuracy increase into 0.8639715. The value of OOB / error become smaller than before.

Advantages in Random Forest
- Could help reducing bias and variance at the same time
- Automatic feature selection
- Has internal cross-validation mechanism

Limitations in Random Forest
- Sometimes could be computational and memory intensive

The conclusion is the imbalance data really makes our model not correctly reflect our data. Some of the problem are
1. Overfit (model really good to present the data ‘train’ but not in data ‘test’)
2. Out of Bag error - percentage of error prediction

But we can improve the model using: 1. Upsampling or Down sampling the data
2. Laplace - adding small amount to data which have ‘0’ value
3. Pruning
4. Boosting
5. Cross Validation using K-fold