Random Forest Analysis of Credit Data

Step 1: Collecting data

The data has been collected and ready to be analysed.

credit <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")

Step 2: Exploring and preparing data

str(credit)

'data.frame':   1000 obs. of  17 variables:
 $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
 $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
 $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
 $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
 $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
 $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
 $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
 $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

The distribution of defaults:

plot(credit$default)

Summary statistics of the credit data:

summary(credit)

   checking_balance months_loan_duration   credit_history
 < 0 DM    :274     Min.   : 4.0         critical :293   
 > 200 DM  : 63     1st Qu.:12.0         good     :530   
 1 - 200 DM:269     Median :18.0         perfect  : 40   
 unknown   :394     Mean   :20.9         poor     : 88   
                    3rd Qu.:24.0         very good: 49   
                    Max.   :72.0                         
                 purpose        amount           savings_balance
 business            : 97   Min.   :  250   < 100 DM     :603   
 car                 :337   1st Qu.: 1366   > 1000 DM    : 48   
 car0                : 12   Median : 2320   100 - 500 DM :103   
 education           : 59   Mean   : 3271   500 - 1000 DM: 63   
 furniture/appliances:473   3rd Qu.: 3972   unknown      :183   
 renovations         : 22   Max.   :18424                       
  employment_duration percent_of_income years_at_residence
 < 1 year   :172      Min.   :1.000     Min.   :1.000     
 > 7 years  :253      1st Qu.:2.000     1st Qu.:2.000     
 1 - 4 years:339      Median :3.000     Median :3.000     
 4 - 7 years:174      Mean   :2.973     Mean   :2.845     
 unemployed : 62      3rd Qu.:4.000     3rd Qu.:4.000     
                      Max.   :4.000     Max.   :4.000     
      age        other_credit  housing    existing_loans_count
 Min.   :19.00   bank :139    other:108   Min.   :1.000       
 1st Qu.:27.00   none :814    own  :713   1st Qu.:1.000       
 Median :33.00   store: 47    rent :179   Median :1.000       
 Mean   :35.55                            Mean   :1.407       
 3rd Qu.:42.00                            3rd Qu.:2.000       
 Max.   :75.00                            Max.   :4.000       
         job        dependents    phone     default  
 management:148   Min.   :1.000   no :596   no :700  
 skilled   :630   1st Qu.:1.000   yes:404   yes:300  
 unemployed: 22   Median :1.000                      
 unskilled :200   Mean   :1.155                      
                  3rd Qu.:1.000                      
                  Max.   :2.000

Step 3: Training a model on the data

Set up trainning and test data sets:

indx = sample(1:nrow(credit), as.integer(0.9*nrow(credit)))
indx

  [1]   46  847  364  638  317  366  874  942  568   94  477  381
 [13]  374  781  733  130  535  190  490  399  607  623   50  105
 [25]  396  534  448   59   56  489  541  795  911  116  982  831
 [37]  771  228  469  708  206  940  124  158  593  744  570  952
 [49]  558  335  359  840  202   87  946   93  437  547  355   61
 [61]  959  892  666   52  312   28  168  739  990  170  973  601
 [73]  175  587  554  400  379  619  643  604  580  530  121  526
 [85]  756  307  698  513  736  864  897  286  956  315  981  562
 [97]  689  783  700  787   72  450  763  127  631  646  629  655
[109]  828  499  329  166  606  539  253  701  334  769  732  409
[121]   91  572  143  574  100  635  883   27  194  811  440  129
[133]  141  720  832  889  401  549  693  416  730  140  518  125
[145]  910  473  167  843  154    2  361  149  974  680  510  656
[157]  556  234  107  826  980  880  927  746  275  650  523  683
[169]  842   49  331  358  397  137  614  602  644  387  792  193
[181]  122  165  726  589  309  605  163  738  192  133   41  717
[193]  615  509  219   42  800  446  816  263  872    8  818  257
[205]  460  761  159  475  384  673  147  780  548  238  941  426
[217]  472    1  382  483  252  357  417  616  214  674  634  266
[229]    7  103  267  625  996  595  183  820  445  688   40   75
[241]  965  186  671  520  901  791  291    4  128  179  423  218
[253]  177  621   74  222  751  622  424    5  930  848  208  784
[265]   88  385  224  648  599  342  645  480  686  146  311  305
[277]  221  949  766   18  748  455  524  485   97  905  431  230
[289]  725  504  363  860  968  610  597   35  241   30  662  829
[301]  519  845  573  476  131  827  145  430  245  735  672  251
[313]  231  731  882  838  362  807  879  915  338  823   92  500
[325]  855   24  667   66  420  633  822  522  962  912  626  654
[337]  126  963  909  552   57  703  349  449  196  801  544  290
[349]  515  370  935  138  620   20  960  536  538  862  376  372
[361]  679  176  636  649  352  576  985  189  895  344  486  438
[373]  398  611  528  273  681  356   53  767  368  834   65  531
[385]  590  195  236  390  579  723  274  692  586  410   21  428
[397]  429  706  837  898  916   36  328  350  659  225  938   48
[409]  991  308  542  296  753   81  873  292   38  900  109  136
[421]  844  747  306  863  833  640  508   33  920  474  210  777
[433]  113  123  316   29  609  505  870  975  161   62  443  496
[445]  134  768  155  696   77  452  854  839  583  992  188  716
[457]  903  876  458  877  295  418  467  891  745  718  926  624
[469]  261  682   64  564  675  243  896  948  664  468  884  117
[481]  388  983  304  284  921  289  233  989  367  944  894  360
[493]  301  871  715  492  603  330  287  115  339  371  714  628
[505]  677  817   23  433  691  806  765   90  945  788  461  641
[517]  209  463  487  678  303  978  613  314  694  247   13  824
[529]  434  488  934  724  618  112  299  271  852  591  153  797
[541]  786  408  451  516  242  647  976  298  411  772  773  493
[553]  953  567  180  249  181  755  639  566  333  353  702  102
[565]  378  742   84  565  821  191  600  386  502  164  454  380
[577]  327  865  481  774  867  559   31  596  553    3  933  709
[589]  729  319  657   16  947  713  994  859  986  204  809   44
[601]  663  563  345  966  529  404   73  812  888  943  246  967
[613]  203  279  653  521  866  157  581  150  925  592   78  322
[625]  749  617  260   79   71  810  721  171  907  779  169  881
[637]  819  951  849  999  850  760  282  754  302  393  262  414
[649]  808  937  184  135  525  660  348   54  805  373   76  351
[661]  365  971  856  435  846  902  470  498  789  462  369  676
[673]  899  802  993  919  958  890  904  741  111  326  405  227
[685]  804  707  961  272  957   68   89  608  248  497  207   69
[697]  533   43   98  456   34  825  830  885  293  512  346  172
[709]  550  665  148   85  336  710  851  737  977  212  160  778
[721]  200  501  914  264  705  740   96  503  413  886  950  442
[733]  421  340  594  313   39  419  220  630  929  479   67  569
[745]  984  268  395  585 1000  278  814  506  998  637  704  785
[757]  875  582   58  281  444   55  571  217  223  277  507  728
[769]  240  162  764  987  517    6  412  457   12  803  120  813
[781]  858   80  343  235  906  969  743  494  684  310  300    9
[793]  540  762  383  712  857  577  114  119  775   83   63  197
[805]  687  321  537  465  258  466  893  427  997  391  584  695
[817]  551  924  324   37  394  658  406  612  532   10  734  484
[829]  139  432  972  794  269   17  255  917  651  439  685  750
[841]  560  979  403  174  213  578  588  815  782  229  995  198
[853]  557  546   22  453   47  280  799  106  265   70   82  323
[865]  215   51  722  759   32  514  464  441  932  887  110  955
[877]  668  151   11  187  835  244  232  836  144  142  922  118
[889]  276  347  104  711  642  752  270   25  254  283  402  908

credit_train = credit[indx,]
credit_test = credit[-indx,]
credit_train_labels = credit[indx,17]
credit_test_labels = credit[-indx,17]

Regression tree using rpart:

library(rpart)
m.rpart <- rpart(default ~ ., data = credit_train)

Get basic information about the tree:

m.rpart

n= 900 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

   1) root 900 264 no (0.70666667 0.29333333)  
     2) checking_balance=> 200 DM,unknown 417  55 no (0.86810552 0.13189448) *
     3) checking_balance=< 0 DM,1 - 200 DM 483 209 no (0.56728778 0.43271222)  
       6) months_loan_duration< 22.5 276  94 no (0.65942029 0.34057971)  
        12) credit_history=critical,good,poor 250  75 no (0.70000000 0.30000000)  
          24) months_loan_duration< 11.5 70  11 no (0.84285714 0.15714286) *
          25) months_loan_duration>=11.5 180  64 no (0.64444444 0.35555556)  
            50) amount>=1390.5 113  30 no (0.73451327 0.26548673) *
            51) amount< 1390.5 67  33 yes (0.49253731 0.50746269)  
             102) purpose=business,furniture/appliances 37  14 no (0.62162162 0.37837838) *
             103) purpose=car,education,renovations 30  10 yes (0.33333333 0.66666667) *
        13) credit_history=perfect,very good 26   7 yes (0.26923077 0.73076923) *
       7) months_loan_duration>=22.5 207  92 yes (0.44444444 0.55555556)  
        14) savings_balance=> 1000 DM,unknown 33   9 no (0.72727273 0.27272727) *
        15) savings_balance=< 100 DM,100 - 500 DM,500 - 1000 DM 174  68 yes (0.39080460 0.60919540)  
          30) months_loan_duration< 47.5 143  64 yes (0.44755245 0.55244755)  
            60) amount>=2313 112  55 no (0.50892857 0.49107143)  
             120) amount< 8015.5 90  39 no (0.56666667 0.43333333)  
               240) amount>=5332 17   1 no (0.94117647 0.05882353) *
               241) amount< 5332 73  35 yes (0.47945205 0.52054795)  
                 482) amount< 3962 51  22 no (0.56862745 0.43137255)  
                   964) job=management,unemployed 9   1 no (0.88888889 0.11111111) *
                   965) job=skilled,unskilled 42  21 no (0.50000000 0.50000000)  
                    1930) other_credit=bank,store 12   3 no (0.75000000 0.25000000) *
                    1931) other_credit=none 30  12 yes (0.40000000 0.60000000)  
                      3862) percent_of_income< 3.5 11   4 no (0.63636364 0.36363636) *
                      3863) percent_of_income>=3.5 19   5 yes (0.26315789 0.73684211) *
                 483) amount>=3962 22   6 yes (0.27272727 0.72727273) *
             121) amount>=8015.5 22   6 yes (0.27272727 0.72727273) *
            61) amount< 2313 31   7 yes (0.22580645 0.77419355) *
          31) months_loan_duration>=47.5 31   4 yes (0.12903226 0.87096774) *

Get more detailed information about the tree:

summary(m.rpart)

Call:
rpart(formula = default ~ ., data = credit_train)
  n= 900 

          CP nsplit rel error    xerror       xstd
1 0.04356061      0 1.0000000 1.0000000 0.05173749
2 0.01515152      4 0.8106061 0.8560606 0.04927867
3 0.01262626      9 0.7272727 0.9090909 0.05025189
4 0.01136364     12 0.6893939 0.8939394 0.04998214
5 0.01000000     15 0.6553030 0.8939394 0.04998214

Variable importance
    checking_balance               amount months_loan_duration 
                  29                   19                   16 
      credit_history      savings_balance              purpose 
                  11                   10                    4 
                 job                  age  employment_duration 
                   3                    2                    2 
   percent_of_income         other_credit              housing 
                   2                    2                    1 

Node number 1: 900 observations,    complexity param=0.04356061
  predicted class=no   expected loss=0.2933333  P(node) =1
    class counts:   636   264
   probabilities: 0.707 0.293 
  left son=2 (417 obs) right son=3 (483 obs)
  Primary splits:
      checking_balance     splits as  RLRL,        improve=40.50210, (0 missing)
      credit_history       splits as  LLRLR,       improve=15.58324, (0 missing)
      savings_balance      splits as  RLRLL,       improve=14.43356, (0 missing)
      months_loan_duration < 34.5    to the left,  improve=11.04520, (0 missing)
      amount               < 3913.5  to the left,  improve= 7.81936, (0 missing)
  Surrogate splits:
      savings_balance      splits as  RLRLL,       agree=0.620, adj=0.180, (0 split)
      credit_history       splits as  LRRRR,       agree=0.580, adj=0.094, (0 split)
      age                  < 30.5    to the right, agree=0.557, adj=0.043, (0 split)
      employment_duration  splits as  RLRRR,       agree=0.554, adj=0.038, (0 split)
      months_loan_duration < 6.5     to the left,  agree=0.551, adj=0.031, (0 split)

Node number 2: 417 observations
  predicted class=no   expected loss=0.1318945  P(node) =0.4633333
    class counts:   362    55
   probabilities: 0.868 0.132 

Node number 3: 483 observations,    complexity param=0.04356061
  predicted class=no   expected loss=0.4327122  P(node) =0.5366667
    class counts:   274   209
   probabilities: 0.567 0.433 
  left son=6 (276 obs) right son=7 (207 obs)
  Primary splits:
      months_loan_duration < 22.5    to the left,  improve=10.933060, (0 missing)
      credit_history       splits as  LLRLR,       improve= 8.439764, (0 missing)
      savings_balance      splits as  RLRLL,       improve= 7.670136, (0 missing)
      amount               < 8079    to the left,  improve= 4.644312, (0 missing)
      housing              splits as  RLR,         improve= 3.589430, (0 missing)
  Surrogate splits:
      amount         < 2805.5  to the left,  agree=0.754, adj=0.425, (0 split)
      credit_history splits as  LLRRR,       agree=0.613, adj=0.097, (0 split)
      purpose        splits as  RLRLLL,      agree=0.611, adj=0.092, (0 split)
      housing        splits as  RLL,         agree=0.602, adj=0.072, (0 split)
      job            splits as  RLLL,        agree=0.598, adj=0.063, (0 split)

Node number 6: 276 observations,    complexity param=0.04356061
  predicted class=no   expected loss=0.3405797  P(node) =0.3066667
    class counts:   182    94
   probabilities: 0.659 0.341 
  left son=12 (250 obs) right son=13 (26 obs)
  Primary splits:
      credit_history       splits as  LLRLR,       improve=8.740245, (0 missing)
      purpose              splits as  LRLRLR,      improve=4.553514, (0 missing)
      amount               < 1281.5  to the right, improve=4.203433, (0 missing)
      months_loan_duration < 11.5    to the left,  improve=3.547857, (0 missing)
      employment_duration  splits as  RRRLR,       improve=2.778280, (0 missing)

Node number 7: 207 observations,    complexity param=0.04356061
  predicted class=yes  expected loss=0.4444444  P(node) =0.23
    class counts:    92   115
   probabilities: 0.444 0.556 
  left son=14 (33 obs) right son=15 (174 obs)
  Primary splits:
      savings_balance      splits as  RLRRL,       improve=6.280738, (0 missing)
      amount               < 1370    to the right, improve=3.322222, (0 missing)
      age                  < 25.5    to the right, improve=2.917191, (0 missing)
      months_loan_duration < 43.5    to the left,  improve=2.782716, (0 missing)
      credit_history       splits as  LRRLR,       improve=2.178017, (0 missing)

Node number 12: 250 observations,    complexity param=0.01262626
  predicted class=no   expected loss=0.3  P(node) =0.2777778
    class counts:   175    75
   probabilities: 0.700 0.300 
  left son=24 (70 obs) right son=25 (180 obs)
  Primary splits:
      months_loan_duration < 11.5    to the left,  improve=3.968254, (0 missing)
      amount               < 1281.5  to the right, improve=3.472222, (0 missing)
      purpose              splits as  LLLRLL,      improve=3.389356, (0 missing)
      employment_duration  splits as  RRRLR,       improve=3.198198, (0 missing)
      credit_history       splits as  LR-L-,       improve=2.641147, (0 missing)
  Surrogate splits:
      amount < 527.5   to the left,  agree=0.744, adj=0.086, (0 split)
      age    < 69      to the right, agree=0.728, adj=0.029, (0 split)

Node number 13: 26 observations
  predicted class=yes  expected loss=0.2692308  P(node) =0.02888889
    class counts:     7    19
   probabilities: 0.269 0.731 

Node number 14: 33 observations
  predicted class=no   expected loss=0.2727273  P(node) =0.03666667
    class counts:    24     9
   probabilities: 0.727 0.273 

Node number 15: 174 observations,    complexity param=0.01515152
  predicted class=yes  expected loss=0.3908046  P(node) =0.1933333
    class counts:    68   106
   probabilities: 0.391 0.609 
  left son=30 (143 obs) right son=31 (31 obs)
  Primary splits:
      months_loan_duration < 47.5    to the left,  improve=5.169546, (0 missing)
      amount               < 1368.5  to the right, improve=2.561418, (0 missing)
      percent_of_income    < 2.5     to the left,  improve=1.993194, (0 missing)
      age                  < 26.5    to the right, improve=1.951483, (0 missing)
      employment_duration  splits as  RLLLL,       improve=1.799850, (0 missing)
  Surrogate splits:
      amount < 13319.5 to the left,  agree=0.839, adj=0.097, (0 split)

Node number 24: 70 observations
  predicted class=no   expected loss=0.1571429  P(node) =0.07777778
    class counts:    59    11
   probabilities: 0.843 0.157 

Node number 25: 180 observations,    complexity param=0.01262626
  predicted class=no   expected loss=0.3555556  P(node) =0.2
    class counts:   116    64
   probabilities: 0.644 0.356 
  left son=50 (113 obs) right son=51 (67 obs)
  Primary splits:
      amount              < 1390.5  to the right, improve=4.925555, (0 missing)
      credit_history      splits as  LR-L-,       improve=2.688889, (0 missing)
      checking_balance    splits as  R-L-,        improve=2.557172, (0 missing)
      employment_duration splits as  RRRLR,       improve=2.472905, (0 missing)
      purpose             splits as  LRLRRR,      improve=1.866938, (0 missing)
  Surrogate splits:
      job                  splits as  LLLR,        agree=0.683, adj=0.149, (0 split)
      months_loan_duration < 12.5    to the right, agree=0.667, adj=0.104, (0 split)
      purpose              splits as  LLLRLL,      agree=0.656, adj=0.075, (0 split)
      percent_of_income    < 3.5     to the left,  agree=0.650, adj=0.060, (0 split)
      age                  < 21.5    to the right, agree=0.650, adj=0.060, (0 split)

Node number 30: 143 observations,    complexity param=0.01515152
  predicted class=yes  expected loss=0.4475524  P(node) =0.1588889
    class counts:    64    79
   probabilities: 0.448 0.552 
  left son=60 (112 obs) right son=61 (31 obs)
  Primary splits:
      amount              < 2313    to the right, improve=3.892434, (0 missing)
      employment_duration splits as  RLLLL,       improve=2.842319, (0 missing)
      age                 < 57.5    to the right, improve=2.469589, (0 missing)
      percent_of_income   < 3.5     to the left,  improve=2.405927, (0 missing)
      housing             splits as  LLR,         improve=1.119347, (0 missing)

Node number 31: 31 observations
  predicted class=yes  expected loss=0.1290323  P(node) =0.03444444
    class counts:     4    27
   probabilities: 0.129 0.871 

Node number 50: 113 observations
  predicted class=no   expected loss=0.2654867  P(node) =0.1255556
    class counts:    83    30
   probabilities: 0.735 0.265 

Node number 51: 67 observations,    complexity param=0.01262626
  predicted class=yes  expected loss=0.4925373  P(node) =0.07444444
    class counts:    33    34
   probabilities: 0.493 0.507 
  left son=102 (37 obs) right son=103 (30 obs)
  Primary splits:
      purpose              splits as  LR-RLR,      improve=2.753799, (0 missing)
      existing_loans_count < 1.5     to the right, improve=2.596846, (0 missing)
      years_at_residence   < 3.5     to the right, improve=1.785027, (0 missing)
      credit_history       splits as  LR-L-,       improve=1.740516, (0 missing)
      job                  splits as  LRRL,        improve=1.679470, (0 missing)
  Surrogate splits:
      savings_balance splits as  LRLLR,       agree=0.612, adj=0.133, (0 split)
      amount          < 1163    to the left,  agree=0.597, adj=0.100, (0 split)
      age             < 34      to the left,  agree=0.597, adj=0.100, (0 split)
      housing         splits as  RLL,         agree=0.597, adj=0.100, (0 split)
      job             splits as  LLRL,        agree=0.597, adj=0.100, (0 split)

Node number 60: 112 observations,    complexity param=0.01515152
  predicted class=no   expected loss=0.4910714  P(node) =0.1244444
    class counts:    57    55
   probabilities: 0.509 0.491 
  left son=120 (90 obs) right son=121 (22 obs)
  Primary splits:
      amount              < 8015.5  to the left,  improve=3.054870, (0 missing)
      other_credit        splits as  RRL,         improve=1.810714, (0 missing)
      age                 < 29.5    to the right, improve=1.774614, (0 missing)
      employment_duration splits as  RLLLL,       improve=1.502416, (0 missing)
      percent_of_income   < 2.5     to the left,  improve=1.459793, (0 missing)
  Surrogate splits:
      purpose              splits as  LLRRLL,      agree=0.857, adj=0.273, (0 split)
      existing_loans_count < 2.5     to the left,  agree=0.812, adj=0.045, (0 split)

Node number 61: 31 observations
  predicted class=yes  expected loss=0.2258065  P(node) =0.03444444
    class counts:     7    24
   probabilities: 0.226 0.774 

Node number 102: 37 observations
  predicted class=no   expected loss=0.3783784  P(node) =0.04111111
    class counts:    23    14
   probabilities: 0.622 0.378 

Node number 103: 30 observations
  predicted class=yes  expected loss=0.3333333  P(node) =0.03333333
    class counts:    10    20
   probabilities: 0.333 0.667 

Node number 120: 90 observations,    complexity param=0.01515152
  predicted class=no   expected loss=0.4333333  P(node) =0.1
    class counts:    51    39
   probabilities: 0.567 0.433 
  left son=240 (17 obs) right son=241 (73 obs)
  Primary splits:
      amount              < 5332    to the right, improve=5.879291, (0 missing)
      percent_of_income   < 2.5     to the left,  improve=3.341254, (0 missing)
      job                 splits as  LRLR,        improve=3.200000, (0 missing)
      employment_duration splits as  RLLLL,       improve=1.893106, (0 missing)
      purpose             splits as  LL--RR,      improve=1.690000, (0 missing)

Node number 121: 22 observations
  predicted class=yes  expected loss=0.2727273  P(node) =0.02444444
    class counts:     6    16
   probabilities: 0.273 0.727 

Node number 240: 17 observations
  predicted class=no   expected loss=0.05882353  P(node) =0.01888889
    class counts:    16     1
   probabilities: 0.941 0.059 

Node number 241: 73 observations,    complexity param=0.01515152
  predicted class=yes  expected loss=0.4794521  P(node) =0.08111111
    class counts:    35    38
   probabilities: 0.479 0.521 
  left son=482 (51 obs) right son=483 (22 obs)
  Primary splits:
      amount              < 3962    to the left,  improve=2.691476, (0 missing)
      job                 splits as  LLLR,        improve=2.294661, (0 missing)
      percent_of_income   < 2.5     to the left,  improve=1.514887, (0 missing)
      other_credit        splits as  LRL,         improve=1.188843, (0 missing)
      employment_duration splits as  RLLLL,       improve=1.142304, (0 missing)
  Surrogate splits:
      credit_history       splits as  RLRLL,       agree=0.767, adj=0.227, (0 split)
      existing_loans_count < 1.5     to the left,  agree=0.753, adj=0.182, (0 split)
      months_loan_duration < 37.5    to the left,  agree=0.740, adj=0.136, (0 split)

Node number 482: 51 observations,    complexity param=0.01136364
  predicted class=no   expected loss=0.4313725  P(node) =0.05666667
    class counts:    29    22
   probabilities: 0.569 0.431 
  left son=964 (9 obs) right son=965 (42 obs)
  Primary splits:
      job               splits as  LRLR,        improve=2.241830, (0 missing)
      other_credit      splits as  LRL,         improve=1.818836, (0 missing)
      percent_of_income < 2.5     to the left,  improve=1.533894, (0 missing)
      age               < 42.5    to the right, improve=1.350777, (0 missing)
      housing           splits as  LLR,         improve=1.181551, (0 missing)
  Surrogate splits:
      employment_duration splits as  RRRRL,       agree=0.863, adj=0.222, (0 split)
      amount              < 3844.5  to the right, agree=0.843, adj=0.111, (0 split)

Node number 483: 22 observations
  predicted class=yes  expected loss=0.2727273  P(node) =0.02444444
    class counts:     6    16
   probabilities: 0.273 0.727 

Node number 964: 9 observations
  predicted class=no   expected loss=0.1111111  P(node) =0.01
    class counts:     8     1
   probabilities: 0.889 0.111 

Node number 965: 42 observations,    complexity param=0.01136364
  predicted class=no   expected loss=0.5  P(node) =0.04666667
    class counts:    21    21
   probabilities: 0.500 0.500 
  left son=1930 (12 obs) right son=1931 (30 obs)
  Primary splits:
      other_credit         splits as  LRL,         improve=2.1000000, (0 missing)
      percent_of_income    < 2.5     to the left,  improve=1.3925730, (0 missing)
      months_loan_duration < 25.5    to the left,  improve=1.1904760, (0 missing)
      employment_duration  splits as  RLLLR,       improve=0.9333333, (0 missing)
      age                  < 36.5    to the right, improve=0.9333333, (0 missing)
  Surrogate splits:
      credit_history      splits as  RRRRL,       agree=0.810, adj=0.333, (0 split)
      employment_duration splits as  RRRLR,       agree=0.738, adj=0.083, (0 split)
      age                 < 56.5    to the right, agree=0.738, adj=0.083, (0 split)

Node number 1930: 12 observations
  predicted class=no   expected loss=0.25  P(node) =0.01333333
    class counts:     9     3
   probabilities: 0.750 0.250 

Node number 1931: 30 observations,    complexity param=0.01136364
  predicted class=yes  expected loss=0.4  P(node) =0.03333333
    class counts:    12    18
   probabilities: 0.400 0.600 
  left son=3862 (11 obs) right son=3863 (19 obs)
  Primary splits:
      percent_of_income   < 3.5     to the left,  improve=1.9406700, (0 missing)
      employment_duration splits as  RLRLR,       improve=1.8037270, (0 missing)
      age                 < 23.5    to the right, improve=1.2074530, (0 missing)
      amount              < 3088.5  to the left,  improve=0.9000000, (0 missing)
      housing             splits as  LLR,         improve=0.8126984, (0 missing)
  Surrogate splits:
      age                 < 22.5    to the left,  agree=0.700, adj=0.182, (0 split)
      amount              < 2371.5  to the left,  agree=0.667, adj=0.091, (0 split)
      employment_duration splits as  RRRRL,       agree=0.667, adj=0.091, (0 split)

Node number 3862: 11 observations
  predicted class=no   expected loss=0.3636364  P(node) =0.01222222
    class counts:     7     4
   probabilities: 0.636 0.364 

Node number 3863: 19 observations
  predicted class=yes  expected loss=0.2631579  P(node) =0.02111111
    class counts:     5    14
   probabilities: 0.263 0.737

Use the rpart.plot package to create a visualization:

library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)

A few adjustments to the diagram:

rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

Step 4: Check model performance

Generate predictions for the testing dataset:

p.rpart <- predict(m.rpart, credit_test)

Compare the distribution of predicted values vs. actual values:

summary(p.rpart)

       no              yes         
 Min.   :0.1290   Min.   :0.05882  
 1st Qu.:0.6327   1st Qu.:0.13189  
 Median :0.8429   Median :0.15714  
 Mean   :0.6921   Mean   :0.30793  
 3rd Qu.:0.8681   3rd Qu.:0.36732  
 Max.   :0.9412   Max.   :0.87097

summary(credit_test$default)

 no yes 
 64  36

Compare the correlation between predicted and actual default:

cor(p.rpart, as.integer(credit_test$default))

          [,1]
no  -0.4200798
yes  0.4200798

A correlation of 0.42 is certainly acceptable. However, the correlation only measures how strongly the predictions are related to the true value; it is not a measure of how far off the predictions were from the true values.

Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called themean absolute error (MAE). The equation for MAE is as follows, where n indicates the number of predictions and ei indicates the error for prediction i: Function to calculate the mean absolute error:

MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}

Mean absolute error between predicted and actual values:

MAE(p.rpart, as.numeric(credit_test$default))

[1] 0.86

This implies that, on average, the difference between our model’s predictions and the true quality score was about 0.86. On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.

Mean absolute error between actual values and mean value:

mean(as.numeric(credit_train$default))

[1] 1.293333

MAE(1.30, as.numeric(credit_train$default))

[1] 0.4173333

step 5: Improve model performance

Using RWeka train a M5’ model tree: RWeka does not work on my mac, therefore, only the code is given.

library(RWeka)
m.m5p <- M5P(quality ~ ., data = wine_train)
m.m5p
summary(m.m5p)

#we'll look at how well the model performs on the unseen test data. The
# predict() function gets us a vector of predicted values:
p.m5p <- predict(m.m5p, wine_test)
summary(p.m5p)
cor(p.m5p, wine_test$quality)
MAE(wine_test$quality, p.m5p)

LS0tCnRpdGxlOiAiUmFuZG9tIEZvcmVzdCBBbmFseXNpcyBvZiBDcmVkaXQgRGF0YSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQojIyBTdGVwIDE6IENvbGxlY3RpbmcgZGF0YQpUaGUgZGF0YSBoYXMgYmVlbiBjb2xsZWN0ZWQgYW5kIHJlYWR5IHRvIGJlIGFuYWx5c2VkLgpgYGB7cn0KY3JlZGl0IDwtIHJlYWQuY3N2KCJodHRwOi8vd3d3LnNjaS5jc3VlYXN0YmF5LmVkdS9+ZXN1ZXNzL2NsYXNzZXMvU3RhdGlzdGljc182NjIwL1ByZXNlbnRhdGlvbnMvbWw3L2NyZWRpdC5jc3YiKQpgYGAKCiMjIFN0ZXAgMjogRXhwbG9yaW5nIGFuZCBwcmVwYXJpbmcgZGF0YQpgYGB7cn0Kc3RyKGNyZWRpdCkKYGBgCgpUaGUgZGlzdHJpYnV0aW9uIG9mIGRlZmF1bHRzOgpgYGB7cn0KcGxvdChjcmVkaXQkZGVmYXVsdCkKYGBgCgpTdW1tYXJ5IHN0YXRpc3RpY3Mgb2YgdGhlIGNyZWRpdCBkYXRhOgpgYGB7cn0Kc3VtbWFyeShjcmVkaXQpCmBgYAoKIyMgU3RlcCAzOiBUcmFpbmluZyBhIG1vZGVsIG9uIHRoZSBkYXRhClNldCB1cCB0cmFpbm5pbmcgYW5kIHRlc3QgZGF0YSBzZXRzOgoKYGBge3J9CgppbmR4ID0gc2FtcGxlKDE6bnJvdyhjcmVkaXQpLCBhcy5pbnRlZ2VyKDAuOSpucm93KGNyZWRpdCkpKQppbmR4CgpjcmVkaXRfdHJhaW4gPSBjcmVkaXRbaW5keCxdCmNyZWRpdF90ZXN0ID0gY3JlZGl0Wy1pbmR4LF0KCmNyZWRpdF90cmFpbl9sYWJlbHMgPSBjcmVkaXRbaW5keCwxN10KY3JlZGl0X3Rlc3RfbGFiZWxzID0gY3JlZGl0Wy1pbmR4LDE3XSAgCmBgYAoKUmVncmVzc2lvbiB0cmVlIHVzaW5nIHJwYXJ0OgpgYGB7cn0KbGlicmFyeShycGFydCkKbS5ycGFydCA8LSBycGFydChkZWZhdWx0IH4gLiwgZGF0YSA9IGNyZWRpdF90cmFpbikKYGBgCgpHZXQgYmFzaWMgaW5mb3JtYXRpb24gYWJvdXQgdGhlIHRyZWU6CmBgYHtyfQptLnJwYXJ0CmBgYAoKR2V0IG1vcmUgZGV0YWlsZWQgaW5mb3JtYXRpb24gYWJvdXQgdGhlIHRyZWU6CmBgYHtyfQpzdW1tYXJ5KG0ucnBhcnQpCmBgYAoKVXNlIHRoZSBycGFydC5wbG90IHBhY2thZ2UgdG8gY3JlYXRlIGEgdmlzdWFsaXphdGlvbjoKYGBge3J9CmxpYnJhcnkocnBhcnQucGxvdCkKIyBhIGJhc2ljIGRlY2lzaW9uIHRyZWUgZGlhZ3JhbQpycGFydC5wbG90KG0ucnBhcnQsIGRpZ2l0cyA9IDMpCmBgYApBIGZldyBhZGp1c3RtZW50cyB0byB0aGUgZGlhZ3JhbToKYGBge3J9CnJwYXJ0LnBsb3QobS5ycGFydCwgZGlnaXRzID0gNCwgZmFsbGVuLmxlYXZlcyA9IFRSVUUsIHR5cGUgPSAzLCBleHRyYSA9IDEwMSkKYGBgCgoKIyMgU3RlcCA0OiBDaGVjayBtb2RlbCBwZXJmb3JtYW5jZQpHZW5lcmF0ZSBwcmVkaWN0aW9ucyBmb3IgdGhlIHRlc3RpbmcgZGF0YXNldDoKYGBge3J9CnAucnBhcnQgPC0gcHJlZGljdChtLnJwYXJ0LCBjcmVkaXRfdGVzdCkKCmBgYAoKQ29tcGFyZSB0aGUgZGlzdHJpYnV0aW9uIG9mIHByZWRpY3RlZCB2YWx1ZXMgdnMuIGFjdHVhbCB2YWx1ZXM6CmBgYHtyfQpzdW1tYXJ5KHAucnBhcnQpCnN1bW1hcnkoY3JlZGl0X3Rlc3QkZGVmYXVsdCkKYGBgCgpDb21wYXJlIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHByZWRpY3RlZCBhbmQgYWN0dWFsIGRlZmF1bHQ6CmBgYHtyfQpjb3IocC5ycGFydCwgYXMuaW50ZWdlcihjcmVkaXRfdGVzdCRkZWZhdWx0KSkKYGBgCkEgY29ycmVsYXRpb24gb2YgMC40MiBpcyBjZXJ0YWlubHkgYWNjZXB0YWJsZS4gSG93ZXZlciwgdGhlIGNvcnJlbGF0aW9uIG9ubHkgbWVhc3VyZXMgaG93IHN0cm9uZ2x5IHRoZSBwcmVkaWN0aW9ucyBhcmUgcmVsYXRlZCB0byB0aGUgdHJ1ZSB2YWx1ZTsgaXQgaXMgbm90IGEgbWVhc3VyZSBvZiBob3cgZmFyIG9mZiB0aGUgcHJlZGljdGlvbnMgd2VyZSBmcm9tIHRoZSB0cnVlIHZhbHVlcy4KCkFub3RoZXIgd2F5IHRvIHRoaW5rIGFib3V0IHRoZSBtb2RlbCdzIHBlcmZvcm1hbmNlIGlzIHRvIGNvbnNpZGVyIGhvdyBmYXIsIG9uIGF2ZXJhZ2UsIGl0cyBwcmVkaWN0aW9uIHdhcyBmcm9tIHRoZSB0cnVlIHZhbHVlLiBUaGlzIG1lYXN1cmVtZW50IGlzIGNhbGxlZCB0aGVtZWFuIGFic29sdXRlIGVycm9yIChNQUUpLiBUaGUgZXF1YXRpb24gZm9yIE1BRSBpcyBhcyBmb2xsb3dzLCB3aGVyZSBuIGluZGljYXRlcyB0aGUgbnVtYmVyIG9mIHByZWRpY3Rpb25zIGFuZCBlaSBpbmRpY2F0ZXMgdGhlIGVycm9yIGZvciBwcmVkaWN0aW9uIGk6CkZ1bmN0aW9uIHRvIGNhbGN1bGF0ZSB0aGUgbWVhbiBhYnNvbHV0ZSBlcnJvcjoKYGBge3J9Ck1BRSA8LSBmdW5jdGlvbihhY3R1YWwsIHByZWRpY3RlZCkgewogIG1lYW4oYWJzKGFjdHVhbCAtIHByZWRpY3RlZCkpICAKfQpgYGAKCk1lYW4gYWJzb2x1dGUgZXJyb3IgYmV0d2VlbiBwcmVkaWN0ZWQgYW5kIGFjdHVhbCB2YWx1ZXM6CmBgYHtyfQpNQUUocC5ycGFydCwgYXMubnVtZXJpYyhjcmVkaXRfdGVzdCRkZWZhdWx0KSkKYGBgClRoaXMgaW1wbGllcyB0aGF0LCBvbiBhdmVyYWdlLCB0aGUgZGlmZmVyZW5jZSBiZXR3ZWVuIG91ciBtb2RlbCdzIHByZWRpY3Rpb25zIGFuZCB0aGUgdHJ1ZSBxdWFsaXR5IHNjb3JlIHdhcyBhYm91dCAwLjg2LiBPbiBhIHF1YWxpdHkgc2NhbGUgZnJvbSB6ZXJvIHRvIDEwLCB0aGlzIHNlZW1zIHRvIHN1Z2dlc3QgdGhhdCBvdXIgbW9kZWwgaXMgZG9pbmcgZmFpcmx5IHdlbGwuCgpNZWFuIGFic29sdXRlIGVycm9yIGJldHdlZW4gYWN0dWFsIHZhbHVlcyBhbmQgbWVhbiB2YWx1ZToKYGBge3J9Cm1lYW4oYXMubnVtZXJpYyhjcmVkaXRfdHJhaW4kZGVmYXVsdCkpICMgMS4yOQpNQUUoMS4zMCwgYXMubnVtZXJpYyhjcmVkaXRfdHJhaW4kZGVmYXVsdCkpCmBgYAoKIyMgc3RlcCA1OiBJbXByb3ZlIG1vZGVsIHBlcmZvcm1hbmNlIApVc2luZyBSV2VrYSB0cmFpbiBhIE01JyBtb2RlbCB0cmVlOgpSV2VrYSBkb2VzIG5vdCB3b3JrIG9uIG15IG1hYywgdGhlcmVmb3JlLCBvbmx5IHRoZSBjb2RlIGlzIGdpdmVuLgpgYGB7cn0KbGlicmFyeShSV2VrYSkKbS5tNXAgPC0gTTVQKHF1YWxpdHkgfiAuLCBkYXRhID0gd2luZV90cmFpbikKbS5tNXAKc3VtbWFyeShtLm01cCkKCiN3ZSdsbCBsb29rIGF0IGhvdyB3ZWxsIHRoZSBtb2RlbCBwZXJmb3JtcyBvbiB0aGUgdW5zZWVuIHRlc3QgZGF0YS4gVGhlCiMgcHJlZGljdCgpIGZ1bmN0aW9uIGdldHMgdXMgYSB2ZWN0b3Igb2YgcHJlZGljdGVkIHZhbHVlczoKcC5tNXAgPC0gcHJlZGljdChtLm01cCwgd2luZV90ZXN0KQpzdW1tYXJ5KHAubTVwKQpjb3IocC5tNXAsIHdpbmVfdGVzdCRxdWFsaXR5KQpNQUUod2luZV90ZXN0JHF1YWxpdHksIHAubTVwKQpgYGAKCgoKCg==