The data has been collected and ready to be analysed.
credit <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
str(credit)
'data.frame': 1000 obs. of 17 variables:
$ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
$ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
$ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
$ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
$ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
$ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
$ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
$ age : int 67 22 49 45 53 35 53 35 61 28 ...
$ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
$ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
$ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
$ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
$ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
$ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
The distribution of defaults:
plot(credit$default)
Summary statistics of the credit data:
summary(credit)
checking_balance months_loan_duration credit_history
< 0 DM :274 Min. : 4.0 critical :293
> 200 DM : 63 1st Qu.:12.0 good :530
1 - 200 DM:269 Median :18.0 perfect : 40
unknown :394 Mean :20.9 poor : 88
3rd Qu.:24.0 very good: 49
Max. :72.0
purpose amount savings_balance
business : 97 Min. : 250 < 100 DM :603
car :337 1st Qu.: 1366 > 1000 DM : 48
car0 : 12 Median : 2320 100 - 500 DM :103
education : 59 Mean : 3271 500 - 1000 DM: 63
furniture/appliances:473 3rd Qu.: 3972 unknown :183
renovations : 22 Max. :18424
employment_duration percent_of_income years_at_residence
< 1 year :172 Min. :1.000 Min. :1.000
> 7 years :253 1st Qu.:2.000 1st Qu.:2.000
1 - 4 years:339 Median :3.000 Median :3.000
4 - 7 years:174 Mean :2.973 Mean :2.845
unemployed : 62 3rd Qu.:4.000 3rd Qu.:4.000
Max. :4.000 Max. :4.000
age other_credit housing existing_loans_count
Min. :19.00 bank :139 other:108 Min. :1.000
1st Qu.:27.00 none :814 own :713 1st Qu.:1.000
Median :33.00 store: 47 rent :179 Median :1.000
Mean :35.55 Mean :1.407
3rd Qu.:42.00 3rd Qu.:2.000
Max. :75.00 Max. :4.000
job dependents phone default
management:148 Min. :1.000 no :596 no :700
skilled :630 1st Qu.:1.000 yes:404 yes:300
unemployed: 22 Median :1.000
unskilled :200 Mean :1.155
3rd Qu.:1.000
Max. :2.000
Set up trainning and test data sets:
indx = sample(1:nrow(credit), as.integer(0.9*nrow(credit)))
indx
[1] 46 847 364 638 317 366 874 942 568 94 477 381
[13] 374 781 733 130 535 190 490 399 607 623 50 105
[25] 396 534 448 59 56 489 541 795 911 116 982 831
[37] 771 228 469 708 206 940 124 158 593 744 570 952
[49] 558 335 359 840 202 87 946 93 437 547 355 61
[61] 959 892 666 52 312 28 168 739 990 170 973 601
[73] 175 587 554 400 379 619 643 604 580 530 121 526
[85] 756 307 698 513 736 864 897 286 956 315 981 562
[97] 689 783 700 787 72 450 763 127 631 646 629 655
[109] 828 499 329 166 606 539 253 701 334 769 732 409
[121] 91 572 143 574 100 635 883 27 194 811 440 129
[133] 141 720 832 889 401 549 693 416 730 140 518 125
[145] 910 473 167 843 154 2 361 149 974 680 510 656
[157] 556 234 107 826 980 880 927 746 275 650 523 683
[169] 842 49 331 358 397 137 614 602 644 387 792 193
[181] 122 165 726 589 309 605 163 738 192 133 41 717
[193] 615 509 219 42 800 446 816 263 872 8 818 257
[205] 460 761 159 475 384 673 147 780 548 238 941 426
[217] 472 1 382 483 252 357 417 616 214 674 634 266
[229] 7 103 267 625 996 595 183 820 445 688 40 75
[241] 965 186 671 520 901 791 291 4 128 179 423 218
[253] 177 621 74 222 751 622 424 5 930 848 208 784
[265] 88 385 224 648 599 342 645 480 686 146 311 305
[277] 221 949 766 18 748 455 524 485 97 905 431 230
[289] 725 504 363 860 968 610 597 35 241 30 662 829
[301] 519 845 573 476 131 827 145 430 245 735 672 251
[313] 231 731 882 838 362 807 879 915 338 823 92 500
[325] 855 24 667 66 420 633 822 522 962 912 626 654
[337] 126 963 909 552 57 703 349 449 196 801 544 290
[349] 515 370 935 138 620 20 960 536 538 862 376 372
[361] 679 176 636 649 352 576 985 189 895 344 486 438
[373] 398 611 528 273 681 356 53 767 368 834 65 531
[385] 590 195 236 390 579 723 274 692 586 410 21 428
[397] 429 706 837 898 916 36 328 350 659 225 938 48
[409] 991 308 542 296 753 81 873 292 38 900 109 136
[421] 844 747 306 863 833 640 508 33 920 474 210 777
[433] 113 123 316 29 609 505 870 975 161 62 443 496
[445] 134 768 155 696 77 452 854 839 583 992 188 716
[457] 903 876 458 877 295 418 467 891 745 718 926 624
[469] 261 682 64 564 675 243 896 948 664 468 884 117
[481] 388 983 304 284 921 289 233 989 367 944 894 360
[493] 301 871 715 492 603 330 287 115 339 371 714 628
[505] 677 817 23 433 691 806 765 90 945 788 461 641
[517] 209 463 487 678 303 978 613 314 694 247 13 824
[529] 434 488 934 724 618 112 299 271 852 591 153 797
[541] 786 408 451 516 242 647 976 298 411 772 773 493
[553] 953 567 180 249 181 755 639 566 333 353 702 102
[565] 378 742 84 565 821 191 600 386 502 164 454 380
[577] 327 865 481 774 867 559 31 596 553 3 933 709
[589] 729 319 657 16 947 713 994 859 986 204 809 44
[601] 663 563 345 966 529 404 73 812 888 943 246 967
[613] 203 279 653 521 866 157 581 150 925 592 78 322
[625] 749 617 260 79 71 810 721 171 907 779 169 881
[637] 819 951 849 999 850 760 282 754 302 393 262 414
[649] 808 937 184 135 525 660 348 54 805 373 76 351
[661] 365 971 856 435 846 902 470 498 789 462 369 676
[673] 899 802 993 919 958 890 904 741 111 326 405 227
[685] 804 707 961 272 957 68 89 608 248 497 207 69
[697] 533 43 98 456 34 825 830 885 293 512 346 172
[709] 550 665 148 85 336 710 851 737 977 212 160 778
[721] 200 501 914 264 705 740 96 503 413 886 950 442
[733] 421 340 594 313 39 419 220 630 929 479 67 569
[745] 984 268 395 585 1000 278 814 506 998 637 704 785
[757] 875 582 58 281 444 55 571 217 223 277 507 728
[769] 240 162 764 987 517 6 412 457 12 803 120 813
[781] 858 80 343 235 906 969 743 494 684 310 300 9
[793] 540 762 383 712 857 577 114 119 775 83 63 197
[805] 687 321 537 465 258 466 893 427 997 391 584 695
[817] 551 924 324 37 394 658 406 612 532 10 734 484
[829] 139 432 972 794 269 17 255 917 651 439 685 750
[841] 560 979 403 174 213 578 588 815 782 229 995 198
[853] 557 546 22 453 47 280 799 106 265 70 82 323
[865] 215 51 722 759 32 514 464 441 932 887 110 955
[877] 668 151 11 187 835 244 232 836 144 142 922 118
[889] 276 347 104 711 642 752 270 25 254 283 402 908
credit_train = credit[indx,]
credit_test = credit[-indx,]
credit_train_labels = credit[indx,17]
credit_test_labels = credit[-indx,17]
Regression tree using rpart:
library(rpart)
m.rpart <- rpart(default ~ ., data = credit_train)
Get basic information about the tree:
m.rpart
n= 900
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 900 264 no (0.70666667 0.29333333)
2) checking_balance=> 200 DM,unknown 417 55 no (0.86810552 0.13189448) *
3) checking_balance=< 0 DM,1 - 200 DM 483 209 no (0.56728778 0.43271222)
6) months_loan_duration< 22.5 276 94 no (0.65942029 0.34057971)
12) credit_history=critical,good,poor 250 75 no (0.70000000 0.30000000)
24) months_loan_duration< 11.5 70 11 no (0.84285714 0.15714286) *
25) months_loan_duration>=11.5 180 64 no (0.64444444 0.35555556)
50) amount>=1390.5 113 30 no (0.73451327 0.26548673) *
51) amount< 1390.5 67 33 yes (0.49253731 0.50746269)
102) purpose=business,furniture/appliances 37 14 no (0.62162162 0.37837838) *
103) purpose=car,education,renovations 30 10 yes (0.33333333 0.66666667) *
13) credit_history=perfect,very good 26 7 yes (0.26923077 0.73076923) *
7) months_loan_duration>=22.5 207 92 yes (0.44444444 0.55555556)
14) savings_balance=> 1000 DM,unknown 33 9 no (0.72727273 0.27272727) *
15) savings_balance=< 100 DM,100 - 500 DM,500 - 1000 DM 174 68 yes (0.39080460 0.60919540)
30) months_loan_duration< 47.5 143 64 yes (0.44755245 0.55244755)
60) amount>=2313 112 55 no (0.50892857 0.49107143)
120) amount< 8015.5 90 39 no (0.56666667 0.43333333)
240) amount>=5332 17 1 no (0.94117647 0.05882353) *
241) amount< 5332 73 35 yes (0.47945205 0.52054795)
482) amount< 3962 51 22 no (0.56862745 0.43137255)
964) job=management,unemployed 9 1 no (0.88888889 0.11111111) *
965) job=skilled,unskilled 42 21 no (0.50000000 0.50000000)
1930) other_credit=bank,store 12 3 no (0.75000000 0.25000000) *
1931) other_credit=none 30 12 yes (0.40000000 0.60000000)
3862) percent_of_income< 3.5 11 4 no (0.63636364 0.36363636) *
3863) percent_of_income>=3.5 19 5 yes (0.26315789 0.73684211) *
483) amount>=3962 22 6 yes (0.27272727 0.72727273) *
121) amount>=8015.5 22 6 yes (0.27272727 0.72727273) *
61) amount< 2313 31 7 yes (0.22580645 0.77419355) *
31) months_loan_duration>=47.5 31 4 yes (0.12903226 0.87096774) *
Get more detailed information about the tree:
summary(m.rpart)
Call:
rpart(formula = default ~ ., data = credit_train)
n= 900
CP nsplit rel error xerror xstd
1 0.04356061 0 1.0000000 1.0000000 0.05173749
2 0.01515152 4 0.8106061 0.8560606 0.04927867
3 0.01262626 9 0.7272727 0.9090909 0.05025189
4 0.01136364 12 0.6893939 0.8939394 0.04998214
5 0.01000000 15 0.6553030 0.8939394 0.04998214
Variable importance
checking_balance amount months_loan_duration
29 19 16
credit_history savings_balance purpose
11 10 4
job age employment_duration
3 2 2
percent_of_income other_credit housing
2 2 1
Node number 1: 900 observations, complexity param=0.04356061
predicted class=no expected loss=0.2933333 P(node) =1
class counts: 636 264
probabilities: 0.707 0.293
left son=2 (417 obs) right son=3 (483 obs)
Primary splits:
checking_balance splits as RLRL, improve=40.50210, (0 missing)
credit_history splits as LLRLR, improve=15.58324, (0 missing)
savings_balance splits as RLRLL, improve=14.43356, (0 missing)
months_loan_duration < 34.5 to the left, improve=11.04520, (0 missing)
amount < 3913.5 to the left, improve= 7.81936, (0 missing)
Surrogate splits:
savings_balance splits as RLRLL, agree=0.620, adj=0.180, (0 split)
credit_history splits as LRRRR, agree=0.580, adj=0.094, (0 split)
age < 30.5 to the right, agree=0.557, adj=0.043, (0 split)
employment_duration splits as RLRRR, agree=0.554, adj=0.038, (0 split)
months_loan_duration < 6.5 to the left, agree=0.551, adj=0.031, (0 split)
Node number 2: 417 observations
predicted class=no expected loss=0.1318945 P(node) =0.4633333
class counts: 362 55
probabilities: 0.868 0.132
Node number 3: 483 observations, complexity param=0.04356061
predicted class=no expected loss=0.4327122 P(node) =0.5366667
class counts: 274 209
probabilities: 0.567 0.433
left son=6 (276 obs) right son=7 (207 obs)
Primary splits:
months_loan_duration < 22.5 to the left, improve=10.933060, (0 missing)
credit_history splits as LLRLR, improve= 8.439764, (0 missing)
savings_balance splits as RLRLL, improve= 7.670136, (0 missing)
amount < 8079 to the left, improve= 4.644312, (0 missing)
housing splits as RLR, improve= 3.589430, (0 missing)
Surrogate splits:
amount < 2805.5 to the left, agree=0.754, adj=0.425, (0 split)
credit_history splits as LLRRR, agree=0.613, adj=0.097, (0 split)
purpose splits as RLRLLL, agree=0.611, adj=0.092, (0 split)
housing splits as RLL, agree=0.602, adj=0.072, (0 split)
job splits as RLLL, agree=0.598, adj=0.063, (0 split)
Node number 6: 276 observations, complexity param=0.04356061
predicted class=no expected loss=0.3405797 P(node) =0.3066667
class counts: 182 94
probabilities: 0.659 0.341
left son=12 (250 obs) right son=13 (26 obs)
Primary splits:
credit_history splits as LLRLR, improve=8.740245, (0 missing)
purpose splits as LRLRLR, improve=4.553514, (0 missing)
amount < 1281.5 to the right, improve=4.203433, (0 missing)
months_loan_duration < 11.5 to the left, improve=3.547857, (0 missing)
employment_duration splits as RRRLR, improve=2.778280, (0 missing)
Node number 7: 207 observations, complexity param=0.04356061
predicted class=yes expected loss=0.4444444 P(node) =0.23
class counts: 92 115
probabilities: 0.444 0.556
left son=14 (33 obs) right son=15 (174 obs)
Primary splits:
savings_balance splits as RLRRL, improve=6.280738, (0 missing)
amount < 1370 to the right, improve=3.322222, (0 missing)
age < 25.5 to the right, improve=2.917191, (0 missing)
months_loan_duration < 43.5 to the left, improve=2.782716, (0 missing)
credit_history splits as LRRLR, improve=2.178017, (0 missing)
Node number 12: 250 observations, complexity param=0.01262626
predicted class=no expected loss=0.3 P(node) =0.2777778
class counts: 175 75
probabilities: 0.700 0.300
left son=24 (70 obs) right son=25 (180 obs)
Primary splits:
months_loan_duration < 11.5 to the left, improve=3.968254, (0 missing)
amount < 1281.5 to the right, improve=3.472222, (0 missing)
purpose splits as LLLRLL, improve=3.389356, (0 missing)
employment_duration splits as RRRLR, improve=3.198198, (0 missing)
credit_history splits as LR-L-, improve=2.641147, (0 missing)
Surrogate splits:
amount < 527.5 to the left, agree=0.744, adj=0.086, (0 split)
age < 69 to the right, agree=0.728, adj=0.029, (0 split)
Node number 13: 26 observations
predicted class=yes expected loss=0.2692308 P(node) =0.02888889
class counts: 7 19
probabilities: 0.269 0.731
Node number 14: 33 observations
predicted class=no expected loss=0.2727273 P(node) =0.03666667
class counts: 24 9
probabilities: 0.727 0.273
Node number 15: 174 observations, complexity param=0.01515152
predicted class=yes expected loss=0.3908046 P(node) =0.1933333
class counts: 68 106
probabilities: 0.391 0.609
left son=30 (143 obs) right son=31 (31 obs)
Primary splits:
months_loan_duration < 47.5 to the left, improve=5.169546, (0 missing)
amount < 1368.5 to the right, improve=2.561418, (0 missing)
percent_of_income < 2.5 to the left, improve=1.993194, (0 missing)
age < 26.5 to the right, improve=1.951483, (0 missing)
employment_duration splits as RLLLL, improve=1.799850, (0 missing)
Surrogate splits:
amount < 13319.5 to the left, agree=0.839, adj=0.097, (0 split)
Node number 24: 70 observations
predicted class=no expected loss=0.1571429 P(node) =0.07777778
class counts: 59 11
probabilities: 0.843 0.157
Node number 25: 180 observations, complexity param=0.01262626
predicted class=no expected loss=0.3555556 P(node) =0.2
class counts: 116 64
probabilities: 0.644 0.356
left son=50 (113 obs) right son=51 (67 obs)
Primary splits:
amount < 1390.5 to the right, improve=4.925555, (0 missing)
credit_history splits as LR-L-, improve=2.688889, (0 missing)
checking_balance splits as R-L-, improve=2.557172, (0 missing)
employment_duration splits as RRRLR, improve=2.472905, (0 missing)
purpose splits as LRLRRR, improve=1.866938, (0 missing)
Surrogate splits:
job splits as LLLR, agree=0.683, adj=0.149, (0 split)
months_loan_duration < 12.5 to the right, agree=0.667, adj=0.104, (0 split)
purpose splits as LLLRLL, agree=0.656, adj=0.075, (0 split)
percent_of_income < 3.5 to the left, agree=0.650, adj=0.060, (0 split)
age < 21.5 to the right, agree=0.650, adj=0.060, (0 split)
Node number 30: 143 observations, complexity param=0.01515152
predicted class=yes expected loss=0.4475524 P(node) =0.1588889
class counts: 64 79
probabilities: 0.448 0.552
left son=60 (112 obs) right son=61 (31 obs)
Primary splits:
amount < 2313 to the right, improve=3.892434, (0 missing)
employment_duration splits as RLLLL, improve=2.842319, (0 missing)
age < 57.5 to the right, improve=2.469589, (0 missing)
percent_of_income < 3.5 to the left, improve=2.405927, (0 missing)
housing splits as LLR, improve=1.119347, (0 missing)
Node number 31: 31 observations
predicted class=yes expected loss=0.1290323 P(node) =0.03444444
class counts: 4 27
probabilities: 0.129 0.871
Node number 50: 113 observations
predicted class=no expected loss=0.2654867 P(node) =0.1255556
class counts: 83 30
probabilities: 0.735 0.265
Node number 51: 67 observations, complexity param=0.01262626
predicted class=yes expected loss=0.4925373 P(node) =0.07444444
class counts: 33 34
probabilities: 0.493 0.507
left son=102 (37 obs) right son=103 (30 obs)
Primary splits:
purpose splits as LR-RLR, improve=2.753799, (0 missing)
existing_loans_count < 1.5 to the right, improve=2.596846, (0 missing)
years_at_residence < 3.5 to the right, improve=1.785027, (0 missing)
credit_history splits as LR-L-, improve=1.740516, (0 missing)
job splits as LRRL, improve=1.679470, (0 missing)
Surrogate splits:
savings_balance splits as LRLLR, agree=0.612, adj=0.133, (0 split)
amount < 1163 to the left, agree=0.597, adj=0.100, (0 split)
age < 34 to the left, agree=0.597, adj=0.100, (0 split)
housing splits as RLL, agree=0.597, adj=0.100, (0 split)
job splits as LLRL, agree=0.597, adj=0.100, (0 split)
Node number 60: 112 observations, complexity param=0.01515152
predicted class=no expected loss=0.4910714 P(node) =0.1244444
class counts: 57 55
probabilities: 0.509 0.491
left son=120 (90 obs) right son=121 (22 obs)
Primary splits:
amount < 8015.5 to the left, improve=3.054870, (0 missing)
other_credit splits as RRL, improve=1.810714, (0 missing)
age < 29.5 to the right, improve=1.774614, (0 missing)
employment_duration splits as RLLLL, improve=1.502416, (0 missing)
percent_of_income < 2.5 to the left, improve=1.459793, (0 missing)
Surrogate splits:
purpose splits as LLRRLL, agree=0.857, adj=0.273, (0 split)
existing_loans_count < 2.5 to the left, agree=0.812, adj=0.045, (0 split)
Node number 61: 31 observations
predicted class=yes expected loss=0.2258065 P(node) =0.03444444
class counts: 7 24
probabilities: 0.226 0.774
Node number 102: 37 observations
predicted class=no expected loss=0.3783784 P(node) =0.04111111
class counts: 23 14
probabilities: 0.622 0.378
Node number 103: 30 observations
predicted class=yes expected loss=0.3333333 P(node) =0.03333333
class counts: 10 20
probabilities: 0.333 0.667
Node number 120: 90 observations, complexity param=0.01515152
predicted class=no expected loss=0.4333333 P(node) =0.1
class counts: 51 39
probabilities: 0.567 0.433
left son=240 (17 obs) right son=241 (73 obs)
Primary splits:
amount < 5332 to the right, improve=5.879291, (0 missing)
percent_of_income < 2.5 to the left, improve=3.341254, (0 missing)
job splits as LRLR, improve=3.200000, (0 missing)
employment_duration splits as RLLLL, improve=1.893106, (0 missing)
purpose splits as LL--RR, improve=1.690000, (0 missing)
Node number 121: 22 observations
predicted class=yes expected loss=0.2727273 P(node) =0.02444444
class counts: 6 16
probabilities: 0.273 0.727
Node number 240: 17 observations
predicted class=no expected loss=0.05882353 P(node) =0.01888889
class counts: 16 1
probabilities: 0.941 0.059
Node number 241: 73 observations, complexity param=0.01515152
predicted class=yes expected loss=0.4794521 P(node) =0.08111111
class counts: 35 38
probabilities: 0.479 0.521
left son=482 (51 obs) right son=483 (22 obs)
Primary splits:
amount < 3962 to the left, improve=2.691476, (0 missing)
job splits as LLLR, improve=2.294661, (0 missing)
percent_of_income < 2.5 to the left, improve=1.514887, (0 missing)
other_credit splits as LRL, improve=1.188843, (0 missing)
employment_duration splits as RLLLL, improve=1.142304, (0 missing)
Surrogate splits:
credit_history splits as RLRLL, agree=0.767, adj=0.227, (0 split)
existing_loans_count < 1.5 to the left, agree=0.753, adj=0.182, (0 split)
months_loan_duration < 37.5 to the left, agree=0.740, adj=0.136, (0 split)
Node number 482: 51 observations, complexity param=0.01136364
predicted class=no expected loss=0.4313725 P(node) =0.05666667
class counts: 29 22
probabilities: 0.569 0.431
left son=964 (9 obs) right son=965 (42 obs)
Primary splits:
job splits as LRLR, improve=2.241830, (0 missing)
other_credit splits as LRL, improve=1.818836, (0 missing)
percent_of_income < 2.5 to the left, improve=1.533894, (0 missing)
age < 42.5 to the right, improve=1.350777, (0 missing)
housing splits as LLR, improve=1.181551, (0 missing)
Surrogate splits:
employment_duration splits as RRRRL, agree=0.863, adj=0.222, (0 split)
amount < 3844.5 to the right, agree=0.843, adj=0.111, (0 split)
Node number 483: 22 observations
predicted class=yes expected loss=0.2727273 P(node) =0.02444444
class counts: 6 16
probabilities: 0.273 0.727
Node number 964: 9 observations
predicted class=no expected loss=0.1111111 P(node) =0.01
class counts: 8 1
probabilities: 0.889 0.111
Node number 965: 42 observations, complexity param=0.01136364
predicted class=no expected loss=0.5 P(node) =0.04666667
class counts: 21 21
probabilities: 0.500 0.500
left son=1930 (12 obs) right son=1931 (30 obs)
Primary splits:
other_credit splits as LRL, improve=2.1000000, (0 missing)
percent_of_income < 2.5 to the left, improve=1.3925730, (0 missing)
months_loan_duration < 25.5 to the left, improve=1.1904760, (0 missing)
employment_duration splits as RLLLR, improve=0.9333333, (0 missing)
age < 36.5 to the right, improve=0.9333333, (0 missing)
Surrogate splits:
credit_history splits as RRRRL, agree=0.810, adj=0.333, (0 split)
employment_duration splits as RRRLR, agree=0.738, adj=0.083, (0 split)
age < 56.5 to the right, agree=0.738, adj=0.083, (0 split)
Node number 1930: 12 observations
predicted class=no expected loss=0.25 P(node) =0.01333333
class counts: 9 3
probabilities: 0.750 0.250
Node number 1931: 30 observations, complexity param=0.01136364
predicted class=yes expected loss=0.4 P(node) =0.03333333
class counts: 12 18
probabilities: 0.400 0.600
left son=3862 (11 obs) right son=3863 (19 obs)
Primary splits:
percent_of_income < 3.5 to the left, improve=1.9406700, (0 missing)
employment_duration splits as RLRLR, improve=1.8037270, (0 missing)
age < 23.5 to the right, improve=1.2074530, (0 missing)
amount < 3088.5 to the left, improve=0.9000000, (0 missing)
housing splits as LLR, improve=0.8126984, (0 missing)
Surrogate splits:
age < 22.5 to the left, agree=0.700, adj=0.182, (0 split)
amount < 2371.5 to the left, agree=0.667, adj=0.091, (0 split)
employment_duration splits as RRRRL, agree=0.667, adj=0.091, (0 split)
Node number 3862: 11 observations
predicted class=no expected loss=0.3636364 P(node) =0.01222222
class counts: 7 4
probabilities: 0.636 0.364
Node number 3863: 19 observations
predicted class=yes expected loss=0.2631579 P(node) =0.02111111
class counts: 5 14
probabilities: 0.263 0.737
Use the rpart.plot package to create a visualization:
library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)
A few adjustments to the diagram:
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
Generate predictions for the testing dataset:
p.rpart <- predict(m.rpart, credit_test)
Compare the distribution of predicted values vs. actual values:
summary(p.rpart)
no yes
Min. :0.1290 Min. :0.05882
1st Qu.:0.6327 1st Qu.:0.13189
Median :0.8429 Median :0.15714
Mean :0.6921 Mean :0.30793
3rd Qu.:0.8681 3rd Qu.:0.36732
Max. :0.9412 Max. :0.87097
summary(credit_test$default)
no yes
64 36
Compare the correlation between predicted and actual default:
cor(p.rpart, as.integer(credit_test$default))
[,1]
no -0.4200798
yes 0.4200798
A correlation of 0.42 is certainly acceptable. However, the correlation only measures how strongly the predictions are related to the true value; it is not a measure of how far off the predictions were from the true values.
Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called themean absolute error (MAE). The equation for MAE is as follows, where n indicates the number of predictions and ei indicates the error for prediction i: Function to calculate the mean absolute error:
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
Mean absolute error between predicted and actual values:
MAE(p.rpart, as.numeric(credit_test$default))
[1] 0.86
This implies that, on average, the difference between our model’s predictions and the true quality score was about 0.86. On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.
Mean absolute error between actual values and mean value:
mean(as.numeric(credit_train$default))
[1] 1.293333
MAE(1.30, as.numeric(credit_train$default))
[1] 0.4173333
Using RWeka train a M5’ model tree: RWeka does not work on my mac, therefore, only the code is given.
library(RWeka)
m.m5p <- M5P(quality ~ ., data = wine_train)
m.m5p
summary(m.m5p)
#we'll look at how well the model performs on the unseen test data. The
# predict() function gets us a vector of predicted values:
p.m5p <- predict(m.m5p, wine_test)
summary(p.m5p)
cor(p.m5p, wine_test$quality)
MAE(wine_test$quality, p.m5p)