The data has been collected and ready to be analysed.
credit <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
Fix the default variable to be 0 or 1
credit$default = as.numeric(credit$default)
credit$default = credit$default - 1
Examine the credit data:
str(credit)
'data.frame': 1000 obs. of 17 variables:
$ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
$ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
$ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
$ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
$ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
$ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
$ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
$ age : int 67 22 49 45 53 35 53 35 61 28 ...
$ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
$ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
$ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
$ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
$ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
$ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
$ default : num 0 1 0 0 1 0 0 0 0 1 ...
indx<- sample(1:nrow(credit),as.integer(0.9*nrow(credit)))
indx
[1] 385 910 109 594 646 699 305 652 705 600 55 641
[13] 592 195 933 77 510 124 442 43 751 694 524 76
[25] 90 380 986 530 968 784 659 752 253 940 107 661
[37] 816 431 515 650 68 534 513 871 734 876 512 730
[49] 971 860 411 193 410 335 560 547 186 227 867 957
[61] 883 744 672 404 721 779 642 444 356 1000 944 774
[73] 780 445 59 361 289 975 988 70 263 806 341 555
[85] 458 347 683 147 432 611 912 381 485 57 296 21
[97] 969 682 403 287 655 830 616 537 353 897 34 880
[109] 217 44 644 695 151 508 143 213 518 677 987 874
[121] 303 549 377 639 440 748 840 237 905 301 663 104
[133] 566 496 299 207 65 809 325 966 166 586 160 308
[145] 418 72 875 707 792 190 208 737 192 469 45 344
[157] 346 999 916 281 657 545 314 64 85 522 211 3
[169] 525 155 700 52 467 950 448 955 842 768 254 763
[181] 802 348 322 449 557 653 540 921 14 878 776 319
[193] 73 246 478 393 42 660 994 801 470 926 189 5
[205] 698 554 167 934 823 996 728 884 961 345 125 647
[217] 980 175 643 230 573 58 948 746 67 120 669 982
[229] 755 328 483 363 722 129 919 712 422 420 486 1
[241] 433 35 599 758 15 794 671 983 261 119 351 99
[253] 800 198 998 202 173 267 681 608 631 111 131 739
[265] 531 276 179 165 103 417 727 575 226 414 667 598
[277] 953 229 188 236 133 473 931 302 606 286 542 789
[289] 28 981 906 9 389 461 602 435 704 140 475 836
[301] 452 30 815 849 894 407 893 340 144 742 47 238
[313] 622 753 437 675 597 819 387 741 757 419 854 191
[325] 847 587 71 726 689 245 127 310 562 952 993 56
[337] 293 94 710 371 550 749 609 235 645 664 796 153
[349] 536 285 811 180 330 544 264 870 383 519 22 252
[361] 553 832 973 991 53 300 963 16 580 516 844 374
[373] 12 520 889 484 640 24 824 23 765 977 398 866
[385] 172 282 297 349 911 11 391 503 329 601 838 858
[397] 209 541 839 272 735 6 457 703 591 223 618 493
[409] 219 563 460 181 532 570 257 942 687 121 274 91
[421] 777 174 388 943 856 279 25 81 373 239 488 283
[433] 41 248 899 738 83 110 100 362 863 579 567 450
[445] 569 215 584 306 161 828 375 891 848 141 400 970
[457] 447 902 716 949 693 593 212 582 456 436 265 762
[469] 231 401 142 50 614 621 315 869 929 851 367 38
[481] 535 923 242 256 588 879 900 224 731 288 46 51
[493] 833 295 837 113 105 805 706 941 402 617 733 480
[505] 658 199 477 529 984 499 358 613 928 724 778 382
[517] 552 610 817 169 607 771 197 825 498 808 114 39
[529] 292 907 27 903 13 793 48 112 2 61 565 492
[541] 909 262 379 708 408 320 807 951 578 890 935 69
[553] 123 163 323 791 135 852 521 627 959 152 343 992
[565] 413 80 918 939 128 130 194 769 368 271 201 670
[577] 972 218 331 311 439 159 922 118 797 551 526 865
[589] 760 468 78 491 634 423 572 370 527 985 352 459
[601] 625 612 504 506 62 138 309 548 685 628 596 861
[613] 501 350 915 84 925 680 877 873 184 775 514 465
[625] 788 960 278 978 136 443 476 54 723 259 304 908
[637] 990 590 205 574 139 154 754 366 291 494 187 203
[649] 421 29 826 273 589 892 148 533 92 914 74 258
[661] 87 665 268 18 701 764 633 326 603 799 19 835
[673] 583 156 228 284 86 803 359 717 785 810 967 97
[685] 397 581 662 196 275 507 720 814 116 171 756 247
[697] 711 157 206 463 558 913 946 691 164 974 924 718
[709] 936 692 945 132 729 386 241 101 269 489 813 855
[721] 615 766 954 438 339 750 624 364 812 221 384 885
[733] 927 845 736 896 75 713 145 696 989 49 745 623
[745] 559 886 10 162 686 654 637 183 474 841 761 277
[757] 409 523 232 454 673 482 204 831 316 798 332 79
[769] 96 490 7 290 240 630 126 122 372 577 312 820
[781] 790 853 338 502 455 82 88 134 656 280 412 511
[793] 626 864 976 66 937 666 354 702 979 17 528 32
[805] 895 843 964 365 715 732 546 170 648 629 862 619
[817] 472 917 679 539 571 882 604 947 146 360 214 158
[829] 509 333 168 827 98 200 818 337 451 249 434 585
[841] 294 307 334 108 406 938 427 37 115 782 102 466
[853] 176 804 904 495 33 635 678 106 632 620 446 336
[865] 415 266 31 888 255 222 636 747 342 234 93 185
[877] 781 298 317 481 649 63 500 95 846 767 995 321
[889] 887 471 216 651 324 822 430 89 150 684 962 538
credit_train<-credit[indx,]
credit_test<- credit[-indx,]
# create labels
which(colnames(credit)=='default')
[1] 17
credit_train_labels<- credit[indx,17]
credit_test_labels<- credit[-indx,17]
Check if there are any missing values:
library(Amelia)
Loading required package: Rcpp
package ‘Rcpp’ was built under R version 3.3.2##
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015-12-05)
## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
missmap(credit, main = "Missing values vs observed")
# number of missing values in each column
sapply(credit,function(x) sum(is.na(x)))
checking_balance months_loan_duration credit_history
0 0 0
purpose amount savings_balance
0 0 0
employment_duration percent_of_income years_at_residence
0 0 0
age other_credit housing
0 0 0
existing_loans_count job dependents
0 0 0
phone default
0 0
# number of unique values in each column
sapply(credit, function(x) length(unique(x)))
checking_balance months_loan_duration credit_history
4 33 5
purpose amount savings_balance
6 921 5
employment_duration percent_of_income years_at_residence
5 4 4
age other_credit housing
53 3 3
existing_loans_count job dependents
4 4 2
phone default
2 2
No missing values which is good.
Fit the logistic regression model, with all predictor variables
model <- glm(default ~.,family=binomial(link='logit'),data=credit_train)
model
Call: glm(formula = default ~ ., family = binomial(link = "logit"),
data = credit_train)
Coefficients:
(Intercept) checking_balance> 200 DM
-1.7449409 -0.7550219
checking_balance1 - 200 DM checking_balanceunknown
-0.3737206 -1.6528697
months_loan_duration credit_historygood
0.0256547 0.8459475
credit_historyperfect credit_historypoor
1.4342813 0.8678542
credit_historyvery good purposecar
1.4521062 0.2895620
purposecar0 purposeeducation
-1.0438381 0.8019085
purposefurniture/appliances purposerenovations
-0.1613657 0.4198889
amount savings_balance> 1000 DM
0.0001056 -0.9151967
savings_balance100 - 500 DM savings_balance500 - 1000 DM
-0.1456988 -0.3383594
savings_balanceunknown employment_duration> 7 years
-1.0950653 -0.5157216
employment_duration1 - 4 years employment_duration4 - 7 years
-0.2132003 -0.8815581
employment_durationunemployed percent_of_income
-0.0940415 0.2903974
years_at_residence age
0.0462153 -0.0133834
other_creditnone other_creditstore
-0.5011712 -0.4526129
housingown housingrent
-0.0906545 0.1756198
existing_loans_count jobskilled
0.2715070 0.0673919
jobunemployed jobunskilled
-0.1206304 -0.0322218
dependents phoneyes
0.0678525 -0.2847680
Degrees of Freedom: 899 Total (i.e. Null); 864 Residual
Null Deviance: 1108
Residual Deviance: 874 AIC: 946
summary(model)
Call:
glm(formula = default ~ ., family = binomial(link = "logit"),
data = credit_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8651 -0.7746 -0.4263 0.8285 2.7059
Coefficients:
Estimate Std. Error z value
(Intercept) -1.745e+00 9.242e-01 -1.888
checking_balance> 200 DM -7.550e-01 3.660e-01 -2.063
checking_balance1 - 200 DM -3.737e-01 2.136e-01 -1.750
checking_balanceunknown -1.653e+00 2.328e-01 -7.100
months_loan_duration 2.565e-02 9.164e-03 2.799
credit_historygood 8.459e-01 2.605e-01 3.248
credit_historyperfect 1.434e+00 4.407e-01 3.255
credit_historypoor 8.679e-01 3.326e-01 2.609
credit_historyvery good 1.452e+00 4.206e-01 3.452
purposecar 2.896e-01 3.189e-01 0.908
purposecar0 -1.044e+00 8.136e-01 -1.283
purposeeducation 8.019e-01 4.481e-01 1.790
purposefurniture/appliances -1.614e-01 3.144e-01 -0.513
purposerenovations 4.199e-01 5.820e-01 0.721
amount 1.056e-04 4.391e-05 2.406
savings_balance> 1000 DM -9.152e-01 5.155e-01 -1.775
savings_balance100 - 500 DM -1.457e-01 2.756e-01 -0.529
savings_balance500 - 1000 DM -3.384e-01 4.000e-01 -0.846
savings_balanceunknown -1.095e+00 2.705e-01 -4.048
employment_duration> 7 years -5.157e-01 2.948e-01 -1.749
employment_duration1 - 4 years -2.132e-01 2.349e-01 -0.908
employment_duration4 - 7 years -8.816e-01 2.956e-01 -2.982
employment_durationunemployed -9.404e-02 4.166e-01 -0.226
percent_of_income 2.904e-01 8.767e-02 3.312
years_at_residence 4.622e-02 8.649e-02 0.534
age -1.338e-02 8.979e-03 -1.490
other_creditnone -5.012e-01 2.349e-01 -2.134
other_creditstore -4.526e-01 4.348e-01 -1.041
housingown -9.065e-02 2.949e-01 -0.307
housingrent 1.756e-01 3.438e-01 0.511
existing_loans_count 2.715e-01 1.900e-01 1.429
jobskilled 6.739e-02 2.898e-01 0.233
jobunemployed -1.206e-01 6.220e-01 -0.194
jobunskilled -3.222e-02 3.486e-01 -0.092
dependents 6.785e-02 2.392e-01 0.284
phoneyes -2.848e-01 2.030e-01 -1.402
Pr(>|z|)
(Intercept) 0.059014 .
checking_balance> 200 DM 0.039131 *
checking_balance1 - 200 DM 0.080132 .
checking_balanceunknown 1.24e-12 ***
months_loan_duration 0.005120 **
credit_historygood 0.001163 **
credit_historyperfect 0.001135 **
credit_historypoor 0.009079 **
credit_historyvery good 0.000556 ***
purposecar 0.363944
purposecar0 0.199469
purposeeducation 0.073494 .
purposefurniture/appliances 0.607740
purposerenovations 0.470651
amount 0.016134 *
savings_balance> 1000 DM 0.075860 .
savings_balance100 - 500 DM 0.597062
savings_balance500 - 1000 DM 0.397623
savings_balanceunknown 5.16e-05 ***
employment_duration> 7 years 0.080267 .
employment_duration1 - 4 years 0.364042
employment_duration4 - 7 years 0.002864 **
employment_durationunemployed 0.821428
percent_of_income 0.000925 ***
years_at_residence 0.593108
age 0.136099
other_creditnone 0.032855 *
other_creditstore 0.297940
housingown 0.758559
housingrent 0.609492
existing_loans_count 0.152965
jobskilled 0.816108
jobunemployed 0.846212
jobunskilled 0.926358
dependents 0.776648
phoneyes 0.160776
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1107.90 on 899 degrees of freedom
Residual deviance: 873.97 on 864 degrees of freedom
AIC: 945.97
Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: default
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 899 1107.90
checking_balance 3 112.037 896 995.86 < 2.2e-16 ***
months_loan_duration 1 30.031 895 965.83 4.252e-08 ***
credit_history 4 27.393 891 938.44 1.656e-05 ***
purpose 5 9.563 886 928.87 0.088608 .
amount 1 0.494 885 928.38 0.482128
savings_balance 4 20.249 881 908.13 0.000446 ***
employment_duration 4 10.396 877 897.74 0.034265 *
percent_of_income 1 9.329 876 888.41 0.002255 **
years_at_residence 1 0.472 875 887.93 0.492289
age 1 3.162 874 884.77 0.075375 .
other_credit 2 4.953 872 879.82 0.084033 .
housing 2 1.463 870 878.36 0.481278
existing_loans_count 1 1.862 869 876.49 0.172343
job 3 0.468 866 876.03 0.925841
dependents 1 0.073 865 875.95 0.786640
phone 1 1.983 864 873.97 0.159066
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the result, we can tell that the checking balance, months loan duration, credit history, saving balance, employment duration, percent of income and age are signifiacnt.Next, Let’t drop the insignificant terms and fit another regression model. Alpha=0.10
model <- glm(default ~ checking_balance + months_loan_duration + credit_history + percent_of_income + age,family=binomial(link='logit'),data=credit_train)
model
Call: glm(formula = default ~ checking_balance + months_loan_duration +
credit_history + percent_of_income + age, family = binomial(link = "logit"),
data = credit_train)
Coefficients:
(Intercept) checking_balance> 200 DM
-1.34219 -0.98414
checking_balance1 - 200 DM checking_balanceunknown
-0.49124 -1.83406
months_loan_duration credit_historygood
0.03102 0.57593
credit_historyperfect credit_historypoor
1.63343 0.74246
credit_historyvery good percent_of_income
1.39251 0.16950
age
-0.01213
Degrees of Freedom: 899 Total (i.e. Null); 889 Residual
Null Deviance: 1108
Residual Deviance: 930.8 AIC: 952.8
summary(model)
Call:
glm(formula = default ~ checking_balance + months_loan_duration +
credit_history + percent_of_income + age, family = binomial(link = "logit"),
data = credit_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7835 -0.8141 -0.4896 0.9488 2.3335
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.342194 0.427226 -3.142 0.001680 **
checking_balance> 200 DM -0.984138 0.344520 -2.857 0.004283 **
checking_balance1 - 200 DM -0.491239 0.195250 -2.516 0.011871 *
checking_balanceunknown -1.834064 0.214956 -8.532 < 2e-16 ***
months_loan_duration 0.031022 0.006642 4.671 3.00e-06 ***
credit_historygood 0.575926 0.205279 2.806 0.005023 **
credit_historyperfect 1.633426 0.413018 3.955 7.66e-05 ***
credit_historypoor 0.742460 0.312521 2.376 0.017515 *
credit_historyvery good 1.392513 0.367016 3.794 0.000148 ***
percent_of_income 0.169499 0.074052 2.289 0.022084 *
age -0.012125 0.007337 -1.653 0.098423 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1107.90 on 899 degrees of freedom
Residual deviance: 930.78 on 889 degrees of freedom
AIC: 952.78
Number of Fisher Scoring iterations: 4
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: default
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 899 1107.90
checking_balance 3 112.037 896 995.86 < 2.2e-16 ***
months_loan_duration 1 30.031 895 965.83 4.252e-08 ***
credit_history 4 27.393 891 938.44 1.656e-05 ***
percent_of_income 1 4.869 890 933.57 0.02735 *
age 1 2.787 889 930.78 0.09503 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Check Accuracy:
fitted.results <- predict(model,newdata=credit_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != credit_test$default)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.82"
The accuracy is 0.82, which is very good.
ROC Method: Because this data set is so small, it is possible that the test data set does not contain both 0 and 1 values. If this happens the code will not run.And since the test data set is so small the ROC is not useful here but the code is provided. ‘’’ library(ROCR) p <- predict(model, newdata=credit_test, type=“response”) pr <- prediction(p, credit_test$default) prf <- performance(pr, measure = “tpr”, x.measure = “fpr”) plot(prf)
auc <- performance(pr, measure = “auc”) auc <- auc@y.values[[1]] auc ‘’’