The credit.csv data is obtained from the Machine Learning course website of Prof. Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/.
The credit.csv dataset contains 1000 observations and 17 features. The features contains information for the 1000 bank loan applicants such as their credit history, checking/saving balance, their age, the amount to borrow, purpose of using the loan, and job type etc. These are some of the independent variables used to predict the final credit status, and whether one will undergo loan default.
credit <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
str(credit)
## 'data.frame': 1000 obs. of 17 variables:
## $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
## $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
## $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
## $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
## $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
credit$default <- ifelse(credit$default == 'no',0,1)
head(credit$default)
## [1] 0 1 0 0 1 0
Out of the 1000 observations in the credit dataset, 90% of them is randomly sampled out to used for the trained dataset and the rest used for tested dataset. Similarly, splitting the last feature, the default status as the labels for each trained and tested dataset.
The trained and tested dataset has about the same amount of yes/no default (default/not default).
set.seed(123)
indx = sample(1:nrow(credit), as.integer(0.9*nrow(credit)))
credit_train = credit[indx,]
credit_test = credit[-indx,]
prop.table(table(credit_train$default))*100
##
## 0 1
## 70.33333 29.66667
prop.table(table(credit_test$default))*100
##
## 0 1
## 67 33
credit_train_labels = credit[indx,17]
credit_test_labels = credit[-indx,17]
Using the Amelia package and its missmap function to see if there are any value missing in any of the features. This process will help determine whether to exclude or replace the values before modeling. Fortunately, there’s no missing value, so there’s no need to do much of the data cleaning before preceeding to modeling.
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.3.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.3
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(credit, main = "Missing values vs observed")
Another way to see if there’s any missing/unique values by features using the is.na() and unique() function along with the sapply() function for the data.
sapply(credit,function(x) sum(is.na(x)))
## checking_balance months_loan_duration credit_history
## 0 0 0
## purpose amount savings_balance
## 0 0 0
## employment_duration percent_of_income years_at_residence
## 0 0 0
## age other_credit housing
## 0 0 0
## existing_loans_count job dependents
## 0 0 0
## phone default
## 0 0
sapply(credit, function(x) length(unique(x)))
## checking_balance months_loan_duration credit_history
## 4 33 5
## purpose amount savings_balance
## 6 921 5
## employment_duration percent_of_income years_at_residence
## 5 4 4
## age other_credit housing
## 53 3 3
## existing_loans_count job dependents
## 4 4 2
## phone default
## 2 2
The use of glm() function, along with setting family = binomial(link=‘logit’) will set the condition to train a logistic regression model. The step will train a logistic regression on the trained data setting default as the dependent variable and the rest of the features as independent predictors.
By looking at the p-value for each regression parameter coefficient, most of the 16 predictors coefficient are not significant at alpha equals 0.05 level. This suggests that those coefficient terms with large p-values might not be good predictors for the logsitic regression model.
model <- glm(default ~.,family=binomial(link='logit'),data=credit_train)
model
##
## Call: glm(formula = default ~ ., family = binomial(link = "logit"),
## data = credit_train)
##
## Coefficients:
## (Intercept) checking_balance> 200 DM
## -1.198e+00 -9.538e-01
## checking_balance1 - 200 DM checking_balanceunknown
## -3.783e-01 -1.849e+00
## months_loan_duration credit_historygood
## 2.926e-02 7.263e-01
## credit_historyperfect credit_historypoor
## 1.458e+00 6.598e-01
## credit_historyvery good purposecar
## 1.186e+00 2.916e-01
## purposecar0 purposeeducation
## -7.621e-01 8.124e-01
## purposefurniture/appliances purposerenovations
## 1.860e-02 6.106e-01
## amount savings_balance> 1000 DM
## 7.302e-05 -1.019e+00
## savings_balance100 - 500 DM savings_balance500 - 1000 DM
## -2.208e-01 -5.172e-01
## savings_balanceunknown employment_duration> 7 years
## -7.924e-01 -3.467e-01
## employment_duration1 - 4 years employment_duration4 - 7 years
## -2.966e-01 -9.405e-01
## employment_durationunemployed percent_of_income
## -4.426e-02 2.794e-01
## years_at_residence age
## -3.908e-02 -1.175e-02
## other_creditnone other_creditstore
## -6.540e-01 -1.787e-01
## housingown housingrent
## -1.581e-01 4.279e-01
## existing_loans_count jobskilled
## 2.293e-01 1.295e-02
## jobunemployed jobunskilled
## -4.988e-01 -8.307e-02
## dependents phoneyes
## -1.981e-02 -1.885e-01
##
## Degrees of Freedom: 899 Total (i.e. Null); 864 Residual
## Null Deviance: 1094
## Residual Deviance: 849.4 AIC: 921.4
summary(model)
##
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"),
## data = credit_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8914 -0.7670 -0.4055 0.8077 2.5444
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.198e+00 9.329e-01 -1.285 0.19895
## checking_balance> 200 DM -9.538e-01 3.739e-01 -2.551 0.01074 *
## checking_balance1 - 200 DM -3.783e-01 2.146e-01 -1.763 0.07796 .
## checking_balanceunknown -1.849e+00 2.417e-01 -7.649 2.02e-14 ***
## months_loan_duration 2.926e-02 9.345e-03 3.131 0.00174 **
## credit_historygood 7.263e-01 2.593e-01 2.801 0.00509 **
## credit_historyperfect 1.458e+00 4.459e-01 3.270 0.00107 **
## credit_historypoor 6.598e-01 3.397e-01 1.942 0.05213 .
## credit_historyvery good 1.186e+00 4.593e-01 2.583 0.00979 **
## purposecar 2.916e-01 3.350e-01 0.870 0.38414
## purposecar0 -7.621e-01 7.963e-01 -0.957 0.33854
## purposeeducation 8.124e-01 4.605e-01 1.764 0.07771 .
## purposefurniture/appliances 1.860e-02 3.293e-01 0.056 0.95495
## purposerenovations 6.106e-01 5.938e-01 1.028 0.30376
## amount 7.302e-05 4.387e-05 1.664 0.09603 .
## savings_balance> 1000 DM -1.019e+00 5.547e-01 -1.836 0.06635 .
## savings_balance100 - 500 DM -2.208e-01 2.910e-01 -0.759 0.44799
## savings_balance500 - 1000 DM -5.172e-01 4.355e-01 -1.188 0.23492
## savings_balanceunknown -7.924e-01 2.617e-01 -3.028 0.00247 **
## employment_duration> 7 years -3.467e-01 3.001e-01 -1.155 0.24791
## employment_duration1 - 4 years -2.966e-01 2.422e-01 -1.225 0.22069
## employment_duration4 - 7 years -9.405e-01 3.076e-01 -3.057 0.00223 **
## employment_durationunemployed -4.426e-02 4.419e-01 -0.100 0.92023
## percent_of_income 2.794e-01 8.798e-02 3.175 0.00150 **
## years_at_residence -3.908e-02 8.734e-02 -0.448 0.65450
## age -1.175e-02 9.318e-03 -1.261 0.20747
## other_creditnone -6.540e-01 2.483e-01 -2.634 0.00843 **
## other_creditstore -1.787e-01 4.201e-01 -0.425 0.67056
## housingown -1.581e-01 2.991e-01 -0.528 0.59719
## housingrent 4.279e-01 3.495e-01 1.224 0.22083
## existing_loans_count 2.293e-01 1.917e-01 1.196 0.23163
## jobskilled 1.295e-02 2.911e-01 0.045 0.96450
## jobunemployed -4.988e-01 7.341e-01 -0.679 0.49686
## jobunskilled -8.307e-02 3.509e-01 -0.237 0.81288
## dependents -1.981e-02 2.488e-01 -0.080 0.93654
## phoneyes -1.885e-01 2.038e-01 -0.925 0.35487
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1094.42 on 899 degrees of freedom
## Residual deviance: 849.42 on 864 degrees of freedom
## AIC: 921.42
##
## Number of Fisher Scoring iterations: 5
A further confirmation for selecting significant predictors can be done by using anova() on model specified test = ‘Chisq’ to make chi-square test for each predictor. The chi-square test compare difference between the null and residual deviance and to compute a p-value to test whehter adding the term is significant in addition to the null model that contains only the intercept. The residuals deviance is arranged in an order of descending value and terms should be selected for the greatest decreaed in residual deviance and/or p-value less than the alpha level.
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: default
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 899 1094.42
## checking_balance 3 129.182 896 965.24 < 2.2e-16 ***
## months_loan_duration 1 30.423 895 934.82 3.475e-08 ***
## credit_history 4 23.337 891 911.48 0.0001084 ***
## purpose 5 6.735 886 904.75 0.2410915
## amount 1 0.098 885 904.65 0.7541559
## savings_balance 4 15.302 881 889.35 0.0041137 **
## employment_duration 4 11.608 877 877.74 0.0205210 *
## percent_of_income 1 9.391 876 868.35 0.0021806 **
## years_at_residence 1 0.001 875 868.35 0.9801547
## age 1 3.046 874 865.30 0.0809147 .
## other_credit 2 7.308 872 857.99 0.0258883 *
## housing 2 5.883 870 852.11 0.0527764 .
## existing_loans_count 1 1.215 869 850.90 0.2704046
## job 3 0.609 866 850.29 0.8943817
## dependents 1 0.007 865 850.28 0.9325886
## phone 1 0.860 864 849.42 0.3537464
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use the model includes all the predictors to make prediction on default status of obbservation in the tested dataset. The error and accuracy rate is 0.28 and 0.72, respectively.
fitted.results <- predict(model,newdata=credit_test,type='response')
head(fitted.results)
## 27 28 76 96 101 104
## 0.1309472 0.1933973 0.3057157 0.9194157 0.1966726 0.1547055
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)
## 27 28 76 96 101 104
## 0 0 0 1 0 0
head(credit_test$default)
## [1] 0 0 0 1 0 0
misClasificError <- mean(fitted.results != credit_test$default)
misClasificError
## [1] 0.28
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.72"
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(model, newdata=credit_test, type="response")
pr <- prediction(p, credit_test$default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1,lwd = 2, lty = 2)
AUC (area under the curve) of the ROC curve is 0.771, which means 77% good at identifying positive values, an acceptable/fair model.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7715966
Based on the difference in deviance and p-values of each predictor terms. A new logistic regression model is trained using only the predictors that has has a significant p-value at the 0.05 level as listed above.
Clearly, this is a slightly better logistic model with a smaller AIC (Akaki Information criterionn) value. Most of the predictor terms in the model are significant at the 0.05 level.
model2 <- glm(default ~ checking_balance + months_loan_duration +
credit_history + savings_balance + percent_of_income +
employment_duration + housing,
family=binomial(link='logit'),data=credit_train)
model2
##
## Call: glm(formula = default ~ checking_balance + months_loan_duration +
## credit_history + savings_balance + percent_of_income + employment_duration +
## housing, family = binomial(link = "logit"), data = credit_train)
##
## Coefficients:
## (Intercept) checking_balance> 200 DM
## -1.45185 -1.01270
## checking_balance1 - 200 DM checking_balanceunknown
## -0.35064 -1.77194
## months_loan_duration credit_historygood
## 0.03698 0.55607
## credit_historyperfect credit_historypoor
## 1.49740 0.61060
## credit_historyvery good savings_balance> 1000 DM
## 1.30609 -0.99013
## savings_balance100 - 500 DM savings_balance500 - 1000 DM
## -0.19395 -0.69862
## savings_balanceunknown percent_of_income
## -0.78489 0.22578
## employment_duration> 7 years employment_duration1 - 4 years
## -0.46995 -0.35614
## employment_duration4 - 7 years employment_durationunemployed
## -0.95738 -0.27014
## housingown housingrent
## -0.21058 0.37292
##
## Degrees of Freedom: 899 Total (i.e. Null); 880 Residual
## Null Deviance: 1094
## Residual Deviance: 870.5 AIC: 910.5
summary(model2)
##
## Call:
## glm(formula = default ~ checking_balance + months_loan_duration +
## credit_history + savings_balance + percent_of_income + employment_duration +
## housing, family = binomial(link = "logit"), data = credit_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7095 -0.7695 -0.4301 0.8005 2.5989
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.451854 0.487529 -2.978 0.002901 **
## checking_balance> 200 DM -1.012696 0.362807 -2.791 0.005250 **
## checking_balance1 - 200 DM -0.350641 0.207323 -1.691 0.090784 .
## checking_balanceunknown -1.771940 0.233482 -7.589 3.22e-14 ***
## months_loan_duration 0.036985 0.007206 5.132 2.86e-07 ***
## credit_historygood 0.556071 0.212850 2.613 0.008988 **
## credit_historyperfect 1.497397 0.434932 3.443 0.000576 ***
## credit_historypoor 0.610596 0.328685 1.858 0.063213 .
## credit_historyvery good 1.306086 0.409680 3.188 0.001432 **
## savings_balance> 1000 DM -0.990130 0.536980 -1.844 0.065200 .
## savings_balance100 - 500 DM -0.193953 0.285685 -0.679 0.497197
## savings_balance500 - 1000 DM -0.698623 0.426470 -1.638 0.101390
## savings_balanceunknown -0.784891 0.255181 -3.076 0.002099 **
## percent_of_income 0.225784 0.077623 2.909 0.003629 **
## employment_duration> 7 years -0.469951 0.267167 -1.759 0.078574 .
## employment_duration1 - 4 years -0.356144 0.235891 -1.510 0.131099
## employment_duration4 - 7 years -0.957377 0.295309 -3.242 0.001187 **
## employment_durationunemployed -0.270136 0.378146 -0.714 0.474998
## housingown -0.210577 0.272871 -0.772 0.440288
## housingrent 0.372923 0.326219 1.143 0.252969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1094.42 on 899 degrees of freedom
## Residual deviance: 870.48 on 880 degrees of freedom
## AIC: 910.48
##
## Number of Fisher Scoring iterations: 5
Using the anova() to look at deviance difference and p-values of each selected predictors. All of them are considered significant at the 0.05 level, which means that they should be added in the model to better explained the variation possessed by the dependent variable default.
anova(model2, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: default
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 899 1094.42
## checking_balance 3 129.182 896 965.24 < 2.2e-16 ***
## months_loan_duration 1 30.423 895 934.82 3.475e-08 ***
## credit_history 4 23.337 891 911.48 0.0001084 ***
## savings_balance 4 14.977 887 896.51 0.0047495 **
## percent_of_income 1 7.285 886 889.22 0.0069517 **
## employment_duration 4 11.637 882 877.58 0.0202691 *
## housing 2 7.101 880 870.48 0.0287064 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Applied the logistic regression model with few selected predictors on the tested dataset to make prediction on the final default status. Again, the fitted.results are a vector of probability value between 0 to 1, which should be converted to binomial 0 or 1 for comparison with the actual tested response values.
As it turns out, using the second model will give a 26% error and a 74% accuracy for the model, a little imrprovement on the accuracy.
fitted.results <- predict(model2,newdata=credit_test,type='response')
head(fitted.results)
## 27 28 76 96 101 104
## 0.2172300 0.2668529 0.3601173 0.9394795 0.2372276 0.2403150
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)
## 27 28 76 96 101 104
## 0 0 0 1 0 0
head(credit_test$default)
## [1] 0 0 0 1 0 0
misClasificError <- mean(fitted.results != credit_test$default)
misClasificError
## [1] 0.26
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.74"
library(ROCR)
p <- predict(model2, newdata=credit_test, type="response")
pr <- prediction(p, credit_test$default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1,lwd = 2, lty = 2)
The AUC (area under the curve) of the ROC curve is computed to be 0.75. It is said to be about 75% good at identifying positive value and it is considered a acceptable/fair classifier model.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7453641
Conclusion: Initially, A logistic regression is applied to the credit dataset using all the features to make prediction of default status in the tested dataset. The AIC is 921, accuracy is 72% and AUC of the ROC curve is 0.77. An improved model is done using few selective significant predictors at the alpha 0.05 level. The AIC of the improved model is 911, accuracy is 74% and the AUC of the ROC curve is 0.75. Although the AUC of the ROC curve decreases a little, but both are in the same range which are considered acceptable/fair model for prdicting future values, the accuracy is confirmed to be higher for the second model, and AIC confirms it is a better model.