Step 1: Collecting Data

The credit.csv data is obtained from the Machine Learning course website of Prof. Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/.

Step 2: Exploring & Preparing Data

The credit.csv dataset contains 1000 observations and 17 features. The features contains information for the 1000 bank loan applicants such as their credit history, checking/saving balance, their age, the amount to borrow, purpose of using the loan, and job type etc. These are some of the independent variables used to predict the final credit status, and whether one will undergo loan default.

credit <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
str(credit)
## 'data.frame':    1000 obs. of  17 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
##  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
##  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
credit$default <- ifelse(credit$default == 'no',0,1)
head(credit$default)
## [1] 0 1 0 0 1 0

Out of the 1000 observations in the credit dataset, 90% of them is randomly sampled out to used for the trained dataset and the rest used for tested dataset. Similarly, splitting the last feature, the default status as the labels for each trained and tested dataset.

The trained and tested dataset has about the same amount of yes/no default (default/not default).

set.seed(123)
indx = sample(1:nrow(credit), as.integer(0.9*nrow(credit)))

credit_train = credit[indx,]
credit_test = credit[-indx,]

prop.table(table(credit_train$default))*100
## 
##        0        1 
## 70.33333 29.66667
prop.table(table(credit_test$default))*100
## 
##  0  1 
## 67 33
credit_train_labels = credit[indx,17]
credit_test_labels = credit[-indx,17] 

Using the Amelia package and its missmap function to see if there are any value missing in any of the features. This process will help determine whether to exclude or replace the values before modeling. Fortunately, there’s no missing value, so there’s no need to do much of the data cleaning before preceeding to modeling.

library(Amelia)
## Warning: package 'Amelia' was built under R version 3.3.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.3
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(credit, main = "Missing values vs observed")

Another way to see if there’s any missing/unique values by features using the is.na() and unique() function along with the sapply() function for the data.

sapply(credit,function(x) sum(is.na(x)))
##     checking_balance months_loan_duration       credit_history 
##                    0                    0                    0 
##              purpose               amount      savings_balance 
##                    0                    0                    0 
##  employment_duration    percent_of_income   years_at_residence 
##                    0                    0                    0 
##                  age         other_credit              housing 
##                    0                    0                    0 
## existing_loans_count                  job           dependents 
##                    0                    0                    0 
##                phone              default 
##                    0                    0
sapply(credit, function(x) length(unique(x)))
##     checking_balance months_loan_duration       credit_history 
##                    4                   33                    5 
##              purpose               amount      savings_balance 
##                    6                  921                    5 
##  employment_duration    percent_of_income   years_at_residence 
##                    5                    4                    4 
##                  age         other_credit              housing 
##                   53                    3                    3 
## existing_loans_count                  job           dependents 
##                    4                    4                    2 
##                phone              default 
##                    2                    2

Step 3: Model Training on the Data

The use of glm() function, along with setting family = binomial(link=‘logit’) will set the condition to train a logistic regression model. The step will train a logistic regression on the trained data setting default as the dependent variable and the rest of the features as independent predictors.

By looking at the p-value for each regression parameter coefficient, most of the 16 predictors coefficient are not significant at alpha equals 0.05 level. This suggests that those coefficient terms with large p-values might not be good predictors for the logsitic regression model.

model <- glm(default ~.,family=binomial(link='logit'),data=credit_train)
model
## 
## Call:  glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = credit_train)
## 
## Coefficients:
##                    (Intercept)        checking_balance> 200 DM  
##                     -1.198e+00                      -9.538e-01  
##     checking_balance1 - 200 DM         checking_balanceunknown  
##                     -3.783e-01                      -1.849e+00  
##           months_loan_duration              credit_historygood  
##                      2.926e-02                       7.263e-01  
##          credit_historyperfect              credit_historypoor  
##                      1.458e+00                       6.598e-01  
##        credit_historyvery good                      purposecar  
##                      1.186e+00                       2.916e-01  
##                    purposecar0                purposeeducation  
##                     -7.621e-01                       8.124e-01  
##    purposefurniture/appliances              purposerenovations  
##                      1.860e-02                       6.106e-01  
##                         amount        savings_balance> 1000 DM  
##                      7.302e-05                      -1.019e+00  
##    savings_balance100 - 500 DM    savings_balance500 - 1000 DM  
##                     -2.208e-01                      -5.172e-01  
##         savings_balanceunknown    employment_duration> 7 years  
##                     -7.924e-01                      -3.467e-01  
## employment_duration1 - 4 years  employment_duration4 - 7 years  
##                     -2.966e-01                      -9.405e-01  
##  employment_durationunemployed               percent_of_income  
##                     -4.426e-02                       2.794e-01  
##             years_at_residence                             age  
##                     -3.908e-02                      -1.175e-02  
##               other_creditnone               other_creditstore  
##                     -6.540e-01                      -1.787e-01  
##                     housingown                     housingrent  
##                     -1.581e-01                       4.279e-01  
##           existing_loans_count                      jobskilled  
##                      2.293e-01                       1.295e-02  
##                  jobunemployed                    jobunskilled  
##                     -4.988e-01                      -8.307e-02  
##                     dependents                        phoneyes  
##                     -1.981e-02                      -1.885e-01  
## 
## Degrees of Freedom: 899 Total (i.e. Null);  864 Residual
## Null Deviance:       1094 
## Residual Deviance: 849.4     AIC: 921.4
summary(model)
## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8914  -0.7670  -0.4055   0.8077   2.5444  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.198e+00  9.329e-01  -1.285  0.19895    
## checking_balance> 200 DM       -9.538e-01  3.739e-01  -2.551  0.01074 *  
## checking_balance1 - 200 DM     -3.783e-01  2.146e-01  -1.763  0.07796 .  
## checking_balanceunknown        -1.849e+00  2.417e-01  -7.649 2.02e-14 ***
## months_loan_duration            2.926e-02  9.345e-03   3.131  0.00174 ** 
## credit_historygood              7.263e-01  2.593e-01   2.801  0.00509 ** 
## credit_historyperfect           1.458e+00  4.459e-01   3.270  0.00107 ** 
## credit_historypoor              6.598e-01  3.397e-01   1.942  0.05213 .  
## credit_historyvery good         1.186e+00  4.593e-01   2.583  0.00979 ** 
## purposecar                      2.916e-01  3.350e-01   0.870  0.38414    
## purposecar0                    -7.621e-01  7.963e-01  -0.957  0.33854    
## purposeeducation                8.124e-01  4.605e-01   1.764  0.07771 .  
## purposefurniture/appliances     1.860e-02  3.293e-01   0.056  0.95495    
## purposerenovations              6.106e-01  5.938e-01   1.028  0.30376    
## amount                          7.302e-05  4.387e-05   1.664  0.09603 .  
## savings_balance> 1000 DM       -1.019e+00  5.547e-01  -1.836  0.06635 .  
## savings_balance100 - 500 DM    -2.208e-01  2.910e-01  -0.759  0.44799    
## savings_balance500 - 1000 DM   -5.172e-01  4.355e-01  -1.188  0.23492    
## savings_balanceunknown         -7.924e-01  2.617e-01  -3.028  0.00247 ** 
## employment_duration> 7 years   -3.467e-01  3.001e-01  -1.155  0.24791    
## employment_duration1 - 4 years -2.966e-01  2.422e-01  -1.225  0.22069    
## employment_duration4 - 7 years -9.405e-01  3.076e-01  -3.057  0.00223 ** 
## employment_durationunemployed  -4.426e-02  4.419e-01  -0.100  0.92023    
## percent_of_income               2.794e-01  8.798e-02   3.175  0.00150 ** 
## years_at_residence             -3.908e-02  8.734e-02  -0.448  0.65450    
## age                            -1.175e-02  9.318e-03  -1.261  0.20747    
## other_creditnone               -6.540e-01  2.483e-01  -2.634  0.00843 ** 
## other_creditstore              -1.787e-01  4.201e-01  -0.425  0.67056    
## housingown                     -1.581e-01  2.991e-01  -0.528  0.59719    
## housingrent                     4.279e-01  3.495e-01   1.224  0.22083    
## existing_loans_count            2.293e-01  1.917e-01   1.196  0.23163    
## jobskilled                      1.295e-02  2.911e-01   0.045  0.96450    
## jobunemployed                  -4.988e-01  7.341e-01  -0.679  0.49686    
## jobunskilled                   -8.307e-02  3.509e-01  -0.237  0.81288    
## dependents                     -1.981e-02  2.488e-01  -0.080  0.93654    
## phoneyes                       -1.885e-01  2.038e-01  -0.925  0.35487    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1094.42  on 899  degrees of freedom
## Residual deviance:  849.42  on 864  degrees of freedom
## AIC: 921.42
## 
## Number of Fisher Scoring iterations: 5

A further confirmation for selecting significant predictors can be done by using anova() on model specified test = ‘Chisq’ to make chi-square test for each predictor. The chi-square test compare difference between the null and residual deviance and to compute a p-value to test whehter adding the term is significant in addition to the null model that contains only the intercept. The residuals deviance is arranged in an order of descending value and terms should be selected for the greatest decreaed in residual deviance and/or p-value less than the alpha level.

anova(model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: default
## 
## Terms added sequentially (first to last)
## 
## 
##                      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                   899    1094.42              
## checking_balance      3  129.182       896     965.24 < 2.2e-16 ***
## months_loan_duration  1   30.423       895     934.82 3.475e-08 ***
## credit_history        4   23.337       891     911.48 0.0001084 ***
## purpose               5    6.735       886     904.75 0.2410915    
## amount                1    0.098       885     904.65 0.7541559    
## savings_balance       4   15.302       881     889.35 0.0041137 ** 
## employment_duration   4   11.608       877     877.74 0.0205210 *  
## percent_of_income     1    9.391       876     868.35 0.0021806 ** 
## years_at_residence    1    0.001       875     868.35 0.9801547    
## age                   1    3.046       874     865.30 0.0809147 .  
## other_credit          2    7.308       872     857.99 0.0258883 *  
## housing               2    5.883       870     852.11 0.0527764 .  
## existing_loans_count  1    1.215       869     850.90 0.2704046    
## job                   3    0.609       866     850.29 0.8943817    
## dependents            1    0.007       865     850.28 0.9325886    
## phone                 1    0.860       864     849.42 0.3537464    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Evaluating Model Performance

Use the model includes all the predictors to make prediction on default status of obbservation in the tested dataset. The error and accuracy rate is 0.28 and 0.72, respectively.

fitted.results <- predict(model,newdata=credit_test,type='response')
head(fitted.results)
##        27        28        76        96       101       104 
## 0.1309472 0.1933973 0.3057157 0.9194157 0.1966726 0.1547055
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)
##  27  28  76  96 101 104 
##   0   0   0   1   0   0
head(credit_test$default)
## [1] 0 0 0 1 0 0
misClasificError <- mean(fitted.results != credit_test$default)
misClasificError
## [1] 0.28
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.72"
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
p <- predict(model, newdata=credit_test, type="response")
pr <- prediction(p, credit_test$default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1,lwd = 2, lty = 2)

AUC (area under the curve) of the ROC curve is 0.771, which means 77% good at identifying positive values, an acceptable/fair model.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7715966

Step 5: Model Improvement Using Selected Predictors

Based on the difference in deviance and p-values of each predictor terms. A new logistic regression model is trained using only the predictors that has has a significant p-value at the 0.05 level as listed above.

Clearly, this is a slightly better logistic model with a smaller AIC (Akaki Information criterionn) value. Most of the predictor terms in the model are significant at the 0.05 level.

model2 <- glm(default ~ checking_balance + months_loan_duration + 
              credit_history + savings_balance + percent_of_income +
                employment_duration + housing,
              family=binomial(link='logit'),data=credit_train)
model2
## 
## Call:  glm(formula = default ~ checking_balance + months_loan_duration + 
##     credit_history + savings_balance + percent_of_income + employment_duration + 
##     housing, family = binomial(link = "logit"), data = credit_train)
## 
## Coefficients:
##                    (Intercept)        checking_balance> 200 DM  
##                       -1.45185                        -1.01270  
##     checking_balance1 - 200 DM         checking_balanceunknown  
##                       -0.35064                        -1.77194  
##           months_loan_duration              credit_historygood  
##                        0.03698                         0.55607  
##          credit_historyperfect              credit_historypoor  
##                        1.49740                         0.61060  
##        credit_historyvery good        savings_balance> 1000 DM  
##                        1.30609                        -0.99013  
##    savings_balance100 - 500 DM    savings_balance500 - 1000 DM  
##                       -0.19395                        -0.69862  
##         savings_balanceunknown               percent_of_income  
##                       -0.78489                         0.22578  
##   employment_duration> 7 years  employment_duration1 - 4 years  
##                       -0.46995                        -0.35614  
## employment_duration4 - 7 years   employment_durationunemployed  
##                       -0.95738                        -0.27014  
##                     housingown                     housingrent  
##                       -0.21058                         0.37292  
## 
## Degrees of Freedom: 899 Total (i.e. Null);  880 Residual
## Null Deviance:       1094 
## Residual Deviance: 870.5     AIC: 910.5
summary(model2)
## 
## Call:
## glm(formula = default ~ checking_balance + months_loan_duration + 
##     credit_history + savings_balance + percent_of_income + employment_duration + 
##     housing, family = binomial(link = "logit"), data = credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7095  -0.7695  -0.4301   0.8005   2.5989  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.451854   0.487529  -2.978 0.002901 ** 
## checking_balance> 200 DM       -1.012696   0.362807  -2.791 0.005250 ** 
## checking_balance1 - 200 DM     -0.350641   0.207323  -1.691 0.090784 .  
## checking_balanceunknown        -1.771940   0.233482  -7.589 3.22e-14 ***
## months_loan_duration            0.036985   0.007206   5.132 2.86e-07 ***
## credit_historygood              0.556071   0.212850   2.613 0.008988 ** 
## credit_historyperfect           1.497397   0.434932   3.443 0.000576 ***
## credit_historypoor              0.610596   0.328685   1.858 0.063213 .  
## credit_historyvery good         1.306086   0.409680   3.188 0.001432 ** 
## savings_balance> 1000 DM       -0.990130   0.536980  -1.844 0.065200 .  
## savings_balance100 - 500 DM    -0.193953   0.285685  -0.679 0.497197    
## savings_balance500 - 1000 DM   -0.698623   0.426470  -1.638 0.101390    
## savings_balanceunknown         -0.784891   0.255181  -3.076 0.002099 ** 
## percent_of_income               0.225784   0.077623   2.909 0.003629 ** 
## employment_duration> 7 years   -0.469951   0.267167  -1.759 0.078574 .  
## employment_duration1 - 4 years -0.356144   0.235891  -1.510 0.131099    
## employment_duration4 - 7 years -0.957377   0.295309  -3.242 0.001187 ** 
## employment_durationunemployed  -0.270136   0.378146  -0.714 0.474998    
## housingown                     -0.210577   0.272871  -0.772 0.440288    
## housingrent                     0.372923   0.326219   1.143 0.252969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1094.42  on 899  degrees of freedom
## Residual deviance:  870.48  on 880  degrees of freedom
## AIC: 910.48
## 
## Number of Fisher Scoring iterations: 5

Using the anova() to look at deviance difference and p-values of each selected predictors. All of them are considered significant at the 0.05 level, which means that they should be added in the model to better explained the variation possessed by the dependent variable default.

anova(model2, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: default
## 
## Terms added sequentially (first to last)
## 
## 
##                      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                   899    1094.42              
## checking_balance      3  129.182       896     965.24 < 2.2e-16 ***
## months_loan_duration  1   30.423       895     934.82 3.475e-08 ***
## credit_history        4   23.337       891     911.48 0.0001084 ***
## savings_balance       4   14.977       887     896.51 0.0047495 ** 
## percent_of_income     1    7.285       886     889.22 0.0069517 ** 
## employment_duration   4   11.637       882     877.58 0.0202691 *  
## housing               2    7.101       880     870.48 0.0287064 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Applied the logistic regression model with few selected predictors on the tested dataset to make prediction on the final default status. Again, the fitted.results are a vector of probability value between 0 to 1, which should be converted to binomial 0 or 1 for comparison with the actual tested response values.

As it turns out, using the second model will give a 26% error and a 74% accuracy for the model, a little imrprovement on the accuracy.

fitted.results <- predict(model2,newdata=credit_test,type='response')
head(fitted.results)
##        27        28        76        96       101       104 
## 0.2172300 0.2668529 0.3601173 0.9394795 0.2372276 0.2403150
fitted.results <- ifelse(fitted.results > 0.5,1,0)
head(fitted.results)
##  27  28  76  96 101 104 
##   0   0   0   1   0   0
head(credit_test$default)
## [1] 0 0 0 1 0 0
misClasificError <- mean(fitted.results != credit_test$default)
misClasificError
## [1] 0.26
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.74"
library(ROCR)
p <- predict(model2, newdata=credit_test, type="response")
pr <- prediction(p, credit_test$default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
abline(0,1,lwd = 2, lty = 2)

The AUC (area under the curve) of the ROC curve is computed to be 0.75. It is said to be about 75% good at identifying positive value and it is considered a acceptable/fair classifier model.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7453641

Conclusion: Initially, A logistic regression is applied to the credit dataset using all the features to make prediction of default status in the tested dataset. The AIC is 921, accuracy is 72% and AUC of the ROC curve is 0.77. An improved model is done using few selective significant predictors at the alpha 0.05 level. The AIC of the improved model is 911, accuracy is 74% and the AUC of the ROC curve is 0.75. Although the AUC of the ROC curve decreases a little, but both are in the same range which are considered acceptable/fair model for prdicting future values, the accuracy is confirmed to be higher for the second model, and AIC confirms it is a better model.