Credit Classification Project

Reading the Dataset

## Added an extra column Credit_Type to facilitate exploration. This is derived from status with values Good when status=0 and Bad when status=1
gcd <- read.csv('GCD_Data.csv')

## Summary of the data
summary(gcd)

##  checkin_acc           duration    credit_history         amount     
##  Length:1000        Min.   : 4.0   Length:1000        Min.   :  250  
##  Class :character   1st Qu.:12.0   Class :character   1st Qu.: 1366  
##  Mode  :character   Median :18.0   Mode  :character   Median : 2320  
##                     Mean   :20.9                      Mean   : 3271  
##                     3rd Qu.:24.0                      3rd Qu.: 3972  
##                     Max.   :72.0                      Max.   :18424  
##  savings_acc        present_emp_since    inst_rate     personal_status   
##  Length:1000        Length:1000        Min.   :1.000   Length:1000       
##  Class :character   Class :character   1st Qu.:2.000   Class :character  
##  Mode  :character   Mode  :character   Median :3.000   Mode  :character  
##                                        Mean   :2.973                     
##                                        3rd Qu.:4.000                     
##                                        Max.   :4.000                     
##  residing_since       age         inst_plans         num_credits   
##  Min.   :1.000   Min.   :19.00   Length:1000        Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:27.00   Class :character   1st Qu.:1.000  
##  Median :3.000   Median :33.00   Mode  :character   Median :1.000  
##  Mean   :2.845   Mean   :35.55                      Mean   :1.407  
##  3rd Qu.:4.000   3rd Qu.:42.00                      3rd Qu.:2.000  
##  Max.   :4.000   Max.   :75.00                      Max.   :4.000  
##      job                status    Credit_Type       
##  Length:1000        Min.   :0.0   Length:1000       
##  Class :character   1st Qu.:0.0   Class :character  
##  Mode  :character   Median :0.0   Mode  :character  
##                     Mean   :0.3                     
##                     3rd Qu.:1.0                     
##                     Max.   :1.0

DATA PRE-PROCESSING

The data does not have any missing values or NAs. Outliers are found in variables namely duration, age and duration but these outliers are not random values or due to wrongly entered values. As in the case of amount has maximum values of 15857 and 18424 which are higher than the median values, but it is not considered as outliers and treated because there is possibility of people requesting for higher loan amounts. Similarly, age values above 66 is seen as outlier but they are not treated as it could be legitimate old age people applying for loan.

EXPLORATORY ANALYSIS

Credit Type and Amount

plotx4 = ggplot(data = gcd , aes(x = Credit_Type, y = amount)) + geom_boxplot(fill = "red")
ggplotly(plotx4)

The amount refers to the loan amount availed or requested by the customer. It is seen that minimum amount to be classified as a bad credit is 433 whereas that for a bad credit is 250. This implies that all the applicants with loan amount between 250 and 433 are classified as good credit.

Credit Type and Duration

plotx5 = ggplot(data = gcd , aes(x = Credit_Type, y = duration)) + geom_boxplot(fill = "red")
ggplotly(plotx5)

Duration indicates the number of months for which the credit is given. The box plot shows that maximum value of a loan request being classified as good credit is 60 and bad credit is 72. Thus, records having duration between 60 and 72 are classified as Bad credit. Similarly, with duration 4 to 6 is classified as Good credit. Applicants classified as good credit and residing for more than 45 months is seen as outliers from the plot but this can be exceptional cases of applicants hence not treated

Credit Type and Residing_since

plotx6 = ggplot(data = gcd , aes(x = Credit_Type, y = num_credits)) + geom_boxplot(fill = "red")
ggplotly(plotx6)

Residing since gives the number of years the applicant is residing in that location. This variable does not have a distinction with the credit type as Good and Bad credit.

Classifying into Training and Validation data. Training data is taken as 70% of the data.Validation is the remaining 30% of the data

set.seed(1)

## Applying sampling method to select 70% of the rows as training rows
trainrows = sample(row.names(gcd),dim(gcd)[1]*0.7)
trainingdataset =  gcd[trainrows, ]

## Setdiff method to select the remaining 30% of the data as Validation rows
validrows = setdiff(row.names(gcd),trainrows)
validationdataset = gcd[validrows, ]

## Printing the Training and Validation data
print("Training Dataset")

## [1] "Training Dataset"

print(head(trainingdataset))

##     checkin_acc duration credit_history amount savings_acc present_emp_since
## 836         A11       12            A30   1082         A61               A73
## 679         A11       24            A32   2384         A61               A75
## 129         A12       12            A34   1860         A61               A71
## 930         A11       12            A33   1344         A61               A73
## 509         A14       24            A32   1413         A61               A73
## 471         A12       24            A32   3092         A62               A72
##     inst_rate personal_status residing_since age inst_plans num_credits  job
## 836         4             A93              4  48       A141           2 A173
## 679         4             A93              4  64       A141           1 A172
## 129         4             A93              2  34       A143           2 A174
## 930         4             A93              2  43       A143           2 A172
## 509         4             A94              2  28       A143           1 A173
## 471         3             A94              2  22       A143           1 A173
##     status Credit_Type
## 836      1         Bad
## 679      0        Good
## 129      0        Good
## 930      0        Good
## 509      0        Good
## 471      1         Bad

print("Validation Dataset")

## [1] "Validation Dataset"

print(head(validationdataset))

##    checkin_acc duration credit_history amount savings_acc present_emp_since
## 2          A12       48            A32   5951         A61               A73
## 10         A12       30            A34   5234         A61               A71
## 14         A11       24            A34   1199         A61               A75
## 18         A11       30            A30   8072         A65               A72
## 21         A14        9            A34   2134         A61               A73
## 23         A11       10            A34   2241         A61               A72
##    inst_rate personal_status residing_since age inst_plans num_credits  job
## 2          2             A92              2  22       A143           1 A173
## 10         4             A94              2  28       A143           2 A174
## 14         4             A93              4  60       A143           2 A172
## 18         2             A93              3  25       A141           3 A173
## 21         4             A93              4  48       A143           3 A173
## 23         1             A93              3  48       A143           2 A172
##    status Credit_Type
## 2       1         Bad
## 10      1         Bad
## 14      1         Bad
## 18      0        Good
## 21      0        Good
## 23      0        Good

Developing a Binary logistic regression model because the dependent variable is categorical with values 0 and 1

#Developing a BLR model from glm function using training dataset

model = glm(status~.-Credit_Type,data = trainingdataset,family = "binomial")

print(summary(model))

## 
## Call:
## glm(formula = status ~ . - Credit_Type, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8853  -0.7459  -0.4393   0.7791   2.5549  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.665e-01  1.140e+00  -0.322  0.74780    
## checkin_accA12       -5.414e-01  2.438e-01  -2.221  0.02635 *  
## checkin_accA13       -1.193e+00  4.274e-01  -2.792  0.00523 ** 
## checkin_accA14       -1.724e+00  2.580e-01  -6.682 2.35e-11 ***
## duration              2.712e-02  1.087e-02   2.495  0.01258 *  
## credit_historyA31     1.964e-01  5.994e-01   0.328  0.74311    
## credit_historyA32    -5.971e-01  4.571e-01  -1.306  0.19144    
## credit_historyA33    -5.141e-01  4.998e-01  -1.029  0.30366    
## credit_historyA34    -1.295e+00  4.701e-01  -2.754  0.00589 ** 
## amount                7.287e-05  5.156e-05   1.413  0.15756    
## savings_accA62       -1.101e-01  3.263e-01  -0.337  0.73576    
## savings_accA63       -3.357e-01  4.582e-01  -0.733  0.46376    
## savings_accA64       -1.248e+00  6.299e-01  -1.982  0.04747 *  
## savings_accA65       -8.126e-01  2.938e-01  -2.766  0.00568 ** 
## present_emp_sinceA72  6.466e-02  4.901e-01   0.132  0.89503    
## present_emp_sinceA73 -3.569e-01  4.665e-01  -0.765  0.44422    
## present_emp_sinceA74 -9.970e-01  5.044e-01  -1.977  0.04807 *  
## present_emp_sinceA75 -5.995e-01  4.733e-01  -1.267  0.20524    
## inst_rate             3.356e-01  1.022e-01   3.285  0.00102 ** 
## personal_statusA92   -6.647e-02  4.587e-01  -0.145  0.88478    
## personal_statusA93   -4.604e-01  4.514e-01  -1.020  0.30770    
## personal_statusA94   -9.544e-01  5.501e-01  -1.735  0.08275 .  
## residing_since        2.960e-02  9.403e-02   0.315  0.75289    
## age                  -1.924e-02  1.010e-02  -1.905  0.05682 .  
## inst_plansA142       -6.388e-01  4.791e-01  -1.333  0.18238    
## inst_plansA143       -7.046e-01  2.674e-01  -2.635  0.00841 ** 
## num_credits           3.126e-01  2.102e-01   1.487  0.13699    
## jobA172               7.571e-01  7.783e-01   0.973  0.33063    
## jobA173               9.263e-01  7.539e-01   1.229  0.21917    
## jobA174               6.550e-01  7.662e-01   0.855  0.39265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 664.67  on 670  degrees of freedom
## AIC: 724.67
## 
## Number of Fisher Scoring iterations: 5

Optimising the model by removing the insignificant predictors and checking on the Null Deviance and Residual deviance

model = glm(status~.-credit_history-Credit_Type,data = trainingdataset,family = "binomial")
print(summary(model))

## 
## Call:
## glm(formula = status ~ . - credit_history - Credit_Type, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8369  -0.7656  -0.4350   0.8254   2.7395  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.127e-01  9.973e-01  -0.414  0.67904    
## checkin_accA12       -5.324e-01  2.379e-01  -2.238  0.02521 *  
## checkin_accA13       -1.222e+00  4.173e-01  -2.929  0.00340 ** 
## checkin_accA14       -1.826e+00  2.534e-01  -7.207 5.72e-13 ***
## duration              2.844e-02  1.065e-02   2.672  0.00755 ** 
## amount                8.255e-05  5.061e-05   1.631  0.10287    
## savings_accA62        1.805e-02  3.217e-01   0.056  0.95527    
## savings_accA63       -2.670e-01  4.418e-01  -0.604  0.54565    
## savings_accA64       -1.195e+00  6.157e-01  -1.941  0.05230 .  
## savings_accA65       -7.799e-01  2.887e-01  -2.701  0.00691 ** 
## present_emp_sinceA72  4.480e-02  4.866e-01   0.092  0.92665    
## present_emp_sinceA73 -3.464e-01  4.620e-01  -0.750  0.45339    
## present_emp_sinceA74 -1.038e+00  4.998e-01  -2.078  0.03774 *  
## present_emp_sinceA75 -6.505e-01  4.685e-01  -1.388  0.16501    
## inst_rate             3.300e-01  1.003e-01   3.289  0.00101 ** 
## personal_statusA92   -3.879e-02  4.466e-01  -0.087  0.93079    
## personal_statusA93   -4.611e-01  4.395e-01  -1.049  0.29416    
## personal_statusA94   -9.426e-01  5.401e-01  -1.745  0.08095 .  
## residing_since        1.693e-02  9.256e-02   0.183  0.85487    
## age                  -2.094e-02  9.921e-03  -2.110  0.03483 *  
## inst_plansA142       -5.462e-01  4.699e-01  -1.162  0.24516    
## inst_plansA143       -8.159e-01  2.584e-01  -3.157  0.00159 ** 
## num_credits           1.026e-01  1.691e-01   0.607  0.54403    
## jobA172               6.000e-01  7.599e-01   0.790  0.42975    
## jobA173               7.216e-01  7.316e-01   0.986  0.32401    
## jobA174               4.649e-01  7.420e-01   0.627  0.53090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 679.58  on 674  degrees of freedom
## AIC: 731.58
## 
## Number of Fisher Scoring iterations: 5

model = glm(status~.-Credit_Type-savings_acc,data = trainingdataset,family = "binomial")
print(summary(model))

## 
## Call:
## glm(formula = status ~ . - Credit_Type - savings_acc, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9634  -0.7524  -0.4397   0.7938   2.2952  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.632e-01  1.118e+00  -0.325  0.74538    
## checkin_accA12       -6.160e-01  2.366e-01  -2.603  0.00923 ** 
## checkin_accA13       -1.224e+00  4.207e-01  -2.909  0.00363 ** 
## checkin_accA14       -1.853e+00  2.500e-01  -7.414 1.23e-13 ***
## duration              2.868e-02  1.071e-02   2.679  0.00738 ** 
## credit_historyA31     2.895e-02  5.914e-01   0.049  0.96096    
## credit_historyA32    -6.903e-01  4.531e-01  -1.524  0.12761    
## credit_historyA33    -6.125e-01  4.967e-01  -1.233  0.21756    
## credit_historyA34    -1.355e+00  4.654e-01  -2.911  0.00360 ** 
## amount                6.227e-05  5.059e-05   1.231  0.21833    
## present_emp_sinceA72  5.797e-02  4.878e-01   0.119  0.90541    
## present_emp_sinceA73 -4.039e-01  4.641e-01  -0.870  0.38419    
## present_emp_sinceA74 -1.048e+00  5.029e-01  -2.084  0.03717 *  
## present_emp_sinceA75 -6.616e-01  4.711e-01  -1.404  0.16025    
## inst_rate             3.116e-01  1.003e-01   3.107  0.00189 ** 
## personal_statusA92   -1.158e-01  4.495e-01  -0.258  0.79668    
## personal_statusA93   -4.778e-01  4.422e-01  -1.080  0.27992    
## personal_statusA94   -9.604e-01  5.421e-01  -1.772  0.07647 .  
## residing_since        1.598e-02  9.314e-02   0.172  0.86378    
## age                  -2.068e-02  9.949e-03  -2.079  0.03765 *  
## inst_plansA142       -5.212e-01  4.715e-01  -1.105  0.26895    
## inst_plansA143       -6.532e-01  2.665e-01  -2.451  0.01423 *  
## num_credits           3.144e-01  2.076e-01   1.514  0.12996    
## jobA172               9.234e-01  7.750e-01   1.191  0.23347    
## jobA173               1.038e+00  7.507e-01   1.383  0.16661    
## jobA174               8.365e-01  7.597e-01   1.101  0.27083    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 676.61  on 674  degrees of freedom
## AIC: 728.61
## 
## Number of Fisher Scoring iterations: 5

model = glm(status~.-Credit_Type-present_emp_since,data = trainingdataset,family = "binomial")
print(summary(model))

## 
## Call:
## glm(formula = status ~ . - Credit_Type - present_emp_since, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8647  -0.7651  -0.4419   0.8230   2.6134  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.384e-01  1.085e+00   0.127  0.89856    
## checkin_accA12     -4.869e-01  2.392e-01  -2.036  0.04177 *  
## checkin_accA13     -1.157e+00  4.191e-01  -2.760  0.00577 ** 
## checkin_accA14     -1.704e+00  2.554e-01  -6.674 2.49e-11 ***
## duration            2.467e-02  1.074e-02   2.298  0.02157 *  
## credit_historyA31   5.811e-02  5.937e-01   0.098  0.92203    
## credit_historyA32  -7.010e-01  4.521e-01  -1.551  0.12102    
## credit_historyA33  -5.495e-01  4.951e-01  -1.110  0.26706    
## credit_historyA34  -1.391e+00  4.648e-01  -2.994  0.00276 ** 
## amount              7.145e-05  5.112e-05   1.398  0.16220    
## savings_accA62     -1.561e-01  3.214e-01  -0.486  0.62720    
## savings_accA63     -3.601e-01  4.471e-01  -0.805  0.42058    
## savings_accA64     -1.225e+00  6.082e-01  -2.014  0.04397 *  
## savings_accA65     -8.712e-01  2.913e-01  -2.991  0.00278 ** 
## inst_rate           3.298e-01  1.014e-01   3.254  0.00114 ** 
## personal_statusA92 -7.869e-02  4.502e-01  -0.175  0.86125    
## personal_statusA93 -6.216e-01  4.410e-01  -1.409  0.15872    
## personal_statusA94 -1.019e+00  5.427e-01  -1.877  0.06053 .  
## residing_since     -5.281e-04  9.116e-02  -0.006  0.99538    
## age                -2.215e-02  9.492e-03  -2.333  0.01962 *  
## inst_plansA142     -4.937e-01  4.663e-01  -1.059  0.28965    
## inst_plansA143     -6.797e-01  2.634e-01  -2.580  0.00987 ** 
## num_credits         2.441e-01  2.058e-01   1.186  0.23553    
## jobA172             3.582e-01  7.093e-01   0.505  0.61359    
## jobA173             4.913e-01  6.883e-01   0.714  0.47534    
## jobA174             2.954e-01  7.358e-01   0.401  0.68811    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 675.61  on 674  degrees of freedom
## AIC: 727.61
## 
## Number of Fisher Scoring iterations: 5

model = glm(status~.-Credit_Type-personal_status,data = trainingdataset,family = "binomial")
print(summary(model))

## 
## Call:
## glm(formula = status ~ . - Credit_Type - personal_status, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8438  -0.7641  -0.4370   0.8113   2.5040  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -6.074e-01  1.054e+00  -0.576  0.56438    
## checkin_accA12       -5.794e-01  2.419e-01  -2.396  0.01659 *  
## checkin_accA13       -1.165e+00  4.223e-01  -2.759  0.00579 ** 
## checkin_accA14       -1.726e+00  2.563e-01  -6.733 1.66e-11 ***
## duration              2.777e-02  1.075e-02   2.584  0.00976 ** 
## credit_historyA31     2.391e-01  5.931e-01   0.403  0.68691    
## credit_historyA32    -6.277e-01  4.500e-01  -1.395  0.16305    
## credit_historyA33    -5.691e-01  4.952e-01  -1.149  0.25048    
## credit_historyA34    -1.306e+00  4.650e-01  -2.808  0.00498 ** 
## amount                7.348e-05  5.104e-05   1.440  0.14996    
## savings_accA62       -1.343e-01  3.240e-01  -0.414  0.67864    
## savings_accA63       -3.840e-01  4.565e-01  -0.841  0.40024    
## savings_accA64       -1.161e+00  6.197e-01  -1.873  0.06105 .  
## savings_accA65       -8.039e-01  2.915e-01  -2.758  0.00582 ** 
## present_emp_sinceA72  8.080e-02  4.867e-01   0.166  0.86815    
## present_emp_sinceA73 -3.927e-01  4.645e-01  -0.845  0.39793    
## present_emp_sinceA74 -1.094e+00  5.020e-01  -2.179  0.02930 *  
## present_emp_sinceA75 -6.749e-01  4.709e-01  -1.433  0.15178    
## inst_rate             3.022e-01  9.991e-02   3.024  0.00249 ** 
## residing_since        5.304e-02  9.227e-02   0.575  0.56539    
## age                  -1.947e-02  1.002e-02  -1.944  0.05193 .  
## inst_plansA142       -6.615e-01  4.793e-01  -1.380  0.16752    
## inst_plansA143       -6.644e-01  2.655e-01  -2.503  0.01232 *  
## num_credits           2.993e-01  2.077e-01   1.441  0.14957    
## jobA172               7.110e-01  7.718e-01   0.921  0.35693    
## jobA173               9.156e-01  7.472e-01   1.225  0.22042    
## jobA174               6.675e-01  7.597e-01   0.879  0.37960    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 672.06  on 673  degrees of freedom
## AIC: 726.06
## 
## Number of Fisher Scoring iterations: 5

model = glm(status~.-Credit_Type-num_credits,data = trainingdataset,family = "binomial")
print(summary(model))

## 
## Call:
## glm(formula = status ~ . - Credit_Type - num_credits, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9029  -0.7534  -0.4372   0.7760   2.5640  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.243e-01  1.066e+00   0.210  0.83333    
## checkin_accA12       -5.577e-01  2.434e-01  -2.291  0.02194 *  
## checkin_accA13       -1.233e+00  4.257e-01  -2.895  0.00379 ** 
## checkin_accA14       -1.724e+00  2.576e-01  -6.691 2.22e-11 ***
## duration              2.579e-02  1.076e-02   2.397  0.01651 *  
## credit_historyA31     2.842e-03  5.835e-01   0.005  0.99611    
## credit_historyA32    -7.919e-01  4.358e-01  -1.817  0.06919 .  
## credit_historyA33    -5.206e-01  4.981e-01  -1.045  0.29599    
## credit_historyA34    -1.261e+00  4.668e-01  -2.702  0.00690 ** 
## amount                7.508e-05  5.121e-05   1.466  0.14260    
## savings_accA62       -9.150e-02  3.241e-01  -0.282  0.77773    
## savings_accA63       -3.823e-01  4.562e-01  -0.838  0.40196    
## savings_accA64       -1.223e+00  6.276e-01  -1.949  0.05129 .  
## savings_accA65       -8.177e-01  2.938e-01  -2.783  0.00539 ** 
## present_emp_sinceA72  7.530e-02  4.899e-01   0.154  0.87784    
## present_emp_sinceA73 -3.441e-01  4.662e-01  -0.738  0.46045    
## present_emp_sinceA74 -9.484e-01  5.032e-01  -1.885  0.05946 .  
## present_emp_sinceA75 -5.525e-01  4.720e-01  -1.170  0.24180    
## inst_rate             3.311e-01  1.016e-01   3.257  0.00113 ** 
## personal_statusA92   -6.690e-02  4.572e-01  -0.146  0.88367    
## personal_statusA93   -4.596e-01  4.498e-01  -1.022  0.30694    
## personal_statusA94   -9.441e-01  5.499e-01  -1.717  0.08600 .  
## residing_since        4.344e-02  9.331e-02   0.466  0.64155    
## age                  -1.887e-02  1.009e-02  -1.869  0.06159 .  
## inst_plansA142       -5.743e-01  4.761e-01  -1.206  0.22774    
## inst_plansA143       -6.955e-01  2.673e-01  -2.602  0.00926 ** 
## jobA172               6.583e-01  7.761e-01   0.848  0.39637    
## jobA173               8.345e-01  7.517e-01   1.110  0.26693    
## jobA174               5.840e-01  7.651e-01   0.763  0.44523    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 666.88  on 671  degrees of freedom
## AIC: 724.88
## 
## Number of Fisher Scoring iterations: 5

model = glm(status~.-Credit_Type-job,data = trainingdataset,family = "binomial")
print(summary(model))

## 
## Call:
## glm(formula = status ~ . - Credit_Type - job, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8223  -0.7459  -0.4406   0.7979   2.5556  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.798e-01  9.764e-01   0.287 0.774448    
## checkin_accA12       -5.635e-01  2.429e-01  -2.320 0.020318 *  
## checkin_accA13       -1.233e+00  4.263e-01  -2.892 0.003826 ** 
## checkin_accA14       -1.714e+00  2.569e-01  -6.671 2.54e-11 ***
## duration              2.698e-02  1.066e-02   2.531 0.011369 *  
## credit_historyA31     1.165e-01  5.934e-01   0.196 0.844338    
## credit_historyA32    -5.925e-01  4.529e-01  -1.308 0.190814    
## credit_historyA33    -4.882e-01  4.949e-01  -0.987 0.323855    
## credit_historyA34    -1.284e+00  4.665e-01  -2.752 0.005929 ** 
## amount                6.950e-05  4.829e-05   1.439 0.150076    
## savings_accA62       -7.722e-02  3.243e-01  -0.238 0.811782    
## savings_accA63       -3.733e-01  4.516e-01  -0.827 0.408470    
## savings_accA64       -1.198e+00  6.215e-01  -1.927 0.053970 .  
## savings_accA65       -8.151e-01  2.939e-01  -2.773 0.005551 ** 
## present_emp_sinceA72  3.242e-01  4.369e-01   0.742 0.458064    
## present_emp_sinceA73 -7.680e-02  4.035e-01  -0.190 0.849043    
## present_emp_sinceA74 -7.113e-01  4.491e-01  -1.584 0.113195    
## present_emp_sinceA75 -3.257e-01  4.217e-01  -0.772 0.439883    
## inst_rate             3.349e-01  1.001e-01   3.347 0.000817 ***
## personal_statusA92   -9.952e-02  4.552e-01  -0.219 0.826936    
## personal_statusA93   -4.914e-01  4.480e-01  -1.097 0.272679    
## personal_statusA94   -9.813e-01  5.471e-01  -1.794 0.072864 .  
## residing_since        3.596e-02  9.303e-02   0.387 0.699079    
## age                  -1.995e-02  1.002e-02  -1.991 0.046505 *  
## inst_plansA142       -5.956e-01  4.772e-01  -1.248 0.211939    
## inst_plansA143       -6.874e-01  2.660e-01  -2.584 0.009774 ** 
## num_credits           2.898e-01  2.095e-01   1.383 0.166596    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 666.90  on 673  degrees of freedom
## AIC: 720.9
## 
## Number of Fisher Scoring iterations: 5

model = glm(status~.-Credit_Type-residing_since,data = trainingdataset,family = "binomial")
print(summary(model)) ## this improves AIC without affecting null and residual deviance

## 
## Call:
## glm(formula = status ~ . - Credit_Type - residing_since, family = "binomial", 
##     data = trainingdataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8691  -0.7493  -0.4375   0.7692   2.5637  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.319e-01  1.134e+00  -0.293 0.769854    
## checkin_accA12       -5.445e-01  2.435e-01  -2.236 0.025332 *  
## checkin_accA13       -1.208e+00  4.251e-01  -2.842 0.004486 ** 
## checkin_accA14       -1.730e+00  2.575e-01  -6.716 1.86e-11 ***
## duration              2.726e-02  1.086e-02   2.510 0.012078 *  
## credit_historyA31     1.998e-01  5.989e-01   0.334 0.738726    
## credit_historyA32    -5.943e-01  4.568e-01  -1.301 0.193243    
## credit_historyA33    -5.182e-01  4.996e-01  -1.037 0.299613    
## credit_historyA34    -1.290e+00  4.697e-01  -2.747 0.006020 ** 
## amount                7.328e-05  5.154e-05   1.422 0.155068    
## savings_accA62       -1.054e-01  3.258e-01  -0.324 0.746235    
## savings_accA63       -3.322e-01  4.575e-01  -0.726 0.467678    
## savings_accA64       -1.248e+00  6.312e-01  -1.977 0.048056 *  
## savings_accA65       -8.094e-01  2.939e-01  -2.754 0.005882 ** 
## present_emp_sinceA72  4.902e-02  4.880e-01   0.100 0.919986    
## present_emp_sinceA73 -3.657e-01  4.660e-01  -0.785 0.432612    
## present_emp_sinceA74 -1.003e+00  5.044e-01  -1.989 0.046751 *  
## present_emp_sinceA75 -5.941e-01  4.733e-01  -1.255 0.209409    
## inst_rate             3.362e-01  1.021e-01   3.293 0.000992 ***
## personal_statusA92   -4.621e-02  4.542e-01  -0.102 0.918959    
## personal_statusA93   -4.479e-01  4.496e-01  -0.996 0.319163    
## personal_statusA94   -9.468e-01  5.496e-01  -1.723 0.084914 .  
## age                  -1.886e-02  1.003e-02  -1.880 0.060148 .  
## inst_plansA142       -6.428e-01  4.783e-01  -1.344 0.178978    
## inst_plansA143       -7.041e-01  2.673e-01  -2.634 0.008438 ** 
## num_credits           3.190e-01  2.091e-01   1.526 0.127084    
## jobA172               7.759e-01  7.761e-01   1.000 0.317475    
## jobA173               9.386e-01  7.531e-01   1.246 0.212606    
## jobA174               6.585e-01  7.663e-01   0.859 0.390180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.79  on 699  degrees of freedom
## Residual deviance: 664.77  on 671  degrees of freedom
## AIC: 722.77
## 
## Number of Fisher Scoring iterations: 5

The above code checks the change in Null Deviance, Residual Deviance and AIC value it is found that removing residing since variable reduces the AIC value without much change in the Residual deviance value Thus the residing_since variable is not taken into account while creating the model

Applying the model to validation dataset to predict the status variable

probprediction = predict(model,newdata = validationdataset,type = "response")

print(head(probprediction))

##         2        10        14        18        21        23 
## 0.6621447 0.3768487 0.2691726 0.8771605 0.1060015 0.1910474

Using ROCR library to determine the optimum cutoff value

library(ROCR)
res <- predict(model,trainingdataset,type = "response")

ROCR_Predicted <- prediction(res,trainingdataset$status)
ROCR_Performance <- performance(ROCR_Predicted,"tpr","fpr")

plot(ROCR_Performance,colorize= TRUE,print.cutoffs.at=seq(0.1,by=0.1))

Graph to determine cut off value

Assuming Cutoff = 0.5

cut_off = 0.5

prediction = ifelse(probprediction>cut_off,1,0)

validationdataset$predictedStatus = prediction

print(head(validationdataset))

##    checkin_acc duration credit_history amount savings_acc present_emp_since
## 2          A12       48            A32   5951         A61               A73
## 10         A12       30            A34   5234         A61               A71
## 14         A11       24            A34   1199         A61               A75
## 18         A11       30            A30   8072         A65               A72
## 21         A14        9            A34   2134         A61               A73
## 23         A11       10            A34   2241         A61               A72
##    inst_rate personal_status residing_since age inst_plans num_credits  job
## 2          2             A92              2  22       A143           1 A173
## 10         4             A94              2  28       A143           2 A174
## 14         4             A93              4  60       A143           2 A172
## 18         2             A93              3  25       A141           3 A173
## 21         4             A93              4  48       A143           3 A173
## 23         1             A93              3  48       A143           2 A172
##    status Credit_Type predictedStatus
## 2       1         Bad               1
## 10      1         Bad               0
## 14      1         Bad               0
## 18      0        Good               1
## 21      0        Good               0
## 23      0        Good               0

library(caret)

## Loading required package: lattice

# Evaluation of the model

confusionMatrix(as.factor(validationdataset$predictedStatus),as.factor(validationdataset$status),positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 183  53
##          1  25  39
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6865, 0.7887)
##     No Information Rate : 0.6933          
##     P-Value [Acc > NIR] : 0.043914        
##                                           
##                   Kappa : 0.3319          
##                                           
##  Mcnemar's Test P-Value : 0.002235        
##                                           
##             Sensitivity : 0.4239          
##             Specificity : 0.8798          
##          Pos Pred Value : 0.6094          
##          Neg Pred Value : 0.7754          
##              Prevalence : 0.3067          
##          Detection Rate : 0.1300          
##    Detection Prevalence : 0.2133          
##       Balanced Accuracy : 0.6519          
##                                           
##        'Positive' Class : 1               
##

Cutoff as 0.4

cut_off = 0.4

prediction = ifelse(probprediction>cut_off,1,0)

validationdataset$predictedStatus = prediction

print(head(validationdataset))

##    checkin_acc duration credit_history amount savings_acc present_emp_since
## 2          A12       48            A32   5951         A61               A73
## 10         A12       30            A34   5234         A61               A71
## 14         A11       24            A34   1199         A61               A75
## 18         A11       30            A30   8072         A65               A72
## 21         A14        9            A34   2134         A61               A73
## 23         A11       10            A34   2241         A61               A72
##    inst_rate personal_status residing_since age inst_plans num_credits  job
## 2          2             A92              2  22       A143           1 A173
## 10         4             A94              2  28       A143           2 A174
## 14         4             A93              4  60       A143           2 A172
## 18         2             A93              3  25       A141           3 A173
## 21         4             A93              4  48       A143           3 A173
## 23         1             A93              3  48       A143           2 A172
##    status Credit_Type predictedStatus
## 2       1         Bad               1
## 10      1         Bad               0
## 14      1         Bad               0
## 18      0        Good               1
## 21      0        Good               0
## 23      0        Good               0

library(caret)
# Evaluation of the model

confusionMatrix(as.factor(validationdataset$predictedStatus),as.factor(validationdataset$status),positive = "0")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 162  40
##          1  46  52
##                                           
##                Accuracy : 0.7133          
##                  95% CI : (0.6586, 0.7638)
##     No Information Rate : 0.6933          
##     P-Value [Acc > NIR] : 0.2469          
##                                           
##                   Kappa : 0.3379          
##                                           
##  Mcnemar's Test P-Value : 0.5898          
##                                           
##             Sensitivity : 0.7788          
##             Specificity : 0.5652          
##          Pos Pred Value : 0.8020          
##          Neg Pred Value : 0.5306          
##              Prevalence : 0.6933          
##          Detection Rate : 0.5400          
##    Detection Prevalence : 0.6733          
##       Balanced Accuracy : 0.6720          
##                                           
##        'Positive' Class : 0               
##

It is seen that accuracy is more in the case of cut off value as 0.5 (74%)