Reading the Dataset
## Added an extra column Credit_Type to facilitate exploration. This is derived from status with values Good when status=0 and Bad when status=1
gcd <- read.csv('GCD_Data.csv')
## Summary of the data
summary(gcd)
## checkin_acc duration credit_history amount
## Length:1000 Min. : 4.0 Length:1000 Min. : 250
## Class :character 1st Qu.:12.0 Class :character 1st Qu.: 1366
## Mode :character Median :18.0 Mode :character Median : 2320
## Mean :20.9 Mean : 3271
## 3rd Qu.:24.0 3rd Qu.: 3972
## Max. :72.0 Max. :18424
## savings_acc present_emp_since inst_rate personal_status
## Length:1000 Length:1000 Min. :1.000 Length:1000
## Class :character Class :character 1st Qu.:2.000 Class :character
## Mode :character Mode :character Median :3.000 Mode :character
## Mean :2.973
## 3rd Qu.:4.000
## Max. :4.000
## residing_since age inst_plans num_credits
## Min. :1.000 Min. :19.00 Length:1000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:27.00 Class :character 1st Qu.:1.000
## Median :3.000 Median :33.00 Mode :character Median :1.000
## Mean :2.845 Mean :35.55 Mean :1.407
## 3rd Qu.:4.000 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :4.000 Max. :75.00 Max. :4.000
## job status Credit_Type
## Length:1000 Min. :0.0 Length:1000
## Class :character 1st Qu.:0.0 Class :character
## Mode :character Median :0.0 Mode :character
## Mean :0.3
## 3rd Qu.:1.0
## Max. :1.0
DATA PRE-PROCESSING
The data does not have any missing values or NAs. Outliers are found in variables namely duration, age and duration but these outliers are not random values or due to wrongly entered values. As in the case of amount has maximum values of 15857 and 18424 which are higher than the median values, but it is not considered as outliers and treated because there is possibility of people requesting for higher loan amounts. Similarly, age values above 66 is seen as outlier but they are not treated as it could be legitimate old age people applying for loan.
EXPLORATORY ANALYSIS
plotx4 = ggplot(data = gcd , aes(x = Credit_Type, y = amount)) + geom_boxplot(fill = "red")
ggplotly(plotx4)
The amount refers to the loan amount availed or requested by the customer. It is seen that minimum amount to be classified as a bad credit is 433 whereas that for a bad credit is 250. This implies that all the applicants with loan amount between 250 and 433 are classified as good credit.
plotx5 = ggplot(data = gcd , aes(x = Credit_Type, y = duration)) + geom_boxplot(fill = "red")
ggplotly(plotx5)
Duration indicates the number of months for which the credit is given. The box plot shows that maximum value of a loan request being classified as good credit is 60 and bad credit is 72. Thus, records having duration between 60 and 72 are classified as Bad credit. Similarly, with duration 4 to 6 is classified as Good credit. Applicants classified as good credit and residing for more than 45 months is seen as outliers from the plot but this can be exceptional cases of applicants hence not treated
plotx6 = ggplot(data = gcd , aes(x = Credit_Type, y = num_credits)) + geom_boxplot(fill = "red")
ggplotly(plotx6)
Residing since gives the number of years the applicant is residing in that location. This variable does not have a distinction with the credit type as Good and Bad credit.
Classifying into Training and Validation data. Training data is taken as 70% of the data.Validation is the remaining 30% of the data
set.seed(1)
## Applying sampling method to select 70% of the rows as training rows
trainrows = sample(row.names(gcd),dim(gcd)[1]*0.7)
trainingdataset = gcd[trainrows, ]
## Setdiff method to select the remaining 30% of the data as Validation rows
validrows = setdiff(row.names(gcd),trainrows)
validationdataset = gcd[validrows, ]
## Printing the Training and Validation data
print("Training Dataset")
## [1] "Training Dataset"
print(head(trainingdataset))
## checkin_acc duration credit_history amount savings_acc present_emp_since
## 836 A11 12 A30 1082 A61 A73
## 679 A11 24 A32 2384 A61 A75
## 129 A12 12 A34 1860 A61 A71
## 930 A11 12 A33 1344 A61 A73
## 509 A14 24 A32 1413 A61 A73
## 471 A12 24 A32 3092 A62 A72
## inst_rate personal_status residing_since age inst_plans num_credits job
## 836 4 A93 4 48 A141 2 A173
## 679 4 A93 4 64 A141 1 A172
## 129 4 A93 2 34 A143 2 A174
## 930 4 A93 2 43 A143 2 A172
## 509 4 A94 2 28 A143 1 A173
## 471 3 A94 2 22 A143 1 A173
## status Credit_Type
## 836 1 Bad
## 679 0 Good
## 129 0 Good
## 930 0 Good
## 509 0 Good
## 471 1 Bad
print("Validation Dataset")
## [1] "Validation Dataset"
print(head(validationdataset))
## checkin_acc duration credit_history amount savings_acc present_emp_since
## 2 A12 48 A32 5951 A61 A73
## 10 A12 30 A34 5234 A61 A71
## 14 A11 24 A34 1199 A61 A75
## 18 A11 30 A30 8072 A65 A72
## 21 A14 9 A34 2134 A61 A73
## 23 A11 10 A34 2241 A61 A72
## inst_rate personal_status residing_since age inst_plans num_credits job
## 2 2 A92 2 22 A143 1 A173
## 10 4 A94 2 28 A143 2 A174
## 14 4 A93 4 60 A143 2 A172
## 18 2 A93 3 25 A141 3 A173
## 21 4 A93 4 48 A143 3 A173
## 23 1 A93 3 48 A143 2 A172
## status Credit_Type
## 2 1 Bad
## 10 1 Bad
## 14 1 Bad
## 18 0 Good
## 21 0 Good
## 23 0 Good
Developing a Binary logistic regression model because the dependent variable is categorical with values 0 and 1
#Developing a BLR model from glm function using training dataset
model = glm(status~.-Credit_Type,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - Credit_Type, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8853 -0.7459 -0.4393 0.7791 2.5549
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.665e-01 1.140e+00 -0.322 0.74780
## checkin_accA12 -5.414e-01 2.438e-01 -2.221 0.02635 *
## checkin_accA13 -1.193e+00 4.274e-01 -2.792 0.00523 **
## checkin_accA14 -1.724e+00 2.580e-01 -6.682 2.35e-11 ***
## duration 2.712e-02 1.087e-02 2.495 0.01258 *
## credit_historyA31 1.964e-01 5.994e-01 0.328 0.74311
## credit_historyA32 -5.971e-01 4.571e-01 -1.306 0.19144
## credit_historyA33 -5.141e-01 4.998e-01 -1.029 0.30366
## credit_historyA34 -1.295e+00 4.701e-01 -2.754 0.00589 **
## amount 7.287e-05 5.156e-05 1.413 0.15756
## savings_accA62 -1.101e-01 3.263e-01 -0.337 0.73576
## savings_accA63 -3.357e-01 4.582e-01 -0.733 0.46376
## savings_accA64 -1.248e+00 6.299e-01 -1.982 0.04747 *
## savings_accA65 -8.126e-01 2.938e-01 -2.766 0.00568 **
## present_emp_sinceA72 6.466e-02 4.901e-01 0.132 0.89503
## present_emp_sinceA73 -3.569e-01 4.665e-01 -0.765 0.44422
## present_emp_sinceA74 -9.970e-01 5.044e-01 -1.977 0.04807 *
## present_emp_sinceA75 -5.995e-01 4.733e-01 -1.267 0.20524
## inst_rate 3.356e-01 1.022e-01 3.285 0.00102 **
## personal_statusA92 -6.647e-02 4.587e-01 -0.145 0.88478
## personal_statusA93 -4.604e-01 4.514e-01 -1.020 0.30770
## personal_statusA94 -9.544e-01 5.501e-01 -1.735 0.08275 .
## residing_since 2.960e-02 9.403e-02 0.315 0.75289
## age -1.924e-02 1.010e-02 -1.905 0.05682 .
## inst_plansA142 -6.388e-01 4.791e-01 -1.333 0.18238
## inst_plansA143 -7.046e-01 2.674e-01 -2.635 0.00841 **
## num_credits 3.126e-01 2.102e-01 1.487 0.13699
## jobA172 7.571e-01 7.783e-01 0.973 0.33063
## jobA173 9.263e-01 7.539e-01 1.229 0.21917
## jobA174 6.550e-01 7.662e-01 0.855 0.39265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 664.67 on 670 degrees of freedom
## AIC: 724.67
##
## Number of Fisher Scoring iterations: 5
Optimising the model by removing the insignificant predictors and checking on the Null Deviance and Residual deviance
model = glm(status~.-credit_history-Credit_Type,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - credit_history - Credit_Type, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8369 -0.7656 -0.4350 0.8254 2.7395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.127e-01 9.973e-01 -0.414 0.67904
## checkin_accA12 -5.324e-01 2.379e-01 -2.238 0.02521 *
## checkin_accA13 -1.222e+00 4.173e-01 -2.929 0.00340 **
## checkin_accA14 -1.826e+00 2.534e-01 -7.207 5.72e-13 ***
## duration 2.844e-02 1.065e-02 2.672 0.00755 **
## amount 8.255e-05 5.061e-05 1.631 0.10287
## savings_accA62 1.805e-02 3.217e-01 0.056 0.95527
## savings_accA63 -2.670e-01 4.418e-01 -0.604 0.54565
## savings_accA64 -1.195e+00 6.157e-01 -1.941 0.05230 .
## savings_accA65 -7.799e-01 2.887e-01 -2.701 0.00691 **
## present_emp_sinceA72 4.480e-02 4.866e-01 0.092 0.92665
## present_emp_sinceA73 -3.464e-01 4.620e-01 -0.750 0.45339
## present_emp_sinceA74 -1.038e+00 4.998e-01 -2.078 0.03774 *
## present_emp_sinceA75 -6.505e-01 4.685e-01 -1.388 0.16501
## inst_rate 3.300e-01 1.003e-01 3.289 0.00101 **
## personal_statusA92 -3.879e-02 4.466e-01 -0.087 0.93079
## personal_statusA93 -4.611e-01 4.395e-01 -1.049 0.29416
## personal_statusA94 -9.426e-01 5.401e-01 -1.745 0.08095 .
## residing_since 1.693e-02 9.256e-02 0.183 0.85487
## age -2.094e-02 9.921e-03 -2.110 0.03483 *
## inst_plansA142 -5.462e-01 4.699e-01 -1.162 0.24516
## inst_plansA143 -8.159e-01 2.584e-01 -3.157 0.00159 **
## num_credits 1.026e-01 1.691e-01 0.607 0.54403
## jobA172 6.000e-01 7.599e-01 0.790 0.42975
## jobA173 7.216e-01 7.316e-01 0.986 0.32401
## jobA174 4.649e-01 7.420e-01 0.627 0.53090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 679.58 on 674 degrees of freedom
## AIC: 731.58
##
## Number of Fisher Scoring iterations: 5
model = glm(status~.-Credit_Type-savings_acc,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - Credit_Type - savings_acc, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9634 -0.7524 -0.4397 0.7938 2.2952
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.632e-01 1.118e+00 -0.325 0.74538
## checkin_accA12 -6.160e-01 2.366e-01 -2.603 0.00923 **
## checkin_accA13 -1.224e+00 4.207e-01 -2.909 0.00363 **
## checkin_accA14 -1.853e+00 2.500e-01 -7.414 1.23e-13 ***
## duration 2.868e-02 1.071e-02 2.679 0.00738 **
## credit_historyA31 2.895e-02 5.914e-01 0.049 0.96096
## credit_historyA32 -6.903e-01 4.531e-01 -1.524 0.12761
## credit_historyA33 -6.125e-01 4.967e-01 -1.233 0.21756
## credit_historyA34 -1.355e+00 4.654e-01 -2.911 0.00360 **
## amount 6.227e-05 5.059e-05 1.231 0.21833
## present_emp_sinceA72 5.797e-02 4.878e-01 0.119 0.90541
## present_emp_sinceA73 -4.039e-01 4.641e-01 -0.870 0.38419
## present_emp_sinceA74 -1.048e+00 5.029e-01 -2.084 0.03717 *
## present_emp_sinceA75 -6.616e-01 4.711e-01 -1.404 0.16025
## inst_rate 3.116e-01 1.003e-01 3.107 0.00189 **
## personal_statusA92 -1.158e-01 4.495e-01 -0.258 0.79668
## personal_statusA93 -4.778e-01 4.422e-01 -1.080 0.27992
## personal_statusA94 -9.604e-01 5.421e-01 -1.772 0.07647 .
## residing_since 1.598e-02 9.314e-02 0.172 0.86378
## age -2.068e-02 9.949e-03 -2.079 0.03765 *
## inst_plansA142 -5.212e-01 4.715e-01 -1.105 0.26895
## inst_plansA143 -6.532e-01 2.665e-01 -2.451 0.01423 *
## num_credits 3.144e-01 2.076e-01 1.514 0.12996
## jobA172 9.234e-01 7.750e-01 1.191 0.23347
## jobA173 1.038e+00 7.507e-01 1.383 0.16661
## jobA174 8.365e-01 7.597e-01 1.101 0.27083
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 676.61 on 674 degrees of freedom
## AIC: 728.61
##
## Number of Fisher Scoring iterations: 5
model = glm(status~.-Credit_Type-present_emp_since,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - Credit_Type - present_emp_since, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8647 -0.7651 -0.4419 0.8230 2.6134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.384e-01 1.085e+00 0.127 0.89856
## checkin_accA12 -4.869e-01 2.392e-01 -2.036 0.04177 *
## checkin_accA13 -1.157e+00 4.191e-01 -2.760 0.00577 **
## checkin_accA14 -1.704e+00 2.554e-01 -6.674 2.49e-11 ***
## duration 2.467e-02 1.074e-02 2.298 0.02157 *
## credit_historyA31 5.811e-02 5.937e-01 0.098 0.92203
## credit_historyA32 -7.010e-01 4.521e-01 -1.551 0.12102
## credit_historyA33 -5.495e-01 4.951e-01 -1.110 0.26706
## credit_historyA34 -1.391e+00 4.648e-01 -2.994 0.00276 **
## amount 7.145e-05 5.112e-05 1.398 0.16220
## savings_accA62 -1.561e-01 3.214e-01 -0.486 0.62720
## savings_accA63 -3.601e-01 4.471e-01 -0.805 0.42058
## savings_accA64 -1.225e+00 6.082e-01 -2.014 0.04397 *
## savings_accA65 -8.712e-01 2.913e-01 -2.991 0.00278 **
## inst_rate 3.298e-01 1.014e-01 3.254 0.00114 **
## personal_statusA92 -7.869e-02 4.502e-01 -0.175 0.86125
## personal_statusA93 -6.216e-01 4.410e-01 -1.409 0.15872
## personal_statusA94 -1.019e+00 5.427e-01 -1.877 0.06053 .
## residing_since -5.281e-04 9.116e-02 -0.006 0.99538
## age -2.215e-02 9.492e-03 -2.333 0.01962 *
## inst_plansA142 -4.937e-01 4.663e-01 -1.059 0.28965
## inst_plansA143 -6.797e-01 2.634e-01 -2.580 0.00987 **
## num_credits 2.441e-01 2.058e-01 1.186 0.23553
## jobA172 3.582e-01 7.093e-01 0.505 0.61359
## jobA173 4.913e-01 6.883e-01 0.714 0.47534
## jobA174 2.954e-01 7.358e-01 0.401 0.68811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 675.61 on 674 degrees of freedom
## AIC: 727.61
##
## Number of Fisher Scoring iterations: 5
model = glm(status~.-Credit_Type-personal_status,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - Credit_Type - personal_status, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8438 -0.7641 -0.4370 0.8113 2.5040
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.074e-01 1.054e+00 -0.576 0.56438
## checkin_accA12 -5.794e-01 2.419e-01 -2.396 0.01659 *
## checkin_accA13 -1.165e+00 4.223e-01 -2.759 0.00579 **
## checkin_accA14 -1.726e+00 2.563e-01 -6.733 1.66e-11 ***
## duration 2.777e-02 1.075e-02 2.584 0.00976 **
## credit_historyA31 2.391e-01 5.931e-01 0.403 0.68691
## credit_historyA32 -6.277e-01 4.500e-01 -1.395 0.16305
## credit_historyA33 -5.691e-01 4.952e-01 -1.149 0.25048
## credit_historyA34 -1.306e+00 4.650e-01 -2.808 0.00498 **
## amount 7.348e-05 5.104e-05 1.440 0.14996
## savings_accA62 -1.343e-01 3.240e-01 -0.414 0.67864
## savings_accA63 -3.840e-01 4.565e-01 -0.841 0.40024
## savings_accA64 -1.161e+00 6.197e-01 -1.873 0.06105 .
## savings_accA65 -8.039e-01 2.915e-01 -2.758 0.00582 **
## present_emp_sinceA72 8.080e-02 4.867e-01 0.166 0.86815
## present_emp_sinceA73 -3.927e-01 4.645e-01 -0.845 0.39793
## present_emp_sinceA74 -1.094e+00 5.020e-01 -2.179 0.02930 *
## present_emp_sinceA75 -6.749e-01 4.709e-01 -1.433 0.15178
## inst_rate 3.022e-01 9.991e-02 3.024 0.00249 **
## residing_since 5.304e-02 9.227e-02 0.575 0.56539
## age -1.947e-02 1.002e-02 -1.944 0.05193 .
## inst_plansA142 -6.615e-01 4.793e-01 -1.380 0.16752
## inst_plansA143 -6.644e-01 2.655e-01 -2.503 0.01232 *
## num_credits 2.993e-01 2.077e-01 1.441 0.14957
## jobA172 7.110e-01 7.718e-01 0.921 0.35693
## jobA173 9.156e-01 7.472e-01 1.225 0.22042
## jobA174 6.675e-01 7.597e-01 0.879 0.37960
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 672.06 on 673 degrees of freedom
## AIC: 726.06
##
## Number of Fisher Scoring iterations: 5
model = glm(status~.-Credit_Type-num_credits,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - Credit_Type - num_credits, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9029 -0.7534 -0.4372 0.7760 2.5640
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.243e-01 1.066e+00 0.210 0.83333
## checkin_accA12 -5.577e-01 2.434e-01 -2.291 0.02194 *
## checkin_accA13 -1.233e+00 4.257e-01 -2.895 0.00379 **
## checkin_accA14 -1.724e+00 2.576e-01 -6.691 2.22e-11 ***
## duration 2.579e-02 1.076e-02 2.397 0.01651 *
## credit_historyA31 2.842e-03 5.835e-01 0.005 0.99611
## credit_historyA32 -7.919e-01 4.358e-01 -1.817 0.06919 .
## credit_historyA33 -5.206e-01 4.981e-01 -1.045 0.29599
## credit_historyA34 -1.261e+00 4.668e-01 -2.702 0.00690 **
## amount 7.508e-05 5.121e-05 1.466 0.14260
## savings_accA62 -9.150e-02 3.241e-01 -0.282 0.77773
## savings_accA63 -3.823e-01 4.562e-01 -0.838 0.40196
## savings_accA64 -1.223e+00 6.276e-01 -1.949 0.05129 .
## savings_accA65 -8.177e-01 2.938e-01 -2.783 0.00539 **
## present_emp_sinceA72 7.530e-02 4.899e-01 0.154 0.87784
## present_emp_sinceA73 -3.441e-01 4.662e-01 -0.738 0.46045
## present_emp_sinceA74 -9.484e-01 5.032e-01 -1.885 0.05946 .
## present_emp_sinceA75 -5.525e-01 4.720e-01 -1.170 0.24180
## inst_rate 3.311e-01 1.016e-01 3.257 0.00113 **
## personal_statusA92 -6.690e-02 4.572e-01 -0.146 0.88367
## personal_statusA93 -4.596e-01 4.498e-01 -1.022 0.30694
## personal_statusA94 -9.441e-01 5.499e-01 -1.717 0.08600 .
## residing_since 4.344e-02 9.331e-02 0.466 0.64155
## age -1.887e-02 1.009e-02 -1.869 0.06159 .
## inst_plansA142 -5.743e-01 4.761e-01 -1.206 0.22774
## inst_plansA143 -6.955e-01 2.673e-01 -2.602 0.00926 **
## jobA172 6.583e-01 7.761e-01 0.848 0.39637
## jobA173 8.345e-01 7.517e-01 1.110 0.26693
## jobA174 5.840e-01 7.651e-01 0.763 0.44523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 666.88 on 671 degrees of freedom
## AIC: 724.88
##
## Number of Fisher Scoring iterations: 5
model = glm(status~.-Credit_Type-job,data = trainingdataset,family = "binomial")
print(summary(model))
##
## Call:
## glm(formula = status ~ . - Credit_Type - job, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8223 -0.7459 -0.4406 0.7979 2.5556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.798e-01 9.764e-01 0.287 0.774448
## checkin_accA12 -5.635e-01 2.429e-01 -2.320 0.020318 *
## checkin_accA13 -1.233e+00 4.263e-01 -2.892 0.003826 **
## checkin_accA14 -1.714e+00 2.569e-01 -6.671 2.54e-11 ***
## duration 2.698e-02 1.066e-02 2.531 0.011369 *
## credit_historyA31 1.165e-01 5.934e-01 0.196 0.844338
## credit_historyA32 -5.925e-01 4.529e-01 -1.308 0.190814
## credit_historyA33 -4.882e-01 4.949e-01 -0.987 0.323855
## credit_historyA34 -1.284e+00 4.665e-01 -2.752 0.005929 **
## amount 6.950e-05 4.829e-05 1.439 0.150076
## savings_accA62 -7.722e-02 3.243e-01 -0.238 0.811782
## savings_accA63 -3.733e-01 4.516e-01 -0.827 0.408470
## savings_accA64 -1.198e+00 6.215e-01 -1.927 0.053970 .
## savings_accA65 -8.151e-01 2.939e-01 -2.773 0.005551 **
## present_emp_sinceA72 3.242e-01 4.369e-01 0.742 0.458064
## present_emp_sinceA73 -7.680e-02 4.035e-01 -0.190 0.849043
## present_emp_sinceA74 -7.113e-01 4.491e-01 -1.584 0.113195
## present_emp_sinceA75 -3.257e-01 4.217e-01 -0.772 0.439883
## inst_rate 3.349e-01 1.001e-01 3.347 0.000817 ***
## personal_statusA92 -9.952e-02 4.552e-01 -0.219 0.826936
## personal_statusA93 -4.914e-01 4.480e-01 -1.097 0.272679
## personal_statusA94 -9.813e-01 5.471e-01 -1.794 0.072864 .
## residing_since 3.596e-02 9.303e-02 0.387 0.699079
## age -1.995e-02 1.002e-02 -1.991 0.046505 *
## inst_plansA142 -5.956e-01 4.772e-01 -1.248 0.211939
## inst_plansA143 -6.874e-01 2.660e-01 -2.584 0.009774 **
## num_credits 2.898e-01 2.095e-01 1.383 0.166596
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 666.90 on 673 degrees of freedom
## AIC: 720.9
##
## Number of Fisher Scoring iterations: 5
model = glm(status~.-Credit_Type-residing_since,data = trainingdataset,family = "binomial")
print(summary(model)) ## this improves AIC without affecting null and residual deviance
##
## Call:
## glm(formula = status ~ . - Credit_Type - residing_since, family = "binomial",
## data = trainingdataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8691 -0.7493 -0.4375 0.7692 2.5637
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.319e-01 1.134e+00 -0.293 0.769854
## checkin_accA12 -5.445e-01 2.435e-01 -2.236 0.025332 *
## checkin_accA13 -1.208e+00 4.251e-01 -2.842 0.004486 **
## checkin_accA14 -1.730e+00 2.575e-01 -6.716 1.86e-11 ***
## duration 2.726e-02 1.086e-02 2.510 0.012078 *
## credit_historyA31 1.998e-01 5.989e-01 0.334 0.738726
## credit_historyA32 -5.943e-01 4.568e-01 -1.301 0.193243
## credit_historyA33 -5.182e-01 4.996e-01 -1.037 0.299613
## credit_historyA34 -1.290e+00 4.697e-01 -2.747 0.006020 **
## amount 7.328e-05 5.154e-05 1.422 0.155068
## savings_accA62 -1.054e-01 3.258e-01 -0.324 0.746235
## savings_accA63 -3.322e-01 4.575e-01 -0.726 0.467678
## savings_accA64 -1.248e+00 6.312e-01 -1.977 0.048056 *
## savings_accA65 -8.094e-01 2.939e-01 -2.754 0.005882 **
## present_emp_sinceA72 4.902e-02 4.880e-01 0.100 0.919986
## present_emp_sinceA73 -3.657e-01 4.660e-01 -0.785 0.432612
## present_emp_sinceA74 -1.003e+00 5.044e-01 -1.989 0.046751 *
## present_emp_sinceA75 -5.941e-01 4.733e-01 -1.255 0.209409
## inst_rate 3.362e-01 1.021e-01 3.293 0.000992 ***
## personal_statusA92 -4.621e-02 4.542e-01 -0.102 0.918959
## personal_statusA93 -4.479e-01 4.496e-01 -0.996 0.319163
## personal_statusA94 -9.468e-01 5.496e-01 -1.723 0.084914 .
## age -1.886e-02 1.003e-02 -1.880 0.060148 .
## inst_plansA142 -6.428e-01 4.783e-01 -1.344 0.178978
## inst_plansA143 -7.041e-01 2.673e-01 -2.634 0.008438 **
## num_credits 3.190e-01 2.091e-01 1.526 0.127084
## jobA172 7.759e-01 7.761e-01 1.000 0.317475
## jobA173 9.386e-01 7.531e-01 1.246 0.212606
## jobA174 6.585e-01 7.663e-01 0.859 0.390180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 851.79 on 699 degrees of freedom
## Residual deviance: 664.77 on 671 degrees of freedom
## AIC: 722.77
##
## Number of Fisher Scoring iterations: 5
The above code checks the change in Null Deviance, Residual Deviance and AIC value it is found that removing residing since variable reduces the AIC value without much change in the Residual deviance value Thus the residing_since variable is not taken into account while creating the model
Applying the model to validation dataset to predict the status variable
probprediction = predict(model,newdata = validationdataset,type = "response")
print(head(probprediction))
## 2 10 14 18 21 23
## 0.6621447 0.3768487 0.2691726 0.8771605 0.1060015 0.1910474
Using ROCR library to determine the optimum cutoff value
library(ROCR)
res <- predict(model,trainingdataset,type = "response")
ROCR_Predicted <- prediction(res,trainingdataset$status)
ROCR_Performance <- performance(ROCR_Predicted,"tpr","fpr")
plot(ROCR_Performance,colorize= TRUE,print.cutoffs.at=seq(0.1,by=0.1))
Graph to determine cut off value
Assuming Cutoff = 0.5
cut_off = 0.5
prediction = ifelse(probprediction>cut_off,1,0)
validationdataset$predictedStatus = prediction
print(head(validationdataset))
## checkin_acc duration credit_history amount savings_acc present_emp_since
## 2 A12 48 A32 5951 A61 A73
## 10 A12 30 A34 5234 A61 A71
## 14 A11 24 A34 1199 A61 A75
## 18 A11 30 A30 8072 A65 A72
## 21 A14 9 A34 2134 A61 A73
## 23 A11 10 A34 2241 A61 A72
## inst_rate personal_status residing_since age inst_plans num_credits job
## 2 2 A92 2 22 A143 1 A173
## 10 4 A94 2 28 A143 2 A174
## 14 4 A93 4 60 A143 2 A172
## 18 2 A93 3 25 A141 3 A173
## 21 4 A93 4 48 A143 3 A173
## 23 1 A93 3 48 A143 2 A172
## status Credit_Type predictedStatus
## 2 1 Bad 1
## 10 1 Bad 0
## 14 1 Bad 0
## 18 0 Good 1
## 21 0 Good 0
## 23 0 Good 0
library(caret)
## Loading required package: lattice
# Evaluation of the model
confusionMatrix(as.factor(validationdataset$predictedStatus),as.factor(validationdataset$status),positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 183 53
## 1 25 39
##
## Accuracy : 0.74
## 95% CI : (0.6865, 0.7887)
## No Information Rate : 0.6933
## P-Value [Acc > NIR] : 0.043914
##
## Kappa : 0.3319
##
## Mcnemar's Test P-Value : 0.002235
##
## Sensitivity : 0.4239
## Specificity : 0.8798
## Pos Pred Value : 0.6094
## Neg Pred Value : 0.7754
## Prevalence : 0.3067
## Detection Rate : 0.1300
## Detection Prevalence : 0.2133
## Balanced Accuracy : 0.6519
##
## 'Positive' Class : 1
##
Cutoff as 0.4
cut_off = 0.4
prediction = ifelse(probprediction>cut_off,1,0)
validationdataset$predictedStatus = prediction
print(head(validationdataset))
## checkin_acc duration credit_history amount savings_acc present_emp_since
## 2 A12 48 A32 5951 A61 A73
## 10 A12 30 A34 5234 A61 A71
## 14 A11 24 A34 1199 A61 A75
## 18 A11 30 A30 8072 A65 A72
## 21 A14 9 A34 2134 A61 A73
## 23 A11 10 A34 2241 A61 A72
## inst_rate personal_status residing_since age inst_plans num_credits job
## 2 2 A92 2 22 A143 1 A173
## 10 4 A94 2 28 A143 2 A174
## 14 4 A93 4 60 A143 2 A172
## 18 2 A93 3 25 A141 3 A173
## 21 4 A93 4 48 A143 3 A173
## 23 1 A93 3 48 A143 2 A172
## status Credit_Type predictedStatus
## 2 1 Bad 1
## 10 1 Bad 0
## 14 1 Bad 0
## 18 0 Good 1
## 21 0 Good 0
## 23 0 Good 0
library(caret)
# Evaluation of the model
confusionMatrix(as.factor(validationdataset$predictedStatus),as.factor(validationdataset$status),positive = "0")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 162 40
## 1 46 52
##
## Accuracy : 0.7133
## 95% CI : (0.6586, 0.7638)
## No Information Rate : 0.6933
## P-Value [Acc > NIR] : 0.2469
##
## Kappa : 0.3379
##
## Mcnemar's Test P-Value : 0.5898
##
## Sensitivity : 0.7788
## Specificity : 0.5652
## Pos Pred Value : 0.8020
## Neg Pred Value : 0.5306
## Prevalence : 0.6933
## Detection Rate : 0.5400
## Detection Prevalence : 0.6733
## Balanced Accuracy : 0.6720
##
## 'Positive' Class : 0
##
It is seen that accuracy is more in the case of cut off value as 0.5 (74%)