We are given a set of data called German Credit. We are to use this data to help maximize profits knowing that the consequences of classifications of a false positive outweighs the benefits of a true positive by a factor of 5. We are tasked with create a model that gives us the highest net profit. We will use multiple types of models and find which ones are the best.
Hypothesis: Based off of what I know about banking my guess would be that the following would be big in computing whether or not someone is a good or a bad credit risk: Checking Balance, Savings Balance, History, Duration, and Employment.
Lets make a few logistic models of the entire data and a classification tree to see which predictors are useful. Here is a summary of the first logistic model using all 30 predictors and all 1000 observations.
##
## Call:
## glm(formula = RESPONSE ~ ., family = binomial, data = D)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6535 -0.7188 0.3876 0.7071 2.3595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.016e+00 8.675e-01 1.171 0.241446
## CHK_ACCT 5.641e-01 7.250e-02 7.780 7.24e-15 ***
## DURATION -2.695e-02 9.007e-03 -2.992 0.002770 **
## HISTORY 4.007e-01 8.974e-02 4.466 7.99e-06 ***
## NEW_CAR -7.931e-01 3.846e-01 -2.062 0.039193 *
## USED_CAR 8.271e-01 4.818e-01 1.717 0.086011 .
## FURNITURE -3.759e-02 3.989e-01 -0.094 0.924937
## `RADIO/TV` 7.004e-02 3.884e-01 0.180 0.856884
## EDUCATION -8.658e-01 5.009e-01 -1.728 0.083918 .
## RETRAINING -8.050e-02 4.414e-01 -0.182 0.855300
## AMOUNT -1.178e-04 4.265e-05 -2.761 0.005756 **
## SAV_ACCT 2.497e-01 6.060e-02 4.121 3.77e-05 ***
## EMPLOYMENT 1.175e-01 7.474e-02 1.571 0.116068
## INSTALL_RATE -3.215e-01 8.630e-02 -3.725 0.000195 ***
## MALE_DIV -3.417e-01 3.815e-01 -0.896 0.370467
## MALE_SINGLE 5.406e-01 2.048e-01 2.640 0.008292 **
## MALE_MAR_or_WID 1.114e-01 3.046e-01 0.366 0.714668
## `CO-APPLICANT` -3.500e-01 3.988e-01 -0.878 0.380165
## GUARANTOR 9.463e-01 4.195e-01 2.256 0.024084 *
## PRESENT_RESIDENT -1.275e-02 8.404e-02 -0.152 0.879374
## REAL_ESTATE 2.092e-01 2.093e-01 0.999 0.317569
## PROP_UNKN_NONE -5.551e-01 3.732e-01 -1.487 0.136927
## AGE 1.147e-02 8.665e-03 1.323 0.185723
## OTHER_INSTALL -6.213e-01 2.040e-01 -3.045 0.002324 **
## RENT -6.555e-01 4.602e-01 -1.424 0.154344
## OWN_RES -2.405e-01 4.356e-01 -0.552 0.580920
## NUM_CREDITS -2.301e-01 1.662e-01 -1.385 0.166128
## JOB -3.047e-02 1.423e-01 -0.214 0.830416
## NUM_DEPENDENTS -2.581e-01 2.456e-01 -1.051 0.293322
## TELEPHONE 3.553e-01 1.951e-01 1.821 0.068610 .
## FOREIGN 1.453e+00 6.221e-01 2.335 0.019532 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 909.2 on 969 degrees of freedom
## AIC: 971.2
##
## Number of Fisher Scoring iterations: 5
The confusion matrix is listed below for the first model using all 30 predictors. Note: I have set the cutoff of all confusion matrices to decrease the number of false positives. This has decreased accuracy, but will improve overall profitability based on the losing more money by a factor of 5 for false positives.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 257 275
## 1 43 425
##
## Accuracy : 0.682
## 95% CI : (0.6521, 0.7108)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.8986
##
## Kappa : 0.3799
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6071
## Specificity : 0.8567
## Pos Pred Value : 0.9081
## Neg Pred Value : 0.4831
## Prevalence : 0.7000
## Detection Rate : 0.4250
## Detection Prevalence : 0.4680
## Balanced Accuracy : 0.7319
##
## 'Positive' Class : 1
##
After viewing the first model I noticed that CHK_ACCT, DURATION, HISTORY, NEW_CAR, USED_CAR, EDUCATION, AMOUNT, SAV_ACCT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, OTHER_INSTALL, TELEPHONE, and FOREIGN were statistically significant for a z-value < 0.1
Let’s make two new models for the predictors that have p-value <.1 and p-value < 0.5.
##
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR +
## USED_CAR + EDUCATION + AMOUNT + SAV_ACCT + INSTALL_RATE +
## MALE_SINGLE + GUARANTOR + OTHER_INSTALL + TELEPHONE + FOREIGN,
## family = binomial, data = D)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7749 -0.7622 0.4022 0.7267 2.2357
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.948e-01 3.638e-01 1.635 0.102119
## CHK_ACCT 5.994e-01 7.102e-02 8.439 < 2e-16 ***
## DURATION -2.744e-02 8.714e-03 -3.149 0.001639 **
## HISTORY 3.816e-01 7.933e-02 4.811 1.50e-06 ***
## NEW_CAR -7.945e-01 1.953e-01 -4.069 4.72e-05 ***
## USED_CAR 7.999e-01 3.368e-01 2.375 0.017535 *
## EDUCATION -9.857e-01 3.631e-01 -2.715 0.006628 **
## AMOUNT -1.384e-04 4.146e-05 -3.339 0.000841 ***
## SAV_ACCT 2.623e-01 5.913e-02 4.437 9.14e-06 ***
## INSTALL_RATE -2.942e-01 8.283e-02 -3.552 0.000382 ***
## MALE_SINGLE 5.967e-01 1.700e-01 3.510 0.000449 ***
## GUARANTOR 1.126e+00 4.058e-01 2.774 0.005539 **
## OTHER_INSTALL -6.287e-01 1.991e-01 -3.158 0.001589 **
## TELEPHONE 3.326e-01 1.778e-01 1.871 0.061407 .
## FOREIGN 1.396e+00 6.299e-01 2.216 0.026667 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 928.0 on 985 degrees of freedom
## AIC: 958
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR +
## AMOUNT + SAV_ACCT + INSTALL_RATE + MALE_SINGLE + GUARANTOR +
## OTHER_INSTALL + FOREIGN, family = binomial, data = D)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6928 -0.8152 0.4116 0.7479 2.1184
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.928e-01 3.587e-01 1.653 0.098409 .
## CHK_ACCT 5.998e-01 6.999e-02 8.569 < 2e-16 ***
## DURATION -2.919e-02 8.622e-03 -3.385 0.000711 ***
## HISTORY 3.859e-01 7.828e-02 4.930 8.22e-07 ***
## NEW_CAR -8.040e-01 1.881e-01 -4.274 1.92e-05 ***
## AMOUNT -9.645e-05 3.844e-05 -2.509 0.012110 *
## SAV_ACCT 2.647e-01 5.810e-02 4.555 5.23e-06 ***
## INSTALL_RATE -2.862e-01 8.177e-02 -3.501 0.000464 ***
## MALE_SINGLE 6.095e-01 1.684e-01 3.620 0.000294 ***
## GUARANTOR 1.134e+00 4.021e-01 2.819 0.004812 **
## OTHER_INSTALL -6.180e-01 1.971e-01 -3.136 0.001715 **
## FOREIGN 1.340e+00 6.216e-01 2.156 0.031079 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 946.55 on 988 degrees of freedom
## AIC: 970.55
##
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 255 296
## 1 45 404
##
## Accuracy : 0.659
## 95% CI : (0.6287, 0.6884)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.9977
##
## Kappa : 0.3447
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5771
## Specificity : 0.8500
## Pos Pred Value : 0.8998
## Neg Pred Value : 0.4628
## Prevalence : 0.7000
## Detection Rate : 0.4040
## Detection Prevalence : 0.4490
## Balanced Accuracy : 0.7136
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 254 306
## 1 46 394
##
## Accuracy : 0.648
## 95% CI : (0.6175, 0.6776)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.3282
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5629
## Specificity : 0.8467
## Pos Pred Value : 0.8955
## Neg Pred Value : 0.4536
## Prevalence : 0.7000
## Detection Rate : 0.3940
## Detection Prevalence : 0.4400
## Balanced Accuracy : 0.7048
##
## 'Positive' Class : 1
##
Based off of the first few logistic models we can see what is significant and what isn’t. We also did find out some key information or surprises.
I think the biggest surprise is that employment isn’t that significant of a predictor nor is Age.
Another surprise is that if you are buying a used car or a new car it would be significant in determining whether you would be considered good or bad credit.
I have partitioned the data into training data (60%) and validation data (40%)
Here is the lift chart and confusion matrix of the first Logistic model using the training data and all of the 30 predictors.
Note: I am used a cutoff of .8 or 80% for almost all models because I found this to be the most profitable cutoff.
Also, this is the most profitable of the three logistic models
presented, but there is over fitting occurring.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 133
## 1 14 161
##
## Accuracy : 0.6325
## 95% CI : (0.5832, 0.6799)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3058
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5476
## Specificity : 0.8679
## Pos Pred Value : 0.9200
## Neg Pred Value : 0.4089
## Prevalence : 0.7350
## Detection Rate : 0.4025
## Detection Prevalence : 0.4375
## Balanced Accuracy : 0.7078
##
## 'Positive' Class : 1
##
Here is the lift chart and confusion matrix of the first Logistic model using the training data and CHK_ACCT, DURATION, HISTORY, NEW_CAR, USED_CAR, EDUCATION, AMOUNT, SAV_ACCT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, OTHER_INSTALL, TELEPHONE, and FOREIGN as predictors. These are all of the predictors that have a p-value <.1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 141
## 1 14 153
##
## Accuracy : 0.6125
## 95% CI : (0.5628, 0.6605)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2808
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5204
## Specificity : 0.8679
## Pos Pred Value : 0.9162
## Neg Pred Value : 0.3948
## Prevalence : 0.7350
## Detection Rate : 0.3825
## Detection Prevalence : 0.4175
## Balanced Accuracy : 0.6942
##
## 'Positive' Class : 1
##
Here is the lift chart and confusion matrix of the first Logistic model using the training data and CHK_ACCT, DURATION, HISTORY, USED_CAR, AMOUNT, SAV_ACCT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, OTHER_INSTALL, and FOREIGN as predictors. These are all of the predictors that have a p-value <.05
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 148
## 1 14 146
##
## Accuracy : 0.595
## 95% CI : (0.5451, 0.6435)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2596
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.4966
## Specificity : 0.8679
## Pos Pred Value : 0.9125
## Neg Pred Value : 0.3833
## Prevalence : 0.7350
## Detection Rate : 0.3650
## Detection Prevalence : 0.4000
## Balanced Accuracy : 0.6823
##
## 'Positive' Class : 1
##
After viewing these three models, the third model is tied for accuracy, but the positive prediction value is the highest, thus we will taking using this model for problem 3. We are looking for the most true positive values and with the least false positive values to maximize profits.
Next we will use Classification Trees to model the data.
Using rpart, we first create a classification tree, then check the importance of predictors.
Note these are fairly close to those we had with the logistic models.
Now create the tree.
Confusion Matrix for the Tree Classification Model 1.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 138
## 1 17 156
##
## Accuracy : 0.6125
## 95% CI : (0.5628, 0.6605)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2712
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8396
## Specificity : 0.5306
## Pos Pred Value : 0.3921
## Neg Pred Value : 0.9017
## Prevalence : 0.2650
## Detection Rate : 0.2225
## Detection Prevalence : 0.5675
## Balanced Accuracy : 0.6851
##
## 'Positive' Class : 0
##
Now pruning the tree.
Pruned tree confusion matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 110
## 1 20 184
##
## Accuracy : 0.675
## 95% CI : (0.6267, 0.7207)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 0.9968
##
## Kappa : 0.3438
##
## Mcnemar's Test P-Value : 5.912e-15
##
## Sensitivity : 0.8113
## Specificity : 0.6259
## Pos Pred Value : 0.4388
## Neg Pred Value : 0.9020
## Prevalence : 0.2650
## Detection Rate : 0.2150
## Detection Prevalence : 0.4900
## Balanced Accuracy : 0.7186
##
## 'Positive' Class : 0
##
I used a cutoff of 0.7 because it showed a better profit.
Also, printed is the variable importance. Note they differ from the
logistic models in some ways, but are fairly close in others. This is
the most profitable amongst the tree classification models.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 123
## 1 15 171
##
## Accuracy : 0.655
## 95% CI : (0.6061, 0.7015)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.332
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8585
## Specificity : 0.5816
## Pos Pred Value : 0.4252
## Neg Pred Value : 0.9194
## Prevalence : 0.2650
## Detection Rate : 0.2275
## Detection Prevalence : 0.5350
## Balanced Accuracy : 0.7201
##
## 'Positive' Class : 0
##
Creating a neural network with the variables CHK_ACCT + DURATION + OTHER_INSTALL + SAV_ACCT + HISTORY + AMOUNT + USED_CAR + GUARANTOR. This network includes 1 hidden layer with 4 neurons.
Confusion Matrix and Lift Chart for Neural Network 1
## ME RMSE MAE MPE MAPE
## Test set 0.06034458 0.4067656 0.3118412 NaN Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 142
## 1 15 152
##
## Accuracy : 0.6075
## 95% CI : (0.5577, 0.6557)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2715
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8585
## Specificity : 0.5170
## Pos Pred Value : 0.3906
## Neg Pred Value : 0.9102
## Prevalence : 0.2650
## Detection Rate : 0.2275
## Detection Prevalence : 0.5825
## Balanced Accuracy : 0.6877
##
## 'Positive' Class : 0
##
Creating a neural network with the variables CHK_ACCT + DURATION + OTHER_INSTALL + SAV_ACCT + HISTORY + AMOUNT + USED_CAR + GUARANTOR. This network includes 1 hidden layer with 2 neurons.
Confusion Matrix and Lift Chart for Neural Network 2
## ME RMSE MAE MPE MAPE
## Test set 0.05530795 0.3978058 0.316457 NaN Inf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 102
## 1 20 192
##
## Accuracy : 0.695
## 95% CI : (0.6473, 0.7398)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 0.9678
##
## Kappa : 0.3723
##
## Mcnemar's Test P-Value : 2.244e-13
##
## Sensitivity : 0.8113
## Specificity : 0.6531
## Pos Pred Value : 0.4574
## Neg Pred Value : 0.9057
## Prevalence : 0.2650
## Detection Rate : 0.2150
## Detection Prevalence : 0.4700
## Balanced Accuracy : 0.7322
##
## 'Positive' Class : 0
##
Based off of the above confusion matrices and lift charts I will choose the first logistic model (I know there is over fitting, but I am solely looking for profitability here), the Random Forest Tree Model, and Neural Network 2.
Below are the confusion matrices of all three. In order: Logistic, Random Forest, and then Neural Network.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 133
## 1 14 161
##
## Accuracy : 0.6325
## 95% CI : (0.5832, 0.6799)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3058
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5476
## Specificity : 0.8679
## Pos Pred Value : 0.9200
## Neg Pred Value : 0.4089
## Prevalence : 0.7350
## Detection Rate : 0.4025
## Detection Prevalence : 0.4375
## Balanced Accuracy : 0.7078
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 123
## 1 15 171
##
## Accuracy : 0.655
## 95% CI : (0.6061, 0.7015)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.332
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8585
## Specificity : 0.5816
## Pos Pred Value : 0.4252
## Neg Pred Value : 0.9194
## Prevalence : 0.2650
## Detection Rate : 0.2275
## Detection Prevalence : 0.5350
## Balanced Accuracy : 0.7201
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 102
## 1 20 192
##
## Accuracy : 0.695
## 95% CI : (0.6473, 0.7398)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 0.9678
##
## Kappa : 0.3723
##
## Mcnemar's Test P-Value : 2.244e-13
##
## Sensitivity : 0.8113
## Specificity : 0.6531
## Pos Pred Value : 0.4574
## Neg Pred Value : 0.9057
## Prevalence : 0.2650
## Detection Rate : 0.2150
## Detection Prevalence : 0.4700
## Balanced Accuracy : 0.7322
##
## 'Positive' Class : 0
##
Below are the cost/gain matrices of the three above models. In order: Logistic, Random Forest, and then Neural Network.
## Reference
## Prediction Bad Good
## Bad 0 0
## Good -7000 16100
## [1] "The overall profit would be 9100 DM"
## Reference
## Prediction Bad Good
## Bad 0 0
## Good -7500 17100
## [1] "The overall profit would be 9600 DM"
## Reference
## Prediction Bad Good
## Bad 0 0
## Good -10000 19200
## [1] "The overall profit would be 9200 DM"
Logistic models profit: 9100DM
Random Forest models profit: 9600DM
Neural Networks profit: 9200DM
Overall the Random Forest Tree Classification model worked the best or had the most profit based on certain cutoffs and set seeds.
Below is a lift curve of the profits for the validation data. I created a vector of expected values for the profit by multiplying the probability of a true positive by 100 and then multiplying the false positive by -500. I arranged the probabilities in descending order and then created a cumulative profits vector. I then found the max profit and at where it is located.
Below is a cumulative lift and decile-wise lift curve for the validation profits data with a 70% cutoff.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 79 89
## 1 27 205
##
## Accuracy : 0.71
## 95% CI : (0.6628, 0.754)
## No Information Rate : 0.735
## P-Value [Acc > NIR] : 0.8822
##
## Kappa : 0.3728
##
## Mcnemar's Test P-Value : 1.481e-08
##
## Sensitivity : 0.6973
## Specificity : 0.7453
## Pos Pred Value : 0.8836
## Neg Pred Value : 0.4702
## Prevalence : 0.7350
## Detection Rate : 0.5125
## Detection Prevalence : 0.5800
## Balanced Accuracy : 0.7213
##
## 'Positive' Class : 1
##
Below is a cumulative profits chart. I ordered the profits in descending order and then made a cumulative chart based off of the profits.
## pp4 profit1
## 343 0.9939444 99.39444
## 24 0.9933122 198.72566
## 29 0.9932699 298.05265
## 202 0.9930737 397.36002
## 34 0.9922904 496.58906
## 311 0.9913745 595.72651
## pp4 profit1
## 26 0.9306132 -822.7914
## 42 0.9318435 -1288.7131
## 38 0.9389001 -1758.1632
## 178 0.9400596 -2228.1930
## 92 0.9417257 -2699.0558
## 201 0.9667376 -3182.4246
Looking at the bar plot it shows that I should stop after the 5th file or going only 200 people into the 400 total in the validation set.
Based off of my decile-wise curve it looks like the I should go until the 58th percentile in order to get the max profit, based off of a 70% cutoff. This is really close to what the first plot is.
Based off of the curve I showed, I should stop at 294 for a profit of about 22192 DM. This curve is interesting in that it sums the expected profits given the actual responses. This value is a lot higher than what all of the models anticipated. Thus, I am wondering if this is accurate.
Based off of my confusion matrix above the best cutoff I have used was around 70% vs. 50%. This gives a profit of 205 * 100 - 500 * 27 = 7000 DM.
or we have a cutoff at the 5th file based on the bar plot.