Description

We are given a set of data called German Credit. We are to use this data to help maximize profits knowing that the consequences of classifications of a false positive outweighs the benefits of a true positive by a factor of 5. We are tasked with create a model that gives us the highest net profit. We will use multiple types of models and find which ones are the best.

Problem 1

Review the predictor variables and guess what their role is a credit decision might be. Are there any surprises?

Hypothesis: Based off of what I know about banking my guess would be that the following would be big in computing whether or not someone is a good or a bad credit risk: Checking Balance, Savings Balance, History, Duration, and Employment.

Lets make a few logistic models of the entire data and a classification tree to see which predictors are useful. Here is a summary of the first logistic model using all 30 predictors and all 1000 observations.

## 
## Call:
## glm(formula = RESPONSE ~ ., family = binomial, data = D)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6535  -0.7188   0.3876   0.7071   2.3595  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.016e+00  8.675e-01   1.171 0.241446    
## CHK_ACCT          5.641e-01  7.250e-02   7.780 7.24e-15 ***
## DURATION         -2.695e-02  9.007e-03  -2.992 0.002770 ** 
## HISTORY           4.007e-01  8.974e-02   4.466 7.99e-06 ***
## NEW_CAR          -7.931e-01  3.846e-01  -2.062 0.039193 *  
## USED_CAR          8.271e-01  4.818e-01   1.717 0.086011 .  
## FURNITURE        -3.759e-02  3.989e-01  -0.094 0.924937    
## `RADIO/TV`        7.004e-02  3.884e-01   0.180 0.856884    
## EDUCATION        -8.658e-01  5.009e-01  -1.728 0.083918 .  
## RETRAINING       -8.050e-02  4.414e-01  -0.182 0.855300    
## AMOUNT           -1.178e-04  4.265e-05  -2.761 0.005756 ** 
## SAV_ACCT          2.497e-01  6.060e-02   4.121 3.77e-05 ***
## EMPLOYMENT        1.175e-01  7.474e-02   1.571 0.116068    
## INSTALL_RATE     -3.215e-01  8.630e-02  -3.725 0.000195 ***
## MALE_DIV         -3.417e-01  3.815e-01  -0.896 0.370467    
## MALE_SINGLE       5.406e-01  2.048e-01   2.640 0.008292 ** 
## MALE_MAR_or_WID   1.114e-01  3.046e-01   0.366 0.714668    
## `CO-APPLICANT`   -3.500e-01  3.988e-01  -0.878 0.380165    
## GUARANTOR         9.463e-01  4.195e-01   2.256 0.024084 *  
## PRESENT_RESIDENT -1.275e-02  8.404e-02  -0.152 0.879374    
## REAL_ESTATE       2.092e-01  2.093e-01   0.999 0.317569    
## PROP_UNKN_NONE   -5.551e-01  3.732e-01  -1.487 0.136927    
## AGE               1.147e-02  8.665e-03   1.323 0.185723    
## OTHER_INSTALL    -6.213e-01  2.040e-01  -3.045 0.002324 ** 
## RENT             -6.555e-01  4.602e-01  -1.424 0.154344    
## OWN_RES          -2.405e-01  4.356e-01  -0.552 0.580920    
## NUM_CREDITS      -2.301e-01  1.662e-01  -1.385 0.166128    
## JOB              -3.047e-02  1.423e-01  -0.214 0.830416    
## NUM_DEPENDENTS   -2.581e-01  2.456e-01  -1.051 0.293322    
## TELEPHONE         3.553e-01  1.951e-01   1.821 0.068610 .  
## FOREIGN           1.453e+00  6.221e-01   2.335 0.019532 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance:  909.2  on 969  degrees of freedom
## AIC: 971.2
## 
## Number of Fisher Scoring iterations: 5

The confusion matrix is listed below for the first model using all 30 predictors. Note: I have set the cutoff of all confusion matrices to decrease the number of false positives. This has decreased accuracy, but will improve overall profitability based on the losing more money by a factor of 5 for false positives.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 257 275
##          1  43 425
##                                           
##                Accuracy : 0.682           
##                  95% CI : (0.6521, 0.7108)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.8986          
##                                           
##                   Kappa : 0.3799          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6071          
##             Specificity : 0.8567          
##          Pos Pred Value : 0.9081          
##          Neg Pred Value : 0.4831          
##              Prevalence : 0.7000          
##          Detection Rate : 0.4250          
##    Detection Prevalence : 0.4680          
##       Balanced Accuracy : 0.7319          
##                                           
##        'Positive' Class : 1               
##

After viewing the first model I noticed that CHK_ACCT, DURATION, HISTORY, NEW_CAR, USED_CAR, EDUCATION, AMOUNT, SAV_ACCT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, OTHER_INSTALL, TELEPHONE, and FOREIGN were statistically significant for a z-value < 0.1

Let’s make two new models for the predictors that have p-value <.1 and p-value < 0.5.

## 
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + 
##     USED_CAR + EDUCATION + AMOUNT + SAV_ACCT + INSTALL_RATE + 
##     MALE_SINGLE + GUARANTOR + OTHER_INSTALL + TELEPHONE + FOREIGN, 
##     family = binomial, data = D)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7749  -0.7622   0.4022   0.7267   2.2357  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.948e-01  3.638e-01   1.635 0.102119    
## CHK_ACCT       5.994e-01  7.102e-02   8.439  < 2e-16 ***
## DURATION      -2.744e-02  8.714e-03  -3.149 0.001639 ** 
## HISTORY        3.816e-01  7.933e-02   4.811 1.50e-06 ***
## NEW_CAR       -7.945e-01  1.953e-01  -4.069 4.72e-05 ***
## USED_CAR       7.999e-01  3.368e-01   2.375 0.017535 *  
## EDUCATION     -9.857e-01  3.631e-01  -2.715 0.006628 ** 
## AMOUNT        -1.384e-04  4.146e-05  -3.339 0.000841 ***
## SAV_ACCT       2.623e-01  5.913e-02   4.437 9.14e-06 ***
## INSTALL_RATE  -2.942e-01  8.283e-02  -3.552 0.000382 ***
## MALE_SINGLE    5.967e-01  1.700e-01   3.510 0.000449 ***
## GUARANTOR      1.126e+00  4.058e-01   2.774 0.005539 ** 
## OTHER_INSTALL -6.287e-01  1.991e-01  -3.158 0.001589 ** 
## TELEPHONE      3.326e-01  1.778e-01   1.871 0.061407 .  
## FOREIGN        1.396e+00  6.299e-01   2.216 0.026667 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance:  928.0  on 985  degrees of freedom
## AIC: 958
## 
## Number of Fisher Scoring iterations: 5

## 
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + 
##     AMOUNT + SAV_ACCT + INSTALL_RATE + MALE_SINGLE + GUARANTOR + 
##     OTHER_INSTALL + FOREIGN, family = binomial, data = D)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6928  -0.8152   0.4116   0.7479   2.1184  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.928e-01  3.587e-01   1.653 0.098409 .  
## CHK_ACCT       5.998e-01  6.999e-02   8.569  < 2e-16 ***
## DURATION      -2.919e-02  8.622e-03  -3.385 0.000711 ***
## HISTORY        3.859e-01  7.828e-02   4.930 8.22e-07 ***
## NEW_CAR       -8.040e-01  1.881e-01  -4.274 1.92e-05 ***
## AMOUNT        -9.645e-05  3.844e-05  -2.509 0.012110 *  
## SAV_ACCT       2.647e-01  5.810e-02   4.555 5.23e-06 ***
## INSTALL_RATE  -2.862e-01  8.177e-02  -3.501 0.000464 ***
## MALE_SINGLE    6.095e-01  1.684e-01   3.620 0.000294 ***
## GUARANTOR      1.134e+00  4.021e-01   2.819 0.004812 ** 
## OTHER_INSTALL -6.180e-01  1.971e-01  -3.136 0.001715 ** 
## FOREIGN        1.340e+00  6.216e-01   2.156 0.031079 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  946.55  on 988  degrees of freedom
## AIC: 970.55
## 
## Number of Fisher Scoring iterations: 5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 255 296
##          1  45 404
##                                           
##                Accuracy : 0.659           
##                  95% CI : (0.6287, 0.6884)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.9977          
##                                           
##                   Kappa : 0.3447          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5771          
##             Specificity : 0.8500          
##          Pos Pred Value : 0.8998          
##          Neg Pred Value : 0.4628          
##              Prevalence : 0.7000          
##          Detection Rate : 0.4040          
##    Detection Prevalence : 0.4490          
##       Balanced Accuracy : 0.7136          
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 254 306
##          1  46 394
##                                           
##                Accuracy : 0.648           
##                  95% CI : (0.6175, 0.6776)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.3282          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5629          
##             Specificity : 0.8467          
##          Pos Pred Value : 0.8955          
##          Neg Pred Value : 0.4536          
##              Prevalence : 0.7000          
##          Detection Rate : 0.3940          
##    Detection Prevalence : 0.4400          
##       Balanced Accuracy : 0.7048          
##                                           
##        'Positive' Class : 1               
##

Based off of the first few logistic models we can see what is significant and what isn’t. We also did find out some key information or surprises.

I think the biggest surprise is that employment isn’t that significant of a predictor nor is Age.

Another surprise is that if you are buying a used car or a new car it would be significant in determining whether you would be considered good or bad credit.

Problem 2

Divide the data into training and validation partitions, and develop classification models using the following data mining techniques in R: logistic regression, classification trees, and neural networks.

I have partitioned the data into training data (60%) and validation data (40%)

Logistic Model 1

Here is the lift chart and confusion matrix of the first Logistic model using the training data and all of the 30 predictors.

Note: I am used a cutoff of .8 or 80% for almost all models because I found this to be the most profitable cutoff.

Also, this is the most profitable of the three logistic models presented, but there is over fitting occurring.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  92 133
##          1  14 161
##                                           
##                Accuracy : 0.6325          
##                  95% CI : (0.5832, 0.6799)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3058          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5476          
##             Specificity : 0.8679          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 0.4089          
##              Prevalence : 0.7350          
##          Detection Rate : 0.4025          
##    Detection Prevalence : 0.4375          
##       Balanced Accuracy : 0.7078          
##                                           
##        'Positive' Class : 1               
##

Logistic Model 2

Here is the lift chart and confusion matrix of the first Logistic model using the training data and CHK_ACCT, DURATION, HISTORY, NEW_CAR, USED_CAR, EDUCATION, AMOUNT, SAV_ACCT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, OTHER_INSTALL, TELEPHONE, and FOREIGN as predictors. These are all of the predictors that have a p-value <.1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  92 141
##          1  14 153
##                                           
##                Accuracy : 0.6125          
##                  95% CI : (0.5628, 0.6605)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2808          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5204          
##             Specificity : 0.8679          
##          Pos Pred Value : 0.9162          
##          Neg Pred Value : 0.3948          
##              Prevalence : 0.7350          
##          Detection Rate : 0.3825          
##    Detection Prevalence : 0.4175          
##       Balanced Accuracy : 0.6942          
##                                           
##        'Positive' Class : 1               
##

Logistic Model 3

Here is the lift chart and confusion matrix of the first Logistic model using the training data and CHK_ACCT, DURATION, HISTORY, USED_CAR, AMOUNT, SAV_ACCT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, OTHER_INSTALL, and FOREIGN as predictors. These are all of the predictors that have a p-value <.05

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  92 148
##          1  14 146
##                                           
##                Accuracy : 0.595           
##                  95% CI : (0.5451, 0.6435)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2596          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4966          
##             Specificity : 0.8679          
##          Pos Pred Value : 0.9125          
##          Neg Pred Value : 0.3833          
##              Prevalence : 0.7350          
##          Detection Rate : 0.3650          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.6823          
##                                           
##        'Positive' Class : 1               
##

After viewing these three models, the third model is tied for accuracy, but the positive prediction value is the highest, thus we will taking using this model for problem 3. We are looking for the most true positive values and with the least false positive values to maximize profits.

Next we will use Classification Trees to model the data.

Using rpart, we first create a classification tree, then check the importance of predictors.

Note these are fairly close to those we had with the logistic models.

Now create the tree.

Confusion Matrix for the Tree Classification Model 1.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  89 138
##          1  17 156
##                                           
##                Accuracy : 0.6125          
##                  95% CI : (0.5628, 0.6605)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2712          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8396          
##             Specificity : 0.5306          
##          Pos Pred Value : 0.3921          
##          Neg Pred Value : 0.9017          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2225          
##    Detection Prevalence : 0.5675          
##       Balanced Accuracy : 0.6851          
##                                           
##        'Positive' Class : 0               
##

Now pruning the tree.

Pruned tree confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86 110
##          1  20 184
##                                           
##                Accuracy : 0.675           
##                  95% CI : (0.6267, 0.7207)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 0.9968          
##                                           
##                   Kappa : 0.3438          
##                                           
##  Mcnemar's Test P-Value : 5.912e-15       
##                                           
##             Sensitivity : 0.8113          
##             Specificity : 0.6259          
##          Pos Pred Value : 0.4388          
##          Neg Pred Value : 0.9020          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2150          
##    Detection Prevalence : 0.4900          
##       Balanced Accuracy : 0.7186          
##                                           
##        'Positive' Class : 0               
##

RandomForest Model

I used a cutoff of 0.7 because it showed a better profit.
Also, printed is the variable importance. Note they differ from the logistic models in some ways, but are fairly close in others. This is the most profitable amongst the tree classification models.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  91 123
##          1  15 171
##                                           
##                Accuracy : 0.655           
##                  95% CI : (0.6061, 0.7015)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.332           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8585          
##             Specificity : 0.5816          
##          Pos Pred Value : 0.4252          
##          Neg Pred Value : 0.9194          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2275          
##    Detection Prevalence : 0.5350          
##       Balanced Accuracy : 0.7201          
##                                           
##        'Positive' Class : 0               
##

Neural Network 1

Creating a neural network with the variables CHK_ACCT + DURATION + OTHER_INSTALL + SAV_ACCT + HISTORY + AMOUNT + USED_CAR + GUARANTOR. This network includes 1 hidden layer with 4 neurons.

Confusion Matrix and Lift Chart for Neural Network 1

##                  ME      RMSE       MAE MPE MAPE
## Test set 0.06034458 0.4067656 0.3118412 NaN  Inf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  91 142
##          1  15 152
##                                           
##                Accuracy : 0.6075          
##                  95% CI : (0.5577, 0.6557)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2715          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8585          
##             Specificity : 0.5170          
##          Pos Pred Value : 0.3906          
##          Neg Pred Value : 0.9102          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2275          
##    Detection Prevalence : 0.5825          
##       Balanced Accuracy : 0.6877          
##                                           
##        'Positive' Class : 0               
##

Neural Network 2

Creating a neural network with the variables CHK_ACCT + DURATION + OTHER_INSTALL + SAV_ACCT + HISTORY + AMOUNT + USED_CAR + GUARANTOR. This network includes 1 hidden layer with 2 neurons.

Confusion Matrix and Lift Chart for Neural Network 2

##                  ME      RMSE      MAE MPE MAPE
## Test set 0.05530795 0.3978058 0.316457 NaN  Inf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86 102
##          1  20 192
##                                           
##                Accuracy : 0.695           
##                  95% CI : (0.6473, 0.7398)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 0.9678          
##                                           
##                   Kappa : 0.3723          
##                                           
##  Mcnemar's Test P-Value : 2.244e-13       
##                                           
##             Sensitivity : 0.8113          
##             Specificity : 0.6531          
##          Pos Pred Value : 0.4574          
##          Neg Pred Value : 0.9057          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2150          
##    Detection Prevalence : 0.4700          
##       Balanced Accuracy : 0.7322          
##                                           
##        'Positive' Class : 0               
##

Problem 3

Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data.

Based off of the above confusion matrices and lift charts I will choose the first logistic model (I know there is over fitting, but I am solely looking for profitability here), the Random Forest Tree Model, and Neural Network 2.

Below are the confusion matrices of all three. In order: Logistic, Random Forest, and then Neural Network.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  92 133
##          1  14 161
##                                           
##                Accuracy : 0.6325          
##                  95% CI : (0.5832, 0.6799)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3058          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5476          
##             Specificity : 0.8679          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 0.4089          
##              Prevalence : 0.7350          
##          Detection Rate : 0.4025          
##    Detection Prevalence : 0.4375          
##       Balanced Accuracy : 0.7078          
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  91 123
##          1  15 171
##                                           
##                Accuracy : 0.655           
##                  95% CI : (0.6061, 0.7015)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.332           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8585          
##             Specificity : 0.5816          
##          Pos Pred Value : 0.4252          
##          Neg Pred Value : 0.9194          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2275          
##    Detection Prevalence : 0.5350          
##       Balanced Accuracy : 0.7201          
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86 102
##          1  20 192
##                                           
##                Accuracy : 0.695           
##                  95% CI : (0.6473, 0.7398)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 0.9678          
##                                           
##                   Kappa : 0.3723          
##                                           
##  Mcnemar's Test P-Value : 2.244e-13       
##                                           
##             Sensitivity : 0.8113          
##             Specificity : 0.6531          
##          Pos Pred Value : 0.4574          
##          Neg Pred Value : 0.9057          
##              Prevalence : 0.2650          
##          Detection Rate : 0.2150          
##    Detection Prevalence : 0.4700          
##       Balanced Accuracy : 0.7322          
##                                           
##        'Positive' Class : 0               
##

Which technique has the highest profit?

Below are the cost/gain matrices of the three above models. In order: Logistic, Random Forest, and then Neural Network.

##           Reference
## Prediction   Bad  Good
##       Bad      0     0
##       Good -7000 16100

## [1] "The overall profit would be 9100 DM"

##           Reference
## Prediction   Bad  Good
##       Bad      0     0
##       Good -7500 17100

## [1] "The overall profit would be 9600 DM"

##           Reference
## Prediction    Bad   Good
##       Bad       0      0
##       Good -10000  19200

## [1] "The overall profit would be 9200 DM"

Logistic models profit: 9100DM

Random Forest models profit: 9600DM

Neural Networks profit: 9200DM

Overall the Random Forest Tree Classification model worked the best or had the most profit based on certain cutoffs and set seeds.

Problem 4

Let us try and improve our performance. Rather than accept the default classifcation of all applicants’ credit status, use the estimated probabilities (propensities) from the logistic regression (where success means 1) as a basis for selecting the best credit risks first, followed by poorer-risk applicants. Creat a vector to create a decile-wise lift chart for the validation set that incorporates the net profit.

Below is a lift curve of the profits for the validation data. I created a vector of expected values for the profit by multiplying the probability of a true positive by 100 and then multiplying the false positive by -500. I arranged the probabilities in descending order and then created a cumulative profits vector. I then found the max profit and at where it is located.

Below is a cumulative lift and decile-wise lift curve for the validation profits data with a 70% cutoff.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  79  89
##          1  27 205
##                                          
##                Accuracy : 0.71           
##                  95% CI : (0.6628, 0.754)
##     No Information Rate : 0.735          
##     P-Value [Acc > NIR] : 0.8822         
##                                          
##                   Kappa : 0.3728         
##                                          
##  Mcnemar's Test P-Value : 1.481e-08      
##                                          
##             Sensitivity : 0.6973         
##             Specificity : 0.7453         
##          Pos Pred Value : 0.8836         
##          Neg Pred Value : 0.4702         
##              Prevalence : 0.7350         
##          Detection Rate : 0.5125         
##    Detection Prevalence : 0.5800         
##       Balanced Accuracy : 0.7213         
##                                          
##        'Positive' Class : 1              
##

Below is a cumulative profits chart. I ordered the profits in descending order and then made a cumulative chart based off of the profits.

##           pp4   profit1
## 343 0.9939444  99.39444
## 24  0.9933122 198.72566
## 29  0.9932699 298.05265
## 202 0.9930737 397.36002
## 34  0.9922904 496.58906
## 311 0.9913745 595.72651

##           pp4    profit1
## 26  0.9306132  -822.7914
## 42  0.9318435 -1288.7131
## 38  0.9389001 -1758.1632
## 178 0.9400596 -2228.1930
## 92  0.9417257 -2699.0558
## 201 0.9667376 -3182.4246

a. How far into the validation data should you go to get maximum net profit?

Looking at the bar plot it shows that I should stop after the 5th file or going only 200 people into the 400 total in the validation set.

Based off of my decile-wise curve it looks like the I should go until the 58th percentile in order to get the max profit, based off of a 70% cutoff. This is really close to what the first plot is.

Based off of the curve I showed, I should stop at 294 for a profit of about 22192 DM. This curve is interesting in that it sums the expected profits given the actual responses. This value is a lot higher than what all of the models anticipated. Thus, I am wondering if this is accurate.

b. If this logistic regression model is used to score future applicants, what “probability of success” cutoff should be used in extending credit?

Based off of my confusion matrix above the best cutoff I have used was around 70% vs. 50%. This gives a profit of 205 * 100 - 500 * 27 = 7000 DM.

or we have a cutoff at the 5th file based on the bar plot.

STATS 615 - Project 2

Aaron E Voronyak

3-16-2023