Project 2: Classification with GermanCredit Data

In this report, we examine 4 classification models with target feature Class from the GermanCredit dataset in the caret package. We will be using every available feature for the classification task.

Overview

We will be using a Logistic model, a Decision Tree, a Random Forest, and a Gradient Boosting model, each with some mild hyperparameter tuning (more on this soon). The Class feature is a factor with 2 levels: “Good” and “Bad”. As such, we only need our models to be fit for binary classification. No data preparation was necessary for the model constructions.

To tune our models, we will adjust the labeling threshold via a (janky) incremental approach to find the approximate threshold that maximizes F1 score. We are free to maximize F1 as our target metric as the Class factor is not quite balanced, about 70% of the entries are “Good” and only 30% are “Bad”. Additionally, for both the Random Forest and the Gradient Boosting Model, 35 trees were used (as the error plot plateaued).

Models

Below are the top 10 variables for each model in decreasing importance:

Logistic Regression

## Top 10 Important Variables (Logistic Regression):

##  [1] "Purpose.DomesticAppliance"          "CheckingAccountStatus.lt.0"        
##  [3] "CreditHistory.ThisBank.AllPaid"     "Purpose.Repairs"                   
##  [5] "Purpose.NewCar"                     "OtherDebtorsGuarantors.CoApplicant"
##  [7] "CheckingAccountStatus.0.to.200"     "CreditHistory.NoCredit.AllPaid"    
##  [9] "ForeignWorker"                      "Purpose.Education"

Decision Tree

## Top 10 Important Variables (Decision Tree):

##  [1] "CheckingAccountStatus.none"     "Duration"                      
##  [3] "Amount"                         "CreditHistory.ThisBank.AllPaid"
##  [5] "SavingsAccountBonds.lt.100"     "Property.CarOther"             
##  [7] "Purpose.UsedCar"                "ResidenceDuration"             
##  [9] "CheckingAccountStatus.0.to.200" "CheckingAccountStatus.lt.0"

Random Forest

## Top 5 Important Variables (Random Forest):

##  [1] "Amount"                     "Duration"                  
##  [3] "Age"                        "CheckingAccountStatus.none"
##  [5] "InstallmentRatePercentage"  "ResidenceDuration"         
##  [7] "CheckingAccountStatus.lt.0" "CreditHistory.Critical"    
##  [9] "SavingsAccountBonds.lt.100" "Personal.Male.Single"

Gradient Boosting

## Top 10 Important Variables (Gradient Boosting):

##  [1] "CheckingAccountStatus.none"     "Duration"                      
##  [3] "CheckingAccountStatus.lt.0"     "Amount"                        
##  [5] "CreditHistory.Critical"         "SavingsAccountBonds.lt.100"    
##  [7] "Personal.Male.Single"           "CreditHistory.ThisBank.AllPaid"
##  [9] "EmploymentDuration.lt.1"        "Property.RealEstate"

We can see a few of the same variables appearing a couple times, suggesting great importance of that feature in the classification of Class. For instance, CheckingAccountStatus.lt.0, CreditHistory.ThisBank.AllPaid, CheckingAccountStatus.none, and Duration all are in the top 10 important variables for (at least) 3/4 of the models.

Evaluations

Below are the ROC curves for the models, as well as the individual areas under the curves:

## AUC Logistic: 0.8042857

## AUC Tree: 0.6968132

## AUC Random Forest: 0.8107692

## AUC GBM: 0.7934066

As you can see, according to the ROC curves and areas underneath, the Random Forest performs (barely) the best, right after the Logistic regression. However, due to the simpler nature of a Logistic regression, as well as how close the areas unde the curves were, I believe that the Logistic regression should be the preferred model.

In addition, here are the common classification model performance metrics:

##          Accuracy    Recall Precision        F1
## logistic    0.770 0.9461538 0.7592593 0.8424658
## tree        0.700 0.9538462 0.6966292 0.8051948
## rforest     0.755 0.9230769 0.7547170 0.8304498
## grboost     0.740 0.8923077 0.7532468 0.8169014

Looking at the F1 scores, when we maximize F1 for each model via adjusting the thresholds, the Logistic regression wins on both Accuracy and F1 metrics. Neat!

Conclusion

In the end, we can say (with some confidence) that of the models trained, the Logistic regression performed the best, with about 77% accuracy and an F1 of about 0.85. However, this may not always be the case. With more in-depth hyperparameter tuning, the Random Forest could likely outperform the Logistic regression, as all that was done here was setting the number of trees to be 35.