In this report, we examine 4 classification models with target
feature Class
from the GermanCredit
dataset in
the caret
package. We will be using every available feature
for the classification task.
We will be using a Logistic model, a Decision Tree, a Random Forest,
and a Gradient Boosting model, each with some mild hyperparameter tuning
(more on this soon). The Class
feature is a factor with 2
levels: “Good” and “Bad”. As such, we only need our models to be fit for
binary classification. No data preparation was necessary for the model
constructions.
To tune our models, we will adjust the labeling threshold via a
(janky) incremental approach to find the approximate threshold that
maximizes F1 score. We are free to maximize F1 as our target metric as
the Class
factor is not quite balanced, about 70% of the
entries are “Good” and only 30% are “Bad”. Additionally, for both the
Random Forest and the Gradient Boosting Model, 35 trees were used (as
the error plot plateaued).
Below are the top 10 variables for each model in decreasing
importance:
## Top 10 Important Variables (Logistic Regression):
## [1] "Purpose.DomesticAppliance" "CheckingAccountStatus.lt.0"
## [3] "CreditHistory.ThisBank.AllPaid" "Purpose.Repairs"
## [5] "Purpose.NewCar" "OtherDebtorsGuarantors.CoApplicant"
## [7] "CheckingAccountStatus.0.to.200" "CreditHistory.NoCredit.AllPaid"
## [9] "ForeignWorker" "Purpose.Education"
## Top 10 Important Variables (Decision Tree):
## [1] "CheckingAccountStatus.none" "Duration"
## [3] "Amount" "CreditHistory.ThisBank.AllPaid"
## [5] "SavingsAccountBonds.lt.100" "Property.CarOther"
## [7] "Purpose.UsedCar" "ResidenceDuration"
## [9] "CheckingAccountStatus.0.to.200" "CheckingAccountStatus.lt.0"
## Top 5 Important Variables (Random Forest):
## [1] "Amount" "Duration"
## [3] "Age" "CheckingAccountStatus.none"
## [5] "InstallmentRatePercentage" "ResidenceDuration"
## [7] "CheckingAccountStatus.lt.0" "CreditHistory.Critical"
## [9] "SavingsAccountBonds.lt.100" "Personal.Male.Single"
## Top 10 Important Variables (Gradient Boosting):
## [1] "CheckingAccountStatus.none" "Duration"
## [3] "CheckingAccountStatus.lt.0" "Amount"
## [5] "CreditHistory.Critical" "SavingsAccountBonds.lt.100"
## [7] "Personal.Male.Single" "CreditHistory.ThisBank.AllPaid"
## [9] "EmploymentDuration.lt.1" "Property.RealEstate"
We can see a few of the same variables appearing a couple times,
suggesting great importance of that feature in the classification of
Class
. For instance,
CheckingAccountStatus.lt.0
,
CreditHistory.ThisBank.AllPaid
,
CheckingAccountStatus.none
, and Duration
all
are in the top 10 important variables for (at least) 3/4 of the models.
Below are the ROC curves for the models, as well as the individual areas under the curves:
## AUC Logistic: 0.8042857
## AUC Tree: 0.6968132
## AUC Random Forest: 0.8107692
## AUC GBM: 0.7934066
As you can see, according to the ROC curves and areas
underneath, the Random Forest performs (barely) the best, right after
the Logistic regression. However, due to the simpler nature of a
Logistic regression, as well as how close the areas unde the curves
were, I believe that the Logistic regression should be the preferred
model.
In addition, here are the common classification model
performance metrics:
## Accuracy Recall Precision F1
## logistic 0.770 0.9461538 0.7592593 0.8424658
## tree 0.700 0.9538462 0.6966292 0.8051948
## rforest 0.755 0.9230769 0.7547170 0.8304498
## grboost 0.740 0.8923077 0.7532468 0.8169014
Looking at the F1 scores, when we maximize F1 for each model via
adjusting the thresholds, the Logistic regression wins on both Accuracy
and F1 metrics. Neat!
In the end, we can say (with some confidence) that of the models
trained, the Logistic regression performed the best, with about 77%
accuracy and an F1 of about 0.85. However, this may not always be the
case. With more in-depth hyperparameter tuning, the Random Forest could
likely outperform the Logistic regression, as all that was done here was
setting the number of trees to be 35.