Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
if (!require(caret, quietly = TRUE)) {
install.packages("caret")
}
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: The data set has lots of variables 62 to be exact. These variables include 7 integer variables, 1 logical variable, and 54 numeric variables. Class is the one logical variable.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
str(GermanCredit)
## 'data.frame': 1000 obs. of 49 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
Your observation: Now instead of 62 variables it lowered to 49 variables.
2024 for reproducibility.
(10pts)set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit_train = GermanCredit[index,]
credit_test = GermanCredit[-index,]
Your observation: Train to 800 observations and test to 200 observations. Splitting german credit 80 20.
glm_credit<- glm(Class~., family=binomial, data=credit_train)
Your observation: Using variable class as the outcome variable while all other variables are predictors.
summary(glm_credit)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = credit_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.241e+00 1.719e+00 5.376 7.61e-08 ***
## Duration -2.994e-02 1.072e-02 -2.794 0.005214 **
## Amount -1.771e-04 5.095e-05 -3.475 0.000510 ***
## InstallmentRatePercentage -3.718e-01 1.036e-01 -3.589 0.000332 ***
## ResidenceDuration 2.577e-02 1.010e-01 0.255 0.798510
## Age 1.183e-02 1.097e-02 1.078 0.280974
## NumberExistingCredits -1.225e-01 2.189e-01 -0.560 0.575690
## NumberPeopleMaintenance -1.731e-01 2.945e-01 -0.588 0.556678
## Telephone -4.236e-01 2.371e-01 -1.786 0.074081 .
## ForeignWorker -1.651e+00 7.421e-01 -2.224 0.026143 *
## CheckingAccountStatus.lt.0 -1.817e+00 2.710e-01 -6.703 2.04e-11 ***
## CheckingAccountStatus.0.to.200 -1.432e+00 2.686e-01 -5.330 9.81e-08 ***
## CheckingAccountStatus.gt.200 -5.912e-01 4.631e-01 -1.277 0.201696
## CreditHistory.NoCredit.AllPaid -8.724e-01 5.139e-01 -1.698 0.089584 .
## CreditHistory.ThisBank.AllPaid -1.676e+00 5.493e-01 -3.052 0.002277 **
## CreditHistory.PaidDuly -6.686e-01 2.939e-01 -2.275 0.022899 *
## CreditHistory.Delay -9.413e-01 3.780e-01 -2.491 0.012756 *
## Purpose.NewCar -1.733e+00 1.013e+00 -1.710 0.087282 .
## Purpose.UsedCar 6.716e-02 1.033e+00 0.065 0.948146
## Purpose.Furniture.Equipment -8.257e-01 1.015e+00 -0.814 0.415816
## Purpose.Radio.Television -8.386e-01 1.019e+00 -0.823 0.410457
## Purpose.DomesticAppliance -1.227e+00 1.328e+00 -0.923 0.355762
## Purpose.Repairs -1.321e+00 1.165e+00 -1.134 0.256825
## Purpose.Education -2.020e+00 1.088e+00 -1.857 0.063374 .
## Purpose.Retraining 4.276e-01 1.640e+00 0.261 0.794237
## Purpose.Business -8.618e-01 1.032e+00 -0.835 0.403529
## SavingsAccountBonds.lt.100 -1.266e+00 3.201e-01 -3.956 7.63e-05 ***
## SavingsAccountBonds.100.to.500 -1.075e+00 4.171e-01 -2.577 0.009964 **
## SavingsAccountBonds.500.to.1000 -8.768e-01 5.216e-01 -1.681 0.092761 .
## SavingsAccountBonds.gt.1000 1.301e-02 6.161e-01 0.021 0.983157
## EmploymentDuration.lt.1 3.581e-01 5.167e-01 0.693 0.488195
## EmploymentDuration.1.to.4 5.527e-01 5.000e-01 1.105 0.268967
## EmploymentDuration.4.to.7 9.863e-01 5.355e-01 1.842 0.065524 .
## EmploymentDuration.gt.7 5.253e-01 5.039e-01 1.042 0.297218
## Personal.Male.Divorced.Seperated -2.546e-01 5.214e-01 -0.488 0.625274
## Personal.Female.NotSingle -1.274e-01 3.573e-01 -0.357 0.721452
## Personal.Male.Single 4.118e-01 3.623e-01 1.137 0.255622
## OtherDebtorsGuarantors.None -1.239e+00 5.370e-01 -2.308 0.021018 *
## OtherDebtorsGuarantors.CoApplicant -1.565e+00 6.828e-01 -2.292 0.021919 *
## Property.RealEstate 7.166e-01 4.898e-01 1.463 0.143477
## Property.Insurance 3.544e-01 4.785e-01 0.741 0.458926
## Property.CarOther 6.110e-01 4.648e-01 1.314 0.188702
## OtherInstallmentPlans.Bank -8.504e-01 2.730e-01 -3.115 0.001838 **
## OtherInstallmentPlans.Stores -4.293e-01 4.711e-01 -0.911 0.362139
## Housing.Rent -9.538e-01 5.624e-01 -1.696 0.089924 .
## Housing.Own -2.723e-01 5.282e-01 -0.516 0.606157
## Job.UnemployedUnskilled 1.449e+00 8.788e-01 1.649 0.099175 .
## Job.UnskilledResident -2.641e-03 4.101e-01 -0.006 0.994861
## Job.SkilledEmployee -1.073e-02 3.349e-01 -0.032 0.974438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 958.02 on 799 degrees of freedom
## Residual deviance: 672.78 on 751 degrees of freedom
## AIC: 770.78
##
## Number of Fisher Scoring iterations: 5
Your observation: The most significant predictors are Amount, InstallmentRatePercentage, CheckingAccountStatus.lt.0, CheckingAccountStatus.0.to.200, SavingsAccountBonds.lt.100 because of their p-value. However CheckingAccountStatus.lt.0 has the biggest absolute coefficient value at -1.81 meaning those with negative checking accounts are are less likely to have good credit, this makes CheckingAccountStatus.lt.0 variable a very important predictor for Class variable.
pred_credit_train<- predict(glm_credit, type="response")
hist(pred_credit_train)
Your observation: Theis shows the distribution of the predicted probabilities.
costfunc = function(obs, pred.p, pcut){
weight_FN = 1
weight_FP = 1
FNC = sum( (obs==1) & (pred.p < pcut))
FPC = sum( (obs==0) & (pred.p >=pcut))
MR = sum(weight_FN*FNC + weight_FP*FPC) / length(obs)
return(MR) }
pcut.seq = seq(0.01, 1, 0.01)
MR_vec = rep(0, length(pcut.seq))
for(i in 1:length(pcut.seq)){
MR_vec[i] = costfunc(obs = credit_train$Class, pred.p = pred_credit_train, pcut = pcut.seq[i])}
optimal.pcut = pcut.seq[which(MR_vec==min(MR_vec))]
print(optimal.pcut)
## [1] 0.39 0.40 0.50
Your observation: Optimal probability cut off points found with MR are 0.39, 0.40 and 0.50.
pred_credit_train_optimal <- (pred_credit_train>0.4)*1
conf_train <- table(credit_train$Class, pred_credit_train_optimal, dnn = c("True", "Predicted"))
TP <- conf_train[2, 2]
TN <- conf_train[1, 1]
FP <- conf_train[1, 2]
FN <- conf_train[2, 1]
MR <- (FP + FN) / sum(conf_train)
print(paste("Misclassification Rate (MR):", MR))
## [1] "Misclassification Rate (MR): 0.19625"
FPR <- FP / (FP + TN)
print(paste("False Positive Rate (FPR):", FPR))
## [1] "False Positive Rate (FPR): 0.532751091703057"
FNR <- FN / (FN + TP)
print(paste("False Negative Rate (FNR):", FNR))
## [1] "False Negative Rate (FNR): 0.0612959719789842"
Your observation: MR= .196 meaning 19.6% of predictions are incorrect. FPR = .532 means that 53.2% are misclassified as BAD. FNR =.061 means that only 6.1% of Good credit are misclassified.
library(ROCR)
pred_train <- prediction(pred_credit_train, credit_train$Class)
ROC <- performance(pred_train, "tpr", "fpr")
plot(ROC, colorize=TRUE)
auc_train = unlist(slot(performance(pred_train, "auc"), "y.values"))
auc_train
## [1] 0.8504807
Your observation: Overall decent AUC with a AUC value of .85
pred_credit_test<- predict(glm_credit, newdata = credit_test, type="response")
pred_credit_test_optimal <- (pred_credit_test > optimal.pcut) * 1
## Warning in pred_credit_test > optimal.pcut: longer object length is not a
## multiple of shorter object length
conf_test <- table(credit_test$Class, pred_credit_test_optimal, dnn = c("True", "Predicted"))
print(conf_test)
## Predicted
## True 0 1
## FALSE 31 40
## TRUE 15 114
TP_test <- conf_test[2, 2]
TN_test <- conf_test[1, 1]
FP_test <- conf_test[1, 2]
FN_test <- conf_test[2, 1]
MR_test <- (FP_test + FN_test) / sum(conf_test)
print(paste("Misclassification Rate (MR) for Test Set:", MR_test))
## [1] "Misclassification Rate (MR) for Test Set: 0.275"
FPR_test <- FP_test / (FP_test + TN_test)
print(paste("False Positive Rate (FPR) for Test Set:", FPR_test))
## [1] "False Positive Rate (FPR) for Test Set: 0.563380281690141"
FNR_test <- FN_test / (FN_test + TP_test)
print(paste("False Negative Rate (FNR) for Test Set:", FNR_test))
## [1] "False Negative Rate (FNR) for Test Set: 0.116279069767442"
Your observation: MR= .275, FPR= 0.563, and FNR = .116
pred_test <- prediction(pred_credit_test, credit_test$Class)
ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize=TRUE)
auc_test = unlist(slot(performance(pred_test, "auc"), "y.values"))
auc_test
## [1] 0.7353423
Your observation: AUC for this model is .735.
Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!
costfunc = function(obs, pred.p, pcut){
weight_FN = 5
weight_FP = 1
FNC = sum( (obs==1) & (pred.p < pcut))
FPC = sum( (obs==0) & (pred.p >=pcut))
MR = sum(weight_FN*FNC + weight_FP*FPC) / length(obs)
return(MR)
}
pcut.seq = seq(0.01, 1, 0.01)
MR_vec = rep(0, length(pcut.seq))
for(i in 1:length(pcut.seq)){
MR_vec[i] = costfunc(obs = credit_train$Class, pred.p = pred_credit_train, pcut = pcut.seq[i])
}
optimal.pcut = pcut.seq[which(MR_vec==min(MR_vec))]
print(optimal.pcut)
## [1] 0.22
Your observation: Optimal probability cut off is .22
pred_class_credit_train_optimal <- (pred_credit_train>0.22)*1
conf_train <- table(credit_train$Class, pred_credit_train_optimal, dnn = c("True", "Predicted"))
TP <- conf_train[2, 2]
TN <- conf_train[1, 1]
FP <- conf_train[1, 2]
FN <- conf_train[2, 1]
MR <- (FP + FN) / sum(conf_train)
print(paste("Misclassification Rate (MR):", MR))
## [1] "Misclassification Rate (MR): 0.19625"
FPR <- FP / (FP + TN)
print(paste("False Positive Rate (FPR):", FPR))
## [1] "False Positive Rate (FPR): 0.532751091703057"
FNR <- FN / (FN + TP)
print(paste("False Negative Rate (FNR):", FNR))
## [1] "False Negative Rate (FNR): 0.0612959719789842"
Your observation: MR=.196, FPR= .532, and FPR = .061
pred_credit_test_optimal <- (pred_credit_test>0.22)*1
conf_test <- table(credit_test$Class, pred_credit_test_optimal, dnn = c("True", "Predicted"))
TP <- conf_test[2, 2]
TN <- conf_test[1, 1]
FP <- conf_test[1, 2]
FN <- conf_test[2, 1]
print(paste0("testing MR:",MR))
## [1] "testing MR:0.19625"
print(paste0("testing FPR:",FPR))
## [1] "testing FPR:0.532751091703057"
print(paste0("testing FNR:",FNR))
## [1] "testing FNR:0.0612959719789842"
Your observation: MR=.19625, FPR 0.53275, FNR= 0.06129
Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model.
Optimal Probability Cut off is .22
“Misclassification Rate (MR): 0.19625” “False Positive Rate (FPR): 0.532751091703057” “False Negative Rate (FNR): 0.0612959719789842”
“testing MR:0.19625” “testing FPR:0.532751091703057” “testing FNR:0.0612959719789842”
Since Optimal probability cut off point is .22 (relatively low) this suggests that model tends to classify more instances as positive, resulting in a higher FPR.
Training set has a misclassification rate of 0.19625, which means 19.6% of predictions are incorrect which is not too bad but the test MR is higher at 27.5% indicating the model may not perform well with new data.
The training FPR is 0.5327 (53.27%), meaning that over half of the actual negatives are misclassified as positives. This is a significant issue, as a high false positive rate could lead to incorrect decisions in practical applications.
The training false negative rate (FNR) is 0.0613 (6.13%), indicating that the model successfully identifies a high proportion of actual positives. However, the testing FNR is higher at 0.1163 (11.63%), suggesting a trade-off between sensitivity (true positive rate) and specificity (true negative rate).
To improve the model, it could help to balance the classes. This can be done by adding more examples of the less common class (oversampling) , reducing examples of the more common class (undersampling), or creating examples to even things out.