Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require(caret, quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: The data set has lots of variables 62 to be exact. These variables include 7 integer variables, 1 logical variable, and 54 numeric variables. Class is the one logical variable.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

str(GermanCredit)

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

Your observation: Now instead of 62 variables it lowered to 49 variables.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit_train = GermanCredit[index,]
credit_test = GermanCredit[-index,]

Your observation: Train to 800 observations and test to 200 observations. Splitting german credit 80 20.

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

glm_credit<- glm(Class~., family=binomial, data=credit_train)

Your observation: Using variable class as the outcome variable while all other variables are predictors.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

summary(glm_credit)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = credit_train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.241e+00  1.719e+00   5.376 7.61e-08 ***
## Duration                           -2.994e-02  1.072e-02  -2.794 0.005214 ** 
## Amount                             -1.771e-04  5.095e-05  -3.475 0.000510 ***
## InstallmentRatePercentage          -3.718e-01  1.036e-01  -3.589 0.000332 ***
## ResidenceDuration                   2.577e-02  1.010e-01   0.255 0.798510    
## Age                                 1.183e-02  1.097e-02   1.078 0.280974    
## NumberExistingCredits              -1.225e-01  2.189e-01  -0.560 0.575690    
## NumberPeopleMaintenance            -1.731e-01  2.945e-01  -0.588 0.556678    
## Telephone                          -4.236e-01  2.371e-01  -1.786 0.074081 .  
## ForeignWorker                      -1.651e+00  7.421e-01  -2.224 0.026143 *  
## CheckingAccountStatus.lt.0         -1.817e+00  2.710e-01  -6.703 2.04e-11 ***
## CheckingAccountStatus.0.to.200     -1.432e+00  2.686e-01  -5.330 9.81e-08 ***
## CheckingAccountStatus.gt.200       -5.912e-01  4.631e-01  -1.277 0.201696    
## CreditHistory.NoCredit.AllPaid     -8.724e-01  5.139e-01  -1.698 0.089584 .  
## CreditHistory.ThisBank.AllPaid     -1.676e+00  5.493e-01  -3.052 0.002277 ** 
## CreditHistory.PaidDuly             -6.686e-01  2.939e-01  -2.275 0.022899 *  
## CreditHistory.Delay                -9.413e-01  3.780e-01  -2.491 0.012756 *  
## Purpose.NewCar                     -1.733e+00  1.013e+00  -1.710 0.087282 .  
## Purpose.UsedCar                     6.716e-02  1.033e+00   0.065 0.948146    
## Purpose.Furniture.Equipment        -8.257e-01  1.015e+00  -0.814 0.415816    
## Purpose.Radio.Television           -8.386e-01  1.019e+00  -0.823 0.410457    
## Purpose.DomesticAppliance          -1.227e+00  1.328e+00  -0.923 0.355762    
## Purpose.Repairs                    -1.321e+00  1.165e+00  -1.134 0.256825    
## Purpose.Education                  -2.020e+00  1.088e+00  -1.857 0.063374 .  
## Purpose.Retraining                  4.276e-01  1.640e+00   0.261 0.794237    
## Purpose.Business                   -8.618e-01  1.032e+00  -0.835 0.403529    
## SavingsAccountBonds.lt.100         -1.266e+00  3.201e-01  -3.956 7.63e-05 ***
## SavingsAccountBonds.100.to.500     -1.075e+00  4.171e-01  -2.577 0.009964 ** 
## SavingsAccountBonds.500.to.1000    -8.768e-01  5.216e-01  -1.681 0.092761 .  
## SavingsAccountBonds.gt.1000         1.301e-02  6.161e-01   0.021 0.983157    
## EmploymentDuration.lt.1             3.581e-01  5.167e-01   0.693 0.488195    
## EmploymentDuration.1.to.4           5.527e-01  5.000e-01   1.105 0.268967    
## EmploymentDuration.4.to.7           9.863e-01  5.355e-01   1.842 0.065524 .  
## EmploymentDuration.gt.7             5.253e-01  5.039e-01   1.042 0.297218    
## Personal.Male.Divorced.Seperated   -2.546e-01  5.214e-01  -0.488 0.625274    
## Personal.Female.NotSingle          -1.274e-01  3.573e-01  -0.357 0.721452    
## Personal.Male.Single                4.118e-01  3.623e-01   1.137 0.255622    
## OtherDebtorsGuarantors.None        -1.239e+00  5.370e-01  -2.308 0.021018 *  
## OtherDebtorsGuarantors.CoApplicant -1.565e+00  6.828e-01  -2.292 0.021919 *  
## Property.RealEstate                 7.166e-01  4.898e-01   1.463 0.143477    
## Property.Insurance                  3.544e-01  4.785e-01   0.741 0.458926    
## Property.CarOther                   6.110e-01  4.648e-01   1.314 0.188702    
## OtherInstallmentPlans.Bank         -8.504e-01  2.730e-01  -3.115 0.001838 ** 
## OtherInstallmentPlans.Stores       -4.293e-01  4.711e-01  -0.911 0.362139    
## Housing.Rent                       -9.538e-01  5.624e-01  -1.696 0.089924 .  
## Housing.Own                        -2.723e-01  5.282e-01  -0.516 0.606157    
## Job.UnemployedUnskilled             1.449e+00  8.788e-01   1.649 0.099175 .  
## Job.UnskilledResident              -2.641e-03  4.101e-01  -0.006 0.994861    
## Job.SkilledEmployee                -1.073e-02  3.349e-01  -0.032 0.974438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 958.02  on 799  degrees of freedom
## Residual deviance: 672.78  on 751  degrees of freedom
## AIC: 770.78
## 
## Number of Fisher Scoring iterations: 5

Your observation: The most significant predictors are Amount, InstallmentRatePercentage, CheckingAccountStatus.lt.0, CheckingAccountStatus.0.to.200, SavingsAccountBonds.lt.100 because of their p-value. However CheckingAccountStatus.lt.0 has the biggest absolute coefficient value at -1.81 meaning those with negative checking accounts are are less likely to have good credit, this makes CheckingAccountStatus.lt.0 variable a very important predictor for Class variable.

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

pred_credit_train<- predict(glm_credit, type="response")
hist(pred_credit_train)

Your observation: Theis shows the distribution of the predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

costfunc = function(obs, pred.p, pcut){
    weight_FN = 1
    weight_FP = 1    
    FNC = sum( (obs==1) & (pred.p < pcut))   
    FPC = sum( (obs==0) & (pred.p >=pcut))   
    MR  = sum(weight_FN*FNC + weight_FP*FPC) / length(obs)  
    return(MR) }

    pcut.seq = seq(0.01, 1, 0.01) 

MR_vec = rep(0, length(pcut.seq))  
for(i in 1:length(pcut.seq)){ 
    MR_vec[i] = costfunc(obs = credit_train$Class, pred.p = pred_credit_train, pcut = pcut.seq[i])}

optimal.pcut = pcut.seq[which(MR_vec==min(MR_vec))]
print(optimal.pcut)

## [1] 0.39 0.40 0.50

Your observation: Optimal probability cut off points found with MR are 0.39, 0.40 and 0.50.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

pred_credit_train_optimal <- (pred_credit_train>0.4)*1
conf_train <- table(credit_train$Class, pred_credit_train_optimal, dnn = c("True", "Predicted"))

TP <- conf_train[2, 2]  
TN <- conf_train[1, 1]  
FP <- conf_train[1, 2]  
FN <- conf_train[2, 1] 

MR <- (FP + FN) / sum(conf_train)
print(paste("Misclassification Rate (MR):", MR))

## [1] "Misclassification Rate (MR): 0.19625"

FPR <- FP / (FP + TN)
print(paste("False Positive Rate (FPR):", FPR))

## [1] "False Positive Rate (FPR): 0.532751091703057"

FNR <- FN / (FN + TP)
print(paste("False Negative Rate (FNR):", FNR))

## [1] "False Negative Rate (FNR): 0.0612959719789842"

Your observation: MR= .196 meaning 19.6% of predictions are incorrect. FPR = .532 means that 53.2% are misclassified as BAD. FNR =.061 means that only 6.1% of Good credit are misclassified.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

library(ROCR)
pred_train <- prediction(pred_credit_train, credit_train$Class)
ROC <- performance(pred_train, "tpr", "fpr")
plot(ROC, colorize=TRUE)

auc_train = unlist(slot(performance(pred_train, "auc"), "y.values"))
auc_train

## [1] 0.8504807

Your observation: Overall decent AUC with a AUC value of .85

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

pred_credit_test<- predict(glm_credit, newdata = credit_test, type="response")

pred_credit_test_optimal <- (pred_credit_test > optimal.pcut) * 1

## Warning in pred_credit_test > optimal.pcut: longer object length is not a
## multiple of shorter object length

conf_test <- table(credit_test$Class, pred_credit_test_optimal, dnn = c("True", "Predicted"))
print(conf_test)

##        Predicted
## True      0   1
##   FALSE  31  40
##   TRUE   15 114

TP_test <- conf_test[2, 2]  
TN_test <- conf_test[1, 1]  
FP_test <- conf_test[1, 2]  
FN_test <- conf_test[2, 1]

MR_test <- (FP_test + FN_test) / sum(conf_test)
print(paste("Misclassification Rate (MR) for Test Set:", MR_test))

## [1] "Misclassification Rate (MR) for Test Set: 0.275"

FPR_test <- FP_test / (FP_test + TN_test)
print(paste("False Positive Rate (FPR) for Test Set:", FPR_test))

## [1] "False Positive Rate (FPR) for Test Set: 0.563380281690141"

FNR_test <- FN_test / (FN_test + TP_test)
print(paste("False Negative Rate (FNR) for Test Set:", FNR_test))

## [1] "False Negative Rate (FNR) for Test Set: 0.116279069767442"

Your observation: MR= .275, FPR= 0.563, and FNR = .116

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

pred_test <- prediction(pred_credit_test, credit_test$Class)
ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize=TRUE)

auc_test = unlist(slot(performance(pred_test, "auc"), "y.values"))
auc_test

## [1] 0.7353423

Your observation: AUC for this model is .735.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

costfunc = function(obs, pred.p, pcut){
    weight_FN = 5    
    weight_FP = 1    
    FNC = sum( (obs==1) & (pred.p < pcut))   
    FPC = sum( (obs==0) & (pred.p >=pcut))  
    MR  = sum(weight_FN*FNC + weight_FP*FPC) / length(obs)  
    return(MR) 
}   
 pcut.seq = seq(0.01, 1, 0.01) 
 
 MR_vec = rep(0, length(pcut.seq))  
for(i in 1:length(pcut.seq)){ 
    MR_vec[i] = costfunc(obs = credit_train$Class, pred.p = pred_credit_train, pcut = pcut.seq[i])  
}

optimal.pcut = pcut.seq[which(MR_vec==min(MR_vec))]
print(optimal.pcut)

## [1] 0.22

Your observation: Optimal probability cut off is .22

2. Obtain the confusion matrix and MR for the training set.

pred_class_credit_train_optimal <- (pred_credit_train>0.22)*1
conf_train <- table(credit_train$Class, pred_credit_train_optimal, dnn = c("True", "Predicted"))


TP <- conf_train[2, 2]  
TN <- conf_train[1, 1]  
FP <- conf_train[1, 2]  
FN <- conf_train[2, 1]  

MR <- (FP + FN) / sum(conf_train)
print(paste("Misclassification Rate (MR):", MR))

## [1] "Misclassification Rate (MR): 0.19625"

FPR <- FP / (FP + TN)
print(paste("False Positive Rate (FPR):", FPR))

## [1] "False Positive Rate (FPR): 0.532751091703057"

FNR <- FN / (FN + TP)
print(paste("False Negative Rate (FNR):", FNR))

## [1] "False Negative Rate (FNR): 0.0612959719789842"

Your observation: MR=.196, FPR= .532, and FPR = .061

3. Obtain the confusion matrix and MR for the test set.

pred_credit_test_optimal <- (pred_credit_test>0.22)*1

conf_test <- table(credit_test$Class, pred_credit_test_optimal, dnn = c("True", "Predicted"))

TP <- conf_test[2, 2]  
TN <- conf_test[1, 1]  
FP <- conf_test[1, 2]  
FN <- conf_test[2, 1]  

print(paste0("testing MR:",MR))

## [1] "testing MR:0.19625"

print(paste0("testing FPR:",FPR))

## [1] "testing FPR:0.532751091703057"

print(paste0("testing FNR:",FNR))

## [1] "testing FNR:0.0612959719789842"

Your observation: MR=.19625, FPR 0.53275, FNR= 0.06129

Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model.

Optimal Probability Cut off is .22

“Misclassification Rate (MR): 0.19625” “False Positive Rate (FPR): 0.532751091703057” “False Negative Rate (FNR): 0.0612959719789842”

“testing MR:0.19625” “testing FPR:0.532751091703057” “testing FNR:0.0612959719789842”

Since Optimal probability cut off point is .22 (relatively low) this suggests that model tends to classify more instances as positive, resulting in a higher FPR.

Training set has a misclassification rate of 0.19625, which means 19.6% of predictions are incorrect which is not too bad but the test MR is higher at 27.5% indicating the model may not perform well with new data.

The training FPR is 0.5327 (53.27%), meaning that over half of the actual negatives are misclassified as positives. This is a significant issue, as a high false positive rate could lead to incorrect decisions in practical applications.

The training false negative rate (FNR) is 0.0613 (6.13%), indicating that the model successfully identifies a high proportion of actual positives. However, the testing FNR is higher at 0.1163 (11.63%), suggesting a trade-off between sensitivity (true positive rate) and specificity (true negative rate).

To improve the model, it could help to balance the classes. This can be done by adding more examples of the less common class (oversampling) , reducing examples of the more common class (undersampling), or creating examples to even things out.

Homework4

Christopher Salinas

10/23/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

Homework4

Christopher Salinas

10/23/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)