Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Warning: package 'caret' was built under R version 4.5.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.5.3

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: The dataset has 1000 obs and 62 variables. Some variables are numeric while many are binary which suggets that categorical variables have already been converted into dummy variables.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

dim(GermanCredit)

## [1] 1000   49

Your observation: The dataset has 1000 observations and 62 variables. This means there are 1000 credit applicants and 62 features describing each applicant. The relatively large number of variables suggests the dataset includes many detailed attributes, likely including both original numeric variables and multiple dummy variables created from categorical features.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
trainIndex <- createDataPartition(GermanCredit$Class, p = 0.8, list = FALSE)
trainSet   <- GermanCredit[trainIndex, ]
testSet    <- GermanCredit[-trainIndex, ]

Your observation: The dataset was split into training and test set using 80/20 ratio. The training group contains 80% of the data and builds the model while the test group contains the remainding 30% and evalutes model performance

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

model <- glm(Class ~ ., data = trainSet, family = binomial)

Your observation: Used a logistic regression to fit all the predictor variables, the model estimates the relationship between each variable and the probability of a credit applicant being classified as “Good”. The results of the regression will be used to interpret which variables have the strongest impact on the outcome.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

summary(model)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = trainSet)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.622e+00  1.643e+00   5.855 4.78e-09 ***
## Duration                           -3.144e-02  1.045e-02  -3.008 0.002632 ** 
## Amount                             -1.562e-04  5.251e-05  -2.976 0.002924 ** 
## InstallmentRatePercentage          -3.239e-01  1.012e-01  -3.202 0.001366 ** 
## ResidenceDuration                  -8.167e-02  9.954e-02  -0.821 0.411921    
## Age                                 5.497e-03  1.059e-02   0.519 0.603849    
## NumberExistingCredits              -2.030e-01  2.128e-01  -0.954 0.340032    
## NumberPeopleMaintenance            -2.122e-01  2.813e-01  -0.754 0.450650    
## Telephone                          -5.724e-01  2.364e-01  -2.421 0.015462 *  
## ForeignWorker                      -1.606e+00  7.500e-01  -2.141 0.032304 *  
## CheckingAccountStatus.lt.0         -1.938e+00  2.694e-01  -7.194 6.27e-13 ***
## CheckingAccountStatus.0.to.200     -1.498e+00  2.714e-01  -5.517 3.44e-08 ***
## CheckingAccountStatus.gt.200       -5.494e-01  4.661e-01  -1.179 0.238545    
## CreditHistory.NoCredit.AllPaid     -1.224e+00  4.869e-01  -2.513 0.011972 *  
## CreditHistory.ThisBank.AllPaid     -1.701e+00  5.235e-01  -3.250 0.001154 ** 
## CreditHistory.PaidDuly             -1.042e+00  2.967e-01  -3.511 0.000446 ***
## CreditHistory.Delay                -6.730e-01  3.969e-01  -1.696 0.089965 .  
## Purpose.NewCar                     -1.576e+00  8.544e-01  -1.845 0.065066 .  
## Purpose.UsedCar                     4.007e-01  8.938e-01   0.448 0.653912    
## Purpose.Furniture.Equipment        -6.678e-01  8.626e-01  -0.774 0.438868    
## Purpose.Radio.Television           -7.860e-01  8.643e-01  -0.909 0.363128    
## Purpose.DomesticAppliance          -6.060e-01  1.221e+00  -0.496 0.619740    
## Purpose.Repairs                    -1.920e+00  1.038e+00  -1.851 0.064218 .  
## Purpose.Education                  -1.214e+00  9.555e-01  -1.271 0.203804    
## Purpose.Retraining                 -3.336e-01  1.478e+00  -0.226 0.821459    
## Purpose.Business                   -6.818e-01  8.848e-01  -0.771 0.440963    
## SavingsAccountBonds.lt.100         -1.149e+00  2.967e-01  -3.872 0.000108 ***
## SavingsAccountBonds.100.to.500     -6.596e-01  4.033e-01  -1.635 0.101953    
## SavingsAccountBonds.500.to.1000    -3.858e-01  5.251e-01  -0.735 0.462566    
## SavingsAccountBonds.gt.1000         5.404e-02  6.246e-01   0.087 0.931054    
## EmploymentDuration.lt.1             1.274e-02  5.305e-01   0.024 0.980835    
## EmploymentDuration.1.to.4           1.511e-02  5.075e-01   0.030 0.976241    
## EmploymentDuration.4.to.7           8.924e-01  5.566e-01   1.603 0.108838    
## EmploymentDuration.gt.7             2.865e-02  5.062e-01   0.057 0.954873    
## Personal.Male.Divorced.Seperated   -3.225e-01  5.236e-01  -0.616 0.537938    
## Personal.Female.NotSingle          -1.771e-01  3.528e-01  -0.502 0.615693    
## Personal.Male.Single                4.965e-01  3.556e-01   1.396 0.162620    
## OtherDebtorsGuarantors.None        -1.051e+00  4.694e-01  -2.239 0.025151 *  
## OtherDebtorsGuarantors.CoApplicant -1.407e+00  6.547e-01  -2.149 0.031672 *  
## Property.RealEstate                 1.287e+00  4.758e-01   2.705 0.006836 ** 
## Property.Insurance                  8.400e-01  4.628e-01   1.815 0.069495 .  
## Property.CarOther                   1.055e+00  4.550e-01   2.319 0.020410 *  
## OtherInstallmentPlans.Bank         -6.532e-01  2.710e-01  -2.411 0.015917 *  
## OtherInstallmentPlans.Stores       -4.904e-01  4.506e-01  -1.088 0.276477    
## Housing.Rent                       -1.014e+00  5.446e-01  -1.862 0.062617 .  
## Housing.Own                        -5.562e-01  5.063e-01  -1.099 0.271959    
## Job.UnemployedUnskilled             7.335e-01  7.829e-01   0.937 0.348812    
## Job.UnskilledResident               6.990e-02  4.187e-01   0.167 0.867408    
## Job.SkilledEmployee                 6.613e-02  3.472e-01   0.190 0.848920    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 977.38  on 799  degrees of freedom
## Residual deviance: 688.60  on 751  degrees of freedom
## AIC: 786.6
## 
## Number of Fisher Scoring iterations: 5

Your observation: The model results show that several variables are statistically significant, including Duration, Amount, InstallmentRatePercentage, Telephone, ForeignWorker, and multiple checking account and credit history variables. This suggests these factors have a meaningful impact on whether a credit applicant is classified as “Good.” Overall, the model captures important relationships in the data, especially financial indicators, but could be improved by removing redundant variables. # Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

train_probs <- predict(model, newdata = trainSet, type = "response")

Your observation: The model generates predicted probabilities for each observation in the training set, representing the likelihood that a credit applicant is classified as “Good.” These probabilities range between 0 and 1, where values closer to 1 indicate a higher likelihood of being a good credit risk.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

cutoffs <- seq(0.01, 0.99, by = 0.01)

costs <- sapply(cutoffs, function(c) {
  pred <- train_probs >= c
  mean(pred != trainSet$Class)
})

optimal_cutoff <- cutoffs[which.min(costs)]
cat("Optimal Cutoff:", optimal_cutoff, "\n")

## Optimal Cutoff: 0.46

cat("Minimum MR:", min(costs), "\n")

## Minimum MR: 0.195

plot(cutoffs, costs, type = "l", col = "blue", lwd = 2,
     xlab = "Cutoff", ylab = "Misclassification Rate")
abline(v = optimal_cutoff, col = "red", lty = 2)

Your observation: The plot shows the misclassification rate is relatively flat and low between .3 and .5, then sharply rises as the model begins predicting everything else is Bad

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

train_pred <- factor(train_probs >= optimal_cutoff, levels = c(FALSE, TRUE))
train_actual <- factor(trainSet$Class, levels = c(FALSE, TRUE))
cm_train <- confusionMatrix(train_pred, train_actual, positive = "TRUE")
print(cm_train)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   131   47
##      TRUE    109  513
##                                           
##                Accuracy : 0.805           
##                  95% CI : (0.7758, 0.8319)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 1.006e-11       
##                                           
##                   Kappa : 0.4987          
##                                           
##  Mcnemar's Test P-Value : 1.040e-06       
##                                           
##             Sensitivity : 0.9161          
##             Specificity : 0.5458          
##          Pos Pred Value : 0.8248          
##          Neg Pred Value : 0.7360          
##              Prevalence : 0.7000          
##          Detection Rate : 0.6412          
##    Detection Prevalence : 0.7775          
##       Balanced Accuracy : 0.7310          
##                                           
##        'Positive' Class : TRUE            
##

mr_train <- 1 - cm_train$overall["Accuracy"]
cat("Training MR:", round(mr_train, 4), "\n")

## Training MR: 0.195

Your observation: The model has high sensitivity meaning it correctly identified 91.6% of actual Good applicants. It rarely misses a good customer. It also has low specificity meaning the model only catches 54.6% of actual Bad applicants. A concern for credit risk

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_train <- roc(trainSet$Class, train_probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_train, col = "blue", lwd = 2,
     main = "ROC Curve - Training Set")
abline(a = 0, b = 1, lty = 2, col = "red")

auc_train <- auc(roc_train)
cat("Training AUC:", round(auc_train, 4), "\n")

## Training AUC: 0.8466

Your observation: The ROC curve bows strongly toward the top left, indicating that the mdoel performs well. AN AUC of .8466 indicates that the model has a 84.66% chance of correctly ranking a Good applicant than a Bad one. Overall it seems the model is consistent and has solid performance for the training set

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

test_probs <- predict(model, newdata = testSet, type = "response")
test_pred <- factor(test_probs >= optimal_cutoff, levels = c(FALSE, TRUE))
test_actual <- factor(testSet$Class, levels = c(FALSE, TRUE))

cm_test <- confusionMatrix(test_pred, test_actual, positive = "TRUE")
print(cm_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    28   21
##      TRUE     32  119
##                                           
##                Accuracy : 0.735           
##                  95% CI : (0.6681, 0.7948)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.1579          
##                                           
##                   Kappa : 0.3342          
##                                           
##  Mcnemar's Test P-Value : 0.1696          
##                                           
##             Sensitivity : 0.8500          
##             Specificity : 0.4667          
##          Pos Pred Value : 0.7881          
##          Neg Pred Value : 0.5714          
##              Prevalence : 0.7000          
##          Detection Rate : 0.5950          
##    Detection Prevalence : 0.7550          
##       Balanced Accuracy : 0.6583          
##                                           
##        'Positive' Class : TRUE            
##

mr_test <- 1 - cm_test$overall["Accuracy"]
cat("Test MR:", round(mr_test, 4), "\n")

## Test MR: 0.265

Your observation: Sensivitiy dropped and Specificity dropped. Test MR of .265 is higher than traning MR of .195, a difference of .07, suggesting some degree of overftting perhaps. Overall, the model shows acceptable but degraded generilization

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

library(pROC)

roc_test <- roc(testSet$Class, test_probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_test, col = "blue", lwd = 2,
     main = "ROC Curve - Test Set")
abline(a = 0, b = 1, lty = 2, col = "red")

auc_test <- auc(roc_test)
cat("Test AUC:", round(auc_test, 4), "\n")

## Test AUC: 0.7544

Your observation: The Test AUC is noticeably lower than training AUC of .8466, a drop of .09. This confirms that Test is moderately overfitting.Although lower, .7544 is still an acceptable power. The Test ROC is more jagged and stepped compared to the training.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

weight_FN <- 1
weight_FP <- 5

cutoffs <- seq(0.01, 0.99, by = 0.01)

costs <- sapply(cutoffs, function(c) {
  pred <- train_probs >= c
  (weight_FN * sum(trainSet$Class == TRUE  & pred == FALSE) +
   weight_FP * sum(trainSet$Class == FALSE & pred == TRUE)) / nrow(trainSet)
})

optimal_cutoff_w <- cutoffs[which.min(costs)]
cat("Optimal Cutoff (weighted):", optimal_cutoff_w, "\n")

## Optimal Cutoff (weighted): 0.85

cat("Minimum Weighted Cost:", min(costs), "\n")

## Minimum Weighted Cost: 0.4425

plot(cutoffs, costs, type = "l", col = "blue", lwd = 2,
     xlab = "Cutoff", ylab = "Weighted Cost")
abline(v = optimal_cutoff_w, col = "red", lty = 2)
legend("topright", legend = paste("Optimal Cutoff =", optimal_cutoff_w), col = "red", lty = 2)

Your observation: The optimal cutoff increased from .46 to .85 and this suggests the weights are swapped in the cost function. ### 2. Obtain the confusion matrix and MR for the training set.

train_pred_w <- factor(train_probs >= optimal_cutoff_w, levels = c(FALSE, TRUE))
train_actual <- factor(trainSet$Class, levels = c(FALSE, TRUE))

cm_train_w <- confusionMatrix(train_pred_w, train_actual, positive = "TRUE")
print(cm_train_w)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   221  259
##      TRUE     19  301
##                                           
##                Accuracy : 0.6525          
##                  95% CI : (0.6184, 0.6855)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.9983          
##                                           
##                   Kappa : 0.3565          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5375          
##             Specificity : 0.9208          
##          Pos Pred Value : 0.9406          
##          Neg Pred Value : 0.4604          
##              Prevalence : 0.7000          
##          Detection Rate : 0.3762          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.7292          
##                                           
##        'Positive' Class : TRUE            
##

mr_train_w <- 1 - cm_train_w$overall["Accuracy"]
cat("Training MR (weighted):", round(mr_train_w, 4), "\n")

## Training MR (weighted): 0.3475

Your observation: Specificity jumped dramatically from .546 to .921. The model correctly identifies 92% of actual bad customers. Sensitivity dropped sharply from .916 to .538, the model rejects more than half of actually good customers. ### 3. Obtain the confusion matrix and MR for the test set.

test_pred_w <- factor(test_probs >= optimal_cutoff_w, levels = c(FALSE, TRUE))
test_actual <- factor(testSet$Class, levels = c(FALSE, TRUE))

cm_test_w <- confusionMatrix(test_pred_w, test_actual, positive = "TRUE")
print(cm_test_w)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    46   61
##      TRUE     14   79
##                                           
##                Accuracy : 0.625           
##                  95% CI : (0.5539, 0.6923)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.9906          
##                                           
##                   Kappa : 0.2704          
##                                           
##  Mcnemar's Test P-Value : 1.087e-07       
##                                           
##             Sensitivity : 0.5643          
##             Specificity : 0.7667          
##          Pos Pred Value : 0.8495          
##          Neg Pred Value : 0.4299          
##              Prevalence : 0.7000          
##          Detection Rate : 0.3950          
##    Detection Prevalence : 0.4650          
##       Balanced Accuracy : 0.6655          
##                                           
##        'Positive' Class : TRUE            
##

mr_test_w <- 1 - cm_test_w$overall["Accuracy"]
cat("Test MR (weighted):", round(mr_test_w, 4), "\n")

## Test MR (weighted): 0.375

Your observation: The weighted model achieves a test MR of 0.375 with notably improved specificity of 0.767 compared to 0.467 in the equal-weight model, confirming the model successfully generalizes its conservative behavior to unseen data. The dramatic reduction in false positives (14 vs. 32) demonstrates the asymmetric cost structure is working as intended — approving far fewer bad customers at the expense of rejecting more good ones. # Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.

The equal-weight model (cutoff = 0.46) achieved a reasonable AUC of 0.8466 on training and 0.7544 on test, with a moderate generalization gap in MR (0.195 vs. 0.265), indicating mild overfitting. The asymmetric weighted model (cutoff = 0.85) significantly improved specificity from 0.467 to 0.767 on the test set, reducing false positives from 32 to 14 — a more practical outcome for credit risk management where approving bad customers is 5x more costly.

Try more powerful models, which may better capture non-linear relationships in the data and improve both AUC and generalization. irrelevant or highly correlated variables through regularization could reduce overfitting and improve test performance. Cross-validation instead of a single train/test split would provide more robust and reliable estimates of model performance.

Homework5

tianhai

10/23/2025

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

Homework5

tianhai

10/23/2025

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)