Homework5

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require(package_name, quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)

Your observation: After loading and cleaning, we have a binary outcome coded as TRUE for good credit and FALSE for bad credit, plus numeric/dummy predictors for logistic regressions.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)

dim(GermanCredit)

## [1] 1000   49

Your observation: There are 1000 observations and 49 variables

(2) Please make a frequency table of variable `class` (use `table()` function). How many observations are classed as “good” and how many are “bad”? (2 pts)

table(GermanCredit$Class)

## 
## FALSE  TRUE 
##   300   700

Your observation: There are 700 good and 300 bad credit applicants. ##### (3) Please make a barplot of of response variable class. Please add titles and labels to axis. (2 pts)

class_tab <- table(GermanCredit$Class)

barplot(class_tab,
names.arg = c("Bad", "Good"),
main = "Distribution of Credit Class",
xlab = "Class",
ylab = "Count")

3. Split the dataset into training and test set. A random seed of `2025` is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

set.seed(2025) # set random seed for reproducibility.

train_index <- createDataPartition(GermanCredit$Class,
p = 0.7,   # 70% training
list = FALSE)

train_data <- GermanCredit[train_index, ]
test_data  <- GermanCredit[-train_index, ]

dim(train_data)

## [1] 700  49

dim(test_data)

## [1] 300  49

Your comment: I used a 70%/30% split. 700 observations for training and 300 for testing.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

# Logistic regression with all predictors

logit_model <- glm(Class ~ .,
data = train_data,
family = binomial)

summary(logit_model)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         7.941e+00  1.617e+00   4.910 9.12e-07 ***
## Duration                           -2.901e-02  1.151e-02  -2.521 0.011705 *  
## Amount                             -1.075e-04  5.272e-05  -2.039 0.041401 *  
## InstallmentRatePercentage          -3.898e-01  1.092e-01  -3.569 0.000358 ***
## ResidenceDuration                  -6.141e-02  1.015e-01  -0.605 0.545004    
## Age                                 2.488e-02  1.142e-02   2.178 0.029393 *  
## NumberExistingCredits              -2.077e-01  2.276e-01  -0.913 0.361504    
## NumberPeopleMaintenance            -2.455e-01  3.068e-01  -0.800 0.423583    
## Telephone                          -4.703e-02  2.442e-01  -0.193 0.847267    
## ForeignWorker                      -1.116e+00  6.633e-01  -1.682 0.092485 .  
## CheckingAccountStatus.lt.0         -1.539e+00  2.694e-01  -5.714 1.10e-08 ***
## CheckingAccountStatus.0.to.200     -1.186e+00  2.786e-01  -4.258 2.06e-05 ***
## CheckingAccountStatus.gt.200       -2.359e-01  4.652e-01  -0.507 0.612139    
## CreditHistory.NoCredit.AllPaid     -1.508e+00  5.144e-01  -2.931 0.003383 ** 
## CreditHistory.ThisBank.AllPaid     -1.970e+00  5.276e-01  -3.733 0.000189 ***
## CreditHistory.PaidDuly             -9.283e-01  3.350e-01  -2.771 0.005588 ** 
## CreditHistory.Delay                -7.730e-01  4.062e-01  -1.903 0.057032 .  
## Purpose.NewCar                     -1.616e+00  8.906e-01  -1.815 0.069570 .  
## Purpose.UsedCar                     1.418e-02  9.328e-01   0.015 0.987872    
## Purpose.Furniture.Equipment        -8.664e-01  8.985e-01  -0.964 0.334875    
## Purpose.Radio.Television           -6.736e-01  8.958e-01  -0.752 0.452093    
## Purpose.DomesticAppliance          -1.316e+00  1.198e+00  -1.099 0.271980    
## Purpose.Repairs                    -1.125e+00  1.077e+00  -1.045 0.295888    
## Purpose.Education                  -1.842e+00  9.842e-01  -1.872 0.061248 .  
## Purpose.Retraining                  5.452e-01  1.534e+00   0.355 0.722233    
## Purpose.Business                   -1.013e+00  9.128e-01  -1.110 0.266924    
## SavingsAccountBonds.lt.100         -1.011e+00  3.130e-01  -3.230 0.001238 ** 
## SavingsAccountBonds.100.to.500     -6.076e-01  4.251e-01  -1.429 0.152873    
## SavingsAccountBonds.500.to.1000    -3.826e-01  5.583e-01  -0.685 0.493160    
## SavingsAccountBonds.gt.1000         2.995e-01  7.103e-01   0.422 0.673243    
## EmploymentDuration.lt.1             3.306e-01  4.957e-01   0.667 0.504788    
## EmploymentDuration.1.to.4           5.461e-01  4.722e-01   1.156 0.247501    
## EmploymentDuration.4.to.7           6.790e-01  5.128e-01   1.324 0.185449    
## EmploymentDuration.gt.7             4.548e-01  4.684e-01   0.971 0.331585    
## Personal.Male.Divorced.Seperated   -7.670e-01  5.438e-01  -1.410 0.158402    
## Personal.Female.NotSingle          -2.582e-01  3.681e-01  -0.701 0.483014    
## Personal.Male.Single                4.698e-01  3.750e-01   1.253 0.210300    
## OtherDebtorsGuarantors.None        -1.044e+00  5.749e-01  -1.817 0.069292 .  
## OtherDebtorsGuarantors.CoApplicant -1.167e+00  7.453e-01  -1.566 0.117349    
## Property.RealEstate                 7.808e-01  5.317e-01   1.468 0.141990    
## Property.Insurance                  3.269e-01  5.206e-01   0.628 0.530048    
## Property.CarOther                   7.458e-01  5.089e-01   1.466 0.142739    
## OtherInstallmentPlans.Bank         -5.795e-01  2.937e-01  -1.973 0.048493 *  
## OtherInstallmentPlans.Stores       -5.174e-01  4.619e-01  -1.120 0.262692    
## Housing.Rent                       -8.942e-01  5.908e-01  -1.514 0.130116    
## Housing.Own                        -4.131e-01  5.593e-01  -0.739 0.460202    
## Job.UnemployedUnskilled             1.175e+00  9.763e-01   1.204 0.228626    
## Job.UnskilledResident              -2.144e-01  4.270e-01  -0.502 0.615535    
## Job.SkilledEmployee                -1.170e-01  3.437e-01  -0.340 0.733609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 621.15  on 651  degrees of freedom
## AIC: 719.15
## 
## Number of Fisher Scoring iterations: 5

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable `InstallmentRatePercentage`? Is it significant, and why? (2 pts)

summary(logit_model)$coefficients["InstallmentRatePercentage", ]

##      Estimate    Std. Error       z value      Pr(>|z|) 
## -0.3897870155  0.1091997206 -3.5694872977  0.0003576806

Your comment: The estimated coefficient for InstallmentRatePercentage is approximately -0.39 with a small p-value (<0.001). This means it is statistically significant at the 5% level.

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

beta_install <- coef(logit_model)["InstallmentRatePercentage"]
odds_ratio_install <- exp(beta_install)
odds_ratio_install

## InstallmentRatePercentage 
##                 0.6772011

Your comment: The odds ratio is about 0.68. This means that for each one-unit increase in InstallmentRatePercentage, the odds of being classified as Good are multiplied by about 0.68. (About a 32% decrease in the odds of good credit) ### Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

train_pred_prob <- predict(logit_model,
newdata = train_data,
type = "response")

summary(train_pred_prob)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04427 0.50110 0.78645 0.70000 0.92189 0.99851

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

cutoff <- 0.5

# Predicted class: 1 = Good, 0 = Bad

train_pred_class <- ifelse(train_pred_prob >= cutoff, 1, 0)

# Actual: convert logical TRUE/FALSE to 1/0

actual_train <- ifelse(train_data$Class, 1, 0)

confusion_train <- table(Predicted = train_pred_class,
Actual    = actual_train)
confusion_train

##          Actual
## Predicted   0   1
##         0 122  52
##         1  88 438

# Misclassification rate

FP_train <- sum(actual_train == 0 & train_pred_class == 1)
FN_train <- sum(actual_train == 1 & train_pred_class == 0)
MR_train <- (FP_train + FN_train) / length(actual_train)
MR_train

## [1] 0.2

Your comment: The training confusion matrix shows about 80% correctly classified and MR is about 0.20 (20% of training observations misclassified). Performance is decent given the problem and imbalance.

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

pcut_seq <- seq(0, 1, by = 0.01)
MR_seq <- numeric(length(pcut_seq))

for(i in seq_along(pcut_seq)){
pcut <- pcut_seq[i]
pred_class_i <- ifelse(train_pred_prob >= pcut, 1, 0)
FP_i <- sum(actual_train == 0 & pred_class_i == 1)
FN_i <- sum(actual_train == 1 & pred_class_i == 0)
MR_seq[i] <- (FP_i + FN_i) / length(actual_train)
}

plot(pcut_seq, MR_seq, type = "l",
xlab = "Cut-off probability",
ylab = "Misclassification Rate (MR)",
main = "MR vs Cut-off Probability (Training Set)")

best_cut <- pcut_seq[which.min(MR_seq)]
best_cut

## [1] 0.49

min(MR_seq)

## [1] 0.1957143

Your comment: The minimum MR occurs at a cut-off around 0.49 (very close to 0.5), with MR slightly below 0.20. So the usual 0.5 threshold is nearly optimal.

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

library(ROCR)


pred_train <- prediction(train_pred_prob, actual_train)
perf_train <- performance(pred_train, "tpr", "fpr")

plot(perf_train,
col = "blue",
lwd = 2,
main = "ROC Curve - Training Set")

auc_train <- performance(pred_train, "auc")@y.values[[1]]
auc_train

## [1] 0.8345578

Your comment: The training AUC is about 0.83, which indicates good discrimination: the model has a strong ability to rank good vs. bad credit applicants better than random guessing. ## Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

test_pred_prob <- predict(logit_model,
newdata = test_data,
type = "response")

summary(test_pred_prob)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06043 0.52355 0.80110 0.70132 0.91777 0.99664

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

cutoff <- 0.5

test_pred_class <- ifelse(test_pred_prob >= cutoff, 1, 0)
actual_test <- ifelse(test_data$Class, 1, 0)

confusion_test <- table(Predicted = test_pred_class,
Actual    = actual_test)
confusion_test

##          Actual
## Predicted   0   1
##         0  40  29
##         1  50 181

FP_test <- sum(actual_test == 0 & test_pred_class == 1)
FN_test <- sum(actual_test == 1 & test_pred_class == 0)
MR_test <- (FP_test + FN_test) / length(actual_test)
MR_test

## [1] 0.2633333

Your comment: On the test set, MR is about 0.26 (about 26% misclassified), worse than the training MR (about 20%). This drop is expected and suggests some overfitting but still reasonable generalization.

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

pred_test <- prediction(test_pred_prob, actual_test)
perf_test <- performance(pred_test, "tpr", "fpr")

plot(perf_test,
col = "red",
lwd = 2,
main = "ROC Curve - Test Set")

auc_test <- performance(pred_test, "auc")@y.values[[1]]
auc_test

## [1] 0.7964021

Your comment: The test AUC is about 0.80, slightly lower than the training AUC but still in a good range. This indicates the model maintains solid predictive power on unseen data with only modest performance degradation.