Homework5

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require(package_name, quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# 
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation:

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)

# your code here:
dim(GermanCredit)

## [1] 1000   49

Your observation: There are 1000 observations and 49 variables.

(2) Please make a frequency table of variable `class` (use `table()` function). How many observations are classed as “good” and how many are “bad”? (2 pts)

# your code here:
table(GermanCredit$Class)

## 
## FALSE  TRUE 
##   300   700

Your observation: 300 are bad and 700 are good

(3) Please make a barplot of of response variable `class`. Please add titles and labels to axis. (2 pts)

# your code here:
barplot(table(GermanCredit$Class), main = "Distribution of Germany's Credit Class", xlab = "Class", ylab = "Frequency")

3. Split the dataset into training and test set. A random seed of `2025` is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

set.seed(2025) # set random seed for reproducibility.
# your code here: 
train_ind <- sample(1:nrow(GermanCredit), 0.75 * nrow(GermanCredit))
GermanCredit_train <- GermanCredit[train_ind, ]
GermanCredit_test <- GermanCredit[-train_ind, ]

Your comment:I chose a 75% / 25% split for the training and testing data. This means 75% of the observations are used to train the logistic regression model, allowing it to learn patterns from most of the data, while 25% of the observations are held out for testing to evaluate the model’s performance on unseen data.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

# your code here:
GermanCredit_logit <- glm(Class ~., data = GermanCredit_train, family = binomial)

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable `InstallmentRatePercentage`? Is it significant, and why? (2 pts)

# your code here:
summary(GermanCredit_logit)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = GermanCredit_train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         8.138e+00  1.659e+00   4.905 9.33e-07 ***
## Duration                           -2.287e-02  1.051e-02  -2.176 0.029549 *  
## Amount                             -1.407e-04  5.069e-05  -2.775 0.005517 ** 
## InstallmentRatePercentage          -3.080e-01  1.007e-01  -3.059 0.002223 ** 
## ResidenceDuration                  -8.627e-02  1.017e-01  -0.848 0.396285    
## Age                                 1.401e-02  1.045e-02   1.340 0.180091    
## NumberExistingCredits              -3.447e-01  2.206e-01  -1.562 0.118192    
## NumberPeopleMaintenance            -3.991e-01  2.749e-01  -1.452 0.146546    
## Telephone                          -3.329e-01  2.339e-01  -1.423 0.154683    
## ForeignWorker                      -1.418e+00  7.389e-01  -1.919 0.055021 .  
## CheckingAccountStatus.lt.0         -1.775e+00  2.712e-01  -6.544 5.98e-11 ***
## CheckingAccountStatus.0.to.200     -1.314e+00  2.678e-01  -4.906 9.30e-07 ***
## CheckingAccountStatus.gt.200       -1.009e+00  4.333e-01  -2.330 0.019815 *  
## CreditHistory.NoCredit.AllPaid     -1.034e+00  5.010e-01  -2.063 0.039097 *  
## CreditHistory.ThisBank.AllPaid     -1.433e+00  4.930e-01  -2.906 0.003663 ** 
## CreditHistory.PaidDuly             -8.772e-01  2.975e-01  -2.949 0.003193 ** 
## CreditHistory.Delay                -4.631e-01  3.826e-01  -1.210 0.226131    
## Purpose.NewCar                     -1.169e+00  8.451e-01  -1.384 0.166449    
## Purpose.UsedCar                     5.667e-01  8.977e-01   0.631 0.527881    
## Purpose.Furniture.Equipment        -2.726e-01  8.570e-01  -0.318 0.750388    
## Purpose.Radio.Television           -2.016e-01  8.540e-01  -0.236 0.813402    
## Purpose.DomesticAppliance          -4.521e-01  1.222e+00  -0.370 0.711306    
## Purpose.Repairs                    -1.303e+00  1.030e+00  -1.265 0.205940    
## Purpose.Education                  -1.009e+00  9.280e-01  -1.087 0.277076    
## Purpose.Retraining                  7.012e-01  1.518e+00   0.462 0.644074    
## Purpose.Business                   -5.485e-01  8.790e-01  -0.624 0.532654    
## SavingsAccountBonds.lt.100         -1.092e+00  3.184e-01  -3.428 0.000608 ***
## SavingsAccountBonds.100.to.500     -8.785e-01  4.048e-01  -2.170 0.029972 *  
## SavingsAccountBonds.500.to.1000    -8.902e-01  5.226e-01  -1.703 0.088478 .  
## SavingsAccountBonds.gt.1000         3.225e-01  6.318e-01   0.510 0.609733    
## EmploymentDuration.lt.1             4.102e-01  4.885e-01   0.840 0.401148    
## EmploymentDuration.1.to.4           4.354e-01  4.675e-01   0.931 0.351677    
## EmploymentDuration.4.to.7           1.268e+00  5.081e-01   2.496 0.012563 *  
## EmploymentDuration.gt.7             5.514e-01  4.803e-01   1.148 0.250921    
## Personal.Male.Divorced.Seperated   -1.034e-01  5.311e-01  -0.195 0.845651    
## Personal.Female.NotSingle          -6.919e-02  3.665e-01  -0.189 0.850244    
## Personal.Male.Single                5.270e-01  3.657e-01   1.441 0.149528    
## OtherDebtorsGuarantors.None        -6.635e-01  4.785e-01  -1.387 0.165542    
## OtherDebtorsGuarantors.CoApplicant -9.064e-01  6.538e-01  -1.386 0.165638    
## Property.RealEstate                 3.235e-01  5.026e-01   0.644 0.519794    
## Property.Insurance                  2.209e-01  4.940e-01   0.447 0.654739    
## Property.CarOther                   2.288e-01  4.772e-01   0.480 0.631544    
## OtherInstallmentPlans.Bank         -7.996e-01  2.707e-01  -2.953 0.003142 ** 
## OtherInstallmentPlans.Stores       -2.339e-01  4.453e-01  -0.525 0.599450    
## Housing.Rent                       -4.498e-01  5.573e-01  -0.807 0.419647    
## Housing.Own                        -2.336e-01  5.368e-01  -0.435 0.663475    
## Job.UnemployedUnskilled             7.239e-01  7.837e-01   0.924 0.355629    
## Job.UnskilledResident              -4.570e-02  3.959e-01  -0.115 0.908101    
## Job.SkilledEmployee                -1.314e-01  3.274e-01  -0.402 0.688025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 931.04  on 749  degrees of freedom
## Residual deviance: 687.08  on 701  degrees of freedom
## AIC: 785.08
## 
## Number of Fisher Scoring iterations: 5

Your comment: The coefficient is negative, which means that higher installment rates are associated with a lower probability of being a good credit risk. The InstallmentRatepercentage is -3.080e-01 The p-value is 0.002, which is less than 0.05. This indicates that the effect of InstallmentRatePercentage is statistically significant.

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

# you might need some code for calculation:
coef_GermanCredit <- coef(GermanCredit_logit)["InstallmentRatePercentage"]
odds_ratio <- exp(coef_GermanCredit)
odds_ratio

## InstallmentRatePercentage 
##                 0.7349491

Your comment: The estimated coefficient for InstallmentRatePercentage is -0.308. The corresponding odds ratio is 0.735, which means that for every 1-unit increase in InstallmentRatePercentage, the odds of having good credit decrease by about 26.5%. The p-value is 0.002, indicating this effect is statistically significant. Higher installment rates are therefore associated with a lower probability of being a good credit risk.

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

# your code here:
train_pred_prob <- predict(GermanCredit_logit, newdata= GermanCredit_train, type = "response")
head(train_pred_prob)

##       909       460       932       922       961       279 
## 0.9624772 0.9035098 0.6215123 0.8649416 0.9471023 0.8748184

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

# your code here:
train_pred_class <- ifelse(train_pred_prob > 0.5, TRUE, FALSE)
conf_matrix_train <- table(Predicted = train_pred_class, Actual = GermanCredit_train$Class)
conf_matrix_train

##          Actual
## Predicted FALSE TRUE
##     FALSE   129   59
##     TRUE    105  457

MR_train <- mean(train_pred_class != GermanCredit_train$Class)
MR_train

## [1] 0.2186667

Your comment: The confusion matrix shows that the model correctly predicted 129 bad credit cases and 457 good credit cases. It misclassified 59 good credit cases as bad and 105 bad credit cases as good. Overall, the model performs reasonably well, but it tends to misclassify some bad credit cases as good, which could be risky for lenders.The misclassification rate on the training data is 0.2187, meaning the model incorrectly classifies about 21.9% of the training observations.

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

# your code here:
cutoffs <- seq(0.01, 0.99, by = 0.01)
MR_values <- numeric(length(cutoffs))
for (i in 1:length(cutoffs)) {
  pred_class <- ifelse(train_pred_prob > cutoffs[i], TRUE, FALSE)
  MR_values[i] <- mean(pred_class != GermanCredit_train$Class)
}

optimal_cutoff <- cutoffs[which.min(MR_values)]
optimal_cutoff

## [1] 0.47

min(MR_values)

## [1] 0.2146667

plot(cutoffs, MR_values, type = "l", col = "blue", lwd = 2,
     xlab = "Probability Cut-off", ylab = "Misclassification Rate (MR)",
     main = "MR vs. Probability Cut-off")
abline(v = optimal_cutoff, col = "red", lty = 2) 
text(optimal_cutoff, min(MR_values), labels = round(optimal_cutoff, 2), pos = 4, col = "red")

Your comment: The optimal is 0.47

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

# your code here:
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_train <- roc(GermanCredit_train$Class, train_pred_prob)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_train, col = "blue", main = "ROC Curve Training Set")

auc_train <- auc(roc_train)
auc_train

## Area under the curve: 0.8306

Your comment: The AUC for the training set is 0.8306, which indicates strong model performance. Since the AUC is much higher than 0.5, the logistic regression model has good discriminative ability, it can effectively distinguish between “Good” and “Bad” credit applicants.

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

# your code here:
test_pred_prob <- predict(GermanCredit_logit, newdata= GermanCredit_test, type = "response")
head(test_pred_prob)

##         3         4         7         9        17        21 
## 0.9723991 0.6708878 0.9167871 0.9902534 0.9744949 0.8554000

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

# your code here:
test_pred_class <- ifelse(test_pred_prob > 0.5, TRUE, FALSE)
conf_matrix_test <- table(Predicted = test_pred_class, Actual = GermanCredit_test$Class)
conf_matrix_test

##          Actual
## Predicted FALSE TRUE
##     FALSE    26   19
##     TRUE     40  165

MR_test <- mean(train_pred_class != GermanCredit_test$Class)
MR_test

## [1] 0.376

Your comment: The confusion matrix shows the model’s performance on the test set. The model correctly classified 26 bad credit cases and 165 good credit cases. It misclassified 19 good credit cases as bad and 40 bad credit cases as good. The misclassification rate is 0.376, meaning approximately 37.6% of predictions were incorrect. This MR is higher than the training MR, indicating the model’s accuracy decreases on unseen data, though it still captures many correct classifications.

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

# your code here:
roc_test <- roc(GermanCredit_test$Class, test_pred_prob)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_test, col = "blue", main = "ROC Curve Test Set")

auc_test <- auc(roc_test)
auc_test

## Area under the curve: 0.8167