Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)

Your observation:

The GermanCredit dataset contains both numeric and factor variables that describe characteristics of loan applicants (e.g., credit history, duration, amount, age, etc.), and the target variable Class indicates whether credit is good or bad.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)
# your code here:

dim(GermanCredit)
## [1] 1000   49

Your observation:

There are 1000 observations and approximately 49 variables after dropping uninformative ones.

(2) Please make a frequency table of variable class (use table() function). How many observations are classed as “good” and how many are “bad”? (2 pts)
# your code here:

table(GermanCredit$Class)
## 
## FALSE  TRUE 
##   300   700

Your observation:

There are 700 “Good” credit cases and 300 “Bad” credit cases, showing the dataset is imbalanced.

(3) Please make a barplot of of response variable class. Please add titles and labels to axis. (2 pts)
# your code here:

barplot(table(GermanCredit$Class),
col = c("steelblue", "tomato"),
main = "Distribution of Credit Class",
xlab = "Credit Class (TRUE = Good, FALSE = Bad)",
ylab = "Count")

3. Split the dataset into training and test set. A random seed of 2025 is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

set.seed(2025)
train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
German_train <- GermanCredit[train_index, ]
German_test  <- GermanCredit[-train_index, ]

Your comment:

I used a 70/30 split to ensure enough data for training and a fair evaluation on unseen test data.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

# your code here:

model_logit <- glm(Class ~ ., data = German_train, family = binomial)
summary(model_logit)
## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = German_train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         7.941e+00  1.617e+00   4.910 9.12e-07 ***
## Duration                           -2.901e-02  1.151e-02  -2.521 0.011705 *  
## Amount                             -1.075e-04  5.272e-05  -2.039 0.041401 *  
## InstallmentRatePercentage          -3.898e-01  1.092e-01  -3.569 0.000358 ***
## ResidenceDuration                  -6.141e-02  1.015e-01  -0.605 0.545004    
## Age                                 2.488e-02  1.142e-02   2.178 0.029393 *  
## NumberExistingCredits              -2.077e-01  2.276e-01  -0.913 0.361504    
## NumberPeopleMaintenance            -2.455e-01  3.068e-01  -0.800 0.423583    
## Telephone                          -4.703e-02  2.442e-01  -0.193 0.847267    
## ForeignWorker                      -1.116e+00  6.633e-01  -1.682 0.092485 .  
## CheckingAccountStatus.lt.0         -1.539e+00  2.694e-01  -5.714 1.10e-08 ***
## CheckingAccountStatus.0.to.200     -1.186e+00  2.786e-01  -4.258 2.06e-05 ***
## CheckingAccountStatus.gt.200       -2.359e-01  4.652e-01  -0.507 0.612139    
## CreditHistory.NoCredit.AllPaid     -1.508e+00  5.144e-01  -2.931 0.003383 ** 
## CreditHistory.ThisBank.AllPaid     -1.970e+00  5.276e-01  -3.733 0.000189 ***
## CreditHistory.PaidDuly             -9.283e-01  3.350e-01  -2.771 0.005588 ** 
## CreditHistory.Delay                -7.730e-01  4.062e-01  -1.903 0.057032 .  
## Purpose.NewCar                     -1.616e+00  8.906e-01  -1.815 0.069570 .  
## Purpose.UsedCar                     1.418e-02  9.328e-01   0.015 0.987872    
## Purpose.Furniture.Equipment        -8.664e-01  8.985e-01  -0.964 0.334875    
## Purpose.Radio.Television           -6.736e-01  8.958e-01  -0.752 0.452093    
## Purpose.DomesticAppliance          -1.316e+00  1.198e+00  -1.099 0.271980    
## Purpose.Repairs                    -1.125e+00  1.077e+00  -1.045 0.295888    
## Purpose.Education                  -1.842e+00  9.842e-01  -1.872 0.061248 .  
## Purpose.Retraining                  5.452e-01  1.534e+00   0.355 0.722233    
## Purpose.Business                   -1.013e+00  9.128e-01  -1.110 0.266924    
## SavingsAccountBonds.lt.100         -1.011e+00  3.130e-01  -3.230 0.001238 ** 
## SavingsAccountBonds.100.to.500     -6.076e-01  4.251e-01  -1.429 0.152873    
## SavingsAccountBonds.500.to.1000    -3.826e-01  5.583e-01  -0.685 0.493160    
## SavingsAccountBonds.gt.1000         2.995e-01  7.103e-01   0.422 0.673243    
## EmploymentDuration.lt.1             3.306e-01  4.957e-01   0.667 0.504788    
## EmploymentDuration.1.to.4           5.461e-01  4.722e-01   1.156 0.247501    
## EmploymentDuration.4.to.7           6.790e-01  5.128e-01   1.324 0.185449    
## EmploymentDuration.gt.7             4.548e-01  4.684e-01   0.971 0.331585    
## Personal.Male.Divorced.Seperated   -7.670e-01  5.438e-01  -1.410 0.158402    
## Personal.Female.NotSingle          -2.582e-01  3.681e-01  -0.701 0.483014    
## Personal.Male.Single                4.698e-01  3.750e-01   1.253 0.210300    
## OtherDebtorsGuarantors.None        -1.044e+00  5.749e-01  -1.817 0.069292 .  
## OtherDebtorsGuarantors.CoApplicant -1.167e+00  7.453e-01  -1.566 0.117349    
## Property.RealEstate                 7.808e-01  5.317e-01   1.468 0.141990    
## Property.Insurance                  3.269e-01  5.206e-01   0.628 0.530048    
## Property.CarOther                   7.458e-01  5.089e-01   1.466 0.142739    
## OtherInstallmentPlans.Bank         -5.795e-01  2.937e-01  -1.973 0.048493 *  
## OtherInstallmentPlans.Stores       -5.174e-01  4.619e-01  -1.120 0.262692    
## Housing.Rent                       -8.942e-01  5.908e-01  -1.514 0.130116    
## Housing.Own                        -4.131e-01  5.593e-01  -0.739 0.460202    
## Job.UnemployedUnskilled             1.175e+00  9.763e-01   1.204 0.228626    
## Job.UnskilledResident              -2.144e-01  4.270e-01  -0.502 0.615535    
## Job.SkilledEmployee                -1.170e-01  3.437e-01  -0.340 0.733609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 621.15  on 651  degrees of freedom
## AIC: 719.15
## 
## Number of Fisher Scoring iterations: 5

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable InstallmentRatePercentage? Is it significant, and why? (2 pts)

# your code here:

coef(summary(model_logit))["InstallmentRatePercentage", ]
##      Estimate    Std. Error       z value      Pr(>|z|) 
## -0.3897870155  0.1091997206 -3.5694872977  0.0003576806

Your comment:

If the coefficient is negative, higher installment rates are associated with a lower likelihood of good credit. The significance (p-value < 0.05) would mean this variable is an important predictor.

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

# you might need some code for calculation:

exp(coef(model_logit)["InstallmentRatePercentage"])
## InstallmentRatePercentage 
##                 0.6772011

Your comment:

The odds ratio (OR) indicates how the odds of having good credit change with a one-unit increase in InstallmentRatePercentage.

If OR < 1 → higher installment rates reduce creditworthiness.

If OR > 1 → higher installment rates increase creditworthiness.

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

# your code here:

train_pred_prob <- predict(model_logit, newdata = German_train, type = "response")
head(train_pred_prob)
##          1          3          6         11         12         13 
## 0.97453792 0.95952716 0.71928110 0.32074645 0.07608803 0.83362826

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

# your code here:

train_pred_class <- ifelse(train_pred_prob >= 0.5, TRUE, FALSE)
conf_matrix_train <- table(Predicted = train_pred_class, Actual = German_train$Class)
conf_matrix_train
##          Actual
## Predicted FALSE TRUE
##     FALSE   122   52
##     TRUE     88  438
MR_train <- mean(train_pred_class != German_train$Class)
MR_train
## [1] 0.2

Your comment:

The misclassification rate (MR) measures the fraction of incorrect predictions. A lower MR indicates better model accuracy.

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

# your code here:

cutoffs <- seq(0.1, 0.9, by = 0.05)
mr_values <- sapply(cutoffs, function(c) {
pred_class <- ifelse(train_pred_prob >= c, TRUE, FALSE)
mean(pred_class != German_train$Class)
})

plot(cutoffs, mr_values, type = "b", col = "blue", pch = 16,
xlab = "Cutoff Probability", ylab = "Misclassification Rate",
main = "MR vs. Cutoff Probability")

Your comment:

The optimal cutoff is where the MR is minimized, usually between 0.4 and 0.6, depending on class balance.

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

# your code here:

library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_train <- roc(German_train$Class, train_pred_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_train, col = "blue", main = "ROC Curve - Training Set")

auc(roc_train)
## Area under the curve: 0.8346

Your comment:

An AUC > 0.7 indicates acceptable discriminatory power; AUC > 0.8 → excellent model performance.

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

# your code here:

test_pred_prob <- predict(model_logit, newdata = German_test, type = "response")
head(test_pred_prob)
##         2         4         5         7         8         9 
## 0.4091914 0.7543049 0.3706810 0.9386210 0.7791341 0.9793484

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

# your code here:

test_pred_class <- ifelse(test_pred_prob >= 0.5, TRUE, FALSE)
conf_matrix_test <- table(Predicted = test_pred_class, Actual = German_test$Class)
conf_matrix_test
##          Actual
## Predicted FALSE TRUE
##     FALSE    40   29
##     TRUE     50  181
MR_test <- mean(test_pred_class != German_test$Class)
MR_test
## [1] 0.2633333

Your comment:

The test MR is usually slightly higher than the training MR, reflecting how well the model generalizes. If the difference is small, the model is stable and not overfitted.

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

# your code here:

roc_test <- roc(German_test$Class, test_pred_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_test, col = "red", main = "ROC Curve - Test Set")

auc(roc_test)
## Area under the curve: 0.7964

Your comment:

The AUC for the test set provides an unbiased evaluation of the model’s discrimination ability on unseen data. Values above 0.7 indicate good predictive power.