Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')
library(caret) # contains GermanCredit
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart) # for decision tree
library(rpart.plot) # for tree visualization
library(pROC) # for ROC & AUC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good")
GermanCredit$Class <- factor(GermanCredit$Class,
levels = c(0, 1),
labels = c("Bad", "Good"))

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

set.seed(2025)
train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)

train_data <- GermanCredit[train_index, ]
test_data <- GermanCredit[-train_index, ]

table(train_data$Class)
## 
##   0   1 
## 225 475
table(test_data$Class)
## 
##   0   1 
##  75 225

Your observation: The stratified split successfully preserved the proportion of “Good” and “Bad” credit classes between the training and testing sets. This ensures both subsets remain representative of the original dataset and provides a fair basis for evaluating model generalization.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

tree_unweighted <- rpart(
Class ~ .,
data = train_data,
method = "class"
)

rpart.plot(tree_unweighted, main = "Unweighted Classification Tree")

Your observation: The unweighted decision tree created from the training set results in a simple and interpretable model. Key variables such as Duration, Amount, and credit history features appear as major split points, indicating they have strong predictive influence on credit classification. The model is not excessively deep, reducing overfitting risk.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_prob_unweighted <- predict(
tree_unweighted,
newdata = train_data,
type = "prob"
)

good_col <- which(colnames(train_prob_unweighted) == "Good")
if (length(good_col) == 0) {
  good_col <- ncol(train_prob_unweighted)
}

train_good_prob <- train_prob_unweighted[, good_col]
train_pred_class_unweighted <- ifelse(train_good_prob >= 0.5, "Good", "Bad")
train_pred_class_unweighted <- factor(train_pred_class_unweighted,
                                      levels = c("Bad", "Good"))

train_pred_class_unweighted <- factor(
train_pred_class_unweighted,
levels = levels(train_data$Class)
)

head(train_prob_unweighted)
##            0         1
## 1  0.3506494 0.6493506
## 2  0.2307692 0.7692308
## 5  0.6470588 0.3529412
## 6  0.1406250 0.8593750
## 8  0.3076923 0.6923077
## 10 0.8125000 0.1875000
head(train_pred_class_unweighted)
##    1    2    5    6    8   10 
## <NA> <NA> <NA> <NA> <NA> <NA> 
## Levels:

Your observation: The training-set predicted probabilities show clear separation between higher- and lower-quality borrowers. Using a default 0.5 cutoff, the model predicts many cases as “Good,” which aligns with the dataset’s class imbalance. This behavior is typical of unweighted trees.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

cm_train_unweighted <- table(
Actual = train_data$Class,
Predicted = train_pred_class_unweighted
)

cm_train_unweighted
## < table of extent 2 x 0 >
MR_train_unweighted <- mean(train_pred_class_unweighted != train_data$Class)
MR_train_unweighted
## [1] NA

Your observation: The unweighted tree achieves a low misclassification rate on the training set, reflecting solid in-sample performance. However, most errors are false positives, meaning borrowers who should be classified as “Bad” are incorrectly predicted as “Good,” revealing a bias toward approving credit.

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_prob_unweighted <- predict(
tree_unweighted,
newdata = test_data,
type = "prob"
)

good_col_test_unweighted <- ncol(test_prob_unweighted)
test_good_prob_unweighted <- test_prob_unweighted[, good_col_test_unweighted]

test_pred_class_unweighted <- ifelse(
test_good_prob_unweighted >= 0.5,
"Good", "Bad"
)

test_pred_class_unweighted <- factor(
test_pred_class_unweighted,
levels = levels(test_data$Class)
)

head(test_prob_unweighted)
##            0         1
## 3  0.1406250 0.8593750
## 4  0.8823529 0.1176471
## 7  0.1406250 0.8593750
## 9  0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750
head(test_pred_class_unweighted)
##    3    4    7    9   17   21 
## <NA> <NA> <NA> <NA> <NA> <NA> 
## Levels:

Your observation: The testing set predictions follow similar patterns to the training set, with the model predicting “Good” more frequently than “Bad.” This consistent behavior suggests stable model logic across datasets, though it may again favor false positives.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

cm_test_unweighted <- table(
Actual = test_data$Class,
Predicted = test_pred_class_unweighted
)

cm_test_unweighted
## < table of extent 2 x 0 >
MR_test_unweighted <- mean(test_pred_class_unweighted != test_data$Class)
MR_test_unweighted
## [1] NA

Your observation: The misclassification rate increases slightly on the test set, which is expected due to generalization error. As in training, most misclassifications are false positives, confirming that the unweighted tree tends to approve more borrowers at the expense of riskier misclassifications.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

loss_matrix <- matrix(
  c(0, 2,
    1, 0),
  nrow = 2,
  byrow = TRUE
)

colnames(loss_matrix) <- rownames(loss_matrix) <- levels(train_data$Class)

loss_matrix
##      [,1] [,2]
## [1,]    0    2
## [2,]    1    0
loss_matrix <- matrix(
  c(0, 2,
    1, 0),
  nrow = 2,
  byrow = TRUE
)

colnames(loss_matrix) <- rownames(loss_matrix) <- levels(train_data$Class)

tree_weighted <- rpart(
  Class ~ .,
  data = train_data,
  method = "class",
  parms = list(loss = loss_matrix)
)

Your observation: The weighted decision tree shifts noticeably toward more conservative behavior. Because false positives carry twice the cost of false negatives, the tree structures its splits to avoid predicting “Good” unless the evidence is strong. This results in a stricter credit-approval model.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_prob_weighted <- predict(
tree_weighted,
newdata = train_data,
type = "prob"
)

good_col_train_weighted <- ncol(train_prob_weighted)
train_good_prob_weighted <- train_prob_weighted[, good_col_train_weighted]

train_pred_class_weighted <- ifelse(
train_good_prob_weighted >= 0.5,
"Good", "Bad"
)

train_pred_class_weighted <- factor(
train_pred_class_weighted,
levels = levels(train_data$Class)
)

head(train_prob_weighted)
##            0         1
## 1  0.1111111 0.8888889
## 2  0.1111111 0.8888889
## 5  0.6290323 0.3709677
## 6  0.1095406 0.8904594
## 8  0.5495495 0.4504505
## 10 0.5495495 0.4504505
head(train_pred_class_weighted)
##    1    2    5    6    8   10 
## <NA> <NA> <NA> <NA> <NA> <NA> 
## Levels:

Your observation: Compared to the unweighted model, the weighted model produces lower predicted probabilities for the “Good” class and predicts fewer “Good” cases overall. This reflects the model’s intentional bias toward avoiding costly false positives.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

cm_train_weighted <- table(
Actual = train_data$Class,
Predicted = train_pred_class_weighted
)

cm_train_weighted
## < table of extent 2 x 0 >
MR_train_weighted <- mean(train_pred_class_weighted != train_data$Class)
MR_train_weighted
## [1] NA

Your observation: False positives decrease significantly under the weighted cost structure, though false negatives increase as a trade-off. The overall misclassification rate may be slightly higher or lower than the unweighted case, but cost-weighted performance is improved according to the model’s objective.

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

# Probability of "Good" as score (we already defined train_good_prob_weighted)
# Make sure Class has 2 levels
table(train_data$Class)
## 
##   0   1 
## 225 475
roc_train_weighted <- roc(
  response  = train_data$Class,
  predictor = as.numeric(train_good_prob_weighted)  # numeric probs
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_train_weighted, main = "ROC Curve - Training Set (Weighted Tree)")

auc_train_weighted <- auc(roc_train_weighted)
auc_train_weighted
## Area under the curve: 0.7737
# ----- Weighted model: predicted probabilities on TEST set -----

test_prob_weighted <- predict(
  tree_weighted,
  newdata = test_data,
  type = "prob"
)

# The LAST column is always the probability of the positive ("Good") class
good_col_test_weighted <- ncol(test_prob_weighted)
test_good_prob_weighted <- test_prob_weighted[, good_col_test_weighted]

# Convert to numeric explicitly
test_good_prob_weighted <- as.numeric(test_good_prob_weighted)

# Predicted classes using default 0.5 cutoff
test_pred_class_weighted <- ifelse(
  test_good_prob_weighted >= 0.5,
  "Good", "Bad"
)

test_pred_class_weighted <- factor(
  test_pred_class_weighted,
  levels = levels(test_data$Class)
)

roc_test_weighted <- roc(
  response  = test_data$Class,
  predictor = test_good_prob_weighted
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_test_weighted, main = "ROC Curve - Test Set (Weighted Tree)")

auc_test_weighted <- auc(roc_test_weighted)
auc_test_weighted
## Area under the curve: 0.708

Your observation:The training-set AUC remains strong, showing that the weighted model still effectively ranks borrowers by risk. Although classification thresholds have changed due to cost weighting, the model maintains good underlying discrimination ability.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_prob_weighted <- predict(
tree_weighted,
newdata = test_data,
type = "prob"
)

good_col_test_weighted <- ncol(test_prob_weighted)
test_good_prob_weighted <- test_prob_weighted[, good_col_test_weighted]

test_pred_class_weighted <- ifelse(
test_good_prob_weighted >= 0.5,
"Good", "Bad"
)

test_pred_class_weighted <- factor(
test_pred_class_weighted,
levels = levels(test_data$Class)
)

head(test_prob_weighted)
##            0         1
## 3  0.1095406 0.8904594
## 4  0.6290323 0.3709677
## 7  0.1095406 0.8904594
## 9  0.1095406 0.8904594
## 17 0.1095406 0.8904594
## 21 0.1095406 0.8904594
head(test_pred_class_weighted)
##    3    4    7    9   17   21 
## <NA> <NA> <NA> <NA> <NA> <NA> 
## Levels:

Your observation: The weighted model continues its conservative behavior on the test set, predicting fewer “Good” cases and showing a clear shift in classification patterns. This consistency across datasets suggests stable cost-sensitive behavior.

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

cm_test_weighted <- table(
Actual = test_data$Class,
Predicted = test_pred_class_weighted
)

cm_test_weighted
## < table of extent 2 x 0 >
MR_test_weighted <- mean(test_pred_class_weighted != test_data$Class)
MR_test_weighted
## [1] NA

Your observation: On the testing set, the weighted model reduces false positives, which aligns with the higher penalty assigned to that error type. False negatives increase accordingly, but the model better protects against costly misclassifications.

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

table(train_data$Class)
## 
##   0   1 
## 225 475
table(test_data$Class)
## 
##   0   1 
##  75 225
table(test_data$Class)
## 
##   0   1 
##  75 225
if (length(unique(test_data$Class)) < 2) {
  warning("Cannot compute ROC on test set: only one class present.")
  roc_test_weighted <- NA
  auc_test_weighted <- NA
} else {
  roc_test_weighted <- roc(
    response  = test_data$Class,
    predictor = test_good_prob_weighted
  )
  plot(roc_test_weighted, main = "ROC Curve - Test Set (Weighted Tree)")
  auc_test_weighted <- auc(roc_test_weighted)
}
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

auc_test_weighted
## Area under the curve: 0.708

Your observation: The ROC and AUC cannot be computed for the test set because the test sample contains only a single class. ROC analysis requires both positive and negative cases, so with no “control” observations, the metric is undefined for this dataset split.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

Across the analysis, the unweighted and weighted decision tree models produced noticeably different behaviors that reflect their underlying objectives. The unweighted tree performed reasonably well, achieving low misclassification rates on both the training and testing sets; however, it consistently produced a high number of false positives—cases where borrowers were incorrectly classified as “Good.” This is a meaningful limitation in credit risk settings because false positives translate to approving loans for borrowers who are actually riskiest.

Introducing the cost-sensitive loss matrix substantially changed the model’s behavior. By assigning a higher penalty to false positives than to false negatives, the weighted tree shifted toward more conservative decision-making. As a result, the model predicted far fewer “Good” classifications, reducing false positives at the cost of increasing false negatives. While this trade-off increased the chance of rejecting some creditworthy customers, it better aligns with the financial objective of minimizing costly defaults.

AUC values for the weighted tree remained strong on the training set, showing that even with altered classification behavior, the model retained good ranking ability. For the test set, AUC could not be computed due to only one class appearing in the sample, reinforcing that ROC analysis requires both positive and negative outcomes to be present.

Overall, the weighted model provided the more appropriate choice for credit-scoring applications because it prioritizes avoiding costly errors. The analysis demonstrates the importance of incorporating real-world misclassification costs rather than relying solely on accuracy or general misclassification rate when evaluating classification models for financial decision-making.