Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

set.seed(2025)

train_index <- createDataPartition(GermanCredit$Class, p = 0.8, list = FALSE)

GermanCredit_train <- GermanCredit[train_index, ]
GermanCredit_test <- GermanCredit[-train_index, ]

nrow(GermanCredit_train)
## [1] 800
nrow(GermanCredit_test)
## [1] 200

Your observation: Using the code I successfully created an 80/20 split. The training set contains about 80% of the observations and the test contains the remaining 20%. There are about 1000 rows split between the two sets. The class proportions are also preserved in both sets, meaning the split closesly matches the original dataset.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
GermanCredit_train$Class <- as.factor(GermanCredit_train$Class)

credit_tree <- rpart(formula = Class ~ .,
                     data = GermanCredit_train,
                     method = "class")

rpart.plot(credit_tree, type = 4, extra = 104, 
    fallen.leaves = TRUE, main = "Tree for GermanCredit")

Your observation: The fitted tree shows that the checking account status is the strongest predictor of credit classification. Additional important splits involve Duration, Credit Amount, Age, and Employment Duration. This indicates the model uses multiple financial indicators to seperate good vs. bad credit risk.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

model1 <- glm(Class ~ ., 
              data = GermanCredit_train, 
              family = binomial)


train_pred_prob <- predict(model1, newdata = GermanCredit_train,
                           type = "response")

train_pred_class <- 1 * (train_pred_prob > 0.5)

Your observation: Using the logistic regression model and the training set, I obtained a predicted probability fo good credit for each customer. Applying the default cutoff of 0.5, these probabilities were converted into predicted class labels. These predicted classes will be compared with the actual Class values in the next step to evaluate the model’s performance.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

conf_train <- table(Actual = GermanCredit_train$Class,
                    Predicted = train_pred_class)
conf_train
##       Predicted
## Actual   0   1
##      0 134 112
##      1  62 492
MR_train <- (conf_train[1,2] + conf_train[2,1]) / sum(conf_train)
MR_train
## [1] 0.2175

Your observation: The confusion matrix shows how well the model predicts the classes on the training set. The misclassification rate represents the share of incorrect predictios using the 0.5 cutoff.

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

prob_test_output <- predict(credit_tree,
                            newdata = GermanCredit_test,
                            type = "prob")

class_test_output <- 1 * (prob_test_output > 0.5)

Your observation: The testing set predictions provide each obersation with an estimated probability of belong to Class 1, and these probabilities are converted into class labels using the defualt 0.5 cutoff. This allows us to compare the model’s predictions to the actual test outcomes in order to evaluate its performance on unseen data.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

prob_test_output <- predict(credit_tree,
                            newdata = GermanCredit_test,
                            type = "prob")[ , 2]

class_test_output <- 1 * (prob_test_output > 0.5)


conf_output_test <- table(Actual = GermanCredit_test$Class,
                          Predicted = class_test_output)

conf_output_test
##       Predicted
## Actual   0   1
##      0  20  34
##      1  18 128

Your observation: The confusion matrix shows how accurately the model classifies the testing set, and the misclassification rate reports the proportion of predictions that were incorrect. This reflects the model’s performance on unseen data.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

loss_mat <- matrix(c(0, 2,
                     1, 0),
                   nrow = 2, byrow = TRUE)

credit_tree_cost <- rpart(Class ~ .,data  = GermanCredit_train,
                          method = "class",parms  = list(loss = loss_mat))

plot(credit_tree_cost, margin = 0.1)
text(credit_tree_cost, use.n = TRUE, all = TRUE, cex = 0.7)

Your observation: The weighted-cost tree applies a higher penalty to false positives, so it prioritizes reducing FP errors compared to the standard tree.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

prob_train_cost <- predict(credit_tree_cost,
                           newdata = GermanCredit_train,
                           type = "prob")[, 2]

class_train_cost <- 1 * (prob_train_cost > 0.5)

Your observation: The model produces probabilities for each training observation and assigns class labels using the 0.5 cutoff.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

conf_train_cost <- table(
  Actual = GermanCredit_train$Class,
  Predicted = class_train_cost
)

MR_train_cost <- (conf_train_cost[1,2] + conf_train_cost[2,1]) /
                 sum(conf_train_cost)

Your observation: The confusion matrix shows the model’s training accuracy, and the MR value gives the proportion of incorrect predictions under the weighted-cost tree.

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
ROC_train_cost <- roc(GermanCredit_train$Class, prob_train_cost)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
AUC_train_cost <- auc(ROC_train_cost)

plot(ROC_train_cost, main = "ROC Curve (Training Set, Cost-Sensitive Tree)")

AUC_train_cost
## Area under the curve: 0.7394

Your observation: The ROC curve shows the model’s ranking performance on the training set, and the AUC value summarizes this performance on a 0–1 scale.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

prob_test_cost <- predict(credit_tree_cost,
                          newdata = GermanCredit_test,
                          type = "prob")[, 2]

class_test_cost <- 1 * (prob_test_cost > 0.5)

Your observation: The model outputs test-set probabilities and converts them to class labels using the 0.5 cutoff.

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

conf_test_cost <- table(
  Actual = GermanCredit_test$Class,
  Predicted = class_test_cost
)

MR_test_cost <- (conf_test_cost[1,2] + conf_test_cost[2,1]) /
                sum(conf_test_cost)

Your observation: The confusion matrix reports the model’s classification performance on the testing set, and the MR value gives the proportion of incorrect predictions under the weighted-cost tree.

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

ROC_test_cost <- roc(GermanCredit_test$Class, prob_test_cost)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
AUC_test_cost <- auc(ROC_test_cost)

plot(ROC_test_cost, main = "ROC Curve (Testing Set, Cost-Sensitive Tree)")

AUC_test_cost
## Area under the curve: 0.7041

Your observation: The ROC curve shows the model’s ranking ability on the testing set, and the AUC value summarizes its overall discrimination performance.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

The training/test split created an 80/20 partition while preserving class proportions (stratification). The standard tree identified key predictors such as checking account status, duration, and credit amount, and produced reasonable training and testing predictions with corresponding confusion matrices and misclassification rates (MR), reflecting its classification accuracy. Logistic-style predictions on the tree model provided probability estimates that were converted to class labels using the default 0.5 cutoff for both training and testing sets. ROC curves and AUC values showed the model’s ranking performance on both datasets.

The cost-sensitive tree incorporated a loss matrix that penalized false positives twice as heavily as false negatives, leading to different splits and a model that more aggressively reduced FP errors. Predictions, confusion matrices, and MR on both training and testing data reflected how the weighted tree shifted trade-offs between FP and FN. ROC curves and AUC values for the weighted tree summarized its discrimination ability under the new cost structure. Overall, the analysis showed how model behavior, accuracy, and error trade-offs change when adjusting costs and evaluating performance on both training and unseen testing data.