Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

set.seed(2025)
index <- sample(1:nrow(GermanCredit), 0.7 * nrow(GermanCredit))
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test  <- GermanCredit[-index, ]

Your observation: After setting the seed to 2025, the dataset was split randomly into a 70% training set and a 30% testing set. The split is reproducible because of the fixed seed.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)
train_tree_model <- rpart(Class ~ ., data = GermanCredit_train, method = "class")
rpart.plot(train_tree_model, type = 2, extra = 1)

Your observation: A classification tree was fit using all predictors in the training set. The plot shows the sequence of splits the model uses to classify customers as “Good” or “Bad” credit. The tree structure highlights the most important variables based on how early they appear in the splitting process.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_prob <- predict(train_tree_model, GermanCredit_train, type = "prob")[,2]
train_pred <- ifelse(train_prob >= 0.5, 1, 0)
train_pred <- factor(train_pred, levels = c(0,1))

Your observation: Predicted probabilities for the “Good” credit class were generated using the fitted tree. Classes were assigned using the default cutoff of 0.5, which means observations with predicted probability ≥ 0.5 were labeled as “Good.” The predictions appear reasonable with no missing values.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

train_cm <- table(Predicted = train_pred, Actual = GermanCredit_train$Class)
train_cm
##          Actual
## Predicted   0   1
##         0 112  33
##         1 113 442
train_mr <- 1 - sum(diag(train_cm)) / sum(train_cm)
train_mr
## [1] 0.2085714

Your observation: The confusion matrix shows that the model correctly classified most class-1 cases (442 true positives) but produced a notable number of false positives (113) and false negatives (33). With a misclassification rate of 0.2086, the model incorrectly predicts about 20.86% of all observations.

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_prob <- predict(train_tree_model, GermanCredit_test, type = "prob")[,2]
test_pred <- ifelse(test_prob >= 0.5, 1, 0)
test_pred <- factor(test_pred, levels = c(0,1))

Your observation: Predicted probabilities and class labels were calculated for the test set using the same 0.5 cutoff. This provides an unbiased view of how the model performs on unseen data. There were no errors or issues when generating predictions.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

test_cm <- table(Predicted = test_pred, Actual = GermanCredit_test$Class)
test_cm
##          Actual
## Predicted   0   1
##         0  24  24
##         1  51 201
test_mr <-  1 - sum(diag(test_cm)) / sum(test_cm)
test_mr
## [1] 0.25

Your observation: The confusion matrix shows that the model correctly classified 24 cases of class 0 and 201 cases of class 1, but misclassified 24 class-0 cases as class 1 and 51 class-1 cases as class 0. With a misclassification rate of 0.25, the model incorrectly predicts 25% of all observations in the test set.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

lossMat <- matrix(c(0, 2,   
                    1, 0),  
                  nrow = 2, byrow = TRUE)
rownames(lossMat) <- colnames(lossMat) <- c("0","1")
tree_weighted <- rpart(Class ~ ., data = GermanCredit_train, method = "class",
                       parms = list(loss = lossMat),
                       control = rpart.control(cp = 0.01, minsplit = 20))
rpart.plot(tree_weighted, type = 2, extra = 1)

Your observation: The weighted tree was successfully fit with FP cost = 2 and FN cost = 1. Compared to the unweighted tree, it is more conservative when predicting class 1, reflecting the higher penalty for false positives.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_prob_w <- predict(tree_weighted, GermanCredit_train, type = "prob")[,2]
train_pred_w <- ifelse(train_prob_w >= 0.5, 1, 0)
train_pred_w <- factor(train_pred_w, levels = c(0,1))

Your observation: Predicted probabilities and classes were generated for the training set using the weighted tree and a 0.5 cutoff. The model is more cautious about predicting class 1 due to the higher cost of false positives.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

train_cm_w <- table(Predicted = train_pred_w, Actual = GermanCredit_train$Class)
train_cm_w
##          Actual
## Predicted   0   1
##         0 139  96
##         1  86 379
train_mr_w <- 1 - sum(diag(train_cm_w)) / sum(train_cm_w)
train_mr_w
## [1] 0.26

Your observation: The confusion matrix shows that the weighted tree correctly classified 139 cases of class 0 and 379 cases of class 1, while misclassifying 96 class-0 cases as class 1 and 86 class-1 cases as class 0. With a misclassification rate of 0.26, the model incorrectly predicts about 26% of all training observations, reflecting the trade-off made to reduce false positives due to the weighted cost.

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_train_w <- roc(response = as.numeric(GermanCredit_train$Class), predictor = train_prob_w)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_train_w, main = "ROC - Training Set (Weighted Tree)")

auc_train_w <- auc(roc_train_w)
auc_train_w
## Area under the curve: 0.7737

Your observation: The ROC curve for the training set shows the model’s ability to discriminate between classes, and the area under the curve of 0.7737 indicates a good level of class separation despite the weighted cost, demonstrating that the weighted tree retains strong predictive performance on the training data.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_prob_w <- predict(tree_weighted, GermanCredit_test, type = "prob")[,2]

test_pred_w <- ifelse(test_prob_w >= 0.5, 1, 0)
test_pred_w <- factor(test_pred_w, levels = c(0,1))

Your observation: Predicted probabilities and classes were generated for the test set. The weighted tree continues to favor reducing false positives while classifying unseen data.

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

test_cm_w <- table(Predicted = test_pred_w, Actual = GermanCredit_test$Class)
test_cm_w
##          Actual
## Predicted   0   1
##         0  39  53
##         1  36 172
test_mr_w <- 1 - sum(diag(test_cm_w)) / sum(test_cm_w)
test_mr_w
## [1] 0.2966667

Your observation: The confusion matrix for the test set shows that the weighted tree correctly classified 39 cases of class 0 and 172 cases of class 1, while misclassifying 53 class-0 cases as class 1 and 36 class-1 cases as class 0. With a misclassification rate of 0.297, the model incorrectly predicts about 29.7% of all test observations, reflecting the trade-off made to reduce false positives while applying the weighted cost structure.

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

roc_test_w <- roc(response = as.numeric(GermanCredit_test$Class), predictor = test_prob_w)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_test_w, main = "ROC - Test Set (Weighted Tree)")

auc_test_w <- auc(roc_test_w)
auc_test_w
## Area under the curve: 0.708

Your observation: The ROC curve for the test set shows the model’s ability to separate the two classes on unseen data. The area under the curve (AUC) of 0.708 indicates moderate discrimination, showing that the weighted tree maintains reasonable predictive performance on the test set despite prioritizing the reduction of false positives.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

The unweighted tree achieved good performance on the training set (MR ≈ 21%) and slightly higher misclassification on the test set (MR = 25%). Introducing a weighted cost reduced false positives but increased false negatives, raising the training MR to 26% and test MR to 29.7%. ROC and AUC values show that both models maintained reasonable discrimination, illustrating the trade-off between prioritizing costly errors and overall accuracy.