Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)set.seed(2025)
GermanCredit$Class <- factor(GermanCredit$Class, levels = c(0, 1))
train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
German_train <- GermanCredit[train_index, ]
German_test <- GermanCredit[-train_index, ]
dim(German_train)
## [1] 700 49
dim(German_test)
## [1] 300 49
Your observation:
Training set has about 70% of the data and test set about 30%, with similar proportions of good vs bad credit in both.
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
tree_unweighted <- rpart(
Class ~ .,
data = German_train,
method = "class",
control = rpart.control(cp = 0.01)
)
rpart.plot(tree_unweighted, main = "Unweighted Classification Tree")
Your observation:
The tree splits on a few key credit-related variables first (e.g., duration, amount, existing checking status), suggesting these are important for distinguishing good vs bad credit.
train_prob_unw <- predict(tree_unweighted, newdata = German_train, type = "prob")[, "1"]
train_class_unw <- predict(tree_unweighted, newdata = German_train, type = "class")
head(train_prob_unw)
## 1 3 6 11 12 13
## 0.8181818 0.8584071 0.8584071 0.3125000 0.2258065 0.3548387
head(train_class_unw)
## 1 3 6 11 12 13
## 1 1 1 0 0 0
## Levels: 0 1
Your observation:
The model assigns higher probabilities of class 1 (good credit) to some customers, and the class predictions reflect these probabilities based on the default decision rule.
cm_train_unw <- table(Predicted = train_class_unw, Actual = German_train$Class)
cm_train_unw
## Actual
## Predicted 0 1
## 0 107 37
## 1 103 453
MR_train_unw <- mean(train_class_unw != German_train$Class)
MR_train_unw
## [1] 0.2
Your observation:
The confusion matrix shows how many good/bad credits are classified correctly or incorrectly, and the misclassification rate on the training set is MR_train_unw (lower is better).
test_prob_unw <- predict(tree_unweighted, newdata = German_test, type = "prob")[, "1"]
test_class_unw <- predict(tree_unweighted, newdata = German_test, type = "class")
head(test_prob_unw)
## 2 4 5 7 8 9
## 0.8043478 0.2258065 0.7192982 0.8584071 0.3125000 0.8584071
head(test_class_unw)
## 2 4 5 7 8 9
## 1 0 1 1 0 1
## Levels: 0 1
Your observation:
We now have predicted probabilities and classes for previously unseen customers in the test set.
cm_test_unw <- table(Predicted = test_class_unw, Actual = German_test$Class)
cm_test_unw
## Actual
## Predicted 0 1
## 0 42 36
## 1 48 174
MR_test_unw <- mean(test_class_unw != German_test$Class)
MR_test_unw
## [1] 0.28
Your observation:
The test misclassification rate indicates how well the model generalizes. If MR_test_unw is close to MR_train_unw, the model is not severely overfitting.
loss_mat <- matrix(c(0, 2,
1, 0),
nrow = 2, byrow = TRUE)
colnames(loss_mat) <- rownames(loss_mat) <- levels(German_train$Class)
loss_mat
## 0 1
## 0 0 2
## 1 1 0
tree_weighted <- rpart(
Class ~ .,
data = German_train,
method = "class",
parms = list(loss = loss_mat),
control = rpart.control(cp = 0.01)
)
rpart.plot(tree_weighted, main = "Weighted Classification Tree (FP cost 2, FN cost 1)")
Your observation:
Compared to the unweighted tree, this tree structure may shift to reduce costly false positives, possibly classifying more cases as bad credit (0).
train_prob_w <- predict(tree_weighted, newdata = German_train, type = "prob")[, "1"]
train_class_w <- predict(tree_weighted, newdata = German_train, type = "class")
head(train_prob_w)
## 1 3 6 11 12 13
## 0.9565217 0.8584071 0.8584071 0.4629630 0.4472050 0.4629630
head(train_class_w)
## 1 3 6 11 12 13
## 1 1 1 0 0 0
## Levels: 0 1
Your observation:
The probability predictions change slightly compared to the unweighted model, reflecting the different misclassification costs.
cm_train_w <- table(Predicted = train_class_w, Actual = German_train$Class)
cm_train_w
## Actual
## Predicted 0 1
## 0 153 123
## 1 57 367
MR_train_w <- mean(train_class_w != German_train$Class)
MR_train_w
## [1] 0.2571429
Your observation:
The weighted tree may have a different MR than the unweighted tree; even with a slightly higher MR, it can be preferable if it reduces the more costly type of error (FP).
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_train_w <- roc(German_train$Class, train_prob_w)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_train_w, col = "blue", main = "ROC Curve - Training Set (Weighted Tree)")
auc_train_w <- auc(roc_train_w)
auc_train_w
## Area under the curve: 0.7598
Your observation:
The AUC summarizes how well the weighted tree separates good vs bad credit on the training data. A higher AUC indicates better discrimination.
test_prob_w <- predict(tree_weighted, newdata = German_test, type = "prob")[, "1"]
test_class_w <- predict(tree_weighted, newdata = German_test, type = "class")
head(test_prob_w)
## 2 4 5 7 8 9
## 0.9333333 0.4472050 0.4472050 0.8584071 0.4800000 0.8584071
head(test_class_w)
## 2 4 5 7 8 9
## 1 0 0 1 0 1
## Levels: 0 1
Your observation:
We now have predictions from the cost-sensitive tree on unseen test data.
cm_test_w <- table(Predicted = test_class_w, Actual = German_test$Class)
cm_test_w
## Actual
## Predicted 0 1
## 0 64 63
## 1 26 147
MR_test_w <- mean(test_class_w != German_test$Class)
MR_test_w
## [1] 0.2966667
Your observation:
Comparing this confusion matrix and MR to the unweighted model shows how the weighting changes the error patterns on the test set (e.g., possibly fewer false positives, more false negatives).
roc_test_w <- roc(German_test$Class, test_prob_w)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_test_w, col = "red", main = "ROC Curve - Test Set (Weighted Tree)")
auc_test_w <- auc(roc_test_w)
auc_test_w
## Area under the curve: 0.7286
Your observation:
The test AUC for the weighted tree indicates how well the model discriminates on new data; comparing this with the unweighted tree helps pick the better model in practice.
The unweighted classification tree provided a baseline model with reasonable training and test misclassification rates, but it treated false positives and false negatives equally. After introducing a higher cost for false positives, the weighted tree shifted its decision boundary, generally becoming more conservative in predicting the positive class (good credit).
As a result, the weighted model tended to reduce the number of false positives at the expense of more false negatives. Depending on the business context (e.g., if approving bad credit is more costly than rejecting good credit), the weighted model may be preferable even if its overall misclassification rate is slightly higher. Comparing AUC values for both training and testing sets showed that both models had similar discrimination ability, but their error trade-offs differ, which is critical when setting credit policy.