Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)set.seed(2025) # set random seed for reproducibility.
# split dataset into 80% training and 20% testing
train_ind <-sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))
# creating training and testing datasets
GermanCredit_train <- GermanCredit[train_ind, ]
GermanCredit_test <- GermanCredit[-train_ind,]
# test dimensions
dim(GermanCredit_train)
## [1] 800 49
dim(GermanCredit_test)
## [1] 200 49
# how many "good" and "bad" risks in data:
table(GermanCredit_train$Class)
##
## 0 1
## 246 554
table(GermanCredit_test$Class)
##
## 0 1
## 54 146
Your observation: I split the dataset into 80% training data and 20% testing data. The training set will be used to fit the model, while the testing set will be used to evaluate how good the model is. In the training data, there are 246 bad risks and 554 good risks. In the testing data, there are 54 bad risks and 146 good risks.
library(rpart)
library(rpart.plot)
# fit classification tree model
GermanCredit_rpart_train <- rpart(formula = Class ~ .,
data = GermanCredit_train,
method = "class")
# visualization of fitted tree
prp(GermanCredit_rpart_train)
rpart.plot(GermanCredit_rpart_train,extra = 3)
Your observation: This classification tree creates splits based on predictors of good and bad credit risks. Early splits indicate the most important variables for predicting good and bad credit. Some of the early splits include checking account status, job history, and key financial factors.
# predicted probabilities:
pred_train_prob <- predict(GermanCredit_rpart_train, GermanCredit_train, type = "prob")
# view first several rows
head(pred_train_prob)
## 0 1
## 909 0.1333333 0.8666667
## 460 0.1333333 0.8666667
## 932 0.3272727 0.6727273
## 922 0.1333333 0.8666667
## 961 0.1333333 0.8666667
## 279 0.1333333 0.8666667
# predicted classes:
# note: type = "class" assigns the majority class in the terminal node
pred_train_class <- predict(GermanCredit_rpart_train, GermanCredit_train, type = "class")
# view first several rows
head(pred_train_class)
## 909 460 932 922 961 279
## 1 1 1 1 1 1
## Levels: 0 1
Your observation: The predicted probabilities give the likelihood of being considered a good credit risk. Type = “prob” returns the probability of each class, while type = “class” returns the predicted class.
Example Predicted Probability Intepretation: For observation 909, the model estimates that there is a ~86.67% probability that the individual is a good credit risk.
Example Predicted Class Interpretation: For observation 909, the probability of class 1 (good credit risk) is exceeds the cutoff probability, the model classifies this individual as a good credit risk.
actual_value_train <- GermanCredit_train$Class
pred_value_train <- pred_train_class
confusion_matrix_train <- table(actual_value_train, pred_value_train)
# false positives and false negatives
FP_train <- sum(actual_value_train == 0 & pred_value_train == 1)
FN_train <- sum(actual_value_train == 1 & pred_value_train == 0)
# misclassification rate - 2 ways
MR_train1 <- (FP_train + FN_train) / sum(confusion_matrix_train)
MR_train2 <- 1 - sum(diag(confusion_matrix_train))/sum(confusion_matrix_train)
# showing confusion matrix and misclassification rate
confusion_matrix_train
## pred_value_train
## actual_value_train 0 1
## 0 126 120
## 1 36 518
MR_train1
## [1] 0.195
MR_train2
## [1] 0.195
Your observation: The confusion matrix measures the performance of the model by showing the number of false negatives, false positives, true negatives, and true positives. There are 126 true negatives (model and actual are 0/bad), 36 false negatives (model predicts 0/bad but actual is 1/good), 120 false positives (model predicts 1/good but actual is 0/bad), and 518 true positives (model and actual are 1/good). The misclassification rate is 19.5%, indicating that the model makes incorrect predictions 19.5% of the time.
# use fitted model to return predicted probabilities
pred_test_prob <- predict(GermanCredit_rpart_train, GermanCredit_test, type = "prob")
# view first several rows
head(pred_test_prob)
## 0 1
## 3 0.1333333 0.8666667
## 4 0.7580645 0.2419355
## 7 0.1333333 0.8666667
## 9 0.1333333 0.8666667
## 17 0.1333333 0.8666667
## 21 0.1333333 0.8666667
# use fitted model to return predicted classes:
pred_test_class <- predict(GermanCredit_rpart_train, GermanCredit_test, type = "class")
# view first several rows
head(pred_test_class)
## 3 4 7 9 17 21
## 1 0 1 1 1 1
## Levels: 0 1
Your observation: These predicted probabilities and classes come from the tree model using the testing data and indicate how confident the model is in its predictions.
Example Interpretation of Predicted Probability: For observation 3, the model estimates that there is a ~86.67% probability that the individual is a good credit risk.
Example Interpretation of Predicted Probability: For observation 3, the model predicts good credit since the predicted class exceeds the default cutoff of 0.5.
actual_value_test <- GermanCredit_test$Class
pred_value_test <- pred_test_class
confusion_matrix_test <- table(actual_value_test, pred_value_test)
# false positives and false negatives
FP_test <- sum(actual_value_test == 0 & pred_value_test == 1)
FN_test <- sum(actual_value_test == 1 & pred_value_test == 0)
# misclassification rate - 2 ways
MR_test1 <- (FP_test + FN_test) / sum(confusion_matrix_test)
MR_test2 <- 1 - sum(diag(confusion_matrix_test))/sum(confusion_matrix_test)
# showing confusion matrix and misclassification rate
confusion_matrix_test
## pred_value_test
## actual_value_test 0 1
## 0 20 34
## 1 18 128
MR_test1
## [1] 0.26
MR_test2
## [1] 0.26
Your observation: There are 20 true negatives, 18 false negatives, 34 false positives, and 128 true positives. The misclassification rate is 26%, indicating that the model is incorrect 26% of the time. The model has a high true positive rate so it tends to predict good credit risks correctly.
# define cost matrix
cost_matrix <- matrix(c(0, 2, # cost of 2 for FP
1, 0)) # cost of 1 for FN
# weighted tree model
GermanCredit_rpart_weighted <- rpart(Class ~ .,
data = GermanCredit_train,
parms = list(loss = cost_matrix),
method = "class")
# plot
# note: extra = 4 shows misclassification counts and class probabilities at each node
# note: yesno = 2 shows split visualization more clearly
rpart.plot(GermanCredit_rpart_weighted, extra = 4, yesno = 2)
Your observation: Assigning a cost matrix changes the objective the tree will try to minimize. The cost matrix will impact how the tree splits. Splits that reduce costs are preferred. In this case, a false positive costs 2, and a false negative costs 1. A false positive means the model predicts a good credit risk when the class is bad, so the model penalizes misclassifying a bad credit risk as a good credit risk more heavily. Important early predictors in this model include checking account status, duration, and amount. With the weighted tree, the weighted costs strengthen the effect of misclassifying bad credit risks as good risks.
# predicted probabilities
pred_train_prob_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_train, type = "prob")
# view first several rows
head(pred_train_prob_weighted)
## 0 1
## 909 0.1333333 0.8666667
## 460 0.1333333 0.8666667
## 932 0.3571429 0.6428571
## 922 0.1333333 0.8666667
## 961 0.1333333 0.8666667
## 279 0.1333333 0.8666667
# predicted classes
pred_train_class_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_train, type = "class")
# view first several rows
head(pred_train_class_weighted)
## 909 460 932 922 961 279
## 1 1 1 1 1 1
## Levels: 0 1
Your observation:
Interpretation of Predicted Probability: For observation 909, there is an ~86.67% probability of being classified as a good credit risk. With the weighted cost matrix, the decision will favor the lower expected weighted cost.
Interpretation of Predicted Class: For observation 909, the individual is predicted to be a good credit risk. Notably, though, the model will minimize cost, so the it is more conservative when predicting risks as good.
actual_value_train_weighted <- GermanCredit_train$Class
pred_value_train_weighted <- pred_train_class_weighted
confusion_matrix_train_weighted <- table(actual_value_train_weighted, pred_value_train_weighted)
# false positives and false negatives
FP_train_weighted <- sum(actual_value_train_weighted == 0 & pred_value_train_weighted == 1)
FN_train_weighted <- sum(actual_value_train_weighted == 1 & pred_value_train_weighted == 0)
# misclassification rate - 2 ways
MR_train1_weighted <- (FP_train_weighted + FN_train_weighted) / sum(confusion_matrix_train_weighted)
MR_train2_weighted <- 1 - sum(diag(confusion_matrix_train_weighted))/sum(confusion_matrix_train_weighted)
# showing confusion matrix and misclassification rate
confusion_matrix_train_weighted
## pred_value_train_weighted
## actual_value_train_weighted 0 1
## 0 98 148
## 1 14 540
MR_train1_weighted
## [1] 0.2025
MR_train2_weighted
## [1] 0.2025
Your observation: There are 98 true negatives, 14 false negatives, 148 false positives, and 540 true positives. Since the misclassification rate is 0.2025, the model incorrectly predicts ~20% of the outcomes.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# generate ROC curve
roc_obj_train_weighted <- roc(actual_value_train_weighted, pred_train_prob_weighted[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# plot ROC curve
plot(roc_obj_train_weighted)
# calculate AUC
auc_train_weighted <- auc(actual_value_train_weighted, pred_train_prob_weighted[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# view AUC
auc_train_weighted
## Area under the curve: 0.7797
Your observation: The ROC curve plots the true positive rate vs. the false positive rate, while the AUC curve represents the area under the curve, which conveys the model’s strength. The ROC model shows that the model has good discriminatory power since it’s above the line of chance. The area under the curve is 0.7797 which suggests a fairly strong model since it’s above 0.5, meaning it performs better than random guessing.
# predicted probabilities
pred_test_prob_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_test, type = "prob")
# view first several rows
head(pred_test_prob_weighted)
## 0 1
## 3 0.1333333 0.86666667
## 4 0.9333333 0.06666667
## 7 0.1333333 0.86666667
## 9 0.1333333 0.86666667
## 17 0.1333333 0.86666667
## 21 0.1333333 0.86666667
# predicted classes
pred_test_class_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_test, type = "class")
# view first several rows
head(pred_test_class_weighted)
## 3 4 7 9 17 21
## 1 0 1 1 1 1
## Levels: 0 1
Your observation: Each test observation is assigned a probability of being a good (1) or bad (0) credit risk. The predicted class reflects the decision taking cost into account. Higher probabilities for 1 are classified as Good, and lower probabilities are classified as Bad.
Example Interpretation of Predicted Probability: For observation 3, the model estimates that there is a ~86.67% probability that the individual is a good credit risk.
Example Interpretation of Predicted Probability: For observation 3, the model predicts good credit since the predicted class exceeds the default cutoff of 0.5.
# actual and predicted values
actual_value_test_weighted <- GermanCredit_test$Class
pred_value_test_weighted <- pred_test_class_weighted
# confusion matrix
confusion_matrix_test_weighted <- table(actual_value_test_weighted, pred_value_test_weighted)
# false positives and false negatives
FP_test_weighted <- sum(actual_value_test_weighted == 0 & pred_value_test_weighted == 1)
FN_test_weighted <- sum(actual_value_test_weighted == 1 & pred_value_test_weighted == 0)
# misclassification rate - 2 ways
MR_test1_weighted <- (FP_test_weighted + FN_test_weighted) / sum(confusion_matrix_test_weighted)
MR_test2_weighted <- 1 - sum(diag(confusion_matrix_test_weighted))/sum(confusion_matrix_test_weighted)
# showing confusion matrix and misclassification rate
confusion_matrix_test_weighted
## pred_value_test_weighted
## actual_value_test_weighted 0 1
## 0 13 41
## 1 10 136
MR_test1_weighted
## [1] 0.255
MR_test2_weighted
## [1] 0.255
Your observation: There are 13 true negatives, 10 false negatives, 41 false positives, and 136 true positives. THe misclassification rate is 0.255, indicating that 25.5% of outcomes are incorrectly predicted.
# generate ROC curve
roc_obj_test_weighted <- roc(actual_value_test_weighted, pred_test_prob_weighted[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# plot ROC curve
plot(roc_obj_test_weighted)
# calculate AUC
auc_test_weighted <- auc(actual_value_test_weighted, pred_test_prob_weighted[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# view AUC
auc_test_weighted
## Area under the curve: 0.7149
Your observation: The ROC curve plots the true positive rate vs. the false positive rate, while the AUC curve represents the area under the curve, which conveys the model’s strength. The ROC model shows that the model has good descriminatory power since it’s above the line of chance. The area under the curve is 0.7149 which suggests a fairly strong model since it’s above 0.5, meaning it performs better than random guessing. However, the AUC is slightly lower than the training AUC of 0.7797.
I split the dataset into 80% training and 20% testing data.
For the tree model without weighted cost, the unweighted decision tree identified important predictors like checking account status, duration, and credit amount. The training misclassification rate was 19.5%, with high classification of good credit risks. The testing misclassification rate was 26%, so slightly worse than the training MR.
For the tree model with weighted cost, there was a higher cost assigned to false positives, which made the model more cautious when predicting good credit risks. The training MR was 20.25%, and the testing MR was slightly higher at 25.5%. The training AUC was 0.7797 and the testing AUC was 0.7149, indicating overall strong models with good discriminatory ability.
Overall, both models effectively identify good risks, but the weighted tree adjusts predictions to account for higher cost of false positives (bad credit risks predicted as good).