Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)set.seed(2025)
train_index <- createDataPartition(GermanCredit$Class, p = 0.8, list = FALSE)
GermanCredit_train <- GermanCredit[train_index, ]
GermanCredit_test <- GermanCredit[-train_index, ]
nrow(GermanCredit_train)
## [1] 800
nrow(GermanCredit_test)
## [1] 200
Your observation: Using the code I successfully created an 80/20 split. The training set contains about 80% of the observations and the test contains the remaining 20%. There are about 1000 rows split between the two sets. The class proportions are also preserved in both sets, meaning the split closesly matches the original dataset.
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
GermanCredit_train$Class <- as.factor(GermanCredit_train$Class)
credit_tree <- rpart(formula = Class ~ .,
data = GermanCredit_train,
method = "class")
rpart.plot(credit_tree, type = 4, extra = 104,
fallen.leaves = TRUE, main = "Tree for GermanCredit")
Your observation: The fitted tree shows that the checking account status is the strongest predictor of credit classification. Additional important splits involve Duration, Credit Amount, Age, and Employment Duration. This indicates the model uses multiple financial indicators to seperate good vs. bad credit risk.
model1 <- glm(Class ~ .,
data = GermanCredit_train,
family = binomial)
train_pred_prob <- predict(model1, newdata = GermanCredit_train,
type = "response")
train_pred_class <- 1 * (train_pred_prob > 0.5)
Your observation: Using the logistic regression model and the training set, I obtained a predicted probability fo good credit for each customer. Applying the default cutoff of 0.5, these probabilities were converted into predicted class labels. These predicted classes will be compared with the actual Class values in the next step to evaluate the model’s performance.
conf_train <- table(Actual = GermanCredit_train$Class,
Predicted = train_pred_class)
conf_train
## Predicted
## Actual 0 1
## 0 134 112
## 1 62 492
MR_train <- (conf_train[1,2] + conf_train[2,1]) / sum(conf_train)
MR_train
## [1] 0.2175
Your observation: The confusion matrix shows how well the model predicts the classes on the training set. The misclassification rate represents the share of incorrect predictios using the 0.5 cutoff.
prob_test_output <- predict(credit_tree,
newdata = GermanCredit_test,
type = "prob")
class_test_output <- 1 * (prob_test_output > 0.5)
Your observation: The testing set predictions provide each obersation with an estimated probability of belong to Class 1, and these probabilities are converted into class labels using the defualt 0.5 cutoff. This allows us to compare the model’s predictions to the actual test outcomes in order to evaluate its performance on unseen data.
prob_test_output <- predict(credit_tree,
newdata = GermanCredit_test,
type = "prob")[ , 2]
class_test_output <- 1 * (prob_test_output > 0.5)
conf_output_test <- table(Actual = GermanCredit_test$Class,
Predicted = class_test_output)
conf_output_test
## Predicted
## Actual 0 1
## 0 20 34
## 1 18 128
Your observation: The confusion matrix shows how accurately the model classifies the testing set, and the misclassification rate reports the proportion of predictions that were incorrect. This reflects the model’s performance on unseen data.
loss_mat <- matrix(c(0, 2,
1, 0),
nrow = 2, byrow = TRUE)
credit_tree_cost <- rpart(Class ~ .,data = GermanCredit_train,
method = "class",parms = list(loss = loss_mat))
plot(credit_tree_cost, margin = 0.1)
text(credit_tree_cost, use.n = TRUE, all = TRUE, cex = 0.7)
Your observation: The weighted-cost tree applies a higher penalty to false positives, so it prioritizes reducing FP errors compared to the standard tree.
prob_train_cost <- predict(credit_tree_cost,
newdata = GermanCredit_train,
type = "prob")[, 2]
class_train_cost <- 1 * (prob_train_cost > 0.5)
Your observation: The model produces probabilities for each training observation and assigns class labels using the 0.5 cutoff.
conf_train_cost <- table(
Actual = GermanCredit_train$Class,
Predicted = class_train_cost
)
MR_train_cost <- (conf_train_cost[1,2] + conf_train_cost[2,1]) /
sum(conf_train_cost)
Your observation: The confusion matrix shows the model’s training accuracy, and the MR value gives the proportion of incorrect predictions under the weighted-cost tree.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
ROC_train_cost <- roc(GermanCredit_train$Class, prob_train_cost)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
AUC_train_cost <- auc(ROC_train_cost)
plot(ROC_train_cost, main = "ROC Curve (Training Set, Cost-Sensitive Tree)")
AUC_train_cost
## Area under the curve: 0.7394
Your observation: The ROC curve shows the model’s ranking performance on the training set, and the AUC value summarizes this performance on a 0–1 scale.
prob_test_cost <- predict(credit_tree_cost,
newdata = GermanCredit_test,
type = "prob")[, 2]
class_test_cost <- 1 * (prob_test_cost > 0.5)
Your observation: The model outputs test-set probabilities and converts them to class labels using the 0.5 cutoff.
conf_test_cost <- table(
Actual = GermanCredit_test$Class,
Predicted = class_test_cost
)
MR_test_cost <- (conf_test_cost[1,2] + conf_test_cost[2,1]) /
sum(conf_test_cost)
Your observation: The confusion matrix reports the model’s classification performance on the testing set, and the MR value gives the proportion of incorrect predictions under the weighted-cost tree.
ROC_test_cost <- roc(GermanCredit_test$Class, prob_test_cost)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
AUC_test_cost <- auc(ROC_test_cost)
plot(ROC_test_cost, main = "ROC Curve (Testing Set, Cost-Sensitive Tree)")
AUC_test_cost
## Area under the curve: 0.7041
Your observation: The ROC curve shows the model’s ranking ability on the testing set, and the AUC value summarizes its overall discrimination performance.
The training/test split created an 80/20 partition while preserving class proportions (stratification). The standard tree identified key predictors such as checking account status, duration, and credit amount, and produced reasonable training and testing predictions with corresponding confusion matrices and misclassification rates (MR), reflecting its classification accuracy. Logistic-style predictions on the tree model provided probability estimates that were converted to class labels using the default 0.5 cutoff for both training and testing sets. ROC curves and AUC values showed the model’s ranking performance on both datasets.
The cost-sensitive tree incorporated a loss matrix that penalized false positives twice as heavily as false negatives, leading to different splits and a model that more aggressively reduced FP errors. Predictions, confusion matrices, and MR on both training and testing data reflected how the weighted tree shifted trade-offs between FP and FN. ROC curves and AUC values for the weighted tree summarized its discrimination ability under the new cost structure. Overall, the analysis showed how model behavior, accuracy, and error trade-offs change when adjusting costs and evaluating performance on both training and unseen testing data.