Homework5

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)

set.seed(2025)

GermanCredit$Class <- factor(GermanCredit$Class, levels = c(0, 1))

train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
German_train <- GermanCredit[train_index, ]
German_test  <- GermanCredit[-train_index, ]

dim(German_train)

## [1] 700  49

dim(German_test)

## [1] 300  49

Your observation:

Training set has about 70% of the data and test set about 30%, with similar proportions of good vs bad credit in both.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.5.2

tree_unweighted <- rpart(
Class ~ .,
data   = German_train,
method = "class",
control = rpart.control(cp = 0.01)
)

rpart.plot(tree_unweighted, main = "Unweighted Classification Tree")

Your observation:

The tree splits on a few key credit-related variables first (e.g., duration, amount, existing checking status), suggesting these are important for distinguishing good vs bad credit.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_prob_unw <- predict(tree_unweighted, newdata = German_train, type = "prob")[, "1"]

train_class_unw <- predict(tree_unweighted, newdata = German_train, type = "class")

head(train_prob_unw)

##         1         3         6        11        12        13 
## 0.8181818 0.8584071 0.8584071 0.3125000 0.2258065 0.3548387

head(train_class_unw)

##  1  3  6 11 12 13 
##  1  1  1  0  0  0 
## Levels: 0 1

Your observation:

The model assigns higher probabilities of class 1 (good credit) to some customers, and the class predictions reflect these probabilities based on the default decision rule.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

cm_train_unw <- table(Predicted = train_class_unw, Actual = German_train$Class)
cm_train_unw

##          Actual
## Predicted   0   1
##         0 107  37
##         1 103 453

MR_train_unw <- mean(train_class_unw != German_train$Class)
MR_train_unw

## [1] 0.2

Your observation:

The confusion matrix shows how many good/bad credits are classified correctly or incorrectly, and the misclassification rate on the training set is MR_train_unw (lower is better).

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_prob_unw <- predict(tree_unweighted, newdata = German_test, type = "prob")[, "1"]
test_class_unw <- predict(tree_unweighted, newdata = German_test, type = "class")

head(test_prob_unw)

##         2         4         5         7         8         9 
## 0.8043478 0.2258065 0.7192982 0.8584071 0.3125000 0.8584071

head(test_class_unw)

## 2 4 5 7 8 9 
## 1 0 1 1 0 1 
## Levels: 0 1

Your observation:

We now have predicted probabilities and classes for previously unseen customers in the test set.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

cm_test_unw <- table(Predicted = test_class_unw, Actual = German_test$Class)
cm_test_unw

##          Actual
## Predicted   0   1
##         0  42  36
##         1  48 174

MR_test_unw <- mean(test_class_unw != German_test$Class)
MR_test_unw

## [1] 0.28

Your observation:

The test misclassification rate indicates how well the model generalizes. If MR_test_unw is close to MR_train_unw, the model is not severely overfitting.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

loss_mat <- matrix(c(0, 2,
1, 0),
nrow = 2, byrow = TRUE)
colnames(loss_mat) <- rownames(loss_mat) <- levels(German_train$Class)
loss_mat

##   0 1
## 0 0 2
## 1 1 0

tree_weighted <- rpart(
Class ~ .,
data   = German_train,
method = "class",
parms  = list(loss = loss_mat),
control = rpart.control(cp = 0.01)
)

rpart.plot(tree_weighted, main = "Weighted Classification Tree (FP cost 2, FN cost 1)")

Your observation:

Compared to the unweighted tree, this tree structure may shift to reduce costly false positives, possibly classifying more cases as bad credit (0).

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_prob_w <- predict(tree_weighted, newdata = German_train, type = "prob")[, "1"]
train_class_w <- predict(tree_weighted, newdata = German_train, type = "class")

head(train_prob_w)

##         1         3         6        11        12        13 
## 0.9565217 0.8584071 0.8584071 0.4629630 0.4472050 0.4629630

head(train_class_w)

##  1  3  6 11 12 13 
##  1  1  1  0  0  0 
## Levels: 0 1

Your observation:

The probability predictions change slightly compared to the unweighted model, reflecting the different misclassification costs.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

cm_train_w <- table(Predicted = train_class_w, Actual = German_train$Class)
cm_train_w

##          Actual
## Predicted   0   1
##         0 153 123
##         1  57 367

MR_train_w <- mean(train_class_w != German_train$Class)
MR_train_w

## [1] 0.2571429

Your observation:

The weighted tree may have a different MR than the unweighted tree; even with a slightly higher MR, it can be preferable if it reduces the more costly type of error (FP).

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_train_w <- roc(German_train$Class, train_prob_w)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_train_w, col = "blue", main = "ROC Curve - Training Set (Weighted Tree)")

auc_train_w <- auc(roc_train_w)
auc_train_w

## Area under the curve: 0.7598

Your observation:

The AUC summarizes how well the weighted tree separates good vs bad credit on the training data. A higher AUC indicates better discrimination.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_prob_w <- predict(tree_weighted, newdata = German_test, type = "prob")[, "1"]
test_class_w <- predict(tree_weighted, newdata = German_test, type = "class")

head(test_prob_w)

##         2         4         5         7         8         9 
## 0.9333333 0.4472050 0.4472050 0.8584071 0.4800000 0.8584071

head(test_class_w)

## 2 4 5 7 8 9 
## 1 0 0 1 0 1 
## Levels: 0 1

Your observation:

We now have predictions from the cost-sensitive tree on unseen test data.

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

cm_test_w <- table(Predicted = test_class_w, Actual = German_test$Class)
cm_test_w

##          Actual
## Predicted   0   1
##         0  64  63
##         1  26 147

MR_test_w <- mean(test_class_w != German_test$Class)
MR_test_w

## [1] 0.2966667

Your observation:

Comparing this confusion matrix and MR to the unweighted model shows how the weighting changes the error patterns on the test set (e.g., possibly fewer false positives, more false negatives).

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

roc_test_w <- roc(German_test$Class, test_prob_w)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_test_w, col = "red", main = "ROC Curve - Test Set (Weighted Tree)")

auc_test_w <- auc(roc_test_w)
auc_test_w

## Area under the curve: 0.7286

Your observation:

The test AUC for the weighted tree indicates how well the model discriminates on new data; comparing this with the unweighted tree helps pick the better model in practice.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

The unweighted classification tree provided a baseline model with reasonable training and test misclassification rates, but it treated false positives and false negatives equally. After introducing a higher cost for false positives, the weighted tree shifted its decision boundary, generally becoming more conservative in predicting the positive class (good credit).

As a result, the weighted model tended to reduce the number of false positives at the expense of more false negatives. Depending on the business context (e.g., if approving bad credit is more costly than rejecting good credit), the weighted model may be preferable even if its overall misclassification rate is slightly higher. Comparing AUC values for both training and testing sets showed that both models had similar discrimination ability, but their error trade-offs differ, which is critical when setting credit policy.

Homework5

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)