Homework7

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)

set.seed(2025) # set random seed for reproducibility.

# split dataset into 80% training and 20% testing
train_ind <-sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))

# creating training and testing datasets
GermanCredit_train <- GermanCredit[train_ind, ]
GermanCredit_test <- GermanCredit[-train_ind,]

# test dimensions
dim(GermanCredit_train)

## [1] 800  49

dim(GermanCredit_test)

## [1] 200  49

# how many "good" and "bad" risks in data:
table(GermanCredit_train$Class)

## 
##   0   1 
## 246 554

table(GermanCredit_test$Class)

## 
##   0   1 
##  54 146

Your observation: I split the dataset into 80% training data and 20% testing data. The training set will be used to fit the model, while the testing set will be used to evaluate how good the model is. In the training data, there are 246 bad risks and 554 good risks. In the testing data, there are 54 bad risks and 146 good risks.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)

# fit classification tree model
GermanCredit_rpart_train <- rpart(formula = Class ~ ., 
                    data = GermanCredit_train, 
                    method = "class") 

# visualization of fitted tree
prp(GermanCredit_rpart_train)

rpart.plot(GermanCredit_rpart_train,extra = 3)

Your observation: This classification tree creates splits based on predictors of good and bad credit risks. Early splits indicate the most important variables for predicting good and bad credit. Some of the early splits include checking account status, job history, and key financial factors.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

# predicted probabilities:
pred_train_prob <- predict(GermanCredit_rpart_train, GermanCredit_train, type = "prob")
# view first several rows
head(pred_train_prob)

##             0         1
## 909 0.1333333 0.8666667
## 460 0.1333333 0.8666667
## 932 0.3272727 0.6727273
## 922 0.1333333 0.8666667
## 961 0.1333333 0.8666667
## 279 0.1333333 0.8666667

# predicted classes:
# note: type = "class" assigns the majority class in the terminal node
pred_train_class <- predict(GermanCredit_rpart_train, GermanCredit_train, type = "class")
# view first several rows
head(pred_train_class)

## 909 460 932 922 961 279 
##   1   1   1   1   1   1 
## Levels: 0 1

Your observation: The predicted probabilities give the likelihood of being considered a good credit risk. Type = “prob” returns the probability of each class, while type = “class” returns the predicted class.

Example Predicted Probability Intepretation: For observation 909, the model estimates that there is a ~86.67% probability that the individual is a good credit risk.

Example Predicted Class Interpretation: For observation 909, the probability of class 1 (good credit risk) is exceeds the cutoff probability, the model classifies this individual as a good credit risk.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

actual_value_train <- GermanCredit_train$Class
pred_value_train <- pred_train_class

confusion_matrix_train <- table(actual_value_train, pred_value_train)

# false positives and false negatives
FP_train <- sum(actual_value_train == 0 & pred_value_train == 1)
FN_train <- sum(actual_value_train == 1 & pred_value_train == 0)

# misclassification rate - 2 ways
MR_train1 <- (FP_train + FN_train) / sum(confusion_matrix_train)
MR_train2 <- 1 - sum(diag(confusion_matrix_train))/sum(confusion_matrix_train)

# showing confusion matrix and misclassification rate
confusion_matrix_train

##                   pred_value_train
## actual_value_train   0   1
##                  0 126 120
##                  1  36 518

MR_train1

## [1] 0.195

MR_train2

## [1] 0.195

Your observation: The confusion matrix measures the performance of the model by showing the number of false negatives, false positives, true negatives, and true positives. There are 126 true negatives (model and actual are 0/bad), 36 false negatives (model predicts 0/bad but actual is 1/good), 120 false positives (model predicts 1/good but actual is 0/bad), and 518 true positives (model and actual are 1/good). The misclassification rate is 19.5%, indicating that the model makes incorrect predictions 19.5% of the time.

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

# use fitted model to return predicted probabilities
pred_test_prob <- predict(GermanCredit_rpart_train, GermanCredit_test, type = "prob")
# view first several rows
head(pred_test_prob)

##            0         1
## 3  0.1333333 0.8666667
## 4  0.7580645 0.2419355
## 7  0.1333333 0.8666667
## 9  0.1333333 0.8666667
## 17 0.1333333 0.8666667
## 21 0.1333333 0.8666667

# use fitted model to return predicted classes:
pred_test_class <- predict(GermanCredit_rpart_train, GermanCredit_test, type = "class")
# view first several rows
head(pred_test_class)

##  3  4  7  9 17 21 
##  1  0  1  1  1  1 
## Levels: 0 1

Your observation: These predicted probabilities and classes come from the tree model using the testing data and indicate how confident the model is in its predictions.

Example Interpretation of Predicted Probability: For observation 3, the model estimates that there is a ~86.67% probability that the individual is a good credit risk.

Example Interpretation of Predicted Probability: For observation 3, the model predicts good credit since the predicted class exceeds the default cutoff of 0.5.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

actual_value_test <- GermanCredit_test$Class
pred_value_test <- pred_test_class

confusion_matrix_test <- table(actual_value_test, pred_value_test)

# false positives and false negatives
FP_test <- sum(actual_value_test == 0 & pred_value_test == 1)
FN_test <- sum(actual_value_test == 1 & pred_value_test == 0)

# misclassification rate - 2 ways
MR_test1 <- (FP_test + FN_test) / sum(confusion_matrix_test)
MR_test2 <- 1 - sum(diag(confusion_matrix_test))/sum(confusion_matrix_test)

# showing confusion matrix and misclassification rate
confusion_matrix_test

##                  pred_value_test
## actual_value_test   0   1
##                 0  20  34
##                 1  18 128

MR_test1

## [1] 0.26

MR_test2

## [1] 0.26

Your observation: There are 20 true negatives, 18 false negatives, 34 false positives, and 128 true positives. The misclassification rate is 26%, indicating that the model is incorrect 26% of the time. The model has a high true positive rate so it tends to predict good credit risks correctly.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

# define cost matrix
cost_matrix <- matrix(c(0, 2, # cost of 2 for FP
                        1, 0)) # cost of 1 for FN

# weighted tree model
GermanCredit_rpart_weighted <- rpart(Class ~ .,
                               data = GermanCredit_train,
                               parms = list(loss = cost_matrix),
                               method = "class")

# plot
# note: extra = 4 shows misclassification counts and class probabilities at each node
# note: yesno = 2 shows split visualization more clearly
rpart.plot(GermanCredit_rpart_weighted, extra = 4, yesno = 2)

Your observation: Assigning a cost matrix changes the objective the tree will try to minimize. The cost matrix will impact how the tree splits. Splits that reduce costs are preferred. In this case, a false positive costs 2, and a false negative costs 1. A false positive means the model predicts a good credit risk when the class is bad, so the model penalizes misclassifying a bad credit risk as a good credit risk more heavily. Important early predictors in this model include checking account status, duration, and amount. With the weighted tree, the weighted costs strengthen the effect of misclassifying bad credit risks as good risks.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

# predicted probabilities
pred_train_prob_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_train, type = "prob")
# view first several rows
head(pred_train_prob_weighted)

##             0         1
## 909 0.1333333 0.8666667
## 460 0.1333333 0.8666667
## 932 0.3571429 0.6428571
## 922 0.1333333 0.8666667
## 961 0.1333333 0.8666667
## 279 0.1333333 0.8666667

# predicted classes
pred_train_class_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_train, type = "class")
# view first several rows
head(pred_train_class_weighted)

## 909 460 932 922 961 279 
##   1   1   1   1   1   1 
## Levels: 0 1

Your observation:

Interpretation of Predicted Probability: For observation 909, there is an ~86.67% probability of being classified as a good credit risk. With the weighted cost matrix, the decision will favor the lower expected weighted cost.

Interpretation of Predicted Class: For observation 909, the individual is predicted to be a good credit risk. Notably, though, the model will minimize cost, so the it is more conservative when predicting risks as good.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

actual_value_train_weighted <- GermanCredit_train$Class
pred_value_train_weighted <- pred_train_class_weighted

confusion_matrix_train_weighted <- table(actual_value_train_weighted, pred_value_train_weighted)

# false positives and false negatives
FP_train_weighted <- sum(actual_value_train_weighted == 0 & pred_value_train_weighted == 1)
FN_train_weighted <- sum(actual_value_train_weighted == 1 & pred_value_train_weighted == 0)

# misclassification rate - 2 ways
MR_train1_weighted <- (FP_train_weighted + FN_train_weighted) / sum(confusion_matrix_train_weighted)
MR_train2_weighted <- 1 - sum(diag(confusion_matrix_train_weighted))/sum(confusion_matrix_train_weighted)

# showing confusion matrix and misclassification rate
confusion_matrix_train_weighted

##                            pred_value_train_weighted
## actual_value_train_weighted   0   1
##                           0  98 148
##                           1  14 540

MR_train1_weighted

## [1] 0.2025

MR_train2_weighted

## [1] 0.2025

Your observation: There are 98 true negatives, 14 false negatives, 148 false positives, and 540 true positives. Since the misclassification rate is 0.2025, the model incorrectly predicts ~20% of the outcomes.

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# generate ROC curve
roc_obj_train_weighted <- roc(actual_value_train_weighted, pred_train_prob_weighted[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# plot ROC curve
plot(roc_obj_train_weighted)

# calculate AUC
auc_train_weighted <- auc(actual_value_train_weighted, pred_train_prob_weighted[,2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# view AUC
auc_train_weighted

## Area under the curve: 0.7797

Your observation: The ROC curve plots the true positive rate vs. the false positive rate, while the AUC curve represents the area under the curve, which conveys the model’s strength. The ROC model shows that the model has good discriminatory power since it’s above the line of chance. The area under the curve is 0.7797 which suggests a fairly strong model since it’s above 0.5, meaning it performs better than random guessing.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

# predicted probabilities
pred_test_prob_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_test, type = "prob")
# view first several rows
head(pred_test_prob_weighted)

##            0          1
## 3  0.1333333 0.86666667
## 4  0.9333333 0.06666667
## 7  0.1333333 0.86666667
## 9  0.1333333 0.86666667
## 17 0.1333333 0.86666667
## 21 0.1333333 0.86666667

# predicted classes
pred_test_class_weighted <- predict(GermanCredit_rpart_weighted, newdata = GermanCredit_test, type = "class")
# view first several rows
head(pred_test_class_weighted)

##  3  4  7  9 17 21 
##  1  0  1  1  1  1 
## Levels: 0 1

Your observation: Each test observation is assigned a probability of being a good (1) or bad (0) credit risk. The predicted class reflects the decision taking cost into account. Higher probabilities for 1 are classified as Good, and lower probabilities are classified as Bad.

Example Interpretation of Predicted Probability: For observation 3, the model estimates that there is a ~86.67% probability that the individual is a good credit risk.

Example Interpretation of Predicted Probability: For observation 3, the model predicts good credit since the predicted class exceeds the default cutoff of 0.5.

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

# actual and predicted values
actual_value_test_weighted <- GermanCredit_test$Class
pred_value_test_weighted <- pred_test_class_weighted

# confusion matrix
confusion_matrix_test_weighted <- table(actual_value_test_weighted, pred_value_test_weighted)

# false positives and false negatives
FP_test_weighted <- sum(actual_value_test_weighted == 0 & pred_value_test_weighted == 1)
FN_test_weighted <- sum(actual_value_test_weighted == 1 & pred_value_test_weighted == 0)

# misclassification rate - 2 ways
MR_test1_weighted <- (FP_test_weighted + FN_test_weighted) / sum(confusion_matrix_test_weighted)
MR_test2_weighted <- 1 - sum(diag(confusion_matrix_test_weighted))/sum(confusion_matrix_test_weighted)

# showing confusion matrix and misclassification rate
confusion_matrix_test_weighted

##                           pred_value_test_weighted
## actual_value_test_weighted   0   1
##                          0  13  41
##                          1  10 136

MR_test1_weighted

## [1] 0.255

MR_test2_weighted

## [1] 0.255

Your observation: There are 13 true negatives, 10 false negatives, 41 false positives, and 136 true positives. THe misclassification rate is 0.255, indicating that 25.5% of outcomes are incorrectly predicted.

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

# generate ROC curve
roc_obj_test_weighted <- roc(actual_value_test_weighted, pred_test_prob_weighted[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# plot ROC curve
plot(roc_obj_test_weighted)

# calculate AUC
auc_test_weighted <- auc(actual_value_test_weighted, pred_test_prob_weighted[,2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# view AUC
auc_test_weighted

## Area under the curve: 0.7149

Your observation: The ROC curve plots the true positive rate vs. the false positive rate, while the AUC curve represents the area under the curve, which conveys the model’s strength. The ROC model shows that the model has good descriminatory power since it’s above the line of chance. The area under the curve is 0.7149 which suggests a fairly strong model since it’s above 0.5, meaning it performs better than random guessing. However, the AUC is slightly lower than the training AUC of 0.7797.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

I split the dataset into 80% training and 20% testing data.

For the tree model without weighted cost, the unweighted decision tree identified important predictors like checking account status, duration, and credit amount. The training misclassification rate was 19.5%, with high classification of good credit risks. The testing misclassification rate was 26%, so slightly worse than the training MR.

For the tree model with weighted cost, there was a higher cost assigned to false positives, which made the model more cautious when predicting good credit risks. The training MR was 20.25%, and the testing MR was slightly higher at 25.5%. The training AUC was 0.7797 and the testing AUC was 0.7149, indicating overall strong models with good discriminatory ability.

Overall, both models effectively identify good risks, but the weighted tree adjusts predictions to account for higher cost of false positives (bad credit risks predicted as good).

Homework7

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)