Jack Prangle

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) 
#this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

set.seed(2025)
training_index <- sample(1 : nrow(GermanCredit), round(0.8 * nrow(GermanCredit))) 
German_training <- GermanCredit[training_index, ]
German_testing <- GermanCredit[-training_index, ]
dim(German_training)
## [1] 800  49
dim(German_testing)
## [1] 200  49

Your observation:

We ended up with 800 obs with 49 variables for German_training, and the remaining 200 obs for German_testing. Same 49 variables as German_training.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
train1 <- German_training

dim(train1)
## [1] 800  49
tree1 <- rpart(Class ~ ., data = train1, method = "class")

#use class 

par(mfrow = c(1, 2)) 
rpart.plot(tree1, main = "Tree 1")

Your observation:

Tree 1 produced a very complex/different paths tree. Lots of routes and variety within this one tree.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_train_full <- predict(tree1, newdata = German_training)
summary(pred_train_full)
##        0                 1         
##  Min.   :0.09091   Min.   :0.0000  
##  1st Qu.:0.13333   1st Qu.:0.6727  
##  Median :0.18750   Median :0.8125  
##  Mean   :0.30750   Mean   :0.6925  
##  3rd Qu.:0.32727   3rd Qu.:0.8667  
##  Max.   :1.00000   Max.   :0.9091

Your observation:

We’re given the minimum, 1st Qu, Median, Mean, 3rd Qu, and Maximum. We got two different charts, which each data point different from one another.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

training_matrix <- mean( (German_training$Class - pred_train_full)^2) 
MR = (4 + 1)/sum(training_matrix)

training_matrix
## [1] 0.3499069
MR
## [1] 14.28952

Your observation:

The matrix for the training data is 0.3499069. While the MR is 14.28952

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train2 <- German_testing

dim(train2)
## [1] 200  49
tree2 <- rpart(Class ~ ., data = train2, method = "class")

par(mfrow = c(1, 2)) 
rpart.plot(tree2, main = "Tree 2")

pred_test_full <- predict(tree2, newdata = German_testing)
summary(pred_test_full)
##        0                 1         
##  Min.   :0.07143   Min.   :0.1111  
##  1st Qu.:0.08434   1st Qu.:0.7600  
##  Median :0.18750   Median :0.8125  
##  Mean   :0.27000   Mean   :0.7300  
##  3rd Qu.:0.24000   3rd Qu.:0.9157  
##  Max.   :0.88889   Max.   :0.9286

Your observation:

For Tree 2, it’s more organized, and the options are little compared to Tree 1. However the predicted probabilites are organzied better than the ones from our training data. As for it’s predicted probabilities, same case as with training data. Same format, two charts, different results.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

testing_matrix <- mean( (German_testing$Class - pred_test_full)^2) 
MR = (4 + 1)/sum(testing_matrix)

testing_matrix
## [1] 0.3692797
MR
## [1] 13.53987

Your observation:

Our Testing Matrix is 0.3692797. Right away, we do have a contrast between training and testing data. With testing’s MR standing at 13.53987

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

cost <- function(r, phat) {
  weight1 <- 2
  weight0 <- 1
  pcut <- weight0 / (weight1 + weight0)
  FN <- sum((r == 1) & (phat < pcut)) 
  FP <- sum((r == 2) & (phat > pcut))
  MR <- (weight0 * FP + weight1 * FN)/length(phat)
  return(MR)
}

cost(r = German_training$Class, phat = pred_train_full[, 2])
## [1] 0.09

Your observation:

After all the long code shown, the cost comes to a total of just 0.09.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

library(ipred)
## Warning: package 'ipred' was built under R version 4.5.2
German_bagging <- bagging(as.factor(Class) ~ .,
                          data = German_training, nbagg = 500)

pred_credit_train <- predict(German_bagging, German_training, type="class")

pred_credit_train
##   [1] 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1
##  [38] 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0
##  [75] 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1
## [112] 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1
## [149] 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 0 1
## [186] 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0
## [223] 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1
## [260] 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## [297] 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 0
## [334] 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0
## [371] 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0
## [408] 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
## [445] 1 1 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1
## [482] 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1
## [519] 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 1 1 1 1 1
## [556] 0 1 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0
## [593] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1
## [630] 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0
## [667] 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1
## [704] 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1
## [778] 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1
## Levels: 0 1

Your observation:

I used the bagging method to make the code writing process simplier to understand. By doing so, we get another bombrade of 1 and 0 across over 793 levels. With those levels either being 0 or 1.

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

Cmatrix_train = table(true = German_training$Class,
                      pred = pred_credit_train)

MR = (4 + 1)/sum(Cmatrix_train)

Cmatrix_train
##     pred
## true   0   1
##    0 246   0
##    1   0 554
MR
## [1] 0.00625

Your observation:

Our CMatrix_train ends as the following table shown above. It’s MR however, is simply 0.00625.

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
pred_prob_credit_train <- predict(German_bagging, German_training, type = "prob")
auc( German_training$Class, pred_prob_credit_train[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 1
plot( roc(German_training$Class, pred_prob_credit_train[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Your observation:

Our area under the curve is just 1. With a straight and organized line shown on our plot.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_credit_test <- predict(German_bagging, German_testing, type="class")

pred_credit_test
##   [1] 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
##  [38] 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
##  [75] 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0
## [112] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## [149] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1
## Levels: 0 1

Your observation:

Same case as pred_credit_train, excpet now, it’s only over 181 levels.

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

Cmatrix_test = table(true = German_testing$Class,
                      pred = pred_credit_test)

MR = (4 + 1)/sum(Cmatrix_test)

Cmatrix_test
##     pred
## true   0   1
##    0  26  28
##    1  10 136
MR
## [1] 0.025

Your observation:

Once more, we get the following table using the testing data. As for the MR, it’s calculated to be 0.025.

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

pred_prob_credit_test <- predict(German_bagging, German_testing, type = "prob")
auc( German_testing$Class, pred_prob_credit_test[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7852
plot( roc(German_testing$Class, pred_prob_credit_test[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Your observation:

This time, our area under the curve isn’t 1. Rather 0.7896. Which showns given our plot is more unorganized/not straight than when we used the training data.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

Both training and testing German data provided different results. I found it interesting German_training, which has far more data than German_testing, has a well formated plot, area under the curve, MR, etc. Where our German_testing data gave us a wide variety of outcomes even with few numbers to work with. From a less organized line, more data under it’s predicted probabilites, etc.