Jack Prangle
Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret)
#this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)set.seed(2025)
training_index <- sample(1 : nrow(GermanCredit), round(0.8 * nrow(GermanCredit)))
German_training <- GermanCredit[training_index, ]
German_testing <- GermanCredit[-training_index, ]
dim(German_training)
## [1] 800 49
dim(German_testing)
## [1] 200 49
Your observation:
We ended up with 800 obs with 49 variables for German_training, and the remaining 200 obs for German_testing. Same 49 variables as German_training.
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
train1 <- German_training
dim(train1)
## [1] 800 49
tree1 <- rpart(Class ~ ., data = train1, method = "class")
#use class
par(mfrow = c(1, 2))
rpart.plot(tree1, main = "Tree 1")
Your observation:
Tree 1 produced a very complex/different paths tree. Lots of routes and variety within this one tree.
pred_train_full <- predict(tree1, newdata = German_training)
summary(pred_train_full)
## 0 1
## Min. :0.09091 Min. :0.0000
## 1st Qu.:0.13333 1st Qu.:0.6727
## Median :0.18750 Median :0.8125
## Mean :0.30750 Mean :0.6925
## 3rd Qu.:0.32727 3rd Qu.:0.8667
## Max. :1.00000 Max. :0.9091
Your observation:
We’re given the minimum, 1st Qu, Median, Mean, 3rd Qu, and Maximum. We got two different charts, which each data point different from one another.
training_matrix <- mean( (German_training$Class - pred_train_full)^2)
MR = (4 + 1)/sum(training_matrix)
training_matrix
## [1] 0.3499069
MR
## [1] 14.28952
Your observation:
The matrix for the training data is 0.3499069. While the MR is 14.28952
train2 <- German_testing
dim(train2)
## [1] 200 49
tree2 <- rpart(Class ~ ., data = train2, method = "class")
par(mfrow = c(1, 2))
rpart.plot(tree2, main = "Tree 2")
pred_test_full <- predict(tree2, newdata = German_testing)
summary(pred_test_full)
## 0 1
## Min. :0.07143 Min. :0.1111
## 1st Qu.:0.08434 1st Qu.:0.7600
## Median :0.18750 Median :0.8125
## Mean :0.27000 Mean :0.7300
## 3rd Qu.:0.24000 3rd Qu.:0.9157
## Max. :0.88889 Max. :0.9286
Your observation:
For Tree 2, it’s more organized, and the options are little compared to Tree 1. However the predicted probabilites are organzied better than the ones from our training data. As for it’s predicted probabilities, same case as with training data. Same format, two charts, different results.
testing_matrix <- mean( (German_testing$Class - pred_test_full)^2)
MR = (4 + 1)/sum(testing_matrix)
testing_matrix
## [1] 0.3692797
MR
## [1] 13.53987
Your observation:
Our Testing Matrix is 0.3692797. Right away, we do have a contrast between training and testing data. With testing’s MR standing at 13.53987
cost <- function(r, phat) {
weight1 <- 2
weight0 <- 1
pcut <- weight0 / (weight1 + weight0)
FN <- sum((r == 1) & (phat < pcut))
FP <- sum((r == 2) & (phat > pcut))
MR <- (weight0 * FP + weight1 * FN)/length(phat)
return(MR)
}
cost(r = German_training$Class, phat = pred_train_full[, 2])
## [1] 0.09
Your observation:
After all the long code shown, the cost comes to a total of just 0.09.
library(ipred)
## Warning: package 'ipred' was built under R version 4.5.2
German_bagging <- bagging(as.factor(Class) ~ .,
data = German_training, nbagg = 500)
pred_credit_train <- predict(German_bagging, German_training, type="class")
pred_credit_train
## [1] 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1
## [38] 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0
## [75] 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1
## [112] 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1
## [149] 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 0 1
## [186] 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0
## [223] 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1
## [260] 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## [297] 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 0
## [334] 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0
## [371] 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0
## [408] 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
## [445] 1 1 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1
## [482] 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1
## [519] 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 1 1 1 1 1
## [556] 0 1 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0
## [593] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1
## [630] 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0
## [667] 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1
## [704] 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1
## [778] 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1
## Levels: 0 1
Your observation:
I used the bagging method to make the code writing process simplier to understand. By doing so, we get another bombrade of 1 and 0 across over 793 levels. With those levels either being 0 or 1.
Cmatrix_train = table(true = German_training$Class,
pred = pred_credit_train)
MR = (4 + 1)/sum(Cmatrix_train)
Cmatrix_train
## pred
## true 0 1
## 0 246 0
## 1 0 554
MR
## [1] 0.00625
Your observation:
Our CMatrix_train ends as the following table shown above. It’s MR however, is simply 0.00625.
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
pred_prob_credit_train <- predict(German_bagging, German_training, type = "prob")
auc( German_training$Class, pred_prob_credit_train[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 1
plot( roc(German_training$Class, pred_prob_credit_train[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Your observation:
Our area under the curve is just 1. With a straight and organized line shown on our plot.
pred_credit_test <- predict(German_bagging, German_testing, type="class")
pred_credit_test
## [1] 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [38] 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
## [75] 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0
## [112] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## [149] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1
## Levels: 0 1
Your observation:
Same case as pred_credit_train, excpet now, it’s only over 181 levels.
Cmatrix_test = table(true = German_testing$Class,
pred = pred_credit_test)
MR = (4 + 1)/sum(Cmatrix_test)
Cmatrix_test
## pred
## true 0 1
## 0 26 28
## 1 10 136
MR
## [1] 0.025
Your observation:
Once more, we get the following table using the testing data. As for the MR, it’s calculated to be 0.025.
pred_prob_credit_test <- predict(German_bagging, German_testing, type = "prob")
auc( German_testing$Class, pred_prob_credit_test[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7852
plot( roc(German_testing$Class, pred_prob_credit_test[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Your observation:
This time, our area under the curve isn’t 1. Rather 0.7896. Which showns given our plot is more unorganized/not straight than when we used the training data.
Both training and testing German data provided different results. I found it interesting German_training, which has far more data than German_testing, has a well formated plot, area under the curve, MR, etc. Where our German_testing data gave us a wide variety of outcomes even with few numbers to work with. From a less organized line, more data under it’s predicted probabilites, etc.