Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : num 1 0 1 1 0 1 1 1 1 0 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: Upon looking at the structure of the dataset, I see that majority of the variables are binary. On the other hand, there are some variables that are non-binary such as duration, amount, and installmentratepercentage to name a few.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
#summary statistics
summary(GermanCredit)
#head
head(GermanCredit)
#structure
str(GermanCredit)
#correlation matrix
correlation_matrix <- cor(GermanCredit)
print(correlation_matrix)
Your observation: When looking at the summary statistics of the dataset, I can see that the “amount” variable has the highest minimum, 1st quartile, median, mean, 3rd quartile, and maximum. In addition, “age” is the second variable to have the highest value for each characteristic, and then “duration” is the third variable. Majority of the other variables have values between 0 and 1. After running the code that dropped the variables that provided no information, the variables went down from 62 to 49. When looking at a correlation matrix, I can see that the class variable is not strongly correlated with other variables in the dataset.
2023 for
reproducibility.set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.60)
germancredit.train = GermanCredit[index,]
germancredit.test = GermanCredit[-index,]
Your observation: After splitting the dataset, I see that the training set has 600 observations and the testing set has 400 observations.
library(rpart)
library(rpart.plot)
# fit the model
fit_tree <- rpart(as.factor(Class) ~ ., data=germancredit.train)
rpart.plot(fit_tree,extra=4, yesno=2)
Your observation: According to the tree model, it appears that majority of the leaf nodes that represent “yes” have a higher number of observations that are predicted to be in class 0. Whereas in the leaf nodes that represent “no”, majority of them have a higher number of observations that are predicted to be in class 0.
pred_germancredit_train <- predict(fit_tree, germancredit.train, type="class")
summary(pred_germancredit_train)
## 0 1
## 142 458
Your observation: After obtaining the predicted classes, there appears to be a class imbalance as class 1 has significantly more values.
# Confusion matrix
Cmatrix_train = table(true = germancredit.train$Class,
pred = pred_germancredit_train)
Cmatrix_train
## pred
## true 0 1
## 0 104 71
## 1 38 387
# MR
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1816667
Your observation: In the confusion matrix, there’s more number of true negatives in the actual negative. And in the actual positive, there’s much more number of true positives. The MR is 0.18, which means that 18% of the instances are being misclassified by the model. I would consider this to be a relatively low MR.
pred_germancredit_test <- predict(fit_tree, germancredit.test, type="class")
summary(pred_germancredit_test)
## 0 1
## 102 298
Your observation: After obtaining the predicted classes, there appears to be a class imbalance as class 1 has more values.
# Confusion matrix
Cmatrix_test = table(true = germancredit.test$Class,
pred = pred_germancredit_test)
Cmatrix_test
## pred
## true 0 1
## 0 55 70
## 1 47 228
# MR
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.2925
Your observation: In the confusion matrix, there’s more number of false positives in the actual negative. And in the actual positives, there’s much more number of true positives. The MR is 0.29, which means that 29% of the instances are being misclassified by the model. I would consider to be a moderately high MR.
cost_matrix <- matrix(c(0, 2, # cost of 1 for FP
1, 0), # cost of 5 for FN
byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(Class) ~ ., data=germancredit.train,
parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)
Your observation: After creating a tree model with a weighted cost, it appears that there’s less leaf nodes on the left side in this model compared to the previous tree model.
# probabilities
pred_prob_train = predict(fit_tree_asym, germancredit.train, type = "prob")
pred_prob_train = pred_prob_train[,"1"]
# classes
pred_germancredit_train2 <- predict(fit_tree_asym, germancredit.train,
type = "class")
summary(pred_germancredit_train2)
## 0 1
## 193 407
Your observation: There appears to be a class imbalance as class 1 has significantly more values.
# Confusion matrix
Cmatrix_train2 = table(true = germancredit.train$Class,
pred = pred_germancredit_train2)
Cmatrix_train2
## pred
## true 0 1
## 0 122 53
## 1 71 354
# MR
1 - sum(diag(Cmatrix_train2))/sum(Cmatrix_train2)
## [1] 0.2066667
Your observation: In the confusion matrix, this time there’s more number of true negatives in the actual negative. But there’s still more number of true positives in the actual positive. The MR is 0.2066, which means that about 21% of the instances are being misclassified by the model. I would consider this to be a moderately high MR.
# ROC
library(ROCR)
pred <- prediction(pred_prob_train, germancredit.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
# AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8023866
Your observation: The ROC curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC, the value is 0.80. I would consider the AUC in this model to be good6 at maintaining good performance across different threshold settings.
# probabilities
pred_prob_test = predict(fit_tree, germancredit.test, type = "prob")
pred_prob_test = pred_prob_test[,"1"]
# classes
pred_germancredit_test2 <- predict(fit_tree_asym, germancredit.test,
type = "class")
summary(pred_germancredit_test2)
## 0 1
## 134 266
Your observation: There appears to be a class imbalance as class 1 has more values.
# Confusion matrix
Cmatrix_test2 = table(true = germancredit.test$Class,
pred = pred_germancredit_test2)
Cmatrix_test2
## pred
## true 0 1
## 0 70 55
## 1 64 211
# MR
1 - sum(diag(Cmatrix_test2))/sum(Cmatrix_test2)
## [1] 0.2975
Your observation: In the confusion matrix, this time there’s more number of true negatives in the actual negative. But in the actual positive, there’s still much more number of true positives. The MR is 0.2975, which means that 29.75% of the instances are being misclassified by the model. I would consider this to be a moderately high MR.
# ROC
pred2 <- prediction(pred_prob_test, germancredit.test$Class)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf, colorize=TRUE)
# AUC
unlist(slot(performance(pred2, "auc"), "y.values"))
## [1] 0.7040145
Your observation: The ROC curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC, the value is 0.70. I would also consider the AUC in this model to be fair at maintaining good performance across different threshold settings.
In the training and testing set without weighted cost, both sets have a class imbalance with class 1 being higher than class 0. Regarding the training set’s MR, the value is 18% and the testing set’s MR is 29%. On the other hand, when the training and testing set have a weighted cost, the training set has a MR of 21% and the testing set has a MR of 29.75%. Both datasets appear to have the same ROC curve, and regarding their AUCs, training’s AUC is 0.80 and testing’s AUC is 0.70. When comparing the training set’s MR when it doesn’t have a weighted cost versus when it does, the value goes up slightly as it went up to 21% from 18%. And for the testing set, the MR stayed relatively the same, as it only went up to 29.75% from 29%.
While each model has their own advantages and disadvantages, it’s important to choose which model best fits the specific problem at hand. Decision trees are good for interpretability and capturing non-linear relationships. Logistic regression is suitable for simpler problems and when interpetability is really important. And SVM is powerful for high-dimensional data and complex decision boundaries.