Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: Upon looking at the structure of the dataset, I see that majority of the variables are binary. On the other hand, there are some variables that are non-binary such as duration, amount, and installmentratepercentage to name a few.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

#summary statistics
summary(GermanCredit)
#head
head(GermanCredit)
#structure
str(GermanCredit)
#correlation matrix
correlation_matrix <- cor(GermanCredit)
print(correlation_matrix)

Your observation: When looking at the summary statistics of the dataset, I can see that the “amount” variable has the highest minimum, 1st quartile, median, mean, 3rd quartile, and maximum. In addition, “age” is the second variable to have the highest value for each characteristic, and then “duration” is the third variable. Majority of the other variables have values between 0 and 1. After running the code that dropped the variables that provided no information, the variables went down from 62 to 49. When looking at a correlation matrix, I can see that the class variable is not strongly correlated with other variables in the dataset.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.60)
germancredit.train = GermanCredit[index,]
germancredit.test = GermanCredit[-index,]

Your observation: After splitting the dataset, I see that the training set has 600 observations and the testing set has 400 observations.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right.

library(rpart)
library(rpart.plot)
# fit the model
fit_tree <- rpart(as.factor(Class) ~ ., data=germancredit.train)
rpart.plot(fit_tree,extra=4, yesno=2)

Your observation: According to the tree model, it appears that majority of the leaf nodes that represent “yes” have a higher number of observations that are predicted to be in class 0. Whereas in the leaf nodes that represent “no”, majority of them have a higher number of observations that are predicted to be in class 0.

2. Use the training set to get predicted classes.

pred_germancredit_train <- predict(fit_tree, germancredit.train, type="class")
summary(pred_germancredit_train)
##   0   1 
## 142 458

Your observation: After obtaining the predicted classes, there appears to be a class imbalance as class 1 has significantly more values.

3. Obtain confusion matrix and MR on training set.

# Confusion matrix 
Cmatrix_train = table(true = germancredit.train$Class,
                      pred = pred_germancredit_train)
Cmatrix_train
##     pred
## true   0   1
##    0 104  71
##    1  38 387
# MR
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1816667

Your observation: In the confusion matrix, there’s more number of true negatives in the actual negative. And in the actual positive, there’s much more number of true positives. The MR is 0.18, which means that 18% of the instances are being misclassified by the model. I would consider this to be a relatively low MR.

4. Use the testing set to get predicted classes.

pred_germancredit_test <- predict(fit_tree, germancredit.test, type="class")
summary(pred_germancredit_test)
##   0   1 
## 102 298

Your observation: After obtaining the predicted classes, there appears to be a class imbalance as class 1 has more values.

5. Obtain confusion matrix and MR on testing set.

# Confusion matrix
Cmatrix_test = table(true = germancredit.test$Class,
                     pred = pred_germancredit_test)
Cmatrix_test
##     pred
## true   0   1
##    0  55  70
##    1  47 228
# MR
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.2925

Your observation: In the confusion matrix, there’s more number of false positives in the actual negative. And in the actual positives, there’s much more number of true positives. The MR is 0.29, which means that 29% of the instances are being misclassified by the model. I would consider to be a moderately high MR.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right.

cost_matrix <- matrix(c(0, 2,  # cost of 1 for FP
                        1, 0),  # cost of 5 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(Class) ~ ., data=germancredit.train, 
                       parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)

Your observation: After creating a tree model with a weighted cost, it appears that there’s less leaf nodes on the left side in this model compared to the previous tree model.

2. Use the training set to get predicted probabilities and classes.

# probabilities
pred_prob_train = predict(fit_tree_asym, germancredit.train, type = "prob")
pred_prob_train = pred_prob_train[,"1"] 

# classes
pred_germancredit_train2 <- predict(fit_tree_asym, germancredit.train,
                             type = "class")
summary(pred_germancredit_train2)
##   0   1 
## 193 407

Your observation: There appears to be a class imbalance as class 1 has significantly more values.

3. Obtain confusion matrix and MR on training set (use predicted classes).

# Confusion matrix
Cmatrix_train2 = table(true = germancredit.train$Class,
                      pred = pred_germancredit_train2)
Cmatrix_train2
##     pred
## true   0   1
##    0 122  53
##    1  71 354
# MR
1 - sum(diag(Cmatrix_train2))/sum(Cmatrix_train2)
## [1] 0.2066667

Your observation: In the confusion matrix, this time there’s more number of true negatives in the actual negative. But there’s still more number of true positives in the actual positive. The MR is 0.2066, which means that about 21% of the instances are being misclassified by the model. I would consider this to be a moderately high MR.

4. Obtain ROC and AUC on training set (use predicted probabilities).

# ROC
library(ROCR)
pred <- prediction(pred_prob_train, germancredit.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8023866

Your observation: The ROC curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC, the value is 0.80. I would consider the AUC in this model to be good6 at maintaining good performance across different threshold settings.

5. Use the testing set to get predicted probabilities and classes.

# probabilities
pred_prob_test = predict(fit_tree, germancredit.test, type = "prob")
pred_prob_test = pred_prob_test[,"1"]

# classes
pred_germancredit_test2 <- predict(fit_tree_asym, germancredit.test,
                             type = "class")
summary(pred_germancredit_test2)
##   0   1 
## 134 266

Your observation: There appears to be a class imbalance as class 1 has more values.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

# Confusion matrix
Cmatrix_test2 = table(true = germancredit.test$Class,
                      pred = pred_germancredit_test2)
Cmatrix_test2
##     pred
## true   0   1
##    0  70  55
##    1  64 211
# MR
1 - sum(diag(Cmatrix_test2))/sum(Cmatrix_test2)
## [1] 0.2975

Your observation: In the confusion matrix, this time there’s more number of true negatives in the actual negative. But in the actual positive, there’s still much more number of true positives. The MR is 0.2975, which means that 29.75% of the instances are being misclassified by the model. I would consider this to be a moderately high MR.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

# ROC
pred2 <- prediction(pred_prob_test, germancredit.test$Class)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC
unlist(slot(performance(pred2, "auc"), "y.values"))
## [1] 0.7040145

Your observation: The ROC curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC, the value is 0.70. I would also consider the AUC in this model to be fair at maintaining good performance across different threshold settings.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

In the training and testing set without weighted cost, both sets have a class imbalance with class 1 being higher than class 0. Regarding the training set’s MR, the value is 18% and the testing set’s MR is 29%. On the other hand, when the training and testing set have a weighted cost, the training set has a MR of 21% and the testing set has a MR of 29.75%. Both datasets appear to have the same ROC curve, and regarding their AUCs, training’s AUC is 0.80 and testing’s AUC is 0.70. When comparing the training set’s MR when it doesn’t have a weighted cost versus when it does, the value goes up slightly as it went up to 21% from 18%. And for the testing set, the MR stayed relatively the same, as it only went up to 29.75% from 29%.

2. How do you compare Tree model to logistic regression and SVM? Only for this question, you don’t need to show numbers, just answer based on your understanding.

While each model has their own advantages and disadvantages, it’s important to choose which model best fits the specific problem at hand. Decision trees are good for interpretability and capturing non-linear relationships. Logistic regression is suitable for simpler problems and when interpetability is really important. And SVM is powerful for high-dimensional data and complex decision boundaries.