Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the packages and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
library(ROCR)
library(ggplot2)
library(lattice)

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") 

Your observation: Our target variable for this will be ‘Class’ which is why we will use as.numeric to convert it into True or False; 1 or 0.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

str(GermanCredit)
## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...
table(GermanCredit$Class)
## 
##   0   1 
## 300 700

Your observation: “Good” is found 700/1000 times, making up 70% of the Class dataset.

3. Split the dataset into

training and test set. Please use the random seed as 2023 for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
german.train = GermanCredit[index,]
german.test = GermanCredit[-index,]

Your observation:

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right.

library(rpart)
library(rpart.plot)
# fit the tree
fit_tree <- rpart(as.factor(Class) ~ ., data=german.train)

rpart.plot(fit_tree,extra=4, yesno=2)

Your observation: The rpart paackage was used to ensure all variables were included in regards to the target variable.

2. Use the training set to get prediected classes.

library(rpart)
pred_german_train <- predict(fit_tree, german.train, type="class")

Your observation: The predicted classes of the training set were calculated through the fit_tree model

3. Obtain confusion matrix and MR on training set.

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = german.train$Class,
                      pred = pred_german_train)
Cmatrix_train
##     pred
## true   0   1
##    0 146  96
##    1  50 508
# MSE
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1825

Your observation: The matrix shows 146 True Negatives,508 True Positives, 96 False Positives, and 50 False Negatives. The Misclassification Rate is 18.25%.

4. Use the testing set to get prediected classes.

# Make predictions on the training sets
pred_german_test <- predict(fit_tree, german.test, type="class")

Your observation: To get the test set of predicted classes, the fit_tree model was also used.

5. Obtain confusion matrix and MR on testing set.

# Confusion matrix
Cmatrix_test = table(true = german.test$Class,
                     pred = pred_german_test)
Cmatrix_test
##     pred
## true   0   1
##    0  35  23
##    1  27 115
# MSE
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.25

Your observation: There were 35 True Negatives, 115 True Positives, 23 False Positives and 27 False Negatives. This gave a MR rate of 25% which is 6.75% higher than the training predicted set.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right.

# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 2,  # cost of 2 for FP
                        1, 0),  # cost of 1 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(Class) ~ ., data=german.train, 
                       parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)

Your observation: By having the weights set to 2 for FP and 1 for FN, more emphasis is given to False Positives to prevent them from positive classification unless the confident.

2. Use the training set to get prediected probabilities and classes.

#get predictions for training
pred_german_train21 <- predict(fit_tree_asym, german.train, type="class")

Your observation: Predicted probabilities and classes for the training set were calculated using the weighted fit_tree model, fit_tree_asym

3. Obtain confusion matrix and MR on training set (use predicted classes).

Cmatrix_train_weight = table(true = german.train$Class, pred = pred_german_train21)
Cmatrix_train_weight
##     pred
## true   0   1
##    0 172  70
##    1 126 432
1 - sum(diag(Cmatrix_train_weight))/sum(Cmatrix_train_weight)
## [1] 0.245

Your observation: There were 172 True Negatives, 432 True Positives, 70 False Positives, and 126 False Negatives. There are less False Positives because of the weight for FP in the cost matrix. The MR is 24.5% which is 6.5% higher than the unweighted training set MR.

4. Obtain ROC and AUC on training set (use predicted probabilities).

## Unweighted ROC and AUC 
# obtain predicted probability
pred_prob_train = predict(fit_tree, german.train, type = "prob")
pred_prob_train = pred_prob_train[,"1"] 

library(ROCR)
pred <- prediction(pred_prob_train, german.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7974133
## Weighted ROC & AUC
# obtain predicted probability
pred_prob_train21 = predict(fit_tree_asym, german.train, type = "prob")
pred_prob_train21 = pred_prob_train21[,"1"] 

library(ROCR)
pred <- prediction(pred_prob_train21, german.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7557096

Your observation: The unweighted AUC is higher at 79.74% compared to the weighted AUC of 75.57%.The ROC is also better for the unweighted model.

5. Use the testing set to get prediected probabilities and classes.

pred_german_test21 <- predict(fit_tree_asym, german.test, type = "class")

Your observation: The weighted fit tree model was used to calculate the predicted probabilities and classes of the testing set.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

Cmatrix_test_weight = table( true = german.test$Class, pred = pred_german_test21)
Cmatrix_test_weight
##     pred
## true  0  1
##    0 40 18
##    1 43 99
1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)
## [1] 0.305

Your observation: There are 40 True Negatives, 99 True Positives, 18 False Positives and 43 False Negatives for the confusion matrix. The MR is 30.5%, which is 5.5% higher than the unweighted testing model.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

## Unweighted ROC & AUC
# obtain predicted probability
pred_prob_test = predict(fit_tree, german.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, german.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7279626
## Weighted ROC & AUC
# obtain predicted probability
pred_prob_test21 = predict(fit_tree_asym, german.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test21 = pred_prob_test21[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test21, german.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7049539

Your observation: Again, the AUC for the unweighted model is 72.8% while the weighted model is 70.5%. The ROC curve is also better for the unweighted model.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

Overall, the unweighted model shows has a better performance.

Unweighted: MR train: 18.25% MR test: 25% AUC train: 79.74% AUC test: 72.8% Better ROC curve

Weighted: MR train: 24.5% MR test: 30.5% AUC train: 75.57% AUC test: 70.5%

Applying more emphasis when classifying as Positive in the weighted model is likely what caused this difference. A case where using the weighted model is preferred could be seen in medical diagnoses or fraud detection wherethe consequences can be more severe.

2. How do you compare Tree model to logistic regression and SVM? Only for this question, you don’t need to show numbers, just answer based on your understanding.

Tree Model

Pros: - Easily Interpretable - Handles both categorical and numerical features - Minimal Data Preprocessing - Quick to Build

Cons: - Prone to Overfitting - Unstable - Biased to Dominant Class - Limited Expressiveness

SVM

Pros: - Effective in high dimensional spaces - Memory efficient for large dataset - Versatile through kernel

Cons: - Poor performance when number of features exceeds number of samples - Sensitive to noise

Logistic Regression

Pros: - Simplicity _ Efficiency _ Versatility: binary, ordinal and nominal variables - Output Interpretation

Cons: - Assumes linearity - Sensitive to Outliers - High correlation among predictors can affect performance