Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: Converting Class into TRUE or FALSE or equal to 1 or 0 so it can be used as target variable.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

table(GermanCredit$Class)
## 
##   0   1 
## 300 700
head(GermanCredit)
##   Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 1        6   1169                         4                 4  67
## 2       48   5951                         2                 2  22
## 3       12   2096                         2                 3  49
## 4       42   7882                         2                 4  45
## 5       24   4870                         3                 4  53
## 6       36   9055                         2                 4  35
##   NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 1                     2                       1         0             1     1
## 2                     1                       1         1             1     0
## 3                     1                       2         1             1     1
## 4                     1                       2         1             1     1
## 5                     2                       2         1             1     0
## 6                     1                       2         0             1     1
##   CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 1                          1                              0
## 2                          0                              1
## 3                          0                              0
## 4                          1                              0
## 5                          1                              0
## 6                          0                              0
##   CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 1                            0                              0
## 2                            0                              0
## 3                            0                              0
## 4                            0                              0
## 5                            0                              0
## 6                            0                              0
##   CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 1                              0                      0                   0
## 2                              0                      1                   0
## 3                              0                      0                   0
## 4                              0                      1                   0
## 5                              0                      0                   1
## 6                              0                      1                   0
##   Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 1              0               0                           0
## 2              0               0                           0
## 3              0               0                           0
## 4              0               0                           1
## 5              1               0                           0
## 6              0               0                           0
##   Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 1                        1                         0               0
## 2                        1                         0               0
## 3                        0                         0               0
## 4                        0                         0               0
## 5                        0                         0               0
## 6                        0                         0               0
##   Purpose.Education Purpose.Retraining Purpose.Business
## 1                 0                  0                0
## 2                 0                  0                0
## 3                 1                  0                0
## 4                 0                  0                0
## 5                 0                  0                0
## 6                 1                  0                0
##   SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 1                          0                              0
## 2                          1                              0
## 3                          1                              0
## 4                          1                              0
## 5                          1                              0
## 6                          0                              0
##   SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 1                               0                           0
## 2                               0                           0
## 3                               0                           0
## 4                               0                           0
## 5                               0                           0
## 6                               0                           0
##   EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 1                       0                         0                         0
## 2                       0                         1                         0
## 3                       0                         0                         1
## 4                       0                         0                         1
## 5                       0                         1                         0
## 6                       0                         1                         0
##   EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 1                       1                                0
## 2                       0                                0
## 3                       0                                0
## 4                       0                                0
## 5                       0                                0
## 6                       0                                0
##   Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 1                         0                    1                           1
## 2                         1                    0                           1
## 3                         0                    1                           1
## 4                         0                    1                           0
## 5                         0                    1                           1
## 6                         0                    1                           1
##   OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 1                                  0                   1                  0
## 2                                  0                   1                  0
## 3                                  0                   1                  0
## 4                                  0                   0                  1
## 5                                  0                   0                  0
## 6                                  0                   0                  0
##   Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 1                 0                          0                            0
## 2                 0                          0                            0
## 3                 0                          0                            0
## 4                 0                          0                            0
## 5                 0                          0                            0
## 6                 0                          0                            0
##   Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 1            0           1                       0                     0
## 2            0           1                       0                     0
## 3            0           1                       0                     1
## 4            0           0                       0                     0
## 5            0           0                       0                     0
## 6            0           0                       0                     1
##   Job.SkilledEmployee
## 1                   1
## 2                   1
## 3                   0
## 4                   1
## 5                   1
## 6                   0

Your observation: 0 appears 300 times and 1 appears 700 times.This means “Good” is found 700 for the Class dataset.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]

Your observation: Assigning 80% of dataset to German.train and the remaining 20% to German.test for random sample of German.Credit.

Task 2: SVM without weighted class cost

1. Fit a SVM model using the training set. Please use all variables, but make sure the variable types are right.

library(e1071)
# Fitting SVM model for training set
German.svm = svm(Class ~ .,
                 data = German.train, kernel = 'linear')

summary(German.svm)
## 
## Call:
## svm(formula = Class ~ ., data = German.train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  418
## 
##  ( 201 217 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Your observation: Out of 418 supporting vectors, 201 are in class “0” and 217 are in class “1.” These are the only two levels of classes for the model. ### 2. Use the training set to get prediected classes.

# Predictions for German.train
pred_German_train <- predict(German.svm, German.train)

# Histogram
numeric_pgtrain_data <- as.numeric(pred_German_train)

hist(numeric_pgtrain_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")

Your observation: The histogram shows the difference of pred_German_train. With the difference being less than the histogram for pred_German_test.

3. Obtain confusion matrix and MR on training set.

# Confusion matrix for training set
Cmatrix_German_train = table(true = German.train$Class, pred = pred_German_train)

Cmatrix_German_train
##     pred
## true   0   1
##    0 142 100
##    1  68 490
# Train MR
1 - sum(diag(Cmatrix_German_train))/sum(Cmatrix_German_train)
## [1] 0.21

Your observation: According to the matrix, there are 490 true positives and 142 true negatives. The model predicted 100 false positives and 68 false negatives. The MR of 0.21 indicates 21% of the model was predicted incorrectly.

4. Use the testing set to get prediected classes.

pred_German_test <- predict(German.svm, German.test)

# Histogram
numeric_pgtest_data <- as.numeric(pred_German_test)

hist(numeric_pgtest_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")

Your observation: The histogram shows the difference of pred_German_test. With the difference being larger than the histogram for pred_German_train.

5. Obtain confusion matrix and MR on testing set.

# Confusion matrix for testing set
Cmatrix_German_test = table(true = German.test$Class, pred = pred_German_test)

Cmatrix_German_test
##     pred
## true   0   1
##    0  32  26
##    1  24 118
# Test MR
1 - sum(diag(Cmatrix_German_test))/sum(Cmatrix_German_test)
## [1] 0.25

Your observation: According to the matrix, there are 118 are true positives and 32 true negatives. The model predicted 26 false positives and 24 false negatives. The MR of 0.25 indicates 25% of the model was predicted incorrectly, which is %4 higher than the training set matrix.

Task 3: SVM with weighted class cost, and probabilities enabled

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right.

German.svm_asymmetric12 = svm(Class ~ .,
                            data = German.train, 
                            kernel = 'polynomial',
                            class.weights = c("0" = 1, "1" = 2),
                            probability = TRUE)

Your observation: Changing the model so there is more emphasis on “1.” ### 2. Use the training set to get prediected probabilities and classes.

pred_German_train12 <- predict(German.svm_asymmetric12, German.train)

# Histogram
numeric_12pgtrain_data <- as.numeric(pred_German_train12)


hist(numeric_12pgtrain_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")

Your observation: There is a larger difference between the two since changing the emphasis on True(1) data.

3. Obtain confusion matrix and MR on training set (use predicted classes).

# Confusion matrix for training_12
Cmatrix_train_12 = table( true = German.train$Class, pred = pred_German_train12)

Cmatrix_train_12
##     pred
## true   0   1
##    0 130 112
##    1   0 558
# MR
1 - sum(diag(Cmatrix_train_12))/sum(Cmatrix_train_12)
## [1] 0.14

Your observation: According to the matrix, there are 558 true positives and 130 true negatives. The model predicted 112 false positives and 0 false negatives. The MR of 0.14 indicates 14% of the model was predicted incorrectly. This is lower than the original matrix for the training set.

4. Obtain ROC and AUC on training set (use predicted probabilities).

#Prep
German.svm_prob = svm(Class ~ .,
                      data = German.train, kernel = 'linear',
                      probability = TRUE)
pred_prob_train = predict(German.svm_prob,
                          newdata = German.train,
                          probability = TRUE)
str(pred_prob_train)
##  Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
##  - attr(*, "names")= chr [1:800] "885" "464" "431" "361" ...
##  - attr(*, "probabilities")= num [1:800, 1:2] 0.3477 0.1538 0.0437 0.2535 0.3079 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:800] "885" "464" "431" "361" ...
##   .. ..$ : chr [1:2] "0" "1"
# Necessary
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2] 
# ROC for train
library(ROCR)
pred <- prediction(pred_prob_train, German.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8298158

Your observation: ROC looks good, the AUC is 0.8298158 which means ~82.98% of the model’s performance is correct. ### 5. Use the testing set to get prediected probabilities and classes.

pred_German_test12 <- predict(German.svm_asymmetric12, German.test)

# Histogram
numeric_12pgtest_data <- as.numeric(pred_German_train12)


hist(numeric_12pgtest_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")

Your observation: Compared to the original test, there is less of a difference between the two after changing the weights of the model. ### 6. Obtain confusion matrix and MR on testing set. (use predicted classes).

# Confusion matrix for testing_12
Cmatrix_test_12 = table( true = German.test$Class, pred = pred_German_test12)

Cmatrix_test_12
##     pred
## true   0   1
##    0  14  44
##    1   6 136
#MR testing
1 - sum(diag(Cmatrix_test_12))/sum(Cmatrix_test_12)
## [1] 0.25

Your observation: According to the matrix, there are 136 are true positives and 14 true negatives. The model predicted 44 false positives and 6 false negatives. The MR of 0.25 indicates 25% of the model was predicted incorrectly, which is the same MR for the original testing set matrix. ### 7. Obtain ROC and AUC on testing set. (use predicted probabilities).

# Prep
pred_prob_test = predict(German.svm_prob,
                         newdata = German.test,
                         probability = TRUE)
 # Necessary
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
# ROC
pred <- prediction(pred_prob_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8016027

Your observation: ROC looks good, the AUC is 0.8016027 which means ~80.16% of the model’s performance is correct. This is slightly lower than the AUC of the training set. # Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

When changing the weight of the SVM model, the results for the training set were better. This was determined by the MR which went from 21% to 14%. There was little to no change for the testing set according to the MR, since there was no change at 25%. Overall the model had better results when altering the weight to have more emphasis on “1” over “0.”

2. How do you compare SVM to logistic regression? Only for this question, you don’t need to show numbers, just answer based on your understanding.

Logistic Regression

Pros: - Simplicity: Easy to interpret - Efficient: less computer resources needed - Probabilistic interpretation: probabilities for outcomes

Cons: - Assumes linear relationship between variables - Sensitive to outliers _ Performance can be negatively impacted if variables are highly correlated.

SVM

Pros: - Effective in high-dimensional spaces - Versatile through kernel - Memory efficient

Cons: - Poor performance when the # of features exceeds # of samples - Sensitive to noise

Overall SVM is better for more complex datasets while Logistic Regression is best for more simple datasets and its interpretability when the outcome is at least somewhat linear.