Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: Based on the variables, none appear categorical. Thus, none of the varibales are converted to factor.
We know that some of the columns are categorical such as; ResidenceDuration, NumberExistingCredit, NumberPeopleMaintenance, Telephone ForeignWorker, Class, CheckingAccountStatus.lt.0, CheckingAccountStatus.0.to.200, etc. We can see There are more categorical than interger varibales based from observing the mean, minimum and maxiumum values.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

str(GermanCredit)
## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

Your observation: In the Age variable, the median is 33 years. This insight can help us observe a lower age demographic which may lead to worse credit scores for younger ages. In Duration the Max value is 72, while the mean and median are 18 and 20; respectfully. This may lead us to believe that their may be outliers present in the data. The same idea follows for Amount. The max is 18424, which is extremely higher than the other data.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]

Your observation: The random seed was selected to 2023 and the data was randomly split to training (80%) and testing (20%).

Task 2: SVM without weighted class cost

1. Fit a SVM model using the training set. Please use all variables, but make sure the variable types are right.

library(e1071)
## Warning: package 'e1071' was built under R version 4.3.1
# Fitting SVM model for training set
German.svm = svm({{as.factor(Class)}} ~ .,
                 data = German.train, kernel = 'linear')

summary(German.svm)
## 
## Call:
## svm(formula = {
##     {
##         as.factor(Class)
##     }
## } ~ ., data = German.train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  418
## 
##  ( 201 217 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Your observation: Here we have the SVM model that gives us the number of support vectors. We need to make the default as factor when fitting SVM. Also, we need to use the svm from package e1071. ### 2. Use the training set to get prediected classes.

# Predictions for German.train
pred_German_train <- predict(German.svm, German.train)

Your observation: Here we are gettingt the training set of the predicted class.

3. Obtain confusion matrix and MR on training set.

# Confusion matrix for training set
Cmatrix_German_train = table(true = German.train$Class, pred = pred_German_train)

Cmatrix_German_train
##     pred
## true   0   1
##    0 142 100
##    1  68 490
# Train MR
1 - sum(diag(Cmatrix_German_train))/sum(Cmatrix_German_train)
## [1] 0.21

Your observation: Here the train MR is 0.21, which means about 21% of the instances are misclassifies by the SVM model. So there are a lower value indicating thea better performance. The training data set has a linear kernel and a cost parameter of 1 with classes as 0 and 1. In the table, the true and predicted class labels, “pred” are shown. This table gives us the true positive 142 true negatives, 490 true positives, 100 false negatives, and 68 false positives.

4. Use the testing set to get prediected classes.

pred_German_test <- predict(German.svm, German.test)

Your observation: Here, I am using the testing dataset from line 58 and getting the predicted values for the dataset. ### 5. Obtain confusion matrix and MR on testing set.

# Confusion matrix for testing set
Cmatrix_German_test = table(true = German.test$Class, pred = pred_German_test)

Cmatrix_German_test
##     pred
## true   0   1
##    0  32  26
##    1  24 118
# Test MR
1 - sum(diag(Cmatrix_German_test))/sum(Cmatrix_German_test)
## [1] 0.25

Your observation: Here the test MR, misclassification rate is approx. 25%. This is the amount of instances in the testing set that are misclassified by the SVM model. The confusion matrix tells us that the Cmatrix_German_test has 32 true negatives, 118 true positives, 26 false negatives, and 24 false positives.

Task 3: SVM with weighted class cost, and probabilities enabled

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right.

German.svm_asymmetric = svm(as.factor(Class) ~ .,
                            data = German.train, 
                            kernel = 'linear',
                            class.weights = c("0" = 1, "1" = 2),
                            probability = TRUE)

Your observation: Using SVM with asymmetric cost is actually easier! I kept the linear kernal and the asfactor for the variable CLass. ### 2. Use the training set to get prediected probabilities and classes.

pred_German_train <- predict(German.svm_asymmetric, German.train)

Your observation: The predict function generates predictions based on the German train model dataset. ### 3. Obtain confusion matrix and MR on training set (use predicted classes).

# Confusion matrix for training_12
Cmatrix_train_12 = table( true = German.train$Class, pred = pred_German_train)
# MR
1 - sum(diag(Cmatrix_train_12))/sum(Cmatrix_train_12)
## [1] 0.25125

Your observation: Here, the misclassification rate is 25%

4. Obtain ROC and AUC on training set (use predicted probabilities).

German.svm_prob = svm(Class ~ .,
                      data = German.train, kernel = 'linear',
                      probability = TRUE)
pred_prob_train = predict(German.svm_prob,
                          newdata = German.train,
                          probability = TRUE)
str(pred_prob_train)
##  Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
##  - attr(*, "names")= chr [1:800] "885" "464" "431" "361" ...
##  - attr(*, "probabilities")= num [1:800, 1:2] 0.3477 0.1538 0.0437 0.2535 0.3079 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:800] "885" "464" "431" "361" ...
##   .. ..$ : chr [1:2] "0" "1"
# Necessary
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2] 
# ROC for train
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.1
pred <- prediction(pred_prob_train, German.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8298158

Your observation: Here we can see the AUC is 0.8298. In SVM, the output values is the distance from the hyperplane.

5. Use the testing set to get prediected probabilities and classes.

pred_German_test <- predict(German.svm_asymmetric, German.test)

Your observation: The predicted test data helps retrieve the proabaility for the german test dataset. ### 6. Obtain confusion matrix and MR on testing set. (use predicted classes).

# Confusion matrix for testing_12
Cmatrix_test_12 = table( true = German.test$Class, pred = pred_German_test)
#MR testing
1 - sum(diag(Cmatrix_test_12))/sum(Cmatrix_test_12)
## [1] 0.245

Your observation: In this chunk, the MR testing is found by subtracting the ratio of correct predictions (the diagonal of the confusion matrix) from 1.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

# Testing pred_prob
pred_prob_test = predict(German.svm_prob,
                         newdata = German.test,
                         probability = TRUE)
 # Necessary
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
# ROC for test 
pred <- prediction(pred_prob_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8016027

Your observation: In line 210, we extract the class probability of the positive class (1) from the output of the predict function, focuing on class (1)

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

The Area Under the Curve (AUC) shows the trade off between the true positive rate at different probability thresholds. The first ROC for training had a higher AUC which means it had a better discrimination between the classes with 0.5 being equivalent to random guessing and 1.0 being the perfect indicator. Although, Platt scaling can be used to transform the output scores. Also, the “pred_prob_train” had a 0.8298158 for its AUC and the “pred_prob_test” had a 0.8016027.

2. How do you compare SVM to logistic regression? Only for this question, you don’t need to show numbers, just answer based on your understanding.

  • SVM can handle many different type of problems for its type of classifier. For instance, it can handle nonlinearity whereas logistic regression can not.
  • Logistic regression can be sensitive to outliers, whereas SVMs are more robust to outliers due to being influenced by the data points that are not support vectors.
  • SVM can be adopted to hande regression probelms, and faster for larger number of observation.
  • Logistic regression has fewer hyperparameters to tune, whereas in SVM the choice of kernels and regularization of parameters are considered.
  • Logistic regression is overall better for finding the relationship between features, and it target to find linearity and interpretability, whereas, SVM’s are better at finding a non-linear decision, maximizing margins being effective in high dimensional spaces, has a memory efficient for large dataset and beign versatile through the kernel trick.