Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: After running the code to change ‘Class’, it is now a factor.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

#summary statistics
summary(GermanCredit)
#head
head(GermanCredit)
#structure
str(GermanCredit)

Your observation: When looking at the summary statistics of the dataset, I can see that the “amount” variable has the highest minimum, 1st quartile, median, mean, 3rd quartile, and maximum. In addition, “age” is the second variable to have the highest value for each characteristic, and then “duration” is the third variable. Majority of the other variables have values between 0 and 1. When looking at the structure, I can see that majority of the variables are binomial, but there are a few that are non-binary. Such as ‘duration’, ‘amount’, ‘age’ to name a few.

3. Split the dataset into training and test set. Please use the random seed as `2023` for reproducibility.

#Set the seed for reproducibility
set.seed(2023)

index <- sample(1:NROW(GermanCredit),NROW(GermanCredit)*0.80)

#Create the training set
train_data <- GermanCredit[index,]
#Create the testing set
test_data <- GermanCredit[-index,]

Your observation: When splitting the dataset, in the testing dataset, the observations went down to 200 observations from a 1000. And in the training dataset, the observations went down to 800 from 1000.

Task 2: SVM without weighted class cost

1. Fit a SVM model using the training set. Please use all variables, but make sure the variable types are right.

library(e1071)
germancredit.svm = svm( as.factor(Class) ~ .,
                 data = train_data, kernel = 'linear')

summary(germancredit.svm)

## 
## Call:
## svm(formula = as.factor(Class) ~ ., data = train_data, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  418
## 
##  ( 201 217 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Your observation: After fitting the SVM model and running a summary of it, I can see that there’s 418 Support Vectors. In addition, I would say the classes are well-balanced since 201 and 217 aren’t far off from each other.

2. Use the training set to get predicted classes.

#Make predictions on the train data
pred_germancredit_train <- predict(germancredit.svm, train_data) 
summary(pred_germancredit_train)

##   0   1 
## 210 590

Your observation: After obtaining the predicted classes, it appears that the SVM is more confident in classifying data points into class 1 since 590 is higher than 210. In addition, this indicates a class imbalance.

3. Obtain confusion matrix and MR on training set.

#Confusion matrix
Cmatrix_train = table(true = train_data$Class,
                      pred = pred_germancredit_train)
Cmatrix_train

##     pred
## true   0   1
##    0 142 100
##    1  68 490

Mis-classfication Rate (MR)

1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)

## [1] 0.21

Your observation: In the confusion matrix, there’s more number of true negatives than false negatives for the actual negative. And for the actual positive, there’s more number of true positives than false negatives. Regarding the MR, the value is 0.21. This means that 21% of the instances are being misclassified by the model. I would consider this to be a relatively high MR.

4. Use the testing set to get predicted classes.

germancredit.svm2 = svm( as.factor(Class) ~ .,
                 data = test_data, kernel = 'linear')

summary(germancredit.svm2)

## 
## Call:
## svm(formula = as.factor(Class) ~ ., data = test_data, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  95
## 
##  ( 52 43 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

# Make predictions on the test data
pred_germancredit_test <- predict(germancredit.svm2, test_data)
summary(pred_germancredit_test)

##   0   1 
##  55 145

Your observation: When viewing the SVM model after running a summary of it, I can see that there’s 95 Support Vectors and there’s a good balance between the classes. After obtaining the predicted classes, it appears that the SVM is more confident in classifying data points into class 1 since 145 is higher than 55. In addition, this indicates a class imbalance.

5. Obtain confusion matrix and MR on testing set.

#Confusion matrix
Cmatrix_test = table(true = test_data$Class,
                      pred = pred_germancredit_test)
Cmatrix_test

##     pred
## true   0   1
##    0  44  14
##    1  11 131

Mis-classfication Rate (MR)

1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.125

Your observation: When looking at the confusion matrix, there are more number of true negatives for the actual negative. And for the actual positive, there’s more number of true positives. Regarding the MR, it has a value of 0.125, which means that means 12.5% of the instances are being misclassified by the model. I would consider this to be a relatively low MR.

Task 3: SVM with weighted class cost, and probabilities enabled

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right.

germancredit.svm_asymmetric21 = svm(as.factor(Class) ~ .,
                            data = train_data, 
                            kernel = 'linear',
                            class.weights = c("1" = 2, "0" = 1))

summary(germancredit.svm_asymmetric21)

## 
## Call:
## svm(formula = as.factor(Class) ~ ., data = train_data, kernel = "linear", 
##     class.weights = c(`1` = 2, `0` = 1))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  386
## 
##  ( 241 145 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Your observation: After fitting the SVM model and running a summary of it, I can see that there’s 386 Support Vectors. In addition, the classes are imbalanced.

2. Use the training set to get predicted probabilities and classes.

pred_germancredit_train_cost <- predict(germancredit.svm_asymmetric21, train_data)
summary(pred_germancredit_train_cost)

##   0   1 
##  63 737

Your observation: After obtaining the predicted probabilities and classes, it appears that the SVM is more confident in classifying data points into class 1 since 737 is higher than 63. In addition, there’s a major class imbalance.

3. Obtain confusion matrix and MR on training set (use predicted classes).

#C matrix for training
Cmatrix_train_21 = table( true = train_data$Class, pred = pred_germancredit_train_cost)
Cmatrix_train_21

##     pred
## true   0   1
##    0  52 190
##    1  11 547

#MR training
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)

## [1] 0.25125

Your observation: When looking at the confusion matrix, there’s more number of false positives in the actual negative. And for the actual positive, there’s more number of true positives. For the MR, the value is 0.2512. This means that 25.12% of the instances are being misclassified by the model. I would consider this to be a relatively high MR.

4. Obtain ROC and AUC on training set (use predicted probabilities).

#refit the model with probability enabled
germancredit.svm_asymmetric21 = svm(as.factor(Class) ~ .,
                            data = train_data, 
                            kernel = 'linear',
                            class.weights = c("1" = 2, "0" = 1),
                            probability = TRUE)

#get predictions for training
pred_prob_train <- predict(germancredit.svm_asymmetric21, train_data, probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]

#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, train_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8111911

Your observation: When looking at the ROC curve, I can see that the curve approaches more towards a higher false positive rate and true positive rate. Since the AUC value is 0.81, and an AUC model of 1 represents a perfect model, I would consider the AUC in this model to be good at maintaining good performance across different threshold settings.

5. Use the testing set to get predicted probabilities and classes.

#refit the model with probability enabled
germancredit.svm_asymmetric21_test = svm(as.factor(Class) ~ .,
                            data = test_data, 
                            kernel = 'linear',
                            class.weights = c("1" = 2, "0" = 1),
                            probability = TRUE)

summary(germancredit.svm_asymmetric21_test)

## 
## Call:
## svm(formula = as.factor(Class) ~ ., data = test_data, kernel = "linear", 
##     class.weights = c(`1` = 2, `0` = 1), probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  90
## 
##  ( 44 46 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

pred_germancredit_test_cost <- predict(germancredit.svm_asymmetric21_test, test_data)
summary(pred_germancredit_test_cost)

##   0   1 
##  40 160

Your observation: After fitting the SVM model and running a summary of it, I can see that there’s 90 Support Vectors. In addition, the classes are very well-balanced. After obtaining the predicted probabilities and classes, it appears that the SVM is more confident in classifying data points into class 1 since 160 is higher than 40. In addition, there’s a class imbalance.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

#C matrix for testing
Cmatrix_test_21 = table( true = test_data$Class, pred = pred_germancredit_test_cost)
Cmatrix_test_21

##     pred
## true   0   1
##    0  36  22
##    1   4 138

#MR testing
1 - sum(diag(Cmatrix_test_21))/sum(Cmatrix_test_21)

## [1] 0.13

Your observation: In the confusion matrix, there’s more number of true negatives in the actual negative. And in the actual positive, there’s more number of true positives. When looking at the MR, the value is 0.13. This means that 13% the instances are being misclassified by the model. I would consider this to be a relatively low MR.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

#get predictions for testing
pred_prob_test <- predict(germancredit.svm_asymmetric21, test_data, probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]

#ROC and AUC
#ROC
pred2 <- prediction(pred_prob_test, test_data$Class)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)

unlist(slot(performance(pred2, "auc"), "y.values"))

## [1] 0.7793832

Your observation: Like in the training dataset, the curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC value, the value is 0.77. I would also consider the AUC in this model to be good at maintaining good performance across different threshold settings.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

When fitting a SVM model without the weighted cost for the training dataset, I noticed that the classes were balanced compared to fitting the SVM model with a weighted cost, where the classes were imbalanced. Moreover, when fitting a SVM without the weighted cost for the testing dataset, I noticed that the classes were balanced. And when comparing to the SVM model with a weighted cost, the classes were balanced as well. When looking at the MR for the non-weighted training and testing datasets, the MR for training was 21% and the MR for testing was 12.5%. On the other hand, when looking at the MR with a weighted cost, the training MR was 25.12% and the testing MR was 13%. Overall, it appears that it’s more ideal to use the SVM model without the weighted cost since both datasets had lower MRs compared to the model with the weighted cost. In addition, both the training and testing datasets had balanced classes.

2. How do you compare SVM to logistic regression? Only for this question, you don’t need to show numbers, just answer based on your understanding.

Logistic regression provides interpretable coefficients for each variable, making it easier to understand the influence of each variable on the target variable. With SVM, especially with non-linear kernels, it can be less interpretable as the transformation into a higher-dimensional space can make it harder to directly relate features (or variables) to the hyperplane. In addition, SVM can handle non-linear classification by using different kernel functions to transform data into higher-dimensional spaces, which allows for it to capture non-linear patterns. With logistic regression, it’s inherently a linear model, but it can be extended to handle non-linear relationships by adding things such as interaction terms between variables.

Homework6

Destiny Campbell

11/1/2023

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure.

3. Split the dataset into training and test set. Please use the random seed as `2023` for reproducibility.

Task 2: SVM without weighted class cost

1. Fit a SVM model using the training set. Please use all variables, but make sure the variable types are right.

2. Use the training set to get predicted classes.

3. Obtain confusion matrix and MR on training set.

4. Use the testing set to get predicted classes.

5. Obtain confusion matrix and MR on testing set.

Task 3: SVM with weighted class cost, and probabilities enabled

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right.

2. Use the training set to get predicted probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get predicted probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

2. How do you compare SVM to logistic regression? Only for this question, you don’t need to show numbers, just answer based on your understanding.

Homework6

Destiny Campbell

11/1/2023

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

Task 2: SVM without weighted class cost

1. Fit a SVM model using the training set. Please use all variables, but make sure the variable types are right.

2. Use the training set to get predicted classes.

3. Obtain confusion matrix and MR on training set.

4. Use the testing set to get predicted classes.

5. Obtain confusion matrix and MR on testing set.

Task 3: SVM with weighted class cost, and probabilities enabled

1. Fit a SVM model using the training set with weight of 2 on “1” and weight of 1 on “0”. Please use all variables, but make sure the variable types are right.

2. Use the training set to get predicted probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get predicted probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

2. How do you compare SVM to logistic regression? Only for this question, you don’t need to show numbers, just answer based on your understanding.

3. Split the dataset into training and test set. Please use the random seed as `2023` for reproducibility.