Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: After running the code to change ‘Class’, it is now a factor.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
#summary statistics
summary(GermanCredit)
#head
head(GermanCredit)
#structure
str(GermanCredit)
Your observation: When looking at the summary statistics of the dataset, I can see that the “amount” variable has the highest minimum, 1st quartile, median, mean, 3rd quartile, and maximum. In addition, “age” is the second variable to have the highest value for each characteristic, and then “duration” is the third variable. Majority of the other variables have values between 0 and 1. When looking at the structure, I can see that majority of the variables are binomial, but there are a few that are non-binary. Such as ‘duration’, ‘amount’, ‘age’ to name a few.
2023 for
reproducibility.#Set the seed for reproducibility
set.seed(2023)
index <- sample(1:NROW(GermanCredit),NROW(GermanCredit)*0.80)
#Create the training set
train_data <- GermanCredit[index,]
#Create the testing set
test_data <- GermanCredit[-index,]
Your observation: When splitting the dataset, in the testing dataset, the observations went down to 200 observations from a 1000. And in the training dataset, the observations went down to 800 from 1000.
library(e1071)
germancredit.svm = svm( as.factor(Class) ~ .,
data = train_data, kernel = 'linear')
summary(germancredit.svm)
##
## Call:
## svm(formula = as.factor(Class) ~ ., data = train_data, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 418
##
## ( 201 217 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Your observation: After fitting the SVM model and running a summary of it, I can see that there’s 418 Support Vectors. In addition, I would say the classes are well-balanced since 201 and 217 aren’t far off from each other.
#Make predictions on the train data
pred_germancredit_train <- predict(germancredit.svm, train_data)
summary(pred_germancredit_train)
## 0 1
## 210 590
Your observation: After obtaining the predicted classes, it appears that the SVM is more confident in classifying data points into class 1 since 590 is higher than 210. In addition, this indicates a class imbalance.
#Confusion matrix
Cmatrix_train = table(true = train_data$Class,
pred = pred_germancredit_train)
Cmatrix_train
## pred
## true 0 1
## 0 142 100
## 1 68 490
Mis-classfication Rate (MR)
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.21
Your observation: In the confusion matrix, there’s more number of true negatives than false negatives for the actual negative. And for the actual positive, there’s more number of true positives than false negatives. Regarding the MR, the value is 0.21. This means that 21% of the instances are being misclassified by the model. I would consider this to be a relatively high MR.
germancredit.svm2 = svm( as.factor(Class) ~ .,
data = test_data, kernel = 'linear')
summary(germancredit.svm2)
##
## Call:
## svm(formula = as.factor(Class) ~ ., data = test_data, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 95
##
## ( 52 43 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
# Make predictions on the test data
pred_germancredit_test <- predict(germancredit.svm2, test_data)
summary(pred_germancredit_test)
## 0 1
## 55 145
Your observation: When viewing the SVM model after running a summary of it, I can see that there’s 95 Support Vectors and there’s a good balance between the classes. After obtaining the predicted classes, it appears that the SVM is more confident in classifying data points into class 1 since 145 is higher than 55. In addition, this indicates a class imbalance.
#Confusion matrix
Cmatrix_test = table(true = test_data$Class,
pred = pred_germancredit_test)
Cmatrix_test
## pred
## true 0 1
## 0 44 14
## 1 11 131
Mis-classfication Rate (MR)
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.125
Your observation: When looking at the confusion matrix, there are more number of true negatives for the actual negative. And for the actual positive, there’s more number of true positives. Regarding the MR, it has a value of 0.125, which means that means 12.5% of the instances are being misclassified by the model. I would consider this to be a relatively low MR.
germancredit.svm_asymmetric21 = svm(as.factor(Class) ~ .,
data = train_data,
kernel = 'linear',
class.weights = c("1" = 2, "0" = 1))
summary(germancredit.svm_asymmetric21)
##
## Call:
## svm(formula = as.factor(Class) ~ ., data = train_data, kernel = "linear",
## class.weights = c(`1` = 2, `0` = 1))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 386
##
## ( 241 145 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Your observation: After fitting the SVM model and running a summary of it, I can see that there’s 386 Support Vectors. In addition, the classes are imbalanced.
pred_germancredit_train_cost <- predict(germancredit.svm_asymmetric21, train_data)
summary(pred_germancredit_train_cost)
## 0 1
## 63 737
Your observation: After obtaining the predicted probabilities and classes, it appears that the SVM is more confident in classifying data points into class 1 since 737 is higher than 63. In addition, there’s a major class imbalance.
#C matrix for training
Cmatrix_train_21 = table( true = train_data$Class, pred = pred_germancredit_train_cost)
Cmatrix_train_21
## pred
## true 0 1
## 0 52 190
## 1 11 547
#MR training
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)
## [1] 0.25125
Your observation: When looking at the confusion matrix, there’s more number of false positives in the actual negative. And for the actual positive, there’s more number of true positives. For the MR, the value is 0.2512. This means that 25.12% of the instances are being misclassified by the model. I would consider this to be a relatively high MR.
#refit the model with probability enabled
germancredit.svm_asymmetric21 = svm(as.factor(Class) ~ .,
data = train_data,
kernel = 'linear',
class.weights = c("1" = 2, "0" = 1),
probability = TRUE)
#get predictions for training
pred_prob_train <- predict(germancredit.svm_asymmetric21, train_data, probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, train_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8111911
Your observation: When looking at the ROC curve, I can see that the curve approaches more towards a higher false positive rate and true positive rate. Since the AUC value is 0.81, and an AUC model of 1 represents a perfect model, I would consider the AUC in this model to be good at maintaining good performance across different threshold settings.
#refit the model with probability enabled
germancredit.svm_asymmetric21_test = svm(as.factor(Class) ~ .,
data = test_data,
kernel = 'linear',
class.weights = c("1" = 2, "0" = 1),
probability = TRUE)
summary(germancredit.svm_asymmetric21_test)
##
## Call:
## svm(formula = as.factor(Class) ~ ., data = test_data, kernel = "linear",
## class.weights = c(`1` = 2, `0` = 1), probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 90
##
## ( 44 46 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
pred_germancredit_test_cost <- predict(germancredit.svm_asymmetric21_test, test_data)
summary(pred_germancredit_test_cost)
## 0 1
## 40 160
Your observation: After fitting the SVM model and running a summary of it, I can see that there’s 90 Support Vectors. In addition, the classes are very well-balanced. After obtaining the predicted probabilities and classes, it appears that the SVM is more confident in classifying data points into class 1 since 160 is higher than 40. In addition, there’s a class imbalance.
#C matrix for testing
Cmatrix_test_21 = table( true = test_data$Class, pred = pred_germancredit_test_cost)
Cmatrix_test_21
## pred
## true 0 1
## 0 36 22
## 1 4 138
#MR testing
1 - sum(diag(Cmatrix_test_21))/sum(Cmatrix_test_21)
## [1] 0.13
Your observation: In the confusion matrix, there’s more number of true negatives in the actual negative. And in the actual positive, there’s more number of true positives. When looking at the MR, the value is 0.13. This means that 13% the instances are being misclassified by the model. I would consider this to be a relatively low MR.
#get predictions for testing
pred_prob_test <- predict(germancredit.svm_asymmetric21, test_data, probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
#ROC and AUC
#ROC
pred2 <- prediction(pred_prob_test, test_data$Class)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)
unlist(slot(performance(pred2, "auc"), "y.values"))
## [1] 0.7793832
Your observation: Like in the training dataset, the curve approaches more towards a higher false positive rate and true positive rate. When looking at the AUC value, the value is 0.77. I would also consider the AUC in this model to be good at maintaining good performance across different threshold settings.
When fitting a SVM model without the weighted cost for the training dataset, I noticed that the classes were balanced compared to fitting the SVM model with a weighted cost, where the classes were imbalanced. Moreover, when fitting a SVM without the weighted cost for the testing dataset, I noticed that the classes were balanced. And when comparing to the SVM model with a weighted cost, the classes were balanced as well. When looking at the MR for the non-weighted training and testing datasets, the MR for training was 21% and the MR for testing was 12.5%. On the other hand, when looking at the MR with a weighted cost, the training MR was 25.12% and the testing MR was 13%. Overall, it appears that it’s more ideal to use the SVM model without the weighted cost since both datasets had lower MRs compared to the model with the weighted cost. In addition, both the training and testing datasets had balanced classes.
Logistic regression provides interpretable coefficients for each variable, making it easier to understand the influence of each variable on the target variable. With SVM, especially with non-linear kernels, it can be less interpretable as the transformation into a higher-dimensional space can make it harder to directly relate features (or variables) to the hyperplane. In addition, SVM can handle non-linear classification by using different kernel functions to transform data into higher-dimensional spaces, which allows for it to capture non-linear patterns. With logistic regression, it’s inherently a linear model, but it can be extended to handle non-linear relationships by adding things such as interaction terms between variables.