Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: Converting Class into TRUE or FALSE or equal to 1 or 0 so it can be used as target variable.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
table(GermanCredit$Class)
##
## 0 1
## 300 700
head(GermanCredit)
## Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 1 6 1169 4 4 67
## 2 48 5951 2 2 22
## 3 12 2096 2 3 49
## 4 42 7882 2 4 45
## 5 24 4870 3 4 53
## 6 36 9055 2 4 35
## NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 1 2 1 0 1 1
## 2 1 1 1 1 0
## 3 1 2 1 1 1
## 4 1 2 1 1 1
## 5 2 2 1 1 0
## 6 1 2 0 1 1
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 1 1 0
## 2 0 1
## 3 0 0
## 4 1 0
## 5 1 0
## 6 0 0
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 1 0 0 0
## 2 0 1 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 1
## 6 0 1 0
## Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 1
## 5 1 0 0
## 6 0 0 0
## Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 1 1 0 0
## 2 1 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Purpose.Education Purpose.Retraining Purpose.Business
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
## 6 1 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 1 0 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 0 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 1 0 0 0
## 2 0 1 0
## 3 0 0 1
## 4 0 0 1
## 5 0 1 0
## 6 0 1 0
## EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 1 0 1 1
## 2 1 0 1
## 3 0 1 1
## 4 0 1 0
## 5 0 1 1
## 6 0 1 1
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 0 0 1
## 5 0 0 0
## 6 0 0 0
## Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 1 0 1 0 0
## 2 0 1 0 0
## 3 0 1 0 1
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 1
## Job.SkilledEmployee
## 1 1
## 2 1
## 3 0
## 4 1
## 5 1
## 6 0
Your observation: 0 appears 300 times and 1 appears 700 times.This means “Good” is found 700 for the Class dataset.
2023 for
reproducibility.set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]
Your observation: Assigning 80% of dataset to German.train and the remaining 20% to German.test for random sample of German.Credit.
library(e1071)
# Fitting SVM model for training set
German.svm = svm(Class ~ .,
data = German.train, kernel = 'linear')
summary(German.svm)
##
## Call:
## svm(formula = Class ~ ., data = German.train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 418
##
## ( 201 217 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Your observation: Out of 418 supporting vectors, 201 are in class “0” and 217 are in class “1.” These are the only two levels of classes for the model. ### 2. Use the training set to get prediected classes.
# Predictions for German.train
pred_German_train <- predict(German.svm, German.train)
# Histogram
numeric_pgtrain_data <- as.numeric(pred_German_train)
hist(numeric_pgtrain_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")
Your observation: The histogram shows the difference of pred_German_train. With the difference being less than the histogram for pred_German_test.
# Confusion matrix for training set
Cmatrix_German_train = table(true = German.train$Class, pred = pred_German_train)
Cmatrix_German_train
## pred
## true 0 1
## 0 142 100
## 1 68 490
# Train MR
1 - sum(diag(Cmatrix_German_train))/sum(Cmatrix_German_train)
## [1] 0.21
Your observation: According to the matrix, there are 490 true positives and 142 true negatives. The model predicted 100 false positives and 68 false negatives. The MR of 0.21 indicates 21% of the model was predicted incorrectly.
pred_German_test <- predict(German.svm, German.test)
# Histogram
numeric_pgtest_data <- as.numeric(pred_German_test)
hist(numeric_pgtest_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")
Your observation: The histogram shows the difference of pred_German_test. With the difference being larger than the histogram for pred_German_train.
# Confusion matrix for testing set
Cmatrix_German_test = table(true = German.test$Class, pred = pred_German_test)
Cmatrix_German_test
## pred
## true 0 1
## 0 32 26
## 1 24 118
# Test MR
1 - sum(diag(Cmatrix_German_test))/sum(Cmatrix_German_test)
## [1] 0.25
Your observation: According to the matrix, there are 118 are true positives and 32 true negatives. The model predicted 26 false positives and 24 false negatives. The MR of 0.25 indicates 25% of the model was predicted incorrectly, which is %4 higher than the training set matrix.
German.svm_asymmetric12 = svm(Class ~ .,
data = German.train,
kernel = 'polynomial',
class.weights = c("0" = 1, "1" = 2),
probability = TRUE)
Your observation: Changing the model so there is more emphasis on “1.” ### 2. Use the training set to get prediected probabilities and classes.
pred_German_train12 <- predict(German.svm_asymmetric12, German.train)
# Histogram
numeric_12pgtrain_data <- as.numeric(pred_German_train12)
hist(numeric_12pgtrain_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")
Your observation: There is a larger difference between the two since changing the emphasis on True(1) data.
# Confusion matrix for training_12
Cmatrix_train_12 = table( true = German.train$Class, pred = pred_German_train12)
Cmatrix_train_12
## pred
## true 0 1
## 0 130 112
## 1 0 558
# MR
1 - sum(diag(Cmatrix_train_12))/sum(Cmatrix_train_12)
## [1] 0.14
Your observation: According to the matrix, there are 558 true positives and 130 true negatives. The model predicted 112 false positives and 0 false negatives. The MR of 0.14 indicates 14% of the model was predicted incorrectly. This is lower than the original matrix for the training set.
#Prep
German.svm_prob = svm(Class ~ .,
data = German.train, kernel = 'linear',
probability = TRUE)
pred_prob_train = predict(German.svm_prob,
newdata = German.train,
probability = TRUE)
str(pred_prob_train)
## Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
## - attr(*, "names")= chr [1:800] "885" "464" "431" "361" ...
## - attr(*, "probabilities")= num [1:800, 1:2] 0.3477 0.1538 0.0437 0.2535 0.3079 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:800] "885" "464" "431" "361" ...
## .. ..$ : chr [1:2] "0" "1"
# Necessary
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
# ROC for train
library(ROCR)
pred <- prediction(pred_prob_train, German.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8298158
Your observation: ROC looks good, the AUC is 0.8298158 which means ~82.98% of the model’s performance is correct. ### 5. Use the testing set to get prediected probabilities and classes.
pred_German_test12 <- predict(German.svm_asymmetric12, German.test)
# Histogram
numeric_12pgtest_data <- as.numeric(pred_German_train12)
hist(numeric_12pgtest_data,breaks = 2,main = "Histogram of True/False Data", xlab = "Values (0 = False, 1 = True)",ylab = "Frequency")
Your observation: Compared to the original test, there is less of a difference between the two after changing the weights of the model. ### 6. Obtain confusion matrix and MR on testing set. (use predicted classes).
# Confusion matrix for testing_12
Cmatrix_test_12 = table( true = German.test$Class, pred = pred_German_test12)
Cmatrix_test_12
## pred
## true 0 1
## 0 14 44
## 1 6 136
#MR testing
1 - sum(diag(Cmatrix_test_12))/sum(Cmatrix_test_12)
## [1] 0.25
Your observation: According to the matrix, there are 136 are true positives and 14 true negatives. The model predicted 44 false positives and 6 false negatives. The MR of 0.25 indicates 25% of the model was predicted incorrectly, which is the same MR for the original testing set matrix. ### 7. Obtain ROC and AUC on testing set. (use predicted probabilities).
# Prep
pred_prob_test = predict(German.svm_prob,
newdata = German.test,
probability = TRUE)
# Necessary
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
# ROC
pred <- prediction(pred_prob_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8016027
Your observation: ROC looks good, the AUC is 0.8016027 which means ~80.16% of the model’s performance is correct. This is slightly lower than the AUC of the training set. # Task 4: Report
When changing the weight of the SVM model, the results for the training set were better. This was determined by the MR which went from 21% to 14%. There was little to no change for the testing set according to the MR, since there was no change at 25%. Overall the model had better results when altering the weight to have more emphasis on “1” over “0.”
Logistic Regression
Pros: - Simplicity: Easy to interpret - Efficient: less computer resources needed - Probabilistic interpretation: probabilities for outcomes
Cons: - Assumes linear relationship between variables - Sensitive to outliers _ Performance can be negatively impacted if variables are highly correlated.
SVM
Pros: - Effective in high-dimensional spaces - Versatile through kernel - Memory efficient
Cons: - Poor performance when the # of features exceeds # of samples - Sensitive to noise
Overall SVM is better for more complex datasets while Logistic Regression is best for more simple datasets and its interpretability when the outcome is at least somewhat linear.