Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: Based on the variables, none appear categorical.
Thus, none of the varibales are converted to factor.
We know that some of the columns are categorical such as;
ResidenceDuration, NumberExistingCredit, NumberPeopleMaintenance,
Telephone ForeignWorker, Class, CheckingAccountStatus.lt.0,
CheckingAccountStatus.0.to.200, etc. We can see There are more
categorical than interger varibales based from observing the mean,
minimum and maxiumum values.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
str(GermanCredit)
## 'data.frame': 1000 obs. of 49 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
Your observation: In the Age variable, the median is 33 years. This insight can help us observe a lower age demographic which may lead to worse credit scores for younger ages. In Duration the Max value is 72, while the mean and median are 18 and 20; respectfully. This may lead us to believe that their may be outliers present in the data. The same idea follows for Amount. The max is 18424, which is extremely higher than the other data.
2023 for
reproducibility.set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]
Your observation: The random seed was selected to 2023 and the data was randomly split to training (80%) and testing (20%).
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.1
# Fitting SVM model for training set
German.svm = svm({{as.factor(Class)}} ~ .,
data = German.train, kernel = 'linear')
summary(German.svm)
##
## Call:
## svm(formula = {
## {
## as.factor(Class)
## }
## } ~ ., data = German.train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 418
##
## ( 201 217 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Your observation: Here we have the SVM model that gives us the number of support vectors. We need to make the default as factor when fitting SVM. Also, we need to use the svm from package e1071. ### 2. Use the training set to get prediected classes.
# Predictions for German.train
pred_German_train <- predict(German.svm, German.train)
Your observation: Here we are gettingt the training set of the predicted class.
# Confusion matrix for training set
Cmatrix_German_train = table(true = German.train$Class, pred = pred_German_train)
Cmatrix_German_train
## pred
## true 0 1
## 0 142 100
## 1 68 490
# Train MR
1 - sum(diag(Cmatrix_German_train))/sum(Cmatrix_German_train)
## [1] 0.21
Your observation: Here the train MR is 0.21, which means about 21% of the instances are misclassifies by the SVM model. So there are a lower value indicating thea better performance. The training data set has a linear kernel and a cost parameter of 1 with classes as 0 and 1. In the table, the true and predicted class labels, “pred” are shown. This table gives us the true positive 142 true negatives, 490 true positives, 100 false negatives, and 68 false positives.
pred_German_test <- predict(German.svm, German.test)
Your observation: Here, I am using the testing dataset from line 58 and getting the predicted values for the dataset. ### 5. Obtain confusion matrix and MR on testing set.
# Confusion matrix for testing set
Cmatrix_German_test = table(true = German.test$Class, pred = pred_German_test)
Cmatrix_German_test
## pred
## true 0 1
## 0 32 26
## 1 24 118
# Test MR
1 - sum(diag(Cmatrix_German_test))/sum(Cmatrix_German_test)
## [1] 0.25
Your observation: Here the test MR, misclassification rate is approx. 25%. This is the amount of instances in the testing set that are misclassified by the SVM model. The confusion matrix tells us that the Cmatrix_German_test has 32 true negatives, 118 true positives, 26 false negatives, and 24 false positives.
German.svm_asymmetric = svm(as.factor(Class) ~ .,
data = German.train,
kernel = 'linear',
class.weights = c("0" = 1, "1" = 2),
probability = TRUE)
Your observation: Using SVM with asymmetric cost is actually easier! I kept the linear kernal and the asfactor for the variable CLass. ### 2. Use the training set to get prediected probabilities and classes.
pred_German_train <- predict(German.svm_asymmetric, German.train)
Your observation: The predict function generates predictions based on the German train model dataset. ### 3. Obtain confusion matrix and MR on training set (use predicted classes).
# Confusion matrix for training_12
Cmatrix_train_12 = table( true = German.train$Class, pred = pred_German_train)
# MR
1 - sum(diag(Cmatrix_train_12))/sum(Cmatrix_train_12)
## [1] 0.25125
Your observation: Here, the misclassification rate is 25%
German.svm_prob = svm(Class ~ .,
data = German.train, kernel = 'linear',
probability = TRUE)
pred_prob_train = predict(German.svm_prob,
newdata = German.train,
probability = TRUE)
str(pred_prob_train)
## Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
## - attr(*, "names")= chr [1:800] "885" "464" "431" "361" ...
## - attr(*, "probabilities")= num [1:800, 1:2] 0.3477 0.1538 0.0437 0.2535 0.3079 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:800] "885" "464" "431" "361" ...
## .. ..$ : chr [1:2] "0" "1"
# Necessary
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
# ROC for train
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.1
pred <- prediction(pred_prob_train, German.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8298158
Your observation: Here we can see the AUC is 0.8298. In SVM, the output values is the distance from the hyperplane.
pred_German_test <- predict(German.svm_asymmetric, German.test)
Your observation: The predicted test data helps retrieve the proabaility for the german test dataset. ### 6. Obtain confusion matrix and MR on testing set. (use predicted classes).
# Confusion matrix for testing_12
Cmatrix_test_12 = table( true = German.test$Class, pred = pred_German_test)
#MR testing
1 - sum(diag(Cmatrix_test_12))/sum(Cmatrix_test_12)
## [1] 0.245
Your observation: In this chunk, the MR testing is found by subtracting the ratio of correct predictions (the diagonal of the confusion matrix) from 1.
# Testing pred_prob
pred_prob_test = predict(German.svm_prob,
newdata = German.test,
probability = TRUE)
# Necessary
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
# ROC for test
pred <- prediction(pred_prob_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
# AUC for train
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8016027
Your observation: In line 210, we extract the class probability of the positive class (1) from the output of the predict function, focuing on class (1)
The Area Under the Curve (AUC) shows the trade off between the true positive rate at different probability thresholds. The first ROC for training had a higher AUC which means it had a better discrimination between the classes with 0.5 being equivalent to random guessing and 1.0 being the perfect indicator. Although, Platt scaling can be used to transform the output scores. Also, the “pred_prob_train” had a 0.8298158 for its AUC and the “pred_prob_test” had a 0.8016027.