Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
colSums(is.na(GermanCredit)) #the code book says there are no missing, but double checking
## Duration Amount
## 0 0
## InstallmentRatePercentage ResidenceDuration
## 0 0
## Age NumberExistingCredits
## 0 0
## NumberPeopleMaintenance Telephone
## 0 0
## ForeignWorker Class
## 0 0
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 0 0
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 0 0
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly
## 0 0
## CreditHistory.Delay Purpose.NewCar
## 0 0
## Purpose.UsedCar Purpose.Furniture.Equipment
## 0 0
## Purpose.Radio.Television Purpose.DomesticAppliance
## 0 0
## Purpose.Repairs Purpose.Education
## 0 0
## Purpose.Retraining Purpose.Business
## 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 0 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 0 0
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4
## 0 0
## EmploymentDuration.4.to.7 EmploymentDuration.gt.7
## 0 0
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## 0 0
## Personal.Male.Single OtherDebtorsGuarantors.None
## 0 0
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate
## 0 0
## Property.Insurance Property.CarOther
## 0 0
## OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 0 0
## Housing.Rent Housing.Own
## 0 0
## Job.UnemployedUnskilled Job.UnskilledResident
## 0 0
## Job.SkilledEmployee
## 0
str(GermanCredit) # structure of the data set
## 'data.frame': 1000 obs. of 49 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
summary(GermanCredit) #Acquire the summ stats
## Duration Amount InstallmentRatePercentage ResidenceDuration
## Min. : 4.0 Min. : 250 Min. :1.000 Min. :1.000
## 1st Qu.:12.0 1st Qu.: 1366 1st Qu.:2.000 1st Qu.:2.000
## Median :18.0 Median : 2320 Median :3.000 Median :3.000
## Mean :20.9 Mean : 3271 Mean :2.973 Mean :2.845
## 3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :72.0 Max. :18424 Max. :4.000 Max. :4.000
## Age NumberExistingCredits NumberPeopleMaintenance Telephone
## Min. :19.00 Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.:27.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
## Median :33.00 Median :1.000 Median :1.000 Median :1.000
## Mean :35.55 Mean :1.407 Mean :1.155 Mean :0.596
## 3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :75.00 Max. :4.000 Max. :2.000 Max. :1.000
## ForeignWorker Class CheckingAccountStatus.lt.0
## Min. :0.000 0:300 Min. :0.000
## 1st Qu.:1.000 1:700 1st Qu.:0.000
## Median :1.000 Median :0.000
## Mean :0.963 Mean :0.274
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000
## CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.269 Mean :0.063
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
## Min. :0.00 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000
## Median :0.00 Median :0.000
## Mean :0.04 Mean :0.049
## 3rd Qu.:0.00 3rd Qu.:0.000
## Max. :1.00 Max. :1.000
## CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar Purpose.UsedCar
## Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.00 Median :0.000 Median :0.000 Median :0.000
## Mean :0.53 Mean :0.088 Mean :0.234 Mean :0.103
## 3rd Qu.:1.00 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.00 Max. :1.000 Max. :1.000 Max. :1.000
## Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000
## Mean :0.181 Mean :0.28 Mean :0.012
## 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000
## Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000 Median :0.000
## Mean :0.022 Mean :0.05 Mean :0.009 Mean :0.097
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000 Max. :1.000
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :0.000
## Mean :0.603 Mean :0.103
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.063 Mean :0.048
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.172 Mean :0.339 Mean :0.174
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.00
## Median :0.000 Median :0.00
## Mean :0.253 Mean :0.05
## 3rd Qu.:1.000 3rd Qu.:0.00
## Max. :1.000 Max. :1.00
## Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:1.000
## Median :0.00 Median :1.000 Median :1.000
## Mean :0.31 Mean :0.548 Mean :0.907
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.00 Max. :1.000 Max. :1.000
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.041 Mean :0.282 Mean :0.232
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.332 Mean :0.139 Mean :0.047
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0
## Median :0.000 Median :1.000 Median :0.000 Median :0.0
## Mean :0.179 Mean :0.713 Mean :0.022 Mean :0.2
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.0
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.0
## Job.SkilledEmployee
## Min. :0.00
## 1st Qu.:0.00
## Median :1.00
## Mean :0.63
## 3rd Qu.:1.00
## Max. :1.00
barplot(table(GermanCredit$Class), # our focus will be on `Class`, visual rep or proportion.
ylab = "Frequency",
xlab = "Class")
Your observation: There are no missings/NA’s, looking at the summery
stats, There aren’t any points that stick out of the norm. Visually we
can see based on the predictor variable Class, we see the
distribution falling higher on 1- non-credit risk.
2024 for
reproducibility. (5pts)set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80) #using a 80 20 split
german_train = GermanCredit[index,]
german_test = GermanCredit[-index,]
Your observation: 800 observations in training, and 200 in testing.
# Load the e1071 package
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
#library ggplot2
library(ggplot2)
# Create SVM model with linear kernel
svm_model <- svm(Class ~ ., data = german_train, kernel = 'linear') #linear SVM
# Summary of the trained model
summary(svm_model)
##
## Call:
## svm(formula = Class ~ ., data = german_train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 391
##
## ( 197 194 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Your observation: Using the training data set, I created the
svm_model using Class as the predictor and all
other variables. The results show that there are 391 support
vectors(observations) that support the hyperplane, but the number of
vectors on either side is uneven to each other, meaning the that there
may be high variance, and the model might be unstable.
# Make predictions on the train data
pred_class_train <- predict(svm_model, german_train)
head(pred_class_train)
## 578 549 557 700 255 913
## 1 0 0 1 1 1
## Levels: 0 1
Your observation: Using the predict function to classify observations.
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = german_train$Class,
pred = pred_class_train)
Cmatrix_train
## pred
## true 0 1
## 0 132 97
## 1 59 512
#Misclassification Rate
MR_train<- 1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
MR_train
## [1] 0.195
Your observation: We see higher values in the confusion matrix on the
top right corner, or the False Positive section, meaning a majority of
errors we predict they have (based on Class) good credit
potential but are actually not. To which in this method of prediction is
worse than False Negative-meaning that we predicted they are a bad
credit risk but are actually good. The MR for the training data is at
most 19.5%, meaning we only mis-classify 19.5% of the time.
pred_class_test <- predict(svm_model, german_test)
head(pred_class_test)
## 10 13 17 20 41 44
## 0 1 1 1 1 1
## Levels: 0 1
Your observation: Same work, now using the testing data.
# Confusion matrix to evaluate the model on testing data
Cmatrix_test = table(true = german_test$Class,
pred = pred_class_test)
Cmatrix_test
## pred
## true 0 1
## 0 36 35
## 1 20 109
#Misclassification Rate
MR_test<- 1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
MR_test
## [1] 0.275
Your observation: We see similar results, but they are technically higher on the testing data. We still see higher results on the FP. The MR for the testing data, is higher at 27.5% but is to be expected.
probability = TRUE.german.svm_asymmetric = svm(Class ~ .,
data = german_train,
kernel = 'linear',
probability = TRUE,
class.weights = c("0" = 2, "1" = 1)) #different from logressioin where you put higher weight on the variable you want more of
Your observation: Re-created the svm model, having probability as true to attain AUC, and having weights, that are no longer equal. We have placed higher importance on 1, meaning we want to lower FP.
#get predictions for training
pred_prob_train = predict(german.svm_asymmetric,
newdata = german_train,
probability = TRUE)
Your observation: Creating the predict model, having probability on so we can attain the AUC and ROC if needed.
Cmatrix_train2 <- table( true = german_train$Class, pred = pred_prob_train)
Cmatrix_train2
## pred
## true 0 1
## 0 113 116
## 1 41 530
MR_train2 <- 1 - sum(diag(Cmatrix_train2))/sum(Cmatrix_train2)
MR_train2
## [1] 0.19625
Your observation: We have obtained the confusion matrix using different weights, here we have lowered FP, the point of the weighting choice. Seeing a significant decrease in False positives, but also an increase on false negatives, to which is OK in this scenario.
#run only once
pred_prob_train = attr(pred_prob_train, "probabilities")[, "1"] #might be grabbing the 0, input [, "1"] to grab 1
#load the ROCR package to get the ROC
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.3
pred <- prediction(pred_prob_train, german_train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#to get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8498153
Your observation: Using the ROCR package I was able to create the ROC curve showing a very rough but high covering line, meaning it will explain quite a bit of the data in the model. Which is shown mathematically by the AUC, to which we want as high as we can, here it is 0.85, saying it this model is able to classify the variable well (good), but not as high as we want (.90 = excellent).
#obtain testing pred_prob
pred_prob_test = predict(german.svm_asymmetric,
newdata = german_test,
probability = TRUE)
Your observation: Now using the testing data to get the predicted variables and classes.
Cmatrix_test2 <- table( true = german_test$Class, pred = pred_prob_test)
Cmatrix_test2
## pred
## true 0 1
## 0 25 46
## 1 10 119
MR_test2 <- 1 - sum(diag(Cmatrix_test2))/sum(Cmatrix_test2)
MR_test2
## [1] 0.28
Your observation: using the Testing data we were able to see that the results that are technically worse, but to be expected of testing data, since we don’t want to overfit the data, but the MR had gone up from 19.5% to 28% which is concerning.
# this is necessary, run only once
pred_prob_test = attr(pred_prob_test, "probabilities")[, "1"]
#ROC
library(ROCR)
pred <- prediction(pred_prob_test, german_test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7421116
Your observation: Based on the AUC, we see that it is lower than what we predicted in the training data, falling from “Good” to “fair” classifier.
Using the German credit variables with a higher focus upon
Class where 0 = bad credit risk and 1 = good credit risk,
we worked to better classify where in the training and testing data sets
we created with a 80 20 split, how the Class observations
classified based on all other variables. By creating the svm model, we
saw that there 391 support vectors for the hyperplane, with more vectors
on “0” with 197, with “1” only having 194, possibly indicating high
variance, and possibly a unstable model.
After classifying the training and testing data sets with equal weighting we see that, there is a higher concentration on the FP (False positive) side, meaning we incorrectly guess that people have good credit, but are actually a bad credit risk, Which is worse for credit lenders than FN.
We move onto classifying with unequal weights, with higher weight on “0” (at 2) and “1” (at 1), we put higher weight on the one we are wanting to let grow, so we can lower the other, to which in this instance we want to lower the FP’s so we allow FN’s to grow so FP can shrink. Which is what happened on both the training and testing data sets, comparatively it shows great change than the original analysis. But now we were able to create the ROC curve and AUC values for each set, since we set the probabilities to true. We see that the training set has a higher AUC, meaning that model is a good classifier of the data, but the testing data has a lower AUC meaning it is only a fair classifier.
In logistic regression, when assigning weights we would put the higher weight on the one we want to predict better/more, but in SVM we put higher weight on the opposite, so we can decrease the result (FP/FN) of the one we want.
radial, and see if you got a better result.