Loading neccessary packages and dataset. Dataset is GermanCredit. It has Credit Worthiness feature with two classes “good” and “bad”. This will be our response variable.
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.2
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.2
data(GermanCredit)
dataset = GermanCredit
Exploring the data by checking the structure of the dataset, like datatypes of the variables. Then we are scaling the top 7 columns of the GermanCredit Dataset. The scale function will standardized the values of those 7 columns, this is basically called Data Transformation process. The new standardized values are finally updated in the dataset. We can see from the plots that the values are scaled.
str(dataset)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
dataset[,1:7] = as.data.frame(lapply(dataset[,1:7], scale))
plot(dataset[,1:7])
str(dataset)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : num -1.236 2.247 -0.738 1.75 0.257 ...
## $ Amount : num -0.745 0.949 -0.416 1.633 0.566 ...
## $ InstallmentRatePercentage : num 0.918 -0.8697 -0.8697 -0.8697 0.0241 ...
## $ ResidenceDuration : num 1.046 -0.766 0.14 1.046 1.046 ...
## $ Age : num 2.765 -1.191 1.183 0.831 1.534 ...
## $ NumberExistingCredits : num 1.027 -0.705 -0.705 -0.705 1.027 ...
## $ NumberPeopleMaintenance : num -0.428 -0.428 2.334 2.334 2.334 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
After data transform in previous section, we will now take a random sample of 200 values out of 1000, because dataset has 1000 rows in total. Here we are taking sample of 200 indexes which we use to extract values from dataset. We are subsetting Training and Test dataset from our original dataset, 200 values for Test dataset and 800 values for Training dataset. 80% values - Training Dataset 20% values - Test Dataset
dim(dataset)
## [1] 1000 62
set.seed(1234)
sample_index = sample(1000, 200)
test_dataset = dataset[sample_index,]
train_dataset = dataset[-sample_index,]
Tuning the svm models. We will get optimal cost and gamma parameter which we will use in next section for creating svm model. We are using Linear, Polynomial, Sigmoid and Radial kernels to create svm model, later we will check the accuracy of all the kernels whichever is best in terms of accuracy we will pick it.
svm_tune_radial <- tune(svm, Class ~ .,
data = train_dataset,
kernel="radial",
ranges=list(cost=10^(-1:2),
gamma=c(.5,1,2),
scale=F
))
sumry_radial <- summary(svm_tune_radial)
print(sumry_radial$best.parameters)
## cost gamma scale
## 3 10 0.5 FALSE
svm_tune_poly <- tune(svm, Class ~ .,
data = train_dataset,
kernel="polynomial",
ranges=list(cost=10^(-1:2),
gamma=c(.5,1,2),
scale=F
))
sumry_poly <- summary(svm_tune_poly)
print(sumry_poly$best.parameters)
## cost gamma scale
## 1 0.1 0.5 FALSE
svm_tune_sigm <- tune(svm, Class ~ .,
data = train_dataset,
kernel="sigmoid",
ranges=list(cost=10^(-1:2),
gamma=c(.5,1,2),
scale=F
))
sumry_sigm <- summary(svm_tune_sigm)
print(sumry_sigm$best.parameters)
## cost gamma scale
## 5 0.1 1 FALSE
svm_tune_linear <- tune(svm, Class ~ .,
data = train_dataset,
kernel="linear",
ranges=list(cost=10^(-1:2),
scale=F
))
sumry_linear <- summary(svm_tune_linear)
print(sumry_linear$best.parameters)
## cost scale
## 2 1 FALSE
Now we have Training and Test dataset we will fit the SVM model. The Class is our predictor variable which is here a categorical variable, which represents “good” or “bad” Credit Worthiness of a person. So we are basically separating/classifying “good” and “bad” values of the Credit Worthiness feature (Class). Cost is set to 0.1 intitially which is general penalizing parameter, it’s a cost of penalizaling for misclassification. So if C is large the bias will be low and variance high. Gamma is parameter of Guassian kernel, used for nonlinear structures. In our same, for linear optimal cost is 1, for radial it is 10 and gamma 0.5. We will use these values for build our svm model.
svm_fit_radial <- svm(Class ~ ., kernel="radial", cost = 10, gamma=0.5,data = train_dataset)
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'Purpose.Vacation' and 'Personal.Female.Single' constant.
## Cannot scale data.
print(svm_fit_radial)
##
## Call:
## svm(formula = Class ~ ., data = train_dataset, kernel = "radial",
## cost = 10, gamma = 0.5)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.5
##
## Number of Support Vectors: 800
svm_fit_poly <- svm(Class ~ ., kernel="polynomial", cost = 0.1, gamma=0.5,data = train_dataset)
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'Purpose.Vacation' and 'Personal.Female.Single' constant.
## Cannot scale data.
print(svm_fit_poly)
##
## Call:
## svm(formula = Class ~ ., data = train_dataset, kernel = "polynomial",
## cost = 0.1, gamma = 0.5)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.1
## degree: 3
## gamma: 0.5
## coef.0: 0
##
## Number of Support Vectors: 494
svm_fit_sigm <- svm(Class ~ ., kernel="sigmoid", cost = 0.1, gamma=1,data = train_dataset)
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'Purpose.Vacation' and 'Personal.Female.Single' constant.
## Cannot scale data.
print(svm_fit_sigm)
##
## Call:
## svm(formula = Class ~ ., data = train_dataset, kernel = "sigmoid",
## cost = 0.1, gamma = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 0.1
## gamma: 1
## coef.0: 0
##
## Number of Support Vectors: 434
svm_fit_linear <- svm(Class ~ ., kernel="linear", cost = 1, data = train_dataset, scale = F)
print(svm_fit_linear)
##
## Call:
## svm(formula = Class ~ ., data = train_dataset, kernel = "linear",
## cost = 1, scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.01639344
##
## Number of Support Vectors: 421
Now we will predict the Credit Worthiness (Class) feature for all the svm models with diffrent kernels and later in next section we will check for the accuracy of the prediction.
predictions <- predict(svm_fit_linear, test_dataset[-10])
table(test_dataset[,10], predictions)
## predictions
## Bad Good
## Bad 30 33
## Good 14 123
predictions <- predict(svm_fit_radial, test_dataset[-10])
table(test_dataset[,10], predictions)
## predictions
## Bad Good
## Bad 1 62
## Good 0 137
predictions <- predict(svm_fit_poly, test_dataset[-10])
table(test_dataset[,10], predictions)
## predictions
## Bad Good
## Bad 29 34
## Good 19 118
predictions <- predict(svm_fit_sigm, test_dataset[-10])
table(test_dataset[,10], predictions)
## predictions
## Bad Good
## Bad 11 52
## Good 25 112
Calculating Accuracy from the confusion matrix table for all kernels. We conclude from the accuracy values that Linear kernel with 76.5% accuracy is best for the given sample.
print((123+30)/(123+30+33+14)*100)
## [1] 76.5
print((137+1)/(137+62+1)*100)
## [1] 69
print((118+29)/(118+34+19+29)*100)
## [1] 73.5
print((112+11)/(112+11+25+52)*100)
## [1] 61.5