Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.
Solution
Detailed case
Job:
Problem Description:
Imagine a situation, where a Data Scientist working for an automotive Service company was given an assignment to find weather the customers in their data base will churn out (no longer come back) or not. The data base was exhaustive and had enough data. It also had a information about the customers who terminated their contract with company. Apart from this, it also had information about the factors (Predictor variables) which seems affect the decision of the customer. The data scientist decided to use Machine Learning tools to model the situation.
The model chosen: Classification model
Predictor variables are as follow:-
Other additional scenarios
Everyday life:
Classification models can be well situated for finding whether a vegetable is fit to use or not based on the predictor variables like Color of vegetable, Smell of vegetable, Size of the vegetable, total no of day spent in refrigerator, Freshness of the vegetable (by appearance) etc.,
Current Events:
Classification models can be well situated for finding whether it will rain today or not based on the predictor variables like Humidity, Wind Speed, Yesterday’s weather condition, Air Temperature, Air pressure etc.,
The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical variables and without data points that have missing values.
## install.packages("kernlab") ## to install the R package for SVM library
library(kernlab)## use this call the library for the SVM
data <- read.table("C:/Users/amirt/Desktop/New folder/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE) # use this to input the data into variable
head(data,10)# to explore top 10 observation to know about vairable datatypes
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.250 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.040 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.500 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.750 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.710 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.500 1 1 0 0 360 0 1
## 7 1 33.17 1.040 6.500 1 1 0 0 164 31285 1
## 8 0 22.92 11.585 0.040 1 1 0 1 80 1349 1
## 9 1 54.42 0.500 3.960 1 1 0 1 180 314 1
## 10 1 42.50 4.915 3.165 1 1 0 0 52 1442 1
## 1) MODEL: support vector machine model with scaling
model <- ksvm(V11~.,data=data,type = "C-svc",kernel = "vanilladot",C = 100,scaled=TRUE) # model creation
## Setting default kernel parameters
model # display model characteristic
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 189
##
## Objective Function Value : -17887.92
## Training error : 0.136086
a<-colSums(model@xmatrix[[1]] * model@coef[[1]]) # to find the value of a's
# a0 is just -model @b
a0<- -model@b # to find the value of a0
# To display the values of a's and a0
a
## V1 V2 V3 V4 V5
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## V6 V7 V8 V9 V10
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995
a0
## [1] 0.08158492
# see what the model predicts
pred <- predict(model,data[,1:10])
pred
## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# see what fraction of the model’s predictions match the actual classification
sum(pred == data[,11]) / nrow(data)
## [1] 0.8639144
## 2) MODEL: support vector machine model without scaling
model <- ksvm(V11~.,data=data,type = "C-svc",kernel = "vanilladot",C = 100,scaled=FALSE) # model creation
## Setting default kernel parameters
model # display model characteristic
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 186
##
## Objective Function Value : -2213.731
## Training error : 0.278287
a<-colSums(model@xmatrix[[1]] * model@coef[[1]]) # to find the value of a's
# a0 is just -model @b
a0<- -model@b # to find the value of a0
# To display the values of a's and a0
a
## V1 V2 V3 V4 V5
## -0.0483050561 -0.0083148473 -0.0836550114 0.1751121271 1.8254844547
## V6 V7 V8 V9 V10
## 0.2763673361 0.0654782414 -0.1108211169 -0.0047229653 -0.0007764962
a0
## [1] 0.5255393
# see what the model predicts
pred <- predict(model,data[,1:10])
pred
## [1] 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0
## [38] 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 0 0 0 1 0
## [75] 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1
## [186] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
## [297] 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## [334] 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
## [371] 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0
## [408] 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0
## [445] 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 1 0 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1
## [519] 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1
## [556] 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0
## [630] 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
# see what fraction of the model’s predictions match the actual classification
sum(pred == data[,11]) / nrow(data)
## [1] 0.7217125
Hyperplane of the separting the class is given the coefficient as
-0.0010065348z1 - 0.0011729048z2 - 0.0016261967z3 + 0.0030064203z4 + 1.0049405641z5 - 0.0028259432z6 + 0.0002600295z7 - 0.0005349551z8 - 0.0012283758z9 + 0.1063633995z10 + 0.08158492 = 0
The Trainning performance of the SVM is evluated:
Total number of support vectors = 189
Training error = 0.136086
The accuracy of the model = 0.8639144
However, the for any unscaled model the accuracy of the model was found to be low as 0.7217125.
Here, Scaling improves the performance the This SVM model
## MODEL: support vector machine model with non-linear radial kernal
model <- ksvm(V11~.,data=data,type = "C-svc",kernel = "rbfdot",C =100, scaled=TRUE) # model creation
model
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.097273682447156
##
## Number of Support Vectors : 242
##
## Objective Function Value : -8796.874
## Training error : 0.045872
When we can easily separate data with hyperplane by drawing a straight line is Linear SVM. When we cannot separate data with a straight line we use Non – Linear SVM. In this, we have Kernel functions. They transform non-linear spaces into linear spaces. It transforms data into another dimension so that the data can be classified.
#install.packages("kknn")
library(kknn) ## to call the knn function in R
#inputed the data
data <- read.table("C:/Users/amirt/Desktop/New folder/credit_card_data.txt", stringsAsFactors = FALSE, header = FALSE)
# to see top 10 observation
head(data,10)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.250 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.040 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.500 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.750 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.710 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.500 1 1 0 0 360 0 1
## 7 1 33.17 1.040 6.500 1 1 0 0 164 31285 1
## 8 0 22.92 11.585 0.040 1 1 0 1 80 1349 1
## 9 1 54.42 0.500 3.960 1 1 0 1 180 314 1
## 10 1 42.50 4.915 3.165 1 1 0 0 52 1442 1
check_accuracy = function(X){
predicted <- rep(0,(nrow(data)))
for (i in 1:nrow(data)){
# model creation using scaled data
model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,data[-i,],data[i,],k=X, scale = TRUE) # use scaled data
# this is to round-off values
predicted[i] <- as.integer(fitted(model)+0.5)
}
# calculation of accuracy
accuracy = sum(predicted == data[,11]) / nrow(data)
return(accuracy)
}
acc <- rep(0,20) # set up a vector of 20 zeros to start
for (X in 1:20){
acc[X] = check_accuracy(X) # test knn with X neighbors
}
#
# report accuracies
#
acc
## [1] 0.8149847 0.8149847 0.8149847 0.8149847 0.8516820 0.8455657 0.8470948
## [8] 0.8486239 0.8470948 0.8501529 0.8516820 0.8532110 0.8516820 0.8516820
## [15] 0.8532110 0.8516820 0.8516820 0.8516820 0.8501529 0.8501529
##
check_accuracy_unscaled = function(X){
predicted <- rep(0,(nrow(data)))
for (i in 1:nrow(data)){
# model creation using scaled data
model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,data[-i,],data[i,],k=X, scale = FALSE) # use unscaled data
# this is to round-off values
predicted[i] <- as.integer(fitted(model)+0.5)
}
# calculation of accuracy
accuracy = sum(predicted == data[,11]) / nrow(data)
return(accuracy)
}
acc_unscaled <- rep(0,20) # set up a vector of 20 zeros to start
for (X in 1:20){
acc_unscaled[X] = check_accuracy_unscaled(X) # test knn with X neighbors
}
#
# report accuracies
#
acc_unscaled
## [1] 0.6636086 0.6636086 0.6636086 0.6636086 0.6911315 0.6957187 0.6926606
## [8] 0.6926606 0.6865443 0.6773700 0.6804281 0.6834862 0.6865443 0.6880734
## [15] 0.6880734 0.6926606 0.6926606 0.6926606 0.6926606 0.6911315
Case 1: Scaled
Values of Accuracy for different values of K from 1 to 20 are as follow:- \
[K]
[1] 0.8149847
[2] 0.8149847
[3] 0.8149847
[4] 0.8149847
[5] 0.8516820
[6] 0.8455657
[7] 0.8470948
[8] 0.8486239
[9] 0.8470948
[10] 0.8501529
[11] 0.8516820
[12] 0.8532110 ** Best accuracy value at k=12 (~558 correct predictions)
[13] 0.8516820
[14] 0.8516820
[15] 0.8532110 ** Best accuracy value at k=15 (~558 correct predictions)
[16] 0.8516820
[17] 0.8516820
[18] 0.8516820
[19] 0.8501529
[20] 0.8501529
Case 2: unScaled
Values of Accuracy for different values of K from 1 to 20 are as follow:-
[K]
[1] 0.6636
[2] 0.6636
[3] 0.6636
[4] 0.6636
[5] 0.6911
[6] 0.6957 ** Best accuracy value at k=6 (~455 correct predictions)
[7] 0.6927
[8] 0.6927
[9] 0.6965
[10] 0.6774
[11] 0.6804
[12] 0.6835
[13] 0.6927
[14] 0.6881
[15] 0.6881
[16] 0.6927
[17] 0.6927
[18] 0.6927
[19] 0.6927
[20] 0.6911