Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.
Answer: I work for a utility company and we have situations in the field whereby we send drones to monitor field assets such as poles, power generation sites, transformers etc., where drones are actually taking life pictures or life streaming videos to our machine learning algorithms that make classification decision based on integrity of the pictures as to whether we should fix, repair or replace certain critical parts of the asset in question. Some of the predictors utilized were random forest, XGBoost, YOLO-based Convolutional Neural Network family of models for object detection and some other decision-tree based models.
In this HW assignment, we were introduced to the theory and applications of classifcation algorithms; KSVM and KKNN.
The toy dataset came from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) and is popularly used by acadamic institutions for teaching purposes. Steps taken to fulfill the HW requirements were:
(1) Data exploration through some simple visualzation techniques.
(2) Fine tuning the hyperparameters of the algorithms via inter-changing different C values, kernel-types and K-values for the KSVM and kKNN alogrithms respectively.
Objective of the assignments were to find the following:
- Most optimal C values and its correpsonding Kernel for KSVM
- Optimal number of K values that provides the highest accuracy for KKNN
library(scales)
library(tidyr)
library(dplyr)
library(kableExtra)
library(knitr)
library(kernlab)
library(kknn)
require(ggthemes)
library(ggplot2)
#reading txt file
data<-read.table("credit_card_data.txt", stringsAsFactor = FALSE, header = F, sep = "")
head(data,5)
#Checking if NA exist in Data set
is.null(data)
## [1] FALSE
#sub setting data to check balance of response class
df0 <- data %>% select(V11) %>% mutate(approved= ifelse(V11==1,1,0)) %>% group_by(V11) %>% summarise(No_Applicants = n())
#Bar plots of applicants
c2=c("Not Approved","Approved")
ggplot(df0 , aes(x=V11, y=No_Applicants,fill=c2)) +
geom_bar(stat = "identity") +ggtitle("Fig1: Class distribution of applicants")+theme_economist()+geom_text(x=0, y=200, label="Not Approved")+geom_text(x=1, y=200, label=" Approved")+xlab("Class(1:approved, 0:Not Approved)")+scale_fill_manual("legend", values = c("Not Approved" = "red", "Approved" = "green"))
Observation 1: It is best practice to verify that the data you loaded do not have missing values. If if does, then one should take corrective actions to counter that effect without jeopardizing or distorting the original intent of the data. In addition, in any classification problem, it’s also good idea to check for imbalances of the different classes. The reason is having a minority class makes it hard for the algorithm to learn the patterns of the data properly and more importantly, limits its ability to generalize when expose to new data.
As expected, there were no NULLs in this academic dataset; It was very clean and tidy
Very balanced class distribution between approved vs. not approved applicants in the dataset
#for reproducibility
set.seed(123)
# Set of C values to test
store <- c(0.000001, 100, 10000, 1000000)
#create empty lists for storage purposes
Opti_C <- vector(mode = "list")
Training_error<- vector(mode = "list")
ksvm_margin <- vector(mode = "list")
for(i in 1:4)
{
model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "vanilladot", C = store[i], scaled = T)
Opti_C <- c(Opti_C, store[i])
Training_error<- c(Training_error,model@error)
a<-colSums(model@xmatrix[[1]]*model@coef[[1]])
a_mat<-as.matrix(a)
ksvm_margin[i]<- (1/sqrt(sum(a)^2))
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
#Putting the C values and its associated errors together
col_bind <- cbind(Opti_C,Training_error,ksvm_margin)
mode(col_bind ) = "numeric"
df<-data.frame(col_bind )
#Creating Table for C and its training errors
knitr::kable(df, format = "html", row.names = FALSE,caption = "Table 1a: C vs. Training Errors & Margins",align= "c",font_size=12) %>% row_spec(c(1,3), background = "lightblue") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Opti_C | Training_error | ksvm_margin |
---|---|---|
1e-06 | 0.4525994 | 1245.0364638 |
1e+02 | 0.1360856 | 0.9040157 |
1e+04 | 0.1376147 | 0.9786229 |
1e+06 | 0.3746177 | 2.3495819 |
Observation 2:
Here we were playing with one of the hyperparameters, C and it was found to be quite insensitive unless changes were made in huge orders of magnitude increments. Using the “vanilladot” linear kernel as a baseline to check which values of C produced optimal training error, it was found that C=100 had the lowest training error at 13.6% with the lowest margin of 0.904. While at the extreme a low value of C, at 1e-06, the margins was found to be the highest at 1245.04. At the opposite end of the spectrum, a high C at 1e+06, the margin was 2.349. Lastly, with C set at 1000, the margin was negligibly higher than when C was at 100.
In general, this is the trade-offs an end-user have to face when performing analytics, that is which set of hyper-parameters will provide the most optimal solution given the training errors and it’s associated margins.
My decision was to take my “optimal C” value at 1e+06, absorbing some training error but keeping the margins wide enough so that my model has some ability to generalize better when when it’s exposed to new unseen test data sets.
The classifier equation will be base on these sets of coefficients and its corresponding intercept.
model_type <- function(x2)
{
ksvm_model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= x2, C = 1000000, scaled = T)
return(ksvm_model)
}
model_type("vanilladot" )
## Setting default kernel parameters
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1e+06
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 427
##
## Objective Function Value : -46182140
## Training error : 0.374618
model_type("anovadot" )
## Setting default kernel parameters
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1e+06
##
## Anova RBF kernel function.
## Hyperparameter : sigma = 1 degree = 1
##
## Number of Support Vectors : 254
##
## Objective Function Value : -106676174
## Training error : 0.180428
model_type("rbfdot")
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1e+06
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.106691960791606
##
## Number of Support Vectors : 188
##
## Objective Function Value : -2921303
## Training error : 0.001529
model_type("polydot")
## Setting default kernel parameters
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1e+06
##
## Polynomial kernel function.
## Hyperparameters : degree = 1 scale = 1 offset = 1
##
## Number of Support Vectors : 387
##
## Objective Function Value : -63622512
## Training error : 0.668196
Observation 3:
In this section, we were tweaking with yet another one of the hyperparameters; the kernel and it appears that KSVM handles non-linearity situations well as the training error decreased drastically from 37.4% to 0.31%. Again, while it may appeared the model improved dramatically given such low error rate, it may have done so at a price. This is the classic high variance, low bias situation. It’s evidenced by the fact that this kernel has the lowest number of support vectors (lowest margin) among the 4 models.
A potentially better candidate was Model 2 with Kernel = “anovadot”. This kernel is also non-linear as well but with a more reasonable error rate of 18% and higher number of support vectors at 254; giving it a higher margin and more tolerance to misclassification errors.
Using C=1e+06, Model3 had the “best” Kernel with the lowest training error using the default kernel of “rbfdot”
Using C=1e+06, Model2 had the second lowest training error but because its non-linear, the predictors will naturally be polynomial in nature and beyond the scope of this exercise. However, the coefficients and its intercept will be shown as a companion to the linear Model1
#Calculate Coefficients a0 to am
model1 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "vanilladot", C = 1000000, scaled = T)
## Setting default kernel parameters
a1 <- colSums(model@xmatrix[[1]]*model@coef[[1]])
#labeling the coefficients
a_mat1<-as.matrix(a1)
rownames(a_mat1) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat1) <- c('Coeficients')
a_mat1
## Coeficients
## a1 -0.8283471
## a2 -0.2217216
## a3 -0.3301782
## a4 0.2825488
## a5 0.5750731
## a6 0.6143978
## a7 0.2607774
## a8 -0.5943042
## a9 -1.1175369
## a10 0.9336833
#Calculate Coefficients a0 to am
model2 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "anovadot", C = 1000000, scaled = T)
## Setting default kernel parameters
a2 <- colSums(model2@xmatrix[[1]]*model2@coef[[1]])
#labeling the coefficients
a_mat2<-as.matrix(a2)
rownames(a_mat2) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat2) <- c('Coeficients')
a_mat2
## Coeficients
## a1 -0.09112927
## a2 159.45108009
## a3 -411.57110320
## a4 94.35969342
## a5 4.10955797
## a6 1.41952178
## a7 -27.44089207
## a8 -0.49552951
## a9 -112.20111008
## a10 104.05488643
#Calculate Coefficients a0 to am
model3 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "rbfdot", C = 1000000, scaled = T)
a3 <- colSums(model3@xmatrix[[1]]*model3@coef[[1]])
#labeling the coefficients
a_mat3<-as.matrix(a3)
rownames(a_mat3) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat3) <- c('Coeficients')
a_mat3
## Coeficients
## a1 -326.4830
## a2 -508.9533
## a3 -858.7789
## a4 -183.4438
## a5 546.2135
## a6 -400.4594
## a7 236.8309
## a8 -431.0257
## a9 -244.5575
## a10 377.2721
#Calculate Coefficients a0 to am
model4 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "polydot", C = 1000000, scaled = T)
## Setting default kernel parameters
a4 <- colSums(model4@xmatrix[[1]]*model4@coef[[1]])
#labeling the coefficients
a_mat4<-as.matrix(a4)
rownames(a_mat4) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat4) <- c('Coeficients')
a_mat4
## Coeficients
## a1 -0.2704748
## a2 -1.5608654
## a3 -0.6957239
## a4 -2.0909780
## a5 0.2390224
## a6 0.3817119
## a7 -1.6549423
## a8 1.0024530
## a9 1.2131130
## a10 1.9821883
#Intercept for each model
a0.Model1.Linear <- -model1@b
a0.Model2 <- -model2@b
a0.Model3.Gaussian <- -model3@b
a0.Model4 <- -model4@b
#Intercepts of baseline model1 (linear model) and the model3 ("best" model)
df3<-data.frame(a0.Model1.Linear,a0.Model2,a0.Model3.Gaussian ,a0.Model4)
knitr::kable(df3, format = "html", row.names = FALSE,caption = "Table 1b: Intercepts of the 4 models",align= "c",font_size=12) %>% row_spec(c(1), background = "lightblue") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
a0.Model1.Linear | a0.Model2 | a0.Model3.Gaussian | a0.Model4 |
---|---|---|---|
-0.1281168 | 16.36655 | -1.598924 | -0.4399422 |
Observation 4:
Above is a table summary of coefficients and its corresponding intercepts from the 4 models. Note that Model 1, with the linear kernel (“Vanniladot”) will be the intercept in the classifier equation.
# see what the model predicts
pred1 <- predict(model1, data[,1:10],type = "response")
pred2 <- predict(model2, data[,1:10],type = "response")
pred3 <- predict(model3, data[,1:10],type = "response")
pred4 <- predict(model4, data[,1:10],type = "response")
# see what fraction of the model’s predictions match the actual classification
Model1.pred<-sum(pred1==data[,11])/nrow(data)
Model2.pred<-sum(pred2==data[,11])/nrow(data)
Model3.pred<-sum(pred3==data[,11])/nrow(data)
Model4.pred<-sum(pred4==data[,11])/nrow(data)
df2<- data.frame(Model1.pred,Model2.pred,Model3.pred,Model4.pred)
knitr::kable(df2, format = "html", row.names = FALSE,caption = "Table 1c: Prediction Accuracy ",align= "c",font_size=12) %>% row_spec(c(1), background = "lightblue") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Model1.pred | Model2.pred | Model3.pred | Model4.pred |
---|---|---|---|
0.6253823 | 0.8195719 | 0.9984709 | 0.3318043 |
Observation 5:
Similarly, the 4 models based on different kernels were displayed. As alluded to in observation 2 and 3, KSVM showed superior result when switching from the linear kernel to non-linear kernels with C = 1e+06. Going from “Vanniladot” to “rbfdot”, the prediction accuracy increased from 62% to almost 100%. But again, this comes at a cost, i.e. it’s accuracy rate increased while its predictive power decreased. This model performed great during training but will fail miserably in production when new unseen data is shown.
Also stated in observation 3 was Model 2 with kernel set to “anovadot” could be reasonable choice with accuracy of 82%. It’s coefficients along with its intercept is displayed below:
Coeficients
a1 -0.09112927
a2 159.45108009
a3 -411.57110320
a4 94.35969342
a5 4.10955797
a6 1.41952178
a7 -27.44089207
a8 -0.49552951
a9 -112.20111008
a10 104.05488643
a0 16.36655
Note a1 and a8 coefficients are relatively low at 0.091 and 0.495 respectively. This suggested both a1’s and a8’s contributions to the model were probably not as relevant as the other 8 coefficients and could be dropped from further analysis. This is a quick and dirty way of dimensionality reduction. Furthermore, by carrying less coefficients (dimensions), both computational and storage capacities can be freed up for other more important operations
set.seed(1)
check_accuracy = function(X){
predicted <- rep(0,(nrow(data)))
for (i in 1:nrow(data))
{
credit.kknn=kknn(V11~.,data[-i,],data[i,],k=X, scale = TRUE)
predicted[i] <- as.integer(fitted(credit.kknn)+0.5) # round off to 0 or 1
}
# calculate % of correct predictions
acc <- sum(predicted == data[,11]) / nrow(data)
return(acc)
}
n<-20 # no. of k values to test
kknn_accuracy<- rep(0,n)
for (i in 1:n)
{
kknn_accuracy[i] <- check_accuracy(i)
}
#Elbow Plot: accuracy vs. K-values
par(bg="lightblue")
plot(kknn_accuracy,ylab="accuracy level",xlab="K values" ,type='b',col='red', main='Fig2: KKN accuracy')
#determining the max value
df3<-data.frame(kknn_accuracy)
print(paste0("The best KKNN accuracy level is: ", percent(max(df3),2)))
## [1] "The best KKNN accuracy level is: 86%"
#determining the K-value that gives the max value
max_kvalue <-which.max(df3$kknn_accuracy)
print(paste0("The K-value that best classifies the data points is ", max_kvalue))
## [1] "The K-value that best classifies the data points is 12"
\(\underline{KSVM}\)
Model 1 was chosen to be the linearly “optimal” model using the optimization settings of C=1000,000 and the linear kernel (“Vanniladot”). Having tinkered with just 2 hyper-parameters of C and the kernel, the trade-off decision taken was to stick with these sets of hyper-parameters values in order to have a reasonable balance between low bias and low variance (low enough anyways). Having said that, these settings were by no means perfect given we were only playing with 2 hyper-parameters and the selection process was not systematically implemented.
Prediction accuracy was at 63% with a margin of 2.3495
The classifier equation is:
\(-0.8283471V_1\) + \(-0.2217216V_2\) + \(-0.3301782V_3\) +\(0.2825488V_4\) + \(0.5750731V_5\) + \(0.6143978V_6\) + \(0.2607774V_7\) + \(-0.5943042V_8\) + \(-1.1175369V_9\) + \(0.9336833V_10\) + \(-0.1281168 V_0\) = 0
Linear (vanilla) kernel function
Number of Support Vectors : 427
Objective Function Value : -46182140
Training error : 0.374618
Model 2 was a promising candidate with the same C=1000,000 but utilizing the non-linear kernel “anovadot”. Model 2 exhibited higher accuracy rate at 82% with more support vectors and thus higher margin. This should make it a more generalizable model. In addition, a1 and a8 were relatively small and could be dropped, thus reducing the dimensions of the overall classify equation. The benefit of dimension reduction is obviously less storage space and faster computational run-time.
Companion coefficients of Model 2 (w/o the classify equation):
a1 -0.09112927 -> Can be dropped
a2 159.45108009
a3 -411.57110320
a4 94.35969342
a5 4.10955797
a6 1.41952178
a7 -27.44089207
a8 -0.49552951 -> Can be dropped
a9 -112.20111008
a10 104.05488643
a0 16.36655
\(\underline{KKNN}\)
In general, the KSVM algorithm was both easy and intuitive to implement. It was easy for a novice like myself to tinker with the different hyperparameters such as C and the built-in different kernels to obtain the optimal solution. Due to this regularization feature, KSVM has good generalization capabilities which can prevent it from over-fitting.
Having said that, what was \(\underline{not}\) intuitive was the fact that varying C may or may not produce a different hyperplane. This was evidenced by having to vary the C value in huge orders of magnitude increments before seeing any significant changes to the coefficients of the classifier equation and its corresponding error terms.
Choosing an appropriate Kernel function was difficult: Picking the “correct” kernel required an iterative process and experience from the end-users.
Handles non-linear data efficiently (i.e., there’s no hard separation): KSVM can efficiently handle non-linear data via inter-changing of Kernels seamlessly.
Because most predictors of classification problems are different orders in magnitudes, the data must be scaled in order to have any meaningful results. This is especially true in the KKNN model as it’s a purely distance-based algorithm.