Prob 2.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.

Answer: I work for a utility company and we have situations in the field whereby we send drones to monitor field assets such as poles, power generation sites, transformers etc., where drones are actually taking life pictures or life streaming videos to our machine learning algorithms that make classification decision based on integrity of the pictures as to whether we should fix, repair or replace certain critical parts of the asset in question. Some of the predictors utilized were random forest, XGBoost, YOLO-based Convolutional Neural Network family of models for object detection and some other decision-tree based models.

Prob 2.2

In this HW assignment, we were introduced to the theory and applications of classifcation algorithms; KSVM and KKNN.
The toy dataset came from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) and is popularly used by acadamic institutions for teaching purposes.  Steps taken to fulfill the HW requirements were:

(1) Data exploration through some simple visualzation techniques.  
(2) Fine tuning the hyperparameters of the algorithms via inter-changing different C values, kernel-types and K-values for the KSVM and kKNN alogrithms respectively.

Objective of the assignments were to find the following:
- Most optimal C values and its correpsonding Kernel for KSVM 
- Optimal number of K values that provides the highest accuracy for KKNN

Importing Libraries

library(scales)
library(tidyr)
library(dplyr)
library(kableExtra)
library(knitr)
library(kernlab)
library(kknn)
require(ggthemes)
library(ggplot2)

2.2a: KSVM Model

Loading data

#reading txt file 
data<-read.table("credit_card_data.txt", stringsAsFactor = FALSE, header = F, sep = "")

head(data,5)

Data Exploration and simple visualizations

#Checking if NA exist in Data set
is.null(data)

## [1] FALSE

#sub setting data to check balance of response class
df0 <- data %>% select(V11)  %>% mutate(approved= ifelse(V11==1,1,0)) %>% group_by(V11) %>% summarise(No_Applicants = n())
#Bar plots of applicants
c2=c("Not Approved","Approved")
ggplot(df0 , aes(x=V11, y=No_Applicants,fill=c2)) + 
  geom_bar(stat = "identity") +ggtitle("Fig1: Class distribution of applicants")+theme_economist()+geom_text(x=0, y=200, label="Not Approved")+geom_text(x=1, y=200, label=" Approved")+xlab("Class(1:approved, 0:Not Approved)")+scale_fill_manual("legend", values = c("Not Approved" = "red", "Approved" = "green"))

Observation 1: It is best practice to verify that the data you loaded do not have missing values. If if does, then one should take corrective actions to counter that effect without jeopardizing or distorting the original intent of the data. In addition, in any classification problem, it’s also good idea to check for imbalances of the different classes. The reason is having a minority class makes it hard for the algorithm to learn the patterns of the data properly and more importantly, limits its ability to generalize when expose to new data.

As expected, there were no NULLs in this academic dataset; It was very clean and tidy
Very balanced class distribution between approved vs. not approved applicants in the dataset

Findng optimal C Parameter of the linear kernel:“vanilladot”

#for reproducibility
set.seed(123)
# Set of C values to test
store <- c(0.000001, 100, 10000, 1000000)
#create empty lists for storage purposes
Opti_C <- vector(mode = "list")
Training_error<- vector(mode = "list")
ksvm_margin <- vector(mode = "list")

for(i in 1:4)

{
model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "vanilladot", C = store[i], scaled = T) 

Opti_C  <- c(Opti_C, store[i])
Training_error<- c(Training_error,model@error)
a<-colSums(model@xmatrix[[1]]*model@coef[[1]])
a_mat<-as.matrix(a)
ksvm_margin[i]<- (1/sqrt(sum(a)^2))

}

##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters

#Putting the C values and its associated errors together
col_bind <- cbind(Opti_C,Training_error,ksvm_margin)
mode(col_bind ) = "numeric"
df<-data.frame(col_bind )
#Creating Table for C and its training errors
knitr::kable(df, format = "html", row.names = FALSE,caption = "Table 1a: C vs. Training Errors & Margins",align= "c",font_size=12) %>% row_spec(c(1,3), background  = "lightblue") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Table 1a: C vs. Training Errors & Margins
Opti_C	Training_error	ksvm_margin
1e-06	0.4525994	1245.0364638
1e+02	0.1360856	0.9040157
1e+04	0.1376147	0.9786229
1e+06	0.3746177	2.3495819

Observation 2:

Here we were playing with one of the hyperparameters, C and it was found to be quite insensitive unless changes were made in huge orders of magnitude increments. Using the “vanilladot” linear kernel as a baseline to check which values of C produced optimal training error, it was found that C=100 had the lowest training error at 13.6% with the lowest margin of 0.904. While at the extreme a low value of C, at 1e-06, the margins was found to be the highest at 1245.04. At the opposite end of the spectrum, a high C at 1e+06, the margin was 2.349. Lastly, with C set at 1000, the margin was negligibly higher than when C was at 100.

In general, this is the trade-offs an end-user have to face when performing analytics, that is which set of hyper-parameters will provide the most optimal solution given the training errors and it’s associated margins.

My decision was to take my “optimal C” value at 1e+06, absorbing some training error but keeping the margins wide enough so that my model has some ability to generalize better when when it’s exposed to new unseen test data sets.

The classifier equation will be base on these sets of coefficients and its corresponding intercept.

Function for testing different kernels

model_type <- function(x2)
{
ksvm_model <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= x2, C = 1000000, scaled = T) 
return(ksvm_model)
}

Model 1, Baseline Linear Kernel: “vanilladot”

model_type("vanilladot" )

##  Setting default kernel parameters

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1e+06 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 427 
## 
## Objective Function Value : -46182140 
## Training error : 0.374618

Model 2, Kernel: “anovadot”

model_type("anovadot" )

##  Setting default kernel parameters

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1e+06 
## 
## Anova RBF kernel function. 
##  Hyperparameter : sigma =  1 degree =  1 
## 
## Number of Support Vectors : 254 
## 
## Objective Function Value : -106676174 
## Training error : 0.180428

Model 3, Default Kernel: “rbfdot”

model_type("rbfdot")

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1e+06 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.106691960791606 
## 
## Number of Support Vectors : 188 
## 
## Objective Function Value : -2921303 
## Training error : 0.001529

Model 4, Kernel: “polydot”

model_type("polydot")

##  Setting default kernel parameters

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1e+06 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  1  scale =  1  offset =  1 
## 
## Number of Support Vectors : 387 
## 
## Objective Function Value : -63622512 
## Training error : 0.668196

Observation 3:

In this section, we were tweaking with yet another one of the hyperparameters; the kernel and it appears that KSVM handles non-linearity situations well as the training error decreased drastically from 37.4% to 0.31%. Again, while it may appeared the model improved dramatically given such low error rate, it may have done so at a price. This is the classic high variance, low bias situation. It’s evidenced by the fact that this kernel has the lowest number of support vectors (lowest margin) among the 4 models.

A potentially better candidate was Model 2 with Kernel = “anovadot”. This kernel is also non-linear as well but with a more reasonable error rate of 18% and higher number of support vectors at 254; giving it a higher margin and more tolerance to misclassification errors.

Using C=1e+06, Model3 had the “best” Kernel with the lowest training error using the default kernel of “rbfdot”
Using C=1e+06, Model2 had the second lowest training error but because its non-linear, the predictors will naturally be polynomial in nature and beyond the scope of this exercise. However, the coefficients and its intercept will be shown as a companion to the linear Model1

Calculating coeficients: Model 1, Linear case: “Vanniladot”

#Calculate Coefficients a0 to am
model1 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "vanilladot", C = 1000000, scaled = T)

##  Setting default kernel parameters

a1 <- colSums(model@xmatrix[[1]]*model@coef[[1]])
#labeling the coefficients
a_mat1<-as.matrix(a1)
rownames(a_mat1) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat1) <- c('Coeficients')
a_mat1

##     Coeficients
## a1   -0.8283471
## a2   -0.2217216
## a3   -0.3301782
## a4    0.2825488
## a5    0.5750731
## a6    0.6143978
## a7    0.2607774
## a8   -0.5943042
## a9   -1.1175369
## a10   0.9336833

Calculating coefficients: Model 2, kernel = “anovadot”

#Calculate Coefficients a0 to am
model2 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "anovadot", C = 1000000, scaled = T)

##  Setting default kernel parameters

a2 <- colSums(model2@xmatrix[[1]]*model2@coef[[1]])
#labeling the coefficients
a_mat2<-as.matrix(a2)
rownames(a_mat2) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat2) <- c('Coeficients')
a_mat2

##       Coeficients
## a1    -0.09112927
## a2   159.45108009
## a3  -411.57110320
## a4    94.35969342
## a5     4.10955797
## a6     1.41952178
## a7   -27.44089207
## a8    -0.49552951
## a9  -112.20111008
## a10  104.05488643

Calculating coefficients: Model 3, Default kernel = “rbfdot”

#Calculate Coefficients a0 to am
model3 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "rbfdot", C = 1000000, scaled = T) 
a3 <- colSums(model3@xmatrix[[1]]*model3@coef[[1]])
#labeling the coefficients
a_mat3<-as.matrix(a3)
rownames(a_mat3) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat3) <- c('Coeficients')
a_mat3

##     Coeficients
## a1    -326.4830
## a2    -508.9533
## a3    -858.7789
## a4    -183.4438
## a5     546.2135
## a6    -400.4594
## a7     236.8309
## a8    -431.0257
## a9    -244.5575
## a10    377.2721

Calculating coefficients: Model 4, kernel = “polydot”

#Calculate Coefficients a0 to am
model4 <- ksvm(as.matrix(data[,1:10]), as.factor(data[,11]), type="C-svc", kernel= "polydot", C = 1000000, scaled = T)

##  Setting default kernel parameters

a4 <- colSums(model4@xmatrix[[1]]*model4@coef[[1]])
#labeling the coefficients
a_mat4<-as.matrix(a4)
rownames(a_mat4) <- c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10")
colnames(a_mat4) <- c('Coeficients')
a_mat4

##     Coeficients
## a1   -0.2704748
## a2   -1.5608654
## a3   -0.6957239
## a4   -2.0909780
## a5    0.2390224
## a6    0.3817119
## a7   -1.6549423
## a8    1.0024530
## a9    1.2131130
## a10   1.9821883

Calculating the intercepts of all the models

#Intercept for each model
a0.Model1.Linear <- -model1@b  
a0.Model2 <- -model2@b 
a0.Model3.Gaussian <- -model3@b 
a0.Model4 <- -model4@b
#Intercepts of baseline model1 (linear model) and the model3 ("best" model)
df3<-data.frame(a0.Model1.Linear,a0.Model2,a0.Model3.Gaussian ,a0.Model4)
knitr::kable(df3, format = "html", row.names = FALSE,caption = "Table 1b: Intercepts of the 4 models",align= "c",font_size=12) %>% row_spec(c(1), background  = "lightblue") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Table 1b: Intercepts of the 4 models
a0.Model1.Linear	a0.Model2	a0.Model3.Gaussian	a0.Model4
-0.1281168	16.36655	-1.598924	-0.4399422

Observation 4:

Above is a table summary of coefficients and its corresponding intercepts from the 4 models. Note that Model 1, with the linear kernel (“Vanniladot”) will be the intercept in the classifier equation.

Initiating predict function for the 4 models

# see what the model predicts
pred1 <- predict(model1, data[,1:10],type = "response")
pred2 <- predict(model2, data[,1:10],type = "response")
pred3 <- predict(model3, data[,1:10],type = "response")
pred4 <- predict(model4, data[,1:10],type = "response")

# see what fraction of the model’s predictions match the actual classification
Model1.pred<-sum(pred1==data[,11])/nrow(data)
Model2.pred<-sum(pred2==data[,11])/nrow(data)
Model3.pred<-sum(pred3==data[,11])/nrow(data)
Model4.pred<-sum(pred4==data[,11])/nrow(data)

df2<- data.frame(Model1.pred,Model2.pred,Model3.pred,Model4.pred)

knitr::kable(df2, format = "html", row.names = FALSE,caption = "Table 1c: Prediction Accuracy ",align= "c",font_size=12) %>% row_spec(c(1), background  = "lightblue") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Table 1c: Prediction Accuracy
Model1.pred	Model2.pred	Model3.pred	Model4.pred
0.6253823	0.8195719	0.9984709	0.3318043

Observation 5:

Similarly, the 4 models based on different kernels were displayed. As alluded to in observation 2 and 3, KSVM showed superior result when switching from the linear kernel to non-linear kernels with C = 1e+06. Going from “Vanniladot” to “rbfdot”, the prediction accuracy increased from 62% to almost 100%. But again, this comes at a cost, i.e. it’s accuracy rate increased while its predictive power decreased. This model performed great during training but will fail miserably in production when new unseen data is shown.

Also stated in observation 3 was Model 2 with kernel set to “anovadot” could be reasonable choice with accuracy of 82%. It’s coefficients along with its intercept is displayed below:

Coeficients

a1 -0.09112927

a2 159.45108009

a3 -411.57110320

a4 94.35969342

a5 4.10955797

a6 1.41952178

a7 -27.44089207

a8 -0.49552951

a9 -112.20111008

a10 104.05488643

a0 16.36655

Note a1 and a8 coefficients are relatively low at 0.091 and 0.495 respectively. This suggested both a1’s and a8’s contributions to the model were probably not as relevant as the other 8 coefficients and could be dropped from further analysis. This is a quick and dirty way of dimensionality reduction. Furthermore, by carrying less coefficients (dimensions), both computational and storage capacities can be freed up for other more important operations

2.2b: KKNN Modeling

Now let’s try the k-nearest-neighbors classification function KKNN
Function to train and fit

set.seed(1)

check_accuracy = function(X){
  predicted <- rep(0,(nrow(data))) 

  for (i in 1:nrow(data))
    {

    credit.kknn=kknn(V11~.,data[-i,],data[i,],k=X, scale = TRUE) 

    predicted[i] <- as.integer(fitted(credit.kknn)+0.5) # round off to 0 or 1
    }

  # calculate % of correct predictions

  acc <- sum(predicted == data[,11]) / nrow(data)
  return(acc)
}

Accuracy Validation: K values (1 to 20)

n<-20 # no. of k values to test
kknn_accuracy<- rep(0,n)

for (i in 1:n)
{
  kknn_accuracy[i] <- check_accuracy(i)
}
#Elbow Plot:  accuracy vs. K-values
par(bg="lightblue")
plot(kknn_accuracy,ylab="accuracy level",xlab="K values" ,type='b',col='red', main='Fig2: KKN accuracy')

#determining the max value
df3<-data.frame(kknn_accuracy)
print(paste0("The best KKNN accuracy level is: ", percent(max(df3),2)))

## [1] "The best KKNN accuracy level is: 86%"

#determining the K-value that gives the max value
max_kvalue <-which.max(df3$kknn_accuracy)
print(paste0("The K-value that best classifies the data points is ", max_kvalue))

## [1] "The K-value that best classifies the data points is 12"

Summary of Results:

\(\underline{KSVM}\)

Model 1 was chosen to be the linearly “optimal” model using the optimization settings of C=1000,000 and the linear kernel (“Vanniladot”). Having tinkered with just 2 hyper-parameters of C and the kernel, the trade-off decision taken was to stick with these sets of hyper-parameters values in order to have a reasonable balance between low bias and low variance (low enough anyways). Having said that, these settings were by no means perfect given we were only playing with 2 hyper-parameters and the selection process was not systematically implemented.
Prediction accuracy was at 63% with a margin of 2.3495
The classifier equation is:

\(-0.8283471V_1\) + \(-0.2217216V_2\) + \(-0.3301782V_3\) +\(0.2825488V_4\) + \(0.5750731V_5\) + \(0.6143978V_6\) + \(0.2607774V_7\) + \(-0.5943042V_8\) + \(-1.1175369V_9\) + \(0.9336833V_10\) + \(-0.1281168 V_0\) = 0

Other relevant results from model 1:

Linear (vanilla) kernel function

Number of Support Vectors : 427

Objective Function Value : -46182140

Training error : 0.374618

Model 2 was a promising candidate with the same C=1000,000 but utilizing the non-linear kernel “anovadot”. Model 2 exhibited higher accuracy rate at 82% with more support vectors and thus higher margin. This should make it a more generalizable model. In addition, a1 and a8 were relatively small and could be dropped, thus reducing the dimensions of the overall classify equation. The benefit of dimension reduction is obviously less storage space and faster computational run-time.

Companion coefficients of Model 2 (w/o the classify equation):

a1 -0.09112927 -> Can be dropped