Support Vector Machine for Classification

Objective

The main objective for this project is to try different kernels of a Support Vector Machine and perform a 5-fold cross validation in order to tune parameters and obtain the optimal kernel and parameters for the specific task.

To achieve this, we will use the Wisconsin Breast Cancer Dataset from the University of Wisconsin.

Introduction

The dataset consists in \(698\) observations of \(10\) features and a class which describes wether the tumor is benign or malign. To perform the cross validation, first some models were chosen in order to be compare their predictions against each other using 5 - Fold Cross validation.

Models

The main evaluated models are the SVM with different Kernels. For this exercise the kernels utilized are Polynomial, Gaussian (RBF), and Linear. For tuning them the focus was mainly in computing the cross validations (cv) for specific values on the parameters C (Cost of constraints violation), and epsilon (the insensitive-loss function).

The tuned parameters for each Kernel SVM were:
- epsilon = 0.001 and C = 1
- epsilon = 0.01 and C = 1
- epsilon = 0.1 and C = 1
- epsilon = 0.001 and C = 10
- epsilon = 0.01 and C = 10
- epsilon = 0.1 and C = 10

Also a Random Forest, Naive Bayes and Linear Discriminant Analysis were computed in order to compare the accuracies against each other, and against the best kernel SVM. These also went through cross validation, but the parameters for each of them were just set as the default ones.

Methodology

The Wisconsin Breast Cancer dataset was first split into 90% Training set and 10% Test set. To perform the samplings and cross validation the Training set will be used in order to have Training and Validations sets along the 5-folds CV.

For this project, the data is considered to be clean. In order build a model, normally we would consider if there is correlation or not between the features, or even try different combinations of the features in order to build the models, but in this scenario, all the features but the ID were include for the model building process. The ID variable was to be removed because it should not be a predictor. It could be biased since sometimes the ids are generated according to the label for easier management of the information for the personal in the hospitals.

5 - Fold Cross Validation

While doing 5-Fold Cross validation in order to compute the accuracy of each model, it is not reliable to simply trust the average accuracy of the 5 predictions given that the dataset was only sampled once. To remove bias and spot the variance of the models, the cross validation process was ran in a loop where the data was sampled 20 times, and a 5-fold cross validation was performed on each sample.

This results in 100 accuracies obtained for each model. The models’ accuracies were then polted utilizing boxplots and obtaining the following results:

Variance of the Accuracies

The plots generated for the different models explain the variability of the model, this means even though we could have an outstanding accuracy for predicting Breast Cancer, it is important to see weather the accuracy could vary if the experiment is repeated.

For the three different Kernels, the accuracies 5-folds of the 20 sampled datasets displayed in the boxplots show that the best parameters are C = \(1\) and epsilon = \(0.1\).

Now lets compare the other models and then have the best ones together against each other.

Here is noticable that the Random Forest not only has less variance than the other ones, but it has a higher median as well.

Plot for the Best Models

For all the different Kernel SVMs, we could say the results are very much similar. It could even be said that they all have the same in general, but when we look at the variance the decision might change. The Polynomial_KSVM with parameters [C = 1, epsilon = 0.1] seems to be a safer choice, since it’s variance is less and it is important to have certainty in our predictions.

The Final Model

The cross validation helps us choose between the different models and its tuning parametrers. Now with this information, we could use all the training data in order to build a Polynomial KSVM model and use the Testing set to measure the accuracy.

By evaluating the model we get the following confusion matrix:

##  Setting default kernel parameters

##    prediction
##      2  4
##   2 46  0
##   4  2 22

The Confusion Matrix shows the number of well predicted values, and the number of missclassified. The accuracy of this model is 0.97. In this case it is important to note that we have 2 false positives. Which in this case is our worst error, since we predict as healthy for persons who have a malign tumor.

Conclusions

With cross validation, it was possible to tune the parameters for different kernel support vector machines. By the end of the exercise it was possible to achieve an accuracy of 0.97 for the Polynomial Kernel SVM. This accuracy is backed up by sampling the data multiple times and performing a 5 - fold cross validation.

Important things to notice:

A distribution of errors could have been compute in order to have higher certainty of our given accuracy.
The parameters for the SVM could have gone through a more profound or extrict tuning, but in order to simplify the computational cost, due to time constraints, only the ones mentioned above were used.
It is also important to state that the main objective was to attempt to get the best accuracy out of each model, but in this case, there was no Sensitivity or Specifity computed. Is important to take into account that for these type of prediction tasks, one of the errors has a stronger weight, since it is not desired to predict healthy to someone with a malign cancer tumor.

Code

# Loading Packages
library(kernlab)
library(e1071)
library(randomForest)
library(dplyr)
library(ggplot2)
library(MASS)
library(reshape2)

######### FUNCTIONS #########
# Models to evaluate
#This function receives a train and test dataset, the algorithm, and the predictor label in order to execute the desired algorithm and predict the class
models <- function(train, test, algorithm){
  #Polynomial KSVM [Cost = 1, epsilon = .001]
  if(algorithm == "poly_eps.001_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='polydot', C=1, epsilon =.001 ,cross=length(train))
    #Polynomial KSVM [Cost = 1, epsilon = .01]
  }else if(algorithm == "poly_eps.01_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='polydot', C=1, epsilon =.01 ,cross=length(train))
    #Polynomial KSVM [Cost = 1, epsilon = .1]
  }else if(algorithm == "poly_eps.01_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='polydot', C=1, epsilon =.1 ,cross=length(train))
    #Polynomial KSVM [Cost = 10, epsilon = .001]
  }else if(algorithm == "poly_eps.001_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='polydot', C=10, epsilon =.001 ,cross=length(train))
    #Polynomial KSVM [Cost = 10, epsilon = .01]
  }else if(algorithm == "poly_eps.01_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='polydot', C=10, epsilon =.01 ,cross=length(train))
    #Polynomial KSVM [Cost = 10, epsilon = .1]
  }else if(algorithm == "poly_eps.1_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='polydot', C=10, epsilon =.1 ,cross=length(train))
    
    
    #Radial Basis KSVM [Cost = 1, epsilon = .001]
  }else if(algorithm == "rbf_eps.001_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='rbf', C=1, epsilon =.001 ,cross=length(train))
    #Radial Basis KSVM [Cost = 1, epsilon = .01]
  }else if(algorithm == "rbf_eps.01_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='rbf', C=1, epsilon =.01 ,cross=length(train))
    #Radial Basis KSVM [Cost = 1, epsilon = .1]
  }else if(algorithm == "rbf_eps.01_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='rbf', C=1, epsilon =.1 ,cross=length(train))
    #Radial Basis KSVM [Cost = 10, epsilon = .001]
  }else if(algorithm == "rbf_eps.001_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='rbf', C=10, epsilon =.001 ,cross=length(train))
    #Radial Basis KSVM [Cost = 10, epsilon = .01]
  }else if(algorithm == "rbf_eps.01_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='rbf', C=10, epsilon =.01 ,cross=length(train))
    #Radial Basis KSVM [Cost = 10, epsilon = .1]
  }else if(algorithm == "rbf_eps.1_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='rbf', C=10, epsilon =.1 ,cross=length(train))
    
    
    #Linear KSVM [Cost = 1, epsilon = .001]
  }else if(algorithm == "lin_eps.001_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='vanilladot', C=1, epsilon =.001 ,cross=length(train))
    #Linear KSVM [Cost = 1, epsilon = .01]
  }else if(algorithm == "lin_eps.01_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='vanilladot', C=1, epsilon =.01 ,cross=length(train))
    #Linear KSVM [Cost = 1, epsilon = .1]
  }else if(algorithm == "lin_eps.01_C1"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='vanilladot', C=1, epsilon =.1 ,cross=length(train))
    #Linear KSVM [Cost = 10, epsilon = .001]
  }else if(algorithm == "lin_eps.001_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='vanilladot', C=10, epsilon =.001 ,cross=length(train))
    #Linear KSVM [Cost = 10, epsilon = .01]
  }else if(algorithm == "lin_eps.01_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='vanilladot', C=10, epsilon =.01 ,cross=length(train))
    #Linear KSVM [Cost = 10, epsilon = .1]
  }else if(algorithm == "lin_eps.1_C10"){ 
    model <- ksvm (as.factor(Class) ~ ., data = train,kernel='vanilladot', C=10, epsilon =.1 ,cross=length(train))
    
    #Naive Bayes
  }else if(algorithm == "nb"){ 
    model <- naiveBayes(as.factor(Class) ~ ., data = train)
    #Linear Discriminant Analysis
  }else if(algorithm == "lda"){ 
    model <- lda (as.factor(Class) ~ . ,prior = c(1,1)/2, data = train, CV=FALSE)
    #Random Forest
  }else{  
    model <- randomForest(as.factor(Class) ~ .,data = train, nodesize=25, ntree = 200)
  }
  
  if(algorithm != "lda"){
    prediction = predict(model,newdata=test) 
  }else{
    prediction = predict(model,newdata=test)$class
  }
  
  
  return(prediction)
}

# Cross Validation 
#   This function receives the data and the models for which it will apply 5 Fold Cross Validation
cross_validation <- function(data, models, k = 5){
  #Shuffle the data
  dfTrain<-data[sample(nrow(data)),]
  
  #Create K equally size folds
  folds <- cut( seq(1,nrow(data)), breaks= k, labels=FALSE)
  
  #Perform K fold cross validation
  results <- data.frame()
  for(i in 1:k){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- dfTrain[testIndexes, ]
    trainData <- dfTrain[-testIndexes, ]
    
    #Setting up a dataframe for all the predictions of given iteration
    Predictions = data.frame("iteration" = rep(i,length(testIndexes)), "Class" = testData$Class)
    
    #Predictive Modeling for all the algorithms in the models function.
    for(j in 1:length(models)){
      pred <- models(trainData,testData,models[j])
      Predictions <- cbind(Predictions,pred)
    }
    names(Predictions) <- c("iteration","class", models)
    
    #Append the iteration results with the other iterations.
    results = rbind(results,Predictions)
  }
  return(results)
}

# Function to perform Cross Validation N- times for input models.
cv_n_times <- function(data, models, k=5, n ){
  results <- data.frame()
  # Repeat the cross_validation function and concatenate the outputs
  for(i in 1:n){
    cv <- cross_validation(data,models)
    cv <- cbind("sample" = i, cv)
    results = rbind(results,cv)
  }
  # Make the results into Boolean for easier manage of accuracy calculation
  results[models] <- sapply(models, function(i){
    results[i] == results$class
  })
  return(results)
}

# Function to calculate the accuracy for each algorithm from the multi time sampled CV results
accuracy <- function(results){
  algorithms =names(results[-c(1:3)])
  Accuracy <- results[c("sample","iteration", algorithms)] %>%
    group_by(sample,iteration) %>%
    summarise_each(funs(mean)) %>%
    setNames(c("sample","iteration",algorithms))
  return(Accuracy)
}

#Get boxplots from the N x Folds Accuracies of the models
plotbox <- function(accuracies_table, color, main){
algorithms = names(accuracies_table[-c(1:2)])  
boxplot(accuracies_table[algorithms], col = color ,
        main = "Polynomial KSVM Models", ylab = "Accuracy",
        las=2,ylim = c(0.90,1))
}

######### FUNCTIONS #########

######### MAIN ##########
# Load Cancer DataSet
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
df = read.csv(url)
names(df) <- c("Id", "Clump_Thickness","Unif_Cell_Size","Unif_Cell_Shape","Marginal_Adhesion",
               "Sing_Epithelial_Cell_Size","Bare_Nuclei","Bland_Chromatin","Normal_Nucleoli",
               "Mitoses","Class")

#Objective: Evaluate models by Cross validation

# Models
ksvms_poly <- c("poly_eps.001_C1", "poly_eps.01_C1", "poly_eps.1_C1", "poly_eps.001_C10","poly_eps.01_C10","poly_eps.1_C10")
ksvms_rbf <-  c("rbf_eps.001_C1", "rbf_eps.01_C1", "rbf_eps.1_C1", "rbf_eps.001_C10","rbf_eps.01_C10","rbf_eps.1_C10")
ksvms_linear <-  c("lin_eps.001_C1", "lin_eps.01_C1", "lin_eps.1_C1", "lin_eps.001_C10","lin_eps.01_C10","lin_eps.1_C10")
algorithms <-  c("rf", "nb", "lda")


#Split dataset
library(caTools)
set.seed(1)
spl = sample.split(df$Class, SplitRatio =  0.9)
# We create Training and Validation Set for purpose of Cross validation, and validating our model at the end.
# The first variable "ID" is taken out, because it may be biased, the ids, could have been generated depending on the Class label, and it does not tell us anything about the patient.
dfTrain = subset(df[-1], spl == TRUE)
dfValidation = subset(df[-1], spl == FALSE)


#Statistics
table(df$Class)
#Percentage of Malign Tumors
table(df$Class)[2]/sum(table(df$Class))

# Accuracy for each iteration for models

  # Results of the N-5-Fold CV
results_poly = cv_n_times(dfTrain,ksvms_poly, n=1)
results_rbf = cv_n_times(dfTrain,ksvms_rbf, n=1)
results_linear = cv_n_times(dfTrain,ksvms_linear,n=1)
results_algorithms = cv_n_times(dfTrain,algorithms,n=1)

  # Accuracy tables
acc_poly = accuracy(results_poly)
acc_rbf = accuracy(results_rbf)
acc_linear = accuracy(results_linear)
acc_algorithms = accuracy(results_algorithms)

  # Plots
plotbox(acc_poly,"lightblue", "Polynomial KSVM Models")
plotbox(acc_rbf,"darkseagreen", "Gaussian KSVM Models" )
plotbox(acc_linear,"dodgerblue2", "Linear KSVM Models")
plotbox(acc_algorithms, "coral3", "Other Model's Accuracy")

## We can see that the best models for each KSVM are the ones with the Parameters [C=1, eps = 0.1]
## This means we could compare the bests KSVM against each other, and against the Random Forest, which proved
## to be the best model out of (RF, NB, LDA).

  # Concatenate the best models
best_models = cbind(acc_algorithms[,1:3],"Poly_KSVM" = acc_poly$poly_eps.1_C1, "RBF_KSVM" = acc_rbf$ rbf_eps.1_C1, "Linear_KSVM" = acc_linear$lin_eps.1_C1)
  # Plot the best models to compare
plotbox(best_models, c("lightblue", "darkseagreen", "dodgerblue2", "coral3"), "Best Models")

  # For all the different Kernel SVMs, we could say the results are very much similar. 
  # It could even appear that the Random Forest has better accuracy then the rest but if 
  # we consider the outlier for the Random Forest, the decision might change. The Linear_KSVM 
  # with parameters [C = 1, epsilon = 0.1] seems a safer choice, since it's variance is less and 
  # it is important to have certainty in our predictions.

LinearKSVM <- ksvm (as.factor(Class) ~ ., data = dfTrain,kernel='vanilladot', C=1, epsilon =.1 ,cross=length(dfTrain))
prediction = predict(LinearKSVM,newdata=dfValidation) 
cm=table(dfValidation$Class, prediction)
sum(diag(cm))/sum(cm)

######### MAIN ##########