Info about the lab

Learning aim

The aim of this lab is to experiment with kernels and hyperparameters of support vector machine.

You are encouraged to explore plots of support vector machines for better understanding, but visualizing SVM is not an aim of this lab.

Objectives

By the end of this lab session, students should be able to

  1. Fit a support vector machine for classification

  2. Choose a suitable kernel for support vector machine

  3. Tune parameters of support vector machine by cross-validation

  4. Impute missing values into data

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

The dataset for this lab is taken from a Portugese study of risk factors of cervical cancer:

http://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29

You can get more information on cervical cancer here:

https://www.healthhub.sg/a-z/diseases-and-conditions/93/topic_cervical_cancer

Loading data into R

First we will load the data into R. Below are the data dimensions and variables.

library(tidyverse) # for manipulation with data
library(caret) # for machine learning, including KNN

library(kernlab) # for training decision trees

raw_data <- read.csv("risk_factors_cervical_cancer.csv",
              na.strings = c("NA", "?"),
              stringsAsFactors = FALSE) %>%
  as_tibble()

cat("Dimensions of the dataset are", dim(raw_data), "\n")
## Dimensions of the dataset are 858 30
cat("Sample of data:\n")
## Sample of data:
head(raw_data)

Response variable

According to the data description, the following variables are target variables:

target_vars <-  c("Hinselmann", "Schiller", "Citology", "Biopsy")
target_vars
## [1] "Hinselmann" "Schiller"   "Citology"   "Biopsy"

Note that each of them is numeric and its value is 0 or 1, with 0 representing a negative result of a certain test and 1 positive result. We will create a new target variable, “y”, whose value is “cancer” if any of these four tests is positive and “no_cancer” if all these four tests are negative.

Important remark: while under the hood, an SVM converts a binary variable to a numeric variable with values \(-1\) and \(1\), in order to train an SVM in R, the response variable should be categorical, i.e., of class “factor”. Predictors can be either numeric or categorical, but categorical ones will be automatically converted to numeric with values 0 or 1. Compare it to other classification models:

  • Logistic regression - response variable and predictors can be either categorical or numeric, but categorical will be automatically converted to numeric variables with values 0 and 1.

  • KNN - response variable is passed to the model as is and while predictors will be automatically converted to numeric variables with values 0 and 1.

  • Tree-based methods (decision tree, bagging, random forest, boosting) — there is no conversion of variables. Categorical variables are passed to the model training function as categorical.

Below we create a response variable as follows: select columns in the raw data representing the target variables and then label the response as “cancer” whenever any of the targel variables equals 1 and “no_cancer” if all the target variables are equal to 0.

response <- raw_data %>%
  select(all_of(target_vars)) %>%
  apply(1, any) %>%
  ifelse("cancer", "no cancer") %>%
  as.factor

X <- raw_data %>%
  select(-all_of(target_vars)) %>%
  mutate(y = response) 
rm(response)

head(X)  
# X$y <- apply(X[ , target_vars],
#              1, sum)
# 
# X <- X[ , !(names(X) %in% target_vars)]
# 
# X$y <- as.factor(X$y > 0)
# X$y <- recode(X$y, 'TRUE' = 'cancer', 'FALSE' = 'no_cancer')
# table(X$y)

Note that we also removed the variable response once it was inserted into the data frame X.

Missing values

Our dataset has missing values. Below we calculate the number of missing values in each column by applying the function “sum(is.na(x))” to each column of the dataset:

sapply(X, function(x) sum(is.na(x)))
##                                Age          Number.of.sexual.partners 
##                                  0                                 26 
##           First.sexual.intercourse                 Num.of.pregnancies 
##                                  7                                 56 
##                     Smokes..years.                Smokes..packs.year. 
##                                 13                                 13 
##    Hormonal.Contraceptives..years.                        IUD..years. 
##                                108                                117 
##                      STDs..number.                STDs.condylomatosis 
##                                105                                105 
##       STDs.cervical.condylomatosis        STDs.vaginal.condylomatosis 
##                                105                                105 
## STDs.vulvo.perineal.condylomatosis                      STDs.syphilis 
##                                105                                105 
##   STDs.pelvic.inflammatory.disease                STDs.genital.herpes 
##                                105                                105 
##         STDs.molluscum.contagiosum                          STDs.AIDS 
##                                105                                105 
##                           STDs.HIV                   STDs.Hepatitis.B 
##                                105                                105 
##                           STDs.HPV          STDs..Number.of.diagnosis 
##                                105                                  0 
##                          Dx.Cancer                             Dx.CIN 
##                                  0                                  0 
##                             Dx.HPV                                 Dx 
##                                  0                                  0 
##                                  y 
##                                  0

There are many strategies for dealing with missing data. The simplest approach is removing observations that has missing values. The most advanced method is using statistical learning to predict most reasonable values from values of other variables. You are probably familiar with it from the following story on stolen chemistry A-level papers in 2018:

LINK TO CHANNEL NEWS ASIA

Our dataset is not very large. If we simply deleted all missing values, it would reduce the size of the data. Instead of deleting observations with missing values, we will just replace missing values with the median of existing values:

replace_na_with_median <- function(x) {
  # This function does not change non-numeric vectors
  # And if x is numeric, it returns a copy of x with NA replaced with median(x)
  if (is.numeric(x)) {
    result <- x
    result[is.na(x)] <- median(x, na.rm = TRUE)
  } else {
    result <- x
  }
  result
}

X <- X %>%
  mutate_all(~replace_na_with_median(.))

head(X)

Question 1

Let us look at the summary of our dataset now

summary(X)
##       Age        Number.of.sexual.partners First.sexual.intercourse
##  Min.   :13.00   Min.   : 1.000            Min.   :10              
##  1st Qu.:20.00   1st Qu.: 2.000            1st Qu.:15              
##  Median :25.00   Median : 2.000            Median :17              
##  Mean   :26.82   Mean   : 2.512            Mean   :17              
##  3rd Qu.:32.00   3rd Qu.: 3.000            3rd Qu.:18              
##  Max.   :84.00   Max.   :28.000            Max.   :32              
##  Num.of.pregnancies Smokes..years.   Smokes..packs.year.
##  Min.   : 0.000     Min.   : 0.000   Min.   : 0.0000    
##  1st Qu.: 1.000     1st Qu.: 0.000   1st Qu.: 0.0000    
##  Median : 2.000     Median : 0.000   Median : 0.0000    
##  Mean   : 2.258     Mean   : 1.201   Mean   : 0.4463    
##  3rd Qu.: 3.000     3rd Qu.: 0.000   3rd Qu.: 0.0000    
##  Max.   :11.000     Max.   :37.000   Max.   :37.0000    
##  Hormonal.Contraceptives..years.  IUD..years.      STDs..number.  
##  Min.   : 0.000                  Min.   : 0.0000   Min.   :0.000  
##  1st Qu.: 0.000                  1st Qu.: 0.0000   1st Qu.:0.000  
##  Median : 0.500                  Median : 0.0000   Median :0.000  
##  Mean   : 2.035                  Mean   : 0.4446   Mean   :0.155  
##  3rd Qu.: 2.000                  3rd Qu.: 0.0000   3rd Qu.:0.000  
##  Max.   :30.000                  Max.   :19.0000   Max.   :4.000  
##  STDs.condylomatosis STDs.cervical.condylomatosis STDs.vaginal.condylomatosis
##  Min.   :0.00000     Min.   :0                    Min.   :0.000000           
##  1st Qu.:0.00000     1st Qu.:0                    1st Qu.:0.000000           
##  Median :0.00000     Median :0                    Median :0.000000           
##  Mean   :0.05128     Mean   :0                    Mean   :0.004662           
##  3rd Qu.:0.00000     3rd Qu.:0                    3rd Qu.:0.000000           
##  Max.   :1.00000     Max.   :0                    Max.   :1.000000           
##  STDs.vulvo.perineal.condylomatosis STDs.syphilis    
##  Min.   :0.00000                    Min.   :0.00000  
##  1st Qu.:0.00000                    1st Qu.:0.00000  
##  Median :0.00000                    Median :0.00000  
##  Mean   :0.05012                    Mean   :0.02098  
##  3rd Qu.:0.00000                    3rd Qu.:0.00000  
##  Max.   :1.00000                    Max.   :1.00000  
##  STDs.pelvic.inflammatory.disease STDs.genital.herpes
##  Min.   :0.000000                 Min.   :0.000000   
##  1st Qu.:0.000000                 1st Qu.:0.000000   
##  Median :0.000000                 Median :0.000000   
##  Mean   :0.001166                 Mean   :0.001166   
##  3rd Qu.:0.000000                 3rd Qu.:0.000000   
##  Max.   :1.000000                 Max.   :1.000000   
##  STDs.molluscum.contagiosum   STDs.AIDS    STDs.HIV       STDs.Hepatitis.B  
##  Min.   :0.000000           Min.   :0   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000           1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000           Median :0   Median :0.00000   Median :0.000000  
##  Mean   :0.001166           Mean   :0   Mean   :0.02098   Mean   :0.001166  
##  3rd Qu.:0.000000           3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000           Max.   :0   Max.   :1.00000   Max.   :1.000000  
##     STDs.HPV        STDs..Number.of.diagnosis   Dx.Cancer      
##  Min.   :0.000000   Min.   :0.00000           Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000           1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000           Median :0.00000  
##  Mean   :0.002331   Mean   :0.08741           Mean   :0.02098  
##  3rd Qu.:0.000000   3rd Qu.:0.00000           3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :3.00000           Max.   :1.00000  
##      Dx.CIN            Dx.HPV              Dx                  y      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   cancer   :102  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   no cancer:756  
##  Median :0.00000   Median :0.00000   Median :0.00000                  
##  Mean   :0.01049   Mean   :0.02098   Mean   :0.02797                  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000                  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000

Note that a lot of these variables are almost constant zeroes - there are just a few non-zero observations. This will be a problem for cross-validation because there will be a high chance the few positive observations will not be included into some stage of cross-validation.

Remove numeric variables whose mean value is below \(0.01\). Dimensions of the new data should be 858 by 19.

# Modify the code below

mean_values <- X %>%
  summarise_if(is.numeric, mean) 

vars_with_little_variability <-
  names(mean_values)[which(mean_values <= 0.01)]

vars_with_little_variability
## [1] "STDs.cervical.condylomatosis"     "STDs.vaginal.condylomatosis"     
## [3] "STDs.pelvic.inflammatory.disease" "STDs.genital.herpes"             
## [5] "STDs.molluscum.contagiosum"       "STDs.AIDS"                       
## [7] "STDs.Hepatitis.B"                 "STDs.HPV"
X <- X %>%
  select(-all_of(vars_with_little_variability))

dim(X)
## [1] 858  19

Data visualization

Our dataset is multidimensional. To get some impression of what it looks like, here is a scatterplot that only uses two variables.

ggplot(data = X, aes(x = Age, y = Num.of.pregnancies, colour = y)) +
  geom_point()

Training a support vector machine

Training and test sets

We will now split our data into training and test sets

set.seed(42)
p <- 0.8
ind <- which(runif(nrow(X)) < p)
train_data <- X %>% slice(ind)
test_data <- X %>% slice(-ind)
cat("Dimensions of the training data are", dim(train_data), "\n")
## Dimensions of the training data are 691 19
cat("Dimensions of the validation data are", dim(test_data), "\n")
## Dimensions of the validation data are 167 19

Linear kernel

First, we will try a linear kernel. Below is the summary of our first trained SVM.

mod_svm_linear <- train(
  y ~ . , train_data, method = 'svmLinear', 
  trControl = trainControl("none")
)
mod_svm_linear
## Support Vector Machines with Linear Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'cancer', 'no cancer' 
## 
## No pre-processing
## Resampling: None

And here is the resulting confusion matrix (on the training set)

mod_svm_linear %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cancer no cancer
##   cancer         2         1
##   no cancer     18       146
##                                          
##                Accuracy : 0.8862         
##                  95% CI : (0.828, 0.9301)
##     No Information Rate : 0.8802         
##     P-Value [Acc > NIR] : 0.4645336      
##                                          
##                   Kappa : 0.1473         
##                                          
##  Mcnemar's Test P-Value : 0.0002419      
##                                          
##             Sensitivity : 0.10000        
##             Specificity : 0.99320        
##          Pos Pred Value : 0.66667        
##          Neg Pred Value : 0.89024        
##              Prevalence : 0.11976        
##          Detection Rate : 0.01198        
##    Detection Prevalence : 0.01796        
##       Balanced Accuracy : 0.54660        
##                                          
##        'Positive' Class : cancer         
## 

Question 2

Notice that the overall accuracy is quite decent. However, the balanced accuracy is low. Why is that?

This happened because the values of the response variable are imbalanced and hence by just predicting that no one has cancer we can achieve a pretty good accuracy.

Tuning the hyperparameter

The linear kernel has a hyperparameter \(C\) that should be chosen by cross-validation. When training a linear SVM classifier, we did not specify the value of \(C\), which means that the default value was chosen. We can extract it from the model as follows:

mod_svm_linear$bestTune

Let us now apply a 5-fold cross-validation to select the best value of \(C\) on a certain grid:

svm_linear_grid <- expand.grid(C = 2^(-4:1))

mod_svm_linear_tuned <- train(y ~ . , data = train_data, method = "svmLinear",
                tuneGrid = svm_linear_grid,
                trControl = trainControl("cv", number = 5))

mod_svm_linear_tuned
## Support Vector Machines with Linear Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'cancer', 'no cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 553, 553, 553, 552, 553 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa       
##   0.0625  0.8798874  -0.002765774
##   0.1250  0.8784485   0.011496514
##   0.2500  0.8784485   0.011496514
##   0.5000  0.8755500   0.106363014
##   1.0000  0.8741007   0.103094134
##   2.0000  0.8741007   0.103094134
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.0625.

And here is the plot of cross-validation accuracy vs the hyperparameter value:

plot(mod_svm_linear_tuned)

Now let us find the test confusion matrix:

mod_svm_linear_tuned %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cancer no cancer
##   cancer         0         0
##   no cancer     20       147
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8211, 0.9253)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.5592          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8802          
##              Prevalence : 0.1198          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : cancer          
## 

SVM with a polynomial kernel

Here we will train a polynomial kernel with a default grid.

mod_svm_poly_default <- train(y ~ . , data = train_data, method = "svmPoly",
                            trControl = trainControl("cv", number = 5))

mod_svm_poly_default
## Support Vector Machines with Polynomial Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'cancer', 'no cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 553, 552, 553, 553, 553 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa        
##   1       0.001  0.25  0.8813367   0.0000000000
##   1       0.001  0.50  0.8813367   0.0000000000
##   1       0.001  1.00  0.8813367   0.0000000000
##   1       0.010  0.25  0.8813367   0.0000000000
##   1       0.010  0.50  0.8769888  -0.0076009501
##   1       0.010  1.00  0.8682932  -0.0222299477
##   1       0.100  0.25  0.8668439  -0.0247534303
##   1       0.100  0.50  0.8668439  -0.0247534303
##   1       0.100  1.00  0.8668439  -0.0247534303
##   2       0.001  0.25  0.8813367   0.0000000000
##   2       0.001  0.50  0.8813367   0.0000000000
##   2       0.001  1.00  0.8813367   0.0000000000
##   2       0.010  0.25  0.8740903  -0.0131422339
##   2       0.010  0.50  0.8682932  -0.0222299477
##   2       0.010  1.00  0.8682932  -0.0078943749
##   2       0.100  0.25  0.8682932   0.0272451662
##   2       0.100  0.50  0.8624961   0.0476570575
##   2       0.100  1.00  0.8625065   0.0432372070
##   3       0.001  0.25  0.8813367   0.0000000000
##   3       0.001  0.50  0.8813367   0.0000000000
##   3       0.001  1.00  0.8813367   0.0000000000
##   3       0.010  0.25  0.8711917  -0.0029696546
##   3       0.010  0.50  0.8726410  -0.0002291422
##   3       0.010  1.00  0.8740799   0.0195439534
##   3       0.100  0.25  0.8538109   0.0168860008
##   3       0.100  0.50  0.8523720   0.0276650622
##   3       0.100  1.00  0.8480242   0.0596418121
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.001 and C = 0.25.

The model that seems to be the best is still the linear.

mod_svm_poly_default %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cancer no cancer
##   cancer         0         0
##   no cancer     20       147
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8211, 0.9253)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.5592          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8802          
##              Prevalence : 0.1198          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : cancer          
## 

Question 3

The default grid for the polynomial kernel does not do very well. Instead of using the default grid for the hyperparameter values, train your polynomial SVM with the following grid.

svm_poly_grid <- expand.grid(C = c(0.25, 0.5, 1, 2),
                        scale = c(0.5, 1, 2),
                        degree = c(2, 3))

svm_poly_grid

Report the confusion matrix of the final model.

mod_svm_poly_tuned <- train(y ~ . , data = train_data, method = "svmPoly",
                            tuneGrid = svm_poly_grid,
                            trControl = trainControl("cv", number = 5))

mod_svm_poly_tuned 
## Support Vector Machines with Polynomial Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'cancer', 'no cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 552, 553, 553, 553, 553 
## Resampling results across tuning parameters:
## 
##   C     scale  degree  Accuracy   Kappa     
##   0.25  0.5    2       0.8698259  0.13768155
##   0.25  0.5    3       0.8567720  0.17984193
##   0.25  1.0    2       0.8596705  0.11456657
##   0.25  1.0    3       0.8509957  0.13761482
##   0.25  2.0    2       0.8596705  0.12711647
##   0.25  2.0    3       0.8466375  0.16380100
##   0.50  0.5    2       0.8654781  0.11620989
##   0.50  0.5    3       0.8495360  0.14167873
##   0.50  1.0    2       0.8596705  0.14056368
##   0.50  1.0    3       0.8452091  0.13933593
##   0.50  2.0    2       0.8596601  0.12682998
##   0.50  2.0    3       0.8263580  0.10482596
##   1.00  0.5    2       0.8596810  0.10401825
##   1.00  0.5    3       0.8524346  0.12605573
##   1.00  1.0    2       0.8596705  0.14110840
##   1.00  1.0    3       0.8480867  0.14251281
##   1.00  2.0    2       0.8610989  0.14322873
##   1.00  2.0    3       0.8162340  0.07954981
##   2.00  0.5    2       0.8596705  0.14056368
##   2.00  0.5    3       0.8538838  0.15616977
##   2.00  1.0    2       0.8596705  0.14118672
##   2.00  1.0    3       0.8422896  0.11995554
##   2.00  2.0    2       0.8582004  0.15519365
##   2.00  2.0    3       0.8046502  0.03147407
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.5 and C = 0.25.
mod_svm_poly_tuned %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cancer no cancer
##   cancer         3         4
##   no cancer     17       143
##                                           
##                Accuracy : 0.8743          
##                  95% CI : (0.8142, 0.9204)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.649384        
##                                           
##                   Kappa : 0.1707          
##                                           
##  Mcnemar's Test P-Value : 0.008829        
##                                           
##             Sensitivity : 0.15000         
##             Specificity : 0.97279         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.89375         
##              Prevalence : 0.11976         
##          Detection Rate : 0.01796         
##    Detection Prevalence : 0.04192         
##       Balanced Accuracy : 0.56139         
##                                           
##        'Positive' Class : cancer          
## 

Radial basis kernel

svm_gauss_grid <- expand.grid(sigma = c(1/16, 1/8, 1/4, 1, 2),
                              C = c(0.25, 0.5, 1, 2))

mod_svm_radial <- train(y ~ . , data = train_data, method = "svmRadial",
                tuneGrid = svm_gauss_grid,
                trControl = trainControl("cv", number = 5))

mod_svm_radial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'cancer', 'no cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 552, 552, 553, 554, 553 
## Resampling results across tuning parameters:
## 
##   sigma   C     Accuracy   Kappa       
##   0.0625  0.25  0.8813447   0.000000000
##   0.0625  0.50  0.8784461  -0.005289256
##   0.0625  1.00  0.8798850   0.032004501
##   0.0625  2.00  0.8784673   0.092890549
##   0.1250  0.25  0.8813447   0.000000000
##   0.1250  0.50  0.8784461  -0.005289256
##   0.1250  1.00  0.8769864   0.009406372
##   0.1250  2.00  0.8741299   0.067803251
##   0.2500  0.25  0.8813447   0.000000000
##   0.2500  0.50  0.8813447   0.000000000
##   0.2500  1.00  0.8769968  -0.007600950
##   0.2500  2.00  0.8726804   0.018569626
##   1.0000  0.25  0.8813447   0.000000000
##   1.0000  0.50  0.8813447   0.000000000
##   1.0000  1.00  0.8813447   0.000000000
##   1.0000  2.00  0.8799058  -0.002755267
##   2.0000  0.25  0.8813447   0.000000000
##   2.0000  0.50  0.8813447   0.000000000
##   2.0000  1.00  0.8813447   0.000000000
##   2.0000  2.00  0.8769967  -0.008307284
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2 and C = 0.25.
mod_svm_radial %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cancer no cancer
##   cancer         0         0
##   no cancer     20       147
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8211, 0.9253)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.5592          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8802          
##              Prevalence : 0.1198          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : cancer          
## 

Question 4

Note that all these models have high overall accuracy but pretty poor balanced accuracy. Suggest a strategy to improve the balanced accuracy.

Answer The issue here is that our training data is imbalanced. To improve the balanced accuracy, we can create a new training data by keeping all the observations that do not have cancer and duplicating observations that have cancer. Below we do it for radial basis SVM:

cancer_data <- train_data %>%
  filter(y == "cancer") %>%
  sample_n(400, replace = TRUE)

no_cancer_data <- train_data %>%
  filter(y == "no cancer") %>%
  sample_n(400, replace = TRUE)

balanced_train_data <- rbind(cancer_data, no_cancer_data)
rm(cancer_data)
rm(no_cancer_data)

mod_svm_balanced <- train(y ~ . , data = balanced_train_data, 
                        method = "svmPoly",
                        tuneGrid = svm_poly_grid,
                        trControl = trainControl("cv", number = 5))

mod_svm_balanced
## Support Vector Machines with Polynomial Kernel 
## 
## 800 samples
##  18 predictor
##   2 classes: 'cancer', 'no cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 640, 640, 640, 640, 640 
## Resampling results across tuning parameters:
## 
##   C     scale  degree  Accuracy  Kappa 
##   0.25  0.5    2       0.72375   0.4475
##   0.25  0.5    3       0.77875   0.5575
##   0.25  1.0    2       0.76625   0.5325
##   0.25  1.0    3       0.83500   0.6700
##   0.25  2.0    2       0.76750   0.5350
##   0.25  2.0    3       0.84750   0.6950
##   0.50  0.5    2       0.74625   0.4925
##   0.50  0.5    3       0.82125   0.6425
##   0.50  1.0    2       0.76500   0.5300
##   0.50  1.0    3       0.84875   0.6975
##   0.50  2.0    2       0.76000   0.5200
##   0.50  2.0    3       0.84375   0.6875
##   1.00  0.5    2       0.76125   0.5225
##   1.00  0.5    3       0.81250   0.6250
##   1.00  1.0    2       0.77125   0.5425
##   1.00  1.0    3       0.84500   0.6900
##   1.00  2.0    2       0.76000   0.5200
##   1.00  2.0    3       0.86000   0.7200
##   2.00  0.5    2       0.75750   0.5150
##   2.00  0.5    3       0.83250   0.6650
##   2.00  1.0    2       0.76000   0.5200
##   2.00  1.0    3       0.84000   0.6800
##   2.00  2.0    2       0.76125   0.5225
##   2.00  2.0    3       0.86000   0.7200
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 2 and C = 1.
mod_svm_balanced %>% predict(test_data) %>%
  confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  cancer no cancer
##   cancer         7        45
##   no cancer     13       102
##                                           
##                Accuracy : 0.6527          
##                  95% CI : (0.5753, 0.7246)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0259          
##                                           
##  Mcnemar's Test P-Value : 4.691e-05       
##                                           
##             Sensitivity : 0.35000         
##             Specificity : 0.69388         
##          Pos Pred Value : 0.13462         
##          Neg Pred Value : 0.88696         
##              Prevalence : 0.11976         
##          Detection Rate : 0.04192         
##    Detection Prevalence : 0.31138         
##       Balanced Accuracy : 0.52194         
##                                           
##        'Positive' Class : cancer          
## 

Survey

There is a link to a simple survey after lab 6:

Answers

Here are the answers:

Caret models

Here is a list of models available in caret including SVMs with various kernels;