Info about the lab

Learning aim

The aim of this lab is to experiment with kernels and hyperparameters of support vector machine.

You are encouraged to explore plots of support vector machines for better understanding, but visualizing SVM is not an aim of this lab.

Objectives

By the end of this lab session, students should be able to

Fit a support vector machine for classification
Choose a suitable kernel for support vector machine
Tune parameters of support vector machine by cross-validation
Impute missing values into data

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

The dataset for this lab is taken from a Portugese study of risk factors of cervical cancer:

http://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29

You can get more information on cervical cancer here:

https://www.healthhub.sg/a-z/diseases-and-conditions/93/topic_cervical_cancer

Loading data into R

First we will load the data into R. Note that

“?” is recognised as a missing value (try removing na = "?" from the code below to see what happens.)
We cleaned the variable names with a helper function from janitor package. It removes spaces, strange characters, and converts everything to lower case.

Below are the data dimensions and variables.

library(tidyverse) # for manipulation with data
library(caret) # for machine learning, including KNN
library(janitor)
library(kernlab) # for training SVM

raw_data <- read_csv("risk_factors_cervical_cancer.csv",
                     na = "?") %>%
  clean_names()

cat("Dimensions of the dataset are", dim(raw_data), "\n")

## Dimensions of the dataset are 858 30

cat("Sample of data:\n")

## Sample of data:

head(raw_data)

Response variable

According to the data description, the following variables are target variables:

target_vars <-  c("hinselmann", "schiller", "citology", "biopsy")
target_vars

## [1] "hinselmann" "schiller"   "citology"   "biopsy"

Note that each of them is numeric and its value is 0 or 1, with 0 representing a negative result of a certain test and 1 positive result. We will create a new target variable, “y”, whose value is “cancer” if any of these four tests is positive and “no_cancer” if all these four tests are negative.

Important remark: while under the hood, an SVM converts a binary variable to a numeric variable with values \(-1\) and \(1\), in order to train an SVM in R, the response variable should be categorical, i.e., of class “factor”. Predictors can be either numeric or categorical, but categorical ones will be automatically converted to numeric with values 0 or 1. Compare it to other classification models:

Logistic regression - response variable and predictors can be either categorical or numeric, but categorical will be automatically converted to numeric variables with values 0 and 1.
KNN - response variable is passed to the model as is and while predictors will be automatically converted to numeric variables with values 0 and 1.
Tree-based methods (decision tree, bagging, random forest, boosting) — there is no conversion of variables. Categorical variables are passed to the model training function as categorical.

Below we create a response variable as follows: select columns in the raw data representing the target variables and then label the response as “cancer” whenever any of the target variables equals 1 and “no_cancer” if all the target variables are equal to 0.

X <- raw_data %>%
  mutate(
    y = factor(
      if_else(rowSums(across(all_of(target_vars))) > 0, "cancer", "no_cancer"),
      # Below we ensure that 'cancer' is the positive class:
      levels = c("no_cancer", "cancer")  
    )
  ) %>%
  select(-all_of(target_vars))

head(X)

Note that we also removed the variable response once it was inserted into the data frame X.

Missing values

Our dataset has missing values. Below we calculate the number of missing values in each column by applying the function “sum(is.na(x))” to each column of the dataset:

### A bit old (summarise_all maybe deprecated in future)
X %>%
  summarise_all(~sum(is.na(.)))

### More modern but longer:
# X %>%
#   summarise(across(everything(), ~ sum(is.na(.))))

### A purr version
# X %>%
#   map_df(~ sum(is.na(.)))

There are many strategies for dealing with missing data. The simplest approach is removing observations that has missing values. The most advanced method is using statistical learning to predict most reasonable values from values of other variables. You are probably familiar with it from the following story on stolen chemistry A-level papers in 2018:

LINK TO CHANNEL NEWS ASIA

Our dataset is not very large. If we simply deleted all missing values, it would reduce the size of the data. Instead of deleting observations with missing values, we will just replace missing values with the median of existing values:

X <- X %>%
  mutate(across(where(is.numeric), ~ replace_na(., median(., na.rm = TRUE))))

head(X)

Question 1

Let us look at the summary of our dataset now

glimpse(X)

## Rows: 858
## Columns: 27
## $ age                                 <dbl> 18, 15, 34, 52, 46, 42, 51, 26, 45…
## $ number_of_sexual_partners           <dbl> 4, 1, 1, 5, 3, 3, 3, 1, 1, 3, 3, 1…
## $ first_sexual_intercourse            <dbl> 15, 14, 17, 16, 21, 23, 17, 26, 20…
## $ num_of_pregnancies                  <dbl> 1, 1, 1, 4, 4, 2, 6, 3, 5, 2, 4, 3…
## $ smokes_years                        <dbl> 0.000000, 0.000000, 0.000000, 37.0…
## $ smokes_packs_year                   <dbl> 0.0, 0.0, 0.0, 37.0, 0.0, 0.0, 3.4…
## $ hormonal_contraceptives_years       <dbl> 0.00, 0.00, 0.00, 3.00, 15.00, 0.0…
## $ iud_years                           <dbl> 0, 0, 0, 0, 0, 0, 7, 7, 0, 0, 0, 0…
## $ st_ds_number                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_condylomatosis                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_cervical_condylomatosis       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_vaginal_condylomatosis        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_vulvo_perineal_condylomatosis <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_syphilis                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_pelvic_inflammatory_disease   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_genital_herpes                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_molluscum_contagiosum         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_aids                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_hiv                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_hepatitis_b                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_hpv                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ st_ds_number_of_diagnosis           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dx_cancer                           <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0…
## $ dx_cin                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dx_hpv                              <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0…
## $ dx                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ y                                   <fct> no_cancer, no_cancer, no_cancer, n…

Note that a lot of these variables are almost constant zeroes - there are just a few non-zero observations. This will be a problem for cross-validation because there will be a high chance the few positive observations will not be included into some stage of cross-validation.

There is helper function in caret for detecting such variables. Here are names of variables with the frequency ratio of the most common variable to the next most common variable being at least 99 to 1:

X %>% nearZeroVar(freqCut = 99/1, names = TRUE)

## [1] "st_ds_cervical_condylomatosis"     "st_ds_vaginal_condylomatosis"     
## [3] "st_ds_pelvic_inflammatory_disease" "st_ds_genital_herpes"             
## [5] "st_ds_molluscum_contagiosum"       "st_ds_aids"                       
## [7] "st_ds_hepatitis_b"                 "st_ds_hpv"

Remove these variables from the data. Dimensions of the new data should be 858 by 19.

# Modify the code below

vars_with_little_variability <- X %>% 
  nearZeroVar(freqCut = 99/1, names = TRUE)

X <- X %>%
  select(-all_of(vars_with_little_variability))

dim(X)

## [1] 858  19

Data visualization

Our dataset is multidimensional. To get some impression of what it looks like, here is a scatterplot that only uses two variables.

ggplot(data = X, 
       aes(x = age, y = num_of_pregnancies, colour = y)) +
  geom_point() +
  theme_minimal()

Training a support vector machine

Training and test sets

We will now split our data into training and test sets

set.seed(42)
p <- 0.8
ind <- which(runif(nrow(X)) < p)
train_data <- X %>% slice(ind)
test_data <- X %>% slice(-ind)
cat("Dimensions of the training data are", dim(train_data), "\n")

## Dimensions of the training data are 691 19

cat("Dimensions of the validation data are", dim(test_data), "\n")

## Dimensions of the validation data are 167 19

Linear kernel

First, we will try a linear kernel. Below is the summary of our first trained SVM.

mod_svm_linear <- train(
  y ~ . , train_data, method = 'svmLinear', 
  trControl = trainControl("none")
)
mod_svm_linear

## Support Vector Machines with Linear Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'no_cancer', 'cancer' 
## 
## No pre-processing
## Resampling: None

And here is the resulting confusion matrix (on the training set)

mod_svm_linear %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  no_cancer cancer
##   no_cancer       146     18
##   cancer            1      2
##                                          
##                Accuracy : 0.8862         
##                  95% CI : (0.828, 0.9301)
##     No Information Rate : 0.8802         
##     P-Value [Acc > NIR] : 0.4645336      
##                                          
##                   Kappa : 0.1473         
##                                          
##  Mcnemar's Test P-Value : 0.0002419      
##                                          
##             Sensitivity : 0.9932         
##             Specificity : 0.1000         
##          Pos Pred Value : 0.8902         
##          Neg Pred Value : 0.6667         
##              Prevalence : 0.8802         
##          Detection Rate : 0.8743         
##    Detection Prevalence : 0.9820         
##       Balanced Accuracy : 0.5466         
##                                          
##        'Positive' Class : no_cancer      
##

Question 2

Notice that the overall accuracy is quite decent. However, the balanced accuracy is low. Why is that?

This happened because the values of the response variable are imbalanced and hence by just predicting that no one has cancer we can achieve a pretty good accuracy.

Tuning the hyperparameter

The linear kernel has a hyperparameter \(C\) that should be chosen by cross-validation. When training a linear SVM classifier, we did not specify the value of \(C\), which means that the default value was chosen. We can extract it from the model as follows:

mod_svm_linear$bestTune

Let us now apply a 5-fold cross-validation to select the best value of \(C\) on a certain grid:

svm_linear_grid <- expand.grid(C = 2^(-4:1))

mod_svm_linear_tuned <- train(y ~ . , data = train_data, method = "svmLinear",
                tuneGrid = svm_linear_grid,
                trControl = trainControl("cv", number = 5))

mod_svm_linear_tuned

## Support Vector Machines with Linear Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'no_cancer', 'cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 553, 553, 553, 552, 553 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa       
##   0.0625  0.8798874  -0.002765774
##   0.1250  0.8784485   0.011496514
##   0.2500  0.8784485   0.011496514
##   0.5000  0.8755500   0.106363014
##   1.0000  0.8741007   0.103094134
##   2.0000  0.8741007   0.103094134
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.0625.

And here is the diagnostic plot of cross-validation accuracy vs the hyperparameter value (since this is a simple diagnostic plot, we won’t bother doing in tidyverse):

plot(mod_svm_linear_tuned)

Now let us find the test confusion matrix:

mod_svm_linear_tuned %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  no_cancer cancer
##   no_cancer       147     20
##   cancer            0      0
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8211, 0.9253)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.5592          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8802          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8802          
##          Detection Rate : 0.8802          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : no_cancer       
##

SVM with a polynomial kernel

Here we will train a polynomial kernel with a default grid.

mod_svm_poly_default <- train(y ~ . , data = train_data, method = "svmPoly",
                            trControl = trainControl("cv", number = 5))

mod_svm_poly_default

## Support Vector Machines with Polynomial Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'no_cancer', 'cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 553, 552, 553, 553, 553 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa        
##   1       0.001  0.25  0.8813367   0.0000000000
##   1       0.001  0.50  0.8813367   0.0000000000
##   1       0.001  1.00  0.8813367   0.0000000000
##   1       0.010  0.25  0.8813367   0.0000000000
##   1       0.010  0.50  0.8769888  -0.0076009501
##   1       0.010  1.00  0.8682932  -0.0222299477
##   1       0.100  0.25  0.8668439  -0.0247534303
##   1       0.100  0.50  0.8668439  -0.0247534303
##   1       0.100  1.00  0.8668439  -0.0247534303
##   2       0.001  0.25  0.8813367   0.0000000000
##   2       0.001  0.50  0.8813367   0.0000000000
##   2       0.001  1.00  0.8813367   0.0000000000
##   2       0.010  0.25  0.8740903  -0.0131422339
##   2       0.010  0.50  0.8682932  -0.0222299477
##   2       0.010  1.00  0.8682932  -0.0078943749
##   2       0.100  0.25  0.8682932   0.0272451662
##   2       0.100  0.50  0.8624961   0.0476570575
##   2       0.100  1.00  0.8625065   0.0432372070
##   3       0.001  0.25  0.8813367   0.0000000000
##   3       0.001  0.50  0.8813367   0.0000000000
##   3       0.001  1.00  0.8813367   0.0000000000
##   3       0.010  0.25  0.8711917  -0.0029696546
##   3       0.010  0.50  0.8726410  -0.0002291422
##   3       0.010  1.00  0.8740799   0.0195439534
##   3       0.100  0.25  0.8538109   0.0168860008
##   3       0.100  0.50  0.8523720   0.0276650622
##   3       0.100  1.00  0.8480242   0.0596418121
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.001 and C = 0.25.

The model that seems to be the best is still the linear.

mod_svm_poly_default %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  no_cancer cancer
##   no_cancer       147     20
##   cancer            0      0
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8211, 0.9253)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.5592          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8802          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8802          
##          Detection Rate : 0.8802          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : no_cancer       
##

Question 3

The default grid for the polynomial kernel does not do very well. Instead of using the default grid for the hyperparameter values, train your polynomial SVM with the following grid.

svm_poly_grid <- expand.grid(C = c(0.25, 0.5, 1, 2),
                        scale = c(0.5, 1, 2),
                        degree = c(2, 3))

svm_poly_grid

Report the confusion matrix of the final model.

mod_svm_poly_tuned <- train(y ~ . , data = train_data, method = "svmPoly",
                            tuneGrid = svm_poly_grid,
                            trControl = trainControl("cv", number = 5))

mod_svm_poly_tuned

## Support Vector Machines with Polynomial Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'no_cancer', 'cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 552, 553, 553, 553, 553 
## Resampling results across tuning parameters:
## 
##   C     scale  degree  Accuracy   Kappa     
##   0.25  0.5    2       0.8698259  0.13768155
##   0.25  0.5    3       0.8553227  0.16245409
##   0.25  1.0    2       0.8596705  0.11456657
##   0.25  1.0    3       0.8509957  0.13761482
##   0.25  2.0    2       0.8596705  0.12711647
##   0.25  2.0    3       0.8466375  0.16380100
##   0.50  0.5    2       0.8654781  0.11620989
##   0.50  0.5    3       0.8495360  0.14167873
##   0.50  1.0    2       0.8596705  0.14056368
##   0.50  1.0    3       0.8452091  0.13933593
##   0.50  2.0    2       0.8596601  0.12682998
##   0.50  2.0    3       0.8263580  0.10482596
##   1.00  0.5    2       0.8596810  0.10401825
##   1.00  0.5    3       0.8524346  0.12605573
##   1.00  1.0    2       0.8596705  0.14110840
##   1.00  1.0    3       0.8480867  0.14251281
##   1.00  2.0    2       0.8610989  0.14322873
##   1.00  2.0    3       0.8162340  0.07954981
##   2.00  0.5    2       0.8596705  0.14056368
##   2.00  0.5    3       0.8538838  0.15616977
##   2.00  1.0    2       0.8596705  0.14118672
##   2.00  1.0    3       0.8422896  0.11995554
##   2.00  2.0    2       0.8582004  0.15519365
##   2.00  2.0    3       0.8046502  0.03147407
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.5 and C = 0.25.

mod_svm_poly_tuned %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  no_cancer cancer
##   no_cancer       143     17
##   cancer            4      3
##                                           
##                Accuracy : 0.8743          
##                  95% CI : (0.8142, 0.9204)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.649384        
##                                           
##                   Kappa : 0.1707          
##                                           
##  Mcnemar's Test P-Value : 0.008829        
##                                           
##             Sensitivity : 0.9728          
##             Specificity : 0.1500          
##          Pos Pred Value : 0.8938          
##          Neg Pred Value : 0.4286          
##              Prevalence : 0.8802          
##          Detection Rate : 0.8563          
##    Detection Prevalence : 0.9581          
##       Balanced Accuracy : 0.5614          
##                                           
##        'Positive' Class : no_cancer       
##

Radial basis kernel

svm_gauss_grid <- expand.grid(sigma = c(1/16, 1/8, 1/4, 1, 2),
                              C = c(0.25, 0.5, 1, 2))

mod_svm_radial <- train(y ~ . , data = train_data, method = "svmRadial",
                tuneGrid = svm_gauss_grid,
                trControl = trainControl("cv", number = 5))

mod_svm_radial

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 691 samples
##  18 predictor
##   2 classes: 'no_cancer', 'cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 552, 552, 553, 554, 553 
## Resampling results across tuning parameters:
## 
##   sigma   C     Accuracy   Kappa       
##   0.0625  0.25  0.8813447   0.000000000
##   0.0625  0.50  0.8784461  -0.005289256
##   0.0625  1.00  0.8798850   0.032004501
##   0.0625  2.00  0.8784673   0.092890549
##   0.1250  0.25  0.8813447   0.000000000
##   0.1250  0.50  0.8784461  -0.005289256
##   0.1250  1.00  0.8769864   0.009406372
##   0.1250  2.00  0.8741299   0.067803251
##   0.2500  0.25  0.8813447   0.000000000
##   0.2500  0.50  0.8813447   0.000000000
##   0.2500  1.00  0.8769968  -0.007600950
##   0.2500  2.00  0.8726804   0.018569626
##   1.0000  0.25  0.8813447   0.000000000
##   1.0000  0.50  0.8813447   0.000000000
##   1.0000  1.00  0.8813447   0.000000000
##   1.0000  2.00  0.8799058  -0.002755267
##   2.0000  0.25  0.8813447   0.000000000
##   2.0000  0.50  0.8813447   0.000000000
##   2.0000  1.00  0.8813447   0.000000000
##   2.0000  2.00  0.8769967  -0.008307284
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2 and C = 0.25.

mod_svm_radial %>%
  predict(test_data) %>%
  confusionMatrix(test_data$y)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  no_cancer cancer
##   no_cancer       147     20
##   cancer            0      0
##                                           
##                Accuracy : 0.8802          
##                  95% CI : (0.8211, 0.9253)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 0.5592          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8802          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8802          
##          Detection Rate : 0.8802          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : no_cancer       
##

Question 4

Note that all these models have high overall accuracy but pretty poor balanced accuracy. Suggest a strategy to improve the balanced accuracy.

Answer The issue here is that our training data is imbalanced. To improve the balanced accuracy, we can create a new training data by resampling with replacement so that the new training dataset is balanced. There are libraries for resampling in R, but here we will do it in a simplest way possible just to demonstrate the idea.

Below we do it for radial basis SVM:

cancer_data <- train_data %>%
  filter(y == "cancer") %>%
  slice_sample(n = 400, replace = TRUE)

no_cancer_data <- train_data %>%
  filter(y == "no_cancer") %>%
  slice_sample(n = 400, replace = TRUE)

balanced_train_data <- bind_rows(cancer_data, 
                                 no_cancer_data)

mod_svm_balanced <- train(y ~ . , data = balanced_train_data, 
                        method = "svmPoly",
                        tuneGrid = svm_poly_grid,
                        trControl = trainControl("cv", number = 5))

mod_svm_balanced

## Support Vector Machines with Polynomial Kernel 
## 
## 800 samples
##  18 predictor
##   2 classes: 'no_cancer', 'cancer' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 640, 640, 640, 640, 640 
## Resampling results across tuning parameters:
## 
##   C     scale  degree  Accuracy  Kappa 
##   0.25  0.5    2       0.72375   0.4475
##   0.25  0.5    3       0.77875   0.5575
##   0.25  1.0    2       0.76625   0.5325
##   0.25  1.0    3       0.83500   0.6700
##   0.25  2.0    2       0.76750   0.5350
##   0.25  2.0    3       0.84750   0.6950
##   0.50  0.5    2       0.74625   0.4925
##   0.50  0.5    3       0.82125   0.6425
##   0.50  1.0    2       0.76500   0.5300
##   0.50  1.0    3       0.84875   0.6975
##   0.50  2.0    2       0.76000   0.5200
##   0.50  2.0    3       0.84375   0.6875
##   1.00  0.5    2       0.76125   0.5225
##   1.00  0.5    3       0.81250   0.6250
##   1.00  1.0    2       0.77125   0.5425
##   1.00  1.0    3       0.84500   0.6900
##   1.00  2.0    2       0.76250   0.5250
##   1.00  2.0    3       0.86000   0.7200
##   2.00  0.5    2       0.75750   0.5150
##   2.00  0.5    3       0.83250   0.6650
##   2.00  1.0    2       0.76000   0.5200
##   2.00  1.0    3       0.84000   0.6800
##   2.00  2.0    2       0.76125   0.5225
##   2.00  2.0    3       0.86000   0.7200
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 2 and C = 1.

mod_svm_balanced %>% 
  predict(test_data) %>%
  confusionMatrix(test_data$y)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  no_cancer cancer
##   no_cancer       102     13
##   cancer           45      7
##                                           
##                Accuracy : 0.6527          
##                  95% CI : (0.5753, 0.7246)
##     No Information Rate : 0.8802          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0259          
##                                           
##  Mcnemar's Test P-Value : 4.691e-05       
##                                           
##             Sensitivity : 0.6939          
##             Specificity : 0.3500          
##          Pos Pred Value : 0.8870          
##          Neg Pred Value : 0.1346          
##              Prevalence : 0.8802          
##          Detection Rate : 0.6108          
##    Detection Prevalence : 0.6886          
##       Balanced Accuracy : 0.5219          
##                                           
##        'Positive' Class : no_cancer       
##

Survey

There is a link to a simple survey after lab 6:

https://forms.gle/qRHDpyh3JPFsUuXo7

Answers

Here are the answers:

https://rpubs.com/fduzhin/mh4510-lab-6-answers

Caret models

Here is a list of models available in caret including SVMs with various kernels;

https://rdrr.io/cran/caret/man/models.html

Lab 6 — support vector machine