The aim of this lab is to experiment with kernels and hyperparameters of support vector machine.
You are encouraged to explore plots of support vector machines for better understanding, but visualizing SVM is not an aim of this lab.
By the end of this lab session, students should be able to
Fit a support vector machine for classification
Choose a suitable kernel for support vector machine
Tune parameters of support vector machine by cross-validation
Impute missing values into data
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
The dataset for this lab is taken from a Portugese study of risk factors of cervical cancer:
http://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
You can get more information on cervical cancer here:
https://www.healthhub.sg/a-z/diseases-and-conditions/93/topic_cervical_cancer
First we will load the data into R. Below are the data dimensions and variables.
library(tidyverse) # for manipulation with data
library(caret) # for machine learning, including KNN
library(kernlab) # for training decision trees
raw_data <- read.csv("risk_factors_cervical_cancer.csv",
na.strings = c("NA", "?"),
stringsAsFactors = FALSE) %>%
as_tibble()
cat("Dimensions of the dataset are", dim(raw_data), "\n")
## Dimensions of the dataset are 858 30
cat("Sample of data:\n")
## Sample of data:
head(raw_data)
According to the data description, the following variables are target variables:
target_vars <- c("Hinselmann", "Schiller", "Citology", "Biopsy")
target_vars
## [1] "Hinselmann" "Schiller" "Citology" "Biopsy"
Note that each of them is numeric and its value is 0 or 1, with 0 representing a negative result of a certain test and 1 positive result. We will create a new target variable, “y”, whose value is “cancer” if any of these four tests is positive and “no_cancer” if all these four tests are negative.
Important remark: while under the hood, an SVM converts a binary variable to a numeric variable with values \(-1\) and \(1\), in order to train an SVM in R, the response variable should be categorical, i.e., of class “factor”. Predictors can be either numeric or categorical, but categorical ones will be automatically converted to numeric with values 0 or 1. Compare it to other classification models:
Logistic regression - response variable and predictors can be either categorical or numeric, but categorical will be automatically converted to numeric variables with values 0 and 1.
KNN - response variable is passed to the model as is and while predictors will be automatically converted to numeric variables with values 0 and 1.
Tree-based methods (decision tree, bagging, random forest, boosting) — there is no conversion of variables. Categorical variables are passed to the model training function as categorical.
Below we create a response variable as follows: select columns in the raw data representing the target variables and then label the response as “cancer” whenever any of the targel variables equals 1 and “no_cancer” if all the target variables are equal to 0.
response <- raw_data %>%
select(all_of(target_vars)) %>%
apply(1, any) %>%
ifelse("cancer", "no cancer") %>%
as.factor
X <- raw_data %>%
select(-all_of(target_vars)) %>%
mutate(y = response)
rm(response)
head(X)
# X$y <- apply(X[ , target_vars],
# 1, sum)
#
# X <- X[ , !(names(X) %in% target_vars)]
#
# X$y <- as.factor(X$y > 0)
# X$y <- recode(X$y, 'TRUE' = 'cancer', 'FALSE' = 'no_cancer')
# table(X$y)
Note that we also removed the variable response
once it was inserted into the data frame X
.
Our dataset has missing values. Below we calculate the number of missing values in each column by applying the function “sum(is.na(x))” to each column of the dataset:
sapply(X, function(x) sum(is.na(x)))
## Age Number.of.sexual.partners
## 0 26
## First.sexual.intercourse Num.of.pregnancies
## 7 56
## Smokes..years. Smokes..packs.year.
## 13 13
## Hormonal.Contraceptives..years. IUD..years.
## 108 117
## STDs..number. STDs.condylomatosis
## 105 105
## STDs.cervical.condylomatosis STDs.vaginal.condylomatosis
## 105 105
## STDs.vulvo.perineal.condylomatosis STDs.syphilis
## 105 105
## STDs.pelvic.inflammatory.disease STDs.genital.herpes
## 105 105
## STDs.molluscum.contagiosum STDs.AIDS
## 105 105
## STDs.HIV STDs.Hepatitis.B
## 105 105
## STDs.HPV STDs..Number.of.diagnosis
## 105 0
## Dx.Cancer Dx.CIN
## 0 0
## Dx.HPV Dx
## 0 0
## y
## 0
There are many strategies for dealing with missing data. The simplest approach is removing observations that has missing values. The most advanced method is using statistical learning to predict most reasonable values from values of other variables. You are probably familiar with it from the following story on stolen chemistry A-level papers in 2018:
Our dataset is not very large. If we simply deleted all missing values, it would reduce the size of the data. Instead of deleting observations with missing values, we will just replace missing values with the median of existing values:
replace_na_with_median <- function(x) {
# This function does not change non-numeric vectors
# And if x is numeric, it returns a copy of x with NA replaced with median(x)
if (is.numeric(x)) {
result <- x
result[is.na(x)] <- median(x, na.rm = TRUE)
} else {
result <- x
}
result
}
X <- X %>%
mutate_all(~replace_na_with_median(.))
head(X)
Let us look at the summary of our dataset now
summary(X)
## Age Number.of.sexual.partners First.sexual.intercourse
## Min. :13.00 Min. : 1.000 Min. :10
## 1st Qu.:20.00 1st Qu.: 2.000 1st Qu.:15
## Median :25.00 Median : 2.000 Median :17
## Mean :26.82 Mean : 2.512 Mean :17
## 3rd Qu.:32.00 3rd Qu.: 3.000 3rd Qu.:18
## Max. :84.00 Max. :28.000 Max. :32
## Num.of.pregnancies Smokes..years. Smokes..packs.year.
## Min. : 0.000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 2.000 Median : 0.000 Median : 0.0000
## Mean : 2.258 Mean : 1.201 Mean : 0.4463
## 3rd Qu.: 3.000 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :11.000 Max. :37.000 Max. :37.0000
## Hormonal.Contraceptives..years. IUD..years. STDs..number.
## Min. : 0.000 Min. : 0.0000 Min. :0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:0.000
## Median : 0.500 Median : 0.0000 Median :0.000
## Mean : 2.035 Mean : 0.4446 Mean :0.155
## 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.:0.000
## Max. :30.000 Max. :19.0000 Max. :4.000
## STDs.condylomatosis STDs.cervical.condylomatosis STDs.vaginal.condylomatosis
## Min. :0.00000 Min. :0 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.000000
## Median :0.00000 Median :0 Median :0.000000
## Mean :0.05128 Mean :0 Mean :0.004662
## 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:0.000000
## Max. :1.00000 Max. :0 Max. :1.000000
## STDs.vulvo.perineal.condylomatosis STDs.syphilis
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.05012 Mean :0.02098
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## STDs.pelvic.inflammatory.disease STDs.genital.herpes
## Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000
## Mean :0.001166 Mean :0.001166
## 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000
## STDs.molluscum.contagiosum STDs.AIDS STDs.HIV STDs.Hepatitis.B
## Min. :0.000000 Min. :0 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0 Median :0.00000 Median :0.000000
## Mean :0.001166 Mean :0 Mean :0.02098 Mean :0.001166
## 3rd Qu.:0.000000 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :0 Max. :1.00000 Max. :1.000000
## STDs.HPV STDs..Number.of.diagnosis Dx.Cancer
## Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.002331 Mean :0.08741 Mean :0.02098
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :3.00000 Max. :1.00000
## Dx.CIN Dx.HPV Dx y
## Min. :0.00000 Min. :0.00000 Min. :0.00000 cancer :102
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 no cancer:756
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01049 Mean :0.02098 Mean :0.02797
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
Note that a lot of these variables are almost constant zeroes - there are just a few non-zero observations. This will be a problem for cross-validation because there will be a high chance the few positive observations will not be included into some stage of cross-validation.
Remove numeric variables whose mean value is below \(0.01\). Dimensions of the new data should be 858 by 19.
# Modify the code below
mean_values <- X %>%
summarise_if(is.numeric, mean)
vars_with_little_variability <-
names(mean_values)[which(mean_values <= 0.01)]
vars_with_little_variability
## [1] "STDs.cervical.condylomatosis" "STDs.vaginal.condylomatosis"
## [3] "STDs.pelvic.inflammatory.disease" "STDs.genital.herpes"
## [5] "STDs.molluscum.contagiosum" "STDs.AIDS"
## [7] "STDs.Hepatitis.B" "STDs.HPV"
X <- X %>%
select(-all_of(vars_with_little_variability))
dim(X)
## [1] 858 19
Our dataset is multidimensional. To get some impression of what it looks like, here is a scatterplot that only uses two variables.
ggplot(data = X, aes(x = Age, y = Num.of.pregnancies, colour = y)) +
geom_point()
We will now split our data into training and test sets
set.seed(42)
p <- 0.8
ind <- which(runif(nrow(X)) < p)
train_data <- X %>% slice(ind)
test_data <- X %>% slice(-ind)
cat("Dimensions of the training data are", dim(train_data), "\n")
## Dimensions of the training data are 691 19
cat("Dimensions of the validation data are", dim(test_data), "\n")
## Dimensions of the validation data are 167 19
First, we will try a linear kernel. Below is the summary of our first trained SVM.
mod_svm_linear <- train(
y ~ . , train_data, method = 'svmLinear',
trControl = trainControl("none")
)
mod_svm_linear
## Support Vector Machines with Linear Kernel
##
## 691 samples
## 18 predictor
## 2 classes: 'cancer', 'no cancer'
##
## No pre-processing
## Resampling: None
And here is the resulting confusion matrix (on the training set)
mod_svm_linear %>%
predict(test_data) %>%
confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cancer no cancer
## cancer 2 1
## no cancer 18 146
##
## Accuracy : 0.8862
## 95% CI : (0.828, 0.9301)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 0.4645336
##
## Kappa : 0.1473
##
## Mcnemar's Test P-Value : 0.0002419
##
## Sensitivity : 0.10000
## Specificity : 0.99320
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.89024
## Prevalence : 0.11976
## Detection Rate : 0.01198
## Detection Prevalence : 0.01796
## Balanced Accuracy : 0.54660
##
## 'Positive' Class : cancer
##
Notice that the overall accuracy is quite decent. However, the balanced accuracy is low. Why is that?
This happened because the values of the response variable are imbalanced and hence by just predicting that no one has cancer we can achieve a pretty good accuracy.
The linear kernel has a hyperparameter \(C\) that should be chosen by cross-validation. When training a linear SVM classifier, we did not specify the value of \(C\), which means that the default value was chosen. We can extract it from the model as follows:
mod_svm_linear$bestTune
Let us now apply a 5-fold cross-validation to select the best value of \(C\) on a certain grid:
svm_linear_grid <- expand.grid(C = 2^(-4:1))
mod_svm_linear_tuned <- train(y ~ . , data = train_data, method = "svmLinear",
tuneGrid = svm_linear_grid,
trControl = trainControl("cv", number = 5))
mod_svm_linear_tuned
## Support Vector Machines with Linear Kernel
##
## 691 samples
## 18 predictor
## 2 classes: 'cancer', 'no cancer'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 553, 553, 553, 552, 553
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.0625 0.8798874 -0.002765774
## 0.1250 0.8784485 0.011496514
## 0.2500 0.8784485 0.011496514
## 0.5000 0.8755500 0.106363014
## 1.0000 0.8741007 0.103094134
## 2.0000 0.8741007 0.103094134
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.0625.
And here is the plot of cross-validation accuracy vs the hyperparameter value:
plot(mod_svm_linear_tuned)
Now let us find the test confusion matrix:
mod_svm_linear_tuned %>%
predict(test_data) %>%
confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cancer no cancer
## cancer 0 0
## no cancer 20 147
##
## Accuracy : 0.8802
## 95% CI : (0.8211, 0.9253)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 0.5592
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.152e-05
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8802
## Prevalence : 0.1198
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : cancer
##
Here we will train a polynomial kernel with a default grid.
mod_svm_poly_default <- train(y ~ . , data = train_data, method = "svmPoly",
trControl = trainControl("cv", number = 5))
mod_svm_poly_default
## Support Vector Machines with Polynomial Kernel
##
## 691 samples
## 18 predictor
## 2 classes: 'cancer', 'no cancer'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 553, 552, 553, 553, 553
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.8813367 0.0000000000
## 1 0.001 0.50 0.8813367 0.0000000000
## 1 0.001 1.00 0.8813367 0.0000000000
## 1 0.010 0.25 0.8813367 0.0000000000
## 1 0.010 0.50 0.8769888 -0.0076009501
## 1 0.010 1.00 0.8682932 -0.0222299477
## 1 0.100 0.25 0.8668439 -0.0247534303
## 1 0.100 0.50 0.8668439 -0.0247534303
## 1 0.100 1.00 0.8668439 -0.0247534303
## 2 0.001 0.25 0.8813367 0.0000000000
## 2 0.001 0.50 0.8813367 0.0000000000
## 2 0.001 1.00 0.8813367 0.0000000000
## 2 0.010 0.25 0.8740903 -0.0131422339
## 2 0.010 0.50 0.8682932 -0.0222299477
## 2 0.010 1.00 0.8682932 -0.0078943749
## 2 0.100 0.25 0.8682932 0.0272451662
## 2 0.100 0.50 0.8624961 0.0476570575
## 2 0.100 1.00 0.8625065 0.0432372070
## 3 0.001 0.25 0.8813367 0.0000000000
## 3 0.001 0.50 0.8813367 0.0000000000
## 3 0.001 1.00 0.8813367 0.0000000000
## 3 0.010 0.25 0.8711917 -0.0029696546
## 3 0.010 0.50 0.8726410 -0.0002291422
## 3 0.010 1.00 0.8740799 0.0195439534
## 3 0.100 0.25 0.8538109 0.0168860008
## 3 0.100 0.50 0.8523720 0.0276650622
## 3 0.100 1.00 0.8480242 0.0596418121
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.001 and C = 0.25.
The model that seems to be the best is still the linear.
mod_svm_poly_default %>%
predict(test_data) %>%
confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cancer no cancer
## cancer 0 0
## no cancer 20 147
##
## Accuracy : 0.8802
## 95% CI : (0.8211, 0.9253)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 0.5592
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.152e-05
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8802
## Prevalence : 0.1198
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : cancer
##
The default grid for the polynomial kernel does not do very well. Instead of using the default grid for the hyperparameter values, train your polynomial SVM with the following grid.
svm_poly_grid <- expand.grid(C = c(0.25, 0.5, 1, 2),
scale = c(0.5, 1, 2),
degree = c(2, 3))
svm_poly_grid
Report the confusion matrix of the final model.
mod_svm_poly_tuned <- train(y ~ . , data = train_data, method = "svmPoly",
tuneGrid = svm_poly_grid,
trControl = trainControl("cv", number = 5))
mod_svm_poly_tuned
## Support Vector Machines with Polynomial Kernel
##
## 691 samples
## 18 predictor
## 2 classes: 'cancer', 'no cancer'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 552, 553, 553, 553, 553
## Resampling results across tuning parameters:
##
## C scale degree Accuracy Kappa
## 0.25 0.5 2 0.8698259 0.13768155
## 0.25 0.5 3 0.8567720 0.17984193
## 0.25 1.0 2 0.8596705 0.11456657
## 0.25 1.0 3 0.8509957 0.13761482
## 0.25 2.0 2 0.8596705 0.12711647
## 0.25 2.0 3 0.8466375 0.16380100
## 0.50 0.5 2 0.8654781 0.11620989
## 0.50 0.5 3 0.8495360 0.14167873
## 0.50 1.0 2 0.8596705 0.14056368
## 0.50 1.0 3 0.8452091 0.13933593
## 0.50 2.0 2 0.8596601 0.12682998
## 0.50 2.0 3 0.8263580 0.10482596
## 1.00 0.5 2 0.8596810 0.10401825
## 1.00 0.5 3 0.8524346 0.12605573
## 1.00 1.0 2 0.8596705 0.14110840
## 1.00 1.0 3 0.8480867 0.14251281
## 1.00 2.0 2 0.8610989 0.14322873
## 1.00 2.0 3 0.8162340 0.07954981
## 2.00 0.5 2 0.8596705 0.14056368
## 2.00 0.5 3 0.8538838 0.15616977
## 2.00 1.0 2 0.8596705 0.14118672
## 2.00 1.0 3 0.8422896 0.11995554
## 2.00 2.0 2 0.8582004 0.15519365
## 2.00 2.0 3 0.8046502 0.03147407
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.5 and C = 0.25.
mod_svm_poly_tuned %>%
predict(test_data) %>%
confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cancer no cancer
## cancer 3 4
## no cancer 17 143
##
## Accuracy : 0.8743
## 95% CI : (0.8142, 0.9204)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 0.649384
##
## Kappa : 0.1707
##
## Mcnemar's Test P-Value : 0.008829
##
## Sensitivity : 0.15000
## Specificity : 0.97279
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.89375
## Prevalence : 0.11976
## Detection Rate : 0.01796
## Detection Prevalence : 0.04192
## Balanced Accuracy : 0.56139
##
## 'Positive' Class : cancer
##
svm_gauss_grid <- expand.grid(sigma = c(1/16, 1/8, 1/4, 1, 2),
C = c(0.25, 0.5, 1, 2))
mod_svm_radial <- train(y ~ . , data = train_data, method = "svmRadial",
tuneGrid = svm_gauss_grid,
trControl = trainControl("cv", number = 5))
mod_svm_radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 691 samples
## 18 predictor
## 2 classes: 'cancer', 'no cancer'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 552, 552, 553, 554, 553
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 0.0625 0.25 0.8813447 0.000000000
## 0.0625 0.50 0.8784461 -0.005289256
## 0.0625 1.00 0.8798850 0.032004501
## 0.0625 2.00 0.8784673 0.092890549
## 0.1250 0.25 0.8813447 0.000000000
## 0.1250 0.50 0.8784461 -0.005289256
## 0.1250 1.00 0.8769864 0.009406372
## 0.1250 2.00 0.8741299 0.067803251
## 0.2500 0.25 0.8813447 0.000000000
## 0.2500 0.50 0.8813447 0.000000000
## 0.2500 1.00 0.8769968 -0.007600950
## 0.2500 2.00 0.8726804 0.018569626
## 1.0000 0.25 0.8813447 0.000000000
## 1.0000 0.50 0.8813447 0.000000000
## 1.0000 1.00 0.8813447 0.000000000
## 1.0000 2.00 0.8799058 -0.002755267
## 2.0000 0.25 0.8813447 0.000000000
## 2.0000 0.50 0.8813447 0.000000000
## 2.0000 1.00 0.8813447 0.000000000
## 2.0000 2.00 0.8769967 -0.008307284
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2 and C = 0.25.
mod_svm_radial %>%
predict(test_data) %>%
confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cancer no cancer
## cancer 0 0
## no cancer 20 147
##
## Accuracy : 0.8802
## 95% CI : (0.8211, 0.9253)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 0.5592
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.152e-05
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8802
## Prevalence : 0.1198
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : cancer
##
Note that all these models have high overall accuracy but pretty poor balanced accuracy. Suggest a strategy to improve the balanced accuracy.
Answer The issue here is that our training data is imbalanced. To improve the balanced accuracy, we can create a new training data by keeping all the observations that do not have cancer and duplicating observations that have cancer. Below we do it for radial basis SVM:
cancer_data <- train_data %>%
filter(y == "cancer") %>%
sample_n(400, replace = TRUE)
no_cancer_data <- train_data %>%
filter(y == "no cancer") %>%
sample_n(400, replace = TRUE)
balanced_train_data <- rbind(cancer_data, no_cancer_data)
rm(cancer_data)
rm(no_cancer_data)
mod_svm_balanced <- train(y ~ . , data = balanced_train_data,
method = "svmPoly",
tuneGrid = svm_poly_grid,
trControl = trainControl("cv", number = 5))
mod_svm_balanced
## Support Vector Machines with Polynomial Kernel
##
## 800 samples
## 18 predictor
## 2 classes: 'cancer', 'no cancer'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 640, 640, 640, 640, 640
## Resampling results across tuning parameters:
##
## C scale degree Accuracy Kappa
## 0.25 0.5 2 0.72375 0.4475
## 0.25 0.5 3 0.77875 0.5575
## 0.25 1.0 2 0.76625 0.5325
## 0.25 1.0 3 0.83500 0.6700
## 0.25 2.0 2 0.76750 0.5350
## 0.25 2.0 3 0.84750 0.6950
## 0.50 0.5 2 0.74625 0.4925
## 0.50 0.5 3 0.82125 0.6425
## 0.50 1.0 2 0.76500 0.5300
## 0.50 1.0 3 0.84875 0.6975
## 0.50 2.0 2 0.76000 0.5200
## 0.50 2.0 3 0.84375 0.6875
## 1.00 0.5 2 0.76125 0.5225
## 1.00 0.5 3 0.81250 0.6250
## 1.00 1.0 2 0.77125 0.5425
## 1.00 1.0 3 0.84500 0.6900
## 1.00 2.0 2 0.76000 0.5200
## 1.00 2.0 3 0.86000 0.7200
## 2.00 0.5 2 0.75750 0.5150
## 2.00 0.5 3 0.83250 0.6650
## 2.00 1.0 2 0.76000 0.5200
## 2.00 1.0 3 0.84000 0.6800
## 2.00 2.0 2 0.76125 0.5225
## 2.00 2.0 3 0.86000 0.7200
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 2 and C = 1.
mod_svm_balanced %>% predict(test_data) %>%
confusionMatrix(test_data$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cancer no cancer
## cancer 7 45
## no cancer 13 102
##
## Accuracy : 0.6527
## 95% CI : (0.5753, 0.7246)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0259
##
## Mcnemar's Test P-Value : 4.691e-05
##
## Sensitivity : 0.35000
## Specificity : 0.69388
## Pos Pred Value : 0.13462
## Neg Pred Value : 0.88696
## Prevalence : 0.11976
## Detection Rate : 0.04192
## Detection Prevalence : 0.31138
## Balanced Accuracy : 0.52194
##
## 'Positive' Class : cancer
##
Here are the answers:
Here is a list of models available in caret
including SVMs with various kernels;