library(e1071)
Here, we will use a heart disease data to show how to tune and estimate support vector machine (SVM).
heart_tidy datafile (csv).heart <- read.csv("C:/Users/Maria Elena Morinigo/Desktop/MSDA/DA 6813 - Data Analytics Applications/Week 5/heart_tidy.csv", header = FALSE)
str(heart)
## 'data.frame': 300 obs. of 14 variables:
## $ V1 : int 63 67 67 37 41 56 62 57 63 53 ...
## $ V2 : int 1 1 1 1 0 1 0 0 1 1 ...
## $ V3 : int 1 4 4 3 2 2 4 4 4 4 ...
## $ V4 : int 145 160 120 130 130 120 140 120 130 140 ...
## $ V5 : int 233 286 229 250 204 236 268 354 254 203 ...
## $ V6 : int 1 0 0 0 0 0 0 0 0 1 ...
## $ V7 : int 2 2 2 0 2 0 2 0 2 2 ...
## $ V8 : int 150 108 129 187 172 178 160 163 147 155 ...
## $ V9 : int 0 1 1 0 0 0 0 1 0 1 ...
## $ V10: num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ V11: int 3 2 2 3 1 1 3 1 2 3 ...
## $ V12: int 0 3 2 0 0 0 2 0 1 0 ...
## $ V13: int 6 3 7 3 3 3 3 3 7 7 ...
## $ V14: int 0 1 1 0 0 0 1 0 1 1 ...
head(heart)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
anyNA(heart)
## [1] FALSE
set.seed(200)
tr_ind <- sample(nrow(heart), 0.7*nrow(heart), replace =F) #To get indexes
I am using a 70/30 split
hrtrain <- heart[tr_ind,]
hrtest <- heart[-tr_ind,]
Here I am declaring the final variable (dependent variable) as a factor, i.e. 0 or 1, based on the presence or absence of the heart disease.
hrtrain[["V14"]] <- factor(hrtrain[["V14"]])
Let’s generate a formula. I will simply use a model that considers all remaining variables as independent.
form1 <- V14 ~.
Next we need to set gamma and cost if we use the RBF kernel. For this we can use the function tune.svm from the package e1071. Caution: Depending on your system, iterating different values using seq() function can take a long time depending on your system. In this demo, we will use gamma ranging from 0.01 to 0.1 in the increments of 0.01. This will give us 10 values of gamma in the following sequence {0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.10}.We will also iterate cost from 0.1 to 1 in increments of 0.1. This will give us 10 values for cost as follows {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0}. So, tune.svm will learn SVMs for 10x10 = 100 possible combinations of gamma and cost. Internally, tune.svm also uses a 10-fold cross validation to get classification error. Thus, for 100 combinations it will actually learn 10x100 = 1000 SVMs. Therefore, it takes a long time to execute this function.I am going to assign the output from tune.svm to an object called tuned:
tuned <- tune.svm(form1, data = hrtrain, gamma = seq(.01, .1, by = .01), cost = seq(.1, 1, by = .1))
tuned is a large list. It has a lot of output but at this point we are interested in knowing which parameter values for gamma and cost are the best. Note the best is within the 100 combinations that we decided to use for tuning. It doesn’t actually tell us the best parameter set globally.
tuned$best.parameters
## gamma cost
## 12 0.02 0.2
For this example and the range of tuning parameters, we get the best value of gamma = "tune$best.parameters$gamma" and the best value of cost = "tune$best.parameters$cost". To see the errors for different combinations you can take a look at the following output:
tuned$performances
## gamma cost error dispersion
## 1 0.01 0.1 0.2761905 0.07377111
## 2 0.02 0.1 0.1666667 0.09049011
## 3 0.03 0.1 0.1571429 0.10299129
## 4 0.04 0.1 0.1619048 0.11042874
## 5 0.05 0.1 0.1523810 0.10237788
## 6 0.06 0.1 0.1571429 0.10299129
## 7 0.07 0.1 0.1666667 0.08473872
## 8 0.08 0.1 0.1761905 0.09269080
## 9 0.09 0.1 0.1904762 0.08694009
## 10 0.10 0.1 0.2095238 0.10089047
## 11 0.01 0.2 0.1523810 0.09470752
## 12 0.02 0.2 0.1380952 0.10151287
## 13 0.03 0.2 0.1380952 0.10151287
## 14 0.04 0.2 0.1380952 0.10151287
## 15 0.05 0.2 0.1380952 0.10151287
## 16 0.06 0.2 0.1476190 0.10396522
## 17 0.07 0.2 0.1476190 0.10396522
## 18 0.08 0.2 0.1476190 0.10396522
## 19 0.09 0.2 0.1476190 0.10396522
## 20 0.10 0.2 0.1523810 0.09988656
## 21 0.01 0.3 0.1380952 0.10151287
## 22 0.02 0.3 0.1428571 0.10286890
## 23 0.03 0.3 0.1428571 0.10286890
## 24 0.04 0.3 0.1428571 0.09784784
## 25 0.05 0.3 0.1428571 0.09784784
## 26 0.06 0.3 0.1476190 0.09377178
## 27 0.07 0.3 0.1476190 0.09377178
## 28 0.08 0.3 0.1523810 0.10237788
## 29 0.09 0.3 0.1523810 0.10237788
## 30 0.10 0.3 0.1523810 0.10237788
## 31 0.01 0.4 0.1380952 0.10151287
## 32 0.02 0.4 0.1380952 0.10396522
## 33 0.03 0.4 0.1428571 0.09784784
## 34 0.04 0.4 0.1428571 0.08694009
## 35 0.05 0.4 0.1476190 0.08232573
## 36 0.06 0.4 0.1476190 0.08232573
## 37 0.07 0.4 0.1571429 0.09797650
## 38 0.08 0.4 0.1571429 0.09797650
## 39 0.09 0.4 0.1523810 0.10237788
## 40 0.10 0.4 0.1523810 0.10237788
## 41 0.01 0.5 0.1428571 0.10286890
## 42 0.02 0.5 0.1428571 0.09784784
## 43 0.03 0.5 0.1476190 0.08232573
## 44 0.04 0.5 0.1476190 0.08232573
## 45 0.05 0.5 0.1476190 0.08232573
## 46 0.06 0.5 0.1523810 0.08922838
## 47 0.07 0.5 0.1571429 0.09797650
## 48 0.08 0.5 0.1571429 0.09797650
## 49 0.09 0.5 0.1571429 0.09797650
## 50 0.10 0.5 0.1619048 0.10335759
## 51 0.01 0.6 0.1380952 0.10396522
## 52 0.02 0.6 0.1476190 0.08232573
## 53 0.03 0.6 0.1428571 0.08694009
## 54 0.04 0.6 0.1476190 0.08232573
## 55 0.05 0.6 0.1476190 0.08232573
## 56 0.06 0.6 0.1523810 0.08922838
## 57 0.07 0.6 0.1523810 0.08922838
## 58 0.08 0.6 0.1619048 0.10335759
## 59 0.09 0.6 0.1619048 0.10335759
## 60 0.10 0.6 0.1761905 0.10777299
## 61 0.01 0.7 0.1380952 0.10396522
## 62 0.02 0.7 0.1428571 0.08694009
## 63 0.03 0.7 0.1428571 0.08694009
## 64 0.04 0.7 0.1428571 0.08694009
## 65 0.05 0.7 0.1428571 0.08694009
## 66 0.06 0.7 0.1571429 0.09537028
## 67 0.07 0.7 0.1571429 0.09537028
## 68 0.08 0.7 0.1714286 0.10089047
## 69 0.09 0.7 0.1714286 0.10089047
## 70 0.10 0.7 0.1714286 0.10089047
## 71 0.01 0.8 0.1380952 0.10396522
## 72 0.02 0.8 0.1428571 0.08694009
## 73 0.03 0.8 0.1428571 0.08694009
## 74 0.04 0.8 0.1428571 0.08694009
## 75 0.05 0.8 0.1428571 0.08694009
## 76 0.06 0.8 0.1523810 0.09988656
## 77 0.07 0.8 0.1571429 0.09269080
## 78 0.08 0.8 0.1666667 0.09323286
## 79 0.09 0.8 0.1714286 0.10089047
## 80 0.10 0.8 0.1714286 0.10089047
## 81 0.01 0.9 0.1523810 0.08922838
## 82 0.02 0.9 0.1428571 0.08694009
## 83 0.03 0.9 0.1428571 0.08694009
## 84 0.04 0.9 0.1428571 0.08694009
## 85 0.05 0.9 0.1428571 0.08694009
## 86 0.06 0.9 0.1523810 0.09988656
## 87 0.07 0.9 0.1571429 0.09269080
## 88 0.08 0.9 0.1666667 0.10588622
## 89 0.09 0.9 0.1761905 0.10777299
## 90 0.10 0.9 0.1761905 0.10777299
## 91 0.01 1.0 0.1523810 0.08922838
## 92 0.02 1.0 0.1428571 0.08694009
## 93 0.03 1.0 0.1428571 0.08694009
## 94 0.04 1.0 0.1428571 0.08694009
## 95 0.05 1.0 0.1428571 0.08694009
## 96 0.06 1.0 0.1523810 0.09733149
## 97 0.07 1.0 0.1571429 0.09269080
## 98 0.08 1.0 0.1714286 0.10089047
## 99 0.09 1.0 0.1761905 0.10777299
## 100 0.10 1.0 0.1761905 0.10777299
You can scroll and see that for the values of gamma = "tune$best.parameters$gamma" and cost = "tune$best.parameters$cost" the error is the lowest at "tune$best.performance". Finally, using the values of the best parameters, let’s run SVM. Note that since I am using RBF kernel, the parameters of SVM aren’t useful at all. Instead the prediction will be done using support vectors. It’s the observations and not the parameters associated with the variables that are used in prediction. This is quite a departure from the traditional parametric methods such as linear and logistic regression.
The syntax for SVM from the package e1071 is as follows for the default kernel, which is radial basis function (RBF): svm(formula = , data = , gamma =, cost =)
mysvm <- svm(formula = form1, data = hrtrain, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost)
summary(mysvm)
##
## Call:
## svm(formula = form1, data = hrtrain, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.2
##
## Number of Support Vectors: 145
##
## ( 73 72 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
I am assigning the output of SVM to an object called mysvm. I am also printing the summary of mysvm which will give some basic information. Take a note of the number of support vectors.
Rather than manually putting the best value of gamma and cost, I am simply using the values from the "tuned" object. In case you are wondering how I found it out, go to the top right window in RStudio (Environment tab) and navigate to "tuned" object. Click on the blue icon next to it to expand "tune" and you will see where I got the names of the best parameters.
svmpredict <- predict(mysvm, hrtest, type = "response")
table(pred = svmpredict, true = hrtest$V14)
## true
## pred 0 1
## 0 34 11
## 1 7 38
As an exercise you should compare this output to logistic regression and see which model performs better. You can also tweak SVM by using different kernels (linear, polynomial, or Sigmoid). Note that the parameters for these kernels will be different from RBF. You will find this in the documentation for e1071: https://cran.r-project.org/web/packages/e1071/e1071.pdf