Ex. 1

Use the svm() algorithm of the e1071 package to carry out the support vector machine for the PlantGrowth data set. Then discuss the number of support vectors/samples. [Install the e1071 package in R if needed.]

p <- PlantGrowth
cbind(p[1:10,],p[11:20,],p[21:30,])
p_svm <- svm(group ~ weight, data = p)
summary(p_svm)
## 
## Call:
## svm(formula = group ~ weight, data = p)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  29
## 
##  ( 10 9 10 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  ctrl trt1 trt2
p_pred <- predict(p_svm, newdata = p)
table(p_pred)
## p_pred
## ctrl trt1 trt2 
##    3   10   17
confusionMatrix(p_pred, p$group)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ctrl trt1 trt2
##       ctrl    0    2    1
##       trt1    4    6    0
##       trt2    6    2    9
## 
## Overall Statistics
##                                         
##                Accuracy : 0.5           
##                  95% CI : (0.313, 0.687)
##     No Information Rate : 0.333         
##     P-Value [Acc > NIR] : 0.0435        
##                                         
##                   Kappa : 0.25          
##                                         
##  Mcnemar's Test P-Value : 0.1006        
## 
## Statistics by Class:
## 
##                      Class: ctrl Class: trt1 Class: trt2
## Sensitivity                0.000       0.600       0.900
## Specificity                0.850       0.800       0.600
## Pos Pred Value             0.000       0.600       0.529
## Neg Pred Value             0.630       0.800       0.923
## Prevalence                 0.333       0.333       0.333
## Detection Rate             0.000       0.200       0.300
## Detection Prevalence       0.100       0.333       0.567
## Balanced Accuracy          0.425       0.700       0.750
The SVM model has 29 support vectors for 30 samples in our data set, resulting in an accuracy of only about 50%, just barely better than guessing. This means that our data is not easily separable and that we have a very complex model that uses all but one of the data points to determine the position and orientation of the hyperplane.

Ex. 2

Do a similar SVM analysis as that in the previous question using the iris data set. Discuss the number of support vectrs/samples.

i <- iris
i
i_svm <- svm(Species ~ ., data = i)
summary(i_svm)
## 
## Call:
## svm(formula = Species ~ ., data = i)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  51
## 
##  ( 8 22 21 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica
i_pred <- predict(i_svm, newdata = i)
table(i_pred)
## i_pred
##     setosa versicolor  virginica 
##         50         50         50
confusionMatrix(i_pred, i$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48
## 
## Overall Statistics
##                                              
##                Accuracy : 0.973              
##                  95% CI : (0.933, 0.993)     
##     No Information Rate : 0.333              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.96               
##                                              
##  Mcnemar's Test P-Value : NA                 
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                  1.000             0.960            0.960
## Specificity                  1.000             0.980            0.980
## Pos Pred Value               1.000             0.960            0.960
## Neg Pred Value               1.000             0.980            0.980
## Prevalence                   0.333             0.333            0.333
## Detection Rate               0.333             0.320            0.320
## Detection Prevalence         0.333             0.333            0.333
## Balanced Accuracy            1.000             0.970            0.970
The iris data set produced a SVM model with 51 support vectors for 150 samples. This means about 1 third of the samples are used as support vectors indicating that our model might be overfitting with an accuracy of over 97%.

Ex. 3

Use the iris data set (or any other data set) to select 80% of the samples for training svm(), then use the rest 20% for validation. Discuss your results.

set.seed(42)
index <- createDataPartition(iris$Species, p = 0.80, list = FALSE)
train <- iris[index, ]
test <- iris[-index, ]
svm_iris <- svm(Species ~ ., data = train)
summary(svm_iris)
## 
## Call:
## svm(formula = Species ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  44
## 
##  ( 6 19 19 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica
pred_iris_train <- predict(svm_iris, newdata = train)
table(pred_iris_train)
## pred_iris_train
##     setosa versicolor  virginica 
##         40         40         40
confusionMatrix(pred_iris_train, train$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         39         1
##   virginica       0          1        39
## 
## Overall Statistics
##                                              
##                Accuracy : 0.983              
##                  95% CI : (0.941, 0.998)     
##     No Information Rate : 0.333              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.975              
##                                              
##  Mcnemar's Test P-Value : NA                 
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                  1.000             0.975            0.975
## Specificity                  1.000             0.988            0.988
## Pos Pred Value               1.000             0.975            0.975
## Neg Pred Value               1.000             0.988            0.988
## Prevalence                   0.333             0.333            0.333
## Detection Rate               0.333             0.325            0.325
## Detection Prevalence         0.333             0.333            0.333
## Balanced Accuracy            1.000             0.981            0.981
pred_iris_test <- predict(svm_iris, newdata = test)
table(pred_iris_test)
## pred_iris_test
##     setosa versicolor  virginica 
##          9         12          9
confusionMatrix(pred_iris_test, test$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa          9          0         0
##   versicolor      1          9         2
##   virginica       0          1         8
## 
## Overall Statistics
##                                         
##                Accuracy : 0.867         
##                  95% CI : (0.693, 0.962)
##     No Information Rate : 0.333         
##     P-Value [Acc > NIR] : 0.0000000023  
##                                         
##                   Kappa : 0.8           
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                  0.900             0.900            0.800
## Specificity                  1.000             0.850            0.950
## Pos Pred Value               1.000             0.750            0.889
## Neg Pred Value               0.952             0.944            0.905
## Prevalence                   0.333             0.333            0.333
## Detection Rate               0.300             0.300            0.267
## Detection Prevalence         0.300             0.400            0.300
## Balanced Accuracy            0.950             0.875            0.875
Once again the model trained on just 80% of the iris data used 44 data points as support vectors, a bit over 1 third of the 120 samples. The accuracy on the training data is over 98%, however it drops to under 87% on the test data indicating that it may be overfit.