DATA609 HW7

library(e1071)

## Warning: package 'e1071' was built under R version 4.1.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Use the svm() algorithm of the ‘e1071’ package to carry out the support vector machine for the PlantGrowth data set. Then, discuss the number of support vectors/samples.

First, let’s take a look at the PlantGrowth data set. I has a column of plant weights and a column of treatment categories.

glimpse(PlantGrowth)

## Rows: 30
## Columns: 2
## $ weight <dbl> 4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14, 4.8…
## $ group  <fct> ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, trt…

Ex. 1

We run the svm() algorithm to create a treatment classification model based on plant weights. You can also use the svm() algorithm for regression.

We end up with a classification model with 27 support vectors. That means that our hyperplane classification boundry is reliant on 27 data points.

dat = PlantGrowth
svmfit = svm(group  ~ weight, data = dat, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)

## 
## Call:
## svm(formula = group ~ weight, data = dat, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  27

It’s worth noting that you can plot() the results of the svm() function but not in all cases, only for classification and not when the classes are factors.

Let’s use the caret package ‘confusionMatrix’ function to look at the performance of our classification model. Our classification model has trouble identifying the ‘controls’ in our data set. Only 1 out of 10 controls was correctly identified. The overall accuracy of our model is 50% which isn’t great considering you have a 33% accuracy by random chance.

caret::confusionMatrix(svmfit$fitted, dat$group)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ctrl trt1 trt2
##       ctrl    1    2    2
##       trt1    4    6    0
##       trt2    5    2    8
## 
## Overall Statistics
##                                         
##                Accuracy : 0.5           
##                  95% CI : (0.313, 0.687)
##     No Information Rate : 0.3333        
##     P-Value [Acc > NIR] : 0.04348       
##                                         
##                   Kappa : 0.25          
##                                         
##  Mcnemar's Test P-Value : 0.26665       
## 
## Statistics by Class:
## 
##                      Class: ctrl Class: trt1 Class: trt2
## Sensitivity              0.10000      0.6000      0.8000
## Specificity              0.80000      0.8000      0.6500
## Pos Pred Value           0.20000      0.6000      0.5333
## Neg Pred Value           0.64000      0.8000      0.8667
## Prevalence               0.33333      0.3333      0.3333
## Detection Rate           0.03333      0.2000      0.2667
## Detection Prevalence     0.16667      0.3333      0.5000
## Balanced Accuracy        0.45000      0.7000      0.7250

Ex. 2

Do a similar SVM analysis as that in the previous question using the iris data set. Discuss the number of support vectors/samples.

First we perform SVM analysis like in the previous question. We end up with a classification model with 17 support vectors, so 17 data points are used in creating our hyperplane classification boundary.

Our model has a 98% accuracy.

dat = iris
svmfit = svm(Species  ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)

## 
## Call:
## svm(formula = Species ~ ., data = dat, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  17

caret::confusionMatrix(svmfit$fitted, dat$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         0
##   virginica       0          3        50
## 
## Overall Statistics
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9427, 0.9959)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.97            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9400           1.0000
## Specificity                 1.0000            1.0000           0.9700
## Pos Pred Value              1.0000            1.0000           0.9434
## Neg Pred Value              1.0000            0.9709           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3133           0.3333
## Detection Prevalence        0.3333            0.3133           0.3533
## Balanced Accuracy           1.0000            0.9700           0.9850

Ex. 3

Use the iris data set to select 80% of the samples for training svm(), then use the rest 20% for validation. Discuss your results.

It’s a little surprising that our model has 100% accuracy. It means that our model is not overfitting the training data.

set.seed(199)
rows <- sample(nrow(iris))
iris_rand <- iris[rows,]
train <- iris_rand[1:120,]
test <- iris_rand[121:150,]

svmfit = svm(Species  ~ ., data = train, kernel = "linear", cost = 10, scale = FALSE)

y_predL = predict(svmfit, newdata = test[,-5])
caret::confusionMatrix(y_predL, test[,5])

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         11         0
##   virginica       0          0         9
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3667     
##     P-Value [Acc > NIR] : 8.475e-14  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000              1.0
## Specificity                 1.0000            1.0000              1.0
## Pos Pred Value              1.0000            1.0000              1.0
## Neg Pred Value              1.0000            1.0000              1.0
## Prevalence                  0.3333            0.3667              0.3
## Detection Rate              0.3333            0.3667              0.3
## Detection Prevalence        0.3333            0.3667              0.3
## Balanced Accuracy           1.0000            1.0000              1.0

DATA609 HW7

William Aiken

5/14/2023

Ex. 1

Ex. 2

Ex. 3