HW3_DCraig

  This will be using code from a past post about Decision Trees to compare versus SVMs. That post can be found here. Specifically, we will be using the results from Decision Tree #2 which had the highest performance.

Articles

Question 2: Is it better for classification or regression scenarios?

  From the sources, it does not seem that Decision Trees or SVMs are better at achieving more accurate results per type of learning (regression vs classification). They are however better suited for different types of data. In particular, Decision Trees work well for categorical data and handle collinearity well. SVMs are a good fit for when data is highly complex, but not many observations due to the computation costs. There is a regression version of SVMs called Support Vector Regressions (SVRs).

Question 3: Do you agree with the recommendations? Why?

  I do agree with the recommendations concerning data because of the underlying mathematics. Decision Trees handle colinearity well due to features being selected one at a time. It would not matter if two variables are colinear because in either case the most optimal feature would be selected to make a decision as the tree steps down its branches. Decision Trees also need no transformations for categorical data.
  Colinearity does not impact SVM as much as Linear Regression does, but SVMs are particularly useful for complex data relationships that may exist but only in very complicated forms (think higher degree multivariate regression). Rather than relying on a series of complicated functions to represent the relationships between variables, SVMs represent the relationships by distance between observations. This is great for complex data, but pays a price when computation is concerned for many observations.

Analysis

Decision Tree

   The second decision tree will use the tower damage, duration, lane, and lane role variables. Hopefully a better result is found and can be useful. Below is the result from the previous assignment.

Results

pruneFit2 <- cv.tree(classTreeFit2, FUN = prune.misclass)

dfPruneFit2 <- cbind(size=pruneFit2$size,dev=pruneFit2$dev)
dfPruneFit2 <- data.frame(dfPruneFit2)
dfPruneFit2 <- dfPruneFit2 %>% group_by(size)%>%arrange(size)%>%arrange(dev)
#dfPruneFit2

#alternative method of choosing best method 
#dfPruneFit2$size[which.min(dfPruneFit2$dev)]

bestVal2 <- dfPruneFit2$size[1]

pruneFitFinal2 <- prune.misclass(classTreeFit2, best = bestVal2)
summary(pruneFitFinal2)
## 
## Classification tree:
## snip.tree(tree = classTreeFit2, nodes = c(6L, 11L, 14L))
## Variables actually used in tree construction:
## [1] "tower_damage" "duration"     "lane_role"   
## Number of terminal nodes:  8 
## Residual mean deviance:  0.5086 = 71.71 / 141 
## Misclassification error rate: 0.08054 = 12 / 149
prunePred2 <- predict(pruneFitFinal2, dplyr::select(dataTest2, -"win"), type = "class")

cm2 <- confusionMatrix(prunePred2,dataTest2$win)

cm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    20    5
##      TRUE      2   22
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.7276, 0.9406)
##     No Information Rate : 0.551           
##     P-Value [Acc > NIR] : 5.266e-06       
##                                           
##                   Kappa : 0.7149          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9091          
##             Specificity : 0.8148          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9167          
##              Prevalence : 0.4490          
##          Detection Rate : 0.4082          
##    Detection Prevalence : 0.5102          
##       Balanced Accuracy : 0.8620          
##                                           
##        'Positive' Class : FALSE           
## 
  I’ve reduced the code for the previous decision trees for sake of comparison. The Decision Tree had 87.76% Accuracy, Sensitivity : 0.8182, and Specificity : 0.9259.

SVM Overview

  A quick summary on how Support Vector Machines (SVMs) work. SVMs create a decision boundary between observations. What seperates SVMs from other decision boundaries is that it creates support vectors that are optimized to be as close to its group of observations as possible, thus creating the largest gap they can between other decision boundaries and groups. From these two optimized boundaries, a more generalized decision boundary is created that maximizes the distance between the two support vectors, similar to finding a mid point. A useful image to represent his can be seen here:

SVM Example

  Note that SVM’s are great in small but complex data since they are able to handle high-dimensionality data and create equivalents in a 2D space using Kernels.
  The main parameter that SVMs can be tuned over is Cost. Cost controls the the weight of adjustment from misclassification errors. Effectively, if a boundary causes an observation to be identified incorrectly, Cost determines how much correction is made. This can also improve accuracy, but lead to overfitting. Depending on the method used, these parameters can change to be sigma, scale, and more.
  We will be attempting to use Linear, Polynomial, and Radial Based Function Kernels for SVMs. This will be to see if the data can easily be represented by linear relationships or something more complex like Polynomial. RBF SVMs are generally good at capturing simple or complex data.
#SVM requires its data to be centered and scaled
svmProcess <- preProcess(dataModel_sub2[,-1], method =c("center","scale")) #-1 to remove the win column

svmProcessed <- predict(svmProcess, dataModel_sub2)

paged_table(svmProcessed)

Linear Kernel

  Linear kernels are good for data that can be linearly separated. We can determine this by either creating a linear SVM model, or performing PCA and attempting to plot the data in a scatterplot. We will start with a linear kernel and a tuning grid for C with combinations ranging from 0 - 50.
#Subset based on indexes
trainSVM <- svmProcessed[dataIndex, ]
testSVM <- svmProcessed[-dataIndex, ]

#Create Tuning Grid
tuneLinear <- expand.grid(C = seq(0, 50, length = 50))

#tuneLength, when set inside train(), will choose how many tune options are used from the tuning grid, limiting the number of tuned combinations
#if you have 30 tune combinations in your grid, and have tuneLength 15, only 15 of those tune combinations will be used

svmRTuned <- train(win ~., trainSVM,
                   method = "svmLinear",
                   tuneGrid = tuneLinear,
                   trControl = trainControl(method = "cv")
                   )

svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1.02040816326531 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 61 
## 
## Objective Function Value : -55.1347 
## Training error : 0.120805
paged_table(svmRTuned$results)
plot(svmRTuned)

  We can see that our tuning parameters chose C = 10.2 as the best performer, although it is marginally better than other values of C near 1. Typically, cost is grouped into values <1, 1 - 10, and >10. If this model does not generalize well to our test set, then we may choose a cost parameter that is lower to avoid overfitting.
#Test set performance
svmRPreds <- predict(svmRTuned, newdata = testSVM[-1])

postResample(pred = svmRPreds, obs = testSVM$win)
##  Accuracy     Kappa 
## 0.8163265 0.6303437
varImp(svmRTuned)
## ROC curve variable importance
## 
##              Importance
## tower_damage     100.00
## duration          59.26
## lane              11.82
## lane_role          0.00
  Our linear SVM performed well at 83% accuracy and .63 Kappa. Our model sees lane_role as useless, so if we are to continue testing model features, it may be best to remove lane_role. Tower_damage was the msot important as expected from prior experience.

Polynomial Kernel

  Let’s contrast this with a kernel that assumes a higher dimensionality like “polySVM” which assumes a polynomial relationship.
#Create Tuning Grid
tunePoly <- expand.grid(degree = seq(0, 5, length = 10), # Values for polynomial degree
                        scale = c(TRUE,FALSE), #whether to scale the data
                        C = seq(0, 15, length = 30)) #cost param

#tuneLength, when set inside train(), will choose how many tune options are used from the tuning grid, limiting the number of tuned combinations
#if you have 30 tune combinations in your grid, and have tuneLength 15, only 15 of those tune combinations will be used

svmPoly <- train(win ~., trainSVM,
                   method = "svmPoly",
                   tuneGrid = tunePoly,
                   trControl = trainControl(method = "cv")
                   )

svmPoly$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 0.517241379310345 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  5  scale =  TRUE  offset =  1 
## 
## Number of Support Vectors : 46 
## 
## Objective Function Value : -6.7789 
## Training error : 0.026846
svmPoly$bestTune
##     degree scale         C
## 572      5  TRUE 0.5172414
paged_table(svmPoly$results)
plot(svmPoly)

#Test set performance
svmPolyPreds <- predict(svmPoly, newdata = testSVM[-1])

postResample(pred = svmPolyPreds, obs = testSVM$win)
##  Accuracy     Kappa 
## 0.6734694 0.3509934
varImp(svmPoly)
## ROC curve variable importance
## 
##              Importance
## tower_damage     100.00
## duration          59.26
## lane              11.82
## lane_role          0.00
  Doesn’t look like our polynomial based SVM performed very well with only 69% accuracy.

Radial Basis Function Kernel

  Let’s take a look at one more SVM method called Radial Basis Function Kernels which serves as an example of how the kernel trick can handle complex data. RBF Kernels use the radial distance between each point as measures of similarity/difference. Each point is considered the center of its own circle with some radius. The distance of that radius is the measurement of difference from other points. These measurements are evaluated as part of a normal distribution and are elevated to a higher dimensional space by treating their relationships to other observations as their features. This allows the algorithm to capture very complex relationships by representing observations features as relationships to other observations, rather than the original numeric representations and however they relate to each other.
tuneRBF <- expand.grid(sigma = seq(.1,1,length = 10),    # Gamma values
                       C = seq(1,10,length =10))

svmRBF <- train(win ~., trainSVM,
                   method = "svmRadial",
                   tuneGrid = tuneRBF,
                   trControl = trainControl(method = "cv")
                   )

svmRBF$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 6 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.2 
## 
## Number of Support Vectors : 71 
## 
## Objective Function Value : -264.1129 
## Training error : 0.087248
paged_table(svmRBF$results)
plot(svmRBF)

#Test set performance
svmRBFPreds <- predict(svmRBF, newdata = testSVM[-1])

postResample(pred = svmRBFPreds, obs = testSVM$win)
## Accuracy    Kappa 
## 0.755102 0.513245
varImp(svmRBF)
## ROC curve variable importance
## 
##              Importance
## tower_damage     100.00
## duration          59.26
## lane              11.82
## lane_role          0.00
  The RBF model performed quite well with 83.67% accuracy and .6672 Kappa, which is identical to the linear model’s 83.67% and Kappa 0.6672. The code was double checked and a syntax error or repeat variables do not seem to be the cause.

Comparison

  Despite performing well, RBF and Linear SVMs could not match the Decision Tree’s performance at 88% accuracy, highlighting the importance of testing multiple models when approaching a task. In general, models are unique in performance to the task at hand, but models are better suited for different types of data.