Assignment 6 – SVM Note: If you ever forget the tuning parameters for a particular model, the modelLookup ( ) function Can be used to find them. Simply supply the method’s name.

Part 1:

  1. What is data mining? Explain.
Data mining refers to the process of tracing useful patterns, correlations, or other information from datasets using statistical, machine-learning and computational tools. It can be applied for prediction, detect fraud or security issues, or gain insights into a user base. Data mining processes can also be automated."
  1. What are support vectors?
Ans:
Support vectors are the data points that lie closest to the optimal hyperplane. They have direct bearing on the optimum location of the decision surface.
  1. State what support vectors are used for.
Ans:
The support vectors are used to identify a hyperplane (a straight line in two dimensions) that separates the classes.  
  1. What do we mean when we say that the data is linearly separable?
Ans:
Given data can be grouped by optiman straight line or hyper plane.
  1. List five model builders.
  1. Support vector machines (SVM) are unsupervised learning algorithms. TRUE or FALSE?
FALSE
  1. SVMs are sensitive to the choice of ______________ ________________.
Ans:
SVMs  are sensitive to the choice of tuning option (e.g., the type of transformations to perform), making it harder to use and time-consuming to identify the best model.
  1. The SVM model only deals with ____________ _______________.
Ans:
The SVM model only deals with support vectors rather than the whole training dataset.
  1. What is a binary classification model?
Ans:
Binary classification model classify the data in to two groups.
  1. SVMs often perform a non-linear mapping of the original data into a very high dimensional space where the classes can be separated linearly by a hyperplane. What method do we use to find solutions for this problem at a lower dimension?
Ans:
Mapping of the original data, which is not linearly separable, into a very high dimension space where the classes can be separated linearly (by a hyperplane) 
  1. SVMs solve the problem of linearly separable binary classification by doing what?
Ans:
By finding the maximum margin hyperplane using a quadratic optimization approach.
  1. What will the SVMs do when the cases are not linearly separable?
Ans:
Transform original data into higher dimision, so that it can be linearly separable.
  1. The relation, K(x, z) = ∅(x) ∙ ∅(z) is associated with what method that is used to find solutions. Why is it important?
Ans:
The kernel trick:
    A kernel function, K(), is a function that when evaluated on two vectors of dimension, p, gives the same result as the dot product of the transformation of these two vectors into a much higher dimension, r.

Instead of mapping data into a new space directly (which could be computationally expensive), the kernel function calculates:
 K(x, z) = ∅(x)  ∙  ∅(z) 
where, ϕ(x) is the mapping to the higher-dimensional space.

Part 2: This part of the assignment is based on the six algorithms sent to you via email

  1. In Listing 12.16: Example of Naïve Bayes algorithm for classification in caret,
##Listing 12.16
library(caret)
library(mlbench)
library(klaR)
#Load the dataset
data(PimaIndiansDiabetes)
# train
set.seed(7)
trainControl <- trainControl(method="cv", number=5)
fit.nb <- train(diabetes ~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=trainControl)
#summarize fit
print(fit.nb)
## Naive Bayes 
## 
## 768 samples
##   8 predictor
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 614, 614, 615, 614, 615 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.7564553  0.4510005
##    TRUE      0.7564978  0.4452698
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.
  1. Explain the trainControl( ) function
Ans:

It specifies resampling methods, performance metrics, cross-validation folds, repeated CV settings, and other options for model tuning and validation.
  1. Explain the argument “metric = “Accuracy””
Ans:
Model selection criteria, by highest "Accuracy" on fitting multiple datasets or while tuning parameters.

train(Species ~ ., data = iris,
      method = "rpart",
      trControl = ctrl,
      metric = "Accuracy")
  1. In Listing 12.17: Example of SVM algorithm for classification,
#Listing 12.17
#load the packages
library(kernlab)
library(mlbench)
#Load the dataset
data(PimaIndiansDiabetes)
#fit model
fit <- ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot") 
#summarize the fit
print(fit)
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.12374975684703 
## 
## Number of Support Vectors : 435 
## 
## Objective Function Value : -352.8229 
## Training error : 0.175781
#nake predictions
predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="response") 
#summarize accuracy
table(predictions, PimaIndiansDiabetes$diabetes)
##            
## predictions neg pos
##         neg 463  98
##         pos  37 170
  1. Explain the argument “kernel = “rbfdot”
Ans:

"rbfdot" is a Gaussian kernel, Radial Basis Function (RBF) kernel. It is used for non-linear relationship between response and covariate.
K(x,x′)=exp(−sigma ∥x−x′∥^2)
  1. Explain what is happening in the argument, “PimaIndiansDiabetes [ ,1:8], and explain the indexing
Ans:
Selects first to eigth column from PimaIndiansDiabetes dataset, package "mlbench".
  1. Explain and draw conclusions about the output
Ans:

No. of support vectors : 445
Hyperparameter : sigma =  0.1422
Accuracy = (465+175) / (465+175+35+93) = 0.833
  1. In Listing 12.18: Explain the SVM algorithm for regression. Explain and draw conclusions about the output as well.
Ans:
  1. In Listing 12.19: Example of SVM algorithm for classification in caret,
#load the packages
library(kernlab)
library(mlbench)
#load data
data (BostonHousing)
# fit model
fit<- ksvm(medv~., BostonHousing, kernel="rbfdot") 
#summarize the fit
print(fit)
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.10659215661543 
## 
## Number of Support Vectors : 334 
## 
## Objective Function Value : -75.1727 
## Training error : 0.090571
#make predictions
predictions <- predict(fit, BostonHousing)
#summarize accuracy
mse <- mean((BostonHousing$medv - predictions)^2) 
print (mse)
## [1] 7.6611
  1. Explain the arguments ”method =”cv”” and “number=5”
Ans:
Five fold cross validation method

trainControl(method = "cv", number = 5)
  1. Explain and draw conclusions about the output
Ans:
Training error : 0.091309 
MSE = 7.7235
  1. In Listing 12.20: Example of SVM algorithm for regression in caret,
# Listing 12.20
#load packages
library(caret)
library(mlbench)
# Load the dataset
data (BostonHousing)
#train
set.seed(7)
trainControl <- trainControl(method="cv", number=5)
fit.svmRadial <- train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE",
trControl=trainControl)
#summarize fit
print(fit.svmRadial)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 506 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 407, 404, 405, 404, 404 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE     
##   0.25  4.842935  0.7527188  2.823026
##   0.50  4.303375  0.7969685  2.540088
##   1.00  3.830092  0.8333987  2.327384
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1057462
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1057462 and C = 1.
  1. Explain the arguments “method = “svmRadial”” and “metric = “RMSE””
Ans:

"svmRadial" = Support Vector Machine with RBF kernel
metric = "RMSE", Use Room mean square error for choosing model

ctrl <- trainControl(method = "cv", number = 5)

model <- train(diabetes ~ ., data = PimaIndiansDiabetes,
               method = "svmRadial",
               metric = "RMSE",
               trControl = ctrl)
  1. Explain and draw conclusions about the output
Ans:
The selected model has RMSE = 3.83 with R-square = 0.833 and C=1
  1. In Listing 12.4.4: Classification and Regression Trees,
#Listing 12.4.4
#load the packages
library(rpart)
library(mlbench)
#Load the dataset
data (PimaIndiansDiabetes)
#fit model
fit <- rpart (diabetes~., data=PimaIndiansDiabetes)
#summarize the fit
print(fit)
## n= 768 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 768 268 neg (0.65104167 0.34895833)  
##     2) glucose< 127.5 485  94 neg (0.80618557 0.19381443)  
##       4) age< 28.5 271  23 neg (0.91512915 0.08487085) *
##       5) age>=28.5 214  71 neg (0.66822430 0.33177570)  
##        10) mass< 26.35 41   2 neg (0.95121951 0.04878049) *
##        11) mass>=26.35 173  69 neg (0.60115607 0.39884393)  
##          22) glucose< 99.5 55  10 neg (0.81818182 0.18181818) *
##          23) glucose>=99.5 118  59 neg (0.50000000 0.50000000)  
##            46) pedigree< 0.561 84  34 neg (0.59523810 0.40476190)  
##              92) pedigree< 0.2 21   4 neg (0.80952381 0.19047619) *
##              93) pedigree>=0.2 63  30 neg (0.52380952 0.47619048)  
##               186) pregnant>=1.5 52  21 neg (0.59615385 0.40384615)  
##                 372) pressure>=67 40  12 neg (0.70000000 0.30000000) *
##                 373) pressure< 67 12   3 pos (0.25000000 0.75000000) *
##               187) pregnant< 1.5 11   2 pos (0.18181818 0.81818182) *
##            47) pedigree>=0.561 34   9 pos (0.26470588 0.73529412) *
##     3) glucose>=127.5 283 109 pos (0.38515901 0.61484099)  
##       6) mass< 29.95 76  24 neg (0.68421053 0.31578947)  
##        12) glucose< 145.5 41   6 neg (0.85365854 0.14634146) *
##        13) glucose>=145.5 35  17 pos (0.48571429 0.51428571)  
##          26) insulin< 14.5 21   8 neg (0.61904762 0.38095238) *
##          27) insulin>=14.5 14   4 pos (0.28571429 0.71428571) *
##       7) mass>=29.95 207  57 pos (0.27536232 0.72463768)  
##        14) glucose< 157.5 115  45 pos (0.39130435 0.60869565)  
##          28) age< 30.5 50  23 neg (0.54000000 0.46000000)  
##            56) pressure>=61 40  13 neg (0.67500000 0.32500000)  
##             112) mass< 41.8 31   7 neg (0.77419355 0.22580645) *
##             113) mass>=41.8 9   3 pos (0.33333333 0.66666667) *
##            57) pressure< 61 10   0 pos (0.00000000 1.00000000) *
##          29) age>=30.5 65  18 pos (0.27692308 0.72307692) *
##        15) glucose>=157.5 92  12 pos (0.13043478 0.86956522) *
#make predictions
predictions <-predict(fit, PimaIndiansDiabetes[,1:8], type="class")
#summarize accuracy
table(predictions, PimaIndiansDiabetes$diabetes)
##            
## predictions neg pos
##         neg 449  72
##         pos  51 196
  1. Explain the following argument “diabetes~., “
Ans:

Response is diabetes and all other remaining variables are covariates.
  1. Explain the argument, “type = “class””
Ans:

It says to predict "class" of the new observation.
  1. Explain and draw conclusions about the output
Ans:
Accuracy = 83.98%