Assignment 6 – SVM Note: If you ever forget the tuning parameters for a particular model, the modelLookup ( ) function Can be used to find them. Simply supply the method’s name.
Data mining refers to the process of tracing useful patterns, correlations, or other information from datasets using statistical, machine-learning and computational tools. It can be applied for prediction, detect fraud or security issues, or gain insights into a user base. Data mining processes can also be automated."
Ans:
Support vectors are the data points that lie closest to the optimal hyperplane. They have direct bearing on the optimum location of the decision surface.
Ans:
The support vectors are used to identify a hyperplane (a straight line in two dimensions) that separates the classes.
Ans:
Given data can be grouped by optiman straight line or hyper plane.
FALSE
Ans:
SVMs are sensitive to the choice of tuning option (e.g., the type of transformations to perform), making it harder to use and time-consuming to identify the best model.
Ans:
The SVM model only deals with support vectors rather than the whole training dataset.
Ans:
Binary classification model classify the data in to two groups.
Ans:
Mapping of the original data, which is not linearly separable, into a very high dimension space where the classes can be separated linearly (by a hyperplane)
Ans:
By finding the maximum margin hyperplane using a quadratic optimization approach.
Ans:
Transform original data into higher dimision, so that it can be linearly separable.
Ans:
The kernel trick:
A kernel function, K(), is a function that when evaluated on two vectors of dimension, p, gives the same result as the dot product of the transformation of these two vectors into a much higher dimension, r.
Instead of mapping data into a new space directly (which could be computationally expensive), the kernel function calculates:
K(x, z) = ∅(x) ∙ ∅(z)
where, ϕ(x) is the mapping to the higher-dimensional space.
##Listing 12.16
library(caret)
library(mlbench)
library(klaR)
#Load the dataset
data(PimaIndiansDiabetes)
# train
set.seed(7)
trainControl <- trainControl(method="cv", number=5)
fit.nb <- train(diabetes ~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=trainControl)
#summarize fit
print(fit.nb)
## Naive Bayes
##
## 768 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 614, 614, 615, 614, 615
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.7564553 0.4510005
## TRUE 0.7564978 0.4452698
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
Ans:
It specifies resampling methods, performance metrics, cross-validation folds, repeated CV settings, and other options for model tuning and validation.
Ans:
Model selection criteria, by highest "Accuracy" on fitting multiple datasets or while tuning parameters.
train(Species ~ ., data = iris,
method = "rpart",
trControl = ctrl,
metric = "Accuracy")
#Listing 12.17
#load the packages
library(kernlab)
library(mlbench)
#Load the dataset
data(PimaIndiansDiabetes)
#fit model
fit <- ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot")
#summarize the fit
print(fit)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.12374975684703
##
## Number of Support Vectors : 435
##
## Objective Function Value : -352.8229
## Training error : 0.175781
#nake predictions
predictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="response")
#summarize accuracy
table(predictions, PimaIndiansDiabetes$diabetes)
##
## predictions neg pos
## neg 463 98
## pos 37 170
Ans:
"rbfdot" is a Gaussian kernel, Radial Basis Function (RBF) kernel. It is used for non-linear relationship between response and covariate.
K(x,x′)=exp(−sigma ∥x−x′∥^2)
Ans:
Selects first to eigth column from PimaIndiansDiabetes dataset, package "mlbench".
Ans:
No. of support vectors : 445
Hyperparameter : sigma = 0.1422
Accuracy = (465+175) / (465+175+35+93) = 0.833
Ans:
#load the packages
library(kernlab)
library(mlbench)
#load data
data (BostonHousing)
# fit model
fit<- ksvm(medv~., BostonHousing, kernel="rbfdot")
#summarize the fit
print(fit)
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.10659215661543
##
## Number of Support Vectors : 334
##
## Objective Function Value : -75.1727
## Training error : 0.090571
#make predictions
predictions <- predict(fit, BostonHousing)
#summarize accuracy
mse <- mean((BostonHousing$medv - predictions)^2)
print (mse)
## [1] 7.6611
Ans:
Five fold cross validation method
trainControl(method = "cv", number = 5)
Ans:
Training error : 0.091309
MSE = 7.7235
# Listing 12.20
#load packages
library(caret)
library(mlbench)
# Load the dataset
data (BostonHousing)
#train
set.seed(7)
trainControl <- trainControl(method="cv", number=5)
fit.svmRadial <- train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE",
trControl=trainControl)
#summarize fit
print(fit.svmRadial)
## Support Vector Machines with Radial Basis Function Kernel
##
## 506 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 407, 404, 405, 404, 404
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 4.842935 0.7527188 2.823026
## 0.50 4.303375 0.7969685 2.540088
## 1.00 3.830092 0.8333987 2.327384
##
## Tuning parameter 'sigma' was held constant at a value of 0.1057462
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1057462 and C = 1.
Ans:
"svmRadial" = Support Vector Machine with RBF kernel
metric = "RMSE", Use Room mean square error for choosing model
ctrl <- trainControl(method = "cv", number = 5)
model <- train(diabetes ~ ., data = PimaIndiansDiabetes,
method = "svmRadial",
metric = "RMSE",
trControl = ctrl)
Ans:
The selected model has RMSE = 3.83 with R-square = 0.833 and C=1
#Listing 12.4.4
#load the packages
library(rpart)
library(mlbench)
#Load the dataset
data (PimaIndiansDiabetes)
#fit model
fit <- rpart (diabetes~., data=PimaIndiansDiabetes)
#summarize the fit
print(fit)
## n= 768
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 768 268 neg (0.65104167 0.34895833)
## 2) glucose< 127.5 485 94 neg (0.80618557 0.19381443)
## 4) age< 28.5 271 23 neg (0.91512915 0.08487085) *
## 5) age>=28.5 214 71 neg (0.66822430 0.33177570)
## 10) mass< 26.35 41 2 neg (0.95121951 0.04878049) *
## 11) mass>=26.35 173 69 neg (0.60115607 0.39884393)
## 22) glucose< 99.5 55 10 neg (0.81818182 0.18181818) *
## 23) glucose>=99.5 118 59 neg (0.50000000 0.50000000)
## 46) pedigree< 0.561 84 34 neg (0.59523810 0.40476190)
## 92) pedigree< 0.2 21 4 neg (0.80952381 0.19047619) *
## 93) pedigree>=0.2 63 30 neg (0.52380952 0.47619048)
## 186) pregnant>=1.5 52 21 neg (0.59615385 0.40384615)
## 372) pressure>=67 40 12 neg (0.70000000 0.30000000) *
## 373) pressure< 67 12 3 pos (0.25000000 0.75000000) *
## 187) pregnant< 1.5 11 2 pos (0.18181818 0.81818182) *
## 47) pedigree>=0.561 34 9 pos (0.26470588 0.73529412) *
## 3) glucose>=127.5 283 109 pos (0.38515901 0.61484099)
## 6) mass< 29.95 76 24 neg (0.68421053 0.31578947)
## 12) glucose< 145.5 41 6 neg (0.85365854 0.14634146) *
## 13) glucose>=145.5 35 17 pos (0.48571429 0.51428571)
## 26) insulin< 14.5 21 8 neg (0.61904762 0.38095238) *
## 27) insulin>=14.5 14 4 pos (0.28571429 0.71428571) *
## 7) mass>=29.95 207 57 pos (0.27536232 0.72463768)
## 14) glucose< 157.5 115 45 pos (0.39130435 0.60869565)
## 28) age< 30.5 50 23 neg (0.54000000 0.46000000)
## 56) pressure>=61 40 13 neg (0.67500000 0.32500000)
## 112) mass< 41.8 31 7 neg (0.77419355 0.22580645) *
## 113) mass>=41.8 9 3 pos (0.33333333 0.66666667) *
## 57) pressure< 61 10 0 pos (0.00000000 1.00000000) *
## 29) age>=30.5 65 18 pos (0.27692308 0.72307692) *
## 15) glucose>=157.5 92 12 pos (0.13043478 0.86956522) *
#make predictions
predictions <-predict(fit, PimaIndiansDiabetes[,1:8], type="class")
#summarize accuracy
table(predictions, PimaIndiansDiabetes$diabetes)
##
## predictions neg pos
## neg 449 72
## pos 51 196
Ans:
Response is diabetes and all other remaining variables are covariates.
Ans:
It says to predict "class" of the new observation.
Ans:
Accuracy = 83.98%