set.seed(1)
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1 * (x1 ^ 2 - x2 ^ 2 > 0)
plot(x1,x2, col = ifelse(y == 1, 'green', 'blue'))
legend(x = 'top',legend = c('True', 'False'), col = c('green', 'blue'), lty = 1)
logit = glm(formula = y ~ x1 + x2, family = 'binomial')
preds = predict(logit, data.frame(x1,x2), type = 'response')
preds = round(preds)
plot(x1,x2, col = ifelse(preds == 1, 'green', 'blue'))
legend(x = 'top',legend = c('True', 'False'), col = c('green', 'blue'), lty = 1)
logit = glm(formula = y ~ I(x1 ^ 2) + I((x1 * x2) ^ 2), family = 'binomial')
preds = predict(logit, data.frame(x1, x2), type = 'response')
preds = round(preds)
plot(x1,x2, col = ifelse(preds == 1, 'green', 'blue'))
legend(x = 'top',legend = c('True', 'False'), col = c('green', 'blue'), lty = 1)
sum(preds == y) / 500
## [1] 0.904
dat = data.frame(as.factor(y), x1, x2)
names(dat) = c('y', 'x1', 'x2')
ctrl = trainControl(method = 'repeatedcv', number = 10, repeats = 3)
svm_linear = train(y ~ ., data = dat, method = 'svmLinear', trControl = ctrl, preProcess = c('center', 'scale'))
preds = predict(svm_linear, dat$y)
plot(x1,x2, col = ifelse(preds == 1, 'green', 'blue'))
legend(x = 'top',legend = c('True', 'False'), col = c('green', 'blue'), lty = 1)
dat = data.frame(as.factor(y), x1, x2)
names(dat) = c('y', 'x1', 'x2')
ctrl = trainControl(method = 'repeatedcv', number = 10, repeats = 3)
svm_radial = train(y ~ ., data = dat, method = 'svmRadial', trControl = ctrl, tuneLEngth = 20)
svm_radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 500 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 450, 450, 450, 450, 450, 449, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9361144 0.8714461
## 0.50 0.9473824 0.8943089
## 1.00 0.9573709 0.9143917
##
## Tuning parameter 'sigma' was held constant at a value of 1.069861
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 1.069861 and C = 1.
preds = predict(svm_radial, dat$y)
plot(x1,x2, col = ifelse(preds == 1, 'green', 'blue'))
legend(x = 'top',legend = c('True', 'False'), col = c('green', 'blue'), lty = 1)
The linear models are not able to predict the data set at all. In fact, the linear SVM model does not add any support vectors, it simply chooses False for all predictions. The radial SVM achieves an accuracy over 95%. This is amazing! Even the standard logit model can find an accuracy over 90%. However, it would be difficult to find the exact non-linear relationship between x1, x2, and y; therefore, svm is a much easier method to implement instead of manually checking to deveop an accurate logit model.
data('Auto')
y1 = as.factor(ifelse(Auto$mpg > median(Auto$mpg), 1, 0))
dat = data.frame(y1, Auto)
The cost with the best cross-validation accuracy is 1. However, there is little variance between the test errors of different costs. All range between .0996 and .12.
set.seed(1)
svm_linear = tune(svm, y1 ~ . - mpg, data = dat, kernel = "linear", ranges = list(cost=seq(1,100,1)))
plot(svm_linear$performance[, c(1, 2)], type = 'l')
svm_linear
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.09961538
The poly svm improves on the linear svc. The test error drops below 9%. However, the test error for the radial svm drops below 8%. I would recommend the radial svm with a cost of 2 and a gamma of 1.
set.seed(1)
param = data.frame(cost = seq(1, 20, 1), degree = seq(1, 5, 1))
svm_poly = tune(svm, y1 ~ . - mpg, data = dat, ranges = param, kernel = 'polynomial')
svm_poly
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 7 1
##
## - best performance: 0.08416667
set.seed(1)
param = data.frame(cost = seq(1, 20, 1), gamma = c(.01, .1, 1, 10))
svm_radial = tune(svm, y1 ~ . - mpg, data = dat, ranges = param, kernel = 'radial')
svm_radial
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 2 1
##
## - best performance: 0.07891026
Screw this problem. I have spent hours trying to find a way to plot caret svm models. And then hours more attempting to slice the data set to hold other predictors constant and only plot two predictors at a time. It’s not working and I’m tired of swapping out the code to try and train the svm with another package to try and plot it another way.
data('OJ')
set.seed(1)
trainset = createDataPartition(OJ$Purchase, p =.746, list = FALSE)
train = OJ[trainset,]
test = OJ[-trainset,]
626 support vectors are made. 314 for CH class, and 312 for MM class. This is the majority of the observations we have available (626/800).
svc_linear = svm(Purchase ~ ., data = train, kernel="linear", cost=0.01,scale=FALSE)
summary(svc_linear)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 626
##
## ( 314 312 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
This has a 25.63% training error rate and 21.48% test error rate.
preds = predict(svc_linear, train)
1 - sum(preds == train$Purchase)/nrow(train)
## [1] 0.25625
preds = predict(svc_linear, test)
1 - sum(preds == test$Purchase)/nrow(test)
## [1] 0.2148148
The tuned model has a cost of 10.
svc_linear_tuned = tune(svm, Purchase ~ ., data = train, kernel = "linear", ranges = list(cost = c(0.01, 0.01, 0.1, 1,5,10)))
svc_linear_tuned
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.18125
The new training error is 16.25% and the new test error is 15.56%.
preds = predict(svc_linear_tuned$best.model, train)
1 - sum(preds == train$Purchase)/nrow(train)
## [1] 0.1625
preds = predict(svc_linear_tuned$best.model, test)
1 - sum(preds == test$Purchase)/nrow(test)
## [1] 0.1555556
The training and test error for the best tuned radial svm are 15.125% and 17.04%, respectively.
svm_radial = svm(Purchase ~ ., data = train, kernel = "radial",scale=FALSE)
svm_radial_tuned = tune(svm, Purchase ~ ., data = train, kernel = "radial", ranges = list(cost = c(0.01, 0.01, 0.1, 1,5,10)))
preds = predict(svm_radial_tuned$best.model, train)
1 - sum(preds == train$Purchase)/nrow(train)
## [1] 0.15125
preds = predict(svm_radial_tuned$best.model, test)
1 - sum(preds == test$Purchase)/nrow(test)
## [1] 0.1703704
The training error and test error for the best tuned poly svm are 14% and 17.78%, respectively.
set.seed(1)
svm_poly = svm(Purchase ~ ., data = train, kernel = "poly", degree = 2, scale=FALSE)
svm_poly_tuned = tune(svm, Purchase ~ ., data = train, kernel = "poly", degree = 2, ranges = list(cost = c(0.01, 0.01, 0.1, 1,5,10)))
preds = predict(svm_poly_tuned$best.model, train)
1 - sum(preds == train$Purchase)/nrow(train)
## [1] 0.14
preds = predict(svm_poly_tuned$best.model, test)
1 - sum(preds == test$Purchase)/nrow(test)
## [1] 0.1777778
Overall, I recommend the linear svc. Not only does it have the best test error rate, but it is simpler to train and does not show signs of overfitting. The train and test error rates are similar.