x1 = c(3, 2, 4, 1, 2, 4, 4)
x2 = c(4, 2, 4, 4, 1, 3, 1)
colors = c("red", "red", "red", "red", "blue", "blue", "blue")
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
We need a hyperplane that separates the blue and the red. This is clear that there would be a plane with a positive slope above the three blue points and below the three red points.
This is simple as we find the slope of the line that passes through (2,1.5) and (4,3.5)
\[\frac{Y2-Y1}{X2-X1}\] \[=\] \[\frac{3.5-1.5}{4-2}\] \[=\] \[2/2\] \[=\] \[1\]
Our b-intercept is -0.5.
So, our formula is \[X2=X1-0.5\] = \[X1-X2-0.5=0\]
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.5, 1)
Classify to Red if \[X1-X2-0.5<0\],
Classify to Blue otherwise.
Beta0: -0.5 Beta1: 1 Beta2: -1
The maximal margin hyper plane is the separating hyperplane which is farthest from the training observations. The margin for the maximal margin hyperplane is found from the following formula:
abs(Beta0/Beta2 - (Beta0 + 1)/Beta2) /sqrt((-Beta1/Beta2)^2+1)
=(0.5)-(-0.5)/sqrt((1/2)) =\[\frac{1}{2*\sqrt(2)}\]
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.5, 1, col="purple")
abline(-1, 1, lty = 2)
abline(0, 1, lty = 2)
The maximal margin hyperplane is completely determined by the support vectors at the points (2,1), (2,2), (4,3) and (4,4). A support vector, then by definition, is a point that if an a small movement of that point would affect the maximal margin hyperplane.
Because the point (4,1) is not a support vector, it does not affect the maximal margin hyperplane. Thus, a slight movement of it would not change the hyperplane.
This is an example of a hyperplane that is not an optimal, but it is still a separating hyperplane.
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-1.4, 1.3)
Once the red point (4,0) is added, the data becomes non-separable. So, a hyperplane cannot separate them.
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
points(c(4), c(0), col = c("red"))
set.seed(1234)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- 1 * (x1^2 - x2^2 > 0)
plot(x1,x2, xlab="X1", ylab="X2", col=(3-y))
From the summary, we see that none of the coefficients are significant.
logit.fit <- glm(y ~ x1 + x2, family = "binomial")
summary(logit.fit)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.161 -1.107 -1.072 1.243 1.308
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16814 0.08985 -1.871 0.0613 .
## x1 0.15254 0.31736 0.481 0.6308
## x2 -0.12672 0.30048 -0.422 0.6732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 689.62 on 499 degrees of freedom
## Residual deviance: 689.21 on 497 degrees of freedom
## AIC: 695.21
##
## Number of Fisher Scoring iterations: 3
(USED CODE FROM ONLINE FOR THIS)
The decision boundary is definitely linear
data <- data.frame(x1 = x1, x2 = x2, y = y)
probs <- predict(logit.fit, data, type = "response")
preds <- rep(0, 400)
preds[probs > 0.45] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "red", xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "green")
logitmodel <- glm(y ~ I(x1^2) + I(x2^2) + I(x1*x2) + log(x2), family = "binomial")
## Warning in log(x2): NaNs produced
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logitmodel)
##
## Call:
## glm(formula = y ~ I(x1^2) + I(x2^2) + I(x1 * x2) + log(x2), family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.52e-04 -2.00e-08 -2.00e-08 2.00e-08 1.07e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 67.01 8316.03 0.008 0.994
## I(x1^2) 17314.70 704444.29 0.025 0.980
## I(x2^2) -17682.91 706122.37 -0.025 0.980
## I(x1 * x2) -164.00 110206.33 -0.001 0.999
## log(x2) 17.70 1931.92 0.009 0.993
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3.5979e+02 on 260 degrees of freedom
## Residual deviance: 2.1825e-06 on 256 degrees of freedom
## (239 observations deleted due to missingness)
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
All of the predictors were significant.
probs <- predict(logitmodel, data, type = "response")
## Warning in log(x2): NaNs produced
preds <- rep(0, 500)
preds[probs > 0.47] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "red", xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "green")
This support vector classifier classifies everything under the same class even though the cost is low.
require(e1071)
## Loading required package: e1071
## Warning: package 'e1071' was built under R version 3.3.2
data$y <- as.factor(data$y)
svm.fit <- svm(y ~ x1 + x2, data, kernel = "linear", cost = 0.01)
preds <- predict(svm.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "red", xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "green")
Similar to Part B.
data$y <- as.factor(data$y)
svmnl.fit <- svm(y ~ x1 + x2, data, kernel = "radial", gamma = 1)
preds <- predict(svmnl.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "green", xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "red")
From Part H, we see that the SVM with non-linear kernel is a good model to find non-linear decision boundaries and are closest to the true decision boundary.
From Part G and Part F, logistic regression with the interaction terms are both and a sVM with a linear kernel are not very good at predicting and do not do well with non-linear boundaries.
Using SVM would be my preferred method to build this model because we must only tune gamma, while logistic regression requires checking various intereaction terms by hand.
require(ISLR)
## Loading required package: ISLR
set.seed(1234)
train <- sample(1:nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
This summary supports the conclusion that there will be 2 classes. Additionally, we will use 448 support points of the 800 observations.
library(e1071)
table(OJ$STORE,as.factor(OJ$STORE))
##
## 0 1 2 3 4
## 0 356 0 0 0 0
## 1 0 157 0 0 0
## 2 0 0 222 0 0
## 3 0 0 0 196 0
## 4 0 0 0 0 139
OJ$STORE <- as.factor(OJ$STORE)
OJ$Store7 <- NULL
OJ$StoreID <- NULL
OJ.train <- OJ[train,]
OJ.test <- OJ[-train,]
svmOJ <- svm(Purchase~.,data=OJ.train,kernel='linear',cost=0.01)
summary(svmOJ)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear",
## cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
## gamma: 0.05263158
##
## Number of Support Vectors: 448
##
## ( 225 223 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
The training error rate is 16.5%
The testing error rate is 15.9%
svmOJtrainingprediction <- predict(svmOJ, newdata=OJ.train)
table(OJ.train$Purchase, svmOJtrainingprediction)
## svmOJtrainingprediction
## CH MM
## CH 433 56
## MM 76 235
1 - (433 + 235) / 800
## [1] 0.165
svmOJtestingprediction <- predict(svmOJ, newdata=OJ.test)
table(OJ.test$Purchase, svmOJtestingprediction)
## svmOJtestingprediction
## CH MM
## CH 151 13
## MM 30 76
1 - (151+76)/ 270
## [1] 0.1592593
The cost value of 0.05 is the best value as it has error of 0.17125.
svmOJtune <- tune(svm, Purchase~.,data=OJ.train, ranges= list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)), kernel='linear')
summary(svmOJtune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.17125
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.17125 0.03775377
## 2 0.02 0.17375 0.03557562
## 3 0.05 0.17250 0.03944053
## 4 0.10 0.17125 0.03955042
## 5 0.20 0.17375 0.04143687
## 6 0.50 0.17250 0.04073969
## 7 1.00 0.17500 0.03996526
## 8 2.00 0.17375 0.04185375
## 9 5.00 0.17500 0.03773077
## 10 10.00 0.17250 0.03717451
The training error rate is 16.5% The testing error rate is 15.9%
svmOJbesttrain <- predict(svmOJtune$best.model,newdata=OJ.train)
tablesvmtunetrain <- table(OJ.train$Purchase, svmOJbesttrain)
1-sum(diag(tablesvmtunetrain))/sum(tablesvmtunetrain)
## [1] 0.165
1-(433+238)/800
## [1] 0.16125
svmOJbesttest <-predict(svmOJtune$best.model,newdata = OJ.test)
table(OJ.test$Purchase, svmOJbesttest)
## svmOJbesttest
## CH MM
## CH 151 13
## MM 30 76
1 - (151+76)/ 270
## [1] 0.1592593
Cost = 0.01
Training Error = 0.3888 Testing Error = 0.3925
After tuning-
Cost = 0.5 gamma= 0.005
Training Error = 0.15375 Testing Error = 0.17778
table(OJ$STORE,as.factor(OJ$STORE))
##
## 0 1 2 3 4
## 0 356 0 0 0 0
## 1 0 157 0 0 0
## 2 0 0 222 0 0
## 3 0 0 0 196 0
## 4 0 0 0 0 139
OJ$STORE <- as.factor(OJ$STORE)
OJ$Store7 <- NULL
OJ$StoreID <- NULL
OJ.train <- OJ[train,] # redo the train test split for the modified data...
OJ.test <- OJ[-train,]
svmOJradial <- svm(Purchase~.,data=OJ.train,kernel='radial',cost=0.01)
summary(svmOJradial)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial",
## cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.01
## gamma: 0.05263158
##
## Number of Support Vectors: 623
##
## ( 312 311 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
svmOJtrainingprediction <- predict(svmOJradial, newdata=OJ.train)
svmOJtrainradialtable <- table(OJ.train$Purchase, svmOJtrainingprediction)
#Training Error
1-sum(diag(svmOJtrainradialtable))/sum(svmOJtrainradialtable)
## [1] 0.38875
svmOJtestingprediction <- predict(svmOJradial, newdata=OJ.test)
svmOJtestradialtable <- table(OJ.test$Purchase, svmOJtestingprediction)
#Testing Error
1-sum(diag(svmOJtestradialtable))/sum(svmOJtestradialtable)
## [1] 0.3925926
svmradial.tune <- tune(svm , Purchase~. , data=OJ.train , ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10),kernel='radial'))
summary(svmradial.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost kernel
## 1 radial
##
## - best performance: 0.1775
##
## - Detailed performance results:
## cost kernel error dispersion
## 1 0.01 radial 0.38875 0.07032909
## 2 0.02 radial 0.38875 0.07032909
## 3 0.05 radial 0.22250 0.04594683
## 4 0.10 radial 0.17875 0.05104804
## 5 0.20 radial 0.17875 0.05529278
## 6 0.50 radial 0.17750 0.04993051
## 7 1.00 radial 0.17750 0.04816061
## 8 2.00 radial 0.18250 0.05109903
## 9 5.00 radial 0.18250 0.05470883
## 10 10.00 radial 0.18500 0.05706965
svmOJbesttrainradial<- predict(svmradial.tune$best.model,newdata=OJ.train)
tablesvmtunetrainradial <- table(OJ.train$Purchase, svmOJbesttrainradial)
#Training Error
1-sum(diag(tablesvmtunetrainradial))/sum(tablesvmtunetrainradial)
## [1] 0.15375
svmOJbesttestradial <-predict(svmradial.tune$best.model,newdata = OJ.test)
tablesvmtunetestradial <- table(OJ.test$Purchase, svmOJbesttestradial)
#Testing Error
1-sum(diag(tablesvmtunetestradial))/sum(tablesvmtunetestradial)
## [1] 0.1777778
Cost 0.01, degree 2 Training Error: 0.38875 Testing Error: 0.39259
After tuning-
Cost: 0.185, degree 2 Training Error: 0.15625 Testing Error: 0.1962963
table(OJ$STORE,as.factor(OJ$STORE))
##
## 0 1 2 3 4
## 0 356 0 0 0 0
## 1 0 157 0 0 0
## 2 0 0 222 0 0
## 3 0 0 0 196 0
## 4 0 0 0 0 139
OJ$STORE <- as.factor(OJ$STORE)
OJ$Store7 <- NULL
OJ$StoreID <- NULL
OJ.train <- OJ[train,]
OJ.test <- OJ[-train,]
svmOJpoly <- svm(Purchase~.,data=OJ.train,kernel='polynomial',degree=2, cost=0.01)
summary(svmOJpoly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2, cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.01
## degree: 2
## gamma: 0.05263158
## coef.0: 0
##
## Number of Support Vectors: 625
##
## ( 314 311 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
svmOJtrainingprediction <- predict(svmOJpoly, newdata=OJ.train)
svmOJtrainpolytable <- table(OJ.train$Purchase, svmOJtrainingprediction)
#Training Error
1-sum(diag(svmOJtrainpolytable))/sum(svmOJtrainpolytable)
## [1] 0.38875
svmOJtestingprediction <- predict(svmOJpoly, newdata=OJ.test)
svmOJtestpolytable <- table(OJ.test$Purchase, svmOJtestingprediction)
#Testing Error
1-sum(diag(svmOJtestpolytable))/sum(svmOJtestpolytable)
## [1] 0.3925926
svmpoly.tune<- tune(svm , Purchase~. , data=OJ.train, ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10),kernel='polynomial', degree =2))
summary(svmpoly.tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost kernel degree
## 10 polynomial 2
##
## - best performance: 0.185
##
## - Detailed performance results:
## cost kernel degree error dispersion
## 1 0.01 polynomial 2 0.38875 0.06108112
## 2 0.02 polynomial 2 0.36375 0.07647848
## 3 0.05 polynomial 2 0.35750 0.07364517
## 4 0.10 polynomial 2 0.35000 0.07120003
## 5 0.20 polynomial 2 0.31750 0.05596378
## 6 0.50 polynomial 2 0.26625 0.04411554
## 7 1.00 polynomial 2 0.23375 0.04678927
## 8 2.00 polynomial 2 0.21750 0.04794383
## 9 5.00 polynomial 2 0.18750 0.04677072
## 10 10.00 polynomial 2 0.18500 0.04281744
svmOJbesttrainpoly<- predict(svmpoly.tune$best.model,newdata=OJ.train)
tablesvmtunetrainpoly <- table(OJ.train$Purchase, svmOJbesttrainpoly)
#Training Error
1-sum(diag(tablesvmtunetrainpoly))/sum(tablesvmtunetrainpoly)
## [1] 0.15625
svmOJbesttestpoly <-predict(svmpoly.tune$best.model,newdata = OJ.test)
tablesvmtunetestpoly <- table(OJ.test$Purchase, svmOJbesttestpoly)
#Testing Error
1-sum(diag(tablesvmtunetestpoly))/sum(tablesvmtunetestpoly)
## [1] 0.1962963
From the three approaches, I believe the svm with the polynomial kernel does the best job at the most efficient cost, based on the testing error late being the lowest.
From Part G, we see that the training and testing error for the svm with the kernel set to polynomial drops dramatically (about 20%) once the appropriate cost is picked.
The linear kernel is confusing for me because it seems the training and testing error rates were better before we found the best cost.