STAT 388 HW 7

Problem 1, Chapter 9 Excercise 3

PART A

x1 = c(3, 2, 4, 1, 2, 4, 4)
x2 = c(4, 2, 4, 4, 1, 3, 1)
colors = c("red", "red", "red", "red", "blue", "blue", "blue")
plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))

PART B

We need a hyperplane that separates the blue and the red. This is clear that there would be a plane with a positive slope above the three blue points and below the three red points.

This is simple as we find the slope of the line that passes through (2,1.5) and (4,3.5)

\[\frac{Y2-Y1}{X2-X1}\] \[=\] \[\frac{3.5-1.5}{4-2}\] \[=\] \[2/2\] \[=\] \[1\]

Our b-intercept is -0.5.

So, our formula is \[X2=X1-0.5\] = \[X1-X2-0.5=0\]

plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.5, 1)

PART C.

Classify to Red if \[X1-X2-0.5<0\],

Classify to Blue otherwise.

Beta0: -0.5 Beta1: 1 Beta2: -1

PART D.

The maximal margin hyper plane is the separating hyperplane which is farthest from the training observations. The margin for the maximal margin hyperplane is found from the following formula:

abs(Beta0/Beta2 - (Beta0 + 1)/Beta2) /sqrt((-Beta1/Beta2)^2+1)

=(0.5)-(-0.5)/sqrt((1/2)) =\[\frac{1}{2*\sqrt(2)}\]

plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-0.5, 1, col="purple")
abline(-1, 1, lty = 2)
abline(0, 1, lty = 2)

PART E.

The maximal margin hyperplane is completely determined by the support vectors at the points (2,1), (2,2), (4,3) and (4,4). A support vector, then by definition, is a point that if an a small movement of that point would affect the maximal margin hyperplane.

PART F.

Because the point (4,1) is not a support vector, it does not affect the maximal margin hyperplane. Thus, a slight movement of it would not change the hyperplane.

PART G.

This is an example of a hyperplane that is not an optimal, but it is still a separating hyperplane.

plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
abline(-1.4, 1.3)

PART H.

Once the red point (4,0) is added, the data becomes non-separable. So, a hyperplane cannot separate them.

plot(x1, x2, col = colors, xlim = c(0, 5), ylim = c(0, 5))
points(c(4), c(0), col = c("red"))

Problem 3 Excercise 5

PART A.

set.seed(1234)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- 1 * (x1^2 - x2^2 > 0)

PART B.

plot(x1,x2, xlab="X1", ylab="X2", col=(3-y))

PART C.

From the summary, we see that none of the coefficients are significant.

logit.fit <- glm(y ~ x1 + x2, family = "binomial")
summary(logit.fit)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.161  -1.107  -1.072   1.243   1.308  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.16814    0.08985  -1.871   0.0613 .
## x1           0.15254    0.31736   0.481   0.6308  
## x2          -0.12672    0.30048  -0.422   0.6732  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 689.62  on 499  degrees of freedom
## Residual deviance: 689.21  on 497  degrees of freedom
## AIC: 695.21
## 
## Number of Fisher Scoring iterations: 3

PART D.

(USED CODE FROM ONLINE FOR THIS)

The decision boundary is definitely linear

data <- data.frame(x1 = x1, x2 = x2, y = y)
probs <- predict(logit.fit, data, type = "response")
preds <- rep(0, 400)
preds[probs > 0.45] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "red", xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "green")

PART E.

logitmodel <- glm(y ~ I(x1^2) + I(x2^2) + I(x1*x2) + log(x2), family = "binomial")

## Warning in log(x2): NaNs produced

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logitmodel)

## 
## Call:
## glm(formula = y ~ I(x1^2) + I(x2^2) + I(x1 * x2) + log(x2), family = "binomial")
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -9.52e-04  -2.00e-08  -2.00e-08   2.00e-08   1.07e-03  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)     67.01    8316.03   0.008    0.994
## I(x1^2)      17314.70  704444.29   0.025    0.980
## I(x2^2)     -17682.91  706122.37  -0.025    0.980
## I(x1 * x2)    -164.00  110206.33  -0.001    0.999
## log(x2)         17.70    1931.92   0.009    0.993
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.5979e+02  on 260  degrees of freedom
## Residual deviance: 2.1825e-06  on 256  degrees of freedom
##   (239 observations deleted due to missingness)
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

All of the predictors were significant.

PART F.

probs <- predict(logitmodel, data, type = "response")

## Warning in log(x2): NaNs produced

preds <- rep(0, 500)
preds[probs > 0.47] <- 1
plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "red", xlab = "X1", ylab = "X2")
points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "green")

PART G. Support Vector Classifier

This support vector classifier classifies everything under the same class even though the cost is low.

require(e1071)

## Loading required package: e1071

## Warning: package 'e1071' was built under R version 3.3.2

data$y <- as.factor(data$y)
svm.fit <- svm(y ~ x1 + x2, data, kernel = "linear", cost = 0.01)
preds <- predict(svm.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "red", xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "green")

PART H. Non-linear kernel with X1 and X2 as predictors

Similar to Part B.

data$y <- as.factor(data$y)
svmnl.fit <- svm(y ~ x1 + x2, data, kernel = "radial", gamma = 1)
preds <- predict(svmnl.fit, data)
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = "green", xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = "red")

PART I.

From Part H, we see that the SVM with non-linear kernel is a good model to find non-linear decision boundaries and are closest to the true decision boundary.

From Part G and Part F, logistic regression with the interaction terms are both and a sVM with a linear kernel are not very good at predicting and do not do well with non-linear boundaries.

Using SVM would be my preferred method to build this model because we must only tune gamma, while logistic regression requires checking various intereaction terms by hand.

Problem 4, Chapter 8, Excercise 8

PART A

require(ISLR)

## Loading required package: ISLR

set.seed(1234)
train <- sample(1:nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]

PART B.

This summary supports the conclusion that there will be 2 classes. Additionally, we will use 448 support points of the 800 observations.

library(e1071)
table(OJ$STORE,as.factor(OJ$STORE))

##    
##       0   1   2   3   4
##   0 356   0   0   0   0
##   1   0 157   0   0   0
##   2   0   0 222   0   0
##   3   0   0   0 196   0
##   4   0   0   0   0 139

OJ$STORE <- as.factor(OJ$STORE)
OJ$Store7 <- NULL
OJ$StoreID <- NULL
OJ.train <- OJ[train,] 
OJ.test <- OJ[-train,]
svmOJ <- svm(Purchase~.,data=OJ.train,kernel='linear',cost=0.01)
summary(svmOJ)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.05263158 
## 
## Number of Support Vectors:  448
## 
##  ( 225 223 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

PART C.

The training error rate is 16.5%

The testing error rate is 15.9%

svmOJtrainingprediction <- predict(svmOJ, newdata=OJ.train)
table(OJ.train$Purchase, svmOJtrainingprediction)

##     svmOJtrainingprediction
##       CH  MM
##   CH 433  56
##   MM  76 235

1 - (433 + 235) / 800

## [1] 0.165

svmOJtestingprediction <- predict(svmOJ, newdata=OJ.test)
table(OJ.test$Purchase, svmOJtestingprediction)

##     svmOJtestingprediction
##       CH  MM
##   CH 151  13
##   MM  30  76

1 - (151+76)/ 270

## [1] 0.1592593

PART D.

The cost value of 0.05 is the best value as it has error of 0.17125.

svmOJtune <-  tune(svm, Purchase~.,data=OJ.train, ranges= list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)), kernel='linear')
summary(svmOJtune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.17125 
## 
## - Detailed performance results:
##     cost   error dispersion
## 1   0.01 0.17125 0.03775377
## 2   0.02 0.17375 0.03557562
## 3   0.05 0.17250 0.03944053
## 4   0.10 0.17125 0.03955042
## 5   0.20 0.17375 0.04143687
## 6   0.50 0.17250 0.04073969
## 7   1.00 0.17500 0.03996526
## 8   2.00 0.17375 0.04185375
## 9   5.00 0.17500 0.03773077
## 10 10.00 0.17250 0.03717451

PART E.

The training error rate is 16.5% The testing error rate is 15.9%

svmOJbesttrain <- predict(svmOJtune$best.model,newdata=OJ.train)
tablesvmtunetrain <- table(OJ.train$Purchase, svmOJbesttrain)

1-sum(diag(tablesvmtunetrain))/sum(tablesvmtunetrain)

## [1] 0.165

1-(433+238)/800

## [1] 0.16125

svmOJbesttest <-predict(svmOJtune$best.model,newdata = OJ.test)
table(OJ.test$Purchase, svmOJbesttest)

##     svmOJbesttest
##       CH  MM
##   CH 151  13
##   MM  30  76

1 - (151+76)/ 270

## [1] 0.1592593

PART F.

Cost = 0.01

Training Error = 0.3888 Testing Error = 0.3925

After tuning-

Cost = 0.5 gamma= 0.005

Training Error = 0.15375 Testing Error = 0.17778

table(OJ$STORE,as.factor(OJ$STORE))

##    
##       0   1   2   3   4
##   0 356   0   0   0   0
##   1   0 157   0   0   0
##   2   0   0 222   0   0
##   3   0   0   0 196   0
##   4   0   0   0   0 139

OJ$STORE <- as.factor(OJ$STORE)
OJ$Store7 <- NULL
OJ$StoreID <- NULL
OJ.train <- OJ[train,] # redo the train test split for the modified data...
OJ.test <- OJ[-train,]
svmOJradial <- svm(Purchase~.,data=OJ.train,kernel='radial',cost=0.01)
summary(svmOJradial)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05263158 
## 
## Number of Support Vectors:  623
## 
##  ( 312 311 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

svmOJtrainingprediction <- predict(svmOJradial, newdata=OJ.train)
svmOJtrainradialtable <- table(OJ.train$Purchase, svmOJtrainingprediction)
#Training Error
1-sum(diag(svmOJtrainradialtable))/sum(svmOJtrainradialtable)

## [1] 0.38875

svmOJtestingprediction <- predict(svmOJradial, newdata=OJ.test)
svmOJtestradialtable <- table(OJ.test$Purchase, svmOJtestingprediction)
#Testing Error
1-sum(diag(svmOJtestradialtable))/sum(svmOJtestradialtable)

## [1] 0.3925926

svmradial.tune <- tune(svm , Purchase~. , data=OJ.train , ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10),kernel='radial'))
summary(svmradial.tune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost kernel
##     1 radial
## 
## - best performance: 0.1775 
## 
## - Detailed performance results:
##     cost kernel   error dispersion
## 1   0.01 radial 0.38875 0.07032909
## 2   0.02 radial 0.38875 0.07032909
## 3   0.05 radial 0.22250 0.04594683
## 4   0.10 radial 0.17875 0.05104804
## 5   0.20 radial 0.17875 0.05529278
## 6   0.50 radial 0.17750 0.04993051
## 7   1.00 radial 0.17750 0.04816061
## 8   2.00 radial 0.18250 0.05109903
## 9   5.00 radial 0.18250 0.05470883
## 10 10.00 radial 0.18500 0.05706965

svmOJbesttrainradial<- predict(svmradial.tune$best.model,newdata=OJ.train)
tablesvmtunetrainradial <- table(OJ.train$Purchase, svmOJbesttrainradial)
#Training Error
1-sum(diag(tablesvmtunetrainradial))/sum(tablesvmtunetrainradial)

## [1] 0.15375

svmOJbesttestradial <-predict(svmradial.tune$best.model,newdata = OJ.test)
tablesvmtunetestradial <- table(OJ.test$Purchase, svmOJbesttestradial)
#Testing Error
1-sum(diag(tablesvmtunetestradial))/sum(tablesvmtunetestradial)

## [1] 0.1777778

PART G.

Cost 0.01, degree 2 Training Error: 0.38875 Testing Error: 0.39259

After tuning-

Cost: 0.185, degree 2 Training Error: 0.15625 Testing Error: 0.1962963

table(OJ$STORE,as.factor(OJ$STORE))

##    
##       0   1   2   3   4
##   0 356   0   0   0   0
##   1   0 157   0   0   0
##   2   0   0 222   0   0
##   3   0   0   0 196   0
##   4   0   0   0   0 139

OJ$STORE <- as.factor(OJ$STORE)
OJ$Store7 <- NULL
OJ$StoreID <- NULL
OJ.train <- OJ[train,] 
OJ.test <- OJ[-train,]

svmOJpoly <- svm(Purchase~.,data=OJ.train,kernel='polynomial',degree=2, cost=0.01)
summary(svmOJpoly)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial", 
##     degree = 2, cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.01 
##      degree:  2 
##       gamma:  0.05263158 
##      coef.0:  0 
## 
## Number of Support Vectors:  625
## 
##  ( 314 311 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

svmOJtrainingprediction <- predict(svmOJpoly, newdata=OJ.train)
svmOJtrainpolytable <- table(OJ.train$Purchase, svmOJtrainingprediction)
#Training Error
1-sum(diag(svmOJtrainpolytable))/sum(svmOJtrainpolytable)

## [1] 0.38875

svmOJtestingprediction <- predict(svmOJpoly, newdata=OJ.test)
svmOJtestpolytable <- table(OJ.test$Purchase, svmOJtestingprediction)
#Testing Error
1-sum(diag(svmOJtestpolytable))/sum(svmOJtestpolytable)

## [1] 0.3925926

svmpoly.tune<- tune(svm , Purchase~. , data=OJ.train, ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10),kernel='polynomial', degree =2))
summary(svmpoly.tune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost     kernel degree
##    10 polynomial      2
## 
## - best performance: 0.185 
## 
## - Detailed performance results:
##     cost     kernel degree   error dispersion
## 1   0.01 polynomial      2 0.38875 0.06108112
## 2   0.02 polynomial      2 0.36375 0.07647848
## 3   0.05 polynomial      2 0.35750 0.07364517
## 4   0.10 polynomial      2 0.35000 0.07120003
## 5   0.20 polynomial      2 0.31750 0.05596378
## 6   0.50 polynomial      2 0.26625 0.04411554
## 7   1.00 polynomial      2 0.23375 0.04678927
## 8   2.00 polynomial      2 0.21750 0.04794383
## 9   5.00 polynomial      2 0.18750 0.04677072
## 10 10.00 polynomial      2 0.18500 0.04281744

svmOJbesttrainpoly<- predict(svmpoly.tune$best.model,newdata=OJ.train)
tablesvmtunetrainpoly <- table(OJ.train$Purchase, svmOJbesttrainpoly)
#Training Error
1-sum(diag(tablesvmtunetrainpoly))/sum(tablesvmtunetrainpoly)

## [1] 0.15625

svmOJbesttestpoly <-predict(svmpoly.tune$best.model,newdata = OJ.test)
tablesvmtunetestpoly <- table(OJ.test$Purchase, svmOJbesttestpoly)
#Testing Error
1-sum(diag(tablesvmtunetestpoly))/sum(tablesvmtunetestpoly)

## [1] 0.1962963

PART H.

From the three approaches, I believe the svm with the polynomial kernel does the best job at the most efficient cost, based on the testing error late being the lowest.

From Part G, we see that the training and testing error for the svm with the kernel set to polynomial drops dramatically (about 20%) once the appropriate cost is picked.

The linear kernel is confusing for me because it seems the training and testing error rates were better before we found the best cost.

STAT 388 HW 7

Kajal Chokshi

November 28, 2016

Problem 1, Chapter 9 Excercise 3

PART A

PART B

PART C.

PART D.

PART E.

PART F.

PART G.

PART H.

Problem 3 Excercise 5

PART A.

PART B.

PART C.

PART D.

PART E.

PART F.

PART G. Support Vector Classifier

PART H. Non-linear kernel with X1 and X2 as predictors

PART I.

Problem 4, Chapter 8, Excercise 8

PART A

PART B.

PART C.

PART D.

PART E.

PART F.

PART G.

PART H.