STATS 216 Winter 2014 HW4

Shijia Bian

Question 1

(a)

Produce a plot of test MSE (as in Figure 8.8 in the text) as a function of number of trees for Bagging and Random Forests.

We use 200 observations from our dataset to act as test set, using the remaining 307 as a training set. For the bagging, I use mtry=21 that indicates that all 21 predictors should be considered for each split of the tree. For the random forest of this regression trees, I use mtry=p/3=21/3=7.
load("/Users/shijiabian/Downloads/body.RData")
genderFact <- as.factor(Y$Gender)

set.seed(1)
test = sample(1:nrow(X), 200)
train = (1:nrow(X))[-test]

install.packages("randomForest")
## Error: trying to use CRAN without setting a mirror
library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

set.seed(1)
test.mse.bag <- numeric(100)
test.mse.rf <- numeric(100)
xvalue <- c(1:100) * 5

for (i in 1:100) {
    bag.rdata <- randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 21, 
        importance = TRUE, ntree = i * 5)
    rf.rdata <- randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 7, 
        importance = TRUE, ntree = i * 5)
    yhat.bag <- predict(bag.rdata, newdata = X[test, ])
    yhat.rf <- predict(rf.rdata, newdata = X[test, ])
    test.mse.bag[i] <- mean((yhat.bag - Y[test, ]$Weight)^2)
    test.mse.rf[i] <- mean((yhat.rf - Y[test, ]$Weight)^2)
}
plot(xvalue, test.mse.bag, xlab = "Number of Trees", ylab = "MSE", col = "blue", 
    pch = 19, type <- "l", ylim = c(9.5, 17))
points(xvalue, test.mse.rf, pch = 19, type = "l")
legend("topright", legend = paste(c("Test MSE: Bagging", "Test MSE: Random Forest"), 
    sep = ""), col = c("blue", "black"), lty = 1)

plot of chunk unnamed-chunk-2

(b)

Which variables does your random forest identify as most important? How do they compare with the most important variables as identified by Bagging?

Here, I still use the training data set as the training data set is a big part of the entire data set, and I use default number of trees, 500, this time. For RandomForest, in terms of %IncMSE, Knee.Girth, 21.622356, is identified as the most important variable in the random forest; In terms of IncNodePurity, Waist.Girth, 11968.3360, is identified as the most important variable.
rf.rdata <- randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 7, importance = TRUE)
importance(rf.rdata)
##                     %IncMSE IncNodePurity
## Wrist.Diam            5.469         725.7
## Wrist.Girth           6.819        1614.9
## Forearm.Girth        13.936        6731.5
## Elbow.Diam            8.188         740.7
## Bicep.Girth          13.333        8018.1
## Shoulder.Girth       11.399        3378.4
## Biacromial.Diam      10.247         280.6
## Chest.Depth           7.922         780.6
## Chest.Diam           10.695        1485.1
## Chest.Girth          15.034        8396.5
## Navel.Girth           8.760         447.8
## Waist.Girth          20.812       11250.7
## Pelvic.Breadth        7.551         300.6
## Bitrochanteric.Diam  10.931         481.0
## Hip.Girth            20.654        1835.3
## Thigh.Girth          15.729         674.6
## Knee.Diam            13.310         777.8
## Knee.Girth           21.511        2121.3
## Calf.Girth           17.533        2465.8
## Ankle.Diam            8.277         199.0
## Ankle.Girth          10.741         455.3
varImpPlot(rf.rdata)

plot of chunk unnamed-chunk-4

For Bagging, in terms of %IncMSE, Waist.Girth, 30.152524, is identified as the most important variable in the bagging; In terms of IncNodePurity, Waist.Girth, 22573.6587, is identified as the most important variable in the bagging.
bag.rdata = randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 21, 
    importance = TRUE)
importance(bag.rdata)
##                     %IncMSE IncNodePurity
## Wrist.Diam            5.014         109.8
## Wrist.Girth           5.795         243.1
## Forearm.Girth        11.680        5550.1
## Elbow.Diam            8.381         214.3
## Bicep.Girth          12.542        8097.6
## Shoulder.Girth       11.301         987.1
## Biacromial.Diam      11.128         226.6
## Chest.Depth           8.748         188.8
## Chest.Diam           11.219         710.9
## Chest.Girth          14.705        7006.7
## Navel.Girth           5.201         147.9
## Waist.Girth          30.049       22041.3
## Pelvic.Breadth        5.993         142.5
## Bitrochanteric.Diam   5.577         280.7
## Hip.Girth            22.912        1518.4
## Thigh.Girth          15.685         383.9
## Knee.Diam            15.738         784.2
## Knee.Girth           26.118        2219.1
## Calf.Girth           18.953        1678.5
## Ankle.Diam            6.978         154.7
## Ankle.Girth           9.972         298.2
varImpPlot(bag.rdata)

plot of chunk unnamed-chunk-6

c.

Compare the test error of your random forest (with 500 trees) against the test errors of the three methods you evaluated in problem 3(f) on Homework 3. Does your random forest make better predictions than your predictions from Homework 3?

The MSE given by the randomforest with 500 trees is 10.76685. This is slightly worse than that given in the 3(f). In 3(f), pls model had a test error around 8.65, pcr had 9.27, and forward stepwise 8.63.
set.seed(123)
rf.rdata.500 <- randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 7, 
    importance = TRUE, ntree = 500)
yhat.rf.500 <- predict(rf.rdata.500, newdata = X[test, ])
test.mse.rf.500 <- mean((yhat.rf.500 - Y$Weight[test])^2)
test.mse.rf.500
## [1] 10.77

(d)

The randomForest() function uses 500 as the default number of trees. For this problem, is 500 enough trees? How can you tell?

500 is enough tree. I ploted the MSEs of the random forest and bagging for the tree number from 501 to 500. We can see that the MSEs for the two methods are almost constant and does not have big flatuation.
test.mse.rf.500 <- numeric(50)
test.mse.bag.500 <- numeric(50)
xvalue <- c(1:50)

for (i in 501:550) {
    rf.rdata <- randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 7, 
        importance = TRUE, ntree = i)
    bag.rdata <- randomForest(Y$Weight ~ ., data = X, subset = train, mtry = 21, 
        importance = TRUE, ntree = i)
    yhat.rf <- predict(rf.rdata, newdata = X[test, ])
    test.mse.rf.500[i - 500] <- mean((yhat.rf - Y[test, ]$Weight)^2)
    yhat.bag <- predict(bag.rdata, newdata = X[test, ])
    test.mse.bag.500[i - 500] <- mean((yhat.bag - Y[test, ]$Weight)^2)
}
plot(xvalue, test.mse.rf.500, xlab = "Number of Trees", ylab = "MSE", col = "blue", 
    pch = 19, type <- "l", ylim = c(9.5, 17))
points(xvalue, test.mse.bag.500, xlab = "Number of Trees", ylab = "MSE", col = "green", 
    pch = 19, type <- "l")
legend("topright", legend = paste(c("Test MSE: Random Forest", "Test MSE: Bagging"), 
    sep = ""), col = c("blue", "green"), lty = 1)

plot of chunk unnamed-chunk-9

Question 2

(a)

We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label. Sketch the observations.

x1 <- c(3, 2, 4, 1, 2, 4, 4)
x2 <- c(4, 2, 4, 4, 1, 3, 1)
y <- c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue")
y <- as.factor(y)
data <- data.frame(x1, x2, y)
Here how I sketch the observations:
attach(data)
## The following objects are masked _by_ .GlobalEnv:
## 
##     x1, x2, y
plot(x1, x2, col = c("blue", "red")[y], pch = 20)

plot of chunk unnamed-chunk-11

detach(data)

(b)

Sketch the optimal separating hyperplane, and provide the equation for this hyperplane.

The equation for this hyperplane: x1 - x2 - 0.5 = 0. The optimal separating hyperplane is sketched in color green.
attach(data)
## The following objects are masked _by_ .GlobalEnv:
## 
##     x1, x2, y
plot(x1, x2, col = c("blue", "red")[y], pch = 20)
abline(-0.5, 1, col = "green")

plot of chunk unnamed-chunk-12

detach(data)

c.

Describe the classification rule for the maximal margin classifier.

Classify to Red if x1 - x2 - 0.5 > 0, and classify to Blue otherwise. Beta_0 = -0.5, beta_1=1, beta_2=-1. \[ \beta_0 = -0.5, \beta_1=1, \beta_2=-1 \]

(d)

On your sketch, indicate the margin for the maximal margin hyperplane. How wide is the margin?

The margin is the minimal distance from the observations to the hyperplane: the perpendicular distance from the points on the dashed line to the optimal hyperplan. The margin is \( \frac{1}{2*\sqrt{2}} \) in our problem.

attach(data)
## The following objects are masked _by_ .GlobalEnv:
## 
##     x1, x2, y
plot(x1, x2, col = c("blue", "red")[y], pch = 20)
abline(-0.5, 1, col = "green")
abline(-1, 1, col = "black", lty = 2)
abline(0, 1, col = "black", lty = 2)

plot of chunk unnamed-chunk-13

detach(data)

(e)

Indicate the support vectors for the maximal margin classi er.

The support vectors are indexed as 2,3,5,6 in the table.

(f)

Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.

The maximal margin hyperplane depends directly on only a small subset of the observations. The seventh observation is far from the maximal margin hyperplane and the support vectors. If there is a slight movement of the seventh observation, it would not affect the separating hyperplane. The maximal distance from the support factor to the hyperplane won't change by moving the seventh observation a little bit.

(g)

Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.

I plotted this hyperplane as color orange. The equation is 1.1*X_1 - X_2 - 0.6 = 0
attach(data)
## The following objects are masked _by_ .GlobalEnv:
## 
##     x1, x2, y
plot(x1, x2, col = c("blue", "red")[y], pch = 20)
abline(-0.5, 1, col = "green")
abline(-0.6, 1.1, col = "orange")

plot of chunk unnamed-chunk-14

detach(data)

(h)

Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

Here I add a point (2,3.5) that is color blue, then the two classes are no longer separable by a hyperplane.
x1 <- c(3, 2, 4, 1, 2, 4, 4, 2)
x2 <- c(4, 2, 4, 4, 1, 3, 1, 3.5)
y <- c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue", "Blue")
y <- as.factor(y)
newdata <- data.frame(x1, x2, y)
attach(newdata)
## The following objects are masked _by_ .GlobalEnv:
## 
##     x1, x2, y
plot(x1, x2, col = c("blue", "red")[y], pch = 20)

plot of chunk unnamed-chunk-16

detach(newdata)

Question 3

(a)

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

Here I created a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
install.packages("ISLR")
## Error: trying to use CRAN without setting a mirror
install.packages("e1071")
## Error: trying to use CRAN without setting a mirror

library(ISLR)
library(e1071)
## Loading required package: class

attach(OJ)
head(OJ)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0  0.5000        1.99        1.75      0.24     No    0.0000
## 2         1  0.6000        1.69        1.75     -0.06     No    0.1508
## 3         0  0.6800        2.09        1.69      0.40     No    0.0000
## 4         0  0.4000        1.69        1.69      0.00     No    0.0000
## 5         0  0.9565        1.69        1.69      0.00    Yes    0.0000
## 6         1  0.9652        1.99        1.69      0.30    Yes    0.0000
##   PctDiscCH ListPriceDiff STORE
## 1    0.0000          0.24     1
## 2    0.0000          0.24     1
## 3    0.0914          0.23     1
## 4    0.0000          0.00     1
## 5    0.0000          0.00     0
## 6    0.0000          0.30     0
dim(OJ)
## [1] 1070   18
# 1070 18

set.seed(345)
train = sample(1:nrow(OJ), 800)
test = c(1:nrow(OJ))[-train]

(b)

Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the SVM, and describe the results obtained.

With the cost is equale to 0.01, there are 603 support factors, 302 in one class and the rest of 301 are in another class. The gamma is 0.05555556.
set.seed(235)
Purchase <- as.factor(Purchase)
OJ.df <- data.frame(OJ, Purchase)
OJ.df <- OJ.df[, -1]
svmfit <- svm(Purchase.1 ~ ., data = OJ.df[train, ], kernel = "linear", cost = 0.01, 
    scale = FALSE)
summary(svmfit)
## 
## Call:
## svm(formula = Purchase.1 ~ ., data = OJ.df[train, ], kernel = "linear", 
##     cost = 0.01, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  603
## 
##  ( 302 301 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

©

What are the training and test error rates?

For the training error rate:

truth
predict  CH  MM
     CH 457 149
     MM  41 153

Training Error rate = 190/800 = 0.2375. The training error rate is supposed to smaller than that of the test error rate in most of the cases.
purchasepred.train <- predict(svmfit, OJ.df[train, ])
table(predict = purchasepred.train, truth = OJ.df[train, ]$Purchase.1)
##        truth
## predict  CH  MM
##      CH 457 149
##      MM  41 153
For the test error rate
       truth
 predict  CH  MM
      CH 146  64
      MM   9  51
 Testing Error rate = 73/270 = 0.2703704 (The result might be different for each time I knit HTML, but the training error rate is supposed to smaller than that of the test error rate in most of the cases).
purchasepred.test <- predict(svmfit, OJ.df[test, ])
table(predict = purchasepred.test, truth = OJ.df[test, ]$Purchase.1)
##        truth
## predict  CH  MM
##      CH 146  64
##      MM   9  51

(d)

Use the tune() function to select an optimal cost. Consider values in the range 0:01 to 10.

cost=1.927273  and gamma=0.05555556 results in the lowest cross validation error rate, 0.1663551.
set.seed(365)
tune.out = tune(svm, Purchase.1 ~ ., data = OJ.df, kernel = "linear", ranges = list(cost = c(seq(0.01, 
    10, length.out = 100))))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   cost
##  1.927
## 
## - best performance: 0.1664 
## 
## - Detailed performance results:
##        cost  error dispersion
## 1    0.0100 0.1701    0.02561
## 2    0.1109 0.1729    0.02423
## 3    0.2118 0.1710    0.02645
## 4    0.3127 0.1748    0.02645
## 5    0.4136 0.1720    0.02539
## 6    0.5145 0.1710    0.02455
## 7    0.6155 0.1701    0.02281
## 8    0.7164 0.1729    0.02540
## 9    0.8173 0.1729    0.02540
## 10   0.9182 0.1738    0.02962
## 11   1.0191 0.1729    0.02830
## 12   1.1200 0.1710    0.02924
## 13   1.2209 0.1701    0.03014
## 14   1.3218 0.1682    0.02821
## 15   1.4227 0.1682    0.02821
## 16   1.5236 0.1692    0.02970
## 17   1.6245 0.1692    0.02970
## 18   1.7255 0.1701    0.02779
## 19   1.8264 0.1682    0.02531
## 20   1.9273 0.1664    0.02636
## 21   2.0282 0.1682    0.02889
## 22   2.1291 0.1673    0.02623
## 23   2.2300 0.1682    0.02786
## 24   2.3309 0.1692    0.02870
## 25   2.4318 0.1692    0.02870
## 26   2.5327 0.1692    0.02870
## 27   2.6336 0.1692    0.02870
## 28   2.7345 0.1692    0.02870
## 29   2.8355 0.1692    0.02870
## 30   2.9364 0.1692    0.02870
## 31   3.0373 0.1692    0.02870
## 32   3.1382 0.1692    0.02870
## 33   3.2391 0.1692    0.02870
## 34   3.3400 0.1701    0.02882
## 35   3.4409 0.1710    0.02957
## 36   3.5418 0.1710    0.02957
## 37   3.6427 0.1710    0.02957
## 38   3.7436 0.1710    0.02957
## 39   3.8445 0.1710    0.02957
## 40   3.9455 0.1701    0.02982
## 41   4.0464 0.1692    0.02970
## 42   4.1473 0.1692    0.02970
## 43   4.2482 0.1692    0.02970
## 44   4.3491 0.1692    0.02970
## 45   4.4500 0.1692    0.02970
## 46   4.5509 0.1692    0.02970
## 47   4.6518 0.1692    0.02970
## 48   4.7527 0.1692    0.02970
## 49   4.8536 0.1692    0.02970
## 50   4.9545 0.1692    0.02970
## 51   5.0555 0.1692    0.02970
## 52   5.1564 0.1692    0.02970
## 53   5.2573 0.1701    0.02779
## 54   5.3582 0.1701    0.02779
## 55   5.4591 0.1701    0.02779
## 56   5.5600 0.1701    0.02779
## 57   5.6609 0.1701    0.02779
## 58   5.7618 0.1710    0.02753
## 59   5.8627 0.1710    0.02753
## 60   5.9636 0.1710    0.02753
## 61   6.0645 0.1710    0.02753
## 62   6.1655 0.1701    0.02599
## 63   6.2664 0.1701    0.02599
## 64   6.3673 0.1701    0.02599
## 65   6.4682 0.1701    0.02599
## 66   6.5691 0.1701    0.02599
## 67   6.6700 0.1701    0.02599
## 68   6.7709 0.1701    0.02599
## 69   6.8718 0.1692    0.02510
## 70   6.9727 0.1692    0.02510
## 71   7.0736 0.1701    0.02523
## 72   7.1745 0.1701    0.02523
## 73   7.2755 0.1692    0.02732
## 74   7.3764 0.1692    0.02732
## 75   7.4773 0.1692    0.02732
## 76   7.5782 0.1710    0.02891
## 77   7.6791 0.1701    0.02744
## 78   7.7800 0.1701    0.02744
## 79   7.8809 0.1701    0.02744
## 80   7.9818 0.1701    0.02744
## 81   8.0827 0.1701    0.02744
## 82   8.1836 0.1692    0.02623
## 83   8.2845 0.1692    0.02623
## 84   8.3855 0.1692    0.02623
## 85   8.4864 0.1692    0.02623
## 86   8.5873 0.1692    0.02623
## 87   8.6882 0.1692    0.02623
## 88   8.7891 0.1701    0.02673
## 89   8.8900 0.1701    0.02673
## 90   8.9909 0.1701    0.02673
## 91   9.0918 0.1701    0.02673
## 92   9.1927 0.1710    0.02788
## 93   9.2936 0.1710    0.02788
## 94   9.3945 0.1710    0.02788
## 95   9.4955 0.1710    0.02788
## 96   9.5964 0.1710    0.02788
## 97   9.6973 0.1720    0.02896
## 98   9.7982 0.1720    0.02896
## 99   9.8991 0.1710    0.02788
## 100 10.0000 0.1720    0.02896
bestmod = tune.out$best.model
summary(bestmod)
## 
## Call:
## best.tune(method = svm, train.x = Purchase.1 ~ ., data = OJ.df, 
##     ranges = list(cost = c(seq(0.01, 10, length.out = 100))), 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1.927 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  443
## 
##  ( 221 222 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

(e)

After adopting the new cost, the error rates for both training and testing data decrease. 
set.seed(334)
svmfit.newc <- svm(Purchase.1 ~ ., data = OJ.df[train, ], kernel = "linear", 
    cost = bestmod$cost, scale = FALSE)
summary(svmfit.newc)
## 
## Call:
## svm(formula = Purchase.1 ~ ., data = OJ.df[train, ], kernel = "linear", 
##     cost = bestmod$cost, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1.927 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  342
## 
##  ( 172 170 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
For the training error rate with the new cost:

truth
predict  CH  MM
    CH  444  80
    MM   54 222

Training Error rate = 134/800 = 0.1675 (The result might be different for each time I knit HTML).
purchasepred.train.newc <- predict(svmfit.newc, OJ.df[train, ])
table(predict = purchasepred.train.newc, truth = OJ.df[train, ]$Purchase.1)
##        truth
## predict  CH  MM
##      CH 444  80
##      MM  54 222
For the test error rate with the new cost:
       truth
predict  CH  MM
     CH 142  34
     MM  13  81
Testing Error rate = 47/270 = 0.1740741 (The result might be different for each time I knit HTML).
purchasepred.test.newc <- predict(svmfit.newc, OJ.df[test, ])
table(predict = purchasepred.test.newc, truth = OJ.df[test, ]$Purchase.1)
##        truth
## predict  CH  MM
##      CH 142  34
##      MM  13  81

(f)

Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

Repeat Part b. Number of Support Vectors is 609, 307 in one class and the rest of 302 are in another class.
install.packages("e1071")
## Error: trying to use CRAN without setting a mirror
library(e1071)
set.seed(345)
svmfit.svm <- svm(Purchase.1 ~ ., data = OJ.df[train, ], kernel = "radial", 
    cost = 0.01)
summary(svmfit.svm)
## 
## Call:
## svm(formula = Purchase.1 ~ ., data = OJ.df[train, ], kernel = "radial", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  609
## 
##  ( 307 302 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
Part c. 

For the training error rate of svm

truth
predict  CH  MM
     CH 498 302
     MM   0   0
Training Error rate = 302/800 = 0.3775. We can see that the tolerance that the data is on the wrong category is small due to the small cost parameter. 
purchasepred.train.svm <- predict(svmfit.svm, newx = OJ.df[train, ])
table(pred = purchasepred.train.svm, true = OJ.df[train, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 498 302
##   MM   0   0
For the test error rate
       truth
predict  CH  MM
     CH 155 115
     MM   0   0
Testing Error rate = 115/270 = 0.4259259 (The result might be different for each time I knit HTML).
purchasepred.test.svm <- predict(svmfit.svm, OJ.df[test, ])
table(pred = purchasepred.test.svm, true = OJ.df[test, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 155 115
##   MM   0   0
Part d

cost =  0.6154545  gamma = 0.05555556 results in the lowest cross validation error rate: 0.17000. Number of Support Vectors:  432 (216, 216) (The result might be different for each time I knit HTML).
set.seed(234)
tune.out.svm = tune(svm, Purchase.1 ~ ., data = OJ.df[train, ], kernel = "radial", 
    ranges = list(cost = c(seq(0.01, 10, length.out = 100))))
summary(tune.out.svm)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##    cost
##  0.6155
## 
## - best performance: 0.1713 
## 
## - Detailed performance results:
##        cost  error dispersion
## 1    0.0100 0.3775    0.04888
## 2    0.1109 0.1825    0.03395
## 3    0.2118 0.1825    0.03238
## 4    0.3127 0.1750    0.03333
## 5    0.4136 0.1787    0.03538
## 6    0.5145 0.1738    0.03305
## 7    0.6155 0.1713    0.03283
## 8    0.7164 0.1713    0.03283
## 9    0.8173 0.1737    0.03654
## 10   0.9182 0.1762    0.03929
## 11   1.0191 0.1775    0.03717
## 12   1.1200 0.1800    0.03395
## 13   1.2209 0.1787    0.03587
## 14   1.3218 0.1762    0.03748
## 15   1.4227 0.1775    0.03670
## 16   1.5236 0.1775    0.03670
## 17   1.6245 0.1762    0.03408
## 18   1.7255 0.1787    0.03438
## 19   1.8264 0.1787    0.03438
## 20   1.9273 0.1775    0.03375
## 21   2.0282 0.1788    0.03336
## 22   2.1291 0.1775    0.03270
## 23   2.2300 0.1775    0.03270
## 24   2.3309 0.1775    0.03270
## 25   2.4318 0.1787    0.03336
## 26   2.5327 0.1750    0.03062
## 27   2.6336 0.1750    0.03062
## 28   2.7345 0.1738    0.02973
## 29   2.8355 0.1750    0.03227
## 30   2.9364 0.1762    0.03305
## 31   3.0373 0.1762    0.03305
## 32   3.1382 0.1762    0.03305
## 33   3.2391 0.1775    0.03051
## 34   3.3400 0.1787    0.03283
## 35   3.4409 0.1788    0.03388
## 36   3.5418 0.1775    0.03323
## 37   3.6427 0.1788    0.03438
## 38   3.7436 0.1788    0.03438
## 39   3.8445 0.1788    0.03438
## 40   3.9455 0.1788    0.03438
## 41   4.0464 0.1788    0.03438
## 42   4.1473 0.1800    0.03689
## 43   4.2482 0.1800    0.03689
## 44   4.3491 0.1813    0.03785
## 45   4.4500 0.1813    0.03785
## 46   4.5509 0.1800    0.03782
## 47   4.6518 0.1788    0.03911
## 48   4.7527 0.1800    0.03918
## 49   4.8536 0.1800    0.03918
## 50   4.9545 0.1800    0.03828
## 51   5.0555 0.1800    0.03828
## 52   5.1564 0.1800    0.03828
## 53   5.2573 0.1800    0.03828
## 54   5.3582 0.1813    0.03785
## 55   5.4591 0.1800    0.03689
## 56   5.5600 0.1813    0.03875
## 57   5.6609 0.1813    0.03875
## 58   5.7618 0.1813    0.03875
## 59   5.8627 0.1813    0.03875
## 60   5.9636 0.1813    0.03875
## 61   6.0645 0.1813    0.03875
## 62   6.1655 0.1800    0.03918
## 63   6.2664 0.1800    0.03918
## 64   6.3673 0.1800    0.03918
## 65   6.4682 0.1800    0.03918
## 66   6.5691 0.1800    0.03918
## 67   6.6700 0.1800    0.03918
## 68   6.7709 0.1800    0.03918
## 69   6.8718 0.1800    0.03918
## 70   6.9727 0.1800    0.03918
## 71   7.0736 0.1800    0.03918
## 72   7.1745 0.1800    0.03918
## 73   7.2755 0.1800    0.03918
## 74   7.3764 0.1800    0.03918
## 75   7.4773 0.1800    0.03918
## 76   7.5782 0.1800    0.03918
## 77   7.6791 0.1788    0.03866
## 78   7.7800 0.1788    0.03866
## 79   7.8809 0.1788    0.03866
## 80   7.9818 0.1775    0.03855
## 81   8.0827 0.1775    0.03855
## 82   8.1836 0.1775    0.03855
## 83   8.2845 0.1775    0.03855
## 84   8.3855 0.1775    0.03855
## 85   8.4864 0.1775    0.03855
## 86   8.5873 0.1775    0.03855
## 87   8.6882 0.1775    0.03855
## 88   8.7891 0.1775    0.03855
## 89   8.8900 0.1775    0.03855
## 90   8.9909 0.1775    0.03855
## 91   9.0918 0.1775    0.03855
## 92   9.1927 0.1775    0.03855
## 93   9.2936 0.1775    0.03855
## 94   9.3945 0.1775    0.03855
## 95   9.4955 0.1775    0.03855
## 96   9.5964 0.1775    0.03855
## 97   9.6973 0.1775    0.03855
## 98   9.7982 0.1775    0.03855
## 99   9.8991 0.1788    0.03866
## 100 10.0000 0.1788    0.03866
bestmod.svm = tune.out.svm$best.model
summary(bestmod.svm)
## 
## Call:
## best.tune(method = svm, train.x = Purchase.1 ~ ., data = OJ.df[train, 
##     ], ranges = list(cost = c(seq(0.01, 10, length.out = 100))), 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.6155 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  389
## 
##  ( 197 192 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
Part e
svmfit.svm.newc <- svm(Purchase.1 ~ ., data = OJ.df[train, ], kernel = "radial", 
    cost = bestmod.svm$cost)
summary(svmfit.svm.newc)
## 
## Call:
## svm(formula = Purchase.1 ~ ., data = OJ.df[train, ], kernel = "radial", 
##     cost = bestmod.svm$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.6155 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  389
## 
##  ( 197 192 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
For the training error rate of svm

truth
predict  CH  MM
     CH 457  85
     MM  41 217
Training Error rate = 126/800 = 0.1575 (The result might be different for each time I knit HTML).
purchasepred.train.svm.newc <- predict(svmfit.svm.newc, newx = OJ.df[train, 
    ])
table(pred = purchasepred.train.svm.newc, true = OJ.df[train, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 457  82
##   MM  41 220
For the test error rate
      truth
predict  CH  MM
     CH 138  33
     MM  17  82
Testing Error rate = 50/270 = 0.1851852 (The result might be different for each time I knit HTML).
purchasepred.test.svm.newc <- predict(svmfit.svm.newc, OJ.df[test, ])
table(pred = purchasepred.test.svm.newc, true = OJ.df[test, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 138  32
##   MM  17  83

(g)

Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2 for part (b).

Part b. Number of Support Vectors:  607 (305,302) (The result might be different for each time I knit HTML).
set.seed(4452)
svmfit.svm.po <- svm(Purchase.1 ~ ., data = OJ.df[train, ], kernel = "polynomial", 
    degree = 2, cost = 0.01)
summary(svmfit.svm.po)
## 
## Call:
## svm(formula = Purchase.1 ~ ., data = OJ.df[train, ], kernel = "polynomial", 
##     degree = 2, cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.01 
##      degree:  2 
##       gamma:  0.05556 
##      coef.0:  0 
## 
## Number of Support Vectors:  607
## 
##  ( 305 302 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
Part c. 

For the training error rate of svm with polynomial degree 2 

truth
predict  CH  MM
     CH 497 287
     MM   1  15
Training Error rate = 288/800 = 0.36 (The result might be different for each time I knit HTML).
purchasepred.train.svm.po <- predict(svmfit.svm.po, newx = OJ.df[train, ])
table(pred = purchasepred.train.svm.po, true = OJ.df[train, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 497 287
##   MM   1  15
For the test error rate
      truth
predict  CH  MM
     CH 155 113
     MM   0   2
Testing Error rate = 113/270 = 0.4185185 (The result might be different for each time I knit HTML).
purchasepred.test.svm.po <- predict(svmfit.svm.po, OJ.df[test, ])
table(pred = purchasepred.test.svm.po, true = OJ.df[test, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 155 113
##   MM   0   2
Part d

cost = 5 gamma = 0.05555556 results in the lowest cross validation error rate: 0.17375. Number of Support Vectors:  364 (189,175) (The result might be different for each time I knit HTML).
set.seed(23)
tune.out.svm = tune(svm, Purchase.1 ~ ., data = OJ.df[train, ], kernel = "polynomial", 
    degree = 2, ranges = list(cost = c(0.01, 0.7, 1, 5, 10)))
summary(tune.out.svm)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.1737 
## 
## - Detailed performance results:
##    cost  error dispersion
## 1  0.01 0.3775    0.06003
## 2  0.70 0.1850    0.05329
## 3  1.00 0.1837    0.05002
## 4  5.00 0.1737    0.04185
## 5 10.00 0.1738    0.03558
bestmod.svm = tune.out.svm$best.model
summary(bestmod.svm)
## 
## Call:
## best.tune(method = svm, train.x = Purchase.1 ~ ., data = OJ.df[train, 
##     ], ranges = list(cost = c(0.01, 0.7, 1, 5, 10)), kernel = "polynomial", 
##     degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  5 
##      degree:  2 
##       gamma:  0.05556 
##      coef.0:  0 
## 
## Number of Support Vectors:  364
## 
##  ( 189 175 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
Part e
set.seed(23)
svmfit.svm.newc <- svm(Purchase.1 ~ ., data = OJ.df[train, ], kernel = "polynomial", 
    , cost = 5, degree = 2)
## Error: 'type' is missing
summary(svmfit.svm.newc)
## 
## Call:
## svm(formula = Purchase.1 ~ ., data = OJ.df[train, ], kernel = "radial", 
##     cost = bestmod.svm$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.6155 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  389
## 
##  ( 197 192 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
For the training error rate of svm

truth
predict  CH  MM
     CH 457  82
     MM  41 220
Training Error rate = 123/800 = 0.15375  (The result might be different for each time I knit HTML).
purchasepred.train.svm.newc <- predict(svmfit.svm.newc, newx = OJ.df[train, 
    ])
table(pred = purchasepred.train.svm.newc, true = OJ.df[train, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 457  82
##   MM  41 220
For the test error rate
      truth
predict  CH  MM
     CH 138  32
     MM  17  83
Testing Error rate = 49/270 = 0.1814815
purchasepred.test.svm.newc <- predict(svmfit.svm.newc, OJ.df[test, ])
table(pred = purchasepred.test.svm.newc, true = OJ.df[test, ]$Purchase.1)
##     true
## pred  CH  MM
##   CH 138  32
##   MM  17  83

(h)

Overall, which approach seems to give the best results on this data?

According to the traning and test error rates from (e) to (g), the training error rate and the test error rate all decrease. Thus polynomial kernel with degree = 2 is the best model.

Question 4

(a) - (c) are attached with the hw.

(d)

I simulated a toy data set:
set.seed(123)
n <- 150  # number of data points
p <- 2  # dimension
sigma <- 1  # variance of the distribution
meanpos <- 0  # centre of the distribution of positive examples
meanneg <- 3  # centre of the distribution of negative examples
npos <- round(n/2)  # number of positive examples
nneg <- n - npos  # number of negative examples
# Generate the positive and negative examples
xpos <- matrix(rnorm(npos * p, mean = meanpos, sd = sigma), npos, p)
xneg <- matrix(rnorm(nneg * p, mean = meanneg, sd = sigma), npos, p)
x <- rbind(xpos, xneg)
# Generate the labels
y <- matrix(c(rep(1, npos), rep(-1, nneg)))
# Visualize the data
Then I performed the 2 means clustering algorithm. The color blue stare is the global optimal of the data set, but we can see that the clustering converge to the local optimal instead of the global optimal.
op <- par(mfrow = c(2, 1))

plot(x, col = ifelse(y < 0, 2, 3), main = "the True Data Visualization")
points(mean(x), mean(y), col = "blue", pch = 8)
km.out = kmeans(x, 2)
plot(x, col = (km.out$cluster + 1), main = "K-Means Clustering Results with K=2", 
    xlab = "", ylab = "", pch = 1, cex = 1)
points(mean(x), mean(y), col = "blue", pch = 8)

plot of chunk unnamed-chunk-40

**NOTE**
This hw has the error below when uploading the packages:

Error: trying to use CRAN without setting a mirror

This is due to the internet limit in China. I have to upload the package manually in R instead of calling the function in R console.