Homework 4 for STATS216 by Bruno Pen Wu

Question 1

I worked with Seema Sangari, Sanne de Roever, and Vedat Can Cologlu on this question.

Preparing the data:

load("~/Dropbox/Data Science/STATS216/Inclass Material/Feb 12/body.RData")

# Training and test set (same procedure and seed as in HW3)
n = nrow(Y)
set.seed(99)
train = sample(n, 307)
test = -train

# Prepare for Bagging and Random Forest
library(randomForest)

## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

data = data.frame(Y$Weight, X)

# Bagging
set.seed(404)
weight.bag = randomForest(Y.Weight ~ ., data = data, subset = train, mtry = 21, 
    importance = T, xtest = X[test, ], ytest = Y[test, "Weight"])

# Random Forest (using m = p/3 or 21/3 or 7)
set.seed(505)
weight.rf = randomForest(Y.Weight ~ ., data = data, subset = train, mtry = 7, 
    importance = T, xtest = X[test, ], ytest = Y[test, "Weight"])

a) From looking at the documentation for randomForest():

weight.bag$test$mse gives the test MSE for bagging
weight.rf$test$mse gives the test MSE for random forest

matplot(1:weight.bag$ntree, cbind(weight.bag$test$mse, weight.rf$test$mse), 
    type = "l", main = "Test MSE v. Number of Trees", xlab = "Number of Trees", 
    ylab = "Test MSE", lty = c(1, 2))
legend("topright", legend = c("Bagging", "Random"), lty = c(1, 2))

plot of chunk unnamed-chunk-2

b) Random forest identified the following four variables as the most important: Hip.Girth, Knee.Girth, Waist.Girth, Thigh.Girth. These top four variables are the same for both random forest and bagging. Although the rankings for these variables are different for each method.

par(mfrow = c(1, 2))
varImpPlot(weight.rf, n.var = 8, type = 1, main = "Random Forest (Imp Var)")
varImpPlot(weight.bag, n.var = 8, type = 1, main = "Bagging (Imp Var)")

plot of chunk unnamed-chunk-3

c) Random Forest test MSE for 500 trees is around 13.18:

weight.rf$test$mse[500]

## [1] 13.18

This is much higher than the test MSE for the 3 methods we used in HW3(f). Random forest makes a worse prediction than these three methods.

PLS test MSE rate was 8.35
PCR test MSE was 10.47
Best subset test MSE was 7.997

d) 500 trees is enough because (from looking at the graph in part A) the test MSE has already flattened out to a minimum with 500 trees. The “elbow” has been reached fairly early on at around 50-100 trees.

Question 2

a) Setting up the problem and sketch the observations.

X1 = c(3, 2, 4, 1, 2, 4, 4)
X2 = c(4, 2, 4, 4, 1, 3, 1)
Y = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue")
plot(X1, X2, col = Y, pch = 19, ylim = c(-1, 5), xlim = c(0, 5), yaxs = "i", 
    xaxs = "i")

plot of chunk unnamed-chunk-5

b) Any separating hyperplane will be located between the circled supporting vectors since such a line will linearly separate the two classes perfectly: plot of chunk unnamed-chunk-6

In particular, the optimal separating hyperplane will be a line that goes through exactly the middle of the pair of points (2,1) and (2,2) and the second pair of points (4,3) and (4,4). In other words, this optimal separating hyperplane will contain the points (2,1.5) and (4,3.5) - marked by crosses below:

plot of chunk unnamed-chunk-7

If we used the equation for a line in the form of $ \displaystyle y=mx + b $ and plugged in the points (2,1.5) and (4,3.5), we will get a slope of 1 ($ m=1 $) and a y-intercept of -0.5 ($ b=-0.5 $) and thus, the formula for the line is $ \displaystyle y=1*x-0.5 $.

If we substituted $ \beta_2X_2 $ for $ y $ and $ \beta_1X_1 $ for $ mx $ and the y-intercept for $ \beta_0 $, then we have the formula for the optimal separating hyperplane:
$ \displaystyle 0.5-X_1+X_2 = 0 $

plot of chunk unnamed-chunk-8

c) The classification rule for the maximal margin classifier is:

For a given set of observation $ (X_1, X_2) $, classify this observation to Red if $ \displaystyle 0.5-X_1+X_2 > 0 $ is true and to Blue otherwise.

d) The dashed lines represent the margins for the maximal margin hyperplane. plot of chunk unnamed-chunk-9

The width of the margin is the the perpendicular distance from one of the margins to the hyperplane. To calculate this width, we can use the fact that the vertical distance from margin to hyperplane is 0.5 and this vertical distance is the hypotenuse. Because the slope of this hyperplane line happens to be 1, we have a 30/60/90 degrees triangle where the hypotenuse is 0.5. Then, using geometry, we can figure out that the base of the triangle is 0.25 (half the length of hypotenuse) and the side representing the perpendicular distance is $ 0.25\times\sqrt{3} $. Therefore, the width of the margin is approximately 0.433.

e) The circled observations are the support vectors for the maximal margin hyperplane: plot of chunk unnamed-chunk-10

f) A slight movement in the 7th observation - the blue point (4,1) on the graph at the lower right hand corner - will not affect the maximal margin hyperplane because this point is NOT one of the four support vectors (above circled points). It will only have an effect on the hyperplane if it crossed over the margin.

g) Sketch a hyerplane that is NOT the optimal separating hyperplane: plot of chunk unnamed-chunk-11

The solid line is the optimal separating hyperplane. The dashed line represents a seperating hyperplane but not the optimal one and the formula for it is: $ \displaystyle 0.75-1.1X_1+X_2 = 0 $

h) Draw an observation on the plot so that it is no longer separable by a hyperplane: plot of chunk unnamed-chunk-12

I've added an 8th observation point (2,3) in blue onto the plot. This point violates the margin and the hyperplane and so the two classes will no longer be separable by a hyperplane.

Question 3

a) Setting up the problem. Create training and test sets.

library(ISLR)
attach(OJ)
n = nrow(OJ)
set.seed(99)
train = sample(n, 800)
test = -train

b) Fit support vector classifier using cost=0.01 with Purchase as response against the other variables.

# Purchase is already a factor variable
library(e1071)

## Loading required package: class

set.seed(101)
svmfit = svm(Purchase ~ ., data = OJ, subset = train, cost = 0.01, kernel = "linear", 
    scale = F)
summary(svmfit)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "linear", 
##     subset = train, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  630
## 
##  ( 316 314 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

Too many support vectors! Cost parameter is probably set too low.

Training error rate:

mean(predict(svmfit, OJ[train, ]) != OJ[train, ]$Purchase)

## [1] 0.2175

Test error rate:

mean(predict(svmfit, OJ[test, ]) != OJ[test, ]$Purchase)

## [1] 0.263

d) Use tune() to find optimal cost.

set.seed(101)
tune.out = tune(svm, Purchase ~ ., data = OJ[train, ], kernel = "linear", ranges = list(cost = c(0.01, 
    0.1, 1, 5, 10)))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.1688 
## 
## - Detailed performance results:
##    cost  error dispersion
## 1  0.01 0.1713    0.04967
## 2  0.10 0.1713    0.04412
## 3  1.00 0.1688    0.04135
## 4  5.00 0.1700    0.03782
## 5 10.00 0.1725    0.03764

Looks like cost=1 gives the lowest error rate.

e) Compute new error rates using cost=1

set.seed(101)
svmfit.new = svm(Purchase ~ ., data = OJ, subset = train, cost = 1, kernel = "linear", 
    scale = F)

Training error rate:

mean(predict(svmfit.new, OJ[train, ]) != OJ[train, ]$Purchase)

## [1] 0.1625

Test error rate:

mean(predict(svmfit.new, OJ[test, ]) != OJ[test, ]$Purchase)

## [1] 0.1815

f) Repeat part b) to e) using radial kernel. Use default gamma.

set.seed(101)
svmfit = svm(Purchase ~ ., data = OJ, subset = train, cost = 0.01, kernel = "radial", 
    scale = F)
summary(svmfit)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "radial", 
##     subset = train, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05556 
## 
## Number of Support Vectors:  644
## 
##  ( 326 318 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

Again, too many support vectors! Cost parameter is probably set too low. Training error rate:

mean(predict(svmfit, OJ[train, ]) != OJ[train, ]$Purchase)

## [1] 0.3975

Test error rate:

mean(predict(svmfit, OJ[test, ]) != OJ[test, ]$Purchase)

## [1] 0.3667

Use tune() to find optimal cost.

set.seed(101)
tune.out = tune(svm, Purchase ~ ., data = OJ[train, ], kernel = "radial", ranges = list(cost = c(0.01, 
    0.1, 1, 5, 10)))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.17 
## 
## - Detailed performance results:
##    cost  error dispersion
## 1  0.01 0.3975    0.06530
## 2  0.10 0.1875    0.03536
## 3  1.00 0.1700    0.02899
## 4  5.00 0.1787    0.04251
## 5 10.00 0.1837    0.04604

Again, looks like cost=1 gives the lowest error rate. Compute new error rates using cost=1

set.seed(101)
svmfit.new = svm(Purchase ~ ., data = OJ, subset = train, cost = 1, kernel = "radial", 
    scale = F)

Training error rate:

mean(predict(svmfit.new, OJ[train, ]) != OJ[train, ]$Purchase)

## [1] 0.2125

Test error rate:

mean(predict(svmfit.new, OJ[test, ]) != OJ[test, ]$Purchase)

## [1] 0.2852

g) Repeat part b) to e) using polynomial kernel with degree=2.

set.seed(101)
svmfit = svm(Purchase ~ ., data = OJ, subset = train, cost = 0.01, kernel = "polynomial", 
    degree = 2, scale = F)
summary(svmfit)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "polynomial", 
##     degree = 2, subset = train, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.01 
##      degree:  2 
##       gamma:  0.05556 
##      coef.0:  0 
## 
## Number of Support Vectors:  324
## 
##  ( 163 161 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

Still alot of support vectors! Cost parameter is probably set too low. Training error rate:

mean(predict(svmfit, OJ[train, ]) != OJ[train, ]$Purchase)

## [1] 0.1638

Test error rate:

mean(predict(svmfit, OJ[test, ]) != OJ[test, ]$Purchase)

## [1] 0.1852

Use tune() to find optimal cost.

set.seed(101)
tune.out = tune(svm, Purchase ~ ., data = OJ[train, ], kernel = "polynomial", 
    ranges = list(cost = c(0.01, 0.1, 1, 5, 10), degree = c(2, 3, 4, 5)))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##    10      2
## 
## - best performance: 0.1775 
## 
## - Detailed performance results:
##     cost degree  error dispersion
## 1   0.01      2 0.3875    0.07289
## 2   0.10      2 0.3163    0.05623
## 3   1.00      2 0.1862    0.03143
## 4   5.00      2 0.1812    0.04574
## 5  10.00      2 0.1775    0.02751
## 6   0.01      3 0.3762    0.07058
## 7   0.10      3 0.2888    0.05604
## 8   1.00      3 0.1875    0.03773
## 9   5.00      3 0.1838    0.04292
## 10 10.00      3 0.1787    0.03775
## 11  0.01      4 0.3762    0.07179
## 12  0.10      4 0.3150    0.06061
## 13  1.00      4 0.2288    0.04679
## 14  5.00      4 0.1950    0.04534
## 15 10.00      4 0.1900    0.04362
## 16  0.01      5 0.3775    0.07066
## 17  0.10      5 0.3175    0.06015
## 18  1.00      5 0.2400    0.06203
## 19  5.00      5 0.2087    0.04642
## 20 10.00      5 0.2087    0.05002

In this case, looks like cost=10 gives the lowest error rate. The higher degrees of polynomial resulted in even higher error rates.

Compute new error rates using cost=10

set.seed(101)
svmfit.new = svm(Purchase ~ ., data = OJ, subset = train, cost = 10, kernel = "polynomial", 
    degree = 2, scale = F)

Training error rate:

mean(predict(svmfit.new, OJ[train, ]) != OJ[train, ]$Purchase)

## [1] 0.16

Test error rate:

mean(predict(svmfit.new, OJ[test, ]) != OJ[test, ]$Purchase)

## [1] 0.2111

h) It seems the linear SVM gave the best results (lowest test error rates) under these seed settings.

Question 4

a) We want to prove the following statement:

$ \displaystyle\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 = 2\sum_{i\in C_k}\sum_{j=1}^{p}(x_{ij} - \bar{x}_{kj})^2 $

Because this is a finite sum, we can rearrange the order of the summations so:

$ \displaystyle\sum_{j=1}^{p} \color{red}{\frac{1}{|C_k|}\sum_{i,i'\in C_k}(x_{ij} - x_{i'j})^2} = \sum_{j=1}^{p} \color{red}{2\sum_{i\in C_k}(x_{ij} - \bar{x}_{kj})^2} $

To prove the equality, we'll focus on proving that the red terms are equal.

If we took the inner summation term from the left-hand side of the equation: $ \displaystyle\sum_{i,i'\in C_k}(x_{ij} - x_{i'j})^2 $ and inserted the mean/centroid ($ \bar{x}_{kj} $) into it but keeping the term unchanged, we get: $ \displaystyle\sum_{i,i'\in C_k}((x_{ij} - \bar{x}_{kj}) - (x_{i'j}-\bar{x}_{kj}))^2 $

In general, for $ x_i $ and $ x_j $ with the same number of terms (i.e. $ i=j=n $), the summation term $ \displaystyle\sum_{i,j=1}^{n}(x_{i}-x_{j})^2 $ can be expanded into $ \displaystyle n\sum_{i=1}^{n}x^2_{i}-2\sum_{i=1}^{n}x_{i}\sum_{j=1}^{n}x_{j}+n\sum_{j=1}^{n}x^2_{j} $

The term $ \displaystyle\sum_{i,j=1}^{n}x_{i}^2 $ becomes $ \displaystyle n\sum_{i=1}^{n}x^2_{i} $ because $ x^2_{i} $ is summed through all of the $ i $'s first and then this sum is added up repeatedly for each of the $ j $ term (i.e. added $ n $ times). And if $ i $ and $ j $ both belong to the same set (e.g. $ i,j\in C_k $), then actually $ \displaystyle n\sum_{i=1}^{n}x^2_{i} = n\sum_{j=1}^{n}x^2_{j} $ because they will sum through the same set of numbers (albeit in different sequence).

So now we can rewrite the expansion of the summation $ \displaystyle\sum_{i,j=1}^{n}(x_{i}-x_{j})^2 $ as:

$ \displaystyle 2n\sum_{i=1}^{n}x^2_{i}-2\sum_{i=1}^{n}x_{i}\sum_{j=1}^{n}x_{j} $

Taking the modified term we've generated earlier: $ \displaystyle\sum_{i,i'\in C_k}((x_{ij} - \bar{x}_{kj}) - (x_{i'j}-\bar{x}_{kj}))^2 $ and applying the same logic from above, we end up with:

$ \displaystyle 2|C_k|\sum_{i,i'\in C_k} (x_{ij} - \bar{x}_{kj})^2 - 2\sum_{i=1}^{n}(x_{ij} - \bar{x}_{kj})\sum_{j=1}^{n}(x_{i'j}-\bar{x}_{kj}) $

Now if we looked at the second term, we'll realize that $ \displaystyle \sum_{i=1}^{n}(x_{ij} - \bar{x}_{kj}) $ is by definition 0 (summing all the differences around the mean/centroid in the whole set will yield 0). So we're left with only one term:

$ \displaystyle 2|C_k|\sum_{i,i'\in C_k} (x_{ij} - \bar{x}_{kj})^2 $

And this equals to:

$ \displaystyle\sum_{i,i'\in C_k}(x_{ij} - x_{i'j})^2 = 2|C_k|\sum_{i,i'\in C_k} (x_{ij} - \bar{x}_{kj})^2 $

Now if we looked at the original equation that we're trying to prove: $ \displaystyle\sum_{j=1}^{p} \color{red}{\frac{1}{|C_k|}\sum_{i,i'\in C_k}(x_{ij} - x_{i'j})^2} = \sum_{j=1}^{p} \color{red}{2\sum_{i\in C_k}(x_{ij} - \bar{x}_{kj})^2} $ and substituted the above equation for the inner summation term from the left-hand side of the equation, we'll get:

$ \displaystyle\sum_{j=1}^{p} \color{red}{\frac{1}{|C_k|}2|C_k|\sum_{i,i'\in C_k} (x_{ij} - \bar{x}_{kj})^2} = \sum_{j=1}^{p} \color{red}{2\sum_{i\in C_k}(x_{ij} - \bar{x}_{kj})^2} $

After canceling $ |C_k| $ on the left-hand side, we get the exact same terms on both sides of the equation:

$ \displaystyle\sum_{j=1}^{p} \color{red}{2\sum_{i\in C_k}(x_{ij} - \bar{x}_{kj})^2}=\sum_{j=1}^{p} \color{red}{2\sum_{i\in C_k}(x_{ij} - \bar{x}_{kj})^2} $

Thus, proving that the initial equation:

$ \displaystyle\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 = 2\sum_{i\in C_k}\sum_{j=1}^{p}(x_{ij} - \bar{x}_{kj})^2 $

is true.

b) The within-cluster variation $ W(C_k) $ for the k-th cluster:

$ \displaystyle W(C_k) = \frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 = 2\sum_{i\in C_k}\sum_{j=1}^{p}(x_{ij} - \bar{x}_{kj})^2 $

is defined as the sum of all of the pairwise squared Euclidean distances between the observations in the k-th cluster. From the equation in part A), we can see that we can use the centroids as a way to reduce this pairwise squared distance within the cluster. In fact, this is what the k-means algorithm does.

First, we start out with randomly-generated k clusters and calculate the centroid for each.

Then, for each iteration, the k-means algorithm will do 2 things:

1) Assigns each point to an existing cluster whose centroid is closest to it. This has the effect of assigning each of the $ i $-th observation to the $ C_k $ for which this sum $ \displaystyle\sum_{i\in C_k}\sum_{j=1}^{p}(x_{ij} - \bar{x}_{kj})^2 $ (or the RSS for each cluster) is the smallest.

2) In the second step, a new centroid ($ \displaystyle\bar{x}_{kj} $) is formed (for each k-th cluster) when the points are re-centered around a new mean for each cluster. In effect, this step yields a NEW sample mean (a new centroid $ \displaystyle\bar{x}_{kj} $) that further minimizes the RSS for that cluster.

Therefore, both steps of the iteration will reduce the term: $ 2\sum_{i\in C_k}\sum_{j=1}^{p}(x_{ij} - \bar{x}_{kj})^2 $ and so from the equation in part a), we know that this will also reduce the pairwise squared Euclidean distance for each observation and thus, the total within-cluster variation for all clusters $ \displaystyle\sum_{k=1}^{K} W(C_k) $ will decrease after each iteration.

c) From part b), we see that the total within-cluster variation $ \displaystyle\sum_{k=1}^{K} W(C_k) $ decreases monotonically at each successive step of the k-means algorithm. However, the decrease in $ \displaystyle\sum_{k=1}^{K} W(C_k) $ at each iteration will be smaller than the decrease at the previous iteration. Therefore, at some point, the iteration will reach a lower limit asymptotically and the k-means algorithm will converge.

d) Here is a toy data set and deliberately choosing a set of “wrong” initial centroids.

set.seed(2)
x = matrix(rnorm(500 * 2), ncol = 2)
x[1:250, 1] = x[1:250, 1] + rnorm(250, 1)
x[1:250, 2] = x[1:250, 1] - rnorm(250, 2)
centroidx = c(-4, -3)
centroidy = c(-7, -7)

Now we run the k-means algorithm with these initial centroids and calculate the total within-cluster sum of squares, $ \displaystyle\sum_{k=1}^{K} W(C_k) $.

km.random = kmeans(x, 2, centers = cbind(centroidx, centroidy))
km.random$tot.withinss

## [1] 1198

Now we run k-means algorithm with many different initial starting values to obtain the global optimal solution and get the resulting $ \displaystyle\sum_{k=1}^{K} W(C_k) $.

km.out = kmeans(x, 2, nstart = 200)
km.out$tot.withinss

## [1] 1194

We can see from these two runs that picking an initial pair of “wrong” centroids resulted in a slightly higher $ \displaystyle\sum_{k=1}^{K} W(C_k) $ than the global optimum.

In addition, we can see the clustering results graphically - they are slightly different:

plot of chunk unnamed-chunk-38