Homework 3 for STATS216 by Bruno Pen Wu

Question 1

a) Best subset selection has the smallest training RSS. Best subset selection is guaranteed to find the model with the lowest training RSS (best fit) since it will go through every possible subset of predictors. While the other two methods do not iterate through every subset of predictors and might miss the model with the lowest training RSS.

b) In the absence of additional information, it is impossible to tell which method will yield the smallest test RSS. While the best subset selection is guaranteed to find the model with the lowest training RSS, this might return a model that overfits the data and thus result in a higher test RSS. Also, the other methods might select a model with lower test RSS just by chance if the test data happen to fit them better.

i. True.

ii. True.

iii. False. Forward and backwards selection might identify a different set of variables at different steps of the process, even at the same $ k $-variable model, the two selection processes might contain different variables.

iv. False. Same explanation as part iii above.

v. False. In the $ (k+1) $-variable model, best subset method might identify a variable that didn't exist in the prior $ k $-variable model. This is could be caused by correlation amongst several of the predictors.

Question 2

a) When $ \lambda=\infty $, the roughness penalty becomes so large that $ g $ is forced to be perfectly smooth according to the order of derivative (or degree of freedom) of the penalty term (i.e. $ m $). If $ m=0 $, that means that the 0th derivative (or the function itself) is 0 everywhere. Essentially, you cannot fit any function through the data at all and $ \hat{g}=0 $. This is the same as having a 0 degree of freedom. Sample sketch (using the Wage data):

library(ISLR)
attach(Wage)
plot(age, wage, ylim = c(0, max(wage)))
abline(h = 0, lwd = 5, col = "blue")  #g=0

plot of chunk unnamed-chunk-1

b) Again, when $ \lambda=\infty $, the roughness penalty becomes so large that $ g $ is forced to be perfectly smooth according to the penalty term. $ m=1 $ means that the first derivative is 0 everywhere and that the 0th derivative (or the function itself) is a constant. This is essentially giving $ \hat{g} $ 1 degree of freedom - you are free to choose an intercept that would minimize the loss function. This means that $ \hat{g} $ is a constant and the value of the constant is the mean value of all of the $ y_i $ (the simple average). Sample sketch (using the Wage data:

plot(age, wage)
abline(h = mean(wage), lwd = 2, col = "blue")  #simple average of the response values

plot of chunk unnamed-chunk-2

c) When $ \lambda=\infty $ and $ m=2 $, the second derivative of the function is 0 everywhere and the first derivative is constant. This means that $ \hat{g} $ in this case would be a straight line and the line will be the linear least square line. Essentially, $ \hat{g} $ in this case can have 2 degrees of freedom - you are free to choose an intercept and a slope that would minimize the loss function. Since the loss function amounts to minimizing the residual sum of squares, the line will be equivalent to the linear regression line using least square. Sample sketch:

plot(age, wage)
abline(lm(wage ~ age), lwd = 2, col = "blue")  #least squares linear regression line

plot of chunk unnamed-chunk-3

d) When $ \lambda=\infty $ and $ m=3 $, the third derivative of the function is 0 everywhere and the second derivative is constant. This means that $ \hat{g} $ in this case is allowed to be a second-degree polynomial and the curve will be equivalent to fitting a degree-2 polynomial using least squares. Essentially, $ \hat{g} $ in this case can have 3 degrees of freedom - you are free to choose an intercept, a slope and the rate of change in the slope (curvature) that would minimize the loss function. Since the loss function amounts to minimizing the residual sum of squares, the curve will be equivalent to the degree-2 polynomial regression curve using least squares.

plot(age, wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
fit = lm(wage ~ poly(age, 2))
preds = predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, lwd = 2, col = "blue")  #least squares degree-2 polynomial regression curve

plot of chunk unnamed-chunk-4

e) When $ \lambda=0 $, the roughness penalty has no effect. So $ \hat{g} $ will interpolate through all of the $ y_i $. Turns out the maximum degrees of freedom = number of unique $ x_i $ (according to the R function smooth.spline()).

plot(age, wage)
lines(smooth.spline(age, wage, df = 61), lwd = 2, col = "blue")  #interpolating through all y. Maximum df is the number of unique values of x (which was 61 in the case of the age variable)

plot of chunk unnamed-chunk-5

Question 3

I worked with Seema Sangari, Sanne de Roever, and Vedat Can Cologlu on this question.

a.1. Load the data

load("~/Dropbox/Data Science/STATS216/Inclass Material/Feb 12/body.RData")
names(X)

##  [1] "Wrist.Diam"          "Wrist.Girth"         "Forearm.Girth"      
##  [4] "Elbow.Diam"          "Bicep.Girth"         "Shoulder.Girth"     
##  [7] "Biacromial.Diam"     "Chest.Depth"         "Chest.Diam"         
## [10] "Chest.Girth"         "Navel.Girth"         "Waist.Girth"        
## [13] "Pelvic.Breadth"      "Bitrochanteric.Diam" "Hip.Girth"          
## [16] "Thigh.Girth"         "Knee.Diam"           "Knee.Girth"         
## [19] "Calf.Girth"          "Ankle.Diam"          "Ankle.Girth"

names(Y)

## [1] "Age"    "Weight" "Height" "Gender"

a.2. First we examine the variables to see which ones are “bi-nodal” (having 2 peaks) in their distribution because this is most likely caused by gender. Then we do some boxplots of these variables against gender to deduce how the genders are coded.

# table(Y$Gender) attach(Y)
par(mfrow = c(2, 3))
vnames = colnames(Y)[-4]
for (i in vnames) hist(Y[, i], main = i, xlab = "")
vnames = colnames(X)
for (i in vnames) hist(X[, i], main = i, xlab = "")

plot of chunk unnamed-chunk-7

Looks like variables such as Height, Forearm Girth, Shoulder Girth, Biacromial Diam are clearly bi-nodal. So we take a look the boxplot of a few of these variables against gender to deduce the coding.

par(mfrow = c(2, 2), ask = TRUE)
boxplot(Y$Height ~ Y$Gender, main = "Height vs Gender", ylab = "Height", xlab = "Gender")
boxplot(X$Shoulder.Girth ~ Y$Gender, main = "Shoulder Girth vs Gender", ylab = "Shoulder Girth", 
    xlab = "Gender")
boxplot(X$Forearm.Girth ~ Y$Gender, main = "Forearm Girth vs Gender", ylab = "Forearm Girth", 
    xlab = "Gender")
boxplot(X$Biacromial.Diam ~ Y$Gender, main = "Biacromial Diam vs Gender", ylab = "Biacromial Diam", 
    xlab = "Gender")

plot of chunk unnamed-chunk-8

Since male tend to have higher measurements in these bi-nodal variables, we can dedude that the coding for gender is 1 for male and 0 for female.

b.1. Create training and test data

# create training and test set
n = nrow(Y)
set.seed(99)
train = sample(n, 307)
test = -train

b.2. The measured body parts differ a lot in size (a wrist for example vs a chest); standardizing the measurements compensates for this difference.

# pcr and pls
library(pls)

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings

data = data.frame(Y$Weight, X)
set.seed(202)
weight.pcr <- pcr(Y.Weight ~ ., data = data, subset = train, scale = TRUE, validation = "CV")

set.seed(303)
weight.plsr <- plsr(Y.Weight ~ ., data = data, subset = train, scale = TRUE, 
    validation = "CV")

c. In both models, adding extra factors results in diminishing returns: the first few factors (components) account for most of the variance. So this is not surprising to see. What is surprising is that with only one component, PCR is able to explain variances with the Ys (% variance explained) just as well as PLSR does (both got to around 94% with only their first component). One would expect PLSR to perform better than PCR on explaining the variances with Y because PLSR places the highest weight on the variables that are most strongly related to the response whereas PCR is unsupervised (not trained on Y). But from looking at summary() for both methods, the % variance explained are very similar.

summary(weight.pcr)

## Data:    X dimension: 307 21 
##  Y dimension: 307 1
## Fit method: svdpc
## Number of components considered: 21
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           13.31    3.121    3.051    2.928    2.864    2.870    2.852
## adjCV        13.31    3.120    3.050    2.926    2.862    2.868    2.849
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       2.869    2.896    2.843     2.857     2.865     2.863     2.867
## adjCV    2.864    2.894    2.837     2.851     2.859     2.855     2.860
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        2.892     2.831     2.769     2.719     2.715     2.722
## adjCV     2.886     2.821     2.757     2.709     2.705     2.712
##        20 comps  21 comps
## CV        2.732     2.753
## adjCV     2.721     2.741
## 
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           64.66    75.38    80.45    84.76    86.89    88.98    90.59
## Y.Weight    94.54    94.82    95.26    95.49    95.53    95.57    95.58
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X           91.99    93.25     94.36     95.34     96.14     96.89
## Y.Weight    95.58    95.74     95.74     95.75     95.81     95.81
##           14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X            97.56     98.14     98.62     99.03     99.40     99.65
## Y.Weight     95.82     95.98     96.19     96.32     96.36     96.36
##           20 comps  21 comps
## X            99.84    100.00
## Y.Weight     96.36     96.36

plot(1:21, cumsum(weight.pcr$Xvar/weight.pcr$Xtotvar))

plot of chunk unnamed-chunk-11


summary(weight.plsr)

## Data:    X dimension: 307 21 
##  Y dimension: 307 1
## Fit method: kernelpls
## Number of components considered: 21
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           13.31    3.072    2.851    2.796    2.786    2.760    2.734
## adjCV        13.31    3.070    2.849    2.792    2.773    2.749    2.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       2.733    2.735    2.738     2.741     2.743     2.744     2.744
## adjCV    2.723    2.724    2.728     2.730     2.732     2.732     2.732
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        2.744     2.744     2.744     2.744     2.744     2.744
## adjCV     2.732     2.732     2.732     2.732     2.732     2.732
##        20 comps  21 comps
## CV        2.744     2.744
## adjCV     2.732     2.732
## 
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           64.65    73.04    79.66    81.66    84.89    87.10    88.54
## Y.Weight    94.74    95.58    95.88    96.21    96.30    96.35    96.36
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X           89.39    90.71     91.80     93.19     93.86     94.63
## Y.Weight    96.36    96.36     96.36     96.36     96.36     96.36
##           14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X            95.29     96.12     96.69     97.36     98.40     99.06
## Y.Weight     96.36     96.36     96.36     96.36     96.36     96.36
##           20 comps  21 comps
## X            99.47    100.00
## Y.Weight     96.36     96.36

plot(1:21, cumsum(weight.plsr$Xvar/weight.plsr$Xtotvar))

plot of chunk unnamed-chunk-11

d. We can plot the variance explained by each component. We pick a value up to a point that we see a drop off. We think 2 or 4 is reasonable for PCR. We will go with 4. After this point, new components do not increase variance explained significantly. We think 3 is reasonable for PLSR. After this point, new components do not increase variance explained significantly.

validationplot(weight.pcr, val.type = "MSEP", ylim = c(0, 50))
axis(side = 1, at = 1:weight.pcr$ncomp)

plot of chunk unnamed-chunk-12

validationplot(weight.plsr, val.type = "MSEP", ylim = c(0, 50))
axis(side = 1, at = 1:weight.plsr$ncomp)

plot of chunk unnamed-chunk-12

e. Neither PLSR or PCR perform variable selection. Best Subset Selection does perform variable selection. And so we take a look a best subset method to see how many variables we can potentially reduce.

X.training <- data[train, ]
X.test <- data[-train, ]
library(leaps)
regfit.full <- regsubsets(Y.Weight ~ ., data = X.training, nvmax = 21)
reg.summary <- summary(regfit.full)

par(mfrow = c(2, 1))
plot(reg.summary$rss, xlab = " Number of Variables ", ylab = " RSS", type = "l")
plot(reg.summary$adjr2, xlab = " Number of Variables ", ylab = " Adjusted RSq", 
    type = "l")

plot of chunk unnamed-chunk-13

Let's do CV to choose number of predictors. We will only use training data.

predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    xvars = names(coefi)
    mat[, xvars] %*% coefi
}

k = 10
set.seed(107)
folds <- sample(1:k, nrow(X.training), replace = TRUE)
cv.errors = matrix(NA, k, 21, dimnames = list(NULL, paste(1:21)))

for (j in 1:k) {
    best.fit <- regsubsets(Y.Weight ~ ., data = X.training[folds != j, ], nvmax = 21)
    for (i in 1:21) {
        pred = predict(best.fit, X.training[folds == j, ], id = i)
        cv.errors[j, i] = mean((X.training$Y.Weight[folds == j] - pred)^2)
    }
}

mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors

##      1      2      3      4      5      6      7      8      9     10 
## 37.641 17.697 14.104 12.024 10.542  9.623 10.054  9.889  9.326  9.012 
##     11     12     13     14     15     16     17     18     19     20 
##  8.110  8.174  8.010  7.619  7.643  7.680  7.666  7.687  7.665  7.674 
##     21 
##  7.689

which.min(mean.cv.errors)

## 14 
## 14


par(mfrow = c(1, 1))
plot(mean.cv.errors, type = "b")

plot of chunk unnamed-chunk-14

regfit.best <- regsubsets(Y.Weight ~ ., data = X.training, nvmax = 14)

So we actually need only 14 predictors according to best subset model and we can drop the other 7 predictors (one third of all predictors) from the experiment. Having ascertained the optimal number of variables by cross validation, we re-run the best subset selection method with the training dataset - per instruction in the question.

f.PLSR performs better compared to PCR. But in this case, neither dimension reduction methods did as well as the best subset selection method, a purely variable selection method.

# pcr
pcr.pred = predict(weight.pcr, newdata = data[test, ], ncomp = 3)
mean((pcr.pred - Y$Weight[test])^2)

## [1] 10.47


# plsr
plsr.pred = predict(weight.plsr, newdata = data[test, ], ncomp = 4)
mean((plsr.pred - Y$Weight[test])^2)

## [1] 8.359


# best subset
predSubsetReg <- rep(0, 200)
for (i in 1:200) {
    predSubsetReg[i] <- predict(regfit.best, X.test[i, ], id = 14)
}
mean((predSubsetReg - X.test$Y.Weight)^2)

## [1] 7.997

Question 4

I worked with Seema Sangari, Sanne de Roever, and Vedat Can Cologlu on this question.

a. Function h

h <- function(x, z) {
    ifelse(x > z, (x - z)^3, 0)
}

b. Function hs

hs <- function(xs, z) {
    sapply(xs, function(x) h(x, z))
}

c. Function spline basis

splinebasis <- function(xs, zs) {
    sb = matrix(rep(NA, (length(xs) * (length(zs) + 3))), nrow = length(xs), 
        ncol = (length(zs) + 3))
    sb[, 1] = xs
    sb[, 2] = xs^2
    sb[, 3] = xs^3
    for (i in 1:length(zs)) {
        sb[, i + 3] = hs(xs, zs[i])
    }
    return(sb)
}

d. Generate data

set.seed(1337)
x = runif(100)
y = sin(10 * x)

e. Fit splines

knots = c(1/4, 2/4, 3/4)
splines = splinebasis(x, knots)
data <- data.frame(y, splines)
lm.fit <- lm(y ~ ., data = data)

# plot function
fx = seq(0, 1, len = 101)
fy = sin(10 * fx)
plot(fx, fy, type = "l", col = "blue", lwd = 2)

# plot sample
points(x, y)

# plot fit
tsplines = splinebasis(fx, knots)
tdata <- data.frame(y = rep(0, 101), tsplines)
tdata$y <- predict(lm.fit, newdata = tdata)
lines(fx, tdata$y, type = "l", col = "red", lwd = 2)

plot of chunk unnamed-chunk-20

f. Repeat for $ k=1..9 $

for (k in 1:9) {
    plot(fx, fy, type = "l", col = "blue", lwd = 2, main = paste(k, "knots"))
    knots = 1:k/(k + 1)
    splines = splinebasis(x, knots)
    data <- data.frame(y, splines)
    lm.fit = lm(y ~ ., data = data)
    tsplines = splinebasis(fx, knots)
    tdata <- data.frame(y = rep(0, 101), tsplines)
    tdata$y <- predict(lm.fit, newdata = tdata)
    lines(fx, tdata$y, type = "l", col = "red", lwd = 2)
}

plot of chunk unnamed-chunk-21 $plot of chunk unnamed-chunk-21$

g. Adding more degrees of freedom gives the model more flexibility to approach the original function more precisely. Adding more than 7 knots seem to be unnecessary. At some point applying smoothing splines would be preferable above adding more knots.