a) Best subset selection has the smallest training RSS. Best subset selection is guaranteed to find the model with the lowest training RSS (best fit) since it will go through every possible subset of predictors. While the other two methods do not iterate through every subset of predictors and might miss the model with the lowest training RSS.
b) In the absence of additional information, it is impossible to tell which method will yield the smallest test RSS. While the best subset selection is guaranteed to find the model with the lowest training RSS, this might return a model that overfits the data and thus result in a higher test RSS. Also, the other methods might select a model with lower test RSS just by chance if the test data happen to fit them better.
c)
i. True.
ii. True.
iii. False. Forward and backwards selection might identify a different set of variables at different steps of the process, even at the same \( k \)-variable model, the two selection processes might contain different variables.
iv. False. Same explanation as part iii above.
v. False. In the \( (k+1) \)-variable model, best subset method might identify a variable that didn't exist in the prior \( k \)-variable model. This is could be caused by correlation amongst several of the predictors.
a) When \( \lambda=\infty \), the roughness penalty becomes so large that \( g \) is forced to be perfectly smooth according to the order of derivative (or degree of freedom) of the penalty term (i.e. \( m \)). If \( m=0 \), that means that the 0th derivative (or the function itself) is 0 everywhere. Essentially, you cannot fit any function through the data at all and \( \hat{g}=0 \). This is the same as having a 0 degree of freedom. Sample sketch (using the Wage data):
library(ISLR)
attach(Wage)
plot(age, wage, ylim = c(0, max(wage)))
abline(h = 0, lwd = 5, col = "blue") #g=0
b) Again, when \( \lambda=\infty \), the roughness penalty becomes so large that \( g \) is forced to be perfectly smooth according to the penalty term. \( m=1 \) means that the first derivative is 0 everywhere and that the 0th derivative (or the function itself) is a constant. This is essentially giving \( \hat{g} \) 1 degree of freedom - you are free to choose an intercept that would minimize the loss function. This means that \( \hat{g} \) is a constant and the value of the constant is the mean value of all of the \( y_i \) (the simple average). Sample sketch (using the Wage data:
plot(age, wage)
abline(h = mean(wage), lwd = 2, col = "blue") #simple average of the response values
c) When \( \lambda=\infty \) and \( m=2 \), the second derivative of the function is 0 everywhere and the first derivative is constant. This means that \( \hat{g} \) in this case would be a straight line and the line will be the linear least square line. Essentially, \( \hat{g} \) in this case can have 2 degrees of freedom - you are free to choose an intercept and a slope that would minimize the loss function. Since the loss function amounts to minimizing the residual sum of squares, the line will be equivalent to the linear regression line using least square. Sample sketch:
plot(age, wage)
abline(lm(wage ~ age), lwd = 2, col = "blue") #least squares linear regression line
d) When \( \lambda=\infty \) and \( m=3 \), the third derivative of the function is 0 everywhere and the second derivative is constant. This means that \( \hat{g} \) in this case is allowed to be a second-degree polynomial and the curve will be equivalent to fitting a degree-2 polynomial using least squares. Essentially, \( \hat{g} \) in this case can have 3 degrees of freedom - you are free to choose an intercept, a slope and the rate of change in the slope (curvature) that would minimize the loss function. Since the loss function amounts to minimizing the residual sum of squares, the curve will be equivalent to the degree-2 polynomial regression curve using least squares.
plot(age, wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
fit = lm(wage ~ poly(age, 2))
preds = predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, lwd = 2, col = "blue") #least squares degree-2 polynomial regression curve
e) When \( \lambda=0 \), the roughness penalty has no effect. So \( \hat{g} \) will interpolate through all of the \( y_i \). Turns out the maximum degrees of freedom = number of unique \( x_i \) (according to the R function smooth.spline()).
plot(age, wage)
lines(smooth.spline(age, wage, df = 61), lwd = 2, col = "blue") #interpolating through all y. Maximum df is the number of unique values of x (which was 61 in the case of the age variable)
I worked with Seema Sangari, Sanne de Roever, and Vedat Can Cologlu on this question.
a.1. Load the data
load("~/Dropbox/Data Science/STATS216/Inclass Material/Feb 12/body.RData")
names(X)
## [1] "Wrist.Diam" "Wrist.Girth" "Forearm.Girth"
## [4] "Elbow.Diam" "Bicep.Girth" "Shoulder.Girth"
## [7] "Biacromial.Diam" "Chest.Depth" "Chest.Diam"
## [10] "Chest.Girth" "Navel.Girth" "Waist.Girth"
## [13] "Pelvic.Breadth" "Bitrochanteric.Diam" "Hip.Girth"
## [16] "Thigh.Girth" "Knee.Diam" "Knee.Girth"
## [19] "Calf.Girth" "Ankle.Diam" "Ankle.Girth"
names(Y)
## [1] "Age" "Weight" "Height" "Gender"
a.2. First we examine the variables to see which ones are “bi-nodal” (having 2 peaks) in their distribution because this is most likely caused by gender. Then we do some boxplots of these variables against gender to deduce how the genders are coded.
# table(Y$Gender) attach(Y)
par(mfrow = c(2, 3))
vnames = colnames(Y)[-4]
for (i in vnames) hist(Y[, i], main = i, xlab = "")
vnames = colnames(X)
for (i in vnames) hist(X[, i], main = i, xlab = "")
Looks like variables such as Height, Forearm Girth, Shoulder Girth, Biacromial Diam are clearly bi-nodal. So we take a look the boxplot of a few of these variables against gender to deduce the coding.
par(mfrow = c(2, 2), ask = TRUE)
boxplot(Y$Height ~ Y$Gender, main = "Height vs Gender", ylab = "Height", xlab = "Gender")
boxplot(X$Shoulder.Girth ~ Y$Gender, main = "Shoulder Girth vs Gender", ylab = "Shoulder Girth",
xlab = "Gender")
boxplot(X$Forearm.Girth ~ Y$Gender, main = "Forearm Girth vs Gender", ylab = "Forearm Girth",
xlab = "Gender")
boxplot(X$Biacromial.Diam ~ Y$Gender, main = "Biacromial Diam vs Gender", ylab = "Biacromial Diam",
xlab = "Gender")
Since male tend to have higher measurements in these bi-nodal variables, we can dedude that the coding for gender is 1 for male and 0 for female.
b.1. Create training and test data
# create training and test set
n = nrow(Y)
set.seed(99)
train = sample(n, 307)
test = -train
b.2. The measured body parts differ a lot in size (a wrist for example vs a chest); standardizing the measurements compensates for this difference.
# pcr and pls
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
data = data.frame(Y$Weight, X)
set.seed(202)
weight.pcr <- pcr(Y.Weight ~ ., data = data, subset = train, scale = TRUE, validation = "CV")
set.seed(303)
weight.plsr <- plsr(Y.Weight ~ ., data = data, subset = train, scale = TRUE,
validation = "CV")
c. In both models, adding extra factors results in diminishing returns: the first few factors (components) account for most of the variance. So this is not surprising to see. What is surprising is that with only one component, PCR is able to explain variances with the Ys (% variance explained) just as well as PLSR does (both got to around 94% with only their first component). One would expect PLSR to perform better than PCR on explaining the variances with Y because PLSR places the highest weight on the variables that are most strongly related to the response whereas PCR is unsupervised (not trained on Y). But from looking at summary() for both methods, the % variance explained are very similar.
summary(weight.pcr)
## Data: X dimension: 307 21
## Y dimension: 307 1
## Fit method: svdpc
## Number of components considered: 21
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 13.31 3.121 3.051 2.928 2.864 2.870 2.852
## adjCV 13.31 3.120 3.050 2.926 2.862 2.868 2.849
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.869 2.896 2.843 2.857 2.865 2.863 2.867
## adjCV 2.864 2.894 2.837 2.851 2.859 2.855 2.860
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 2.892 2.831 2.769 2.719 2.715 2.722
## adjCV 2.886 2.821 2.757 2.709 2.705 2.712
## 20 comps 21 comps
## CV 2.732 2.753
## adjCV 2.721 2.741
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 64.66 75.38 80.45 84.76 86.89 88.98 90.59
## Y.Weight 94.54 94.82 95.26 95.49 95.53 95.57 95.58
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 91.99 93.25 94.36 95.34 96.14 96.89
## Y.Weight 95.58 95.74 95.74 95.75 95.81 95.81
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## X 97.56 98.14 98.62 99.03 99.40 99.65
## Y.Weight 95.82 95.98 96.19 96.32 96.36 96.36
## 20 comps 21 comps
## X 99.84 100.00
## Y.Weight 96.36 96.36
plot(1:21, cumsum(weight.pcr$Xvar/weight.pcr$Xtotvar))
summary(weight.plsr)
## Data: X dimension: 307 21
## Y dimension: 307 1
## Fit method: kernelpls
## Number of components considered: 21
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 13.31 3.072 2.851 2.796 2.786 2.760 2.734
## adjCV 13.31 3.070 2.849 2.792 2.773 2.749 2.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.733 2.735 2.738 2.741 2.743 2.744 2.744
## adjCV 2.723 2.724 2.728 2.730 2.732 2.732 2.732
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 2.744 2.744 2.744 2.744 2.744 2.744
## adjCV 2.732 2.732 2.732 2.732 2.732 2.732
## 20 comps 21 comps
## CV 2.744 2.744
## adjCV 2.732 2.732
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 64.65 73.04 79.66 81.66 84.89 87.10 88.54
## Y.Weight 94.74 95.58 95.88 96.21 96.30 96.35 96.36
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 89.39 90.71 91.80 93.19 93.86 94.63
## Y.Weight 96.36 96.36 96.36 96.36 96.36 96.36
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## X 95.29 96.12 96.69 97.36 98.40 99.06
## Y.Weight 96.36 96.36 96.36 96.36 96.36 96.36
## 20 comps 21 comps
## X 99.47 100.00
## Y.Weight 96.36 96.36
plot(1:21, cumsum(weight.plsr$Xvar/weight.plsr$Xtotvar))
d. We can plot the variance explained by each component. We pick a value up to a point that we see a drop off. We think 2 or 4 is reasonable for PCR. We will go with 4. After this point, new components do not increase variance explained significantly. We think 3 is reasonable for PLSR. After this point, new components do not increase variance explained significantly.
validationplot(weight.pcr, val.type = "MSEP", ylim = c(0, 50))
axis(side = 1, at = 1:weight.pcr$ncomp)
validationplot(weight.plsr, val.type = "MSEP", ylim = c(0, 50))
axis(side = 1, at = 1:weight.plsr$ncomp)
e. Neither PLSR or PCR perform variable selection. Best Subset Selection does perform variable selection. And so we take a look a best subset method to see how many variables we can potentially reduce.
X.training <- data[train, ]
X.test <- data[-train, ]
library(leaps)
regfit.full <- regsubsets(Y.Weight ~ ., data = X.training, nvmax = 21)
reg.summary <- summary(regfit.full)
par(mfrow = c(2, 1))
plot(reg.summary$rss, xlab = " Number of Variables ", ylab = " RSS", type = "l")
plot(reg.summary$adjr2, xlab = " Number of Variables ", ylab = " Adjusted RSq",
type = "l")
Let's do CV to choose number of predictors. We will only use training data.
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars] %*% coefi
}
k = 10
set.seed(107)
folds <- sample(1:k, nrow(X.training), replace = TRUE)
cv.errors = matrix(NA, k, 21, dimnames = list(NULL, paste(1:21)))
for (j in 1:k) {
best.fit <- regsubsets(Y.Weight ~ ., data = X.training[folds != j, ], nvmax = 21)
for (i in 1:21) {
pred = predict(best.fit, X.training[folds == j, ], id = i)
cv.errors[j, i] = mean((X.training$Y.Weight[folds == j] - pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8 9 10
## 37.641 17.697 14.104 12.024 10.542 9.623 10.054 9.889 9.326 9.012
## 11 12 13 14 15 16 17 18 19 20
## 8.110 8.174 8.010 7.619 7.643 7.680 7.666 7.687 7.665 7.674
## 21
## 7.689
which.min(mean.cv.errors)
## 14
## 14
par(mfrow = c(1, 1))
plot(mean.cv.errors, type = "b")
regfit.best <- regsubsets(Y.Weight ~ ., data = X.training, nvmax = 14)
So we actually need only 14 predictors according to best subset model and we can drop the other 7 predictors (one third of all predictors) from the experiment. Having ascertained the optimal number of variables by cross validation, we re-run the best subset selection method with the training dataset - per instruction in the question.
f.PLSR performs better compared to PCR. But in this case, neither dimension reduction methods did as well as the best subset selection method, a purely variable selection method.
# pcr
pcr.pred = predict(weight.pcr, newdata = data[test, ], ncomp = 3)
mean((pcr.pred - Y$Weight[test])^2)
## [1] 10.47
# plsr
plsr.pred = predict(weight.plsr, newdata = data[test, ], ncomp = 4)
mean((plsr.pred - Y$Weight[test])^2)
## [1] 8.359
# best subset
predSubsetReg <- rep(0, 200)
for (i in 1:200) {
predSubsetReg[i] <- predict(regfit.best, X.test[i, ], id = 14)
}
mean((predSubsetReg - X.test$Y.Weight)^2)
## [1] 7.997
I worked with Seema Sangari, Sanne de Roever, and Vedat Can Cologlu on this question.
a. Function h
h <- function(x, z) {
ifelse(x > z, (x - z)^3, 0)
}
b. Function hs
hs <- function(xs, z) {
sapply(xs, function(x) h(x, z))
}
c. Function spline basis
splinebasis <- function(xs, zs) {
sb = matrix(rep(NA, (length(xs) * (length(zs) + 3))), nrow = length(xs),
ncol = (length(zs) + 3))
sb[, 1] = xs
sb[, 2] = xs^2
sb[, 3] = xs^3
for (i in 1:length(zs)) {
sb[, i + 3] = hs(xs, zs[i])
}
return(sb)
}
d. Generate data
set.seed(1337)
x = runif(100)
y = sin(10 * x)
e. Fit splines
knots = c(1/4, 2/4, 3/4)
splines = splinebasis(x, knots)
data <- data.frame(y, splines)
lm.fit <- lm(y ~ ., data = data)
# plot function
fx = seq(0, 1, len = 101)
fy = sin(10 * fx)
plot(fx, fy, type = "l", col = "blue", lwd = 2)
# plot sample
points(x, y)
# plot fit
tsplines = splinebasis(fx, knots)
tdata <- data.frame(y = rep(0, 101), tsplines)
tdata$y <- predict(lm.fit, newdata = tdata)
lines(fx, tdata$y, type = "l", col = "red", lwd = 2)
f. Repeat for \( k=1..9 \)
for (k in 1:9) {
plot(fx, fy, type = "l", col = "blue", lwd = 2, main = paste(k, "knots"))
knots = 1:k/(k + 1)
splines = splinebasis(x, knots)
data <- data.frame(y, splines)
lm.fit = lm(y ~ ., data = data)
tsplines = splinebasis(fx, knots)
tdata <- data.frame(y = rep(0, 101), tsplines)
tdata$y <- predict(lm.fit, newdata = tdata)
lines(fx, tdata$y, type = "l", col = "red", lwd = 2)
}
g. Adding more degrees of freedom gives the model more flexibility to approach the original function more precisely. Adding more than 7 knots seem to be unnecessary. At some point applying smoothing splines would be preferable above adding more knots.