Balakumar Balasubramaniam Feb, 2014
Acknowledgment: I discussed Prob# 3 & 4 with Kevin Shaw, Vaishnavi Srisha and Prashant Jaiswal. However, all the code and solutions are mine.
(a) The best subset selection will have the smallest training errors as the procedure selects the best amongst all possible combinations and minimizes training error by design.
(b) Impossible to say as we do not know anything about the explanatory variables.
(c i) True. The forward selection will select the next variable that (when added to the current list) will reduce the error the most. Hence, set of variables at kth step is a subset of set at (k+1)th step.
(c ii) True. Backward selection reduces one variable at a time. Hence, kth step is again a subset of (k+1)th step.
(c iii) False. Forward and Backward selections are independent of one another. Hence, it is not always true that the kth step in forward will be a subset of (k+1)th in backward unless k+1=p.
(c iv) False. Same argument as c-iv.
(c v) False. The subsets selected by best subset in the (k+1)th set can be completely different from the subset selected in the kth set unless k+1 = p, the number of predictors.
###########################################################################################
(a)
rm(list=ls());
h <- function(x,z)
{
max(0,(x-z)^3)
}
(b)
hs <- function(xs, z)
{
sapply(xs, h, z)
}
©
splinebasis <- function(xs, zs)
{
nx = length(xs);
ny = length(zs) + 3;
z = matrix(NA, nx, ny);
z[,1] = xs;
z[,2] = xs^2;
z[,3] = xs^3;
for (i in 1:length(zs))
{
z[,i+3] = hs(xs, zs[i]);
}
z
}
(d)
set.seed(1337);
x = runif(100);
y = sin(10*x);
(e, f)
k = 3;
stddev = numeric();
for (k in 1:9)
{
knots = seq(0,1, length=k+2);
knots = knots[seq(2,length(knots)-1)];
XX = splinebasis(x, knots);
df = data.frame("xx"= XX, "y"=y);
fit = lm(y~., data=df);
yp = predict(fit);
xx = seq(0,1,length=1000);
xdf = splinebasis(xx, knots);
df = data.frame("xx"= xdf);
yy = predict(fit, newdata=df);
if (k==3)
{
plot(x,y, main="k=3 cubic spline fit");
points(xx,yy, col="red", type="l");
}
stddev[k] = sd(y-yp);
}
(g) As the number of knots increases, we expect our fit to become better approximations to the true curve as we are increasing the number of available degrees of freedom for the fit. This is true for large k (approaching the order of the number of points). For smaller k, this is not necessarily true as we have not reached the regime of overfitting yet.
plot(stddev, xlab="k", ylab="std dev (actual-fit)");
grid();
###########################################################################################
rm(list=ls())
setwd("c:/Users/bbalasub/Desktop/Bala-docs/Stanford/2014-winter-Stats216/hw3")
load("body.RData")
library("pls")
## Warning: package 'pls' was built under R version 2.15.3
##
## Attaching package: 'pls'
##
## The following object(s) are masked from 'package:stats':
##
## loadings
library("car")
## Warning: package 'car' was built under R version 2.15.3
library("glmnet")
## Warning: package 'glmnet' was built under R version 2.15.3
## Loading required package: Matrix
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 2.15.3
## Loaded glmnet 1.9-5
Let us plot the various variables color coded by gender. To train our classifier, let us also use only 2/3rd of the data initially leaving the rest for the test. To illustrate the plot, we are only plotting two of the columns, namely Elbow.Diam and Wrist.Diam. We looked at all the data (in-sample only) before deciding on the variables here.
df = data.frame("Gender"=Y$Gender, X)
trainIdx = sample(1:nrow(df), size=floor(nrow(df)*2/3), replace=FALSE);
testIdx = setdiff(1:nrow(df), trainIdx);
#for (i in colnames(df)[2:length(df)])
for (i in c("Elbow.Diam", "Wrist.Diam"))
{
plot(df[trainIdx,i], col=df$Gender[trainIdx]+1, main=i, pch=df$Gender[trainIdx]+6, ylab="gender");
}
There are several variables that look promising (such as Wrist.Diam, Wrist.Girth, Forearm.Girth, Elbow.Diam, Bicep.Girth, Shoulder.Girth etc). Let us pick the Elbow.Diam as an example and apply the logistic regression.
fit = glm(Gender~Elbow.Diam, data=df, subset=trainIdx, family=binomial);
dfnew= subset.data.frame(df, 1:nrow(df) %in% testIdx);
gPredict = predict(fit, newdata=dfnew);
gPredict[gPredict > 0] = 1;
gPredict[gPredict < 0] = 0;
gActual = df$Gender[testIdx];
prcCorrect = sum(gPredict==gActual)/length(gPredict);
print(paste("percentage correct=", 100*prcCorrect))
## [1] "percentage correct= 91.7159763313609"
Thus, we see that the elbow diameter is a very good predictor of gender.
############################################################################
Once again, we started the analysis by plotting the various variables. As an illustration, we are plotting two of the variables here.
df = data.frame("Weight"=Y$Weight, X)
set.seed(1);
trainIdx = sample(1:nrow(df), size=307, replace=FALSE);
testIdx = setdiff(1:nrow(df), trainIdx);
#for (i in colnames(df)[2:length(df)])
for (i in c("Elbow.Diam", "Wrist.Diam"))
{
plot(df[trainIdx, ]$Weight, df[trainIdx,i], col="black", main=i, pch=6, ylab="weight");
}
Although many of the variables appear to be highly correlated with the weight, the variables themselves are correlated with one another also. This is best seen from the variance inflation factor (VIF) plot shown next, where the VIF exceeds 5 several times.
z = vif(lm(Weight~., data=df[trainIdx,]))
barplot(z, main="VIF (Variance Inflation Factor)", horiz=FALSE,names.arg=names(z), las=2);
grid();
Hence, using a regression is a bit trickier due to the collinearity issue. We will investigate this and perform a subset selection later. For now, let us focus on performing dimension reduction using a pcr and plsr on all of the variables as asked in the question.
PCR & PLSR fits:
set.seed(1);
fit.pcr = pcr(Weight~., data=df[trainIdx,], scale=TRUE, validation="CV");
set.seed(1);
fit.plsr = plsr(Weight~., data=df[trainIdx,], scale=TRUE, validation="CV");
We scale the variables before fitting to eliminate the effect of measurement units on the fit. Otherwise, one could increase the scale to make those directions more important (for example, changing the Elbow.Width parameter from cm to nm would incorrectly bias our fits towards this parameter).
summary(fit.pcr);
## Data: X dimension: 307 21
## Y dimension: 307 1
## Fit method: svdpc
## Number of components considered: 21
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 13.34 3.428 3.266 3.000 2.969 2.963 2.940
## adjCV 13.34 3.426 3.264 2.977 2.966 2.960 2.937
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.956 2.922 2.940 2.921 2.930 2.923 2.913
## adjCV 2.953 2.918 2.937 2.916 2.926 2.916 2.908
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 2.906 2.859 2.792 2.769 2.788 2.804
## adjCV 2.898 2.852 2.782 2.758 2.777 2.793
## 20 comps 21 comps
## CV 2.808 2.808
## adjCV 2.797 2.796
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 63.08 75.20 79.96 84.46 86.77 88.89 90.37
## Weight 93.46 94.12 95.18 95.20 95.27 95.32 95.38
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 91.80 93.08 94.19 95.17 96.05 96.82 97.54
## Weight 95.47 95.47 95.52 95.54 95.60 95.61 95.71
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## X 98.10 98.57 98.97 99.34 99.60 99.82
## Weight 95.88 96.08 96.20 96.20 96.21 96.21
## 21 comps
## X 100.00
## Weight 96.23
par(mfrow=c(1,2))
validationplot(fit.pcr, val.type="MSEP", main="MSEP for PCR"); grid()
summary(fit.plsr);
## Data: X dimension: 307 21
## Y dimension: 307 1
## Fit method: kernelpls
## Number of components considered: 21
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 13.34 3.324 2.991 2.863 2.816 2.801 2.792
## adjCV 13.34 3.322 2.989 2.859 2.806 2.789 2.781
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.796 2.802 2.807 2.810 2.810 2.809 2.809
## adjCV 2.785 2.790 2.795 2.798 2.798 2.797 2.797
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 2.808 2.808 2.808 2.808 2.808 2.808
## adjCV 2.796 2.796 2.796 2.796 2.796 2.796
## 20 comps 21 comps
## CV 2.808 2.808
## adjCV 2.796 2.796
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 63.06 73.25 79.60 81.27 82.80 85.27 88.37
## Weight 93.88 95.17 95.67 96.07 96.19 96.21 96.22
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 89.55 91.07 92.05 92.80 93.66 94.67 95.56
## Weight 96.23 96.23 96.23 96.23 96.23 96.23 96.23
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## X 96.35 97.19 97.77 98.57 99.01 99.66
## Weight 96.23 96.23 96.23 96.23 96.23 96.23
## 21 comps
## X 100.00
## Weight 96.23
validationplot(fit.plsr, val.type="MSEP", main="MSEP for PLSR"); grid()
We note that the cross-validation errors are consistently smaller for PLSR when compared to PCR. This is because PLSR includes the predicted variable in its calculations while PCR does not. This results in a better fit for PCR. For ex., for 4 components, the PLSR and PCR CV values are 2.816 and 3.000 respectively.
We also see that % variance explained for the weights is slightly (but consistently) larger for PLSR. A 4-components PLSR explains 96.07% of the weight variance while it takes 16 components of the PCR for the same variance explanation (96.08%).
PCR: For PCR, We note from the plot that 3-4 principal components are sufficient to explain most of the variation. Also from the cross-validation, we see that the estimated test error (CV) is smallest for 17 components (2.758). However, the value for 4 components is not too different (2.969). This again points to the possibility of using a smaller number of variables for forecasting. Let us compare the forecasts with the actual values for ncomp=4 and 17.
pred.pcr4 = predict(fit.pcr, newdata=df[testIdx,], ncomp=4);
pred.pcr17 = predict(fit.pcr, newdata=df[testIdx,], ncomp=17);
par(mfrow=c(2,2));
# plot 1
plot(pred.pcr4, pch=21, col="red", ylab="Comparison of pcr4 (red), pcr17 (green) and actual (black) Weights", main="PCR forecasting out-of-sample");
points(pred.pcr17, pch=23, col="green");
points(df[testIdx,"Weight"], pch=22, col="black");
grid();
y.test = df[testIdx,"Weight"];
print(mean((pred.pcr4-y.test)^2))
## [1] 8.713
print(mean((pred.pcr17-y.test)^2))
## [1] 8.011
# plot 2
plot(y.test, pred.pcr4, pch=21, col="red", xlab="y.test", ylab="pred.pcr4", main="test vs PCR4 (OOS)")
# plot 3
plot(y.test, pred.pcr17, pch=23, col="green", xlab="y.test", ylab="pred.pcr17", main="test vs PCR17 (OOS)")
# plot 4
plot(pred.pcr4, pred.pcr17, pch=23, col="black", xlab="pred.pcr4", ylab="pred.pcr17", main="PCR4 vs PCR17")
We note that the MSE for 4 components is close (but about 10% larger) than that for 17 components. We also note that we have not done any variable selection but merely found 4 components (which are linear combinations of the existing variables) that explain the data very well. With this reduction, we still have to measure all the 21 variables. Despite this, PCR performs quite well in forecasting the weights out of sample as evidenced by the high correlation observed between the actual weights and forecasted weights and the small MSE. Based on a MSE of 8.7 and a mean test sample weight of 68.8 Kgs, our mean forecast standard error is only about 4% on average.
PLSR: Let us repeat the same analysis for PLSR. We note from the plot that 3-6 principal components are sufficient to explain most of the variation. Also from the cross-validation, we see that the estimated test error (CV) is smallest for 6 components (2.781). Hence, we select 6 components for the forecasting step. For completeness, we also compare with the 4 component case again.
pred.plsr4 = predict(fit.plsr, newdata=df[testIdx,], ncomp=4);
pred.plsr6 = predict(fit.plsr, newdata=df[testIdx,], ncomp=6);
par(mfrow=c(2,2));
# plot 1
plot(pred.plsr4, pch=21, col="red", ylab="Comparison of plsr4 (red), plsr6 (green) and actual (black) Weights", main="PLSR forecasting out-of-sample");
points(pred.plsr6, pch=23, col="green");
points(df[testIdx,"Weight"], pch=22, col="black");
grid();
y.test = df[testIdx,"Weight"];
print(mean((pred.plsr4-y.test)^2))
## [1] 7.952
print(mean((pred.plsr6-y.test)^2))
## [1] 8.059
# plot 2
plot(y.test, pred.plsr4, pch=21, col="red", xlab="y.test", ylab="pred.plsr4", main="test vs PLSR4 (OOS)")
# plot 3
plot(y.test, pred.plsr6, pch=23, col="green", xlab="y.test", ylab="pred.plsr6", main="test vs PLSR6 (OOS)")
# plot 4
plot(pred.pcr4, pred.plsr6, pch=23, col="black", xlab="pred.plsr4", ylab="pred.plsr6", main="PLSR4 vs PLSR6")
We observe that the out-of-sample MSE error for the PLSR is smaller than the error for PCR for ncomp=4 (7.952 vs 8.713). This is consistent with the smaller CV estimates obtained earlier. Summary: For the best PCR model, we needed 17 components while we only needed 4 components for the PLSR model.
Both PCR and PLSR use all the variables to calculate the different components. So, despite the dimension reduction obtained, we are still constrained by the need to measure all 21 explanatory variables. However, we have seen earlier (using VIF) that the several of the variables are explained by the other variables. Hence, there is a strong possibility that we might be able to explain most of the variation in weight using only a few explanatory variables. This can be done using LASSO for example. Alternatively, one could also use forward, backward or best subset selection methods.
Best-Lasso: Let us perform the lasso fit, cross-validation and select number of non-zero parameters.
# Lasso fit
y = df[trainIdx,]$Weight;
x = model.matrix(Weight~., df[trainIdx,])[,-1];
fit.lasso =glmnet(x,y,alpha=1);
# cross-validation
set.seed(1);
cv.out = cv.glmnet(x,y,alpha=1);
plot(cv.out);
bestlam = cv.out$lambda.min
fit.lasso =glmnet(x,y,alpha=1, lambda=bestlam);
coef(fit.lasso)
## 22 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -100.38912
## Wrist.Diam 0.10041
## Wrist.Girth 0.24044
## Forearm.Girth 0.38135
## Elbow.Diam 0.47149
## Bicep.Girth .
## Shoulder.Girth 0.16543
## Biacromial.Diam 0.15284
## Chest.Depth 0.40924
## Chest.Diam 0.21497
## Chest.Girth 0.08750
## Navel.Girth .
## Waist.Girth 0.30211
## Pelvic.Breadth 0.27702
## Bitrochanteric.Diam .
## Hip.Girth 0.19498
## Thigh.Girth 0.29553
## Knee.Diam 0.10407
## Knee.Girth 0.60426
## Calf.Girth 0.05771
## Ankle.Diam 0.65128
## Ankle.Girth 0.04136
This best lasso fit only removes 3 variables. However, the CV does not increase too much for lambda=3. This results in the need to measure 11 variables only (see below). We will later see how the test errors increase with this lambda.
fit.lasso8 =glmnet(x,y,alpha=1, lambda=3);
coef(fit.lasso)
## 22 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -100.38912
## Wrist.Diam 0.10041
## Wrist.Girth 0.24044
## Forearm.Girth 0.38135
## Elbow.Diam 0.47149
## Bicep.Girth .
## Shoulder.Girth 0.16543
## Biacromial.Diam 0.15284
## Chest.Depth 0.40924
## Chest.Diam 0.21497
## Chest.Girth 0.08750
## Navel.Girth .
## Waist.Girth 0.30211
## Pelvic.Breadth 0.27702
## Bitrochanteric.Diam .
## Hip.Girth 0.19498
## Thigh.Girth 0.29553
## Knee.Diam 0.10407
## Knee.Girth 0.60426
## Calf.Girth 0.05771
## Ankle.Diam 0.65128
## Ankle.Girth 0.04136
(f) Test errors (out-of-sample errors):
Let us now calculate the test error for each of these methods.
PCR
pred.pcr4 = predict(fit.pcr, newdata=df[testIdx,], ncomp=4);
pred.pcr17 = predict(fit.pcr, newdata=df[testIdx,], ncomp=17);
print(mean((pred.pcr4-y.test)^2))
## [1] 8.713
print(mean((pred.pcr17-y.test)^2))
## [1] 8.011
PLSR
pred.plsr4 = predict(fit.plsr, newdata=df[testIdx,], ncomp=4);
pred.plsr6 = predict(fit.plsr, newdata=df[testIdx,], ncomp=6);
print(mean((pred.plsr4-y.test)^2))
## [1] 7.952
print(mean((pred.plsr6-y.test)^2))
## [1] 8.059
Lasso
xn = model.matrix(Weight~., df[testIdx,])[,-1];
pred.lasso = predict(fit.lasso, s=bestlam, newx=xn)
print(mean((pred.lasso-y.test)^2))
## [1] 7.966
pred.lasso8 = predict(fit.lasso8, s=3, newx=xn)
print(mean((pred.lasso8-y.test)^2))
## [1] 20.23
Most interestingly, we see that the Lasso with 18 measurement fields performs the best while the 4 component PLSR is not too different in terms of the test error. The test error increases for the 8 measurement field lasso. However, the increase in the error is small compared to the mean weight measured, namely, the 8-coef lasso results in a weight measurement error of 4.5 Kgs on a mean weight of 68.8 Kgs while the full lasso results in an error of 2.8 Kg. We conclude that the Lasso is highly effective in reducing the number of measurements required for estimating weights.