Homework assignment 1, December 8th, 2013
We load the data into a data-frame, and then split that into a training and testing data set: (I hope the data file has a random order!)
set.seed(1)
setwd("~/Documents/Universiteit/StatLearn")
data <- read.table("housing.data", header = T)
train <- data[1:350, ]
test <- data[351:length(data[, 1]), ]
# Now rename the train data set to *data*, so we can use it below without
# making mistakes by accidentally using the full data:
data <- train
We fit a very simple linear regression model:
simple.model <- lm(MEDV ~ ., data = data)
summary(simple.model)
##
## Call:
## lm(formula = MEDV ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.191 -2.659 -0.788 1.732 25.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.40033 6.17442 6.38 5.8e-10 ***
## CRIM -0.11352 0.04178 -2.72 0.00692 **
## ZN 0.06372 0.01638 3.89 0.00012 ***
## INDUS 0.02483 0.07374 0.34 0.73647
## CHAS 1.48609 0.99799 1.49 0.13740
## NOX -17.03198 4.70072 -3.62 0.00034 ***
## RM 3.35805 0.52239 6.43 4.4e-10 ***
## AGE -0.00537 0.01562 -0.34 0.73101
## DIS -1.64248 0.23896 -6.87 3.1e-11 ***
## RAD 0.29926 0.07541 3.97 8.8e-05 ***
## TAX -0.01378 0.00432 -3.19 0.00155 **
## PTRATIO -0.84220 0.15372 -5.48 8.4e-08 ***
## B 0.00631 0.00318 1.99 0.04770 *
## LSTAT -0.52157 0.06067 -8.60 3.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.67 on 336 degrees of freedom
## Multiple R-squared: 0.745, Adjusted R-squared: 0.735
## F-statistic: 75.4 on 13 and 336 DF, p-value: <2e-16
The coefficients, their standard errors, and the statistical significance can be read in the table displayed above.
We estimate the prediction accuracy of regression using 5-fold cross-validation:
library("bootstrap")
cv.fit <- function(x, y) lsfit(x, y)
cv.predict <- function(fit, x) cbind(1, x) %*% fit$coef
cv.results <- crossval(as.matrix(data[, colnames(data) != "MEDV"]), data$MEDV,
cv.fit, cv.predict, ngroup = 5)
sqrt(mean((cv.results$cv.fit - data$MEDV)^2))
## [1] 4.851
And this is slightly higher than the RSE from linear regression above, as expected (due to over fitting in the naive estimate).
For every k, we apply cross-validation to find the MSE, and we then choose the optimal one:
library("leaps")
r <- regsubsets(MEDV ~ ., data = data)
models <- summary(r)$which
crossval.k <- function(model.k) {
sub.fit <- function(x, y) regsubsets(y = y, x = x)
sub.predict <- function(r, x) {
# first filter for the selected x values:
x.sel <- t(x[summary(r)$which[model.k, 2:14]])
# now apply model to predict
(cbind(1, x.sel) %*% coef(r, id = model.k))[1, 1]
}
mean((crossval(as.matrix(data[, colnames(data) != "MEDV"]), data$MEDV, sub.fit,
sub.predict)$cv.fit - data$MEDV)^2)
}
(MSE.k <- sapply(1:8, crossval.k))
## [1] 36.26 30.23 27.19 26.89 25.03 24.49 24.77 25.09
(optimal.k <- which(MSE.k == min(MSE.k)))
## [1] 6
We find that K should be 6, so we can reconstruct the chosen model:
summary(r)$which[optimal.k, ]
## (Intercept) CRIM ZN INDUS CHAS NOX
## TRUE FALSE TRUE FALSE FALSE TRUE
## RM AGE DIS RAD TAX PTRATIO
## TRUE FALSE TRUE FALSE FALSE TRUE
## B LSTAT
## FALSE TRUE
# Here we do it manually:
summary(subset.model <- lm(MEDV ~ 1 + ZN + NOX + RM + DIS + PTRATIO + LSTAT,
data = data))
##
## Call:
## lm(formula = MEDV ~ 1 + ZN + NOX + RM + DIS + PTRATIO + LSTAT,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.714 -2.796 -0.541 1.777 26.628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.6453 5.4552 7.08 8.0e-12 ***
## ZN 0.0502 0.0159 3.15 0.0018 **
## NOX -18.5916 3.9853 -4.67 4.4e-06 ***
## RM 3.6534 0.5109 7.15 5.3e-12 ***
## DIS -1.5435 0.2268 -6.81 4.5e-11 ***
## PTRATIO -0.8789 0.1366 -6.43 4.2e-10 ***
## LSTAT -0.5794 0.0581 -9.97 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.84 on 343 degrees of freedom
## Multiple R-squared: 0.721, Adjusted R-squared: 0.716
## F-statistic: 148 on 6 and 343 DF, p-value: <2e-16
library("MASS")
# Model without shrinkage value:
lm.ridge(MEDV ~ ., data = data, lambda = 4.2596)
## CRIM ZN INDUS CHAS NOX
## 37.020329 -0.106836 0.059336 0.004874 1.586067 -15.619322
## RM AGE DIS RAD TAX PTRATIO
## 3.451142 -0.005939 -1.553124 0.256373 -0.011721 -0.823948
## B LSTAT
## 0.006382 -0.509399
# Now we try some shrinkage values:
model <- lm.ridge(MEDV ~ ., data = data, lambda = seq(0, 10, 1e-04))
plot(model)
# Apply built-in select.ridgelm for 'approximated' cross-validation
select(model)
## modified HKB estimator is 4.363
## modified L-W estimator is 3.926
## smallest value of GCV at 3.96
# We can now find the optimal model with the cross-validated output:
(ridge.model <- lm.ridge(MEDV ~ ., data = data, lambda = 3.96))
## CRIM ZN INDUS CHAS NOX
## 37.171507 -0.107251 0.059612 0.006059 1.579998 -15.709846
## RM AGE DIS RAD TAX PTRATIO
## 3.445418 -0.005906 -1.559072 0.258979 -0.011844 -0.825136
## B LSTAT
## 0.006378 -0.510214
library("lars")
## Loaded lars 1.2
# Used for the plot below:
(lasso.model <- lars(x = as.matrix(data[, colnames(data) != "MEDV"]), y = data$MEDV,
type = "lasso"))
##
## Call:
## lars(x = as.matrix(data[, colnames(data) != "MEDV"]), y = data$MEDV,
## type = "lasso")
## R-squared: 0.745
## Sequence of LASSO moves:
## LSTAT RM PTRATIO TAX B CHAS CRIM DIS ZN NOX INDUS RAD INDUS AGE
## Var 13 6 11 10 12 4 1 8 2 5 3 9 -3 7
## Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## INDUS
## Var 3
## Step 15
# Do the cross-validation, and find the optimal fraction;
cv <- cv.lars(x = as.matrix(data[, colnames(data) != "MEDV"]), y = data$MEDV,
type = "lasso")
which(cv$cv == min(cv$cv))/100
## [1] 0.92
We find the fraction shown above as optimal lasso shrinkage (as fraction of final L1 norm).
We can make a plot to see exactly what happens:
plot(lasso.model)
Here the x-axis is the fraction of final L1 norm the coefficients are allowed to sum up to, and the y axis is the subsequent coefficients. The maximum we found is near the vertical line indicated as “14”.
Now we compare the optimal models we found in the previous questions on the test data set. We expect very small differences, because it's still all linear prediction based on the same information.
# First we calculate the MSE for the full simple linear model on the
# training data set. This so we have a feel of the influence of
# (over-)fitting to the training data.
mean(residuals(simple.model)^2)
## [1] 20.98
# Now we apply the various models to the test data:
# Custom prediction function, really only needed for ridge prediction
predict.custom <- function(coefficients) {
as.matrix(cbind(1, test[, colnames(test) != "MEDV"])) %*% as.matrix(coefficients)
}
# Simple full model: MSE on the test data: (could have used normal
# predict())
mean((predict.custom(coef(simple.model)) - test$MEDV)^2)
## [1] 25.15
# Subset model:
mean((predict(subset.model, newdata = test[, colnames(test) != "MEDV"]) - test$MEDV)^2)
## [1] 27.99
# Ridge model: (note that we don't have to manually denormalize the
# coefficients; if we use coef() instead of accessing them by $coef, we get
# them in the proper form) MSE on the test data:
mean((predict.custom(coef(ridge.model)) - test$MEDV)^2)
## [1] 25.12
# Lasso model: We use the fraction we found:
predicted <- predict.lars(lasso.model, newx = test[, colnames(test) != "MEDV"],
type = "fit", s = 0.92, mode = "fraction")$fit
mean((predicted - test$MEDV)^2)
## [1] 25.18
We see that the Ridge model does best of all of these, including the full linear model that we did not have to include here. The subset model does worst by far– this might be because it adds an extra step of fitting to the data, where the other methods sacrifice some accuracy on the training data for more generalized accuracy.
So, our final model has these coefficients and intercept:
ridge.model
## CRIM ZN INDUS CHAS NOX
## 37.171507 -0.107251 0.059612 0.006059 1.579998 -15.709846
## RM AGE DIS RAD TAX PTRATIO
## 3.445418 -0.005906 -1.559072 0.258979 -0.011844 -0.825136
## B LSTAT
## 0.006378 -0.510214