Statistical Learning

Homework assignment 1, December 8th, 2013

Questions 1 & 2

We load the data into a data-frame, and then split that into a training and testing data set: (I hope the data file has a random order!)

set.seed(1)
setwd("~/Documents/Universiteit/StatLearn")
data <- read.table("housing.data", header = T)

train <- data[1:350, ]
test <- data[351:length(data[, 1]), ]

# Now rename the train data set to *data*, so we can use it below without
# making mistakes by accidentally using the full data:
data <- train

Questions 3 & 4

We fit a very simple linear regression model:

simple.model <- lm(MEDV ~ ., data = data)
summary(simple.model)

## 
## Call:
## lm(formula = MEDV ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.191  -2.659  -0.788   1.732  25.831 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.40033    6.17442    6.38  5.8e-10 ***
## CRIM         -0.11352    0.04178   -2.72  0.00692 ** 
## ZN            0.06372    0.01638    3.89  0.00012 ***
## INDUS         0.02483    0.07374    0.34  0.73647    
## CHAS          1.48609    0.99799    1.49  0.13740    
## NOX         -17.03198    4.70072   -3.62  0.00034 ***
## RM            3.35805    0.52239    6.43  4.4e-10 ***
## AGE          -0.00537    0.01562   -0.34  0.73101    
## DIS          -1.64248    0.23896   -6.87  3.1e-11 ***
## RAD           0.29926    0.07541    3.97  8.8e-05 ***
## TAX          -0.01378    0.00432   -3.19  0.00155 ** 
## PTRATIO      -0.84220    0.15372   -5.48  8.4e-08 ***
## B             0.00631    0.00318    1.99  0.04770 *  
## LSTAT        -0.52157    0.06067   -8.60  3.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.67 on 336 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.735 
## F-statistic: 75.4 on 13 and 336 DF,  p-value: <2e-16

The coefficients, their standard errors, and the statistical significance can be read in the table displayed above.

Question 5

We estimate the prediction accuracy of regression using 5-fold cross-validation:

library("bootstrap")
cv.fit <- function(x, y) lsfit(x, y)
cv.predict <- function(fit, x) cbind(1, x) %*% fit$coef
cv.results <- crossval(as.matrix(data[, colnames(data) != "MEDV"]), data$MEDV, 
    cv.fit, cv.predict, ngroup = 5)

sqrt(mean((cv.results$cv.fit - data$MEDV)^2))

## [1] 4.851

And this is slightly higher than the RSE from linear regression above, as expected (due to over fitting in the naive estimate).

Question 6

For every k, we apply cross-validation to find the MSE, and we then choose the optimal one:

library("leaps")

r <- regsubsets(MEDV ~ ., data = data)
models <- summary(r)$which

crossval.k <- function(model.k) {
    sub.fit <- function(x, y) regsubsets(y = y, x = x)
    sub.predict <- function(r, x) {
        # first filter for the selected x values:
        x.sel <- t(x[summary(r)$which[model.k, 2:14]])

        # now apply model to predict
        (cbind(1, x.sel) %*% coef(r, id = model.k))[1, 1]
    }
    mean((crossval(as.matrix(data[, colnames(data) != "MEDV"]), data$MEDV, sub.fit, 
        sub.predict)$cv.fit - data$MEDV)^2)
}
(MSE.k <- sapply(1:8, crossval.k))

## [1] 36.26 30.23 27.19 26.89 25.03 24.49 24.77 25.09

(optimal.k <- which(MSE.k == min(MSE.k)))

## [1] 6

We find that K should be 6, so we can reconstruct the chosen model:

summary(r)$which[optimal.k, ]

## (Intercept)        CRIM          ZN       INDUS        CHAS         NOX 
##        TRUE       FALSE        TRUE       FALSE       FALSE        TRUE 
##          RM         AGE         DIS         RAD         TAX     PTRATIO 
##        TRUE       FALSE        TRUE       FALSE       FALSE        TRUE 
##           B       LSTAT 
##       FALSE        TRUE

# Here we do it manually:
summary(subset.model <- lm(MEDV ~ 1 + ZN + NOX + RM + DIS + PTRATIO + LSTAT, 
    data = data))

## 
## Call:
## lm(formula = MEDV ~ 1 + ZN + NOX + RM + DIS + PTRATIO + LSTAT, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.714  -2.796  -0.541   1.777  26.628 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.6453     5.4552    7.08  8.0e-12 ***
## ZN            0.0502     0.0159    3.15   0.0018 ** 
## NOX         -18.5916     3.9853   -4.67  4.4e-06 ***
## RM            3.6534     0.5109    7.15  5.3e-12 ***
## DIS          -1.5435     0.2268   -6.81  4.5e-11 ***
## PTRATIO      -0.8789     0.1366   -6.43  4.2e-10 ***
## LSTAT        -0.5794     0.0581   -9.97  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.84 on 343 degrees of freedom
## Multiple R-squared:  0.721,  Adjusted R-squared:  0.716 
## F-statistic:  148 on 6 and 343 DF,  p-value: <2e-16

Question 7

library("MASS")
# Model without shrinkage value:
lm.ridge(MEDV ~ ., data = data, lambda = 4.2596)

##                  CRIM         ZN      INDUS       CHAS        NOX 
##  37.020329  -0.106836   0.059336   0.004874   1.586067 -15.619322 
##         RM        AGE        DIS        RAD        TAX    PTRATIO 
##   3.451142  -0.005939  -1.553124   0.256373  -0.011721  -0.823948 
##          B      LSTAT 
##   0.006382  -0.509399


# Now we try some shrinkage values:
model <- lm.ridge(MEDV ~ ., data = data, lambda = seq(0, 10, 1e-04))
plot(model)

plot of chunk unnamed-chunk-6

# Apply built-in select.ridgelm for 'approximated' cross-validation
select(model)

## modified HKB estimator is 4.363 
## modified L-W estimator is 3.926 
## smallest value of GCV  at 3.96


# We can now find the optimal model with the cross-validated output:
(ridge.model <- lm.ridge(MEDV ~ ., data = data, lambda = 3.96))

##                  CRIM         ZN      INDUS       CHAS        NOX 
##  37.171507  -0.107251   0.059612   0.006059   1.579998 -15.709846 
##         RM        AGE        DIS        RAD        TAX    PTRATIO 
##   3.445418  -0.005906  -1.559072   0.258979  -0.011844  -0.825136 
##          B      LSTAT 
##   0.006378  -0.510214

Question 8

library("lars")

## Loaded lars 1.2

# Used for the plot below:
(lasso.model <- lars(x = as.matrix(data[, colnames(data) != "MEDV"]), y = data$MEDV, 
    type = "lasso"))

## 
## Call:
## lars(x = as.matrix(data[, colnames(data) != "MEDV"]), y = data$MEDV, 
##     type = "lasso")
## R-squared: 0.745 
## Sequence of LASSO moves:
##      LSTAT RM PTRATIO TAX  B CHAS CRIM DIS ZN NOX INDUS RAD INDUS AGE
## Var     13  6      11  10 12    4    1   8  2   5     3   9    -3   7
## Step     1  2       3   4  5    6    7   8  9  10    11  12    13  14
##      INDUS
## Var      3
## Step    15

# Do the cross-validation, and find the optimal fraction;
cv <- cv.lars(x = as.matrix(data[, colnames(data) != "MEDV"]), y = data$MEDV, 
    type = "lasso")

plot of chunk unnamed-chunk-7

which(cv$cv == min(cv$cv))/100

## [1] 0.92

We find the fraction shown above as optimal lasso shrinkage (as fraction of final L1 norm).

We can make a plot to see exactly what happens:

plot(lasso.model)

plot of chunk unnamed-chunk-8

Here the x-axis is the fraction of final L1 norm the coefficients are allowed to sum up to, and the y axis is the subsequent coefficients. The maximum we found is near the vertical line indicated as “14”.

Question 9

Now we compare the optimal models we found in the previous questions on the test data set. We expect very small differences, because it's still all linear prediction based on the same information.

# First we calculate the MSE for the full simple linear model on the
# training data set. This so we have a feel of the influence of
# (over-)fitting to the training data.
mean(residuals(simple.model)^2)

## [1] 20.98


# Now we apply the various models to the test data:

# Custom prediction function, really only needed for ridge prediction
predict.custom <- function(coefficients) {
    as.matrix(cbind(1, test[, colnames(test) != "MEDV"])) %*% as.matrix(coefficients)
}

# Simple full model: MSE on the test data: (could have used normal
# predict())
mean((predict.custom(coef(simple.model)) - test$MEDV)^2)

## [1] 25.15


# Subset model:
mean((predict(subset.model, newdata = test[, colnames(test) != "MEDV"]) - test$MEDV)^2)

## [1] 27.99


# Ridge model: (note that we don't have to manually denormalize the
# coefficients; if we use coef() instead of accessing them by $coef, we get
# them in the proper form) MSE on the test data:
mean((predict.custom(coef(ridge.model)) - test$MEDV)^2)

## [1] 25.12


# Lasso model: We use the fraction we found:
predicted <- predict.lars(lasso.model, newx = test[, colnames(test) != "MEDV"], 
    type = "fit", s = 0.92, mode = "fraction")$fit
mean((predicted - test$MEDV)^2)

## [1] 25.18

We see that the Ridge model does best of all of these, including the full linear model that we did not have to include here. The subset model does worst by far– this might be because it adds an extra step of fitting to the data, where the other methods sacrifice some accuracy on the training data for more generalized accuracy.

So, our final model has these coefficients and intercept:

ridge.model

##                  CRIM         ZN      INDUS       CHAS        NOX 
##  37.171507  -0.107251   0.059612   0.006059   1.579998 -15.709846 
##         RM        AGE        DIS        RAD        TAX    PTRATIO 
##   3.445418  -0.005906  -1.559072   0.258979  -0.011844  -0.825136 
##          B      LSTAT 
##   0.006378  -0.510214