Capstone Quiz III

Complete all Exercises, and submit answers to Questions on the Coursera platform.

This third and final quiz will deal with model validation and out-of-sample prediction. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

In general, we use data to help select model(s) and to estimate parameters, seeking parsimonious models that provide a good fit to the data and have small prediction errors overall. This may lead to models that do very well at predicting on the data used in model building. However, summaries of predictions made in the model training period are not completely “honest” because data are often used to select a model that provides a good fit to the observed data, which may not generalize to future data. A good way to test the assumptions of a model and to realistically compare its predictive performance against other models is to perform out-of-sample validation, which means to withhold some of the sample data from the model training process and then use the model to make predictions for the hold-out data (test data).

First, let us load the data:

load("ames_train.Rdata")
library(MASS)

## Warning: package 'MASS' was built under R version 3.4.4

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.4

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.3

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.4.3

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(BAS)

## Warning: package 'BAS' was built under R version 3.4.4

library(broom)

## Warning: package 'broom' was built under R version 3.4.4

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.4.3

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

As in Quiz 2, we are only concerned with predicting the price for houses sold under normal selling conditions, since partial and abnormal sales may have a different generating process altogether. We ensure that the training data only includes houses sold under normal conditions:

library(dplyr)
ames_train <- ames_train %>%
  filter(Sale.Condition == "Normal")

The first few questions will concern comparing the out-of-sample performance of two linear models. Note that when log-transforming a covariate that can take values equal to zero, a useful trick is to add 1 before transforming the variable (since $\log(0) = -\infty$).

library(MASS)
# Full Model (no variable selection)
model.full <- lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +   
                  log(Total.Bsmt.SF + 1) + Garage.Cars + log(area) + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr + Year.Built + log(X1st.Flr.SF) + 
                  log(X2nd.Flr.SF + 1) +
                  log(Lot.Area) +  Central.Air + Overall.Cond,
                 data = ames_train)

# Model selection using AIC
model.AIC <- stepAIC(model.full, k = 2)

load("ames_test.Rdata")
ames_test <- ames_test %>%
  filter(Sale.Condition == "Normal")

Use the predict function in R to predict log(price) in the testing data set (ames_test). Under model.AIC, what is the mean predicted price in the testing data set?
1. $12.02
2. $166,721.30
3. $172,994.50
4. $177,220.30

# Predict the log(price) based on the test data #

predict_logPrice <- predict(model.AIC, newdata = ames_test)
mean(predict_logPrice)

## [1] 12.02408

One metric for comparing out-of-sample performance for multiple models is called root mean squared error (RMSE). Within the context of linear modeling, this involves taking the square root of the mean of the squared residuals. For example, assuming the dependent variable of interest is price, the RMSE for model.full is:

# Extract Predictions
predict.full <- exp(predict(model.full, ames_test))

# Extract Residuals
resid.full <- ames_test$price - predict.full

# Calculate RMSE
rmse.full <- sqrt(mean(resid.full^2))
rmse.full

## [1] 24197.34

In general, the better the model fit, the lower the RMSE.

Which of the following statements is true concerning the RMSE of model.full and model.AIC?
1. When predicting to ames_train, the RMSE for model.full is higher than the RMSE for model.AIC. However, when predicting to ames_test, the RMSE for model.AIC is higher.
2. When predicting to ames_train, the RMSE for model.AIC is higher than the RMSE for model.full. However, when predicting to ames_test, the RMSE for model.full is higher.
3. The RMSE for model.full is higher than the RMSE for model.AIC, regardless of whether ames_train or ames_test is used for prediction.
4. The RMSE for model.AIC is higher than the RMSE for model.full, regardless of whether ames_train or ames_test is used for prediction.

# Comparing the RMSE for different combinations for out of sample performance #

# Extract Predictions #
predict.full.train <- exp(predict(model.full, ames_train))
predict.full.test  <-  exp(predict(model.full, ames_test))
predict.AIC.train  <-  exp(predict(model.AIC, ames_train))
predict.AIC.test   <-   exp(predict(model.AIC, ames_test))

# Extract Residuals
resid.full.train <- ames_train$price - predict.full.train
resid.full.test  <-  ames_test$price -  predict.full.test
resid.AIC.train  <-  ames_train$price - predict.AIC.train
resid.AIC.test   <-   ames_test$price  - predict.AIC.test

# Calculate RMSE
rmse.full.train <- sqrt(mean(resid.full.train^2))
rmse.full.test  <- sqrt(mean(resid.full.test^2))
rmse.AIC.train  <- sqrt(mean(resid.AIC.train^2))
rmse.AIC.test   <- sqrt(mean(resid.AIC.test^2))


# Print the RMSE or out of sample errors #
rmse.full.train

## [1] 22706.55

rmse.full.test

## [1] 24197.34

rmse.AIC.train

## [1] 22823.41

rmse.AIC.test

## [1] 24188.71

True or False: In general, the RMSE for predictions on a training data set will be higher than that for predictions on a testing data set.
1. True
2. False

# type your code for Question 3 here, and Knit

One way to assess how well a model reflects uncertainty is determining coverage probability. For example, if assumptions are met, a 95% prediction interval for price should include the true value of price roughly 95% of the time. If the true proportion of out-of-sample prices that fall within the 95% prediction interval are significantly greater than or less than 0.95, then some assumptions regarding uncertainty may not be met. To calculate the coverage probability for model.full, we do the following:

# Predict prices
predict.full <- exp(predict(model.full, ames_test, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage.prob.full <- mean(ames_test$price > predict.full[,"lwr"] &
                            ames_test$price < predict.full[,"upr"])
coverage.prob.full

## [1] 0.9522644

Create a new model entitled model.BIC that uses BIC to select the covariates from model.full. What is the out-of-sample coverage for model.BIC?
1. 0.948
2. 0.950
3. 0.952
4. 0.961

# type your code for Question 4 here, and Knit

In Course 4, we introduced Bayesian model averaging (BMA), which involves averaging over many possible models. This can be implemented in the BAS R package. So far, we have focused on performing model selection using BMA. However, BMA can also be used for prediction, averaging predictions from multiple models according to their posterior probabilities. The next few questions will give you some practice using BAS for prediction.

library(BAS)

# Fit the BAS model
model.bas <- bas.lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +   
                  log(Total.Bsmt.SF + 1) + Garage.Cars + log(area) + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr + Year.Built + log(X1st.Flr.SF) + 
                  log(X2nd.Flr.SF + 1) +
                  log(Lot.Area) +  Central.Air + Overall.Cond,
                 data = ames_train, prior = "AIC", modelprior=uniform())

Within BAS, you can predict using four different options: HPM, the highest probability model, MPM, the median probability model, BMA, an average over all the models, and BPM, the single model with predictions closest to those obtained from BMA. For example, you could run the following command to generate out-of-sample predictions and RMSE for the highest probability model.

pred.test.HPM <- predict(model.bas, newdata = ames_test, estimator="HPM")
pred.HPM.rmse <- sqrt(mean((exp(pred.test.HPM$fit) - ames_test$price)^2))
pred.HPM.rmse

## [1] 24188.71

Which of the following prediction methods has the smallest out-of-sample RMSE?
1. HPM
2. BPM
3. BMA

# type your code for Question 5 here, and Knit

To obtain out-of-sample prediction intervals and coverage for the highest probability model, we can execute the following command:

pred.test.HPM <- predict(model.bas, ames_test, 
                    estimator="HPM", 
                    prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and confidence intervals
out = as.data.frame(cbind(exp(confint(pred.test.HPM)),
                          price = ames_test$price))

# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names

# Get Coverage
pred.test.HPM.coverage <- out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
pred.test.HPM.coverage

##       cover
## 1 0.9510404

Using the median probability model to generate out-of-sample predictions and a 95% prediction interval, what proportion of observations (rows) in ames_test have sales prices that fall outside the prediction intervals?
1. 0.048
2. 0.049
3. 0.050
4. 0.051

# type your code for Question 6 here, and Knit

Plotting residuals versus the predicted values or other predictors can provide insight into where the model may not be optimal.

pred.test.MPM <- predict(model.bas, ames_test, 
                    estimator="MPM", 
                    prediction=TRUE, se.fit=TRUE)
resid.MPM = ames_test$price - exp(pred.test.MPM$fit)
plot(ames_test$price, resid.MPM, 
     xlab="Price",
     ylab="Residuals")

True or False: The median probability model has a tendency to over-predict prices for the most expensive houses.
1. True
2. False

Summary

Holding data out for a final “validation”" purposes is probably the single most important diagnostic test of a model: it gives the best indication of the accuracy that can be expected when predicting the future. If we commit to a particular model and then compare its prediction accuracy on the test data, we have an honest assessment of its future performance. However, in practice, if you are finding that you are using the results from the predictions on the test set to refine your final model, then this test data has in essence become part of the training data, and there is a danger as with the training data that the best model may be over-state its predictive accuracy. This has led many practioners to create a third hold-out sample or validation dataset to provide a final summary of the expected prediction accuracy.