Capstone Quiz III

Complete all Exercises, and submit answers to Questions on the Coursera platform.

This third and final quiz will deal with model validation and out-of-sample prediction. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

In general, we use data to help select model(s) and to estimate parameters, seeking parsimonious models that provide a good fit to the data and have small prediction errors overall. This may lead to models that do very well at predicting on the data used in model building. However, summaries of predictions made in the model training period are not completely “honest” because data are often used to select a model that provides a good fit to the observed data, which may not generalize to future data. A good way to test the assumptions of a model and to realistically compare its predictive performance against other models is to perform out-of-sample validation, which means to withhold some of the sample data from the model training process and then use the model to make predictions for the hold-out data (test data).

First, let us load the data:

load("ames_train.Rdata")

As in Quiz 2, we are only concerned with predicting the price for houses sold under normal selling conditions, since partial and abnormal sales may have a different generating process altogether. We ensure that the training data only includes houses sold under normal conditions:

library(dplyr)
ames_train <- ames_train %>%
  filter(Sale.Condition == "Normal")

The first few questions will concern comparing the out-of-sample performance of two linear models. Note that when log-transforming a covariate that can take values equal to zero, a useful trick is to add 1 before transforming the variable (since $\log(0) = -\infty$).

library(MASS)
# Full Model (no variable selection)
model.full <- lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +   
                  log(Total.Bsmt.SF + 1) + Garage.Cars + log(area) + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr + Year.Built + log(X1st.Flr.SF) + 
                  log(X2nd.Flr.SF + 1) +
                  log(Lot.Area) +  Central.Air + Overall.Cond,
                 data = ames_train)

# Model selection using AIC
model.AIC <- stepAIC(model.full, k = 2)

load("ames_test.Rdata")

Use the predict function in R to predict log(price) in the testing data set (ames_test). Under model.AIC, what is the mean predicted price in the testing data set?
1. $12.02
2. $166,721.30
3. $172,994.50
4. $177,220.30

# type your code for Question 1 here, and Knit
predict_q1 <- exp(predict(model.AIC, newdata = ames_test))
mean(predict_q1)

## [1] 177220.3

First exponentiate the predictions from model.AIC to the ames_test and then take the mean. Remember, that we are modeling log(price), not price, so all predictions must be exponentiated.

This question refers to the following learning objective(s): Extrapolate a model to out-of sample data.

One metric for comparing out-of-sample performance for multiple models is called root mean squared error (RMSE). Within the context of linear modeling, this involves taking the square root of the mean of the squared residuals. For example, assuming the dependent variable of interest is price, the RMSE for model.full is:

# Extract Predictions
predict.full <- exp(predict(model.full, ames_test))

# Extract Residuals
resid.full <- ames_test$price - predict.full

# Calculate RMSE
rmse.full <- sqrt(mean(resid.full^2))
rmse.full

## [1] 24197.34

In general, the better the model fit, the lower the RMSE.

Which of the following statements is true concerning the RMSE of model.full and model.AIC?
1. When predicting to ames_train, the RMSE for model.full is higher than the RMSE for model.AIC. However, when predicting to ames_test, the RMSE for model.AIC is higher.
2. When predicting to ames_train, the RMSE for model.AIC is higher than the RMSE for model.full. However, when predicting to ames_test, the RMSE for model.full is higher.
3. The RMSE for model.full is higher than the RMSE for model.AIC, regardless of whether ames_train or ames_test is used for prediction.
4. The RMSE for model.AIC is higher than the RMSE for model.full, regardless of whether ames_train or ames_test is used for prediction.

# type your code for Question 2 here, and Knit
sqrt(mean((ames_test$price - exp(predict(model.full, ames_test)))^2))

## [1] 24197.34

sqrt(mean((ames_test$price - exp(predict(model.AIC, ames_test)))^2))

## [1] 24188.71

sqrt(mean((ames_train$price - exp(predict(model.full, ames_train)))^2))

## [1] 22706.55

sqrt(mean((ames_train$price - exp(predict(model.AIC, ames_train)))^2))

## [1] 22823.41

Use the code given above to calculate the RMSE for ames_train and ames_test under both model.AIC and model.full. Then compare your results.

This question refers to the following learning objective(s): Extrapolate a model to out-of sample data. Compare the performance of multiple models.

True or False: In general, the RMSE for predictions on a training data set will be higher than that for predictions on a testing data set.
1. True
2. False

# type your code for Question 3 here, and Knit

Because the model is built to fit the training data, it will generally fit out-of-sample data worse. As a result, the RMSE for predictions to out-of-sample data will be higher.

This question refers to the following learning objective(s): Extrapolate a model to out-of sample data.

One way to assess how well a model reflects uncertainty is determining coverage probability. For example, if assumptions are met, a 95% prediction interval for price should include the true value of price roughly 95% of the time. If the true proportion of out-of-sample prices that fall within the 95% prediction interval are significantly greater than or less than 0.95, then some assumptions regarding uncertainty may not be met. To calculate the coverage probability for model.full, we do the following:

# Predict prices
predict.full <- exp(predict(model.full, ames_test, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage.prob.full <- mean(ames_test$price > predict.full[,"lwr"] &
                            ames_test$price < predict.full[,"upr"])
coverage.prob.full

## [1] 0.9522644

Create a new model entitled model.BIC that uses BIC to select the covariates from model.full. What is the out-of-sample coverage for model.BIC?
1. 0.948
2. 0.950
3. 0.952
4. 0.961

# type your code for Question 4 here, and Knit
k.BIC <- log(nrow(ames_train))
model.BIC <- stepAIC(model.full, k = k.BIC)

## Start:  AIC=-3478.96
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + log(Total.Bsmt.SF + 
##     1) + Garage.Cars + log(area) + Full.Bath + Half.Bath + Bedroom.AbvGr + 
##     Year.Built + log(X1st.Flr.SF) + log(X2nd.Flr.SF + 1) + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## - Half.Bath               1    0.0021 11.405 -3485.5
## - Full.Bath               1    0.0046 11.407 -3485.4
## - log(X2nd.Flr.SF + 1)    1    0.0093 11.412 -3485.0
## - log(Garage.Area + 1)    1    0.0327 11.435 -3483.3
## - log(X1st.Flr.SF)        1    0.0497 11.452 -3482.1
## <none>                                11.403 -3479.0
## - Central.Air             1    0.1716 11.574 -3473.2
## - Garage.Cars             1    0.3544 11.757 -3460.2
## - Bedroom.AbvGr           1    0.4994 11.902 -3449.9
## - log(Total.Bsmt.SF + 1)  1    0.5318 11.934 -3447.7
## - log(area)               1    0.9155 12.318 -3421.3
## - Overall.Cond            1    2.0344 13.437 -3348.8
## - log(Lot.Area)           1    2.2192 13.622 -3337.4
## - Year.Built              1    3.1742 14.577 -3280.9
## - Overall.Qual            1    4.1292 15.532 -3227.9
## 
## Step:  AIC=-3485.54
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + log(Total.Bsmt.SF + 
##     1) + Garage.Cars + log(area) + Full.Bath + Bedroom.AbvGr + 
##     Year.Built + log(X1st.Flr.SF) + log(X2nd.Flr.SF + 1) + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## - Full.Bath               1    0.0086 11.413 -3491.6
## - log(X2nd.Flr.SF + 1)    1    0.0089 11.414 -3491.6
## - log(Garage.Area + 1)    1    0.0346 11.439 -3489.7
## - log(X1st.Flr.SF)        1    0.0476 11.452 -3488.8
## <none>                                11.405 -3485.5
## - Central.Air             1    0.1732 11.578 -3479.7
## - Garage.Cars             1    0.3609 11.765 -3466.3
## - Bedroom.AbvGr           1    0.4990 11.904 -3456.5
## - log(Total.Bsmt.SF + 1)  1    0.5329 11.938 -3454.2
## - log(area)               1    1.0089 12.414 -3421.6
## - Overall.Cond            1    2.0354 13.440 -3355.3
## - log(Lot.Area)           1    2.2328 13.637 -3343.1
## - Year.Built              1    3.7078 15.113 -3257.5
## - Overall.Qual            1    4.1312 15.536 -3234.4
## 
## Step:  AIC=-3491.63
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + log(Total.Bsmt.SF + 
##     1) + Garage.Cars + log(area) + Bedroom.AbvGr + Year.Built + 
##     log(X1st.Flr.SF) + log(X2nd.Flr.SF + 1) + log(Lot.Area) + 
##     Central.Air + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## - log(X2nd.Flr.SF + 1)    1    0.0089 11.422 -3497.7
## - log(Garage.Area + 1)    1    0.0305 11.444 -3496.1
## - log(X1st.Flr.SF)        1    0.0479 11.461 -3494.9
## <none>                                11.413 -3491.6
## - Central.Air             1    0.1838 11.597 -3485.0
## - Garage.Cars             1    0.3523 11.766 -3473.0
## - Bedroom.AbvGr           1    0.5287 11.942 -3460.6
## - log(Total.Bsmt.SF + 1)  1    0.5485 11.962 -3459.2
## - log(area)               1    1.0047 12.418 -3428.0
## - Overall.Cond            1    2.0419 13.455 -3361.1
## - log(Lot.Area)           1    2.2527 13.666 -3348.1
## - Year.Built              1    3.9347 15.348 -3251.3
## - Overall.Qual            1    4.1226 15.536 -3241.2
## 
## Step:  AIC=-3497.7
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + log(Total.Bsmt.SF + 
##     1) + Garage.Cars + log(area) + Bedroom.AbvGr + Year.Built + 
##     log(X1st.Flr.SF) + log(Lot.Area) + Central.Air + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## - log(Garage.Area + 1)    1    0.0315 11.454 -3502.1
## <none>                                11.422 -3497.7
## - Central.Air             1    0.1832 11.605 -3491.2
## - Garage.Cars             1    0.3579 11.780 -3478.7
## - Bedroom.AbvGr           1    0.5293 11.951 -3466.7
## - log(Total.Bsmt.SF + 1)  1    0.5460 11.968 -3465.5
## - log(X1st.Flr.SF)        1    0.7186 12.141 -3453.5
## - Overall.Cond            1    2.0356 13.458 -3367.7
## - log(Lot.Area)           1    2.2498 13.672 -3354.5
## - Overall.Qual            1    4.1137 15.536 -3247.9
## - Year.Built              1    4.1800 15.602 -3244.3
## - log(area)               1    5.0064 16.429 -3201.3
## 
## Step:  AIC=-3502.13
## log(price) ~ Overall.Qual + log(Total.Bsmt.SF + 1) + Garage.Cars + 
##     log(area) + Bedroom.AbvGr + Year.Built + log(X1st.Flr.SF) + 
##     log(Lot.Area) + Central.Air + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## <none>                                11.454 -3502.1
## - Central.Air             1    0.1577 11.611 -3497.5
## - Garage.Cars             1    0.3751 11.829 -3482.0
## - Bedroom.AbvGr           1    0.5265 11.980 -3471.4
## - log(Total.Bsmt.SF + 1)  1    0.5412 11.995 -3470.4
## - log(X1st.Flr.SF)        1    0.7287 12.182 -3457.4
## - Overall.Cond            1    2.0332 13.487 -3372.6
## - log(Lot.Area)           1    2.2188 13.672 -3361.2
## - Overall.Qual            1    4.1322 15.586 -3251.9
## - Year.Built              1    4.2494 15.703 -3245.7
## - log(area)               1    5.0604 16.514 -3203.7

predict.BIC <- exp(predict(model.BIC, ames_test, interval = "prediction"))
coverage.prob.BIC <- mean(ames_test$price > predict.BIC[,"lwr"] &
                            ames_test$price < predict.BIC[,"upr"])
coverage.prob.BIC

## [1] 0.9498164

Recall that BIC is the same as AIC, except that the penalty parameter $k$ is the log of the sample size rather than 2. Fit a new linear model using BIC and use the code above to help you extract the out-of-sample coverage.

This question refers to the following learning objective(s): Check the assumptions of a linear model. Extrapolate a model to out-of sample data.

In Course 4, we introduced Bayesian model averaging (BMA), which involves averaging over many possible models. This can be implemented in the BAS R package. So far, we have focused on performing model selection using BMA. However, BMA can also be used for prediction, averaging predictions from multiple models according to their posterior probabilities. The next few questions will give you some practice using BAS for prediction.

library(BAS)

# Fit the BAS model
model.bas <- bas.lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +   
                  log(Total.Bsmt.SF + 1) + Garage.Cars + log(area) + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr + Year.Built + log(X1st.Flr.SF) + 
                  log(X2nd.Flr.SF + 1) +
                  log(Lot.Area) +  Central.Air + Overall.Cond,
                 data = ames_train, prior = "AIC", modelprior=uniform())

Within BAS, you can predict using four different options: HPM, the highest probability model, MPM, the median probability model, BMA, an average over all the models, and BPM, the single model with predictions closest to those obtained from BMA. For example, you could run the following command to generate out-of-sample predictions and RMSE for the highest probability model.

pred.test.HPM <- predict(model.bas, newdata = ames_test, estimator="HPM")
pred.HPM.rmse <- sqrt(mean((exp(pred.test.HPM$fit) - ames_test$price)^2))
pred.HPM.rmse

## [1] 24188.71

Which of the following prediction methods has the smallest out-of-sample RMSE?
1. HPM
2. BPM
3. BMA

# type your code for Question 5 here, and Knit
sqrt(mean((exp(predict(model.bas, newdata = ames_test, estimator="BMA")$fit) - ames_test$price)^2))

## [1] 24217.35

sqrt(mean((exp(predict(model.bas, newdata = ames_test, estimator="BPM")$fit) - ames_test$price)^2))

## [1] 24189.98

sqrt(mean((exp(predict(model.bas, newdata = ames_test, estimator="HPM")$fit) - ames_test$price)^2))

## [1] 24188.71

For each of the four prediction methods mentioned above, use similar code exhibited in the above block to find the RMSE for each. Then compare the models.

This question refers to the following learning objective(s): Extrapolate a model to out-of sample data. Implement Bayesian model averaging for both prediction and variable selection.

To obtain out-of-sample prediction intervals and coverage for the highest probability model, we can execute the following command:

pred.test.HPM <- predict(model.bas, ames_test, 
                    estimator="HPM", 
                    prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and confidence intervals
out = as.data.frame(cbind(exp(confint(pred.test.HPM)),
                          price = ames_test$price))

# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names

# Get Coverage
pred.test.HPM.coverage <- out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
pred.test.HPM.coverage

##       cover
## 1 0.9510404

Using the median probability model to generate out-of-sample predictions and a 95% prediction interval, what proportion of observations (rows) in ames_test have sales prices that fall outside the prediction intervals?
1. 0.048
2. 0.049
3. 0.050
4. 0.051

# type your code for Question 6 here, and Knit
pred.test.MPM <- predict(model.bas, ames_test, 
                    estimator="MPM", 
                    prediction=TRUE, se.fit=TRUE)

out.MPM <- as.data.frame(cbind(exp(confint(pred.test.MPM)),
                          price = ames_test$price))

colnames(out.MPM)[1:2] <- c("lwr", "upr") 

pred.test.MPM.coverage.out <- out.MPM %>% 
  summarize(cover.out = sum(price >= lwr & price <= upr)/n())

1 - pred.test.MPM.coverage.out

##    cover.out
## 1 0.04895961

Implement the code run in the previous block, using MPM instead of HPM, to generate predictions and 95% prediction intervals. Count the number of observations in ames_test that fall outside the prediction intervals.

This question refers to the following learning objective(s): Extrapolate a model to out-of sample data. Implement Bayesian model averaging for both prediction and variable selection.

Plotting residuals versus the predicted values or other predictors can provide insight into where the model may not be optimal.

pred.test.MPM <- predict(model.bas, ames_test, 
                    estimator="MPM", 
                    prediction=TRUE, se.fit=TRUE)
resid.MPM = ames_test$price - exp(pred.test.MPM$fit)
plot(ames_test$price, resid.MPM, 
     xlab="Price",
     ylab="Residuals")

True or False: The median probability model has a tendency to over-predict prices for the most expensive houses.
1. True
2. False

Interpret the plot above. Do the most expensive homes in the test dataset have high or low residuals? What does that mean about whether our model over-predicts or under-predicts housing prices?

This question refers to the following learning objective(s): Extrapolate a model to out-of sample data.

Summary

Holding data out for a final “validation” purposes is probably the single most important diagnostic test of a model: it gives the best indication of the accuracy that can be expected when predicting the future. If we commit to a particular model and then compare its prediction accuracy on the test data, we have an honest assessment of its future performance. However, in practice, if you are finding that you are using the results from the predictions on the test set to refine your final model, then this test data has in essence become part of the training data, and there is a danger as with the training data that the best model may be over-state its predictive accuracy. This has led many practitioners to create a third hold-out sample or validation dataset to provide a final summary of the expected prediction accuracy.

By Cory Torrella Via Duke University - Statistics with R Specialization