Capstone Quiz II

Complete all Exercises, and submit answers to Questions on the Coursera platform.

This second quiz will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

First, let us load the data:

load("ames_train.Rdata")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Suppose you are regressing \(\log\)(price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
1. \(\log\)(area)
2. \(\log\)(Lot.Area)
3. Bedroom.AbvGr
4. Overall.Qual
5. Land.Slope

data <- ames_train %>% 
  dplyr:: select(price, area, Lot.Area, Bedroom.AbvGr, Overall.Qual, Land.Slope)

data.model <- data[complete.cases(data),]

lm_q1 <- lm(log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr +
              Overall.Qual + Land.Slope, data = data.model )
step_model <- stepAIC(lm_q1, trace = FALSE, k = 2)
step_model$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual + 
##     Land.Slope
## 
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual + 
##     Land.Slope
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                        993   36.53597 -3295.458

step_BIC <- stepAIC(lm_q1, trace = FALSE, k = log(nrow(data.model)))
step_BIC$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual + 
##     Land.Slope
## 
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual
## 
## 
##           Step Df  Deviance Resid. Df Resid. Dev       AIC
## 1                                 993   36.53597 -3261.104
## 2 - Land.Slope  2 0.4466652       995   36.98263 -3262.768

Use the function stepAIC from the MASS package. In AIC, \(k = 2\), whereas in BIC, \(k = \log(n)\).

This question refers to the following learning objective(s): Use principled statistical methods to select a single parsimonious model.

When regressing \(\log\)(price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
1. The original model was misspecified, biasing our coefficient estimate for Bedroom.AbvGr
2. Bedrooms take up proportionally less space in larger houses, which increases property valuation.
3. Larger houses on average have more bedrooms and sell for higher prices. However, holding constant the size of a house, the number of bedrooms decreases property valuation.
4. Since the number of bedrooms is a statistically insignificant predictor of housing price, it is unsurprising that the coefficient changes depending on which variables are included.

lm_q2 <- lm(log(price) ~ Bedroom.AbvGr, data = ames_train)
summary(lm_q2)

## 
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4815 -0.2609 -0.0455  0.2417  1.4915 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.73794    0.04582 256.188  < 2e-16 ***
## Bedroom.AbvGr  0.09998    0.01565   6.387 2.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4125 on 998 degrees of freedom
## Multiple R-squared:  0.03927,    Adjusted R-squared:  0.03831 
## F-statistic: 40.79 on 1 and 998 DF,  p-value: 2.588e-10

lm_q2.1 <- lm(log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
summary(lm_q2.1)

## 
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10122 -0.14025  0.02724  0.16890  0.87020 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.31767    0.19786   21.82   <2e-16 ***
## Bedroom.AbvGr -0.14753    0.01196  -12.34   <2e-16 ***
## log(area)      1.12063    0.02955   37.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2641 on 997 degrees of freedom
## Multiple R-squared:  0.6066, Adjusted R-squared:  0.6058 
## F-statistic: 768.8 on 2 and 997 DF,  p-value: < 2.2e-16

Recall: the interpretation of a coefficient holds constant all other variables included in the model.

This question refers to the following learning objective(s): Interpret the estimate for a slope (say \[b_1\]) as “All else held constant, for each unit increase in \[x_1\], we would expect \[y\] to be higher/lower on average by \[b_1\].”

Run a simple linear model for \(\log\)(price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?
1. OldTown
2. StoneBr
3. GrnHill
4. IDOTRR

lm_q3 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q3  <- predict(lm_q3, newdata = ames_train)
ames_train$resid3 <- resid(lm_q3)
ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(mean_resid = mean(resid3)) %>% 
  arrange(desc(mean_resid))

## # A tibble: 27 x 2
##    Neighborhood mean_resid
##    <fct>             <dbl>
##  1 GrnHill           0.509
##  2 StoneBr           0.379
##  3 Greens            0.378
##  4 NridgHt           0.337
##  5 Timber            0.267
##  6 Somerst           0.205
##  7 Blmngtn           0.172
##  8 Veenker           0.129
##  9 NoRidge           0.126
## 10 CollgCr           0.124
## # ... with 17 more rows

Extract the residual from a model (say m.1) with the command resid(m.1). Then summarize the residuals, grouping by neighborhood.

This question refers to the following learning objective(s): Identify the assumptions of linear regression and assess when a model may need to be improved. Examine the residuals of a linear model.

We are interested in determining how well the model fits the data for each neighborhood. The model from Question 3 does the worst at predicting prices in which of the following neighborhoods?
1. GrnHill
2. BlueSte
3. StoneBr
4. MeadowV

lm_q4 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q4 <- predict(lm_q4, newdata = ames_train) 
ames_train$resid4 <- resid(lm_q4)
ames_train %>%
  group_by(Neighborhood) %>%
  summarise(mean_squared_resids = mean(resid4^2)) %>%
  arrange(desc(mean_squared_resids))

## # A tibble: 27 x 2
##    Neighborhood mean_squared_resids
##    <fct>                      <dbl>
##  1 GrnHill                    0.271
##  2 IDOTRR                     0.249
##  3 StoneBr                    0.201
##  4 OldTown                    0.200
##  5 Veenker                    0.166
##  6 NridgHt                    0.155
##  7 Greens                     0.146
##  8 MeadowV                    0.132
##  9 SWISU                      0.130
## 10 Timber                     0.110
## # ... with 17 more rows

The average squared residuals is one good measure for comparing how well the model predicts prices between neighborhoods. Find the neighborhood for which this is maximized.

This question refers to the following learning objective(s): Examine the residuals of a linear model.

Suppose you want to model \(\log\)(price) using only the variables in the dataset that pertain to quality: Overall.Qual, Basement.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
1. 0
2. 46
3. 64
4. 924

selected_vars5 <- ames_train %>%
  dplyr::select(price, Overall.Qual, Bsmt.Qual, Garage.Qual)
clean_vars5 <- selected_vars5[complete.cases(selected_vars5),]
print(paste("Number of observations to be discarded is", 
            nrow(selected_vars5) - nrow(clean_vars5)))

## [1] "Number of observations to be discarded is 64"

Run the model in R. How many observations are used? There are 1000 originally in the data.

This question refers to the following learning objective(s): Identify the assumptions of linear regression and assess when a model may need to be improved.

NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?
1. Drop all observations with NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
2. Recode all NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
3. Recode all NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.

nrow(subset(ames_train, is.na(Garage.Qual) | is.na(Bsmt.Qual)))

## [1] 64

We can use the information that a house does not have a basement or garage to create a separate category for NA values. This will allow us to avoid discarding observations and use the model to predict prices for out-of-sample houses that lack garages or basements.

This question refers to the following learning objective(s): Check the assumptions of a linear model.

Run a simple linear model with \(\log\)(price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
1. 075: 2-1/2 story houses
2. 060: 2 story, 1946 and Newer
3. 120: 1 story planned unit development
4. 090: Duplexes

lm_q7 <- lm(log(price) ~ Overall.Cond + Overall.Qual, data = ames_train)
predict_lm_q7 <- predict(lm_q7, newdata = ames_train)
ames_train$predict_lm_q7 <- exp(predict_lm_q7)
ames_train %>%
  group_by(MS.SubClass) %>%
  summarise(median_price = median(predict_lm_q7)) %>%
  arrange(desc(median_price))

## # A tibble: 15 x 2
##    MS.SubClass median_price
##          <int>        <dbl>
##  1          75      209072.
##  2          60      205634.
##  3         120      205634.
##  4          70      165841.
##  5          45      163113.
##  6          20      160431.
##  7          80      160431.
##  8         160      160431.
##  9          50      129385.
## 10          40      127258.
## 11          85      127258.
## 12          90      127258.
## 13          30      125165.
## 14         190      125165.
## 15         180      102631.

After fitting the model, use the predict function to find the predicted values for each observation in the data set. Then use group_by and summarise in dplyr to find the median predicted price for each subclass of dwellings.

This question refers to the following learning objective(s): Predict the value of the response variable for a given value of the explanatory variable, \(x^\star\), by plugging in \(x^\star\) in the linear model:

Using the model from Question 7, which observation has the highest leverage or influence on the regression model? Hint: use hatvalues, hat or lm.influence.
1. 125
2. 268
3. 640
4. 832

which.max(lm.influence(lm_q7)$hat)

## 268 
## 268

First, fit the model from question 7. Then, using hatvalues or a combination of hat and lm.influence, you can extract the leverage values of the model. The higher the leverage, the more influential the observation is on the model fit.

This question refers to the following learning objective(s): Identify outliers and high leverage points in a linear model.

Which of the following corresponds to a correct interpretation of the coefficient \(k\) of Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
1. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) percent.
2. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) percent.
3. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) dollars.
4. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) dollars.

lm_q9 <- lm(log(price) ~ Bedroom.AbvGr, data = ames_train)
coef(lm_q9)

##   (Intercept) Bedroom.AbvGr 
##   11.73793756    0.09997612

In a multiple regression setting, we only hold constant all other variables included in the model. Also, since we use log(price) as our dependent variable, we interpret the coefficient as a percent increase or decrease, rather than an absolute increase or decrease.

In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):

n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF), 
     data=ames_train, col=Sale.Condition,
     pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
       col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
       bty="n", xpd=TRUE, inset=c(-.5,0))

Which of the following sale condition categories shows significant differences from the normal selling condition?
1. Family
2. Abnorm
3. Partial
4. Abnorm and Partial

lm_q10 <- lm(log(price) ~ Sale.Condition, data = ames_train)
summary(lm_q10)

## 
## Call:
## lm(formula = log(price) ~ Sale.Condition, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.28589 -0.23436 -0.02709  0.24463  1.33354 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.74223    0.05016 234.091  < 2e-16 ***
## Sale.ConditionAdjLand  0.09490    0.28153   0.337    0.736    
## Sale.ConditionAlloca   0.21674    0.20221   1.072    0.284    
## Sale.ConditionFamily   0.11734    0.10745   1.092    0.275    
## Sale.ConditionNormal   0.25361    0.05196   4.881 1.23e-06 ***
## Sale.ConditionPartial  0.75216    0.06624  11.355  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3918 on 994 degrees of freedom
## Multiple R-squared:  0.1367, Adjusted R-squared:  0.1324 
## F-statistic: 31.49 on 5 and 994 DF,  p-value: < 2.2e-16

A house sold under abnormal conditions often sells for much less than expected given its square footage. Similarly, a partial sale of a house results in a higher price on average holding constant square footage. However, one partial sale has a total square footage of over 4000, which is highly influential on the regression results.

This question refers to the following learning objective(s): Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points. List the conditions for multiple linear regression.

Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.

Subset ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
1. 81.2%
2. 83.4%
3. 87.7%
4. 91.8%

Normal_Sale <- subset(ames_train, Sale.Condition == "Normal") 

print(paste("% of the original observations is", nrow(Normal_Sale)/1000 * 100))

## [1] "% of the original observations is 83.4"

Use either dplyr or the subset R command, setting the logical condition to be Sale.Condition == T.

This question refers to the following learning objective(s): Use R commands to effectively manipulate data.

Now re-run the simple model from question 3 on the subsetted data. True or False: Modeling only the normal sales results in a better model fit than modeling all sales (in terms of \(R^2\)).
1. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.547 to 0.575.
2. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.575 to 0.603.
3. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.575 to 0.547.
4. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.603 to 0.575.

lm_q12 <- lm(log(price) ~ log(area), data = Normal_Sale)
summary(lm_q12)$adj.r.squared

## [1] 0.5739801

lm_q12.1 <- lm(log(price) ~ log(area), data = ames_train)
summary(lm_q12.1)$adj.r.squared

## [1] 0.5461358

Run the model under both the full and subsetted data. Calculate the \(R^2\) values for each model and compare.

This question refers to the following learning objective(s): Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points. Define \(R^2\) as the percentage of the variability in the response variable explained by the the explanatory variable.

By Cory Torrella via Duke University Statistics with R Specialization