Capstone Quiz II

Complete all Exercises, and submit answers to Questions on the Coursera platform.

This second quiz will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

First, let us load the data:

load("~/Desktop/R Programming/Statistics_Coursera/Capstone/ames_train.Rdata")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Suppose you are regressing \(\log\)(price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
1. \(\log\)(area)
2. \(\log\)(Lot.Area)
3. Bedroom.AbvGr
4. Overall.Qual
5. **Land.Slope

data <- ames_train %>% 
  dplyr:: select(price, area, Lot.Area, Bedroom.AbvGr, Overall.Qual, Land.Slope)

data.model <- data[complete.cases(data),]

lm_q1 <- lm(log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr +
              Overall.Qual + Land.Slope, data = data.model )
step_model <- stepAIC(lm_q1, trace = FALSE, k = 2)
step_model$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual + 
##     Land.Slope
## 
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual + 
##     Land.Slope
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                        993   36.53597 -3295.458

step_BIC <- stepAIC(lm_q1, trace = FALSE, k = log(nrow(data.model)))
step_BIC$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual + 
##     Land.Slope
## 
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual
## 
## 
##           Step Df  Deviance Resid. Df Resid. Dev       AIC
## 1                                 993   36.53597 -3261.104
## 2 - Land.Slope  2 0.4466652       995   36.98263 -3262.768

When regressing \(\log\)(price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
1. The original model was misspecified, biasing our coefficient estimate for Bedroom.AbvGr
2. Bedrooms take up proportionally less space in larger houses, which increases property valuation.
3. Larger houses on average have more bedrooms and sell for higher prices. However, holding constant the size of a house, the number of bedrooms decreases property valuation.
4. Since the number of bedrooms is a statistically insignificant predictor of housing price, it is unsurprising that the coefficient changes depending on which variables are included.

lm_q2 <- lm(log(price) ~ Bedroom.AbvGr, data = ames_train)
summary(lm_q2)

## 
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4815 -0.2609 -0.0455  0.2417  1.4915 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.73794    0.04582 256.188  < 2e-16 ***
## Bedroom.AbvGr  0.09998    0.01565   6.387 2.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4125 on 998 degrees of freedom
## Multiple R-squared:  0.03927,    Adjusted R-squared:  0.03831 
## F-statistic: 40.79 on 1 and 998 DF,  p-value: 2.588e-10

lm_q2.1 <- lm(log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
summary(lm_q2.1)

## 
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10122 -0.14025  0.02724  0.16890  0.87020 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.31767    0.19786   21.82   <2e-16 ***
## Bedroom.AbvGr -0.14753    0.01196  -12.34   <2e-16 ***
## log(area)      1.12063    0.02955   37.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2641 on 997 degrees of freedom
## Multiple R-squared:  0.6066, Adjusted R-squared:  0.6058 
## F-statistic: 768.8 on 2 and 997 DF,  p-value: < 2.2e-16

Run a simple linear model for \(\log\)(price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?
1. OldTown
2. StoneBr
3. GrnHill
4. IDOTRR

lm_q3 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q3  <- predict(lm_q3, newdata = ames_train)
ames_train$resid3 <- resid(lm_q3)
ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(mean_resid = mean(resid3)) %>% 
  arrange(desc(mean_resid))

## # A tibble: 27 x 2
##    Neighborhood mean_resid
##    <fct>             <dbl>
##  1 GrnHill           0.509
##  2 StoneBr           0.379
##  3 Greens            0.378
##  4 NridgHt           0.337
##  5 Timber            0.267
##  6 Somerst           0.205
##  7 Blmngtn           0.172
##  8 Veenker           0.129
##  9 NoRidge           0.126
## 10 CollgCr           0.124
## # … with 17 more rows

We are interested in determining how well the model fits the data for each neighborhood. The model from Question 3 does the worst at predicting prices in which of the following neighborhoods?
1. GrnHill
2. BlueSte
3. StoneBr
4. MeadowV

lm_q4 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q4 <- predict(lm_q4, newdata = ames_train) 
ames_train$resid4 <- resid(lm_q4)
ames_train %>%
  group_by(Neighborhood) %>%
  summarise(mean_squared_resids = mean(resid4^2)) %>%
  arrange(desc(mean_squared_resids))

## # A tibble: 27 x 2
##    Neighborhood mean_squared_resids
##    <fct>                      <dbl>
##  1 GrnHill                    0.271
##  2 IDOTRR                     0.249
##  3 StoneBr                    0.201
##  4 OldTown                    0.200
##  5 Veenker                    0.166
##  6 NridgHt                    0.155
##  7 Greens                     0.146
##  8 MeadowV                    0.132
##  9 SWISU                      0.130
## 10 Timber                     0.110
## # … with 17 more rows

Suppose you want to model \(\log\)(price) using only the variables in the dataset that pertain to quality: Overall.Qual, Basement.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
1. 0
2. 46
3. 64
4. 924

selected_vars5 <- ames_train %>%
  dplyr::select(price, Overall.Qual, Bsmt.Qual, Garage.Qual)
clean_vars5 <- selected_vars5[complete.cases(selected_vars5),]
print(paste("Number of observations to be discarded is", 
            nrow(selected_vars5) - nrow(clean_vars5)))

## [1] "Number of observations to be discarded is 64"

NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?
1. Drop all observations with NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
2. Recode all NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
3. Recode all NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.

nrow(subset(ames_train, is.na(Garage.Qual) | is.na(Bsmt.Qual)))

## [1] 64

Run a simple linear model with \(\log\)(price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
1. 075: 2-1/2 story houses
2. 060: 2 story, 1946 and Newer
3. 120: 1 story planned unit development
4. 090: Duplexes

lm_q7 <- lm(log(price) ~ Overall.Cond + Overall.Qual, data = ames_train)
predict_lm_q7 <- predict(lm_q7, newdata = ames_train)
ames_train$predict_lm_q7 <- exp(predict_lm_q7)
ames_train %>%
  group_by(MS.SubClass) %>%
  summarise(median_price = median(predict_lm_q7)) %>%
  arrange(desc(median_price))

## # A tibble: 15 x 2
##    MS.SubClass median_price
##          <int>        <dbl>
##  1          75      209072.
##  2          60      205634.
##  3         120      205634.
##  4          70      165841.
##  5          45      163113.
##  6          20      160431.
##  7          80      160431.
##  8         160      160431.
##  9          50      129385.
## 10          40      127258.
## 11          85      127258.
## 12          90      127258.
## 13          30      125165.
## 14         190      125165.
## 15         180      102631.

Using the model from Question 7, which observation has the highest leverage or influence on the regression model? Hint: use hatvalues, hat or lm.influence.
1. 125
2. 268
3. 640
4. 832

which.max(lm.influence(lm_q7)$hat)

## 268 
## 268

Which of the following corresponds to a correct interpretation of the coefficient \(k\) of Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
1. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) percent.
2. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) percent.
3. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) dollars.
4. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) dollars.

lm_q9 <- lm(log(price) ~ Bedroom.AbvGr, data = ames_train)
coef(lm_q9)

##   (Intercept) Bedroom.AbvGr 
##   11.73793756    0.09997612

In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):

n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF), 
     data=ames_train, col=Sale.Condition,
     pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
       col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
       bty="n", xpd=TRUE, inset=c(-.5,0))

Which of the following sale condition categories shows significant differences from the normal selling condition?
1. Family
2. Abnorm
3. Partial
4. Abnorm and Partial

lm_q10 <- lm(log(price) ~ Sale.Condition, data = ames_train)
summary(lm_q10)

## 
## Call:
## lm(formula = log(price) ~ Sale.Condition, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.28589 -0.23436 -0.02709  0.24463  1.33354 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.74223    0.05016 234.091  < 2e-16 ***
## Sale.ConditionAdjLand  0.09490    0.28153   0.337    0.736    
## Sale.ConditionAlloca   0.21674    0.20221   1.072    0.284    
## Sale.ConditionFamily   0.11734    0.10745   1.092    0.275    
## Sale.ConditionNormal   0.25361    0.05196   4.881 1.23e-06 ***
## Sale.ConditionPartial  0.75216    0.06624  11.355  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3918 on 994 degrees of freedom
## Multiple R-squared:  0.1367, Adjusted R-squared:  0.1324 
## F-statistic: 31.49 on 5 and 994 DF,  p-value: < 2.2e-16

Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.

Subset ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
1. 81.2%
2. 83.4%
3. 87.7%
4. 91.8%

Normal_Sale <- subset(ames_train, Sale.Condition == "Normal") 

print(paste("% of the original observations is", nrow(Normal_Sale)/1000 * 100))

## [1] "% of the original observations is 83.4"

Now re-run the simple model from question 3 on the subsetted data. True or False: Modeling only the normal sales results in a better model fit than modeling all sales (in terms of \(R^2\)).
1. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.547 to 0.575.
2. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.575 to 0.603.
3. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.575 to 0.547.
4. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.603 to 0.575.

lm_q12 <- lm(log(price) ~ log(area), data = Normal_Sale)
summary(lm_q12)$adj.r.squared

## [1] 0.5739801

lm_q12.1 <- lm(log(price) ~ log(area), data = ames_train)
summary(lm_q12.1)$adj.r.squared

## [1] 0.5461358