This second lab will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

First, let us load the data:

load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)
  1. Suppose you are regressing \(\log\)(price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
    1. \(\log\)(area)
    2. \(\log\)(Lot.Area)
    3. Bedroom.AbvGr
    4. Overall.Qual
    5. Land.Slope
# type your code for Question 1 here, and Knit

# Exclude observations with missing values in the data set

q1_df <- ames_train %>% mutate(lprice=log(price),
                               larea=log(area),
                               llotarea=log(Lot.Area)) %>%
  select(c(lprice,larea,llotarea,Bedroom.AbvGr,Overall.Qual,Land.Slope))

q1_df_no_na <- na.omit(q1_df)

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_larea_aic <- bas.lm(larea ~ ., data = q1_df_no_na,
                   prior = "AIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable                
bma_larea_aic
## 
## Call:
## bas.lm(formula = larea ~ ., data = q1_df_no_na, prior = "AIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##     Intercept         lprice       llotarea  Bedroom.AbvGr   Overall.Qual  
##        1.0000         1.0000         0.8796         1.0000         1.0000  
## Land.SlopeMod  Land.SlopeSev  
##        0.4439         0.3340
# Top 5 most probable models under AIC prior
summary(bma_larea_aic)
##               P(B != 0 | Y)    model 1      model 2       model 3       model 4
## Intercept         1.0000000     1.0000     1.000000     1.0000000     1.0000000
## lprice            1.0000000     1.0000     1.000000     1.0000000     1.0000000
## llotarea          0.8796498     1.0000     1.000000     1.0000000     1.0000000
## Bedroom.AbvGr     1.0000000     1.0000     1.000000     1.0000000     1.0000000
## Overall.Qual      1.0000000     1.0000     1.000000     1.0000000     1.0000000
## Land.SlopeMod     0.4438755     0.0000     1.000000     0.0000000     1.0000000
## Land.SlopeSev     0.3339825     0.0000     0.000000     1.0000000     1.0000000
## BF                       NA     1.0000     0.769835     0.4632905     0.3702608
## PostProbs                NA     0.3379     0.260100     0.1565000     0.1251000
## R2                       NA     0.7244     0.724900     0.7246000     0.7250000
## dim                      NA     5.0000     6.000000     6.0000000     7.0000000
## logmarg                  NA -1727.3470 -1727.608610 -1728.1164318 -1728.3405783
##                     model 5
## Intercept         1.0000000
## lprice            1.0000000
## llotarea          0.0000000
## Bedroom.AbvGr     1.0000000
## Overall.Qual      1.0000000
## Land.SlopeMod     0.0000000
## Land.SlopeSev     0.0000000
## BF                0.1042928
## PostProbs         0.0352000
## R2                0.7226000
## dim               4.0000000
## logmarg       -1729.6075835
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_larea_bic <- bas.lm(larea ~ ., data = q1_df_no_na,
                   prior = "BIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable                
bma_larea_bic
## 
## Call:
## bas.lm(formula = larea ~ ., data = q1_df_no_na, prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##     Intercept         lprice       llotarea  Bedroom.AbvGr   Overall.Qual  
##       1.00000        1.00000        0.44280        1.00000        1.00000  
## Land.SlopeMod  Land.SlopeSev  
##       0.06889        0.05090
# Top 5 most probable models under BIC prior
summary(bma_larea_bic)
##               P(B != 0 | Y)    model 1       model 2       model 3
## Intercept        1.00000000     1.0000     1.0000000  1.000000e+00
## lprice           1.00000000     1.0000     1.0000000  1.000000e+00
## llotarea         0.44280298     0.0000     1.0000000  0.000000e+00
## Bedroom.AbvGr    1.00000000     1.0000     1.0000000  1.000000e+00
## Overall.Qual     0.99999973     1.0000     1.0000000  1.000000e+00
## Land.SlopeMod    0.06888948     0.0000     0.0000000  1.000000e+00
## Land.SlopeSev    0.05090072     0.0000     0.0000000  0.000000e+00
## BF                       NA     1.0000     0.8242141  7.994749e-02
## PostProbs                NA     0.4846     0.3994000  3.870000e-02
## R2                       NA     0.7226     0.7244000  7.232000e-01
## dim                      NA     4.0000     5.0000000  5.000000e+00
## logmarg                  NA -1739.4231 -1739.6164190 -1.741949e+03
##                     model 4       model 5
## Intercept      1.000000e+00  1.000000e+00
## lprice         1.000000e+00  1.000000e+00
## llotarea       0.000000e+00  1.000000e+00
## Bedroom.AbvGr  1.000000e+00  1.000000e+00
## Overall.Qual   1.000000e+00  1.000000e+00
## Land.SlopeMod  0.000000e+00  1.000000e+00
## Land.SlopeSev  1.000000e+00  0.000000e+00
## BF             6.454159e-02  5.454214e-02
## PostProbs      3.130000e-02  2.640000e-02
## R2             7.230000e-01  7.249000e-01
## dim            5.000000e+00  6.000000e+00
## logmarg       -1.742164e+03 -1.742332e+03
  1. When regressing \(\log\)(price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
    1. The original model was misspecified, biasing our coefficient estimate for Bedroom.AbvGr
    2. Bedrooms take up proportionally less space in larger houses, which increases property valuation.
    3. Larger houses on average have more bedrooms and sell for higher prices. However, holding constant the size of a house, the number of bedrooms decreases property valuation.
    4. Since the number of bedrooms is a statistically insignificant predictor of housing price, it is unsurprising that the coefficient changes depending on which variables are included.
# type your code for Question 2 here, and Knit
  1. Run a simple linear model for \(\log\)(price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?
    1. OldTown
    2. StoneBr
    3. GrnHill
    4. IDOTRR
# type your code for Question 3 here, and Knit

ames_train_q3 <- ames_train %>% mutate(lprice=log(price),
                               larea=log(area),
                               llotarea=log(Lot.Area))

q3_model <- lm(lprice~larea,data=ames_train_q3)
ames_train_q3$residual <- residuals(q3_model)

neighborhood_summary <- ames_train_q3 %>% group_by(Neighborhood) %>% summarise(mean_res=mean(residual))
  1. We are interested in determining how well the model fits the data for each neighborhood. The model from Question 3 does the worst at predicting prices in which of the following neighborhoods?
    1. GrnHill
    2. BlueSte
    3. StoneBr
    4. MeadowV
# type your code for Question 4 here, and Knit

ames_train_q4 <- ames_train_q3 %>% mutate(residual_sq=residual^2)

neighborhood_summary <- ames_train_q4 %>% group_by(Neighborhood) %>%
  summarise(mean_res_sq=mean(residual_sq))
  1. Suppose you want to model \(\log\)(price) using only the variables in the dataset that pertain to quality: Overall.Qual, Bsmt.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
    1. 0
    2. 46
    3. 64
    4. 924
# type your code for Question 5 here, and Knit

q5_model <- lm(lprice~Overall.Qual+Bsmt.Qual+Garage.Qual,data=ames_train_q3)
summary(q5_model)
## 
## Call:
## lm(formula = lprice ~ Overall.Qual + Bsmt.Qual + Garage.Qual, 
##     data = ames_train_q3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48987 -0.11622  0.00867  0.12462  0.98890 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.581326   0.309310  34.209  < 2e-16 ***
## Overall.Qual   0.187257   0.007433  25.193  < 2e-16 ***
## Bsmt.QualEx    0.600795   0.221856   2.708  0.00689 ** 
## Bsmt.QualFa    0.181917   0.222964   0.816  0.41476    
## Bsmt.QualGd    0.406820   0.219427   1.854  0.06406 .  
## Bsmt.QualPo    0.519545   0.345271   1.505  0.13273    
## Bsmt.QualTA    0.295413   0.218768   1.350  0.17724    
## Garage.QualEx -0.113621   0.309190  -0.367  0.71335    
## Garage.QualFa -0.191551   0.221902  -0.863  0.38824    
## Garage.QualGd  0.089348   0.234180   0.382  0.70290    
## Garage.QualPo -0.548931   0.267922  -2.049  0.04076 *  
## Garage.QualTA -0.053152   0.218874  -0.243  0.80818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2182 on 924 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.7115, Adjusted R-squared:  0.708 
## F-statistic: 207.1 on 11 and 924 DF,  p-value: < 2.2e-16
  1. NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?
    1. Drop all observations with NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
    2. Recode all NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
    3. Recode all NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
# type your code for Question 6 here, and Knit
  1. Run a simple linear model with \(\log\)(price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
    1. 075: 2-1/2 story houses
    2. 060: 2 story, 1946 and Newer
    3. 120: 1 story planned unit development
    4. 090: Duplexes
# type your code for Question 7 here, and Knit

ames_train_q7 <- ames_train %>% mutate(lprice=log(price))

q7_model <- lm(lprice~Overall.Cond+Overall.Qual,data=ames_train_q7)
pred_values <- predict(q7_model,newdata=ames_train_q7,
                                estimator="BMA",
                                type="response")

ames_train_q7$pred_values <- pred_values

q7_summary <- ames_train_q7 %>% group_by(as.factor(MS.SubClass)) %>% 
  summarise(med_pred_price=median(exp(pred_values)))
  1. Using the model from Question 7, which observation has the highest leverage or influence on the regression model? Hint: use hatvalues, hat or lm.influence.
    1. 125
    2. 268
    3. 640
    4. 832
# type your code for Question 8 here, and Knit

inf_vector <- as.data.frame(lm.influence(q7_model))

answer <- inf_vector[inf_vector$hat==max(inf_vector$hat),]
  1. Which of the following corresponds to a correct interpretation of the coefficient \(k\) of Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
    1. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) percent.
    2. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) percent.
    3. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) dollars.
    4. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) dollars.
# type your code for Question 9 here, and Knit

In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):

n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF), 
     data=ames_train, col=Sale.Condition,
     pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
       col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
       bty="n", xpd=TRUE, inset=c(-.5,0))

  1. Which of the following sale condition categories shows significant differences from the normal selling condition?
    1. Family
    2. Abnorm
    3. Partial
    4. Abnorm and Partial
# type your code for Question 10 here, and Knit

Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.

  1. Subset ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
    1. 81.2%
    2. 83.4%
    3. 87.7%
    4. 91.8%
# type your code for Question 11 here, and Knit
  1. Now re-run the simple model from question 3 on the subsetted data. True or False: Modeling only the normal sales results in a better model fit than modeling all sales (in terms of \(R^2\)).
    1. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.547 to 0.575.
    2. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.575 to 0.603.
    3. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.575 to 0.547.
    4. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.603 to 0.575.
# type your code for Question 12 here, and Knit