Statistics with R Capstone Project: Lab II - Model Selection & Diagnostics

This second lab will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.

First, let us load the data:

load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)

Suppose you are regressing \(\log\)(price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
1. \(\log\)(area)
2. \(\log\)(Lot.Area)
3. Bedroom.AbvGr
4. Overall.Qual
5. Land.Slope

# type your code for Question 1 here, and Knit

# Exclude observations with missing values in the data set

q1_df <- ames_train %>% mutate(lprice=log(price),
                               larea=log(area),
                               llotarea=log(Lot.Area)) %>%
  select(c(lprice,larea,llotarea,Bedroom.AbvGr,Overall.Qual,Land.Slope))

q1_df_no_na <- na.omit(q1_df)

# Look at AIC with k=2...

model_aic <- lm(lprice ~ ., data = q1_df_no_na)
model_aic_step <- stepAIC(model_aic, trace = FALSE, 
                          direction = "backward",k=2)
model_aic_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual + Land.Slope
## 
## Final Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual + Land.Slope
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                        993   36.53597 -3295.458

# Look at BIC with k=log(n)...

model_bic <- lm(lprice ~ ., data = q1_df_no_na)
model_bic_step <- stepAIC(model_aic, trace = FALSE, 
                          direction = "backward",k=log(nrow(q1_df_no_na)))
model_bic_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual + Land.Slope
## 
## Final Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual
## 
## 
##           Step Df  Deviance Resid. Df Resid. Dev       AIC
## 1                                 993   36.53597 -3261.104
## 2 - Land.Slope  2 0.4466652       995   36.98263 -3262.768

When regressing \(\log\)(price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
1. The original model was misspecified, biasing our coefficient estimate for Bedroom.AbvGr
2. Bedrooms take up proportionally less space in larger houses, which increases property valuation.
3. Larger houses on average have more bedrooms and sell for higher prices. However, holding constant the size of a house, the number of bedrooms decreases property valuation.
4. Since the number of bedrooms is a statistically insignificant predictor of housing price, it is unsurprising that the coefficient changes depending on which variables are included.

# type your code for Question 2 here, and Knit

Run a simple linear model for \(\log\)(price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?
1. OldTown
2. StoneBr
3. GrnHill
4. IDOTRR

# type your code for Question 3 here, and Knit

ames_train_q3 <- ames_train %>% mutate(lprice=log(price),
                               larea=log(area),
                               llotarea=log(Lot.Area))

q3_model <- lm(lprice~larea,data=ames_train_q3)
summary(q3_model)

## 
## Call:
## lm(formula = lprice ~ larea, data = ames_train_q3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08526 -0.14145  0.02048  0.17002  0.84536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.34441    0.19262   27.75   <2e-16 ***
## larea        0.92167    0.02657   34.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2834 on 998 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.5461 
## F-statistic:  1203 on 1 and 998 DF,  p-value: < 2.2e-16

ames_train_q3$residual <- residuals(q3_model)

neighborhood_summary <- ames_train_q3 %>% group_by(Neighborhood) %>% summarise(mean_res=mean(residual))

We are interested in determining how well the model fits the data for each neighborhood. The model from Question 3 does the worst at predicting prices in which of the following neighborhoods?
1. GrnHill
2. BlueSte
3. StoneBr
4. MeadowV

# type your code for Question 4 here, and Knit

ames_train_q4 <- ames_train_q3 %>% mutate(residual_sq=residual^2)

neighborhood_summary <- ames_train_q4 %>% group_by(Neighborhood) %>%
  summarise(mean_res_sq=mean(residual_sq))

Suppose you want to model \(\log\)(price) using only the variables in the dataset that pertain to quality: Overall.Qual, Bsmt.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
1. 0
2. 46
3. 64
4. 924

# type your code for Question 5 here, and Knit

q5_model <- lm(lprice~Overall.Qual+Bsmt.Qual+Garage.Qual,data=ames_train_q3)
summary(q5_model)

## 
## Call:
## lm(formula = lprice ~ Overall.Qual + Bsmt.Qual + Garage.Qual, 
##     data = ames_train_q3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48987 -0.11622  0.00867  0.12462  0.98890 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.581326   0.309310  34.209  < 2e-16 ***
## Overall.Qual   0.187257   0.007433  25.193  < 2e-16 ***
## Bsmt.QualEx    0.600795   0.221856   2.708  0.00689 ** 
## Bsmt.QualFa    0.181917   0.222964   0.816  0.41476    
## Bsmt.QualGd    0.406820   0.219427   1.854  0.06406 .  
## Bsmt.QualPo    0.519545   0.345271   1.505  0.13273    
## Bsmt.QualTA    0.295413   0.218768   1.350  0.17724    
## Garage.QualEx -0.113621   0.309190  -0.367  0.71335    
## Garage.QualFa -0.191551   0.221902  -0.863  0.38824    
## Garage.QualGd  0.089348   0.234180   0.382  0.70290    
## Garage.QualPo -0.548931   0.267922  -2.049  0.04076 *  
## Garage.QualTA -0.053152   0.218874  -0.243  0.80818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2182 on 924 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.7115, Adjusted R-squared:  0.708 
## F-statistic: 207.1 on 11 and 924 DF,  p-value: < 2.2e-16

NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?
1. Drop all observations with NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
2. Recode all NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
3. Recode all NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.

# type your code for Question 6 here, and Knit

Run a simple linear model with \(\log\)(price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
1. 075: 2-1/2 story houses
2. 060: 2 story, 1946 and Newer
3. 120: 1 story planned unit development
4. 090: Duplexes

# type your code for Question 7 here, and Knit

ames_train_q7 <- ames_train %>% mutate(lprice=log(price))

q7_model <- lm(lprice~Overall.Cond+Overall.Qual,data=ames_train_q7)
pred_values <- predict(q7_model,newdata=ames_train_q7,
                                estimator="BMA",
                                type="response")

ames_train_q7$pred_values <- pred_values

q7_summary <- ames_train_q7 %>% group_by(as.factor(MS.SubClass)) %>% 
  summarise(med_pred_price=median(exp(pred_values)))

Using the model from Question 7, which observation has the highest leverage or influence on the regression model? Hint: use hatvalues, hat or lm.influence.
1. 125
2. 268
3. 640
4. 832

# type your code for Question 8 here, and Knit

inf_vector <- as.data.frame(lm.influence(q7_model))

answer <- inf_vector[inf_vector$hat==max(inf_vector$hat),]

Which of the following corresponds to a correct interpretation of the coefficient \(k\) of Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
1. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) percent.
2. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) percent.
3. Holding constant all other variables in the dataset, on average, an additional bedroom will increase housing price by \(k\) dollars.
4. Holding constant all other variables in the model, on average, an additional bedroom will increase housing price by \(k\) dollars.

# type your code for Question 9 here, and Knit

In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):

n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF), 
     data=ames_train, col=Sale.Condition,
     pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
       col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
       bty="n", xpd=TRUE, inset=c(-.5,0))

Which of the following sale condition categories shows significant differences from the normal selling condition?
1. Family
2. Abnorm
3. Partial
4. Abnorm and Partial

# type your code for Question 10 here, and Knit

Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.

Subset ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
1. 81.2%
2. 83.4%
3. 87.7%
4. 91.8%

# type your code for Question 11 here, and Knit
ames_train_normal <- ames_train %>% filter(Sale.Condition=="Normal") %>%
  mutate(lprice=log(price),larea=log(area))
nrow(ames_train_normal)/nrow(ames_train)

## [1] 0.834

Now re-run the simple model from question 3 on the subsetted data. True or False: Modeling only the normal sales results in a better model fit than modeling all sales (in terms of \(R^2\)).
1. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.547 to 0.575.
2. True, restricting the model to only include observations with normal sale conditions increases the \(R^2\) from 0.575 to 0.603.
3. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.575 to 0.547.
4. False, restricting the model to only include observations with normal sale conditions decreases the \(R^2\) from 0.603 to 0.575.

# type your code for Question 12 here, and Knit

q12_model <- lm(lprice~larea,data=ames_train_normal)

summary(q12_model)

## 
## Call:
## lm(formula = lprice ~ larea, data = ames_train_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36369 -0.12269  0.02005  0.14587  0.82373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.73138    0.18711   30.63   <2e-16 ***
## larea        0.86716    0.02587   33.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2493 on 832 degrees of freedom
## Multiple R-squared:  0.5745, Adjusted R-squared:  0.574 
## F-statistic:  1123 on 1 and 832 DF,  p-value: < 2.2e-16

Statistics with R Capstone Project: Lab II - Model Selection & Diagnostics

Ken Wood

6/19/21