This second lab will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.
First, let us load the data:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
area)
Lot.Area)
Bedroom.AbvGr
Overall.Qual
Land.Slope
# type your code for Question 1 here, and Knit
# Exclude observations with missing values in the data set
q1_df <- ames_train %>% mutate(lprice=log(price),
larea=log(area),
llotarea=log(Lot.Area)) %>%
select(c(lprice,larea,llotarea,Bedroom.AbvGr,Overall.Qual,Land.Slope))
q1_df_no_na <- na.omit(q1_df)
# Look at AIC with k=2...
model_aic <- lm(lprice ~ ., data = q1_df_no_na)
model_aic_step <- stepAIC(model_aic, trace = FALSE,
direction = "backward",k=2)
model_aic_step$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual + Land.Slope
##
## Final Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual + Land.Slope
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 993 36.53597 -3295.458
# Look at BIC with k=log(n)...
model_bic <- lm(lprice ~ ., data = q1_df_no_na)
model_bic_step <- stepAIC(model_aic, trace = FALSE,
direction = "backward",k=log(nrow(q1_df_no_na)))
model_bic_step$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual + Land.Slope
##
## Final Model:
## lprice ~ larea + llotarea + Bedroom.AbvGr + Overall.Qual
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 993 36.53597 -3261.104
## 2 - Land.Slope 2 0.4466652 995 36.98263 -3262.768
price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
Bedroom.AbvGr
# type your code for Question 2 here, and Knitprice), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?OldTown
StoneBr
GrnHill
IDOTRR
# type your code for Question 3 here, and Knit
ames_train_q3 <- ames_train %>% mutate(lprice=log(price),
larea=log(area),
llotarea=log(Lot.Area))
q3_model <- lm(lprice~larea,data=ames_train_q3)
summary(q3_model)##
## Call:
## lm(formula = lprice ~ larea, data = ames_train_q3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08526 -0.14145 0.02048 0.17002 0.84536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34441 0.19262 27.75 <2e-16 ***
## larea 0.92167 0.02657 34.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2834 on 998 degrees of freedom
## Multiple R-squared: 0.5466, Adjusted R-squared: 0.5461
## F-statistic: 1203 on 1 and 998 DF, p-value: < 2.2e-16
ames_train_q3$residual <- residuals(q3_model)
neighborhood_summary <- ames_train_q3 %>% group_by(Neighborhood) %>% summarise(mean_res=mean(residual))GrnHill
BlueSte
StoneBr
MeadowV
# type your code for Question 4 here, and Knit
ames_train_q4 <- ames_train_q3 %>% mutate(residual_sq=residual^2)
neighborhood_summary <- ames_train_q4 %>% group_by(Neighborhood) %>%
summarise(mean_res_sq=mean(residual_sq))price) using only the variables in the dataset that pertain to quality: Overall.Qual, Bsmt.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
# type your code for Question 5 here, and Knit
q5_model <- lm(lprice~Overall.Qual+Bsmt.Qual+Garage.Qual,data=ames_train_q3)
summary(q5_model)##
## Call:
## lm(formula = lprice ~ Overall.Qual + Bsmt.Qual + Garage.Qual,
## data = ames_train_q3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48987 -0.11622 0.00867 0.12462 0.98890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.581326 0.309310 34.209 < 2e-16 ***
## Overall.Qual 0.187257 0.007433 25.193 < 2e-16 ***
## Bsmt.QualEx 0.600795 0.221856 2.708 0.00689 **
## Bsmt.QualFa 0.181917 0.222964 0.816 0.41476
## Bsmt.QualGd 0.406820 0.219427 1.854 0.06406 .
## Bsmt.QualPo 0.519545 0.345271 1.505 0.13273
## Bsmt.QualTA 0.295413 0.218768 1.350 0.17724
## Garage.QualEx -0.113621 0.309190 -0.367 0.71335
## Garage.QualFa -0.191551 0.221902 -0.863 0.38824
## Garage.QualGd 0.089348 0.234180 0.382 0.70290
## Garage.QualPo -0.548931 0.267922 -2.049 0.04076 *
## Garage.QualTA -0.053152 0.218874 -0.243 0.80818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2182 on 924 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.7115, Adjusted R-squared: 0.708
## F-statistic: 207.1 on 11 and 924 DF, p-value: < 2.2e-16
NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
# type your code for Question 6 here, and Knitprice) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
# type your code for Question 7 here, and Knit
ames_train_q7 <- ames_train %>% mutate(lprice=log(price))
q7_model <- lm(lprice~Overall.Cond+Overall.Qual,data=ames_train_q7)
pred_values <- predict(q7_model,newdata=ames_train_q7,
estimator="BMA",
type="response")
ames_train_q7$pred_values <- pred_values
q7_summary <- ames_train_q7 %>% group_by(as.factor(MS.SubClass)) %>%
summarise(med_pred_price=median(exp(pred_values)))hatvalues, hat or lm.influence.
# type your code for Question 8 here, and Knit
inf_vector <- as.data.frame(lm.influence(q7_model))
answer <- inf_vector[inf_vector$hat==max(inf_vector$hat),]Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
# type your code for Question 9 here, and KnitIn a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):
n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF),
data=ames_train, col=Sale.Condition,
pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
bty="n", xpd=TRUE, inset=c(-.5,0))Family
Abnorm
Partial
Abnorm and Partial
# type your code for Question 10 here, and KnitBecause houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.
ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
# type your code for Question 11 here, and Knit
ames_train_normal <- ames_train %>% filter(Sale.Condition=="Normal") %>%
mutate(lprice=log(price),larea=log(area))
nrow(ames_train_normal)/nrow(ames_train)## [1] 0.834
# type your code for Question 12 here, and Knit
q12_model <- lm(lprice~larea,data=ames_train_normal)
summary(q12_model)##
## Call:
## lm(formula = lprice ~ larea, data = ames_train_normal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36369 -0.12269 0.02005 0.14587 0.82373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.73138 0.18711 30.63 <2e-16 ***
## larea 0.86716 0.02587 33.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2493 on 832 degrees of freedom
## Multiple R-squared: 0.5745, Adjusted R-squared: 0.574
## F-statistic: 1123 on 1 and 832 DF, p-value: < 2.2e-16