This second lab will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.
First, let us load the data:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)
price
) on \(\log\)(area
), \(\log\)(Lot.Area
), Bedroom.AbvGr
, Overall.Qual
, and Land.Slope
. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
area
)
Lot.Area
)
Bedroom.AbvGr
Overall.Qual
Land.Slope
# type your code for Question 1 here, and Knit
# Exclude observations with missing values in the data set
<- ames_train %>% mutate(lprice=log(price),
q1_df larea=log(area),
llotarea=log(Lot.Area)) %>%
select(c(lprice,larea,llotarea,Bedroom.AbvGr,Overall.Qual,Land.Slope))
<- na.omit(q1_df)
q1_df_no_na
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
<- bas.lm(larea ~ ., data = q1_df_no_na,
bma_larea_aic prior = "AIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_larea_aic
##
## Call:
## bas.lm(formula = larea ~ ., data = q1_df_no_na, prior = "AIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept lprice llotarea Bedroom.AbvGr Overall.Qual
## 1.0000 1.0000 0.8796 1.0000 1.0000
## Land.SlopeMod Land.SlopeSev
## 0.4439 0.3340
# Top 5 most probable models under AIC prior
summary(bma_larea_aic)
## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## lprice 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## llotarea 0.8796498 1.0000 1.000000 1.0000000 1.0000000
## Bedroom.AbvGr 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## Overall.Qual 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## Land.SlopeMod 0.4438755 0.0000 1.000000 0.0000000 1.0000000
## Land.SlopeSev 0.3339825 0.0000 0.000000 1.0000000 1.0000000
## BF NA 1.0000 0.769835 0.4632905 0.3702608
## PostProbs NA 0.3379 0.260100 0.1565000 0.1251000
## R2 NA 0.7244 0.724900 0.7246000 0.7250000
## dim NA 5.0000 6.000000 6.0000000 7.0000000
## logmarg NA -1727.3470 -1727.608610 -1728.1164318 -1728.3405783
## model 5
## Intercept 1.0000000
## lprice 1.0000000
## llotarea 0.0000000
## Bedroom.AbvGr 1.0000000
## Overall.Qual 1.0000000
## Land.SlopeMod 0.0000000
## Land.SlopeSev 0.0000000
## BF 0.1042928
## PostProbs 0.0352000
## R2 0.7226000
## dim 4.0000000
## logmarg -1729.6075835
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
<- bas.lm(larea ~ ., data = q1_df_no_na,
bma_larea_bic prior = "BIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_larea_bic
##
## Call:
## bas.lm(formula = larea ~ ., data = q1_df_no_na, prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept lprice llotarea Bedroom.AbvGr Overall.Qual
## 1.00000 1.00000 0.44280 1.00000 1.00000
## Land.SlopeMod Land.SlopeSev
## 0.06889 0.05090
# Top 5 most probable models under BIC prior
summary(bma_larea_bic)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.000000e+00
## lprice 1.00000000 1.0000 1.0000000 1.000000e+00
## llotarea 0.44280298 0.0000 1.0000000 0.000000e+00
## Bedroom.AbvGr 1.00000000 1.0000 1.0000000 1.000000e+00
## Overall.Qual 0.99999973 1.0000 1.0000000 1.000000e+00
## Land.SlopeMod 0.06888948 0.0000 0.0000000 1.000000e+00
## Land.SlopeSev 0.05090072 0.0000 0.0000000 0.000000e+00
## BF NA 1.0000 0.8242141 7.994749e-02
## PostProbs NA 0.4846 0.3994000 3.870000e-02
## R2 NA 0.7226 0.7244000 7.232000e-01
## dim NA 4.0000 5.0000000 5.000000e+00
## logmarg NA -1739.4231 -1739.6164190 -1.741949e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## lprice 1.000000e+00 1.000000e+00
## llotarea 0.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 1.000000e+00
## Overall.Qual 1.000000e+00 1.000000e+00
## Land.SlopeMod 0.000000e+00 1.000000e+00
## Land.SlopeSev 1.000000e+00 0.000000e+00
## BF 6.454159e-02 5.454214e-02
## PostProbs 3.130000e-02 2.640000e-02
## R2 7.230000e-01 7.249000e-01
## dim 5.000000e+00 6.000000e+00
## logmarg -1.742164e+03 -1.742332e+03
price
) on Bedroom.AbvGr
, the coefficient for Bedroom.AbvGr
is strongly positive. However, once \(\log\)(area
) is added to the model, the coefficient for Bedroom.AbvGr
becomes strongly negative. Which of the following best explains this phenomenon?
Bedroom.AbvGr
# type your code for Question 2 here, and Knit
price
), with \(\log\)(area
) as the independent variable. Which of the following neighborhoods has the highest average residuals?OldTown
StoneBr
GrnHill
IDOTRR
# type your code for Question 3 here, and Knit
<- ames_train %>% mutate(lprice=log(price),
ames_train_q3 larea=log(area),
llotarea=log(Lot.Area))
<- lm(lprice~larea,data=ames_train_q3)
q3_model $residual <- residuals(q3_model)
ames_train_q3
<- ames_train_q3 %>% group_by(Neighborhood) %>% summarise(mean_res=mean(residual)) neighborhood_summary
GrnHill
BlueSte
StoneBr
MeadowV
# type your code for Question 4 here, and Knit
<- ames_train_q3 %>% mutate(residual_sq=residual^2)
ames_train_q4
<- ames_train_q4 %>% group_by(Neighborhood) %>%
neighborhood_summary summarise(mean_res_sq=mean(residual_sq))
price
) using only the variables in the dataset that pertain to quality: Overall.Qual
, Bsmt.Qual
, and Garage.Qual
. How many observations must be discarded in order to estimate this model?
# type your code for Question 5 here, and Knit
<- lm(lprice~Overall.Qual+Bsmt.Qual+Garage.Qual,data=ames_train_q3)
q5_model summary(q5_model)
##
## Call:
## lm(formula = lprice ~ Overall.Qual + Bsmt.Qual + Garage.Qual,
## data = ames_train_q3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48987 -0.11622 0.00867 0.12462 0.98890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.581326 0.309310 34.209 < 2e-16 ***
## Overall.Qual 0.187257 0.007433 25.193 < 2e-16 ***
## Bsmt.QualEx 0.600795 0.221856 2.708 0.00689 **
## Bsmt.QualFa 0.181917 0.222964 0.816 0.41476
## Bsmt.QualGd 0.406820 0.219427 1.854 0.06406 .
## Bsmt.QualPo 0.519545 0.345271 1.505 0.13273
## Bsmt.QualTA 0.295413 0.218768 1.350 0.17724
## Garage.QualEx -0.113621 0.309190 -0.367 0.71335
## Garage.QualFa -0.191551 0.221902 -0.863 0.38824
## Garage.QualGd 0.089348 0.234180 0.382 0.70290
## Garage.QualPo -0.548931 0.267922 -2.049 0.04076 *
## Garage.QualTA -0.053152 0.218874 -0.243 0.80818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2182 on 924 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.7115, Adjusted R-squared: 0.708
## F-statistic: 207.1 on 11 and 924 DF, p-value: < 2.2e-16
NA
values for Basement.Qual
and Garage.Qual
correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA
values when fitting the linear model with these variables?NA
values for Basement.Qual
or Garage.Qual
since the model cannot be estimated otherwise.
NA
values as the category TA
since we must assume these basements or garages are typical in the absence of all other information.
NA
values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
# type your code for Question 6 here, and Knit
price
) regressed on Overall.Cond
and Overall.Qual
. Which of the following subclasses of dwellings (MS.SubClass
) has the highest median predicted prices?
# type your code for Question 7 here, and Knit
<- ames_train %>% mutate(lprice=log(price))
ames_train_q7
<- lm(lprice~Overall.Cond+Overall.Qual,data=ames_train_q7)
q7_model <- predict(q7_model,newdata=ames_train_q7,
pred_values estimator="BMA",
type="response")
$pred_values <- pred_values
ames_train_q7
<- ames_train_q7 %>% group_by(as.factor(MS.SubClass)) %>%
q7_summary summarise(med_pred_price=median(exp(pred_values)))
hatvalues
, hat
or lm.influence
.
# type your code for Question 8 here, and Knit
<- as.data.frame(lm.influence(q7_model))
inf_vector
<- inf_vector[inf_vector$hat==max(inf_vector$hat),] answer
Bedroom.AbvGr
, where \(\log\)(price
) is the dependent variable?
# type your code for Question 9 here, and Knit
In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):
= length(levels(ames_train$Sale.Condition))
n.Sale.Condition par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF),
data=ames_train, col=Sale.Condition,
pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
bty="n", xpd=TRUE, inset=c(-.5,0))
Family
Abnorm
Partial
Abnorm
and Partial
# type your code for Question 10 here, and Knit
Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.
ames_train
to only include houses sold under normal sale conditions. What percent of the original observations remain?
# type your code for Question 11 here, and Knit
# type your code for Question 12 here, and Knit