This second lab will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.
First, let us load the data:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(plotly)
library(devtools)
library(statsr)
library(broom)
library(BAS)price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
area)
Lot.Area)
Bedroom.AbvGr
Overall.Qual
Land.Slope
# type your code for Question 1 here, and Knit
# Exclude observations with missing values in the data set
q1_df <- ames_train %>% mutate(lprice=log(price),
larea=log(area),
llotarea=log(Lot.Area)) %>%
select(c(lprice,larea,llotarea,Bedroom.AbvGr,Overall.Qual,Land.Slope))
q1_df_no_na <- na.omit(q1_df)
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_larea_aic <- bas.lm(larea ~ ., data = q1_df_no_na,
prior = "AIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_larea_aic##
## Call:
## bas.lm(formula = larea ~ ., data = q1_df_no_na, prior = "AIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept lprice llotarea Bedroom.AbvGr Overall.Qual
## 1.0000 1.0000 0.8796 1.0000 1.0000
## Land.SlopeMod Land.SlopeSev
## 0.4439 0.3340
# Top 5 most probable models under AIC prior
summary(bma_larea_aic)## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## lprice 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## llotarea 0.8796498 1.0000 1.000000 1.0000000 1.0000000
## Bedroom.AbvGr 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## Overall.Qual 1.0000000 1.0000 1.000000 1.0000000 1.0000000
## Land.SlopeMod 0.4438755 0.0000 1.000000 0.0000000 1.0000000
## Land.SlopeSev 0.3339825 0.0000 0.000000 1.0000000 1.0000000
## BF NA 1.0000 0.769835 0.4632905 0.3702608
## PostProbs NA 0.3379 0.260100 0.1565000 0.1251000
## R2 NA 0.7244 0.724900 0.7246000 0.7250000
## dim NA 5.0000 6.000000 6.0000000 7.0000000
## logmarg NA -1727.3470 -1727.608610 -1728.1164318 -1728.3405783
## model 5
## Intercept 1.0000000
## lprice 1.0000000
## llotarea 0.0000000
## Bedroom.AbvGr 1.0000000
## Overall.Qual 1.0000000
## Land.SlopeMod 0.0000000
## Land.SlopeSev 0.0000000
## BF 0.1042928
## PostProbs 0.0352000
## R2 0.7226000
## dim 4.0000000
## logmarg -1729.6075835
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_larea_bic <- bas.lm(larea ~ ., data = q1_df_no_na,
prior = "BIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_larea_bic##
## Call:
## bas.lm(formula = larea ~ ., data = q1_df_no_na, prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept lprice llotarea Bedroom.AbvGr Overall.Qual
## 1.00000 1.00000 0.44280 1.00000 1.00000
## Land.SlopeMod Land.SlopeSev
## 0.06889 0.05090
# Top 5 most probable models under BIC prior
summary(bma_larea_bic)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.000000e+00
## lprice 1.00000000 1.0000 1.0000000 1.000000e+00
## llotarea 0.44280298 0.0000 1.0000000 0.000000e+00
## Bedroom.AbvGr 1.00000000 1.0000 1.0000000 1.000000e+00
## Overall.Qual 0.99999973 1.0000 1.0000000 1.000000e+00
## Land.SlopeMod 0.06888948 0.0000 0.0000000 1.000000e+00
## Land.SlopeSev 0.05090072 0.0000 0.0000000 0.000000e+00
## BF NA 1.0000 0.8242141 7.994749e-02
## PostProbs NA 0.4846 0.3994000 3.870000e-02
## R2 NA 0.7226 0.7244000 7.232000e-01
## dim NA 4.0000 5.0000000 5.000000e+00
## logmarg NA -1739.4231 -1739.6164190 -1.741949e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## lprice 1.000000e+00 1.000000e+00
## llotarea 0.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 1.000000e+00
## Overall.Qual 1.000000e+00 1.000000e+00
## Land.SlopeMod 0.000000e+00 1.000000e+00
## Land.SlopeSev 1.000000e+00 0.000000e+00
## BF 6.454159e-02 5.454214e-02
## PostProbs 3.130000e-02 2.640000e-02
## R2 7.230000e-01 7.249000e-01
## dim 5.000000e+00 6.000000e+00
## logmarg -1.742164e+03 -1.742332e+03
price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
Bedroom.AbvGr
# type your code for Question 2 here, and Knitprice), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?OldTown
StoneBr
GrnHill
IDOTRR
# type your code for Question 3 here, and Knit
ames_train_q3 <- ames_train %>% mutate(lprice=log(price),
larea=log(area),
llotarea=log(Lot.Area))
q3_model <- lm(lprice~larea,data=ames_train_q3)
ames_train_q3$residual <- residuals(q3_model)
neighborhood_summary <- ames_train_q3 %>% group_by(Neighborhood) %>% summarise(mean_res=mean(residual))GrnHill
BlueSte
StoneBr
MeadowV
# type your code for Question 4 here, and Knit
ames_train_q4 <- ames_train_q3 %>% mutate(residual_sq=residual^2)
neighborhood_summary <- ames_train_q4 %>% group_by(Neighborhood) %>%
summarise(mean_res_sq=mean(residual_sq))price) using only the variables in the dataset that pertain to quality: Overall.Qual, Bsmt.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
# type your code for Question 5 here, and Knit
q5_model <- lm(lprice~Overall.Qual+Bsmt.Qual+Garage.Qual,data=ames_train_q3)
summary(q5_model)##
## Call:
## lm(formula = lprice ~ Overall.Qual + Bsmt.Qual + Garage.Qual,
## data = ames_train_q3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48987 -0.11622 0.00867 0.12462 0.98890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.581326 0.309310 34.209 < 2e-16 ***
## Overall.Qual 0.187257 0.007433 25.193 < 2e-16 ***
## Bsmt.QualEx 0.600795 0.221856 2.708 0.00689 **
## Bsmt.QualFa 0.181917 0.222964 0.816 0.41476
## Bsmt.QualGd 0.406820 0.219427 1.854 0.06406 .
## Bsmt.QualPo 0.519545 0.345271 1.505 0.13273
## Bsmt.QualTA 0.295413 0.218768 1.350 0.17724
## Garage.QualEx -0.113621 0.309190 -0.367 0.71335
## Garage.QualFa -0.191551 0.221902 -0.863 0.38824
## Garage.QualGd 0.089348 0.234180 0.382 0.70290
## Garage.QualPo -0.548931 0.267922 -2.049 0.04076 *
## Garage.QualTA -0.053152 0.218874 -0.243 0.80818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2182 on 924 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.7115, Adjusted R-squared: 0.708
## F-statistic: 207.1 on 11 and 924 DF, p-value: < 2.2e-16
NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
# type your code for Question 6 here, and Knitprice) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
# type your code for Question 7 here, and Knit
ames_train_q7 <- ames_train %>% mutate(lprice=log(price))
q7_model <- lm(lprice~Overall.Cond+Overall.Qual,data=ames_train_q7)
pred_values <- predict(q7_model,newdata=ames_train_q7,
estimator="BMA",
type="response")
ames_train_q7$pred_values <- pred_values
q7_summary <- ames_train_q7 %>% group_by(as.factor(MS.SubClass)) %>%
summarise(med_pred_price=median(exp(pred_values)))hatvalues, hat or lm.influence.
# type your code for Question 8 here, and Knit
inf_vector <- as.data.frame(lm.influence(q7_model))
answer <- inf_vector[inf_vector$hat==max(inf_vector$hat),]Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
# type your code for Question 9 here, and KnitIn a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):
n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF),
data=ames_train, col=Sale.Condition,
pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
bty="n", xpd=TRUE, inset=c(-.5,0))Family
Abnorm
Partial
Abnorm and Partial
# type your code for Question 10 here, and KnitBecause houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.
ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
# type your code for Question 11 here, and Knit# type your code for Question 12 here, and Knit