Complete all Exercises, and submit answers to Questions on the Coursera platform.
This second quiz will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.
First, let us load the data:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
area)
Lot.Area)
Bedroom.AbvGr
Overall.Qual
Land.Slope
data <- ames_train %>%
dplyr:: select(price, area, Lot.Area, Bedroom.AbvGr, Overall.Qual, Land.Slope)
data.model <- data[complete.cases(data),]
lm_q1 <- lm(log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr +
Overall.Qual + Land.Slope, data = data.model )
step_model <- stepAIC(lm_q1, trace = FALSE, k = 2)
step_model$anova ## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual +
## Land.Slope
##
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual +
## Land.Slope
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 993 36.53597 -3295.458
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual +
## Land.Slope
##
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 993 36.53597 -3261.104
## 2 - Land.Slope 2 0.4466652 995 36.98263 -3262.768
price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
Bedroom.AbvGr
##
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4815 -0.2609 -0.0455 0.2417 1.4915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.73794 0.04582 256.188 < 2e-16 ***
## Bedroom.AbvGr 0.09998 0.01565 6.387 2.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4125 on 998 degrees of freedom
## Multiple R-squared: 0.03927, Adjusted R-squared: 0.03831
## F-statistic: 40.79 on 1 and 998 DF, p-value: 2.588e-10
##
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10122 -0.14025 0.02724 0.16890 0.87020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.31767 0.19786 21.82 <2e-16 ***
## Bedroom.AbvGr -0.14753 0.01196 -12.34 <2e-16 ***
## log(area) 1.12063 0.02955 37.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2641 on 997 degrees of freedom
## Multiple R-squared: 0.6066, Adjusted R-squared: 0.6058
## F-statistic: 768.8 on 2 and 997 DF, p-value: < 2.2e-16
price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?OldTown
StoneBr
GrnHill
IDOTRR
lm_q3 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q3 <- predict(lm_q3, newdata = ames_train)
ames_train$resid3 <- resid(lm_q3)
ames_train %>%
group_by(Neighborhood) %>%
summarise(mean_resid = mean(resid3)) %>%
arrange(desc(mean_resid))## # A tibble: 27 x 2
## Neighborhood mean_resid
## <fct> <dbl>
## 1 GrnHill 0.509
## 2 StoneBr 0.379
## 3 Greens 0.378
## 4 NridgHt 0.337
## 5 Timber 0.267
## 6 Somerst 0.205
## 7 Blmngtn 0.172
## 8 Veenker 0.129
## 9 NoRidge 0.126
## 10 CollgCr 0.124
## # … with 17 more rows
GrnHill
BlueSte
StoneBr
MeadowV
lm_q4 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q4 <- predict(lm_q4, newdata = ames_train)
ames_train$resid4 <- resid(lm_q4)
ames_train %>%
group_by(Neighborhood) %>%
summarise(mean_squared_resids = mean(resid4^2)) %>%
arrange(desc(mean_squared_resids))## # A tibble: 27 x 2
## Neighborhood mean_squared_resids
## <fct> <dbl>
## 1 GrnHill 0.271
## 2 IDOTRR 0.249
## 3 StoneBr 0.201
## 4 OldTown 0.200
## 5 Veenker 0.166
## 6 NridgHt 0.155
## 7 Greens 0.146
## 8 MeadowV 0.132
## 9 SWISU 0.130
## 10 Timber 0.110
## # … with 17 more rows
price) using only the variables in the dataset that pertain to quality: Overall.Qual, Basement.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
selected_vars5 <- ames_train %>%
dplyr::select(price, Overall.Qual, Bsmt.Qual, Garage.Qual)
clean_vars5 <- selected_vars5[complete.cases(selected_vars5),]
print(paste("Number of observations to be discarded is",
nrow(selected_vars5) - nrow(clean_vars5)))## [1] "Number of observations to be discarded is 64"
NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
## [1] 64
price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
lm_q7 <- lm(log(price) ~ Overall.Cond + Overall.Qual, data = ames_train)
predict_lm_q7 <- predict(lm_q7, newdata = ames_train)
ames_train$predict_lm_q7 <- exp(predict_lm_q7)
ames_train %>%
group_by(MS.SubClass) %>%
summarise(median_price = median(predict_lm_q7)) %>%
arrange(desc(median_price))## # A tibble: 15 x 2
## MS.SubClass median_price
## <int> <dbl>
## 1 75 209072.
## 2 60 205634.
## 3 120 205634.
## 4 70 165841.
## 5 45 163113.
## 6 20 160431.
## 7 80 160431.
## 8 160 160431.
## 9 50 129385.
## 10 40 127258.
## 11 85 127258.
## 12 90 127258.
## 13 30 125165.
## 14 190 125165.
## 15 180 102631.
hatvalues, hat or lm.influence.
## 268
## 268
Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
## (Intercept) Bedroom.AbvGr
## 11.73793756 0.09997612
In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):
n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF),
data=ames_train, col=Sale.Condition,
pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
bty="n", xpd=TRUE, inset=c(-.5,0))Family
Abnorm
Partial
Abnorm and Partial
##
## Call:
## lm(formula = log(price) ~ Sale.Condition, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28589 -0.23436 -0.02709 0.24463 1.33354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.74223 0.05016 234.091 < 2e-16 ***
## Sale.ConditionAdjLand 0.09490 0.28153 0.337 0.736
## Sale.ConditionAlloca 0.21674 0.20221 1.072 0.284
## Sale.ConditionFamily 0.11734 0.10745 1.092 0.275
## Sale.ConditionNormal 0.25361 0.05196 4.881 1.23e-06 ***
## Sale.ConditionPartial 0.75216 0.06624 11.355 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3918 on 994 degrees of freedom
## Multiple R-squared: 0.1367, Adjusted R-squared: 0.1324
## F-statistic: 31.49 on 5 and 994 DF, p-value: < 2.2e-16
Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.
ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
Normal_Sale <- subset(ames_train, Sale.Condition == "Normal")
print(paste("% of the original observations is", nrow(Normal_Sale)/1000 * 100))## [1] "% of the original observations is 83.4"
## [1] 0.5739801
## [1] 0.5461358