Complete all Exercises, and submit answers to Questions on the Coursera platform.
This second quiz will deal with model assumptions, selection, and interpretation. The concepts tested here will prove useful for the final peer assessment, which is much more open-ended.
First, let us load the data:
load("ames_train.Rdata")
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
price) on \(\log\)(area), \(\log\)(Lot.Area), Bedroom.AbvGr, Overall.Qual, and Land.Slope. Which of the following variables are included with stepwise variable selection using AIC but not BIC? Select all that apply.
area)
Lot.Area)
Bedroom.AbvGr
Overall.Qual
Land.Slope
data <- ames_train %>%
dplyr:: select(price, area, Lot.Area, Bedroom.AbvGr, Overall.Qual, Land.Slope)
data.model <- data[complete.cases(data),]
lm_q1 <- lm(log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr +
Overall.Qual + Land.Slope, data = data.model )
step_model <- stepAIC(lm_q1, trace = FALSE, k = 2)
step_model$anova ## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual +
## Land.Slope
##
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual +
## Land.Slope
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 993 36.53597 -3295.458
step_BIC <- stepAIC(lm_q1, trace = FALSE, k = log(nrow(data.model)))
step_BIC$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual +
## Land.Slope
##
## Final Model:
## log(price) ~ log(area) + log(Lot.Area) + Bedroom.AbvGr + Overall.Qual
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 993 36.53597 -3261.104
## 2 - Land.Slope 2 0.4466652 995 36.98263 -3262.768
Use the function stepAIC from the MASS package. In AIC, \(k = 2\), whereas in BIC, \(k = \log(n)\).
This question refers to the following learning objective(s): Use principled statistical methods to select a single parsimonious model.
price) on Bedroom.AbvGr, the coefficient for Bedroom.AbvGr is strongly positive. However, once \(\log\)(area) is added to the model, the coefficient for Bedroom.AbvGr becomes strongly negative. Which of the following best explains this phenomenon?
Bedroom.AbvGr
lm_q2 <- lm(log(price) ~ Bedroom.AbvGr, data = ames_train)
summary(lm_q2)##
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4815 -0.2609 -0.0455 0.2417 1.4915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.73794 0.04582 256.188 < 2e-16 ***
## Bedroom.AbvGr 0.09998 0.01565 6.387 2.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4125 on 998 degrees of freedom
## Multiple R-squared: 0.03927, Adjusted R-squared: 0.03831
## F-statistic: 40.79 on 1 and 998 DF, p-value: 2.588e-10
lm_q2.1 <- lm(log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
summary(lm_q2.1)##
## Call:
## lm(formula = log(price) ~ Bedroom.AbvGr + log(area), data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10122 -0.14025 0.02724 0.16890 0.87020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.31767 0.19786 21.82 <2e-16 ***
## Bedroom.AbvGr -0.14753 0.01196 -12.34 <2e-16 ***
## log(area) 1.12063 0.02955 37.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2641 on 997 degrees of freedom
## Multiple R-squared: 0.6066, Adjusted R-squared: 0.6058
## F-statistic: 768.8 on 2 and 997 DF, p-value: < 2.2e-16
Recall: the interpretation of a coefficient holds constant all other variables included in the model.
This question refers to the following learning objective(s): Interpret the estimate for a slope (say \[b_1\]) as “All else held constant, for each unit increase in \[x_1\], we would expect \[y\] to be higher/lower on average by \[b_1\].”
price), with \(\log\)(area) as the independent variable. Which of the following neighborhoods has the highest average residuals?OldTown
StoneBr
GrnHill
IDOTRR
lm_q3 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q3 <- predict(lm_q3, newdata = ames_train)
ames_train$resid3 <- resid(lm_q3)
ames_train %>%
group_by(Neighborhood) %>%
summarise(mean_resid = mean(resid3)) %>%
arrange(desc(mean_resid))## # A tibble: 27 x 2
## Neighborhood mean_resid
## <fct> <dbl>
## 1 GrnHill 0.509
## 2 StoneBr 0.379
## 3 Greens 0.378
## 4 NridgHt 0.337
## 5 Timber 0.267
## 6 Somerst 0.205
## 7 Blmngtn 0.172
## 8 Veenker 0.129
## 9 NoRidge 0.126
## 10 CollgCr 0.124
## # ... with 17 more rows
Extract the residual from a model (say m.1) with the command resid(m.1). Then summarize the residuals, grouping by neighborhood.
This question refers to the following learning objective(s): Identify the assumptions of linear regression and assess when a model may need to be improved. Examine the residuals of a linear model.
GrnHill
BlueSte
StoneBr
MeadowV
lm_q4 <- lm(log(price) ~ log(area), data = ames_train)
predict_lm_q4 <- predict(lm_q4, newdata = ames_train)
ames_train$resid4 <- resid(lm_q4)
ames_train %>%
group_by(Neighborhood) %>%
summarise(mean_squared_resids = mean(resid4^2)) %>%
arrange(desc(mean_squared_resids))## # A tibble: 27 x 2
## Neighborhood mean_squared_resids
## <fct> <dbl>
## 1 GrnHill 0.271
## 2 IDOTRR 0.249
## 3 StoneBr 0.201
## 4 OldTown 0.200
## 5 Veenker 0.166
## 6 NridgHt 0.155
## 7 Greens 0.146
## 8 MeadowV 0.132
## 9 SWISU 0.130
## 10 Timber 0.110
## # ... with 17 more rows
The average squared residuals is one good measure for comparing how well the model predicts prices between neighborhoods. Find the neighborhood for which this is maximized.
This question refers to the following learning objective(s): Examine the residuals of a linear model.
price) using only the variables in the dataset that pertain to quality: Overall.Qual, Basement.Qual, and Garage.Qual. How many observations must be discarded in order to estimate this model?
selected_vars5 <- ames_train %>%
dplyr::select(price, Overall.Qual, Bsmt.Qual, Garage.Qual)
clean_vars5 <- selected_vars5[complete.cases(selected_vars5),]
print(paste("Number of observations to be discarded is",
nrow(selected_vars5) - nrow(clean_vars5)))## [1] "Number of observations to be discarded is 64"
Run the model in R. How many observations are used? There are 1000 originally in the data.
This question refers to the following learning objective(s): Identify the assumptions of linear regression and assess when a model may need to be improved.
NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Which of the following is the best way to deal with these NA values when fitting the linear model with these variables?NA values for Basement.Qual or Garage.Qual since the model cannot be estimated otherwise.
NA values as the category TA since we must assume these basements or garages are typical in the absence of all other information.
NA values as a separate category, since houses without basements or garages are fundamentally different than houses with both basements and garages.
nrow(subset(ames_train, is.na(Garage.Qual) | is.na(Bsmt.Qual)))## [1] 64
We can use the information that a house does not have a basement or garage to create a separate category for NA values. This will allow us to avoid discarding observations and use the model to predict prices for out-of-sample houses that lack garages or basements.
This question refers to the following learning objective(s): Check the assumptions of a linear model.
price) regressed on Overall.Cond and Overall.Qual. Which of the following subclasses of dwellings (MS.SubClass) has the highest median predicted prices?
lm_q7 <- lm(log(price) ~ Overall.Cond + Overall.Qual, data = ames_train)
predict_lm_q7 <- predict(lm_q7, newdata = ames_train)
ames_train$predict_lm_q7 <- exp(predict_lm_q7)
ames_train %>%
group_by(MS.SubClass) %>%
summarise(median_price = median(predict_lm_q7)) %>%
arrange(desc(median_price))## # A tibble: 15 x 2
## MS.SubClass median_price
## <int> <dbl>
## 1 75 209072.
## 2 60 205634.
## 3 120 205634.
## 4 70 165841.
## 5 45 163113.
## 6 20 160431.
## 7 80 160431.
## 8 160 160431.
## 9 50 129385.
## 10 40 127258.
## 11 85 127258.
## 12 90 127258.
## 13 30 125165.
## 14 190 125165.
## 15 180 102631.
After fitting the model, use the predict function to find the predicted values for each observation in the data set. Then use group_by and summarise in dplyr to find the median predicted price for each subclass of dwellings.
This question refers to the following learning objective(s): Predict the value of the response variable for a given value of the explanatory variable, \(x^\star\), by plugging in \(x^\star\) in the linear model:
hatvalues, hat or lm.influence.
which.max(lm.influence(lm_q7)$hat)## 268
## 268
First, fit the model from question 7. Then, using hatvalues or a combination of hat and lm.influence, you can extract the leverage values of the model. The higher the leverage, the more influential the observation is on the model fit.
This question refers to the following learning objective(s): Identify outliers and high leverage points in a linear model.
Bedroom.AbvGr, where \(\log\)(price) is the dependent variable?
lm_q9 <- lm(log(price) ~ Bedroom.AbvGr, data = ames_train)
coef(lm_q9)## (Intercept) Bedroom.AbvGr
## 11.73793756 0.09997612
In a multiple regression setting, we only hold constant all other variables included in the model. Also, since we use log(price) as our dependent variable, we interpret the coefficient as a percent increase or decrease, rather than an absolute increase or decrease.
This question refers to the following learning objective(s): Interpret the estimate for a slope (say \[b_1\]) as “All else held constant, for each unit increase in \[x_1\], we would expect \[y\] to be higher/lower on average by \[b_1\].”
In a linear model, we assume that all observations in the data are generated from the same process. You are concerned that houses sold in abnormal sale conditions may not exhibit the same behavior as houses sold in normal sale conditions. To visualize this, you make the following plot of 1st and 2nd floor square footage versus log(price):
n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(X1st.Flr.SF+X2nd.Flr.SF),
data=ames_train, col=Sale.Condition,
pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
bty="n", xpd=TRUE, inset=c(-.5,0))Family
Abnorm
Partial
Abnorm and Partial
lm_q10 <- lm(log(price) ~ Sale.Condition, data = ames_train)
summary(lm_q10)##
## Call:
## lm(formula = log(price) ~ Sale.Condition, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28589 -0.23436 -0.02709 0.24463 1.33354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.74223 0.05016 234.091 < 2e-16 ***
## Sale.ConditionAdjLand 0.09490 0.28153 0.337 0.736
## Sale.ConditionAlloca 0.21674 0.20221 1.072 0.284
## Sale.ConditionFamily 0.11734 0.10745 1.092 0.275
## Sale.ConditionNormal 0.25361 0.05196 4.881 1.23e-06 ***
## Sale.ConditionPartial 0.75216 0.06624 11.355 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3918 on 994 degrees of freedom
## Multiple R-squared: 0.1367, Adjusted R-squared: 0.1324
## F-statistic: 31.49 on 5 and 994 DF, p-value: < 2.2e-16
A house sold under abnormal conditions often sells for much less than expected given its square footage. Similarly, a partial sale of a house results in a higher price on average holding constant square footage. However, one partial sale has a total square footage of over 4000, which is highly influential on the regression results.
This question refers to the following learning objective(s): Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points. List the conditions for multiple linear regression.
Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, you decide to only model housing prices under only normal sale conditions.
ames_train to only include houses sold under normal sale conditions. What percent of the original observations remain?
Normal_Sale <- subset(ames_train, Sale.Condition == "Normal")
print(paste("% of the original observations is", nrow(Normal_Sale)/1000 * 100))## [1] "% of the original observations is 83.4"
Use either dplyr or the subset R command, setting the logical condition to be Sale.Condition == T.
This question refers to the following learning objective(s): Use R commands to effectively manipulate data.
lm_q12 <- lm(log(price) ~ log(area), data = Normal_Sale)
summary(lm_q12)$adj.r.squared## [1] 0.5739801
lm_q12.1 <- lm(log(price) ~ log(area), data = ames_train)
summary(lm_q12.1)$adj.r.squared## [1] 0.5461358
Run the model under both the full and subsetted data. Calculate the \(R^2\) values for each model and compare.
This question refers to the following learning objective(s): Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points. Define \(R^2\) as the percentage of the variability in the response variable explained by the the explanatory variable.
By Cory Torrella via Duke University Statistics with R Specialization