First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(knitr)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
ames_train$Home.Age = ames_train$Yr.Sold - ames_train$Year.Built
ggplot(ames_train) +
geom_histogram(aes(x = Home.Age), bins = 30) +
xlab("Age at Sale") +
ylab("Count of Sales") +
ggtitle("Home Age Distribution")
The distribution is skewed right. It appears to be bi-modal with one peak for brand new homes, and a second peak around 40 years.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
ggplot(data = ames_train, aes(x = Neighborhood, y = price)) +
geom_boxplot() +
coord_flip()
ames_train.smry <- ames_train %>%
group_by(Neighborhood) %>%
summarise(avg.price = mean(price),
median.price = median(price),
sd.price = sd(price))
p1 <- ggplot(data = ames_train.smry,
aes(x = reorder(Neighborhood,median.price), y = median.price)) +
geom_bar(stat = "identity") +
geom_text(aes(label=round(median.price,0)), vjust=0.2, hjust=1.1, size=2.5, color="white") +
xlab("Neighborhood") +
ylab("Median Price") +
ggtitle("Median Sale Price") +
theme(axis.text.x = element_blank()) +
coord_flip()
p2 <- ggplot(data = ames_train.smry,
aes(x = reorder(Neighborhood,sd.price), y = sd.price)) +
geom_bar(stat = "identity") +
geom_text(aes(label=round(sd.price,0)), vjust=0.2, hjust=1.1, size=2.5, color="white") +
xlab("Neighborhood") +
ylab("Price Std Dev") +
ggtitle("Sale Price Std Dev") +
theme(axis.text.x = element_blank()) +
coord_flip()
grid.arrange(p1, p2, ncol=2)
The most expensive neighborhood is StoneBr (median=340,692). The least expensive neighborhood is MeadowV (median=85,750). The most heterogeneous (having the most variation in housing price) neighborhood is StoneBr (sd=123,459), and the least heterogeneous neighborhood is Blueste (sd=10,381).
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
na_count <-sapply(ames_train, function(y) mean(is.na(y)))
kable(head(sort(na_count, decreasing = TRUE), n = 5), caption = "Variable Missing Values (%)")
x | |
---|---|
Pool.QC | 0.997 |
Misc.Feature | 0.971 |
Alley | 0.933 |
Fence | 0.798 |
Fireplace.Qu | 0.491 |
‘Pool.QC’ (pool quality) has 997 missing values. The code book explains missing values indicate no pool. The large number of missing values makes sense because most homes in Iowa not have a pool.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
mod.0 <- "log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr"
ar2.0 <- round(summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Lot.Area
mod.1 <- "log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr"
ar2.1 <- round(summary(lm(log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Land.Slope
mod.2 <- "log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr"
ar2.2 <- round(summary(lm(log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Year.Built
mod.3 <- "log(price) ~ Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr"
ar2.3 <- round(summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Year.Remod.Add
mod.4 <- "log(price) ~ Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr"
ar2.4 <- round(summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Bedroom.AbvGr
mod.5 <- "log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add"
ar2.5 <- round(summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add, data = ames_train))$adj.r.squared, digits = 4)
ar2.smry <- data.frame(ModelDesc=c(mod.0, mod.1, mod.2, mod.3, mod.4, mod.5),
adj.r.squared=c(ar2.0, ar2.1, ar2.2, ar2.3, ar2.4, ar2.5))
kable(ar2.smry %>% arrange(desc(adj.r.squared)),
caption = "Adjusted R-squared Method")
ModelDesc | adj.r.squared |
---|---|
log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr | 0.5598 |
log(price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr | 0.5526 |
log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add | 0.5315 |
log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr | 0.5220 |
log(price) ~ Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr | 0.4922 |
log(price) ~ Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr | 0.4474 |
Use the adjusted R2 method. This method does not necesarily produce the most parsimonious model, but it maximizes prediction accuracy. Start with the full model, then drop the variable whose absence increases the asjusted R2 the most. Replace the full model with this reduced model, then repeat the process until dropping any variable reduces the adjusted R2.
In this case, the full model turns out to be the best model. Dropping any variable reduces the adjusted R2.
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
ames_train.lm <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
plot(ames_train.lm, which = 1)
# largest squared residual
#ames_train.lm$model %>%
# filter(ames_train.lm$residuals^2==max(ames_train.lm$residuals^2))
ames_train.lm$model[428,]
## log(price) Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
## 428 9.456341 9656 Gtl 1923 1970 2
#ames_train.lm$fitted.values[428]
#ames_train.lm$residuals[428]
#rstandard(ames_train.lm)[428]
ames_train[428,]
## # A tibble: 1 x 82
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fct> <int> <int> <fct>
## 1 902207130 832 12789 30 RM 68 9656 Pave
## # ... with 74 more variables: Alley <fct>, Lot.Shape <fct>,
## # Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## # Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## # Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## # Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## # Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## # Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## # Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## # Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## # Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## # TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## # Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## # Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## # Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## # Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>, Home.Age <int>
summary(ames_train.lm)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
Home PID
902207130 had the largest squared residual. Its predicted log sale price was 11.54787 ($103,556), but its actual log sale price was only 9.456341 ($12,789). Looking at the variables in the data set, variable Sale.Condition
value is ’Abnorml` meaning (from codebook) “Abnormal Sale - trade, foreclosure, short sale”. It may be that this sale is an outlier for that reason.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
mod.0 <- "log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr"
ar2.0 <- round(summary(lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop log(Lot.Area)
mod.1 <- "log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr"
ar2.1 <- round(summary(lm(log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Land.Slope
mod.2 <- "log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr"
ar2.2 <- round(summary(lm(log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Year.Built
mod.3 <- "log(price) ~ log(Lot.Area) + Land.Slope + Year.Remod.Add + Bedroom.AbvGr"
ar2.3 <- round(summary(lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Remod.Add + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Year.Remod.Add
mod.4 <- "log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Bedroom.AbvGr"
ar2.4 <- round(summary(lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Bedroom.AbvGr, data = ames_train))$adj.r.squared, digits = 4)
# Drop Bedroom.AbvGr
mod.5 <- "log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add"
ar2.5 <- round(summary(lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add, data = ames_train))$adj.r.squared, digits = 4)
ar2.smry <- data.frame(ModelDesc=c(mod.0, mod.1, mod.2, mod.3, mod.4, mod.5),
adj.r.squared=c(ar2.0, ar2.1, ar2.2, ar2.3, ar2.4, ar2.5))
kable(ar2.smry %>% arrange(desc(adj.r.squared)),
caption = "Adjusted R-squared Method")
ModelDesc | adj.r.squared |
---|---|
log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr | 0.6032 |
log(price) ~ log(Lot.Area) + Year.Built + Year.Remod.Add + Bedroom.AbvGr | 0.6015 |
log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add | 0.5911 |
log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Bedroom.AbvGr | 0.5364 |
log(price) ~ Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr | 0.5220 |
log(price) ~ log(Lot.Area) + Land.Slope + Year.Remod.Add + Bedroom.AbvGr | 0.4932 |
Yes, we retain the same set of predictors.
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
ames_train.lm <- lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
ames_train$fitted <- fitted(ames_train.lm)
p1 <- ggplot(ames_train) +
geom_point(aes(y=fitted, x=log(price))) +
geom_smooth(data=ames_train, aes(y=fitted, x=log(price)),
method = "lm", se=FALSE) +
ggtitle("Fit vs Obsrv: Lot.Area") +
xlab("Observed") +
ylab("Fit Value")
#qqnorm(wordrecall$rstudent)
#qqline(wordrecall$rstudent)
ames_train.lm2 <- lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
ames_train$fitted2 <- fitted(ames_train.lm2)
p2 <- ggplot(ames_train) +
geom_point(aes(y=fitted2, x=log(price))) +
geom_smooth(data=ames_train, aes(y=fitted2, x=log(price)),
method = "lm", se=FALSE) +
ggtitle("Fit vs Obsrv: log(Lot.Area)") +
xlab("Observed") +
ylab("Fit Value")
grid.arrange(p1, p2, ncol=2)
c(round(summary(ames_train.lm)$adj.r.squared, digits = 4), 'adj.r.squared, no log(Lot.Area)')
## [1] "0.5598" "adj.r.squared, no log(Lot.Area)"
c(round(summary(ames_train.lm2)$adj.r.squared, digits = 4), 'adj.r.squared, with log(Lot.Area)')
## [1] "0.6032" "adj.r.squared, with log(Lot.Area)"
It is better to log transform Lot.Area. The linear regression model results are based on the assumption of constant variance. The residuals have significantly more larger values at the higher levels of observed values. By log-transforming Lot.aREA
, thee residuals are reduced and the constant variance condition is satisfied. The adjusted R-square increases from 0.5598 in the non-log-transformed to model to 0.6032 in the log-transformed model.