First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS)
library(tidyverse)
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.2
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
# type your code for Question 1 here, and Knit
ames_train %>% ggplot(aes(x=(2017-Year.Built))) + geom_histogram(bins=30) + ggtitle("Distribution of Ages of Houses in Data Set\n(As of 2017)") + xlab("Age in Years") + ylab("Number of houses")
Most houses in the data are 10-15 years old. A tail of old homes (80+ years) creates a right skew. Note that if we plotted Year.Built instead of age, then the distribution would be reversed and we would have left skew.
At this level of granularity, the distribution appears roughly bimodal with peaks at 10-20 years and at 35-65 years. There is a notable dearth of homes 25-35 years old.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
# type your code for Question 2 here, and Knit
ames_train %>% ggplot(aes(x=Neighborhood,y=price,group=Neighborhood)) + geom_boxplot(varwidth = TRUE) + coord_flip() + ggtitle("Median price and Interquartile Range by Neigborhood")
nhood_summary <- ames_train %>% group_by(Neighborhood) %>% summarise(n=n(),mean=mean(price),median=median(price),sd=sd(price),cv=sd/mean)
# Neighborhoods with greatest median price
nhood_summary %>% filter(median>270000) %>% arrange(desc(median))
## # A tibble: 4 x 6
## Neighborhood n mean median sd cv
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 StoneBr 20 339316.0 340691.5 123459.10 0.3638469
## 2 NridgHt 57 333646.7 336860.0 105088.90 0.3149706
## 3 NoRidge 28 295844.6 290000.0 35888.97 0.1213102
## 4 GrnHill 2 280000.0 280000.0 70710.68 0.2525381
# Neighborhoods with least median price
nhood_summary %>% filter(median<123000) %>% arrange(median)
## # A tibble: 4 x 6
## Neighborhood n mean median sd cv
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 MeadowV 16 92946.88 85750 18939.78 0.2037699
## 2 BrDale 10 98930.00 98750 13337.59 0.1348184
## 3 IDOTRR 35 97620.69 99500 31530.44 0.3229894
## 4 OldTown 71 120225.61 120000 36429.69 0.3030111
# Neighborhoods with greatest coefficient of variability
nhood_summary %>% filter(cv>0.32) %>% arrange(desc(cv))
## # A tibble: 4 x 6
## Neighborhood n mean median sd cv
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Edwards 60 136322.02 127250.0 54851.63 0.4023681
## 2 StoneBr 20 339316.05 340691.5 123459.10 0.3638469
## 3 Crawfor 29 204196.55 205000.0 71267.56 0.3490145
## 4 IDOTRR 35 97620.69 99500.0 31530.44 0.3229894
We use median as the summary statistic for price, and coefficient of variability (cv = standard deviation divided by mean) as the summary statistic for price variability (or heterogenaeity).
The median is our preferred summary statistic because of its robustness in the presence of high variability and outliers. This is important in several neighborhoods.
With cv (as opposed to standard deviation) we normalize heterogenaeity on a dollar-by-dollar basis, answering the question: “If you were to invest $X in a single neighborhood, what price variability would that $X be exposed to in each neighborhood.”
The most expensive neighborhood is StoneBr ($341,000) followed closedly by NridgHt ($337,000). The least expensive is MeadowV ($85,800).
The most heterogeneous neighborhood is Edwards (with cv of 0.40). Another heterogeneous neighborhood is StoneBr (0.36). These two neighborhoods are at opposite extremes of the price spectrum. StoneBr actually has the highest standard deviation of price of any neighborhood, but because it is the most expensive neighborhood overall, its coefficient of variability is less than that of Edwards.
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
# type your code for Question 3 here, and Knit
count_na <- sapply(ames_train, function(x) sum(is.na(x)))
count_na[count_na>900]
## Alley Pool.QC Misc.Feature
## 933 997 971
Pool.QC measures pool quality and this variable has 997 missing values out of 1000. There are only 3 houses in the data with swimming pools, and all the others have NA for Pool.QC. The outdoor swimming season is fairly short in Ames, and a private pool is a fairly high-end amenity, so it makes sense that hardly any homes have a pool.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
# type your code for Question 4 here, and Knit
ames_train <- ames_train %>% mutate(log_price=log(price))
model <- bas.lm(log_price ~
Lot.Area +
Land.Slope +
Year.Built +
Year.Remod.Add +
Bedroom.AbvGr,
prior="BIC",
modelprior=uniform(),
data=ames_train)
summary(model)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.000000
## Lot.Area 1.0000000 1.0000 1.0000000 1.000000
## Land.SlopeMod 0.6323994 1.0000 0.0000000 1.000000
## Land.SlopeSev 0.7923033 1.0000 1.0000000 0.000000
## Year.Built 1.0000000 1.0000 1.0000000 1.000000
## Year.Remod.Add 1.0000000 1.0000 1.0000000 1.000000
## Bedroom.AbvGr 1.0000000 1.0000 1.0000000 1.000000
## BF NA 1.0000 0.6674353 0.330911
## PostProbs NA 0.4752 0.3171000 0.157200
## R2 NA 0.5625 0.5591000 0.558500
## dim NA 7.0000 6.0000000 6.000000
## logmarg NA -2198.1673 -2198.5716571 -2199.273250
## model 4 model 5
## Intercept 1.0000000 1.000000e+00
## Lot.Area 1.0000000 1.000000e+00
## Land.SlopeMod 0.0000000 0.000000e+00
## Land.SlopeSev 0.0000000 1.000000e+00
## Year.Built 1.0000000 1.000000e+00
## Year.Remod.Add 1.0000000 1.000000e+00
## Bedroom.AbvGr 1.0000000 0.000000e+00
## BF 0.1061955 1.578929e-12
## PostProbs 0.0505000 0.000000e+00
## R2 0.5544000 5.316000e-01
## dim 5.0000000 5.000000e+00
## logmarg -2200.4098178 -2.225342e+03
image(model,rotate = FALSE)
We use Bayesian Model Averaging from the BAS library to enumerate all possible models. The posterior probability of each model is based on its BIC score, or Bayes Information Criterion, which rewards models for accuracy of prediction but penalizes them for each additional parameter. The resulting balance favors models that have good predctive power without being too complicated (i.e., without too many explanatory variables).
The highest probability model, as shown in the image above, uses all of the candidate variables to predict log(price).
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
# type your code for Question 5 here, and Knit
model_full <- lm(log_price ~
Lot.Area +
Land.Slope +
Year.Built +
Year.Remod.Add +
Bedroom.AbvGr,
data=ames_train)
summary(model_full)
##
## Call:
## lm(formula = log_price ~ Lot.Area + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
plot(model_full,which=1,main="House #428 has greatest magnitude residual")
ames_train[428,] %>% select(PID,Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
## # A tibble: 1 x 6
## PID Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
## <int> <int> <fctr> <int> <int> <int>
## 1 902207130 9656 Gtl 1923 1970 2
The greatest squared residual is associated with ames_train[428] with PID = 902207130. The model predicts log(price) of roughly 11.5 but the actual log(price) is 2 orders of magnitude less at about 9.5. In terms of price (not logged) that difference is $102K vs $13K.
Overall.Qual and Overall.Cond are both “poor”, which is true of only 2 out of 1000 houses
Sale.Condition is abnormal (indicating short sale or foreclosure) which is true of only 61 out of 1000 houses.
It is obvious that poor quality, poor condition, and a foreclosure will all impact price in a very strong and negative way.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
# type your code for Question 6 here, and Knit
ames_train <- ames_train %>% mutate(log_lotarea=log(Lot.Area))
model_logarea <- bas.lm(log_price ~
log_lotarea +
Land.Slope +
Year.Built +
Year.Remod.Add +
Bedroom.AbvGr,
prior="BIC",
modelprior=uniform(),
data=ames_train)
summary(model_logarea)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.000000e+00
## log_lotarea 1.00000000 1.0000 1.0000000 1.000000e+00
## Land.SlopeMod 0.39325522 0.0000 1.0000000 0.000000e+00
## Land.SlopeSev 0.03645331 0.0000 0.0000000 1.000000e+00
## Year.Built 1.00000000 1.0000 1.0000000 1.000000e+00
## Year.Remod.Add 1.00000000 1.0000 1.0000000 1.000000e+00
## Bedroom.AbvGr 0.99998941 1.0000 1.0000000 1.000000e+00
## BF NA 1.0000 0.6494644 3.866294e-02
## PostProbs NA 0.5842 0.3794000 2.260000e-02
## R2 NA 0.6031 0.6055000 6.032000e-01
## dim NA 5.0000 6.0000000 6.000000e+00
## logmarg NA -2142.5652 -2142.9968106 -2.145818e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## log_lotarea 1.000000e+00 1.000000e+00
## Land.SlopeMod 1.000000e+00 0.000000e+00
## Land.SlopeSev 1.000000e+00 0.000000e+00
## Year.Built 1.000000e+00 1.000000e+00
## Year.Remod.Add 1.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 0.000000e+00
## BF 2.374021e-02 1.412793e-05
## PostProbs 1.390000e-02 0.000000e+00
## R2 6.056000e-01 5.913000e-01
## dim 7.000000e+00 4.000000e+00
## logmarg -2.146306e+03 -2.153733e+03
image(model_logarea,rotate = FALSE)
Using log(Lot.Area) in the above Bayesian Model Averaging, we see that the highest probability model is different than before. It no longer includes all candidate variables but excludes Land.Slope. At the same time its R2 has increased from 0.56 to 0.60, which indicates an improvement in predictive power.
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
# type your code for Question 7 here, and Knit
model_4 <- lm(log_price ~
log_lotarea +
Year.Built +
Year.Remod.Add +
Bedroom.AbvGr,
data=ames_train)
summary(model_4)
##
## Call:
## lm(formula = log_price ~ log_lotarea + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14609 -0.15825 -0.01477 0.15354 1.01578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.557e+01 8.213e-01 -18.964 < 2e-16 ***
## log_lotarea 2.471e-01 1.654e-02 14.935 < 2e-16 ***
## Year.Built 5.964e-03 3.604e-04 16.547 < 2e-16 ***
## Year.Remod.Add 6.765e-03 5.197e-04 13.017 < 2e-16 ***
## Bedroom.AbvGr 5.726e-02 1.054e-02 5.434 6.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared: 0.6031, Adjusted R-squared: 0.6015
## F-statistic: 377.9 on 4 and 995 DF, p-value: < 2.2e-16
# plot(model_full,which=1)
ames_train$fitted1 <- model_full$fitted.values
ames_train$fitted2 <- model_4$fitted.values
ames_train %>% ggplot(aes(y=log_price,x=fitted1)) + geom_point() + geom_smooth(method="lm") + ggtitle("Log(price) vs fitted using Lot.Area")
ames_train %>% ggplot(aes(y=log_price,x=fitted2)) + geom_point() + geom_smooth(method="lm") + ggtitle("Log(price) vs fitted using Log(Lot.Area)")
Using log(Lot.Area) is more appropriate. The above two plots show a glimmer of this. Notice that the first plot has very influential points on the far right. This handful of houses have undue influence on the slope of the linear model.
To see the difference more clearly, it helps to look directly at Lot.Area. The distribution of Lot.Area is strongly right skewed because of a relatively small number of houses with very large lots. Below we plot log(price) vs Lot.Area and it is clear that the large lots are “influential” and unduly influence the slope of the linear model. If we instead plot log(price) vs log(Lot.Area) then the transformed distribution of log(Lot.Area), which is relatively normal, allows for a much more reasonable linear model relating the two variables.
ames_train %>% ggplot(aes(x=Lot.Area,y=log_price)) + geom_smooth(method="lm") + geom_point() + ggtitle("log(price) vs Lot Area\nLinear model compromised by influential points")
ames_train %>% ggplot(aes(x=log_lotarea,y=log_price)) + geom_smooth(method="lm") + geom_point() + ggtitle("log(price) vs log(lot area)\nNo more influential points after log transformation")