First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(tidyverse)
library(BAS)

## Warning: package 'BAS' was built under R version 3.4.2

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# type your code for Question 1 here, and Knit
ames_train %>% ggplot(aes(x=(2017-Year.Built))) + geom_histogram(bins=30) + ggtitle("Distribution of Ages of Houses in Data Set\n(As of 2017)") + xlab("Age in Years") + ylab("Number of houses")

Most houses in the data are 10-15 years old. A tail of old homes (80+ years) creates a right skew. Note that if we plotted Year.Built instead of age, then the distribution would be reversed and we would have left skew.

At this level of granularity, the distribution appears roughly bimodal with peaks at 10-20 years and at 35-65 years. There is a notable dearth of homes 25-35 years old.

2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# type your code for Question 2 here, and Knit
ames_train %>% ggplot(aes(x=Neighborhood,y=price,group=Neighborhood)) + geom_boxplot(varwidth = TRUE) + coord_flip() + ggtitle("Median price and Interquartile Range by Neigborhood")

nhood_summary <- ames_train %>% group_by(Neighborhood) %>% summarise(n=n(),mean=mean(price),median=median(price),sd=sd(price),cv=sd/mean)

# Neighborhoods with greatest median price
nhood_summary %>% filter(median>270000) %>% arrange(desc(median))

## # A tibble: 4 x 6
##   Neighborhood     n     mean   median        sd        cv
##         <fctr> <int>    <dbl>    <dbl>     <dbl>     <dbl>
## 1      StoneBr    20 339316.0 340691.5 123459.10 0.3638469
## 2      NridgHt    57 333646.7 336860.0 105088.90 0.3149706
## 3      NoRidge    28 295844.6 290000.0  35888.97 0.1213102
## 4      GrnHill     2 280000.0 280000.0  70710.68 0.2525381

# Neighborhoods with least median price
nhood_summary %>% filter(median<123000) %>% arrange(median)

## # A tibble: 4 x 6
##   Neighborhood     n      mean median       sd        cv
##         <fctr> <int>     <dbl>  <dbl>    <dbl>     <dbl>
## 1      MeadowV    16  92946.88  85750 18939.78 0.2037699
## 2       BrDale    10  98930.00  98750 13337.59 0.1348184
## 3       IDOTRR    35  97620.69  99500 31530.44 0.3229894
## 4      OldTown    71 120225.61 120000 36429.69 0.3030111

# Neighborhoods with greatest coefficient of variability
nhood_summary %>% filter(cv>0.32) %>% arrange(desc(cv))

## # A tibble: 4 x 6
##   Neighborhood     n      mean   median        sd        cv
##         <fctr> <int>     <dbl>    <dbl>     <dbl>     <dbl>
## 1      Edwards    60 136322.02 127250.0  54851.63 0.4023681
## 2      StoneBr    20 339316.05 340691.5 123459.10 0.3638469
## 3      Crawfor    29 204196.55 205000.0  71267.56 0.3490145
## 4       IDOTRR    35  97620.69  99500.0  31530.44 0.3229894

We use median as the summary statistic for price, and coefficient of variability (cv = standard deviation divided by mean) as the summary statistic for price variability (or heterogenaeity).

The median is our preferred summary statistic because of its robustness in the presence of high variability and outliers. This is important in several neighborhoods.

With cv (as opposed to standard deviation) we normalize heterogenaeity on a dollar-by-dollar basis, answering the question: “If you were to invest $X in a single neighborhood, what price variability would that $X be exposed to in each neighborhood.”

The most expensive neighborhood is StoneBr ($341,000) followed closedly by NridgHt ($337,000). The least expensive is MeadowV ($85,800).

The most heterogeneous neighborhood is Edwards (with cv of 0.40). Another heterogeneous neighborhood is StoneBr (0.36). These two neighborhoods are at opposite extremes of the price spectrum. StoneBr actually has the highest standard deviation of price of any neighborhood, but because it is the most expensive neighborhood overall, its coefficient of variability is less than that of Edwards.

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# type your code for Question 3 here, and Knit
count_na <- sapply(ames_train, function(x) sum(is.na(x)))
count_na[count_na>900]

##        Alley      Pool.QC Misc.Feature 
##          933          997          971

Pool.QC measures pool quality and this variable has 997 missing values out of 1000. There are only 3 houses in the data with swimming pools, and all the others have NA for Pool.QC. The outdoor swimming season is fairly short in Ames, and a private pool is a fairly high-end amenity, so it makes sense that hardly any homes have a pool.

4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# type your code for Question 4 here, and Knit
ames_train <- ames_train %>% mutate(log_price=log(price))
model <- bas.lm(log_price ~ 
                  Lot.Area +
                  Land.Slope +
                  Year.Built + 
                  Year.Remod.Add +
                  Bedroom.AbvGr,
                prior="BIC", 
                modelprior=uniform(),
                data=ames_train)
summary(model)

##                P(B != 0 | Y)    model 1       model 2      model 3
## Intercept          1.0000000     1.0000     1.0000000     1.000000
## Lot.Area           1.0000000     1.0000     1.0000000     1.000000
## Land.SlopeMod      0.6323994     1.0000     0.0000000     1.000000
## Land.SlopeSev      0.7923033     1.0000     1.0000000     0.000000
## Year.Built         1.0000000     1.0000     1.0000000     1.000000
## Year.Remod.Add     1.0000000     1.0000     1.0000000     1.000000
## Bedroom.AbvGr      1.0000000     1.0000     1.0000000     1.000000
## BF                        NA     1.0000     0.6674353     0.330911
## PostProbs                 NA     0.4752     0.3171000     0.157200
## R2                        NA     0.5625     0.5591000     0.558500
## dim                       NA     7.0000     6.0000000     6.000000
## logmarg                   NA -2198.1673 -2198.5716571 -2199.273250
##                      model 4       model 5
## Intercept          1.0000000  1.000000e+00
## Lot.Area           1.0000000  1.000000e+00
## Land.SlopeMod      0.0000000  0.000000e+00
## Land.SlopeSev      0.0000000  1.000000e+00
## Year.Built         1.0000000  1.000000e+00
## Year.Remod.Add     1.0000000  1.000000e+00
## Bedroom.AbvGr      1.0000000  0.000000e+00
## BF                 0.1061955  1.578929e-12
## PostProbs          0.0505000  0.000000e+00
## R2                 0.5544000  5.316000e-01
## dim                5.0000000  5.000000e+00
## logmarg        -2200.4098178 -2.225342e+03

image(model,rotate = FALSE)

We use Bayesian Model Averaging from the BAS library to enumerate all possible models. The posterior probability of each model is based on its BIC score, or Bayes Information Criterion, which rewards models for accuracy of prediction but penalizes them for each additional parameter. The resulting balance favors models that have good predctive power without being too complicated (i.e., without too many explanatory variables).

The highest probability model, as shown in the image above, uses all of the candidate variables to predict log(price).

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# type your code for Question 5 here, and Knit
model_full <- lm(log_price ~ 
                  Lot.Area +
                  Land.Slope +
                  Year.Built + 
                  Year.Remod.Add +
                  Bedroom.AbvGr,
                data=ames_train)
summary(model_full)

## 
## Call:
## lm(formula = log_price ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

plot(model_full,which=1,main="House #428 has greatest magnitude residual")

ames_train[428,] %>% select(PID,Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)

## # A tibble: 1 x 6
##         PID Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
##       <int>    <int>     <fctr>      <int>          <int>         <int>
## 1 902207130     9656        Gtl       1923           1970             2

The greatest squared residual is associated with ames_train[428] with PID = 902207130. The model predicts log(price) of roughly 11.5 but the actual log(price) is 2 orders of magnitude less at about 9.5. In terms of price (not logged) that difference is $102K vs $13K.

This property has the lowest price in the entire data set. Taking into account all variables, we see that this property is exceptional with respect to several variables that were not considered in this very preliminary model. The most telling anomalies are:

Overall.Qual and Overall.Cond are both “poor”, which is true of only 2 out of 1000 houses
Sale.Condition is abnormal (indicating short sale or foreclosure) which is true of only 61 out of 1000 houses.

It is obvious that poor quality, poor condition, and a foreclosure will all impact price in a very strong and negative way.

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# type your code for Question 6 here, and Knit
ames_train <- ames_train %>% mutate(log_lotarea=log(Lot.Area))
model_logarea <- bas.lm(log_price ~ 
                  log_lotarea +
                  Land.Slope +
                  Year.Built + 
                  Year.Remod.Add +
                  Bedroom.AbvGr,
                prior="BIC", 
                modelprior=uniform(),
                data=ames_train)
summary(model_logarea)

##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000     1.0000000  1.000000e+00
## log_lotarea       1.00000000     1.0000     1.0000000  1.000000e+00
## Land.SlopeMod     0.39325522     0.0000     1.0000000  0.000000e+00
## Land.SlopeSev     0.03645331     0.0000     0.0000000  1.000000e+00
## Year.Built        1.00000000     1.0000     1.0000000  1.000000e+00
## Year.Remod.Add    1.00000000     1.0000     1.0000000  1.000000e+00
## Bedroom.AbvGr     0.99998941     1.0000     1.0000000  1.000000e+00
## BF                        NA     1.0000     0.6494644  3.866294e-02
## PostProbs                 NA     0.5842     0.3794000  2.260000e-02
## R2                        NA     0.6031     0.6055000  6.032000e-01
## dim                       NA     5.0000     6.0000000  6.000000e+00
## logmarg                   NA -2142.5652 -2142.9968106 -2.145818e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## log_lotarea     1.000000e+00  1.000000e+00
## Land.SlopeMod   1.000000e+00  0.000000e+00
## Land.SlopeSev   1.000000e+00  0.000000e+00
## Year.Built      1.000000e+00  1.000000e+00
## Year.Remod.Add  1.000000e+00  1.000000e+00
## Bedroom.AbvGr   1.000000e+00  0.000000e+00
## BF              2.374021e-02  1.412793e-05
## PostProbs       1.390000e-02  0.000000e+00
## R2              6.056000e-01  5.913000e-01
## dim             7.000000e+00  4.000000e+00
## logmarg        -2.146306e+03 -2.153733e+03

image(model_logarea,rotate = FALSE)

Using log(Lot.Area) in the above Bayesian Model Averaging, we see that the highest probability model is different than before. It no longer includes all candidate variables but excludes Land.Slope. At the same time its R² has increased from 0.56 to 0.60, which indicates an improvement in predictive power.

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# type your code for Question 7 here, and Knit
model_4 <- lm(log_price ~ 
                  log_lotarea +
                  Year.Built + 
                  Year.Remod.Add +
                  Bedroom.AbvGr,
                data=ames_train)
summary(model_4)

## 
## Call:
## lm(formula = log_price ~ log_lotarea + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14609 -0.15825 -0.01477  0.15354  1.01578 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
## log_lotarea     2.471e-01  1.654e-02  14.935  < 2e-16 ***
## Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
## Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
## Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:  0.6015 
## F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

# plot(model_full,which=1)

ames_train$fitted1 <- model_full$fitted.values
ames_train$fitted2 <- model_4$fitted.values

ames_train %>% ggplot(aes(y=log_price,x=fitted1)) + geom_point() + geom_smooth(method="lm") + ggtitle("Log(price) vs fitted using Lot.Area")

ames_train %>% ggplot(aes(y=log_price,x=fitted2)) + geom_point() + geom_smooth(method="lm")  + ggtitle("Log(price) vs fitted using Log(Lot.Area)")

Using log(Lot.Area) is more appropriate. The above two plots show a glimmer of this. Notice that the first plot has very influential points on the far right. This handful of houses have undue influence on the slope of the linear model.

To see the difference more clearly, it helps to look directly at Lot.Area. The distribution of Lot.Area is strongly right skewed because of a relatively small number of houses with very large lots. Below we plot log(price) vs Lot.Area and it is clear that the large lots are “influential” and unduly influence the slope of the linear model. If we instead plot log(price) vs log(Lot.Area) then the transformed distribution of log(Lot.Area), which is relatively normal, allows for a much more reasonable linear model relating the two variables.

ames_train %>% ggplot(aes(x=Lot.Area,y=log_price)) + geom_smooth(method="lm") + geom_point() + ggtitle("log(price) vs Lot Area\nLinear model compromised by influential points")

ames_train %>% ggplot(aes(x=log_lotarea,y=log_price)) + geom_smooth(method="lm") + geom_point() + ggtitle("log(price) vs log(lot area)\nNo more influential points after log transformation")

Peer Assessment I

1

2

3

4

5

6

7