EDA and Basic Model Selection

Let us load the necessary packages and then load the dataset required to perform the analysis

library(MASS)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(devtools)

## Loading required package: usethis

library(BAS)
library(MASS)

Load the dataset

load("ames_train.Rdata.gz")
glimpse(ames_train)

## Rows: 1,000
## Columns: 81
## $ PID             <int> 909176150, 905476230, 911128020, 535377150, 534177230,~
## $ area            <int> 856, 1049, 1001, 1039, 1665, 1922, 936, 1246, 889, 107~
## $ price           <int> 126000, 139500, 124900, 114000, 227000, 198500, 93000,~
## $ MS.SubClass     <int> 30, 120, 30, 70, 60, 85, 20, 20, 20, 180, 120, 60, 30,~
## $ MS.Zoning       <fct> RL, RL, C (all), RL, RL, RL, RM, RL, RL, RM, RL, RL, R~
## $ Lot.Frontage    <int> NA, 42, 60, 80, 70, 64, 60, 53, 74, 35, 48, 63, 62, NA~
## $ Lot.Area        <int> 7890, 4235, 6060, 8146, 8400, 7301, 6000, 3710, 12395,~
## $ Street          <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ~
## $ Alley           <fct> NA, NA, NA, NA, NA, NA, Pave, NA, NA, NA, NA, NA, NA, ~
## $ Lot.Shape       <fct> Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg,~
## $ Land.Contour    <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Bnk, Lvl, Lvl, Lvl, Lvl,~
## $ Utilities       <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub~
## $ Lot.Config      <fct> Corner, Inside, Inside, Corner, Inside, Corner, Inside~
## $ Land.Slope      <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Mod, Gtl, Gtl, Gtl, Gtl,~
## $ Neighborhood    <fct> SWISU, Edwards, IDOTRR, OldTown, NWAmes, Edwards, OldT~
## $ Condition.1     <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ~
## $ Condition.2     <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ~
## $ Bldg.Type       <fct> 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam, 1Fam, 2fmCon, 1Fam, 1F~
## $ House.Style     <fct> 1Story, 1Story, 1Story, 2Story, 2Story, SFoyer, 1Story~
## $ Overall.Qual    <int> 6, 5, 5, 4, 8, 7, 4, 7, 5, 6, 8, 5, 4, 6, 7, 6, 7, 3, ~
## $ Overall.Cond    <int> 6, 5, 9, 8, 6, 5, 4, 5, 6, 5, 5, 5, 6, 5, 5, 5, 5, 3, ~
## $ Year.Built      <int> 1939, 1984, 1930, 1900, 2001, 2003, 1953, 2007, 1984, ~
## $ Year.Remod.Add  <int> 1950, 1984, 2007, 2003, 2001, 2003, 1953, 2008, 1984, ~
## $ Roof.Style      <fct> Gable, Gable, Hip, Gable, Gable, Gable, Gable, Gable, ~
## $ Roof.Matl       <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg, ~
## $ Exterior.1st    <fct> Wd Sdng, HdBoard, MetalSd, MetalSd, VinylSd, HdBoard, ~
## $ Exterior.2nd    <fct> Wd Sdng, HdBoard, MetalSd, MetalSd, VinylSd, HdBoard, ~
## $ Mas.Vnr.Type    <fct> None, BrkFace, None, None, None, BrkFace, None, BrkFac~
## $ Mas.Vnr.Area    <int> 0, 149, 0, 0, 0, 500, 0, 20, 0, 76, 196, 0, 0, 247, 11~
## $ Exter.Qual      <fct> TA, Gd, Gd, Gd, Gd, Gd, Fa, Gd, TA, TA, Gd, TA, TA, TA~
## $ Exter.Cond      <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Foundation      <fct> CBlock, CBlock, BrkTil, BrkTil, PConc, Slab, CBlock, P~
## $ Bsmt.Qual       <fct> TA, Gd, TA, Fa, Gd, NA, Fa, Gd, TA, Gd, Gd, Gd, Fa, Gd~
## $ Bsmt.Cond       <fct> TA, TA, TA, TA, TA, NA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Bsmt.Exposure   <fct> No, Mn, No, No, No, NA, No, Gd, No, Gd, No, No, No, No~
## $ BsmtFin.Type.1  <fct> Rec, GLQ, ALQ, Unf, GLQ, NA, Unf, Unf, ALQ, GLQ, GLQ, ~
## $ BsmtFin.SF.1    <int> 238, 552, 737, 0, 643, 0, 0, 0, 647, 467, 24, 458, 0, ~
## $ BsmtFin.Type.2  <fct> Unf, ALQ, Unf, Unf, Unf, NA, Unf, Unf, Unf, Unf, Unf, ~
## $ BsmtFin.SF.2    <int> 0, 393, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Bsmt.Unf.SF     <int> 618, 104, 100, 405, 167, 0, 936, 1146, 217, 80, 1318, ~
## $ Total.Bsmt.SF   <int> 856, 1049, 837, 405, 810, 0, 936, 1146, 864, 547, 1342~
## $ Heating         <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ~
## $ Heating.QC      <fct> TA, TA, Ex, Gd, Ex, Ex, TA, Ex, TA, Ex, Ex, Gd, TA, Gd~
## $ Central.Air     <fct> Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, N, ~
## $ Electrical      <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr~
## $ X1st.Flr.SF     <int> 856, 1049, 1001, 717, 810, 495, 936, 1246, 889, 1072, ~
## $ X2nd.Flr.SF     <int> 0, 0, 0, 322, 855, 1427, 0, 0, 0, 0, 0, 650, 0, 0, 0, ~
## $ Low.Qual.Fin.SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Bsmt.Full.Bath  <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, ~
## $ Bsmt.Half.Bath  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Full.Bath       <int> 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, ~
## $ Half.Bath       <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, ~
## $ Bedroom.AbvGr   <int> 2, 2, 2, 2, 3, 4, 2, 2, 3, 2, 2, 3, 2, 3, 3, 3, 3, 2, ~
## $ Kitchen.AbvGr   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ Kitchen.Qual    <fct> TA, Gd, Gd, TA, Gd, Gd, TA, Gd, TA, Gd, Gd, TA, TA, TA~
## $ TotRms.AbvGrd   <int> 4, 5, 5, 6, 6, 7, 4, 5, 6, 5, 6, 6, 5, 6, 7, 7, 6, 5, ~
## $ Functional      <fct> Typ, Typ, Typ, Typ, Typ, Typ, Min2, Typ, Typ, Typ, Typ~
## $ Fireplaces      <int> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, ~
## $ Fireplace.Qu    <fct> Gd, NA, NA, NA, NA, Ex, NA, Gd, NA, NA, Gd, NA, NA, Gd~
## $ Garage.Type     <fct> Detchd, Attchd, Detchd, Detchd, Attchd, BuiltIn, Detch~
## $ Garage.Yr.Blt   <int> 1939, 1984, 1930, 1940, 2001, 2003, 1974, 2007, 1984, ~
## $ Garage.Finish   <fct> Unf, Fin, Unf, Unf, Fin, RFn, Unf, Fin, Unf, Fin, RFn,~
## $ Garage.Cars     <int> 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, ~
## $ Garage.Area     <int> 399, 266, 216, 281, 528, 672, 576, 428, 484, 525, 550,~
## $ Garage.Qual     <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Garage.Cond     <fct> TA, TA, Po, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Paved.Drive     <fct> Y, Y, N, N, Y, Y, Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, N, ~
## $ Wood.Deck.SF    <int> 0, 0, 154, 0, 0, 0, 0, 100, 0, 0, 0, 22, 0, 0, 192, 14~
## $ Open.Porch.SF   <int> 0, 105, 0, 0, 45, 0, 32, 24, 0, 44, 35, 0, 0, 76, 74, ~
## $ Enclosed.Porch  <int> 0, 0, 42, 168, 0, 177, 112, 0, 0, 0, 0, 0, 128, 0, 0, ~
## $ X3Ssn.Porch     <int> 0, 0, 86, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Screen.Porch    <int> 166, 0, 0, 111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 185, 0, 0, ~
## $ Pool.Area       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Pool.QC         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Fence           <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, GdWo, NA, ~
## $ Misc.Feature    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Misc.Val        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Mo.Sold         <int> 3, 2, 11, 5, 11, 7, 2, 3, 4, 5, 2, 3, 11, 7, 12, 10, 1~
## $ Yr.Sold         <int> 2010, 2009, 2007, 2009, 2009, 2009, 2009, 2008, 2008, ~
## $ Sale.Type       <fct> WD , WD , WD , WD , WD , ConLD, WD , New, WD , WD , WD~
## $ Sale.Condition  <fct> Normal, Normal, Normal, Normal, Normal, Normal, Normal~

Create a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution*

year_present <- as.numeric(format(Sys.Date(), "%Y"))

ames_train$Age <- year_present - ames_train$Year.Built

ggplot(ames_train, aes(x=Age)) +
  geom_histogram(stat_bins=30, alpha = 0.8, color = "black") +
  xlab("Age") +
  ggtitle("Age of Homes") +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Ignoring unknown parameters: stat_bins

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary((ames_train$Age))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.0    20.0    46.0    48.8    66.0   149.0

sd(ames_train$Age)

## [1] 29.63741

*We see the latest a home was built in this dataset was 11 years ago, and the earliest is 149 years ago. The mean age of homes is approximately 48.8 years, with the median age being 46 years. This is indicative of a right skew. More precisely, the distribution of home ages is multimodal. The peaks we see in the histogram indicate that most homes are newer. Due to the skew, the median would be the summary metric we’d want to use as opposed to the mean. The standard deviation is 29.63741 years.

*The mantra in real estate is “Location, Location, Location!” Let’s explore by making a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods?

by_median <- order(as.numeric(by(ames_train$price, ames_train$Neighborhood, median)), decreasing = TRUE)

ames_train$Neighborhood <- ordered(ames_train$Neighborhood, levels=levels(ames_train$Neighborhood)[by_median]) 

boxplot(price ~ Neighborhood, data = ames_train, las = 2, ylab = "Price", main = "House price by neighborhood in Ames, Iowa", col = "gold")

The mean price is a useful statistic to determine the most expensive neighborhoods. The median would also be useful as it would give you some insight about the spread of values. Also useful in gaining insight on spread is the standard variation.

See below for those values for each neighborhood.

ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(median=median(price, na.rm=TRUE), Stdev=sd(price, na.rm=TRUE)) %>%
  arrange(desc(median))

## # A tibble: 27 x 3
##    Neighborhood  median   Stdev
##    <ord>          <dbl>   <dbl>
##  1 StoneBr      340692. 123459.
##  2 NridgHt      336860  105089.
##  3 NoRidge      290000   35889.
##  4 GrnHill      280000   70711.
##  5 Timber       232500   84030.
##  6 Somerst      221650   65199.
##  7 Greens       212625   29063.
##  8 Veenker      205750   72545.
##  9 Crawfor      205000   71268.
## 10 CollgCr      195800   52786.
## # ... with 17 more rows

The most expensive neighborhood is StoneBr, with median cost $340691.5.

The least expensive neighborhood is MeadowV, with median cost $85750.0.

The most heterogenous neighborhood is StoneBr, with standard deviation $123459.10.

Q2: Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

sort(sapply(ames_train, function(x) sum(is.na(x))), decreasing=TRUE)

##         Pool.QC    Misc.Feature           Alley           Fence    Fireplace.Qu 
##             997             971             933             798             491 
##    Lot.Frontage   Garage.Yr.Blt     Garage.Qual     Garage.Cond     Garage.Type 
##             167              48              47              47              46 
##   Garage.Finish       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##              46              21              21              21              21 
##  BsmtFin.Type.2    Mas.Vnr.Area    BsmtFin.SF.1    BsmtFin.SF.2     Bsmt.Unf.SF 
##              21               7               1               1               1 
##   Total.Bsmt.SF  Bsmt.Full.Bath  Bsmt.Half.Bath     Garage.Cars     Garage.Area 
##               1               1               1               1               1 
##             PID            area           price     MS.SubClass       MS.Zoning 
##               0               0               0               0               0 
##        Lot.Area          Street       Lot.Shape    Land.Contour       Utilities 
##               0               0               0               0               0 
##      Lot.Config      Land.Slope    Neighborhood     Condition.1     Condition.2 
##               0               0               0               0               0 
##       Bldg.Type     House.Style    Overall.Qual    Overall.Cond      Year.Built 
##               0               0               0               0               0 
##  Year.Remod.Add      Roof.Style       Roof.Matl    Exterior.1st    Exterior.2nd 
##               0               0               0               0               0 
##    Mas.Vnr.Type      Exter.Qual      Exter.Cond      Foundation         Heating 
##               0               0               0               0               0 
##      Heating.QC     Central.Air      Electrical     X1st.Flr.SF     X2nd.Flr.SF 
##               0               0               0               0               0 
## Low.Qual.Fin.SF       Full.Bath       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr 
##               0               0               0               0               0 
##    Kitchen.Qual   TotRms.AbvGrd      Functional      Fireplaces     Paved.Drive 
##               0               0               0               0               0 
##    Wood.Deck.SF   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##               0               0               0               0               0 
##       Pool.Area        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##               0               0               0               0               0 
##  Sale.Condition             Age 
##               0               0

length(which(ames_train$Pool.Area==0))

## [1] 997

The variable with the greatest number of missing values is Pool.QC (pool quality). This makes sense as most homes in Ames do not have pools. The number of missing values in Pool.QC matches the number of times Pool.Area == 0.

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

ames_train <- ames_train %>% mutate(price_ln = log(price))

summary(lm(ames_train$price_ln ~ ames_train$Lot.Area))$adj.r.squared

## [1] 0.05786118

summary(lm(ames_train$price_ln ~ ames_train$Land.Slope))$adj.r.squared

## [1] 0.005950438

summary(lm(ames_train$price_ln ~ ames_train$Year.Built))$adj.r.squared

## [1] 0.3947132

summary(lm(ames_train$price_ln ~ ames_train$Year.Remod.Add))$adj.r.squared

## [1] 0.3674833

summary(lm(ames_train$price_ln ~ ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.03830868

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area))$adj.r.squared

## [1] 0.450636

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Land.Slope))$adj.r.squared

## [1] 0.4030474

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add))$adj.r.squared

## [1] 0.4717006

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.4403228

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area))$adj.r.squared

## [1] 0.5239686

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Land.Slope))$adj.r.squared

## [1] 0.4803236

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.512633

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Land.Slope))$adj.r.squared

## [1] 0.5314859

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.5526062

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope))$adj.r.squared

## [1] 0.5598345

Final forward-selection adjusted R^2 model:

```lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope)```


```r
(ames_full <- lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope))

## 
## Call:
## lm(formula = ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + 
##     ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope)
## 
## Coefficients:
##               (Intercept)      ames_train$Year.Built  
##                -1.371e+01                  6.049e-03  
## ames_train$Year.Remod.Add        ames_train$Lot.Area  
##                 6.778e-03                  1.028e-05  
##  ames_train$Bedroom.AbvGr   ames_train$Land.SlopeMod  
##                 8.686e-02                  1.384e-01  
##  ames_train$Land.SlopeSev  
##                -4.567e-01

ggplot(ames_train, aes(x=ames_full$residuals)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(ames_full$residuals, col="red")
abline(h=0, lty=2)

qqnorm(ames_full$residuals, col="red")
qqline(ames_full$residuals)

stepAIC(ames_full, direction="backward", trace=TRUE)

## Start:  AIC=-2545.77
## ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + 
##     ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope
## 
##                             Df Sum of Sq    RSS     AIC
## <none>                                   77.322 -2545.8
## - ames_train$Land.Slope      2    1.4281 78.750 -2531.5
## - ames_train$Bedroom.AbvGr   1    5.0628 82.385 -2484.3
## - ames_train$Lot.Area        1    6.7292 84.051 -2464.3
## - ames_train$Year.Remod.Add  1   11.9642 89.286 -2403.9
## - ames_train$Year.Built      1   19.8546 97.177 -2319.2

## 
## Call:
## lm(formula = ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + 
##     ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope)
## 
## Coefficients:
##               (Intercept)      ames_train$Year.Built  
##                -1.371e+01                  6.049e-03  
## ames_train$Year.Remod.Add        ames_train$Lot.Area  
##                 6.778e-03                  1.028e-05  
##  ames_train$Bedroom.AbvGr   ames_train$Land.SlopeMod  
##                 8.686e-02                  1.384e-01  
##  ames_train$Land.SlopeSev  
##                -4.567e-01

ames_bas <- bas.lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr,
       prior="BIC",
       modelprior=uniform(),
       data=na.omit(ames_train))
ames_bas

## 
## Call:
## bas.lm(formula = ames_train$price_ln ~ ames_train$Year.Built + 
##     ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr, 
##     data = na.omit(ames_train), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##                 Intercept      ames_train$Year.Built  
##                         1                          1  
## ames_train$Year.Remod.Add        ames_train$Lot.Area  
##                         1                          1  
##  ames_train$Bedroom.AbvGr  
##                         1

confint(coef(ames_bas))

##                                   2.5%        97.5%         beta
## Intercept                 1.200058e+01 1.203540e+01 1.201847e+01
## ames_train$Year.Built     5.250789e-03 6.745395e-03 6.018565e-03
## ames_train$Year.Remod.Add 5.781670e-03 7.936535e-03 6.888619e-03
## ames_train$Lot.Area       6.853953e-06 1.044256e-05 8.697411e-06
## ames_train$Bedroom.AbvGr  6.528917e-02 1.076242e-01 8.703660e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

summary(ames_bas)

##                           P(B != 0 | Y)    model 1       model 2       model 3
## Intercept                             1     1.0000  1.000000e+00  1.000000e+00
## ames_train$Year.Built                 1     1.0000  1.000000e+00  1.000000e+00
## ames_train$Year.Remod.Add             1     1.0000  1.000000e+00  1.000000e+00
## ames_train$Lot.Area                   1     1.0000  1.000000e+00  0.000000e+00
## ames_train$Bedroom.AbvGr              1     1.0000  0.000000e+00  1.000000e+00
## BF                                   NA     1.0000  6.442574e-13  4.998273e-18
## PostProbs                            NA     1.0000  0.000000e+00  0.000000e+00
## R2                                   NA     0.5544  5.254000e-01  5.141000e-01
## dim                                  NA     5.0000  4.000000e+00  4.000000e+00
## logmarg                              NA -2200.4098 -2.228480e+03 -2.240247e+03
##                                 model 4       model 5
## Intercept                  1.000000e+00  1.000000e+00
## ames_train$Year.Built      1.000000e+00  1.000000e+00
## ames_train$Year.Remod.Add  0.000000e+00  1.000000e+00
## ames_train$Lot.Area        1.000000e+00  0.000000e+00
## ames_train$Bedroom.AbvGr   1.000000e+00  0.000000e+00
## BF                         5.860164e-31  2.943579e-34
## PostProbs                  0.000000e+00  0.000000e+00
## R2                         4.843000e-01  4.728000e-01
## dim                        4.000000e+00  3.000000e+00
## logmarg                   -2.270022e+03 -2.277618e+03

image(ames_bas, rotate=FALSE)

We see for the ames_full model (our final model) that a histogram of the residuals forms a near-normal distribution, a scatterplot of the residuals shows a random scatter around 0, and the normal Q-Q plot is nearly linear. These conditions are necessary for us to build a valid model.

We see that the model with the highest adjusted R² is the model with all of the variables. The model with the lowest AIC score is also. The model with the greatest log posterior odds is also the full model.

The three methods I used to select a model are:

Forward-Selection using adjusted R²
Backward-Selection using StepAIC using AIC as its parameter of interest.
Bayesian Model Averaging

All three models produced the same result– the full model is the best model.

Q:Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

#predict(ames_full, newdata=ames_train, interval="prediction", level=0.95)
ames_residuals <- residuals(ames_full)

ames_residuals_squared <- ames_residuals**2

ames_train$PID[which.max(ames_residuals_squared)]

## [1] 902207130

which.max(ames_residuals_squared)

## 428 
## 428

min(ames_train$price)

## [1] 12789

The home with the largest squared residual has PID 902207130. Its index is 428 in the ames_train dataset.

One salient feature of this home is its low price: $12,789. This is far below the price of the other homes. In fact, it is the lowest home price. Furthermore, it is one of the older homes in the dataset. It was built in 1923. Its age surely negatively affects its price, as older homes are less likely than newer homes to be in good condition. It also has no central air and no paved drive. It was made with asbestos shingles, which is now known to be a health risk and whose use has been discontinued. It was sold under abnormal circumstances (meaning trade, foreclosure, or short sale). I speculate that homes that succumb to these conditions are more likely to be in disrepair as funds are not available.

Q: Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

ames_train <- ames_train %>%
  mutate(Lot.Area.log = log(Lot.Area))

summary(lm(ames_train$price_ln ~ ames_train$Lot.Area.log))$adj.r.squared

## [1] 0.1368512

summary(lm(ames_train$price_ln ~ ames_train$Land.Slope))$adj.r.squared

## [1] 0.005950438

summary(lm(ames_train$price_ln ~ ames_train$Year.Built))$adj.r.squared

## [1] 0.3947132

summary(lm(ames_train$price_ln ~ ames_train$Year.Remod.Add))$adj.r.squared

## [1] 0.3674833

summary(lm(ames_train$price_ln ~ ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.03830868

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log))$adj.r.squared

## [1] 0.5209301

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Land.Slope))$adj.r.squared

## [1] 0.4030474

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add))$adj.r.squared

## [1] 0.4717006

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.4403228

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope))$adj.r.squared

## [1] 0.5225054

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Year.Remod.Add))$adj.r.squared

## [1] 0.5900694

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.5340868

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope + ames_train$Year.Remod.Add))$adj.r.squared

## [1] 0.5910888

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.5363921

summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope + ames_train$Year.Remod.Add + ames_train$Bedroom.AbvGr))$adj.r.squared

## [1] 0.6032018

ames_full_log <- lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr)

stepAIC(ames_full_log, direction="backward", trace=TRUE)

## Start:  AIC=-2647.16
## ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + 
##     ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr
## 
##                             Df Sum of Sq    RSS     AIC
## <none>                                   70.147 -2647.2
## - ames_train$Bedroom.AbvGr   1    2.0816 72.228 -2619.9
## - ames_train$Year.Remod.Add  1   11.9455 82.092 -2491.9
## - ames_train$Lot.Area.log    1   15.7256 85.872 -2446.9
## - ames_train$Year.Built      1   19.3032 89.450 -2406.1

## 
## Call:
## lm(formula = ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + 
##     ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr)
## 
## Coefficients:
##               (Intercept)      ames_train$Year.Built  
##                -15.574204                   0.005963  
## ames_train$Year.Remod.Add    ames_train$Lot.Area.log  
##                  0.006765                   0.247077  
##  ames_train$Bedroom.AbvGr  
##                  0.057258

(ames_log_bas <- bas.lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr,
       prior="BIC",
       modelprior=uniform(),
       data=na.omit(ames_train)))

## 
## Call:
## bas.lm(formula = ames_train$price_ln ~ ames_train$Year.Built + 
##     ames_train$Year.Remod.Add + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr, 
##     data = na.omit(ames_train), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##                 Intercept      ames_train$Year.Built  
##                         1                          1  
## ames_train$Year.Remod.Add    ames_train$Lot.Area.log  
##                         1                          1  
##  ames_train$Bedroom.AbvGr  
##                         1

Using log(Lot.Area) does change the final model for two model selection methods: StepAIC using backwards selection, and bayesian model averaging. The model constructed using forward-selection using adjusted R² did not change; it used the full model.

Q: Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

ames_pred <- predict(ames_full, ames_train)
plot(ames_pred, ames_train$price)
abline(lm(ames_train$price ~ ames_pred), col="red")

summary(lm(ames_train$price ~ ames_pred))$r.squared

## [1] 0.4847039

ames_log_pred <- predict(ames_full_log, ames_train)
plot(ames_log_pred, ames_train$price)
abline(lm(ames_train$price ~ ames_log_pred), col="red")

summary(lm(ames_train$price ~ ames_log_pred))$r.squared

## [1] 0.5186156

Using Lot.Area.log in the plot helps to capture much more uncertainty than the plot using `Lot.Area

~The R² value of lm(ames_train$price ~ ames_pred) is 0.4847039, which is less than the R² value of lm(ames_train$price ~ ames_log_pred), which is 0.5186156. This means the model with Lot.Area.log explains more uncertainty than the other model.

EDA and Basic Model Selection

Anoop

25 May 2021