1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")
load("ames_test.Rdata")
load("ames_validation.Rdata")

Use the code block below to load any necessary packages

#library(statsr)
library(dplyr)
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.4
library(ggplot2)
library(gridExtra)
library(MASS)
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(purrr)
## Warning: package 'purrr' was built under R version 3.4.4
library(forcats)
## Warning: package 'forcats' was built under R version 3.4.4
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.4.4
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.4.4
library(reshape2)

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


I’ll start with a simple inspection of the data set encoding. The dataset codebook (https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt) explains the data include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and an observation identifier). I am tempted to recode the ordinal “quality” and “condition” variabes (Exter.Qual, Exter.Cond, Bsmt.Qual, Bsmt.Cond, and Kitchen.Qual) as numeric, just as Overall.Qual and Overall.Cond are, but I’m cautioned against assuming the numerical distance between each set of subsequent categories is equal (see https://www.theanalysisfactor.com/pros-and-cons-of-treating-ordinal-variables-as-nominal-or-continuous/).

str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  81 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...

Evaluate missing values because the linear regression model will discard observations with missing values in a predictor or response field. 25 variables have missing values. The codebook explains that many variables are coded NA for “none”. I will recode the nulls to “None”. In the case of continuous variable Lot.Frontage, I will code NA to 0 because for “no frontage” because I see no values of 0 in the data. In the case of discrete variable Garage.Yr.Built, I will code it to the median garage built year, 1979. I wil leave the remaining NAs as NA.

I defined function replace_na to apply these rules to all data sets (training, test, and validation).

#na_count <- colSums(is.na(ames_train))
#head(sort(na_count, decreasing = TRUE),5)
ames_train %>%
  summarise_all(funs(sum(is.na(.)) / n())) %>%
  gather(key = "var", value = "pct") %>%
  filter(pct>0) %>%
  arrange(desc(pct))
## # A tibble: 25 x 2
##    var              pct
##    <chr>          <dbl>
##  1 Pool.QC       0.997 
##  2 Misc.Feature  0.971 
##  3 Alley         0.933 
##  4 Fence         0.798 
##  5 Fireplace.Qu  0.491 
##  6 Lot.Frontage  0.167 
##  7 Garage.Yr.Blt 0.0480
##  8 Garage.Qual   0.0470
##  9 Garage.Cond   0.0470
## 10 Garage.Type   0.0460
## # ... with 15 more rows
replace_na <- function(df) {
  df %>%
    mutate(
    # give missing value an explicit factor level
    Pool.QC = fct_explicit_na(Pool.QC, na_level = "None"),
    Misc.Feature = fct_explicit_na(Pool.QC, na_level = "None"),
    Alley = fct_explicit_na(Pool.QC, na_level = "None"),
    Fence = fct_explicit_na(Pool.QC, na_level = "None"),
    Fireplace.Qu = fct_explicit_na(Fireplace.Qu, na_level = "None"),
    Lot.Frontage = ifelse(is.na(Lot.Frontage), 0, Garage.Yr.Blt),
    Garage.Yr.Blt = ifelse(is.na(Garage.Yr.Blt), 1979, Garage.Yr.Blt),
    Garage.Qual = fct_explicit_na(Garage.Qual, na_level = "None"),
    Garage.Cond = fct_explicit_na(Garage.Cond, na_level = "None"),
    Garage.Type = fct_explicit_na(Garage.Type, na_level = "None"),
    Garage.Finish = fct_explicit_na(Garage.Finish, na_level = "None"),
    Bsmt.Qual = fct_explicit_na(Bsmt.Qual, na_level = "None"),
    Bsmt.Cond = fct_explicit_na(Bsmt.Cond, na_level = "None"),
    Bsmt.Exposure = fct_explicit_na(Bsmt.Exposure, na_level = "None"),
    BsmtFin.Type.1 = fct_explicit_na(BsmtFin.Type.1, na_level = "None"),
    BsmtFin.Type.2 = fct_explicit_na(BsmtFin.Type.2, na_level = "None"),
    #Mas.Vnr.Area
    #BsmtFin.SF.1
    #BsmtFin.SF.2
    Bsmt.Exposure = fct_explicit_na(Bsmt.Exposure, na_level = "None"),
    #Bsmt.Unf.SF
    #Total.Bsmt.SF
    Bsmt.Full.Bath = ifelse(is.na(Bsmt.Full.Bath), 0, Bsmt.Full.Bath),
    Bsmt.Half.Bath = ifelse(is.na(Bsmt.Half.Bath), 0, Bsmt.Half.Bath)
    #Garage.Cars
    #Garage.Area
  )
}
ames_train <- ames_train %>%
  replace_na()

I will transform some data to create new variables based on industry knowledge. I will create a new variable for lot area to account for the diminishing returns to lot size, and for bathrooms to account for the differing values of half baths.

Homes built on large lots typically sell for higher prices, all else being equal. However, there are diminishing returns to additional acreage. Realtors have constructed lot size categories which correspond to approximately equal-sized increases in sale price (see http://ww2.amstat.org/publications/jse/v16n2/datasets.pardoe.html). The first pair of plots below shows the result of categorization. The data shows a more linear relationship between the explantory and response variable.

cat_lot_area <- function(df) {
  df %>%
    mutate(
      Lot.Area.Cat = ifelse(Lot.Area<3000, 1, 
            ifelse(Lot.Area<5000, 2,
            ifelse(Lot.Area<7000, 3,
            ifelse(Lot.Area<10000, 4,
            ifelse(Lot.Area<15000, 5,
            ifelse(Lot.Area<20000, 6, 
            ifelse(Lot.Area<43560, 7, 
            ifelse(Lot.Area<43560*3, 8, 
            ifelse(Lot.Area<43560*5, 9, 
            ifelse(Lot.Area<43560*10, 10, 
            ifelse(Lot.Area<43560*20, 11,12)))))))))))
    )
}
#ames_train$Lot.Cat.Area <- as.numeric(NA)
ames_train <- ames_train %>%
  cat_lot_area()
# Analysis of lot size.
p1 <- ggplot(ames_train, aes(x = Lot.Area, y = price)) +
  geom_point() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Raw Data")
p2 <- ggplot(ames_train, aes(x = Lot.Area.Cat, y = price)) +  
  geom_jitter() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Categorized")
grid.arrange(p1, p2, ncol=2)

My second transformation deals with the observation that half bathrooms add less value to a home than full baths. One way to model this is to assign half-bathrooms the value 0.1 instead of 0.5 (see Pardoe link above). The figures below each show total bathrooms (abover ground and basement), but the figure on the left assigns 0.5 for half-baths and the figure on the right assigns 0.1 for half-baths.

bath_pt_1 <- function(df) {
  df %>%
    mutate(
      Bath_1 = Full.Bath + Bsmt.Full.Bath + 0.1*Half.Bath + 0.1*Bsmt.Half.Bath
    )
}
ames_train <- ames_train %>%
  bath_pt_1()
p1 <- ggplot(ames_train, aes(x = ames_train$Full.Bath + ames_train$Bsmt.Full.Bath + 
  0.5*ames_train$Half.Bath + 0.5*ames_train$Bsmt.Half.Bath, y = price)) +
  geom_point() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Half-Bath = 0.5") +
  xlab("Bath_5")
p2 <- ggplot(ames_train, aes(x = Bath_1, y = price)) +  
  geom_jitter() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Half-Bath = 0.1")
grid.arrange(p1, p2, ncol=2)

Finally, location surely makes a difference. A summary boxplot of prices by Neighborhood shows both differences in median prices, and in price variation. Not all neighborhoods are significantly different, however. Also, I found (later) that there are other neighborhoods in the test and validation set that are not included in the test data set, and that will compromise the model when I create factor variables. I will condense the factor variables into groups, and include a catch-all for remaining not yet in the training dataset. I performed an ANOVA test of group differences and condensed the neighborhoods into 5 groups.

neigh_agg <- function(df) {
  df %>%
    mutate(
    Neighborhood2 = as.factor(
      ifelse(as.character(Neighborhood) == "Blmngtn", "Neigh_1",
  ifelse(as.character(Neighborhood) == "Blueste", "Neigh_1",
  ifelse(as.character(Neighborhood) == "ClearCr", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "CollgCr", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "Crawfor", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "Gilbert", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "Greens", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "GrnHill", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "NPkVill", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "NWAmes", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "SawyerW", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "Veenker", "Neigh_1", 
  ifelse(as.character(Neighborhood) == "BrDale", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "BrkSide", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "Edwards", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "IDOTRR", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "MeadowV", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "OldTown", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "Sawyer", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "SWISU", "Neigh_2", 
  ifelse(as.character(Neighborhood) == "NAmes", "Neigh_3", 
  ifelse(as.character(Neighborhood) == "Mitchel", "Neigh_3", 
  ifelse(as.character(Neighborhood) == "StoneBr", "Neigh_4", 
  ifelse(as.character(Neighborhood) == "NoRidge", "Neigh_4", 
  ifelse(as.character(Neighborhood) == "NridgHt", "Neigh_4", 
  ifelse(as.character(Neighborhood) == "Somerst", "Neigh_5", 
  ifelse(as.character(Neighborhood) == "Timber", "Neigh_5", 
         "Neigh_5"))))))))))))))))))))))))))))
    )
}
ames_train <- ames_train %>%
  neigh_agg()

# run the following code to identify neighborhoods to remap (using p>.05)
#TukeyHSD(aov(ames_train$price ~ ames_train$Neighborhood2))
p1 <- ggplot(data = ames_train, aes(x = Neighborhood, y = price)) + 
  geom_boxplot() +
  ggtitle("Original") +
  coord_flip()
p2 <- ggplot(data = ames_train, aes(x = Neighborhood2, y = price)) + 
  geom_boxplot() +
  ggtitle("Condensed") +
  coord_flip()
grid.arrange(p1, p2, ncol=2)

ames_train %>% 
  group_by(Neighborhood2) %>%
  summarise(avg.price = mean(price),
            sd.price = sd(price),
            median.price = median(price),
            n = n())
## # A tibble: 5 x 5
##   Neighborhood2 avg.price sd.price median.price     n
##   <fct>             <dbl>    <dbl>        <dbl> <int>
## 1 Neigh_1          194738    51577       187500   297
## 2 Neigh_2          123188    39382       125000   306
## 3 Neigh_3          146587    31893       142000   199
## 4 Neigh_4          324646    96823       315750   105
## 5 Neigh_5          240847    70060       222000    93

I will consider a subset of variables for this analysis. I propose dropping the following variables at the outset. In some cases, variables are highly correlated, such as square footage of first floor and square footage of second floor. In other cases, the variable is clearly not relevant for a parsimonious model, such as type of alley access to property. I used a correlation matrix to find highly correlated variables to exclude.

  • PID: Parcel identification number, not relevant to modeling
  • Lot.Frontage: Linear feet of street connected to property
  • Street: Type of road access to property
  • Alley: Type of alley access to property
  • Lot.Shape: General shape of property
  • Land.Contour: Flatness of the property
  • Roof.Style: Type of roof
  • Roof.Matl: Roof material
  • Exterior.1st: Exterior covering on house
  • Exterior.2nd: Exterior covering on house (if more than one material)
  • Mas.Vnr.Type: Masonry veneer type
  • Mas.Vnr.Area: Masonry veneer area in square feet
  • Foundation: Type of foundation
  • BsmtFin.Type.1: Rating of basement finished area
  • BsmtFin.SF.1: Type 1 finished square feet
  • BsmtFin.Type.2: Rating of basement finished area (if multiple types)
  • BsmtFin.SF.2: Type 2 finished square feet
  • Bsmt.Unf.SF: Unfinished square feet of basement area
  • Electrical: Electrical system
  • X1st.Flr.SF: First Floor square feet
  • X2nd.Flr.SF: Second floor square feet
  • Low.Qual.Fin.SF: Low quality finished square feet (all floors)
  • Bsmt.Full.Bath: Basement full bathrooms
  • Bsmt.Half.Bath: Basement half bathrooms
  • Full.Bath: Full bathrooms above grade
  • Half.Bath: Half baths above grade
  • Garage.Cars: Size of garage in car capacity
  • Misc.Feature: Miscellaneous feature not covered in other categories
  • Misc.Val: Value of miscellaneous feature
  • X3Ssn.Porch: Three season porch area in square feet
  • Screen.Porch: Screen porch area in square feet
  • Mo.Sold: Month Sold
  • Yr.Sold: Year Sold
  • Pool.Area: Pool area in square feet

    discard_vars <- function(df) {
      subset(df, select = -c(PID, Lot.Frontage, Street, Alley, Lot.Shape, Land.Contour, Roof.Style,
                             Roof.Matl, Exterior.1st, Exterior.2nd, Mas.Vnr.Type, Mas.Vnr.Area,
                             Foundation, BsmtFin.Type.1, BsmtFin.SF.1, BsmtFin.Type.2, BsmtFin.SF.2,
                             Bsmt.Unf.SF, Electrical, X1st.Flr.SF, X2nd.Flr.SF, Low.Qual.Fin.SF,
                             Bsmt.Full.Bath, Bsmt.Half.Bath, Full.Bath, Half.Bath, Garage.Cars,
                             Misc.Feature, Misc.Val, X3Ssn.Porch, Screen.Porch, Mo.Sold, Yr.Sold, Pool.Area))
    }
    ames_train <- ames_train %>%
      discard_vars()
    ames_contin <- ames_train %>%
      keep(is.numeric) 
    
    corr <- rcorr(as.matrix(ames_contin))
    corrplot(corr$r, type = "upper", method = "color", p.mat = corr$P, sig.level = 0.01)


2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


My initial model is price regressed on my categorized lot area (Lot.Area.Cat), home size (area), bedrooms (Bedroom.AbvGr), my modified number of bathrooms (Bath_1), home age (Year.Built), overall condition of the home (Overall.Cond), and location (Neighborhood2). That is 7 predictor variables, but technically more because Neighborhood2 is changed into several indicator variables. My model selection was based on real estate industry expectations: values are based on location, size, and quality.

ames_train.lm.full <- lm(price ~ Lot.Area.Cat + area + Overall.Cond + 
                           Bedroom.AbvGr + Bath_1 + Year.Built +
                           Neighborhood2, data = ames_train)
summary(ames_train.lm.full)
## 
## Call:
## lm(formula = price ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
##     Bath_1 + Year.Built + Neighborhood2, data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -312722  -18959   -1285   14682  234915 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.535e+06  1.287e+05 -11.928  < 2e-16 ***
## Lot.Area.Cat          1.147e+04  1.043e+03  11.000  < 2e-16 ***
## area                  8.178e+01  3.742e+00  21.852  < 2e-16 ***
## Overall.Cond          1.031e+04  1.153e+03   8.943  < 2e-16 ***
## Bedroom.AbvGr        -1.892e+04  1.772e+03 -10.677  < 2e-16 ***
## Bath_1                9.046e+03  2.008e+03   4.505 7.44e-06 ***
## Year.Built            7.686e+02  6.370e+01  12.066  < 2e-16 ***
## Neighborhood2Neigh_2 -4.597e+03  4.094e+03  -1.123    0.262    
## Neighborhood2Neigh_3 -3.489e+03  3.775e+03  -0.924    0.356    
## Neighborhood2Neigh_4  7.527e+04  4.596e+03  16.379  < 2e-16 ***
## Neighborhood2Neigh_5  3.332e+04  4.556e+03   7.312 5.42e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36550 on 989 degrees of freedom
## Multiple R-squared:  0.8029, Adjusted R-squared:  0.8009 
## F-statistic: 402.9 on 10 and 989 DF,  p-value: < 2.2e-16

This model does a pretty good job of prediction with an R-squared value of 0.8029.

I will evaluate this model for compliance with the LSF conditions of linear relationship with normally distributed variances with constant value. Violation of any of these conditions may necessitate remedial action such as transforming one or more predictors and/or the response variable.

Start with a Residuals vs fitted values plot to visually assess whether:
  • the average of the residuals is ~0 at all fitted values (linearity).
  • the spread of the residuals is ~constant at all fitted values (equal variances).
  • there are no excessively outlying points.

    ames_train$resid <- rstandard(ames_train.lm.full) 
    ames_train$fitted <- fitted(ames_train.lm.full) 
    ggplot(data=ames_train, aes(y = resid, x = fitted)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      xlab("Fitted Value") +
      ylab("Residual") +
      ggtitle("Residuals vs. Fits Plot \n(response is price)")

    The standardized residuals plot fans out, violating the equal variance condition. There are also many points >2 sd from 0. I will log transform the response variable price.

    ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2, data = ames_train)
    ames_train$resid <- rstandard(ames_train.lm.full) 
    ames_train$fitted <- fitted(ames_train.lm.full) 
    ggplot(data=ames_train, aes(y = resid, x = fitted)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      xlab("Fitted Value") +
      ylab("Residual") +
      ggtitle("Residuals vs. Fits Plot \n(response is price)")

    This plot looks much better, although there are a few potential outliers.

    Now I’ll create a Residuals vs Predictors scatterplot matrix to assess whether:
  • the average of the residuals is ~0 at all fitted values (linearity).
  • the spread of the residuals is ~constant at all fitted values (equal variances).

    p1 <- ggplot(data=ames_train, aes(y = resid, x = area)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("area") +
      ylab("Residual")
    p2 <- ggplot(data=ames_train, aes(y = resid, x = Overall.Cond)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Overall.Cond") +
      ylab("Residual")
    p3 <- ggplot(data=ames_train, aes(y = resid, x = Bedroom.AbvGr)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Bedroom.AbvGr") +
      ylab("Residual") 
    p4 <- ggplot(data=ames_train, aes(y = resid, x = Bath_1)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Bath_1") +
      ylab("Residual") 
    p5 <- ggplot(data=ames_train, aes(y = resid, x = Year.Built)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Year.Built") +
      ylab("Residual") 
    p6 <- ggplot(data=ames_train, aes(y = resid, x = Lot.Area.Cat)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Lot.Area.Cat") +
      ylab("Residual")
    grid.arrange(p1, p2, ncol=2)

    grid.arrange(p3, p4, ncol=2)

    grid.arrange(p5, p6, ncol=2)

    I do not see anything that would necessitate a data transformation.

    Finally, I’ll run the standard summary plots to evaluate leverage points and normality.

    par(mfrow = c(2,2)) # Split plot panel into 2x2 grid
    plot(ames_train.lm.full)

    The normal probability plot suggests the model may suffer from normality at the extreme values. I’ll investigate the outliers.

    plot(ames_train.lm.full, which = 1)

    ames_train.lm.full$model[c(310),]
    ##     log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
    ## 310   12.12676            7 4676            5             3    4.1
    ##     Year.Built Neighborhood2
    ## 310       2007       Neigh_2
    ames_train[c(310),]
    ## # A tibble: 1 x 52
    ##    area  price MS.SubClass MS.Zoning Lot.Area Utilities Lot.Config
    ##   <int>  <int>       <int> <fct>        <int> <fct>     <fct>     
    ## 1  4676 184750          60 RL           40094 AllPub    Inside    
    ## # ... with 45 more variables: Land.Slope <fct>, Neighborhood <fct>,
    ## #   Condition.1 <fct>, Condition.2 <fct>, Bldg.Type <fct>,
    ## #   House.Style <fct>, Overall.Qual <int>, Overall.Cond <int>,
    ## #   Year.Built <int>, Year.Remod.Add <int>, Exter.Qual <fct>,
    ## #   Exter.Cond <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
    ## #   Bsmt.Exposure <fct>, Total.Bsmt.SF <int>, Heating <fct>,
    ## #   Heating.QC <fct>, Central.Air <fct>, Bedroom.AbvGr <int>,
    ## #   Kitchen.AbvGr <int>, Kitchen.Qual <fct>, TotRms.AbvGrd <int>,
    ## #   Functional <fct>, Fireplaces <int>, Fireplace.Qu <fct>,
    ## #   Garage.Type <fct>, Garage.Yr.Blt <dbl>, Garage.Finish <fct>,
    ## #   Garage.Area <int>, Garage.Qual <fct>, Garage.Cond <fct>,
    ## #   Paved.Drive <fct>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
    ## #   Enclosed.Porch <int>, Pool.QC <fct>, Fence <fct>, Sale.Type <fct>,
    ## #   Sale.Condition <fct>, Lot.Area.Cat <dbl>, Bath_1 <dbl>,
    ## #   Neighborhood2 <fct>, resid <dbl>, fitted <dbl>
    # fit a second model w/o the outlier
    ames_train2 <- ames_train[c(1:309,311:1000),]
    ames_train.lm.full2 <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2, data = ames_train2)

    Observation 310 is a normal sale a very large home, but at a rather cheap price.

    My initial model then is log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2Neigh_2 + Neighborhood2Neigh_3 + Neighborhood2Neigh_4 + Neighborhood2Neigh_5.

    with estimated regression equation log(price) = -$1.53MM + $11,000Lot.Area.Cat + $82area + $10,000Overall.Cond - $19,000Bedroom.AbvGr + $9,000Bath_1 + $769Year.Built - $500Neighborhood2Neigh_2 -$300Neighborhood2Neigh_3 + $7,500Neighborhood2Neigh_4 + $3,300Neighborhood2Neigh_5

    The model does a good job of predicting log(price) with an adjusted R-squared of 0.8213. All predictor variables apart from Neighborhood are highly significant (p<0.0001). The Neighborhood indicator variables are significant at the \(\alpha\)=0.05 level for 3 of the 4 indicator variables.

    The parameter estimators have plausible signs and values except the Bedroom.AbvGr has a negative value. The may be due to its correlation with area and/or Bath_.

    summary(ames_train.lm.full)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.62700 -0.08985  0.00130  0.09516  0.61324 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           7.730e-01  6.261e-01   1.235   0.2173    
    ## Lot.Area.Cat          5.168e-02  5.076e-03  10.182  < 2e-16 ***
    ## area                  4.029e-04  1.821e-05  22.126  < 2e-16 ***
    ## Overall.Cond          8.729e-02  5.609e-03  15.562  < 2e-16 ***
    ## Bedroom.AbvGr        -6.208e-02  8.623e-03  -7.199 1.20e-12 ***
    ## Bath_1                5.279e-02  9.771e-03   5.403 8.22e-08 ***
    ## Year.Built            5.078e-03  3.099e-04  16.386  < 2e-16 ***
    ## Neighborhood2Neigh_2 -1.006e-01  1.992e-02  -5.052 5.20e-07 ***
    ## Neighborhood2Neigh_3 -3.182e-02  1.837e-02  -1.732   0.0835 .  
    ## Neighborhood2Neigh_4  2.228e-01  2.236e-02   9.963  < 2e-16 ***
    ## Neighborhood2Neigh_5  1.421e-01  2.217e-02   6.409 2.26e-10 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1778 on 989 degrees of freedom
    ## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8213 
    ## F-statistic:   460 on 10 and 989 DF,  p-value: < 2.2e-16

  • 2.2.2 Section 2.2 Model Selection

    Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


    I will try three model selection procedures: AIC, BIC, and BAS.

    ames_train.lm.aic <- stepAIC(ames_train.lm.full, k = 2)
    ## Start:  AIC=-3443.04
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3443.0
    ## - Bath_1         1    0.9230 32.195 -3416.0
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3394.0
    ## - Lot.Area.Cat   1    3.2782 34.550 -3345.4
    ## - Neighborhood2  4    4.5659 35.837 -3314.8
    ## - Overall.Cond   1    7.6572 38.929 -3226.0
    ## - Year.Built     1    8.4899 39.762 -3204.9
    ## - area           1   15.4797 46.751 -3042.9
    summary(ames_train.lm.aic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.62700 -0.08985  0.00130  0.09516  0.61324 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           7.730e-01  6.261e-01   1.235   0.2173    
    ## Lot.Area.Cat          5.168e-02  5.076e-03  10.182  < 2e-16 ***
    ## area                  4.029e-04  1.821e-05  22.126  < 2e-16 ***
    ## Overall.Cond          8.729e-02  5.609e-03  15.562  < 2e-16 ***
    ## Bedroom.AbvGr        -6.208e-02  8.623e-03  -7.199 1.20e-12 ***
    ## Bath_1                5.279e-02  9.771e-03   5.403 8.22e-08 ***
    ## Year.Built            5.078e-03  3.099e-04  16.386  < 2e-16 ***
    ## Neighborhood2Neigh_2 -1.006e-01  1.992e-02  -5.052 5.20e-07 ***
    ## Neighborhood2Neigh_3 -3.182e-02  1.837e-02  -1.732   0.0835 .  
    ## Neighborhood2Neigh_4  2.228e-01  2.236e-02   9.963  < 2e-16 ***
    ## Neighborhood2Neigh_5  1.421e-01  2.217e-02   6.409 2.26e-10 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1778 on 989 degrees of freedom
    ## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8213 
    ## F-statistic:   460 on 10 and 989 DF,  p-value: < 2.2e-16

    AIC retained all predictor variables from the full model. The adjusted R-squared is 0.8009. Now I’ll try BIC.

    ames_train.lm.bic <- stepAIC(ames_train.lm.full, k = log(nrow(ames_train)))
    ## Start:  AIC=-3389.06
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3389.1
    ## - Bath_1         1    0.9230 32.195 -3366.9
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3344.9
    ## - Lot.Area.Cat   1    3.2782 34.550 -3296.3
    ## - Neighborhood2  4    4.5659 35.837 -3280.4
    ## - Overall.Cond   1    7.6572 38.929 -3176.9
    ## - Year.Built     1    8.4899 39.762 -3155.8
    ## - area           1   15.4797 46.751 -2993.8
    summary(ames_train.lm.bic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.62700 -0.08985  0.00130  0.09516  0.61324 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           7.730e-01  6.261e-01   1.235   0.2173    
    ## Lot.Area.Cat          5.168e-02  5.076e-03  10.182  < 2e-16 ***
    ## area                  4.029e-04  1.821e-05  22.126  < 2e-16 ***
    ## Overall.Cond          8.729e-02  5.609e-03  15.562  < 2e-16 ***
    ## Bedroom.AbvGr        -6.208e-02  8.623e-03  -7.199 1.20e-12 ***
    ## Bath_1                5.279e-02  9.771e-03   5.403 8.22e-08 ***
    ## Year.Built            5.078e-03  3.099e-04  16.386  < 2e-16 ***
    ## Neighborhood2Neigh_2 -1.006e-01  1.992e-02  -5.052 5.20e-07 ***
    ## Neighborhood2Neigh_3 -3.182e-02  1.837e-02  -1.732   0.0835 .  
    ## Neighborhood2Neigh_4  2.228e-01  2.236e-02   9.963  < 2e-16 ***
    ## Neighborhood2Neigh_5  1.421e-01  2.217e-02   6.409 2.26e-10 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1778 on 989 degrees of freedom
    ## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8213 
    ## F-statistic:   460 on 10 and 989 DF,  p-value: < 2.2e-16

    BIC also retained all predictor variables from the full model. The adjusted R-squared is 0.8009. Lastly, I’ll try BAS.

    ames_train.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2, data = ames_train, prior = "AIC",
                               modelprior=uniform())
    summary(ames_train.lm.bas)
    ##                      P(B != 0 | Y)    model 1       model 2       model 3
    ## Intercept                1.0000000     1.0000     1.0000000  1.000000e+00
    ## Lot.Area.Cat             1.0000000     1.0000     1.0000000  1.000000e+00
    ## area                     1.0000000     1.0000     1.0000000  1.000000e+00
    ## Overall.Cond             1.0000000     1.0000     1.0000000  1.000000e+00
    ## Bedroom.AbvGr            1.0000000     1.0000     1.0000000  1.000000e+00
    ## Bath_1                   0.9999988     1.0000     1.0000000  1.000000e+00
    ## Year.Built               1.0000000     1.0000     1.0000000  1.000000e+00
    ## Neighborhood2Neigh_2     0.9999897     1.0000     1.0000000  0.000000e+00
    ## Neighborhood2Neigh_3     0.6259502     1.0000     0.0000000  0.000000e+00
    ## Neighborhood2Neigh_4     1.0000000     1.0000     1.0000000  1.000000e+00
    ## Neighborhood2Neigh_5     1.0000000     1.0000     1.0000000  1.000000e+00
    ## BF                              NA     1.0000     0.5975676  8.492981e-06
    ## PostProbs                       NA     0.6259     0.3740000  0.000000e+00
    ## R2                              NA     0.8231     0.8225000  8.181000e-01
    ## dim                             NA    11.0000    10.0000000  9.000000e+00
    ## logmarg                         NA -1732.3554 -1732.8702398 -1.744032e+03
    ##                            model 4       model 5
    ## Intercept             1.000000e+00  1.000000e+00
    ## Lot.Area.Cat          1.000000e+00  1.000000e+00
    ## area                  1.000000e+00  1.000000e+00
    ## Overall.Cond          1.000000e+00  1.000000e+00
    ## Bedroom.AbvGr         1.000000e+00  1.000000e+00
    ## Bath_1                1.000000e+00  0.000000e+00
    ## Year.Built            1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_2  0.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_3  1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_4  1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_5  1.000000e+00  1.000000e+00
    ## BF                    7.962731e-06  1.312482e-06
    ## PostProbs             0.000000e+00  0.000000e+00
    ## R2                    8.185000e-01  8.178000e-01
    ## dim                   1.000000e+01  1.000000e+01
    ## logmarg              -1.744096e+03 -1.745899e+03

    BAS also retained all predictor variables. apart from the Neighborhood variable. BAS dropped 1 of the Neighborhood indicator variables in its top five models. The R-squared ranges from 0.8178 to 0.8231.


    2.2.3 Section 2.3 Initial Model Residuals

    One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


    The residuals versus fits plot has constant variance with a conspicuous outlier labeled for further investigation. Observation 310 is a normal sale a very large home, but at a rather cheap price. I can see no reason to drop this observation from the data set.

    plot(ames_train.lm.full, which = 1)

    ames_train.lm.full$model[c(310),]
    ##     log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
    ## 310   12.12676            7 4676            5             3    4.1
    ##     Year.Built Neighborhood2
    ## 310       2007       Neigh_2
    ames_train[c(310),]
    ## # A tibble: 1 x 52
    ##    area  price MS.SubClass MS.Zoning Lot.Area Utilities Lot.Config
    ##   <int>  <int>       <int> <fct>        <int> <fct>     <fct>     
    ## 1  4676 184750          60 RL           40094 AllPub    Inside    
    ## # ... with 45 more variables: Land.Slope <fct>, Neighborhood <fct>,
    ## #   Condition.1 <fct>, Condition.2 <fct>, Bldg.Type <fct>,
    ## #   House.Style <fct>, Overall.Qual <int>, Overall.Cond <int>,
    ## #   Year.Built <int>, Year.Remod.Add <int>, Exter.Qual <fct>,
    ## #   Exter.Cond <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
    ## #   Bsmt.Exposure <fct>, Total.Bsmt.SF <int>, Heating <fct>,
    ## #   Heating.QC <fct>, Central.Air <fct>, Bedroom.AbvGr <int>,
    ## #   Kitchen.AbvGr <int>, Kitchen.Qual <fct>, TotRms.AbvGrd <int>,
    ## #   Functional <fct>, Fireplaces <int>, Fireplace.Qu <fct>,
    ## #   Garage.Type <fct>, Garage.Yr.Blt <dbl>, Garage.Finish <fct>,
    ## #   Garage.Area <int>, Garage.Qual <fct>, Garage.Cond <fct>,
    ## #   Paved.Drive <fct>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
    ## #   Enclosed.Porch <int>, Pool.QC <fct>, Fence <fct>, Sale.Type <fct>,
    ## #   Sale.Condition <fct>, Lot.Area.Cat <dbl>, Bath_1 <dbl>,
    ## #   Neighborhood2 <fct>, resid <dbl>, fitted <dbl>

    2.2.4 Section 2.4 Initial Model RMSE

    You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


    The model RMSE (translated from log(price) to price is $38,321. Note that if I drop the outlier, the RMSE falls to $32,870.

    # Extract Predictions (convert from log)
    ames_train$predict.price <- exp(predict(ames_train.lm.full, ames_train))
    
    # Calculate RMSE
    sqrt(mean((ames_train$price - ames_train$predict.price)^2))
    ## [1] 38320.84
    # Repeat with the dataset w/o outlier
    ames_train2$predict.price <- exp(predict(ames_train.lm.full2, ames_train2))
    sqrt(mean((ames_train2$price - ames_train2$predict.price)^2))
    ## [1] 32869.89

    2.2.5 Section 2.5 Overfitting

    The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called overfitting. To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

    load("ames_test.Rdata")
    ames_test <- ames_test %>%
      replace_na() %>%
      cat_lot_area() %>%
      bath_pt_1() %>%
      neigh_agg() %>%
      discard_vars()

    Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


    In general, the better the model fit, the lower the RMSE. In this case, the RMSE of the test data RMSE ($27,080) is lower than using the training data. This is unusual, because the model should fit the data it was fitted to better than a new data set. After testing the model with the outlying observation removed, the RMSE was still lower for the test data set!

    # Extract Predictions (convert from log)
    ames_test$predict.price <- exp(predict(ames_train.lm.full, ames_test))
    
    # Calculate RMSE
    sqrt(mean((ames_test$price - ames_test$predict.price)^2))
    ## [1] 27079.7
    # Repeat with the version of the model built off the excluded outlier
    ames_test$predict.price2 <- exp(predict(ames_train.lm.full2, ames_test))
    sqrt(mean((ames_test$price - ames_test$predict.price2)^2))
    ## [1] 28005.33

    Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

    2.3 Part 3 Development of a Final Model

    Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

    Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

    2.3.1 Section 3.1 Final Model

    Provide the summary table for your model.


    I added six more variables to the final model for evaluation: Building type (Bldg.Type), Overall Condition (Overall.Qual), Central Air (Central.Air), Kitchen Quality (Kitchen.Qual), Sale Condition (Sale.Condition), and exterior quality (Exter.Qual). I also added two interaction variables, Central.Air*Year.Built and Bedroom.AbvGr*Bath_1.

    ames.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2 + Bldg.Type + Overall.Qual +
                               Central.Air + Kitchen.Qual +
                               Sale.Condition + Exter.Qual + Central.Air*Year.Built + Bedroom.AbvGr*Bath_1,
                             data = ames_train)

    I used three model selection procedures: AIC, BIC, and BAS.

    ames.lm.aic <- stepAIC(ames.lm.full, k = 2)
    ## Start:  AIC=-3815.97
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual + Sale.Condition + Exter.Qual + 
    ##     Central.Air * Year.Built + Bedroom.AbvGr * Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Exter.Qual              3    0.0516 20.744 -3819.5
    ## - Year.Built:Central.Air  1    0.0003 20.693 -3818.0
    ## <none>                                20.693 -3816.0
    ## - Kitchen.Qual            4    0.2466 20.939 -3812.1
    ## - Bldg.Type               4    0.4235 21.116 -3803.7
    ## - Bedroom.AbvGr:Bath_1    1    0.5143 21.207 -3793.4
    ## - Sale.Condition          5    0.8754 21.568 -3784.5
    ## - Neighborhood2           4    1.0692 21.762 -3773.6
    ## - Lot.Area.Cat            1    1.0614 21.754 -3767.9
    ## - Overall.Cond            1    2.2685 22.961 -3713.9
    ## - Overall.Qual            1    3.5957 24.288 -3657.8
    ## - area                    1    4.9522 25.645 -3603.4
    ## 
    ## Step:  AIC=-3819.48
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual + Sale.Condition + Year.Built:Central.Air + 
    ##     Bedroom.AbvGr:Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Year.Built:Central.Air  1    0.0003 20.745 -3821.5
    ## <none>                                20.744 -3819.5
    ## - Kitchen.Qual            4    0.4243 21.169 -3807.2
    ## - Bldg.Type               4    0.4279 21.172 -3807.1
    ## - Bedroom.AbvGr:Bath_1    1    0.5327 21.277 -3796.1
    ## - Sale.Condition          5    0.9224 21.667 -3786.0
    ## - Neighborhood2           4    1.1267 21.871 -3774.6
    ## - Lot.Area.Cat            1    1.0864 21.831 -3770.4
    ## - Overall.Cond            1    2.3303 23.075 -3715.0
    ## - Overall.Qual            1    3.9841 24.729 -3645.8
    ## - area                    1    4.9995 25.744 -3605.6
    ## 
    ## Step:  AIC=-3821.46
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual + Sale.Condition + Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## <none>                              20.745 -3821.5
    ## - Kitchen.Qual          4    0.4244 21.169 -3809.2
    ## - Bldg.Type             4    0.4280 21.173 -3809.0
    ## - Bedroom.AbvGr:Bath_1  1    0.5328 21.277 -3798.1
    ## - Sale.Condition        5    0.9239 21.669 -3787.9
    ## - Neighborhood2         4    1.1277 21.872 -3776.5
    ## - Central.Air           1    1.0396 21.784 -3774.6
    ## - Lot.Area.Cat          1    1.0871 21.832 -3772.4
    ## - Year.Built            1    1.9783 22.723 -3732.4
    ## - Overall.Cond          1    2.3311 23.076 -3717.0
    ## - Overall.Qual          1    4.0712 24.816 -3644.3
    ## - area                  1    5.0038 25.748 -3607.4
    summary(ames.lm.aic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + 
    ##     Overall.Qual + Central.Air + Kitchen.Qual + Sale.Condition + 
    ##     Bedroom.AbvGr:Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.47623 -0.06867 -0.00019  0.06982  0.50634 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.617e+00  5.910e-01   7.813 1.45e-14 ***
    ## Lot.Area.Cat           3.992e-02  5.590e-03   7.141 1.82e-12 ***
    ## area                   2.655e-04  1.733e-05  15.320  < 2e-16 ***
    ## Overall.Cond           5.372e-02  5.138e-03  10.456  < 2e-16 ***
    ## Bedroom.AbvGr          3.229e-02  1.471e-02   2.196 0.028353 *  
    ## Bath_1                 1.535e-01  2.019e-02   7.603 6.79e-14 ***
    ## Year.Built             2.851e-03  2.960e-04   9.633  < 2e-16 ***
    ## Neighborhood2Neigh_2  -5.359e-02  1.708e-02  -3.138 0.001753 ** 
    ## Neighborhood2Neigh_3   1.023e-02  1.592e-02   0.643 0.520598    
    ## Neighborhood2Neigh_4   1.133e-01  1.995e-02   5.679 1.79e-08 ***
    ## Neighborhood2Neigh_5   6.031e-02  1.903e-02   3.169 0.001579 ** 
    ## Bldg.Type2fmCon        6.148e-02  3.495e-02   1.759 0.078879 .  
    ## Bldg.TypeDuplex       -5.627e-02  2.915e-02  -1.930 0.053850 .  
    ## Bldg.TypeTwnhs        -1.102e-01  3.081e-02  -3.578 0.000364 ***
    ## Bldg.TypeTwnhsE       -2.375e-02  2.210e-02  -1.074 0.282935    
    ## Overall.Qual           8.432e-02  6.102e-03  13.819  < 2e-16 ***
    ## Central.AirY           1.742e-01  2.495e-02   6.983 5.35e-12 ***
    ## Kitchen.QualFa        -1.191e-01  4.313e-02  -2.761 0.005869 ** 
    ## Kitchen.QualGd        -8.341e-02  2.220e-02  -3.758 0.000182 ***
    ## Kitchen.QualPo         1.169e-01  1.515e-01   0.772 0.440437    
    ## Kitchen.QualTA        -1.074e-01  2.580e-02  -4.165 3.39e-05 ***
    ## Sale.ConditionAdjLand  2.195e-01  1.091e-01   2.012 0.044504 *  
    ## Sale.ConditionAlloca   1.003e-01  7.693e-02   1.304 0.192631    
    ## Sale.ConditionFamily   5.574e-03  4.096e-02   0.136 0.891783    
    ## Sale.ConditionNormal   1.090e-01  1.982e-02   5.501 4.82e-08 ***
    ## Sale.ConditionPartial  1.447e-01  2.690e-02   5.378 9.42e-08 ***
    ## Bedroom.AbvGr:Bath_1  -2.934e-02  5.869e-03  -4.999 6.84e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.146 on 973 degrees of freedom
    ## Multiple R-squared:  0.8826, Adjusted R-squared:  0.8795 
    ## F-statistic: 281.4 on 26 and 973 DF,  p-value: < 2.2e-16

    AIC dropped Exter.Qual and the Central.Air*Year.Built interaction term.

    Next I trie the BIC method.

    ames.lm.bic <- stepAIC(ames.lm.full, k = log(nrow(ames_train)))
    ## Start:  AIC=-3663.83
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual + Sale.Condition + Exter.Qual + 
    ##     Central.Air * Year.Built + Bedroom.AbvGr * Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Exter.Qual              3    0.0516 20.744 -3682.1
    ## - Kitchen.Qual            4    0.2466 20.939 -3679.6
    ## - Bldg.Type               4    0.4235 21.116 -3671.2
    ## - Year.Built:Central.Air  1    0.0003 20.693 -3670.7
    ## <none>                                20.693 -3663.8
    ## - Sale.Condition          5    0.8754 21.568 -3656.9
    ## - Bedroom.AbvGr:Bath_1    1    0.5143 21.207 -3646.2
    ## - Neighborhood2           4    1.0692 21.762 -3641.1
    ## - Lot.Area.Cat            1    1.0614 21.754 -3620.7
    ## - Overall.Cond            1    2.2685 22.961 -3566.7
    ## - Overall.Qual            1    3.5957 24.288 -3510.5
    ## - area                    1    4.9522 25.645 -3456.2
    ## 
    ## Step:  AIC=-3682.06
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual + Sale.Condition + Year.Built:Central.Air + 
    ##     Bedroom.AbvGr:Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Kitchen.Qual            4    0.4243 21.169 -3689.4
    ## - Bldg.Type               4    0.4279 21.172 -3689.3
    ## - Year.Built:Central.Air  1    0.0003 20.745 -3689.0
    ## <none>                                20.744 -3682.1
    ## - Sale.Condition          5    0.9224 21.667 -3673.1
    ## - Bedroom.AbvGr:Bath_1    1    0.5327 21.277 -3663.6
    ## - Neighborhood2           4    1.1267 21.871 -3656.8
    ## - Lot.Area.Cat            1    1.0864 21.831 -3637.9
    ## - Overall.Cond            1    2.3303 23.075 -3582.5
    ## - Overall.Qual            1    3.9841 24.729 -3513.3
    ## - area                    1    4.9995 25.744 -3473.0
    ## 
    ## Step:  AIC=-3689.44
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Sale.Condition + Year.Built:Central.Air + Bedroom.AbvGr:Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Year.Built:Central.Air  1    0.0004 21.169 -3696.3
    ## - Bldg.Type               4    0.4754 21.644 -3694.9
    ## <none>                                21.169 -3689.4
    ## - Sale.Condition          5    0.9173 22.086 -3681.6
    ## - Bedroom.AbvGr:Bath_1    1    0.5752 21.744 -3669.5
    ## - Neighborhood2           4    1.3832 22.552 -3653.8
    ## - Lot.Area.Cat            1    1.1229 22.292 -3644.7
    ## - Overall.Cond            1    2.4765 23.645 -3585.7
    ## - Overall.Qual            1    4.8728 26.042 -3489.2
    ## - area                    1    5.7290 26.898 -3456.8
    ## 
    ## Step:  AIC=-3696.33
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Sale.Condition + Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## - Bldg.Type             4    0.4752 21.644 -3701.8
    ## <none>                              21.169 -3696.3
    ## - Sale.Condition        5    0.9186 22.088 -3688.4
    ## - Bedroom.AbvGr:Bath_1  1    0.5751 21.744 -3676.4
    ## - Neighborhood2         4    1.3859 22.555 -3660.6
    ## - Central.Air           1    0.9721 22.141 -3658.3
    ## - Lot.Area.Cat          1    1.1235 22.293 -3651.5
    ## - Year.Built            1    2.2676 23.437 -3601.5
    ## - Overall.Cond          1    2.4775 23.647 -3592.6
    ## - Overall.Qual          1    4.9823 26.151 -3491.9
    ## - area                  1    5.7317 26.901 -3463.6
    ## 
    ## Step:  AIC=-3701.76
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air + 
    ##     Sale.Condition + Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## <none>                              21.644 -3701.8
    ## - Sale.Condition        5    1.0227 22.667 -3690.1
    ## - Bedroom.AbvGr:Bath_1  1    0.6517 22.296 -3679.0
    ## - Neighborhood2         4    1.2932 22.938 -3671.4
    ## - Central.Air           1    0.9523 22.597 -3665.6
    ## - Year.Built            1    2.0802 23.724 -3616.9
    ## - Overall.Cond          1    2.6566 24.301 -3592.9
    ## - Lot.Area.Cat          1    3.3688 25.013 -3564.0
    ## - Overall.Qual          1    5.3187 26.963 -3488.9
    ## - area                  1    5.5679 27.212 -3479.7
    summary(ames.lm.bic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + 
    ##     Central.Air + Sale.Condition + Bedroom.AbvGr:Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.46468 -0.07093  0.00150  0.07481  0.52662 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.491e+00  5.737e-01   7.829 1.27e-14 ***
    ## Lot.Area.Cat           5.277e-02  4.271e-03  12.357  < 2e-16 ***
    ## area                   2.711e-04  1.706e-05  15.886  < 2e-16 ***
    ## Overall.Cond           5.572e-02  5.078e-03  10.973  < 2e-16 ***
    ## Bedroom.AbvGr          3.428e-02  1.480e-02   2.316 0.020754 *  
    ## Bath_1                 1.570e-01  2.017e-02   7.784 1.77e-14 ***
    ## Year.Built             2.811e-03  2.895e-04   9.710  < 2e-16 ***
    ## Neighborhood2Neigh_2  -5.118e-02  1.703e-02  -3.006 0.002712 ** 
    ## Neighborhood2Neigh_3   6.765e-03  1.574e-02   0.430 0.667468    
    ## Neighborhood2Neigh_4   1.274e-01  1.954e-02   6.518 1.14e-10 ***
    ## Neighborhood2Neigh_5   6.490e-02  1.923e-02   3.375 0.000766 ***
    ## Overall.Qual           9.283e-02  5.979e-03  15.526  < 2e-16 ***
    ## Central.AirY           1.575e-01  2.398e-02   6.570 8.16e-11 ***
    ## Sale.ConditionAdjLand  1.624e-01  1.091e-01   1.488 0.136949    
    ## Sale.ConditionAlloca   7.496e-02  7.749e-02   0.967 0.333636    
    ## Sale.ConditionFamily  -1.352e-02  4.138e-02  -0.327 0.743997    
    ## Sale.ConditionNormal   1.075e-01  2.000e-02   5.375 9.57e-08 ***
    ## Sale.ConditionPartial  1.546e-01  2.715e-02   5.695 1.63e-08 ***
    ## Bedroom.AbvGr:Bath_1  -3.194e-02  5.877e-03  -5.435 6.93e-08 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1485 on 981 degrees of freedom
    ## Multiple R-squared:  0.8775, Adjusted R-squared:  0.8753 
    ## F-statistic: 390.5 on 18 and 981 DF,  p-value: < 2.2e-16

    BIC also dropped Exter.Qual and the Central.Air*Year.Built interaction term. Additionally, BIC dropped Kitchen.Qual and Bldg.Type. Finally, I tried the BAS method.

    ames.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2 + Bldg.Type + Overall.Qual +
                               Central.Air + Kitchen.Qual +
                               Sale.Condition + Exter.Qual + Central.Air*Year.Built + Bedroom.AbvGr*Bath_1, 
                          data = ames_train, prior = "AIC",
                               modelprior=uniform())
    summary(ames.lm.bas)
    ##                         P(B != 0 | Y)    model 1       model 2
    ## Intercept                   1.0000000     1.0000     1.0000000
    ## Lot.Area.Cat                1.0000000     1.0000     1.0000000
    ## area                        1.0000000     1.0000     1.0000000
    ## Overall.Cond                1.0000000     1.0000     1.0000000
    ## Bedroom.AbvGr               0.8648758     1.0000     1.0000000
    ## Bath_1                      1.0000000     1.0000     1.0000000
    ## Year.Built                  0.9905766     1.0000     1.0000000
    ## Neighborhood2Neigh_2        0.9953130     1.0000     1.0000000
    ## Neighborhood2Neigh_3        0.3009113     0.0000     0.0000000
    ## Neighborhood2Neigh_4        0.9999993     1.0000     1.0000000
    ## Neighborhood2Neigh_5        0.9841872     1.0000     1.0000000
    ## Bldg.Type2fmCon             0.6543894     1.0000     1.0000000
    ## Bldg.TypeDuplex             0.6458812     1.0000     1.0000000
    ## Bldg.TypeTwnhs              0.9945102     1.0000     1.0000000
    ## Bldg.TypeTwnhsE             0.3909613     0.0000     0.0000000
    ## Overall.Qual                1.0000000     1.0000     1.0000000
    ## Central.AirY                0.5810690     1.0000     0.0000000
    ## Kitchen.QualFa              0.9462522     1.0000     1.0000000
    ## Kitchen.QualGd              0.9864031     1.0000     1.0000000
    ## Kitchen.QualPo              0.3350022     0.0000     0.0000000
    ## Kitchen.QualTA              0.9964145     1.0000     1.0000000
    ## Sale.ConditionAdjLand       0.6618573     1.0000     1.0000000
    ## Sale.ConditionAlloca        0.4258049     0.0000     0.0000000
    ## Sale.ConditionFamily        0.2682041     0.0000     0.0000000
    ## Sale.ConditionNormal        0.9999996     1.0000     1.0000000
    ## Sale.ConditionPartial       0.9999990     1.0000     1.0000000
    ## Exter.QualFa                0.2915373     0.0000     0.0000000
    ## Exter.QualGd                0.3658573     0.0000     0.0000000
    ## Exter.QualTA                0.3571675     0.0000     0.0000000
    ## Year.Built:Central.AirY     0.5846736     0.0000     1.0000000
    ## Bedroom.AbvGr:Bath_1        0.9999944     1.0000     1.0000000
    ## BF                                 NA     1.0000     0.9891655
    ## PostProbs                          NA     0.0036     0.0036000
    ## R2                                 NA     0.8822     0.8822000
    ## dim                                NA    22.0000    22.0000000
    ## logmarg                            NA -1540.0597 -1540.0705549
    ##                               model 3       model 4       model 5
    ## Intercept                   1.0000000     1.0000000     1.0000000
    ## Lot.Area.Cat                1.0000000     1.0000000     1.0000000
    ## area                        1.0000000     1.0000000     1.0000000
    ## Overall.Cond                1.0000000     1.0000000     1.0000000
    ## Bedroom.AbvGr               1.0000000     1.0000000     1.0000000
    ## Bath_1                      1.0000000     1.0000000     1.0000000
    ## Year.Built                  1.0000000     1.0000000     1.0000000
    ## Neighborhood2Neigh_2        1.0000000     1.0000000     1.0000000
    ## Neighborhood2Neigh_3        0.0000000     0.0000000     0.0000000
    ## Neighborhood2Neigh_4        1.0000000     1.0000000     1.0000000
    ## Neighborhood2Neigh_5        1.0000000     1.0000000     1.0000000
    ## Bldg.Type2fmCon             1.0000000     1.0000000     1.0000000
    ## Bldg.TypeDuplex             1.0000000     1.0000000     1.0000000
    ## Bldg.TypeTwnhs              1.0000000     1.0000000     1.0000000
    ## Bldg.TypeTwnhsE             0.0000000     0.0000000     1.0000000
    ## Overall.Qual                1.0000000     1.0000000     1.0000000
    ## Central.AirY                1.0000000     0.0000000     1.0000000
    ## Kitchen.QualFa              1.0000000     1.0000000     1.0000000
    ## Kitchen.QualGd              1.0000000     1.0000000     1.0000000
    ## Kitchen.QualPo              0.0000000     0.0000000     0.0000000
    ## Kitchen.QualTA              1.0000000     1.0000000     1.0000000
    ## Sale.ConditionAdjLand       1.0000000     1.0000000     1.0000000
    ## Sale.ConditionAlloca        1.0000000     1.0000000     0.0000000
    ## Sale.ConditionFamily        0.0000000     0.0000000     0.0000000
    ## Sale.ConditionNormal        1.0000000     1.0000000     1.0000000
    ## Sale.ConditionPartial       1.0000000     1.0000000     1.0000000
    ## Exter.QualFa                0.0000000     0.0000000     0.0000000
    ## Exter.QualGd                0.0000000     0.0000000     0.0000000
    ## Exter.QualTA                0.0000000     0.0000000     0.0000000
    ## Year.Built:Central.AirY     0.0000000     1.0000000     0.0000000
    ## Bedroom.AbvGr:Bath_1        1.0000000     1.0000000     1.0000000
    ## BF                          0.8374305     0.8256039     0.6664186
    ## PostProbs                   0.0030000     0.0030000     0.0024000
    ## R2                          0.8824000     0.8824000     0.8823000
    ## dim                        23.0000000    23.0000000    23.0000000
    ## logmarg                 -1540.2370784 -1540.2513015 -1540.4654986

    BAS also retained all predictor variables in at least one of the top 5 models.

    I decided to drop Exter.Qual and the Central.Air*Year.Built interaction term from the full model.

    ames.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2 + Bldg.Type + Overall.Qual +
                               Central.Air + Kitchen.Qual +
                               Sale.Condition + Bedroom.AbvGr*Bath_1,
                             data = ames_train)
    summary(ames.lm.full)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + 
    ##     Overall.Qual + Central.Air + Kitchen.Qual + Sale.Condition + 
    ##     Bedroom.AbvGr * Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.47623 -0.06867 -0.00019  0.06982  0.50634 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.617e+00  5.910e-01   7.813 1.45e-14 ***
    ## Lot.Area.Cat           3.992e-02  5.590e-03   7.141 1.82e-12 ***
    ## area                   2.655e-04  1.733e-05  15.320  < 2e-16 ***
    ## Overall.Cond           5.372e-02  5.138e-03  10.456  < 2e-16 ***
    ## Bedroom.AbvGr          3.229e-02  1.471e-02   2.196 0.028353 *  
    ## Bath_1                 1.535e-01  2.019e-02   7.603 6.79e-14 ***
    ## Year.Built             2.851e-03  2.960e-04   9.633  < 2e-16 ***
    ## Neighborhood2Neigh_2  -5.359e-02  1.708e-02  -3.138 0.001753 ** 
    ## Neighborhood2Neigh_3   1.023e-02  1.592e-02   0.643 0.520598    
    ## Neighborhood2Neigh_4   1.133e-01  1.995e-02   5.679 1.79e-08 ***
    ## Neighborhood2Neigh_5   6.031e-02  1.903e-02   3.169 0.001579 ** 
    ## Bldg.Type2fmCon        6.148e-02  3.495e-02   1.759 0.078879 .  
    ## Bldg.TypeDuplex       -5.627e-02  2.915e-02  -1.930 0.053850 .  
    ## Bldg.TypeTwnhs        -1.102e-01  3.081e-02  -3.578 0.000364 ***
    ## Bldg.TypeTwnhsE       -2.375e-02  2.210e-02  -1.074 0.282935    
    ## Overall.Qual           8.432e-02  6.102e-03  13.819  < 2e-16 ***
    ## Central.AirY           1.742e-01  2.495e-02   6.983 5.35e-12 ***
    ## Kitchen.QualFa        -1.191e-01  4.313e-02  -2.761 0.005869 ** 
    ## Kitchen.QualGd        -8.341e-02  2.220e-02  -3.758 0.000182 ***
    ## Kitchen.QualPo         1.169e-01  1.515e-01   0.772 0.440437    
    ## Kitchen.QualTA        -1.074e-01  2.580e-02  -4.165 3.39e-05 ***
    ## Sale.ConditionAdjLand  2.195e-01  1.091e-01   2.012 0.044504 *  
    ## Sale.ConditionAlloca   1.003e-01  7.693e-02   1.304 0.192631    
    ## Sale.ConditionFamily   5.574e-03  4.096e-02   0.136 0.891783    
    ## Sale.ConditionNormal   1.090e-01  1.982e-02   5.501 4.82e-08 ***
    ## Sale.ConditionPartial  1.447e-01  2.690e-02   5.378 9.42e-08 ***
    ## Bedroom.AbvGr:Bath_1  -2.934e-02  5.869e-03  -4.999 6.84e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.146 on 973 degrees of freedom
    ## Multiple R-squared:  0.8826, Adjusted R-squared:  0.8795 
    ## F-statistic: 281.4 on 26 and 973 DF,  p-value: < 2.2e-16

    The final model adjusted R-squared is .8795, an improvement over the original model adjusted R-squared of 0.8213.


    2.3.2 Section 3.2 Transformation

    Did you decide to transform any variables? Why or why not? Explain in a few sentences.


    I transformed price into log(price) to conform the model to the least square conditions. I did not transform any predictor variables to conform to model conditions.

    I did transform predictor variables to create a general and parsimonious model.
  • Condensed Neighborhoods into 5 groups
  • Created a total bathroom variable that treats half-baths as .1 to reflect their relatively low value.


  • 2.3.3 Section 3.3 Variable Interaction

    I created interaction variable Year.Built*Central.Air because newer homes are also more likely to have central air conditioning. I ultimately dropped this variable from the final model.

    I created interaction variable Bedroom.AbvGr*Bath.1 because bathrooms should increase with bedrooms or else the property value will suffer.



    2.3.4 Section 3.4 Variable Selection

    What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


    I tried the AIC, BIC, and BAS methods, and ultimately relied on the variable selection of the BIC method because it produced the most parsimonious model.

    ames_train.lm.aic <- stepAIC(ames_train.lm.full, k = 2)
    ## Start:  AIC=-3443.04
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3443.0
    ## - Bath_1         1    0.9230 32.195 -3416.0
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3394.0
    ## - Lot.Area.Cat   1    3.2782 34.550 -3345.4
    ## - Neighborhood2  4    4.5659 35.837 -3314.8
    ## - Overall.Cond   1    7.6572 38.929 -3226.0
    ## - Year.Built     1    8.4899 39.762 -3204.9
    ## - area           1   15.4797 46.751 -3042.9
    summary(ames_train.lm.aic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.62700 -0.08985  0.00130  0.09516  0.61324 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           7.730e-01  6.261e-01   1.235   0.2173    
    ## Lot.Area.Cat          5.168e-02  5.076e-03  10.182  < 2e-16 ***
    ## area                  4.029e-04  1.821e-05  22.126  < 2e-16 ***
    ## Overall.Cond          8.729e-02  5.609e-03  15.562  < 2e-16 ***
    ## Bedroom.AbvGr        -6.208e-02  8.623e-03  -7.199 1.20e-12 ***
    ## Bath_1                5.279e-02  9.771e-03   5.403 8.22e-08 ***
    ## Year.Built            5.078e-03  3.099e-04  16.386  < 2e-16 ***
    ## Neighborhood2Neigh_2 -1.006e-01  1.992e-02  -5.052 5.20e-07 ***
    ## Neighborhood2Neigh_3 -3.182e-02  1.837e-02  -1.732   0.0835 .  
    ## Neighborhood2Neigh_4  2.228e-01  2.236e-02   9.963  < 2e-16 ***
    ## Neighborhood2Neigh_5  1.421e-01  2.217e-02   6.409 2.26e-10 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1778 on 989 degrees of freedom
    ## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8213 
    ## F-statistic:   460 on 10 and 989 DF,  p-value: < 2.2e-16

    2.3.5 Section 3.5 Model Testing

    How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


    The out-of-sample testing did not produce dramatically different results from the training model, so I did not change the model.


    2.4 Part 4 Final Model Assessment

    2.4.1 Section 4.1 Final Model Residual

    For your final model, create and briefly interpret an informative plot of the residuals.


    The residuals versus fits plot has constant variance with a couple outliers labeled for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.

    plot(ames.lm.full, which = 1)

    ames.lm.full$model[c(428, 310),]
    ##     log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
    ## 428   9.456341            4  832            2             2    1.0
    ## 310  12.126759            7 4676            5             3    4.1
    ##     Year.Built Neighborhood2 Bldg.Type Overall.Qual Central.Air
    ## 428       1923       Neigh_2      1Fam            2           N
    ## 310       2007       Neigh_2      1Fam           10           Y
    ##     Kitchen.Qual Sale.Condition
    ## 428           TA        Abnorml
    ## 310           Ex        Partial
    ames_train[c(428, 310),]
    ## # A tibble: 2 x 53
    ##    area  price MS.SubClass MS.Zoning Lot.Area Utilities Lot.Config
    ##   <int>  <int>       <int> <fct>        <int> <fct>     <fct>     
    ## 1   832  12789          30 RM            9656 AllPub    Inside    
    ## 2  4676 184750          60 RL           40094 AllPub    Inside    
    ## # ... with 46 more variables: Land.Slope <fct>, Neighborhood <fct>,
    ## #   Condition.1 <fct>, Condition.2 <fct>, Bldg.Type <fct>,
    ## #   House.Style <fct>, Overall.Qual <int>, Overall.Cond <int>,
    ## #   Year.Built <int>, Year.Remod.Add <int>, Exter.Qual <fct>,
    ## #   Exter.Cond <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
    ## #   Bsmt.Exposure <fct>, Total.Bsmt.SF <int>, Heating <fct>,
    ## #   Heating.QC <fct>, Central.Air <fct>, Bedroom.AbvGr <int>,
    ## #   Kitchen.AbvGr <int>, Kitchen.Qual <fct>, TotRms.AbvGrd <int>,
    ## #   Functional <fct>, Fireplaces <int>, Fireplace.Qu <fct>,
    ## #   Garage.Type <fct>, Garage.Yr.Blt <dbl>, Garage.Finish <fct>,
    ## #   Garage.Area <int>, Garage.Qual <fct>, Garage.Cond <fct>,
    ## #   Paved.Drive <fct>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
    ## #   Enclosed.Porch <int>, Pool.QC <fct>, Fence <fct>, Sale.Type <fct>,
    ## #   Sale.Condition <fct>, Lot.Area.Cat <dbl>, Bath_1 <dbl>,
    ## #   Neighborhood2 <fct>, resid <dbl>, fitted <dbl>, predict.price <dbl>

    2.4.2 Section 4.2 Final Model RMSE

    For your final model, calculate and briefly comment on the RMSE.


    The model RMSE (translated from log(price) to price is $32,959. This is much less than the $38,321 from the original model.

    # Extract Predictions (convert from log)
    ames_train$predict.price <- exp(predict(ames.lm.full, ames_train))
    
    # Calculate RMSE
    sqrt(mean((ames_train$price - ames_train$predict.price)^2))
    ## [1] 32959.3

    2.4.3 Section 4.3 Final Model Evaluation

    What are some strengths and weaknesses of your model?


    The model performs well for prediction (R=0.8826). It is also fairly parsimonious in its trimming the location indicator variables to 5 neighborhoods. All indicator variables were significant at the \(\alpha=0.05\) level except Neighborhood2Neigh_3, Bldg.Type2fmCon, Bldg.TypeDuplex, Bldg.TypeTwnhsE, Kitchen.QualPo, Sale.ConditionAlloca, and Sale.ConditionFamily.

    One weakness of the model is that it does not generalize to geographies other than the Ames, IA area because of the geography indicator variables.

    summary(ames.lm.full)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + 
    ##     Overall.Qual + Central.Air + Kitchen.Qual + Sale.Condition + 
    ##     Bedroom.AbvGr * Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.47623 -0.06867 -0.00019  0.06982  0.50634 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.617e+00  5.910e-01   7.813 1.45e-14 ***
    ## Lot.Area.Cat           3.992e-02  5.590e-03   7.141 1.82e-12 ***
    ## area                   2.655e-04  1.733e-05  15.320  < 2e-16 ***
    ## Overall.Cond           5.372e-02  5.138e-03  10.456  < 2e-16 ***
    ## Bedroom.AbvGr          3.229e-02  1.471e-02   2.196 0.028353 *  
    ## Bath_1                 1.535e-01  2.019e-02   7.603 6.79e-14 ***
    ## Year.Built             2.851e-03  2.960e-04   9.633  < 2e-16 ***
    ## Neighborhood2Neigh_2  -5.359e-02  1.708e-02  -3.138 0.001753 ** 
    ## Neighborhood2Neigh_3   1.023e-02  1.592e-02   0.643 0.520598    
    ## Neighborhood2Neigh_4   1.133e-01  1.995e-02   5.679 1.79e-08 ***
    ## Neighborhood2Neigh_5   6.031e-02  1.903e-02   3.169 0.001579 ** 
    ## Bldg.Type2fmCon        6.148e-02  3.495e-02   1.759 0.078879 .  
    ## Bldg.TypeDuplex       -5.627e-02  2.915e-02  -1.930 0.053850 .  
    ## Bldg.TypeTwnhs        -1.102e-01  3.081e-02  -3.578 0.000364 ***
    ## Bldg.TypeTwnhsE       -2.375e-02  2.210e-02  -1.074 0.282935    
    ## Overall.Qual           8.432e-02  6.102e-03  13.819  < 2e-16 ***
    ## Central.AirY           1.742e-01  2.495e-02   6.983 5.35e-12 ***
    ## Kitchen.QualFa        -1.191e-01  4.313e-02  -2.761 0.005869 ** 
    ## Kitchen.QualGd        -8.341e-02  2.220e-02  -3.758 0.000182 ***
    ## Kitchen.QualPo         1.169e-01  1.515e-01   0.772 0.440437    
    ## Kitchen.QualTA        -1.074e-01  2.580e-02  -4.165 3.39e-05 ***
    ## Sale.ConditionAdjLand  2.195e-01  1.091e-01   2.012 0.044504 *  
    ## Sale.ConditionAlloca   1.003e-01  7.693e-02   1.304 0.192631    
    ## Sale.ConditionFamily   5.574e-03  4.096e-02   0.136 0.891783    
    ## Sale.ConditionNormal   1.090e-01  1.982e-02   5.501 4.82e-08 ***
    ## Sale.ConditionPartial  1.447e-01  2.690e-02   5.378 9.42e-08 ***
    ## Bedroom.AbvGr:Bath_1  -2.934e-02  5.869e-03  -4.999 6.84e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.146 on 973 degrees of freedom
    ## Multiple R-squared:  0.8826, Adjusted R-squared:  0.8795 
    ## F-statistic: 281.4 on 26 and 973 DF,  p-value: < 2.2e-16

    2.4.4 Section 4.4 Final Model Validation

    Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

    You will use the ames_validation dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
    * How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
    * From this result, does your final model properly reflect uncertainty?

    load("ames_validation.Rdata")
    ames_validation <- ames_validation %>%
      replace_na() %>%
      cat_lot_area() %>%
      bath_pt_1() %>%
      neigh_agg() %>%
      discard_vars()

    In general, the better the model fit, the lower the RMSE. In this case, the RMSE of the test data RMSE ($20,942) is lower than using the training data. This is unusual, because the model should fit the data it was fitted to better than a new data set. The outliers in the training set increased its RMSE.

    # Extract Predictions (convert from log)
    ames_validation$predict.price <- exp(predict(ames.lm.full, ames_validation))
    
    # Calculate RMSE
    sqrt(mean((ames_validation$price - ames_validation$predict.price)^2))
    ## [1] 20942.81

    97% of the 95% predictive confidence intervals contain the true price of the house in the validation data set. This result supports the model properly reflecting uncertainty in the actual sales contracts.

    # Calculate prediction interval
    ames_val.pred <- predict(ames.lm.full, ames_validation, interval = "prediction")
    
    # Get dataset of predictions and confidence intervals
    out = as.data.frame(cbind(exp(ames_val.pred),
                              price = ames_validation$price))
    
    # Fix names in dataset
    #colnames(out)[1:2] <- c("lwr", "upr")  #fix names
    
    # Get Coverage
    out %>% summarise(cover = sum(price >= lwr & price <= upr)/n())
    ##       cover
    ## 1 0.9724771
    #pred.test.HPM.coverage

    2.5 Part 5 Conclusion

    Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


    As explained in Pardoe (2008) (https://ww2.amstat.org/publications/jse/v16n2/datasets.pardoe.html), a model such as this can narrow the range of possible values for the asking price of a home. For example, a home with characteristcs such as those shown in below sold for $147,000. A 95% prediction interval based on an average of the entire data set is ($40,068, $304,460) with mean $172,264. (sample mean \(\pm\) \(t_{.05}\) × sample standard deviation). The model 95% prediction interval is much closer at ($117,358, $208,760) with predicted value $156,523.

    val <- ames_validation %>% summarise(mean=mean(price), sd=sd(price), n=n())
    error <- qt(0.975,df=val$n-1)*val$sd
    left <- mean(val$mean)-error
    right <- mean(val$mean)+error
    val$mean
    ## [1] 172264.1
    left
    ## [1] 40068.18
    right
    ## [1] 304460.1
    ames_validation[5,c('price', 'Lot.Area', 'area', 'Overall.Cond', 'Bedroom.AbvGr', 'Bath_1', 'Year.Built', 'Neighborhood', 'Overall.Qual', 'Central.Air', 'Kitchen.Qual', 'Sale.Condition', 'Exter.Qual')]
    ## # A tibble: 1 x 13
    ##    price Lot.Area  area Overall.Cond Bedroom.AbvGr Bath_1 Year.Built
    ##    <int>    <int> <int>        <int>         <int>  <dbl>      <int>
    ## 1 147000    10970  1026            6             3   2.00       1978
    ## # ... with 6 more variables: Neighborhood <fct>, Overall.Qual <int>,
    ## #   Central.Air <fct>, Kitchen.Qual <fct>, Sale.Condition <fct>,
    ## #   Exter.Qual <fct>
    exp(predict(ames.lm.full, ames_validation[5,], interval="prediction"))
    ##        fit      lwr      upr
    ## 1 156523.5 117357.5 208760.4

    The model also does a good job with characterizing the market, explaining most of the variation in sale price (R2 = 88.26%).