1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")
load("ames_test.Rdata")
load("ames_validation.Rdata")

Use the code block below to load any necessary packages

#library(statsr)
library(dplyr)
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.4
library(ggplot2)
library(gridExtra)
library(MASS)
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


Homes built on large lots typically sell for higher prices, all else being equal. However, there are diminishing returns to additional acreage. Realtors have constructed lot size categories which correspond to approximately equal-sized increases in sale price (see http://ww2.amstat.org/publications/jse/v16n2/datasets.pardoe.html). The first pair of plots below shows the result of categorization. The data shows a more linear relationship between the explantory and response variable.

# Analysis of lot size.
p1 <- ggplot(ames_train, aes(x = Lot.Area, y = price)) +
  geom_point() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Raw Data")
ames_train$Lot.Area.Cat <- 
  (ifelse(ames_train$Lot.Area<3000, 1, 
            ifelse(ames_train$Lot.Area<5000, 2,
            ifelse(ames_train$Lot.Area<7000, 3,
            ifelse(ames_train$Lot.Area<10000, 4,
            ifelse(ames_train$Lot.Area<15000, 5,
            ifelse(ames_train$Lot.Area<20000, 6, 
            ifelse(ames_train$Lot.Area<43560, 7, 
            ifelse(ames_train$Lot.Area<43560*3, 8, 
            ifelse(ames_train$Lot.Area<43560*5, 9, 
            ifelse(ames_train$Lot.Area<43560*10, 10, 
            ifelse(ames_train$Lot.Area<43560*20, 11,12))))))))))))
# repeat for ames_test and ames_validation
ames_test$Lot.Area.Cat <- 
  (ifelse(ames_test$Lot.Area<3000, 1, 
            ifelse(ames_test$Lot.Area<5000, 2,
            ifelse(ames_test$Lot.Area<7000, 3,
            ifelse(ames_test$Lot.Area<10000, 4,
            ifelse(ames_test$Lot.Area<15000, 5,
            ifelse(ames_test$Lot.Area<20000, 6, 
            ifelse(ames_test$Lot.Area<43560, 7, 
            ifelse(ames_test$Lot.Area<43560*3, 8, 
            ifelse(ames_test$Lot.Area<43560*5, 9, 
            ifelse(ames_test$Lot.Area<43560*10, 10, 
            ifelse(ames_test$Lot.Area<43560*20, 11,12))))))))))))
ames_validation$Lot.Area.Cat <- 
  (ifelse(ames_validation$Lot.Area<3000, 1, 
            ifelse(ames_validation$Lot.Area<5000, 2,
            ifelse(ames_validation$Lot.Area<7000, 3,
            ifelse(ames_validation$Lot.Area<10000, 4,
            ifelse(ames_validation$Lot.Area<15000, 5,
            ifelse(ames_validation$Lot.Area<20000, 6, 
            ifelse(ames_validation$Lot.Area<43560, 7, 
            ifelse(ames_validation$Lot.Area<43560*3, 8, 
            ifelse(ames_validation$Lot.Area<43560*5, 9, 
            ifelse(ames_validation$Lot.Area<43560*10, 10, 
            ifelse(ames_validation$Lot.Area<43560*20, 11,12))))))))))))
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  82 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
##  $ Lot.Area.Cat   : num  4 2 3 4 4 4 3 2 5 2 ...
p2 <- ggplot(ames_train, aes(x = Lot.Area.Cat, y = price)) +  
  geom_jitter() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Categorized")
grid.arrange(p1, p2, ncol=2)

Half bathrooms add less value to a home than full baths. One way to model this is to assign half-bathrooms the value 0.1 instead of 0.5. The figures below each show total bathrooms (abover ground and basement), but the figure on the left assigns 0.5 for half-baths and the figure on the right assigns 0.1 for half-baths.

# Analysis of bathroom size
ames_train %>% summarise(Bsmt.Full.Bath=sum(is.na(Bsmt.Full.Bath)),
                         Bsmt.Half.Bath=sum(is.na(Bsmt.Half.Bath)),
                         Full.Bath=sum(is.na(Full.Bath)),
                         Half.Bath=sum(is.na(Half.Bath)))
## # A tibble: 1 x 4
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
##            <int>          <int>     <int>     <int>
## 1              1              1         0         0
ames_test %>% summarise(Bsmt.Full.Bath=sum(is.na(Bsmt.Full.Bath)),
                         Bsmt.Half.Bath=sum(is.na(Bsmt.Half.Bath)),
                         Full.Bath=sum(is.na(Full.Bath)),
                         Half.Bath=sum(is.na(Half.Bath)))
## # A tibble: 1 x 4
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
##            <int>          <int>     <int>     <int>
## 1              0              0         0         0
ames_validation %>% summarise(Bsmt.Full.Bath=sum(is.na(Bsmt.Full.Bath)),
                         Bsmt.Half.Bath=sum(is.na(Bsmt.Half.Bath)),
                         Full.Bath=sum(is.na(Full.Bath)),
                         Half.Bath=sum(is.na(Half.Bath)))
## # A tibble: 1 x 4
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
##            <int>          <int>     <int>     <int>
## 1              1              1         0         0
# Relace na with 0
ames_train$Bsmt.Full.Bath[is.na(ames_train$Bsmt.Full.Bath)] <- 0
ames_train$Bsmt.Half.Bath[is.na(ames_train$Bsmt.Half.Bath)] <- 0
ames_test$Bsmt.Full.Bath[is.na(ames_test$Bsmt.Full.Bath)] <- 0
ames_test$Bsmt.Half.Bath[is.na(ames_test$Bsmt.Half.Bath)] <- 0
ames_validation$Bsmt.Full.Bath[is.na(ames_validation$Bsmt.Full.Bath)] <- 0
ames_validation$Bsmt.Half.Bath[is.na(ames_validation$Bsmt.Half.Bath)] <- 0
# Treat half baths a 0.5 bathrooms
ames_train$Bath_5 <- ames_train$Full.Bath + ames_train$Bsmt.Full.Bath + 
  0.5*ames_train$Half.Bath + 0.5*ames_train$Bsmt.Half.Bath
ames_test$Bath_5 <- ames_test$Full.Bath + ames_test$Bsmt.Full.Bath + 
  0.5*ames_test$Half.Bath + 0.5*ames_test$Bsmt.Half.Bath
ames_validation$Bath_5 <- ames_validation$Full.Bath + 
  ames_validation$Bsmt.Full.Bath + 
  0.5*ames_validation$Half.Bath + 0.5*ames_validation$Bsmt.Half.Bath
p1 <- ggplot(ames_train, aes(x = Bath_5, y = price)) +
  geom_point() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Half-Bath = 0.5")
# Treat half baths a 0.1 bathrooms
ames_train$Bath_1 <- ames_train$Full.Bath + ames_train$Bsmt.Full.Bath + 
  0.1*ames_train$Half.Bath + 0.1*ames_train$Bsmt.Half.Bath
ames_test$Bath_1 <- ames_test$Full.Bath + ames_test$Bsmt.Full.Bath + 
  0.1*ames_test$Half.Bath + 0.1*ames_test$Bsmt.Half.Bath
ames_validation$Bath_1 <- ames_validation$Full.Bath + 
  ames_validation$Bsmt.Full.Bath + 
  0.1*ames_validation$Half.Bath + 0.1*ames_validation$Bsmt.Half.Bath
p2 <- ggplot(ames_train, aes(x = Bath_1, y = price)) +  
  geom_jitter() +
  geom_smooth(se = FALSE, method="lm") +
  ggtitle("Half-Bath = 0.1")
grid.arrange(p1, p2, ncol=2)

Location surely makes a difference. A summary boxplot of prices by Neighborhood shows both differences in median prices, and in variation of prices. Not all neighborhoods are significantly different. After examining neighborhoods with an ANOVA test, I condensed the neighborhoods into 5 groups.

ggplot(data = ames_train, aes(x = Neighborhood, y = price)) + 
  geom_boxplot() +
  coord_flip()

ames_train %>% 
  group_by(Neighborhood) %>%
  summarise(avg.price = mean(price),
            median.price = median(price),
            sd.price = sd(price))
## # A tibble: 27 x 4
##    Neighborhood avg.price median.price sd.price
##    <fct>            <dbl>        <dbl>    <dbl>
##  1 Blmngtn         198635       191000    26455
##  2 Blueste         125800       123900    10381
##  3 BrDale           98930        98750    13338
##  4 BrkSide         122473       124000    37310
##  5 ClearCr         193154       185000    48069
##  6 CollgCr         196951       195800    52786
##  7 Crawfor         204197       205000    71268
##  8 Edwards         136322       127250    54852
##  9 Gilbert         193328       183500    41190
## 10 Greens          198562       212625    29063
## # ... with 17 more rows
ames_train$Neighborhood.chr = as.character(ames_train$Neighborhood)
ames_train$Neighborhood2 <- as.factor(
  ifelse(ames_train$Neighborhood.chr == "Blmngtn", "Neigh_1",
  ifelse(ames_train$Neighborhood.chr == "Blueste", "Neigh_1",
  ifelse(ames_train$Neighborhood.chr == "ClearCr", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "CollgCr", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "Crawfor", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "Gilbert", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "Greens", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "GrnHill", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "NPkVill", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "NWAmes", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "SawyerW", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "Veenker", "Neigh_1", 
  ifelse(ames_train$Neighborhood.chr == "BrDale", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "BrkSide", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "Edwards", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "IDOTRR", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "MeadowV", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "OldTown", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "Sawyer", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "SWISU", "Neigh_2", 
  ifelse(ames_train$Neighborhood.chr == "NAmes", "Neigh_3", 
  ifelse(ames_train$Neighborhood.chr == "Mitchel", "Neigh_3", 
  ifelse(ames_train$Neighborhood.chr == "StoneBr", "Neigh_4", 
  ifelse(ames_train$Neighborhood.chr == "NoRidge", "Neigh_4", 
  ifelse(ames_train$Neighborhood.chr == "NridgHt", "Neigh_4", 
  ifelse(ames_train$Neighborhood.chr == "Somerst", "Neigh_5", 
  ifelse(ames_train$Neighborhood.chr == "Timber", "Neigh_5", 
         "Neigh_5"))))))))))))))))))))))))))))
ames_test$Neighborhood.chr = as.character(ames_test$Neighborhood)
ames_test$Neighborhood2 <- as.factor(
  ifelse(ames_test$Neighborhood.chr == "Blmngtn", "Neigh_1",
  ifelse(ames_test$Neighborhood.chr == "Blueste", "Neigh_1",
  ifelse(ames_test$Neighborhood.chr == "ClearCr", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "CollgCr", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "Crawfor", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "Gilbert", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "Greens", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "GrnHill", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "NPkVill", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "NWAmes", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "SawyerW", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "Veenker", "Neigh_1", 
  ifelse(ames_test$Neighborhood.chr == "BrDale", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "BrkSide", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "Edwards", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "IDOTRR", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "MeadowV", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "OldTown", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "Sawyer", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "SWISU", "Neigh_2", 
  ifelse(ames_test$Neighborhood.chr == "NAmes", "Neigh_3", 
  ifelse(ames_test$Neighborhood.chr == "Mitchel", "Neigh_3", 
  ifelse(ames_test$Neighborhood.chr == "StoneBr", "Neigh_4", 
  ifelse(ames_test$Neighborhood.chr == "NoRidge", "Neigh_4", 
  ifelse(ames_test$Neighborhood.chr == "NridgHt", "Neigh_4", 
  ifelse(ames_test$Neighborhood.chr == "Somerst", "Neigh_5", 
  ifelse(ames_test$Neighborhood.chr == "Timber", "Neigh_5", 
         "Neigh_5"))))))))))))))))))))))))))))
ames_validation$Neighborhood.chr = as.character(ames_validation$Neighborhood)
ames_validation$Neighborhood2 <- as.factor(
  ifelse(ames_validation$Neighborhood.chr == "Blmngtn", "Neigh_1",
  ifelse(ames_validation$Neighborhood.chr == "Blueste", "Neigh_1",
  ifelse(ames_validation$Neighborhood.chr == "ClearCr", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "CollgCr", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "Crawfor", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "Gilbert", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "Greens", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "GrnHill", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "NPkVill", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "NWAmes", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "SawyerW", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "Veenker", "Neigh_1", 
  ifelse(ames_validation$Neighborhood.chr == "BrDale", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "BrkSide", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "Edwards", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "IDOTRR", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "MeadowV", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "OldTown", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "Sawyer", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "SWISU", "Neigh_2", 
  ifelse(ames_validation$Neighborhood.chr == "NAmes", "Neigh_3", 
  ifelse(ames_validation$Neighborhood.chr == "Mitchel", "Neigh_3", 
  ifelse(ames_validation$Neighborhood.chr == "StoneBr", "Neigh_4", 
  ifelse(ames_validation$Neighborhood.chr == "NoRidge", "Neigh_4", 
  ifelse(ames_validation$Neighborhood.chr == "NridgHt", "Neigh_4", 
  ifelse(ames_validation$Neighborhood.chr == "Somerst", "Neigh_5", 
  ifelse(ames_validation$Neighborhood.chr == "Timber", "Neigh_5", 
         "Neigh_5"))))))))))))))))))))))))))))
aov_Neighborhood<- aov(ames_train$price ~ ames_train$Neighborhood2)
TukeyHSD(aov_Neighborhood)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ames_train$price ~ ames_train$Neighborhood2)
## 
## $`ames_train$Neighborhood2`
##                      diff         lwr       upr   p adj
## Neigh_2-Neigh_1 -71550.89  -83544.279 -59557.49 0.0e+00
## Neigh_3-Neigh_1 -48151.64  -61639.982 -34663.31 0.0e+00
## Neigh_4-Neigh_1 129907.59  113190.429 146624.75 0.0e+00
## Neigh_5-Neigh_1  46108.27   28612.427  63604.12 0.0e+00
## Neigh_3-Neigh_2  23399.24    9990.724  36807.76 2.1e-05
## Neigh_4-Neigh_2 201458.48  184805.651 218111.30 0.0e+00
## Neigh_5-Neigh_2 117659.16  100224.775 135093.54 0.0e+00
## Neigh_4-Neigh_3 178059.23  160299.438 195819.03 0.0e+00
## Neigh_5-Neigh_3  94259.92   75765.282 112754.55 0.0e+00
## Neigh_5-Neigh_4 -83799.32 -104765.473 -62833.16 0.0e+00
ggplot(data = ames_train, aes(x = Neighborhood2, y = price)) + 
  geom_boxplot() +
  coord_flip()

ames_train %>% 
  group_by(Neighborhood2) %>%
  summarise(avg.price = mean(price),
            median.price = median(price),
            sd.price = sd(price),
            n = n())
## # A tibble: 5 x 5
##   Neighborhood2 avg.price median.price sd.price     n
##   <fct>             <dbl>        <dbl>    <dbl> <int>
## 1 Neigh_1          194738       187500    51577   297
## 2 Neigh_2          123188       125000    39382   306
## 3 Neigh_3          146587       142000    31893   199
## 4 Neigh_4          324646       315750    96823   105
## 5 Neigh_5          240847       222000    70060    93

*** NOT USED - ANALYSIS OF SEASONAL EFFECT ON PRICE ***

# Analysis of season
ames_train$season <- as.factor(ifelse(ames_train$Mo.Sold<=1,ames_train$Mo.Sold,ames_train$Mo.Sold))
ggplot(ames_train, aes(x = season, y = price)) +
  geom_boxplot() 

*** NOT USED - MISSING OBS ***

# Most missing obs
na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE),5)
##      Pool.QC Misc.Feature        Alley        Fence Fireplace.Qu 
##          997          971          933          798          491
# For candidate variables, count na.
na_count["Lot.Area.Cat"]
## Lot.Area.Cat 
##            0
na_count["area"]
## area 
##    0
na_count["Overall.Cond"]
## Overall.Cond 
##            0
na_count["Bedroom.AbvGr"]
## Bedroom.AbvGr 
##             0
na_count["Bath_1"]
## Bath_1 
##      0
na_count["Year.Built"]
## Year.Built 
##          0
na_count["Bedroom.AbvGr"]
## Bedroom.AbvGr 
##             0
na_count["Full.Bath"]
## Full.Bath 
##         0
na_count["Bsmt.Full.Bath"]
## Bsmt.Full.Bath 
##              0
na_count["Half.Bath"]
## Half.Bath 
##         0
na_count["Bsmt.Half.Bath"]
## Bsmt.Half.Bath 
##              0

*** Data Conversion ***

#Convert Overall.Cond and Overal.Qual from int to ordered factor
#str(ames_train$Overall.Cond)
#str(ames_train$Overall.Qual)
#ames_train$Overall.Cond <- factor(ames_train$Overall.Cond, ordered = TRUE)
#ames_train$Overall.Qual <- factor(ames_train$Overall.Qual, ordered = TRUE)
#ames_test$Overall.Cond <- factor(ames_test$Overall.Cond, ordered = TRUE)
#ames_test$Overall.Qual <- factor(ames_test$Overall.Qual, ordered = TRUE)
#ames_validation$Overall.Cond <- factor(ames_validation$Overall.Cond, ordered = TRUE)
#ames_validation$Overall.Qual <- factor(ames_validation$Overall.Qual, ordered = TRUE)
*** Analysis of Candidate Variables for Model Assumptions *** Start with a Residuals vs fitted values plot to visually assess whether:
  • the average of the residuals is ~0 at all fitted values (affirms linearity).
  • the spread of the residuals is ~constant at all fitted values (affirms equal variances).
  • there are no excessively outlying points.
  • (violation of any of these three may necessitate remedial action (such as transforming one or more predictors and/or the response variable)

    The standardized residuals plot fans out, violating the equal variance condition. There are also many points >2sd from 0.

    ames_train.lm.full <- lm(price ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood, data = ames_train)
    ames_train$resid <- rstandard(ames_train.lm.full) 
    ames_train$fitted <- fitted(ames_train.lm.full) 
    ggplot(data=ames_train, aes(y = resid, x = fitted)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      xlab("Fitted Value") +
      ylab("Residual") +
      ggtitle("Residuals vs. Fits Plot \n(response is price)")

    ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood, data = ames_train)
    par(mfrow = c(2,2)) # Split plot panel into 2x2 grid
    plot(ames_train.lm.full)

    Try log-transforming price. This plot looks much better, although there are a few potential outliers.

    ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood, data = ames_train)
    ames_train$resid <- rstandard(ames_train.lm.full) 
    ames_train$fitted <- fitted(ames_train.lm.full) 
    ggplot(data=ames_train, aes(y = resid, x = fitted)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      xlab("Fitted Value") +
      ylab("Residual") +
      ggtitle("Residuals vs. Fits Plot \n(response is ln(price))")

    Create a regular residuals versus fits plot. This will label the outliers for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.

    plot(ames_train.lm.full, which = 1)

    ames_train.lm.full$model[c(428, 310),]
    ##     log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
    ## 428   9.456341            4  832            2             2    1.0
    ## 310  12.126759            7 4676            5             3    4.1
    ##     Year.Built Neighborhood
    ## 428       1923      OldTown
    ## 310       2007      Edwards
    ames_train[c(428, 310),]
    ## # A tibble: 2 x 89
    ##        PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
    ##      <int> <int>  <int>       <int> <fct>            <int>    <int> <fct> 
    ## 1   9.02e8   832  12789          30 RM                  68     9656 Pave  
    ## 2   9.08e8  4676 184750          60 RL                 130    40094 Pave  
    ## # ... with 81 more variables: Alley <fct>, Lot.Shape <fct>,
    ## #   Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
    ## #   Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
    ## #   Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
    ## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
    ## #   Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
    ## #   Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
    ## #   Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
    ## #   Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
    ## #   Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
    ## #   BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
    ## #   Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
    ## #   Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
    ## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <dbl>,
    ## #   Bsmt.Half.Bath <dbl>, Full.Bath <int>, Half.Bath <int>,
    ## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
    ## #   TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
    ## #   Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
    ## #   Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
    ## #   Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
    ## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
    ## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
    ## #   Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
    ## #   Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>,
    ## #   Lot.Area.Cat <dbl>, Bath_5 <dbl>, Bath_1 <dbl>,
    ## #   Neighborhood.chr <chr>, Neighborhood2 <fct>, season <fct>,
    ## #   resid <dbl>, fitted <dbl>
    ames_train2 <- ames_train[c(1:309,311:427,429:1000),]
    Now create Residuals vs Predictors scatterplot matrix to assess whether:
  • the average of the residuals is ~0 at all fitted values (affirms linearity).
  • the spread of the residuals is ~constant at all fitted values (affirms equal variances).
  • (violation of any of these three may necessitate remedial action (such as transforming one or more predictors and/or the response variable)

    p1 <- ggplot(data=ames_train, aes(y = resid, x = area)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("area") +
      ylab("Residual")
    p2 <- ggplot(data=ames_train, aes(y = resid, x = Overall.Cond)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Overall.Cond") +
      ylab("Residual")
    p3 <- ggplot(data=ames_train, aes(y = resid, x = Bedroom.AbvGr)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Bedroom.AbvGr") +
      ylab("Residual") 
    p4 <- ggplot(data=ames_train, aes(y = resid, x = Bath_1)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Bath_1") +
      ylab("Residual") 
    p5 <- ggplot(data=ames_train, aes(y = resid, x = Year.Built)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Year.Built") +
      ylab("Residual") 
    p6 <- ggplot(data=ames_train, aes(y = resid, x = Lot.Area.Cat)) +
      geom_point() +
      geom_hline(yintercept = 0, linetype = 2) +
      geom_hline(yintercept = 2, linetype = 1, color="red") +
      geom_hline(yintercept = -2, linetype = 1, color="red") +
      ggtitle("Residuals vs. Predictor") +
      xlab("Lot.Area.Cat") +
      ylab("Residual")
    grid.arrange(p1, p2, ncol=2)

    grid.arrange(p3, p4, ncol=2)

    grid.arrange(p5, p6, ncol=2)

    Finally, create Histogram, boxplot, and/or normal probability plot of residuals
  • approximate normality (affirms normality).

    par(mfrow = c(2,2)) # Split plot panel into 2x2 grid
    plot(ames_train.lm.full)

    #par(mfrow = c(2,2)) # Split plot panel into 2x2 grid

  • 2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

    2.2.1 Section 2.1 An Initial Model

    In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

    Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


    Based on my EDA, I modeled log(price) on my categorized lot area (Lot.Area.Cat), home size (area), bedrooms (Bedroom.AbvGr), my modified number of bathrooms (Bath_1), home age (Year.Built), overall condition of the home (Overall.Cond), and location (Neighborhood). That is 7 predictor variables, but technically a lot more because Neighborhood is changed into 5 indicator variables.

    My model selection was based on real estate industry expectations: values are based on location, size, and quality.

    The model does a good job of predicting log(price) with an adjusted R-squared of 0.8213. All predictor variables apart from Neighborhood are highly significant (p<0.0001). The Neighborhood indicator variables are significant at the \(\alpha\)=0.05 level for 3 of the 4 indicator variables.

    #str(ames_train)
    ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2, data = ames_train)
    summary(ames_train.lm.full)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.62700 -0.08985  0.00130  0.09516  0.61324 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           7.730e-01  6.261e-01   1.235   0.2173    
    ## Lot.Area.Cat          5.168e-02  5.076e-03  10.182  < 2e-16 ***
    ## area                  4.029e-04  1.821e-05  22.126  < 2e-16 ***
    ## Overall.Cond          8.729e-02  5.609e-03  15.562  < 2e-16 ***
    ## Bedroom.AbvGr        -6.208e-02  8.623e-03  -7.199 1.20e-12 ***
    ## Bath_1                5.279e-02  9.771e-03   5.403 8.22e-08 ***
    ## Year.Built            5.078e-03  3.099e-04  16.386  < 2e-16 ***
    ## Neighborhood2Neigh_2 -1.006e-01  1.992e-02  -5.052 5.20e-07 ***
    ## Neighborhood2Neigh_3 -3.182e-02  1.837e-02  -1.732   0.0835 .  
    ## Neighborhood2Neigh_4  2.228e-01  2.236e-02   9.963  < 2e-16 ***
    ## Neighborhood2Neigh_5  1.421e-01  2.217e-02   6.409 2.26e-10 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1778 on 989 degrees of freedom
    ## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8213 
    ## F-statistic:   460 on 10 and 989 DF,  p-value: < 2.2e-16

    2.2.2 Section 2.2 Model Selection

    Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


    I used three model selection procedures: AIC, BIC, and BAS. AIC and BIC retained all predictor variables from the full model. BAS also retained all predictor variables apart from the Neighborhood variable. BAS dropped 1 of the Neighborhood indicator variables in its top five models.

    ames_train.lm.aic <- stepAIC(ames_train.lm.full, k = 2)
    ## Start:  AIC=-3443.04
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3443.0
    ## - Bath_1         1    0.9230 32.195 -3416.0
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3394.0
    ## - Lot.Area.Cat   1    3.2782 34.550 -3345.4
    ## - Neighborhood2  4    4.5659 35.837 -3314.8
    ## - Overall.Cond   1    7.6572 38.929 -3226.0
    ## - Year.Built     1    8.4899 39.762 -3204.9
    ## - area           1   15.4797 46.751 -3042.9
    ames_train.lm.bic <- stepAIC(ames_train.lm.full, k = log(nrow(ames_train)))
    ## Start:  AIC=-3389.06
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3389.1
    ## - Bath_1         1    0.9230 32.195 -3366.9
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3344.9
    ## - Lot.Area.Cat   1    3.2782 34.550 -3296.3
    ## - Neighborhood2  4    4.5659 35.837 -3280.4
    ## - Overall.Cond   1    7.6572 38.929 -3176.9
    ## - Year.Built     1    8.4899 39.762 -3155.8
    ## - area           1   15.4797 46.751 -2993.8
    ames_train.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2, data = ames_train, prior = "AIC",
                               modelprior=uniform())
    summary(ames_train.lm.bas)
    ##                      P(B != 0 | Y)    model 1       model 2       model 3
    ## Intercept                1.0000000     1.0000     1.0000000  1.000000e+00
    ## Lot.Area.Cat             1.0000000     1.0000     1.0000000  1.000000e+00
    ## area                     1.0000000     1.0000     1.0000000  1.000000e+00
    ## Overall.Cond             1.0000000     1.0000     1.0000000  1.000000e+00
    ## Bedroom.AbvGr            1.0000000     1.0000     1.0000000  1.000000e+00
    ## Bath_1                   0.9999988     1.0000     1.0000000  1.000000e+00
    ## Year.Built               1.0000000     1.0000     1.0000000  1.000000e+00
    ## Neighborhood2Neigh_2     0.9999897     1.0000     1.0000000  0.000000e+00
    ## Neighborhood2Neigh_3     0.6259502     1.0000     0.0000000  0.000000e+00
    ## Neighborhood2Neigh_4     1.0000000     1.0000     1.0000000  1.000000e+00
    ## Neighborhood2Neigh_5     1.0000000     1.0000     1.0000000  1.000000e+00
    ## BF                              NA     1.0000     0.5975676  8.492981e-06
    ## PostProbs                       NA     0.6259     0.3740000  0.000000e+00
    ## R2                              NA     0.8231     0.8225000  8.181000e-01
    ## dim                             NA    11.0000    10.0000000  9.000000e+00
    ## logmarg                         NA -1732.3554 -1732.8702398 -1.744032e+03
    ##                            model 4       model 5
    ## Intercept             1.000000e+00  1.000000e+00
    ## Lot.Area.Cat          1.000000e+00  1.000000e+00
    ## area                  1.000000e+00  1.000000e+00
    ## Overall.Cond          1.000000e+00  1.000000e+00
    ## Bedroom.AbvGr         1.000000e+00  1.000000e+00
    ## Bath_1                1.000000e+00  0.000000e+00
    ## Year.Built            1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_2  0.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_3  1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_4  1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_5  1.000000e+00  1.000000e+00
    ## BF                    7.962731e-06  1.312482e-06
    ## PostProbs             0.000000e+00  0.000000e+00
    ## R2                    8.185000e-01  8.178000e-01
    ## dim                   1.000000e+01  1.000000e+01
    ## logmarg              -1.744096e+03 -1.745899e+03

    2.2.3 Section 2.3 Initial Model Residuals

    One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


    The residuals versus fits plot has constant variance with a couple outliers labeled for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.

    plot(ames_train.lm.full, which = 1)

    ames_train.lm.full$model[c(428, 310),]
    ##     log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
    ## 428   9.456341            4  832            2             2    1.0
    ## 310  12.126759            7 4676            5             3    4.1
    ##     Year.Built Neighborhood2
    ## 428       1923       Neigh_2
    ## 310       2007       Neigh_2
    ames_train[c(428, 310),]
    ## # A tibble: 2 x 89
    ##        PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
    ##      <int> <int>  <int>       <int> <fct>            <int>    <int> <fct> 
    ## 1   9.02e8   832  12789          30 RM                  68     9656 Pave  
    ## 2   9.08e8  4676 184750          60 RL                 130    40094 Pave  
    ## # ... with 81 more variables: Alley <fct>, Lot.Shape <fct>,
    ## #   Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
    ## #   Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
    ## #   Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
    ## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
    ## #   Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
    ## #   Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
    ## #   Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
    ## #   Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
    ## #   Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
    ## #   BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
    ## #   Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
    ## #   Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
    ## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <dbl>,
    ## #   Bsmt.Half.Bath <dbl>, Full.Bath <int>, Half.Bath <int>,
    ## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
    ## #   TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
    ## #   Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
    ## #   Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
    ## #   Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
    ## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
    ## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
    ## #   Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
    ## #   Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>,
    ## #   Lot.Area.Cat <dbl>, Bath_5 <dbl>, Bath_1 <dbl>,
    ## #   Neighborhood.chr <chr>, Neighborhood2 <fct>, season <fct>,
    ## #   resid <dbl>, fitted <dbl>

    2.2.4 Section 2.4 Initial Model RMSE

    You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


    The model RMSE (translated from log(price) to price is $38,320.

    # Extract Predictions (convert from log)
    ames_train$predict.price <- exp(predict(ames_train.lm.full, ames_train))
    
    # Calculate RMSE
    rmse.full <- sqrt(mean((ames_train$price - ames_train$predict.price)^2))
    rmse.full
    ## [1] 38320.84

    2.2.5 Section 2.5 Overfitting

    The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called overfitting. To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

    # do not re-load data because already loaded and created transformed vars.
    #load("ames_test.Rdata")

    Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


    In general, the better the model fit, the lower the RMSE. In this case, the RMSE of the test data RMSE ($27,080) is lower than using the training data. This is unusual, because the model should fit the data it was fitted to better than a new data set. After testing the model with the three training set outliers removed, the RMSE was lower for the training data set than the test data set, as expected.

    # Extract Predictions (convert from log)
    ames_test$predict.price <- exp(predict(ames_train.lm.full, ames_test))
    
    # Calculate RMSE
    rmse.full <- sqrt(mean((ames_test$price - ames_test$predict.price)^2))
    rmse.full
    ## [1] 27079.7

    Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

    2.3 Part 3 Development of a Final Model

    Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

    Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

    2.3.1 Section 3.1 Final Model

    Provide the summary table for your model.


    I added six more variables to the final model for evaluation: Building type (Bldg.Type), Overall Condition (Overall.Qual), Central Air (Central.Air), Kitchen Quality (Kitchen.Qual), Sale Condition (Sale.Condition), and exterior quality (Exter.Qual). I also added two interaction variables, Central.Air*Year.Built and Bedroom.AbvGr*Bath_1.

    I used three model selection procedures: AIC, BIC, and BAS. AIC retained all predictor variables from the full model except for the Central.Air*Year.Built interaction term. BIC dropped the Central.Air*Year.Built interaction term, Bldg.Type, and Exter.Qual2. BAS also retained all predictor variables in at least one of the top 5 models.

    I decided to drop Bldg.Type and Central.Air*Year.Built from the full model.

    #na_count["Bldg.Type"]
    #na_count["Overall.Cond"]
    #na_count["Central.Air"]
    #na_count["Kitchen.Qual"]
    #na_count["Garage.Qual"]
    #na_count["Sale.Condition"]
    #na_count["Exter.Qual"]
    #na_count["Bsmt.Qual"]
    
    ames_train$Exter.Qual2 <- as.numeric(
      ifelse(ames_train$Exter.Qual == "Po", 0, 
      ifelse(ames_train$Exter.Qual == "Fa", 1, 
      ifelse(ames_train$Exter.Qual == "TA", 2, 
      ifelse(ames_train$Exter.Qual == "Gd", 3, 
      ifelse(ames_train$Exter.Qual == "Ex", 4, 5))))))
    ames_train$Kitchen.Qual2 <- as.numeric(
      ifelse(ames_train$Kitchen.Qual == "Po", 0, 
      ifelse(ames_train$Kitchen.Qual == "Fa", 1, 
      ifelse(ames_train$Kitchen.Qual == "TA", 2, 
      ifelse(ames_train$Kitchen.Qual == "Gd", 3, 
      ifelse(ames_train$Kitchen.Qual == "Ex", 4, 5))))))
    ames_test$Exter.Qual2 <- as.numeric(
      ifelse(ames_test$Exter.Qual == "Po", 0, 
      ifelse(ames_test$Exter.Qual == "Fa", 1, 
      ifelse(ames_test$Exter.Qual == "TA", 2, 
      ifelse(ames_test$Exter.Qual == "Gd", 3, 
      ifelse(ames_test$Exter.Qual == "Ex", 4, 5))))))
    ames_test$Kitchen.Qual2 <- as.numeric(
      ifelse(ames_test$Kitchen.Qual == "Po", 0, 
      ifelse(ames_test$Kitchen.Qual == "Fa", 1, 
      ifelse(ames_test$Kitchen.Qual == "TA", 2, 
      ifelse(ames_test$Kitchen.Qual == "Gd", 3, 
      ifelse(ames_test$Kitchen.Qual == "Ex", 4, 5))))))
    ames_validation$Exter.Qual2 <- as.numeric(
      ifelse(ames_validation$Exter.Qual == "Po", 0, 
      ifelse(ames_validation$Exter.Qual == "Fa", 1, 
      ifelse(ames_validation$Exter.Qual == "TA", 2, 
      ifelse(ames_validation$Exter.Qual == "Gd", 3, 
      ifelse(ames_validation$Exter.Qual == "Ex", 4, 5))))))
    ames_validation$Kitchen.Qual2 <- as.numeric(
      ifelse(ames_validation$Kitchen.Qual == "Po", 0, 
      ifelse(ames_validation$Kitchen.Qual == "Fa", 1, 
      ifelse(ames_validation$Kitchen.Qual == "TA", 2, 
      ifelse(ames_validation$Kitchen.Qual == "Gd", 3, 
      ifelse(ames_validation$Kitchen.Qual == "Ex", 4, 5))))))
    
    ames.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2 + Bldg.Type + Overall.Qual +
                               Central.Air + Kitchen.Qual2 +
                               Sale.Condition + Exter.Qual2 + Central.Air*Year.Built + Bedroom.AbvGr*Bath_1,
                             data = ames_train)
    ames.lm.aic <- stepAIC(ames.lm.full, k = 2)
    ## Start:  AIC=-3814.92
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + 
    ##     Central.Air * Year.Built + Bedroom.AbvGr * Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Year.Built:Central.Air  1    0.0011 20.924 -3816.9
    ## - Exter.Qual2             1    0.0309 20.954 -3815.4
    ## <none>                                20.923 -3814.9
    ## - Kitchen.Qual2           1    0.1406 21.063 -3810.2
    ## - Bldg.Type               4    0.4144 21.337 -3803.3
    ## - Bedroom.AbvGr:Bath_1    1    0.5291 21.452 -3791.9
    ## - Sale.Condition          5    0.8930 21.816 -3783.1
    ## - Neighborhood2           4    1.1787 22.102 -3768.1
    ## - Lot.Area.Cat            1    1.1333 22.056 -3764.2
    ## - Overall.Cond            1    2.2194 23.142 -3716.1
    ## - Overall.Qual            1    3.8546 24.777 -3647.8
    ## - area                    1    5.0222 25.945 -3601.8
    ## 
    ## Step:  AIC=-3816.86
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + 
    ##     Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## - Exter.Qual2           1    0.0302 20.954 -3817.4
    ## <none>                              20.924 -3816.9
    ## - Kitchen.Qual2         1    0.1413 21.065 -3812.1
    ## - Bldg.Type             4    0.4133 21.337 -3805.3
    ## - Bedroom.AbvGr:Bath_1  1    0.5279 21.452 -3793.9
    ## - Sale.Condition        5    0.8923 21.816 -3785.1
    ## - Central.Air           1    1.0193 21.943 -3771.3
    ## - Neighborhood2         4    1.1829 22.107 -3769.9
    ## - Lot.Area.Cat          1    1.1340 22.058 -3766.1
    ## - Year.Built            1    1.7971 22.721 -3736.5
    ## - Overall.Cond          1    2.2207 23.145 -3718.0
    ## - Overall.Qual          1    3.9030 24.827 -3647.8
    ## - area                  1    5.0217 25.945 -3603.8
    ## 
    ## Step:  AIC=-3817.42
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## <none>                              20.954 -3817.4
    ## - Kitchen.Qual2         1    0.2151 21.169 -3809.2
    ## - Bldg.Type             4    0.4212 21.375 -3805.5
    ## - Bedroom.AbvGr:Bath_1  1    0.5410 21.495 -3793.9
    ## - Sale.Condition        5    0.9193 21.873 -3784.5
    ## - Central.Air           1    1.0040 21.958 -3772.6
    ## - Lot.Area.Cat          1    1.1231 22.077 -3767.2
    ## - Neighborhood2         4    1.2604 22.215 -3767.0
    ## - Year.Built            1    1.8713 22.825 -3733.9
    ## - Overall.Cond          1    2.2133 23.167 -3719.0
    ## - Overall.Qual          1    4.3582 25.312 -3630.5
    ## - area                  1    5.0865 26.041 -3602.1
    summary(ames.lm.aic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + 
    ##     Overall.Qual + Central.Air + Kitchen.Qual2 + Sale.Condition + 
    ##     Bedroom.AbvGr:Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.44662 -0.06907 -0.00170  0.07010  0.50119 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.638e+00  5.802e-01   7.995 3.65e-15 ***
    ## Lot.Area.Cat           4.048e-02  5.597e-03   7.233 9.56e-13 ***
    ## area                   2.669e-04  1.734e-05  15.392  < 2e-16 ***
    ## Overall.Cond           5.194e-02  5.115e-03  10.153  < 2e-16 ***
    ## Bedroom.AbvGr          3.133e-02  1.473e-02   2.127 0.033635 *  
    ## Bath_1                 1.528e-01  2.021e-02   7.560 9.25e-14 ***
    ## Year.Built             2.753e-03  2.949e-04   9.336  < 2e-16 ***
    ## Neighborhood2Neigh_2  -5.128e-02  1.686e-02  -3.042 0.002415 ** 
    ## Neighborhood2Neigh_3   1.293e-02  1.571e-02   0.823 0.410526    
    ## Neighborhood2Neigh_4   1.234e-01  1.961e-02   6.292 4.72e-10 ***
    ## Neighborhood2Neigh_5   6.037e-02  1.907e-02   3.166 0.001593 ** 
    ## Bldg.Type2fmCon        5.919e-02  3.493e-02   1.695 0.090478 .  
    ## Bldg.TypeDuplex       -5.333e-02  2.912e-02  -1.831 0.067370 .  
    ## Bldg.TypeTwnhs        -1.120e-01  3.083e-02  -3.631 0.000297 ***
    ## Bldg.TypeTwnhsE       -2.746e-02  2.210e-02  -1.243 0.214316    
    ## Overall.Qual           8.653e-02  6.073e-03  14.248  < 2e-16 ***
    ## Central.AirY           1.670e-01  2.442e-02   6.838 1.41e-11 ***
    ## Kitchen.Qual2          3.376e-02  1.067e-02   3.165 0.001598 ** 
    ## Sale.ConditionAdjLand  2.155e-01  1.095e-01   1.969 0.049291 *  
    ## Sale.ConditionAlloca   9.977e-02  7.719e-02   1.292 0.196491    
    ## Sale.ConditionFamily   4.084e-03  4.104e-02   0.099 0.920764    
    ## Sale.ConditionNormal   1.075e-01  1.980e-02   5.428 7.18e-08 ***
    ## Sale.ConditionPartial  1.450e-01  2.691e-02   5.387 8.98e-08 ***
    ## Bedroom.AbvGr:Bath_1  -2.953e-02  5.883e-03  -5.020 6.14e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1465 on 976 degrees of freedom
    ## Multiple R-squared:  0.8814, Adjusted R-squared:  0.8786 
    ## F-statistic: 315.5 on 23 and 976 DF,  p-value: < 2.2e-16
    ames.lm.bic <- stepAIC(ames.lm.full, k = log(nrow(ames_train)))
    ## Start:  AIC=-3687.32
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + 
    ##     Central.Air * Year.Built + Bedroom.AbvGr * Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Bldg.Type               4    0.4144 21.337 -3695.3
    ## - Year.Built:Central.Air  1    0.0011 20.924 -3694.2
    ## - Exter.Qual2             1    0.0309 20.954 -3692.7
    ## - Kitchen.Qual2           1    0.1406 21.063 -3687.5
    ## <none>                                20.923 -3687.3
    ## - Sale.Condition          5    0.8930 21.816 -3680.1
    ## - Bedroom.AbvGr:Bath_1    1    0.5291 21.452 -3669.3
    ## - Neighborhood2           4    1.1787 22.102 -3660.1
    ## - Lot.Area.Cat            1    1.1333 22.056 -3641.5
    ## - Overall.Cond            1    2.2194 23.142 -3593.4
    ## - Overall.Qual            1    3.8546 24.777 -3525.1
    ## - area                    1    5.0222 25.945 -3479.1
    ## 
    ## Step:  AIC=-3695.33
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air + 
    ##     Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + Year.Built:Central.Air + 
    ##     Bedroom.AbvGr:Bath_1
    ## 
    ##                          Df Sum of Sq    RSS     AIC
    ## - Year.Built:Central.Air  1    0.0000 21.337 -3702.2
    ## - Exter.Qual2             1    0.0380 21.375 -3700.5
    ## <none>                                21.337 -3695.3
    ## - Kitchen.Qual2           1    0.1743 21.512 -3694.1
    ## - Sale.Condition          5    0.9807 22.318 -3684.9
    ## - Bedroom.AbvGr:Bath_1    1    0.5906 21.928 -3674.9
    ## - Neighborhood2           4    1.0793 22.416 -3673.6
    ## - Overall.Cond            1    2.3416 23.679 -3598.1
    ## - Lot.Area.Cat            1    3.2665 24.604 -3559.8
    ## - Overall.Qual            1    3.9805 25.318 -3531.2
    ## - area                    1    4.8501 26.187 -3497.4
    ## 
    ## Step:  AIC=-3702.24
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air + 
    ##     Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## - Exter.Qual2           1    0.0380 21.375 -3707.4
    ## <none>                              21.337 -3702.2
    ## - Kitchen.Qual2         1    0.1744 21.512 -3701.0
    ## - Sale.Condition        5    0.9825 22.320 -3691.8
    ## - Bedroom.AbvGr:Bath_1  1    0.5927 21.930 -3681.7
    ## - Neighborhood2         4    1.0809 22.418 -3680.5
    ## - Central.Air           1    0.9884 22.326 -3663.9
    ## - Year.Built            1    1.6100 22.947 -3636.4
    ## - Overall.Cond          1    2.3425 23.680 -3605.0
    ## - Lot.Area.Cat          1    3.2763 24.614 -3566.3
    ## - Overall.Qual          1    4.0470 25.384 -3535.5
    ## - area                  1    4.8557 26.193 -3504.1
    ## 
    ## Step:  AIC=-3707.37
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air + 
    ##     Kitchen.Qual2 + Sale.Condition + Bedroom.AbvGr:Bath_1
    ## 
    ##                        Df Sum of Sq    RSS     AIC
    ## <none>                              21.375 -3707.4
    ## - Kitchen.Qual2         1    0.2691 21.644 -3701.8
    ## - Sale.Condition        5    1.0172 22.392 -3695.4
    ## - Bedroom.AbvGr:Bath_1  1    0.6071 21.982 -3686.3
    ## - Neighborhood2         4    1.1595 22.535 -3682.2
    ## - Central.Air           1    0.9708 22.346 -3669.9
    ## - Year.Built            1    1.6830 23.058 -3638.5
    ## - Overall.Cond          1    2.3366 23.712 -3610.5
    ## - Lot.Area.Cat          1    3.2729 24.648 -3571.8
    ## - Overall.Qual          1    4.5547 25.930 -3521.1
    ## - area                  1    4.9147 26.290 -3507.3
    summary(ames.lm.bic)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Bedroom.AbvGr:Bath_1, 
    ##     data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.45617 -0.07097  0.00077  0.07238  0.53162 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.889e+00  5.815e-01   8.407  < 2e-16 ***
    ## Lot.Area.Cat           5.208e-02  4.251e-03  12.250  < 2e-16 ***
    ## area                   2.595e-04  1.728e-05  15.011  < 2e-16 ***
    ## Overall.Cond           5.290e-02  5.111e-03  10.350  < 2e-16 ***
    ## Bedroom.AbvGr          3.608e-02  1.472e-02   2.450 0.014449 *  
    ## Bath_1                 1.543e-01  2.007e-02   7.690 3.57e-14 ***
    ## Year.Built             2.589e-03  2.947e-04   8.784  < 2e-16 ***
    ## Neighborhood2Neigh_2  -5.025e-02  1.693e-02  -2.968 0.003069 ** 
    ## Neighborhood2Neigh_3   1.115e-02  1.570e-02   0.710 0.477675    
    ## Neighborhood2Neigh_4   1.171e-01  1.965e-02   5.963 3.46e-09 ***
    ## Neighborhood2Neigh_5   5.869e-02  1.920e-02   3.057 0.002293 ** 
    ## Overall.Qual           8.809e-02  6.096e-03  14.451  < 2e-16 ***
    ## Central.AirY           1.591e-01  2.385e-02   6.671 4.23e-11 ***
    ## Kitchen.Qual2          3.742e-02  1.065e-02   3.513 0.000464 ***
    ## Sale.ConditionAdjLand  1.684e-01  1.085e-01   1.553 0.120854    
    ## Sale.ConditionAlloca   7.943e-02  7.706e-02   1.031 0.302914    
    ## Sale.ConditionFamily   1.050e-03  4.135e-02   0.025 0.979738    
    ## Sale.ConditionNormal   1.120e-01  1.993e-02   5.620 2.49e-08 ***
    ## Sale.ConditionPartial  1.551e-01  2.699e-02   5.746 1.22e-08 ***
    ## Bedroom.AbvGr:Bath_1  -3.087e-02  5.851e-03  -5.276 1.63e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1477 on 980 degrees of freedom
    ## Multiple R-squared:  0.879,  Adjusted R-squared:  0.8767 
    ## F-statistic: 374.9 on 19 and 980 DF,  p-value: < 2.2e-16
    ames.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2 + Bldg.Type + Overall.Qual +
                               Central.Air + Kitchen.Qual2 +
                               Sale.Condition + Exter.Qual2 + Central.Air*Year.Built + Bedroom.AbvGr*Bath_1, 
                          data = ames_train, prior = "AIC",
                               modelprior=uniform())
    summary(ames.lm.bas)
    ##                         P(B != 0 | Y)    model 1      model 2
    ## Intercept                   1.0000000     1.0000     1.000000
    ## Lot.Area.Cat                1.0000000     1.0000     1.000000
    ## area                        1.0000000     1.0000     1.000000
    ## Overall.Cond                1.0000000     1.0000     1.000000
    ## Bedroom.AbvGr               0.8409956     1.0000     1.000000
    ## Bath_1                      1.0000000     1.0000     1.000000
    ## Year.Built                  0.9899537     1.0000     1.000000
    ## Neighborhood2Neigh_2        0.9874210     1.0000     1.000000
    ## Neighborhood2Neigh_3        0.3500874     0.0000     0.000000
    ## Neighborhood2Neigh_4        0.9999999     1.0000     1.000000
    ## Neighborhood2Neigh_5        0.9804641     1.0000     1.000000
    ## Bldg.Type2fmCon             0.6438858     1.0000     1.000000
    ## Bldg.TypeDuplex             0.5881163     1.0000     1.000000
    ## Bldg.TypeTwnhs              0.9939672     1.0000     1.000000
    ## Bldg.TypeTwnhsE             0.4441211     0.0000     0.000000
    ## Overall.Qual                1.0000000     1.0000     1.000000
    ## Central.AirY                0.5838850     1.0000     0.000000
    ## Kitchen.Qual2               0.9552425     1.0000     1.000000
    ## Sale.ConditionAdjLand       0.6321541     1.0000     1.000000
    ## Sale.ConditionAlloca        0.4201733     0.0000     0.000000
    ## Sale.ConditionFamily        0.2707315     0.0000     0.000000
    ## Sale.ConditionNormal        0.9999989     1.0000     1.000000
    ## Sale.ConditionPartial       0.9999985     1.0000     1.000000
    ## Exter.Qual2                 0.4301311     0.0000     0.000000
    ## Year.Built:Central.AirY     0.5819537     0.0000     1.000000
    ## Bedroom.AbvGr:Bath_1        0.9999940     1.0000     1.000000
    ## BF                                 NA     1.0000     0.981144
    ## PostProbs                          NA     0.0068     0.006600
    ## R2                                 NA     0.8810     0.881000
    ## dim                                NA    20.0000    20.000000
    ## logmarg                            NA -1543.1431 -1543.162135
    ##                               model 3       model 4       model 5
    ## Intercept                   1.0000000     1.0000000     1.0000000
    ## Lot.Area.Cat                1.0000000     1.0000000     1.0000000
    ## area                        1.0000000     1.0000000     1.0000000
    ## Overall.Cond                1.0000000     1.0000000     1.0000000
    ## Bedroom.AbvGr               1.0000000     1.0000000     1.0000000
    ## Bath_1                      1.0000000     1.0000000     1.0000000
    ## Year.Built                  1.0000000     1.0000000     1.0000000
    ## Neighborhood2Neigh_2        1.0000000     1.0000000     1.0000000
    ## Neighborhood2Neigh_3        0.0000000     0.0000000     0.0000000
    ## Neighborhood2Neigh_4        1.0000000     1.0000000     1.0000000
    ## Neighborhood2Neigh_5        1.0000000     1.0000000     1.0000000
    ## Bldg.Type2fmCon             1.0000000     1.0000000     1.0000000
    ## Bldg.TypeDuplex             1.0000000     1.0000000     1.0000000
    ## Bldg.TypeTwnhs              1.0000000     1.0000000     1.0000000
    ## Bldg.TypeTwnhsE             1.0000000     1.0000000     0.0000000
    ## Overall.Qual                1.0000000     1.0000000     1.0000000
    ## Central.AirY                1.0000000     0.0000000     1.0000000
    ## Kitchen.Qual2               1.0000000     1.0000000     1.0000000
    ## Sale.ConditionAdjLand       1.0000000     1.0000000     1.0000000
    ## Sale.ConditionAlloca        0.0000000     0.0000000     1.0000000
    ## Sale.ConditionFamily        0.0000000     0.0000000     0.0000000
    ## Sale.ConditionNormal        1.0000000     1.0000000     1.0000000
    ## Sale.ConditionPartial       1.0000000     1.0000000     1.0000000
    ## Exter.Qual2                 0.0000000     0.0000000     0.0000000
    ## Year.Built:Central.AirY     0.0000000     1.0000000     0.0000000
    ## Bedroom.AbvGr:Bath_1        1.0000000     1.0000000     1.0000000
    ## BF                          0.8499294     0.8384008     0.8054951
    ## PostProbs                   0.0058000     0.0057000     0.0055000
    ## R2                          0.8812000     0.8812000     0.8811000
    ## dim                        21.0000000    21.0000000    21.0000000
    ## logmarg                 -1543.3057005 -1543.3193576 -1543.3593967
    ames.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2 + Overall.Qual +
                               Central.Air + Kitchen.Qual2 +
                               Sale.Condition + Exter.Qual2 + Bedroom.AbvGr*Bath_1,
                             data = ames_train)
    summary(ames.lm.full)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + 
    ##     Bedroom.AbvGr * Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.46466 -0.07111 -0.00030  0.07329  0.52987 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.9506781  0.5831688   8.489  < 2e-16 ***
    ## Lot.Area.Cat           0.0521039  0.0042496  12.261  < 2e-16 ***
    ## area                   0.0002583  0.0000173  14.926  < 2e-16 ***
    ## Overall.Cond           0.0529747  0.0051098  10.367  < 2e-16 ***
    ## Bedroom.AbvGr          0.0368895  0.0147321   2.504  0.01244 *  
    ## Bath_1                 0.1532259  0.0200788   7.631 5.50e-14 ***
    ## Year.Built             0.0025467  0.0002963   8.595  < 2e-16 ***
    ## Neighborhood2Neigh_2  -0.0481259  0.0169999  -2.831  0.00474 ** 
    ## Neighborhood2Neigh_3   0.0137171  0.0158151   0.867  0.38597    
    ## Neighborhood2Neigh_4   0.1127654  0.0199149   5.662 1.96e-08 ***
    ## Neighborhood2Neigh_5   0.0568896  0.0192384   2.957  0.00318 ** 
    ## Overall.Qual           0.0859463  0.0063072  13.627  < 2e-16 ***
    ## Central.AirY           0.1607451  0.0238699   6.734 2.81e-11 ***
    ## Kitchen.Qual2          0.0321648  0.0113698   2.829  0.00477 ** 
    ## Sale.ConditionAdjLand  0.1656805  0.1084536   1.528  0.12692    
    ## Sale.ConditionAlloca   0.0809501  0.0770372   1.051  0.29361    
    ## Sale.ConditionFamily   0.0022861  0.0413424   0.055  0.95591    
    ## Sale.ConditionNormal   0.1112681  0.0199267   5.584 3.05e-08 ***
    ## Sale.ConditionPartial  0.1521176  0.0270766   5.618 2.52e-08 ***
    ## Exter.Qual2            0.0190427  0.0144166   1.321  0.18685    
    ## Bedroom.AbvGr:Bath_1  -0.0305291  0.0058543  -5.215 2.24e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1476 on 979 degrees of freedom
    ## Multiple R-squared:  0.8793, Adjusted R-squared:  0.8768 
    ## F-statistic: 356.5 on 20 and 979 DF,  p-value: < 2.2e-16

    The final model is

    log(price) = \(\beta_0\) + \(\beta_1\)Lot.Area.Cat + \(\beta_2\)area + \(\beta_3\)Overall.Cond + \(\beta_4\)Bedroom.AbvGr + \(\beta_5\)Bath_1 + \(\beta_6\)Year.Built + \(\beta_7\)Neighborhood2Neigh_2 + \(\beta_8\)Neighborhood2Neigh_3 + \(\beta_9\)Neighborhood2Neigh_4 + \(\beta_{10}\)Neighborhood2Neigh_5 +\(\beta_{11}\)Overall.Qual + \(\beta_{12}\)Central.AirY + \(\beta_{13}\)Kitchen.Qual2 + \(\beta_{14}\)Sale.ConditionAdjLand + \(\beta_{15}\)Sale.ConditionAlloca + \(\beta_{16}\)Sale.ConditionFamily + \(\beta_{17}\)Sale.ConditionNormal + \(\beta_{18}\)Sale.ConditionPartial + \(\beta_{19}\)Exter.Qual2 + \(\beta_{20}\)Bedroom.AbvGr*Bath_1

    with estimated regression equation log(price) = 4.95 + 0.0521Lot.Area.Cat + 0.0002area + 0.0530Overall.Cond + 0.0369Bedroom.AbvGr + 0.1532Bath_1 + 0.0025Year.Built -0.0481Neighborhood2Neigh_2 + 0.0137Neighborhood2Neigh_3 + 0.1128Neighborhood2Neigh_4 + 0.0569Neighborhood2Neigh_5 +0.0859Overall.Qual + 0.1607Central.AirY + 0.0321Kitchen.Qual2 + 0.1657Sale.ConditionAdjLand + 0.0810Sale.ConditionAlloca + 0.0023Sale.ConditionFamily + 0.1112Sale.ConditionNormal + 0.1521Sale.ConditionPartial + 0.0190Exter.Qual2 -0.0305Bedroom.AbvGr*Bath_1


    2.3.2 Section 3.2 Transformation

    Did you decide to transform any variables? Why or why not? Explain in a few sentences.


    I transformed price into log(price) to conform the model to the least square conditions. I did not transform any predictor variables to conform to model conditions.

    I did transform predictor variables to create a general and parsimonious model.
  • Condensed Neighborhoods into 5 groups
  • Created a total bathroom variable that treats half-baths as .1 to reflect their relatively low value.


  • 2.3.3 Section 3.3 Variable Interaction

    I created interaction variable Year.Built*Central.Air because newer homes are also more likely to have central air conditioning. I ultimately dropped this variable from the final model.

    I created interaction variable Bedroom.AbvGr*Bath.1 because bathrooms should increase with bedrooms or else the property value will suffer.



    2.3.4 Section 3.4 Variable Selection

    What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


    I tried the AIC, BIC, and BAS methods, and ultimately relied on the variable selection of the BIC method because it produced the most parsimonious model.

    ames_train.lm.aic <- stepAIC(ames_train.lm.full, k = 2)
    ## Start:  AIC=-3443.04
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3443.0
    ## - Bath_1         1    0.9230 32.195 -3416.0
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3394.0
    ## - Lot.Area.Cat   1    3.2782 34.550 -3345.4
    ## - Neighborhood2  4    4.5659 35.837 -3314.8
    ## - Overall.Cond   1    7.6572 38.929 -3226.0
    ## - Year.Built     1    8.4899 39.762 -3204.9
    ## - area           1   15.4797 46.751 -3042.9
    ames_train.lm.bic <- stepAIC(ames_train.lm.full, k = log(nrow(ames_train)))
    ## Start:  AIC=-3389.06
    ## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr + 
    ##     Bath_1 + Year.Built + Neighborhood2
    ## 
    ##                 Df Sum of Sq    RSS     AIC
    ## <none>                       31.272 -3389.1
    ## - Bath_1         1    0.9230 32.195 -3366.9
    ## - Bedroom.AbvGr  1    1.6386 32.910 -3344.9
    ## - Lot.Area.Cat   1    3.2782 34.550 -3296.3
    ## - Neighborhood2  4    4.5659 35.837 -3280.4
    ## - Overall.Cond   1    7.6572 38.929 -3176.9
    ## - Year.Built     1    8.4899 39.762 -3155.8
    ## - area           1   15.4797 46.751 -2993.8
    ames_train.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
                               Bedroom.AbvGr + Bath_1 + Year.Built +
                               Neighborhood2, data = ames_train, prior = "AIC",
                               modelprior=uniform())
    summary(ames_train.lm.bas)
    ##                      P(B != 0 | Y)    model 1       model 2       model 3
    ## Intercept                1.0000000     1.0000     1.0000000  1.000000e+00
    ## Lot.Area.Cat             1.0000000     1.0000     1.0000000  1.000000e+00
    ## area                     1.0000000     1.0000     1.0000000  1.000000e+00
    ## Overall.Cond             1.0000000     1.0000     1.0000000  1.000000e+00
    ## Bedroom.AbvGr            1.0000000     1.0000     1.0000000  1.000000e+00
    ## Bath_1                   0.9999988     1.0000     1.0000000  1.000000e+00
    ## Year.Built               1.0000000     1.0000     1.0000000  1.000000e+00
    ## Neighborhood2Neigh_2     0.9999897     1.0000     1.0000000  0.000000e+00
    ## Neighborhood2Neigh_3     0.6259502     1.0000     0.0000000  0.000000e+00
    ## Neighborhood2Neigh_4     1.0000000     1.0000     1.0000000  1.000000e+00
    ## Neighborhood2Neigh_5     1.0000000     1.0000     1.0000000  1.000000e+00
    ## BF                              NA     1.0000     0.5975676  8.492981e-06
    ## PostProbs                       NA     0.6259     0.3740000  0.000000e+00
    ## R2                              NA     0.8231     0.8225000  8.181000e-01
    ## dim                             NA    11.0000    10.0000000  9.000000e+00
    ## logmarg                         NA -1732.3554 -1732.8702398 -1.744032e+03
    ##                            model 4       model 5
    ## Intercept             1.000000e+00  1.000000e+00
    ## Lot.Area.Cat          1.000000e+00  1.000000e+00
    ## area                  1.000000e+00  1.000000e+00
    ## Overall.Cond          1.000000e+00  1.000000e+00
    ## Bedroom.AbvGr         1.000000e+00  1.000000e+00
    ## Bath_1                1.000000e+00  0.000000e+00
    ## Year.Built            1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_2  0.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_3  1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_4  1.000000e+00  1.000000e+00
    ## Neighborhood2Neigh_5  1.000000e+00  1.000000e+00
    ## BF                    7.962731e-06  1.312482e-06
    ## PostProbs             0.000000e+00  0.000000e+00
    ## R2                    8.185000e-01  8.178000e-01
    ## dim                   1.000000e+01  1.000000e+01
    ## logmarg              -1.744096e+03 -1.745899e+03

    2.3.5 Section 3.5 Model Testing

    How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


    The out-of-sample testing did not produce dramatically different results from the training model, so I did not change the model.


    2.4 Part 4 Final Model Assessment

    2.4.1 Section 4.1 Final Model Residual

    For your final model, create and briefly interpret an informative plot of the residuals.


    The residuals versus fits plot has constant variance with a couple outliers labeled for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.

    plot(ames.lm.full, which = 1)

    ames_train.lm.full$model[c(428, 310),]
    ##     log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
    ## 428   9.456341            4  832            2             2    1.0
    ## 310  12.126759            7 4676            5             3    4.1
    ##     Year.Built Neighborhood2
    ## 428       1923       Neigh_2
    ## 310       2007       Neigh_2
    ames_train[c(428, 310),]
    ## # A tibble: 2 x 92
    ##        PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
    ##      <int> <int>  <int>       <int> <fct>            <int>    <int> <fct> 
    ## 1   9.02e8   832  12789          30 RM                  68     9656 Pave  
    ## 2   9.08e8  4676 184750          60 RL                 130    40094 Pave  
    ## # ... with 84 more variables: Alley <fct>, Lot.Shape <fct>,
    ## #   Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
    ## #   Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
    ## #   Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
    ## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
    ## #   Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
    ## #   Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
    ## #   Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
    ## #   Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
    ## #   Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
    ## #   BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
    ## #   Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
    ## #   Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
    ## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <dbl>,
    ## #   Bsmt.Half.Bath <dbl>, Full.Bath <int>, Half.Bath <int>,
    ## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
    ## #   TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
    ## #   Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
    ## #   Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
    ## #   Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
    ## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
    ## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
    ## #   Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
    ## #   Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>,
    ## #   Lot.Area.Cat <dbl>, Bath_5 <dbl>, Bath_1 <dbl>,
    ## #   Neighborhood.chr <chr>, Neighborhood2 <fct>, season <fct>,
    ## #   resid <dbl>, fitted <dbl>, predict.price <dbl>, Exter.Qual2 <dbl>,
    ## #   Kitchen.Qual2 <dbl>

    2.4.2 Section 4.2 Final Model RMSE

    For your final model, calculate and briefly comment on the RMSE.


    The model RMSE (translated from log(price) to price is $32,975.

    # Extract Predictions (convert from log)
    ames_train$predict.price <- exp(predict(ames.lm.full, ames_train))
    
    # Calculate RMSE
    rmse.full <- sqrt(mean((ames_train$price - ames_train$predict.price)^2))
    rmse.full
    ## [1] 32975.17

    2.4.3 Section 4.3 Final Model Evaluation

    What are some strengths and weaknesses of your model?


    The model performs well for prediction (R=0.8793). It is also fairly parsimonious in its trimming the location indicator variables to 5 neighborhoods. All indicator variables were significant at the \(\alpha=0.05\) level except Neighborhood2Neigh_3, Sale.ConditionAdjLand, Sale.ConditionAlloca, Sale.ConditionFamily, and Exter.Qual2.

    One weakness of the model is that it does not generalize to geographies other than the Ames, IA area because of the geography indicator variables.

    summary(ames.lm.full)
    ## 
    ## Call:
    ## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond + 
    ##     Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + 
    ##     Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + 
    ##     Bedroom.AbvGr * Bath_1, data = ames_train)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.46466 -0.07111 -0.00030  0.07329  0.52987 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)            4.9506781  0.5831688   8.489  < 2e-16 ***
    ## Lot.Area.Cat           0.0521039  0.0042496  12.261  < 2e-16 ***
    ## area                   0.0002583  0.0000173  14.926  < 2e-16 ***
    ## Overall.Cond           0.0529747  0.0051098  10.367  < 2e-16 ***
    ## Bedroom.AbvGr          0.0368895  0.0147321   2.504  0.01244 *  
    ## Bath_1                 0.1532259  0.0200788   7.631 5.50e-14 ***
    ## Year.Built             0.0025467  0.0002963   8.595  < 2e-16 ***
    ## Neighborhood2Neigh_2  -0.0481259  0.0169999  -2.831  0.00474 ** 
    ## Neighborhood2Neigh_3   0.0137171  0.0158151   0.867  0.38597    
    ## Neighborhood2Neigh_4   0.1127654  0.0199149   5.662 1.96e-08 ***
    ## Neighborhood2Neigh_5   0.0568896  0.0192384   2.957  0.00318 ** 
    ## Overall.Qual           0.0859463  0.0063072  13.627  < 2e-16 ***
    ## Central.AirY           0.1607451  0.0238699   6.734 2.81e-11 ***
    ## Kitchen.Qual2          0.0321648  0.0113698   2.829  0.00477 ** 
    ## Sale.ConditionAdjLand  0.1656805  0.1084536   1.528  0.12692    
    ## Sale.ConditionAlloca   0.0809501  0.0770372   1.051  0.29361    
    ## Sale.ConditionFamily   0.0022861  0.0413424   0.055  0.95591    
    ## Sale.ConditionNormal   0.1112681  0.0199267   5.584 3.05e-08 ***
    ## Sale.ConditionPartial  0.1521176  0.0270766   5.618 2.52e-08 ***
    ## Exter.Qual2            0.0190427  0.0144166   1.321  0.18685    
    ## Bedroom.AbvGr:Bath_1  -0.0305291  0.0058543  -5.215 2.24e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.1476 on 979 degrees of freedom
    ## Multiple R-squared:  0.8793, Adjusted R-squared:  0.8768 
    ## F-statistic: 356.5 on 20 and 979 DF,  p-value: < 2.2e-16

    2.4.4 Section 4.4 Final Model Validation

    Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

    You will use the ames_validation dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
    * How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
    * From this result, does your final model properly reflect uncertainty?

    #data already loaded with transformations
    #load("ames_validation.Rdata")

    In general, the better the model fit, the lower the RMSE. In this case, the RMSE of the test data RMSE ($20,894) is lower than using the training data. This is unusual, because the model should fit the data it was fitted to better than a new data set. The outliers in the training set increased its RMSE.

    # Extract Predictions (convert from log)
    ames_validation$predict.price <- exp(predict(ames.lm.full, ames_validation))
    
    # Calculate RMSE
    rmse.full <- sqrt(mean((ames_validation$price - ames_validation$predict.price)^2))
    rmse.full
    ## [1] 20893.55

    98% of the 95% predictive confidence intervals contain the true price of the house in the validation data set. This result supports the model properly reflecting uncertainty in the actual sales contracts.

    # Calculate prediction interval
    ames_val.pred <- predict(ames.lm.full, ames_validation, interval = "prediction")
    
    # Get dataset of predictions and confidence intervals
    out = as.data.frame(cbind(exp(ames_val.pred),
                              price = ames_validation$price))
    
    # Fix names in dataset
    #colnames(out)[1:2] <- c("lwr", "upr")  #fix names
    
    # Get Coverage
    out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
    ##       cover
    ## 1 0.9803408
    #pred.test.HPM.coverage

    2.5 Part 5 Conclusion

    Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


    As explained in Pardoe (2008) (https://ww2.amstat.org/publications/jse/v16n2/datasets.pardoe.html), a model such as this can narrow the range of possible values for the asking price of a home. For example, a home with characteristcs such as those shown in below sold for $147,000. A 95% prediction interval based on an average of the entire data set is ($40,068, $304,460) with mean $172,264. (sample mean \(\pm\) \(t_{.05}\) × sample standard deviation). The model 95% prediction interval is much closer at ($116,342, $208,235) with predicted value $155,684.

    val <- ames_validation %>% summarise(mean=mean(price), sd=sd(price), n=n())
    error <- qt(0.975,df=val$n-1)*val$sd
    left <- mean(val$mean)-error
    right <- mean(val$mean)+error
    val$mean
    ## [1] 172264.1
    left
    ## [1] 40068.18
    right
    ## [1] 304460.1
    ames_validation[5,c('price', 'Lot.Area', 'area', 'Overall.Cond', 'Bedroom.AbvGr', 'Bath_1', 'Year.Built', 'Neighborhood', 'Overall.Qual', 'Central.Air', 'Kitchen.Qual', 'Sale.Condition', 'Exter.Qual')]
    ## # A tibble: 1 x 13
    ##    price Lot.Area  area Overall.Cond Bedroom.AbvGr Bath_1 Year.Built
    ##    <int>    <int> <int>        <int>         <int>  <dbl>      <int>
    ## 1 147000    10970  1026            6             3   2.00       1978
    ## # ... with 6 more variables: Neighborhood <fct>, Overall.Qual <int>,
    ## #   Central.Air <fct>, Kitchen.Qual <fct>, Sale.Condition <fct>,
    ## #   Exter.Qual <fct>
    exp(predict(ames.lm.full, ames_validation[5,], interval="prediction"))
    ##        fit      lwr      upr
    ## 1 155648.4 116341.8 208234.9

    The model also does a good job with characterizing the market, explaining most of the variation in sale price (R2 = 87.93%).