As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
load("ames_test.Rdata")
load("ames_validation.Rdata")
Use the code block below to load any necessary packages
#library(statsr)
library(dplyr)
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.4
library(ggplot2)
library(gridExtra)
library(MASS)
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
Homes built on large lots typically sell for higher prices, all else being equal. However, there are diminishing returns to additional acreage. Realtors have constructed lot size categories which correspond to approximately equal-sized increases in sale price (see http://ww2.amstat.org/publications/jse/v16n2/datasets.pardoe.html). The first pair of plots below shows the result of categorization. The data shows a more linear relationship between the explantory and response variable.
# Analysis of lot size.
p1 <- ggplot(ames_train, aes(x = Lot.Area, y = price)) +
geom_point() +
geom_smooth(se = FALSE, method="lm") +
ggtitle("Raw Data")
ames_train$Lot.Area.Cat <-
(ifelse(ames_train$Lot.Area<3000, 1,
ifelse(ames_train$Lot.Area<5000, 2,
ifelse(ames_train$Lot.Area<7000, 3,
ifelse(ames_train$Lot.Area<10000, 4,
ifelse(ames_train$Lot.Area<15000, 5,
ifelse(ames_train$Lot.Area<20000, 6,
ifelse(ames_train$Lot.Area<43560, 7,
ifelse(ames_train$Lot.Area<43560*3, 8,
ifelse(ames_train$Lot.Area<43560*5, 9,
ifelse(ames_train$Lot.Area<43560*10, 10,
ifelse(ames_train$Lot.Area<43560*20, 11,12))))))))))))
# repeat for ames_test and ames_validation
ames_test$Lot.Area.Cat <-
(ifelse(ames_test$Lot.Area<3000, 1,
ifelse(ames_test$Lot.Area<5000, 2,
ifelse(ames_test$Lot.Area<7000, 3,
ifelse(ames_test$Lot.Area<10000, 4,
ifelse(ames_test$Lot.Area<15000, 5,
ifelse(ames_test$Lot.Area<20000, 6,
ifelse(ames_test$Lot.Area<43560, 7,
ifelse(ames_test$Lot.Area<43560*3, 8,
ifelse(ames_test$Lot.Area<43560*5, 9,
ifelse(ames_test$Lot.Area<43560*10, 10,
ifelse(ames_test$Lot.Area<43560*20, 11,12))))))))))))
ames_validation$Lot.Area.Cat <-
(ifelse(ames_validation$Lot.Area<3000, 1,
ifelse(ames_validation$Lot.Area<5000, 2,
ifelse(ames_validation$Lot.Area<7000, 3,
ifelse(ames_validation$Lot.Area<10000, 4,
ifelse(ames_validation$Lot.Area<15000, 5,
ifelse(ames_validation$Lot.Area<20000, 6,
ifelse(ames_validation$Lot.Area<43560, 7,
ifelse(ames_validation$Lot.Area<43560*3, 8,
ifelse(ames_validation$Lot.Area<43560*5, 9,
ifelse(ames_validation$Lot.Area<43560*10, 10,
ifelse(ames_validation$Lot.Area<43560*20, 11,12))))))))))))
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 82 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
## $ Lot.Area.Cat : num 4 2 3 4 4 4 3 2 5 2 ...
p2 <- ggplot(ames_train, aes(x = Lot.Area.Cat, y = price)) +
geom_jitter() +
geom_smooth(se = FALSE, method="lm") +
ggtitle("Categorized")
grid.arrange(p1, p2, ncol=2)
Half bathrooms add less value to a home than full baths. One way to model this is to assign half-bathrooms the value 0.1 instead of 0.5. The figures below each show total bathrooms (abover ground and basement), but the figure on the left assigns 0.5 for half-baths and the figure on the right assigns 0.1 for half-baths.
# Analysis of bathroom size
ames_train %>% summarise(Bsmt.Full.Bath=sum(is.na(Bsmt.Full.Bath)),
Bsmt.Half.Bath=sum(is.na(Bsmt.Half.Bath)),
Full.Bath=sum(is.na(Full.Bath)),
Half.Bath=sum(is.na(Half.Bath)))
## # A tibble: 1 x 4
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## <int> <int> <int> <int>
## 1 1 1 0 0
ames_test %>% summarise(Bsmt.Full.Bath=sum(is.na(Bsmt.Full.Bath)),
Bsmt.Half.Bath=sum(is.na(Bsmt.Half.Bath)),
Full.Bath=sum(is.na(Full.Bath)),
Half.Bath=sum(is.na(Half.Bath)))
## # A tibble: 1 x 4
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## <int> <int> <int> <int>
## 1 0 0 0 0
ames_validation %>% summarise(Bsmt.Full.Bath=sum(is.na(Bsmt.Full.Bath)),
Bsmt.Half.Bath=sum(is.na(Bsmt.Half.Bath)),
Full.Bath=sum(is.na(Full.Bath)),
Half.Bath=sum(is.na(Half.Bath)))
## # A tibble: 1 x 4
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## <int> <int> <int> <int>
## 1 1 1 0 0
# Relace na with 0
ames_train$Bsmt.Full.Bath[is.na(ames_train$Bsmt.Full.Bath)] <- 0
ames_train$Bsmt.Half.Bath[is.na(ames_train$Bsmt.Half.Bath)] <- 0
ames_test$Bsmt.Full.Bath[is.na(ames_test$Bsmt.Full.Bath)] <- 0
ames_test$Bsmt.Half.Bath[is.na(ames_test$Bsmt.Half.Bath)] <- 0
ames_validation$Bsmt.Full.Bath[is.na(ames_validation$Bsmt.Full.Bath)] <- 0
ames_validation$Bsmt.Half.Bath[is.na(ames_validation$Bsmt.Half.Bath)] <- 0
# Treat half baths a 0.5 bathrooms
ames_train$Bath_5 <- ames_train$Full.Bath + ames_train$Bsmt.Full.Bath +
0.5*ames_train$Half.Bath + 0.5*ames_train$Bsmt.Half.Bath
ames_test$Bath_5 <- ames_test$Full.Bath + ames_test$Bsmt.Full.Bath +
0.5*ames_test$Half.Bath + 0.5*ames_test$Bsmt.Half.Bath
ames_validation$Bath_5 <- ames_validation$Full.Bath +
ames_validation$Bsmt.Full.Bath +
0.5*ames_validation$Half.Bath + 0.5*ames_validation$Bsmt.Half.Bath
p1 <- ggplot(ames_train, aes(x = Bath_5, y = price)) +
geom_point() +
geom_smooth(se = FALSE, method="lm") +
ggtitle("Half-Bath = 0.5")
# Treat half baths a 0.1 bathrooms
ames_train$Bath_1 <- ames_train$Full.Bath + ames_train$Bsmt.Full.Bath +
0.1*ames_train$Half.Bath + 0.1*ames_train$Bsmt.Half.Bath
ames_test$Bath_1 <- ames_test$Full.Bath + ames_test$Bsmt.Full.Bath +
0.1*ames_test$Half.Bath + 0.1*ames_test$Bsmt.Half.Bath
ames_validation$Bath_1 <- ames_validation$Full.Bath +
ames_validation$Bsmt.Full.Bath +
0.1*ames_validation$Half.Bath + 0.1*ames_validation$Bsmt.Half.Bath
p2 <- ggplot(ames_train, aes(x = Bath_1, y = price)) +
geom_jitter() +
geom_smooth(se = FALSE, method="lm") +
ggtitle("Half-Bath = 0.1")
grid.arrange(p1, p2, ncol=2)
Location surely makes a difference. A summary boxplot of prices by Neighborhood shows both differences in median prices, and in variation of prices. Not all neighborhoods are significantly different. After examining neighborhoods with an ANOVA test, I condensed the neighborhoods into 5 groups.
ggplot(data = ames_train, aes(x = Neighborhood, y = price)) +
geom_boxplot() +
coord_flip()
ames_train %>%
group_by(Neighborhood) %>%
summarise(avg.price = mean(price),
median.price = median(price),
sd.price = sd(price))
## # A tibble: 27 x 4
## Neighborhood avg.price median.price sd.price
## <fct> <dbl> <dbl> <dbl>
## 1 Blmngtn 198635 191000 26455
## 2 Blueste 125800 123900 10381
## 3 BrDale 98930 98750 13338
## 4 BrkSide 122473 124000 37310
## 5 ClearCr 193154 185000 48069
## 6 CollgCr 196951 195800 52786
## 7 Crawfor 204197 205000 71268
## 8 Edwards 136322 127250 54852
## 9 Gilbert 193328 183500 41190
## 10 Greens 198562 212625 29063
## # ... with 17 more rows
ames_train$Neighborhood.chr = as.character(ames_train$Neighborhood)
ames_train$Neighborhood2 <- as.factor(
ifelse(ames_train$Neighborhood.chr == "Blmngtn", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "Blueste", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "ClearCr", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "CollgCr", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "Crawfor", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "Gilbert", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "Greens", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "GrnHill", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "NPkVill", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "NWAmes", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "SawyerW", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "Veenker", "Neigh_1",
ifelse(ames_train$Neighborhood.chr == "BrDale", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "BrkSide", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "Edwards", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "IDOTRR", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "MeadowV", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "OldTown", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "Sawyer", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "SWISU", "Neigh_2",
ifelse(ames_train$Neighborhood.chr == "NAmes", "Neigh_3",
ifelse(ames_train$Neighborhood.chr == "Mitchel", "Neigh_3",
ifelse(ames_train$Neighborhood.chr == "StoneBr", "Neigh_4",
ifelse(ames_train$Neighborhood.chr == "NoRidge", "Neigh_4",
ifelse(ames_train$Neighborhood.chr == "NridgHt", "Neigh_4",
ifelse(ames_train$Neighborhood.chr == "Somerst", "Neigh_5",
ifelse(ames_train$Neighborhood.chr == "Timber", "Neigh_5",
"Neigh_5"))))))))))))))))))))))))))))
ames_test$Neighborhood.chr = as.character(ames_test$Neighborhood)
ames_test$Neighborhood2 <- as.factor(
ifelse(ames_test$Neighborhood.chr == "Blmngtn", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "Blueste", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "ClearCr", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "CollgCr", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "Crawfor", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "Gilbert", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "Greens", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "GrnHill", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "NPkVill", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "NWAmes", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "SawyerW", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "Veenker", "Neigh_1",
ifelse(ames_test$Neighborhood.chr == "BrDale", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "BrkSide", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "Edwards", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "IDOTRR", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "MeadowV", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "OldTown", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "Sawyer", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "SWISU", "Neigh_2",
ifelse(ames_test$Neighborhood.chr == "NAmes", "Neigh_3",
ifelse(ames_test$Neighborhood.chr == "Mitchel", "Neigh_3",
ifelse(ames_test$Neighborhood.chr == "StoneBr", "Neigh_4",
ifelse(ames_test$Neighborhood.chr == "NoRidge", "Neigh_4",
ifelse(ames_test$Neighborhood.chr == "NridgHt", "Neigh_4",
ifelse(ames_test$Neighborhood.chr == "Somerst", "Neigh_5",
ifelse(ames_test$Neighborhood.chr == "Timber", "Neigh_5",
"Neigh_5"))))))))))))))))))))))))))))
ames_validation$Neighborhood.chr = as.character(ames_validation$Neighborhood)
ames_validation$Neighborhood2 <- as.factor(
ifelse(ames_validation$Neighborhood.chr == "Blmngtn", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "Blueste", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "ClearCr", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "CollgCr", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "Crawfor", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "Gilbert", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "Greens", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "GrnHill", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "NPkVill", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "NWAmes", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "SawyerW", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "Veenker", "Neigh_1",
ifelse(ames_validation$Neighborhood.chr == "BrDale", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "BrkSide", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "Edwards", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "IDOTRR", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "MeadowV", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "OldTown", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "Sawyer", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "SWISU", "Neigh_2",
ifelse(ames_validation$Neighborhood.chr == "NAmes", "Neigh_3",
ifelse(ames_validation$Neighborhood.chr == "Mitchel", "Neigh_3",
ifelse(ames_validation$Neighborhood.chr == "StoneBr", "Neigh_4",
ifelse(ames_validation$Neighborhood.chr == "NoRidge", "Neigh_4",
ifelse(ames_validation$Neighborhood.chr == "NridgHt", "Neigh_4",
ifelse(ames_validation$Neighborhood.chr == "Somerst", "Neigh_5",
ifelse(ames_validation$Neighborhood.chr == "Timber", "Neigh_5",
"Neigh_5"))))))))))))))))))))))))))))
aov_Neighborhood<- aov(ames_train$price ~ ames_train$Neighborhood2)
TukeyHSD(aov_Neighborhood)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = ames_train$price ~ ames_train$Neighborhood2)
##
## $`ames_train$Neighborhood2`
## diff lwr upr p adj
## Neigh_2-Neigh_1 -71550.89 -83544.279 -59557.49 0.0e+00
## Neigh_3-Neigh_1 -48151.64 -61639.982 -34663.31 0.0e+00
## Neigh_4-Neigh_1 129907.59 113190.429 146624.75 0.0e+00
## Neigh_5-Neigh_1 46108.27 28612.427 63604.12 0.0e+00
## Neigh_3-Neigh_2 23399.24 9990.724 36807.76 2.1e-05
## Neigh_4-Neigh_2 201458.48 184805.651 218111.30 0.0e+00
## Neigh_5-Neigh_2 117659.16 100224.775 135093.54 0.0e+00
## Neigh_4-Neigh_3 178059.23 160299.438 195819.03 0.0e+00
## Neigh_5-Neigh_3 94259.92 75765.282 112754.55 0.0e+00
## Neigh_5-Neigh_4 -83799.32 -104765.473 -62833.16 0.0e+00
ggplot(data = ames_train, aes(x = Neighborhood2, y = price)) +
geom_boxplot() +
coord_flip()
ames_train %>%
group_by(Neighborhood2) %>%
summarise(avg.price = mean(price),
median.price = median(price),
sd.price = sd(price),
n = n())
## # A tibble: 5 x 5
## Neighborhood2 avg.price median.price sd.price n
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Neigh_1 194738 187500 51577 297
## 2 Neigh_2 123188 125000 39382 306
## 3 Neigh_3 146587 142000 31893 199
## 4 Neigh_4 324646 315750 96823 105
## 5 Neigh_5 240847 222000 70060 93
*** NOT USED - ANALYSIS OF SEASONAL EFFECT ON PRICE ***
# Analysis of season
ames_train$season <- as.factor(ifelse(ames_train$Mo.Sold<=1,ames_train$Mo.Sold,ames_train$Mo.Sold))
ggplot(ames_train, aes(x = season, y = price)) +
geom_boxplot()
*** NOT USED - MISSING OBS ***
# Most missing obs
na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE),5)
## Pool.QC Misc.Feature Alley Fence Fireplace.Qu
## 997 971 933 798 491
# For candidate variables, count na.
na_count["Lot.Area.Cat"]
## Lot.Area.Cat
## 0
na_count["area"]
## area
## 0
na_count["Overall.Cond"]
## Overall.Cond
## 0
na_count["Bedroom.AbvGr"]
## Bedroom.AbvGr
## 0
na_count["Bath_1"]
## Bath_1
## 0
na_count["Year.Built"]
## Year.Built
## 0
na_count["Bedroom.AbvGr"]
## Bedroom.AbvGr
## 0
na_count["Full.Bath"]
## Full.Bath
## 0
na_count["Bsmt.Full.Bath"]
## Bsmt.Full.Bath
## 0
na_count["Half.Bath"]
## Half.Bath
## 0
na_count["Bsmt.Half.Bath"]
## Bsmt.Half.Bath
## 0
*** Data Conversion ***
#Convert Overall.Cond and Overal.Qual from int to ordered factor
#str(ames_train$Overall.Cond)
#str(ames_train$Overall.Qual)
#ames_train$Overall.Cond <- factor(ames_train$Overall.Cond, ordered = TRUE)
#ames_train$Overall.Qual <- factor(ames_train$Overall.Qual, ordered = TRUE)
#ames_test$Overall.Cond <- factor(ames_test$Overall.Cond, ordered = TRUE)
#ames_test$Overall.Qual <- factor(ames_test$Overall.Qual, ordered = TRUE)
#ames_validation$Overall.Cond <- factor(ames_validation$Overall.Cond, ordered = TRUE)
#ames_validation$Overall.Qual <- factor(ames_validation$Overall.Qual, ordered = TRUE)
*** Analysis of Candidate Variables for Model Assumptions *** Start with a Residuals vs fitted values plot to visually assess whether:
(violation of any of these three may necessitate remedial action (such as transforming one or more predictors and/or the response variable)
The standardized residuals plot fans out, violating the equal variance condition. There are also many points >2sd from 0.
ames_train.lm.full <- lm(price ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood, data = ames_train)
ames_train$resid <- rstandard(ames_train.lm.full)
ames_train$fitted <- fitted(ames_train.lm.full)
ggplot(data=ames_train, aes(y = resid, x = fitted)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
xlab("Fitted Value") +
ylab("Residual") +
ggtitle("Residuals vs. Fits Plot \n(response is price)")
ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood, data = ames_train)
par(mfrow = c(2,2)) # Split plot panel into 2x2 grid
plot(ames_train.lm.full)
Try log-transforming price
. This plot looks much better, although there are a few potential outliers.
ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood, data = ames_train)
ames_train$resid <- rstandard(ames_train.lm.full)
ames_train$fitted <- fitted(ames_train.lm.full)
ggplot(data=ames_train, aes(y = resid, x = fitted)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
xlab("Fitted Value") +
ylab("Residual") +
ggtitle("Residuals vs. Fits Plot \n(response is ln(price))")
Create a regular residuals versus fits plot. This will label the outliers for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.
plot(ames_train.lm.full, which = 1)
ames_train.lm.full$model[c(428, 310),]
## log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
## 428 9.456341 4 832 2 2 1.0
## 310 12.126759 7 4676 5 3 4.1
## Year.Built Neighborhood
## 428 1923 OldTown
## 310 2007 Edwards
ames_train[c(428, 310),]
## # A tibble: 2 x 89
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fct> <int> <int> <fct>
## 1 9.02e8 832 12789 30 RM 68 9656 Pave
## 2 9.08e8 4676 184750 60 RL 130 40094 Pave
## # ... with 81 more variables: Alley <fct>, Lot.Shape <fct>,
## # Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## # Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## # Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## # Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## # Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## # Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## # Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## # Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <dbl>,
## # Bsmt.Half.Bath <dbl>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## # TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## # Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## # Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## # Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## # Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>,
## # Lot.Area.Cat <dbl>, Bath_5 <dbl>, Bath_1 <dbl>,
## # Neighborhood.chr <chr>, Neighborhood2 <fct>, season <fct>,
## # resid <dbl>, fitted <dbl>
ames_train2 <- ames_train[c(1:309,311:427,429:1000),]
Now create Residuals vs Predictors scatterplot matrix to assess whether:
(violation of any of these three may necessitate remedial action (such as transforming one or more predictors and/or the response variable)
p1 <- ggplot(data=ames_train, aes(y = resid, x = area)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Predictor") +
xlab("area") +
ylab("Residual")
p2 <- ggplot(data=ames_train, aes(y = resid, x = Overall.Cond)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Predictor") +
xlab("Overall.Cond") +
ylab("Residual")
p3 <- ggplot(data=ames_train, aes(y = resid, x = Bedroom.AbvGr)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Predictor") +
xlab("Bedroom.AbvGr") +
ylab("Residual")
p4 <- ggplot(data=ames_train, aes(y = resid, x = Bath_1)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Predictor") +
xlab("Bath_1") +
ylab("Residual")
p5 <- ggplot(data=ames_train, aes(y = resid, x = Year.Built)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Predictor") +
xlab("Year.Built") +
ylab("Residual")
p6 <- ggplot(data=ames_train, aes(y = resid, x = Lot.Area.Cat)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = 2, linetype = 1, color="red") +
geom_hline(yintercept = -2, linetype = 1, color="red") +
ggtitle("Residuals vs. Predictor") +
xlab("Lot.Area.Cat") +
ylab("Residual")
grid.arrange(p1, p2, ncol=2)
grid.arrange(p3, p4, ncol=2)
grid.arrange(p5, p6, ncol=2)
approximate normality (affirms normality).
par(mfrow = c(2,2)) # Split plot panel into 2x2 grid
plot(ames_train.lm.full)
#par(mfrow = c(2,2)) # Split plot panel into 2x2 grid
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from ames_train
and create a linear model for price
(or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
Based on my EDA, I modeled log(price)
on my categorized lot area (Lot.Area.Cat
), home size (area
), bedrooms (Bedroom.AbvGr
), my modified number of bathrooms (Bath_1
), home age (Year.Built
), overall condition of the home (Overall.Cond
), and location (Neighborhood
). That is 7 predictor variables, but technically a lot more because Neighborhood
is changed into 5 indicator variables.
My model selection was based on real estate industry expectations: values are based on location, size, and quality.
The model does a good job of predicting log(price) with an adjusted R-squared of 0.8213. All predictor variables apart from Neighborhood
are highly significant (p<0.0001). The Neighborhood
indicator variables are significant at the \(\alpha\)=0.05 level for 3 of the 4 indicator variables.
#str(ames_train)
ames_train.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood2, data = ames_train)
summary(ames_train.lm.full)
##
## Call:
## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond +
## Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62700 -0.08985 0.00130 0.09516 0.61324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.730e-01 6.261e-01 1.235 0.2173
## Lot.Area.Cat 5.168e-02 5.076e-03 10.182 < 2e-16 ***
## area 4.029e-04 1.821e-05 22.126 < 2e-16 ***
## Overall.Cond 8.729e-02 5.609e-03 15.562 < 2e-16 ***
## Bedroom.AbvGr -6.208e-02 8.623e-03 -7.199 1.20e-12 ***
## Bath_1 5.279e-02 9.771e-03 5.403 8.22e-08 ***
## Year.Built 5.078e-03 3.099e-04 16.386 < 2e-16 ***
## Neighborhood2Neigh_2 -1.006e-01 1.992e-02 -5.052 5.20e-07 ***
## Neighborhood2Neigh_3 -3.182e-02 1.837e-02 -1.732 0.0835 .
## Neighborhood2Neigh_4 2.228e-01 2.236e-02 9.963 < 2e-16 ***
## Neighborhood2Neigh_5 1.421e-01 2.217e-02 6.409 2.26e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1778 on 989 degrees of freedom
## Multiple R-squared: 0.8231, Adjusted R-squared: 0.8213
## F-statistic: 460 on 10 and 989 DF, p-value: < 2.2e-16
Now either using BAS
another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
I used three model selection procedures: AIC, BIC, and BAS. AIC and BIC retained all predictor variables from the full model. BAS also retained all predictor variables apart from the Neighborhood
variable. BAS dropped 1 of the Neighborhood
indicator variables in its top five models.
ames_train.lm.aic <- stepAIC(ames_train.lm.full, k = 2)
## Start: AIC=-3443.04
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2
##
## Df Sum of Sq RSS AIC
## <none> 31.272 -3443.0
## - Bath_1 1 0.9230 32.195 -3416.0
## - Bedroom.AbvGr 1 1.6386 32.910 -3394.0
## - Lot.Area.Cat 1 3.2782 34.550 -3345.4
## - Neighborhood2 4 4.5659 35.837 -3314.8
## - Overall.Cond 1 7.6572 38.929 -3226.0
## - Year.Built 1 8.4899 39.762 -3204.9
## - area 1 15.4797 46.751 -3042.9
ames_train.lm.bic <- stepAIC(ames_train.lm.full, k = log(nrow(ames_train)))
## Start: AIC=-3389.06
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2
##
## Df Sum of Sq RSS AIC
## <none> 31.272 -3389.1
## - Bath_1 1 0.9230 32.195 -3366.9
## - Bedroom.AbvGr 1 1.6386 32.910 -3344.9
## - Lot.Area.Cat 1 3.2782 34.550 -3296.3
## - Neighborhood2 4 4.5659 35.837 -3280.4
## - Overall.Cond 1 7.6572 38.929 -3176.9
## - Year.Built 1 8.4899 39.762 -3155.8
## - area 1 15.4797 46.751 -2993.8
ames_train.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood2, data = ames_train, prior = "AIC",
modelprior=uniform())
summary(ames_train.lm.bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.000000e+00
## Lot.Area.Cat 1.0000000 1.0000 1.0000000 1.000000e+00
## area 1.0000000 1.0000 1.0000000 1.000000e+00
## Overall.Cond 1.0000000 1.0000 1.0000000 1.000000e+00
## Bedroom.AbvGr 1.0000000 1.0000 1.0000000 1.000000e+00
## Bath_1 0.9999988 1.0000 1.0000000 1.000000e+00
## Year.Built 1.0000000 1.0000 1.0000000 1.000000e+00
## Neighborhood2Neigh_2 0.9999897 1.0000 1.0000000 0.000000e+00
## Neighborhood2Neigh_3 0.6259502 1.0000 0.0000000 0.000000e+00
## Neighborhood2Neigh_4 1.0000000 1.0000 1.0000000 1.000000e+00
## Neighborhood2Neigh_5 1.0000000 1.0000 1.0000000 1.000000e+00
## BF NA 1.0000 0.5975676 8.492981e-06
## PostProbs NA 0.6259 0.3740000 0.000000e+00
## R2 NA 0.8231 0.8225000 8.181000e-01
## dim NA 11.0000 10.0000000 9.000000e+00
## logmarg NA -1732.3554 -1732.8702398 -1.744032e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## Lot.Area.Cat 1.000000e+00 1.000000e+00
## area 1.000000e+00 1.000000e+00
## Overall.Cond 1.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 1.000000e+00
## Bath_1 1.000000e+00 0.000000e+00
## Year.Built 1.000000e+00 1.000000e+00
## Neighborhood2Neigh_2 0.000000e+00 1.000000e+00
## Neighborhood2Neigh_3 1.000000e+00 1.000000e+00
## Neighborhood2Neigh_4 1.000000e+00 1.000000e+00
## Neighborhood2Neigh_5 1.000000e+00 1.000000e+00
## BF 7.962731e-06 1.312482e-06
## PostProbs 0.000000e+00 0.000000e+00
## R2 8.185000e-01 8.178000e-01
## dim 1.000000e+01 1.000000e+01
## logmarg -1.744096e+03 -1.745899e+03
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
The residuals versus fits plot has constant variance with a couple outliers labeled for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.
plot(ames_train.lm.full, which = 1)
ames_train.lm.full$model[c(428, 310),]
## log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
## 428 9.456341 4 832 2 2 1.0
## 310 12.126759 7 4676 5 3 4.1
## Year.Built Neighborhood2
## 428 1923 Neigh_2
## 310 2007 Neigh_2
ames_train[c(428, 310),]
## # A tibble: 2 x 89
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fct> <int> <int> <fct>
## 1 9.02e8 832 12789 30 RM 68 9656 Pave
## 2 9.08e8 4676 184750 60 RL 130 40094 Pave
## # ... with 81 more variables: Alley <fct>, Lot.Shape <fct>,
## # Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## # Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## # Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## # Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## # Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## # Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## # Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## # Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <dbl>,
## # Bsmt.Half.Bath <dbl>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## # TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## # Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## # Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## # Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## # Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>,
## # Lot.Area.Cat <dbl>, Bath_5 <dbl>, Bath_1 <dbl>,
## # Neighborhood.chr <chr>, Neighborhood2 <fct>, season <fct>,
## # resid <dbl>, fitted <dbl>
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
The model RMSE (translated from log(price) to price is $38,320.
# Extract Predictions (convert from log)
ames_train$predict.price <- exp(predict(ames_train.lm.full, ames_train))
# Calculate RMSE
rmse.full <- sqrt(mean((ames_train$price - ames_train$predict.price)^2))
rmse.full
## [1] 38320.84
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called overfitting. To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test
.
# do not re-load data because already loaded and created transformed vars.
#load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
In general, the better the model fit, the lower the RMSE. In this case, the RMSE of the test data RMSE ($27,080) is lower than using the training data. This is unusual, because the model should fit the data it was fitted to better than a new data set. After testing the model with the three training set outliers removed, the RMSE was lower for the training data set than the test data set, as expected.
# Extract Predictions (convert from log)
ames_test$predict.price <- exp(predict(ames_train.lm.full, ames_test))
# Calculate RMSE
rmse.full <- sqrt(mean((ames_test$price - ames_test$predict.price)^2))
rmse.full
## [1] 27079.7
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
I added six more variables to the final model for evaluation: Building type (Bldg.Type
), Overall Condition (Overall.Qual
), Central Air (Central.Air
), Kitchen Quality (Kitchen.Qual
), Sale Condition (Sale.Condition
), and exterior quality (Exter.Qual
). I also added two interaction variables, Central.Air
*Year.Built
and Bedroom.AbvGr
*Bath_1
.
I used three model selection procedures: AIC, BIC, and BAS. AIC retained all predictor variables from the full model except for the Central.Air
*Year.Built
interaction term. BIC dropped the Central.Air
*Year.Built
interaction term, Bldg.Type
, and Exter.Qual2
. BAS also retained all predictor variables in at least one of the top 5 models.
I decided to drop Bldg.Type
and Central.Air
*Year.Built
from the full model.
#na_count["Bldg.Type"]
#na_count["Overall.Cond"]
#na_count["Central.Air"]
#na_count["Kitchen.Qual"]
#na_count["Garage.Qual"]
#na_count["Sale.Condition"]
#na_count["Exter.Qual"]
#na_count["Bsmt.Qual"]
ames_train$Exter.Qual2 <- as.numeric(
ifelse(ames_train$Exter.Qual == "Po", 0,
ifelse(ames_train$Exter.Qual == "Fa", 1,
ifelse(ames_train$Exter.Qual == "TA", 2,
ifelse(ames_train$Exter.Qual == "Gd", 3,
ifelse(ames_train$Exter.Qual == "Ex", 4, 5))))))
ames_train$Kitchen.Qual2 <- as.numeric(
ifelse(ames_train$Kitchen.Qual == "Po", 0,
ifelse(ames_train$Kitchen.Qual == "Fa", 1,
ifelse(ames_train$Kitchen.Qual == "TA", 2,
ifelse(ames_train$Kitchen.Qual == "Gd", 3,
ifelse(ames_train$Kitchen.Qual == "Ex", 4, 5))))))
ames_test$Exter.Qual2 <- as.numeric(
ifelse(ames_test$Exter.Qual == "Po", 0,
ifelse(ames_test$Exter.Qual == "Fa", 1,
ifelse(ames_test$Exter.Qual == "TA", 2,
ifelse(ames_test$Exter.Qual == "Gd", 3,
ifelse(ames_test$Exter.Qual == "Ex", 4, 5))))))
ames_test$Kitchen.Qual2 <- as.numeric(
ifelse(ames_test$Kitchen.Qual == "Po", 0,
ifelse(ames_test$Kitchen.Qual == "Fa", 1,
ifelse(ames_test$Kitchen.Qual == "TA", 2,
ifelse(ames_test$Kitchen.Qual == "Gd", 3,
ifelse(ames_test$Kitchen.Qual == "Ex", 4, 5))))))
ames_validation$Exter.Qual2 <- as.numeric(
ifelse(ames_validation$Exter.Qual == "Po", 0,
ifelse(ames_validation$Exter.Qual == "Fa", 1,
ifelse(ames_validation$Exter.Qual == "TA", 2,
ifelse(ames_validation$Exter.Qual == "Gd", 3,
ifelse(ames_validation$Exter.Qual == "Ex", 4, 5))))))
ames_validation$Kitchen.Qual2 <- as.numeric(
ifelse(ames_validation$Kitchen.Qual == "Po", 0,
ifelse(ames_validation$Kitchen.Qual == "Fa", 1,
ifelse(ames_validation$Kitchen.Qual == "TA", 2,
ifelse(ames_validation$Kitchen.Qual == "Gd", 3,
ifelse(ames_validation$Kitchen.Qual == "Ex", 4, 5))))))
ames.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood2 + Bldg.Type + Overall.Qual +
Central.Air + Kitchen.Qual2 +
Sale.Condition + Exter.Qual2 + Central.Air*Year.Built + Bedroom.AbvGr*Bath_1,
data = ames_train)
ames.lm.aic <- stepAIC(ames.lm.full, k = 2)
## Start: AIC=-3814.92
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 +
## Central.Air * Year.Built + Bedroom.AbvGr * Bath_1
##
## Df Sum of Sq RSS AIC
## - Year.Built:Central.Air 1 0.0011 20.924 -3816.9
## - Exter.Qual2 1 0.0309 20.954 -3815.4
## <none> 20.923 -3814.9
## - Kitchen.Qual2 1 0.1406 21.063 -3810.2
## - Bldg.Type 4 0.4144 21.337 -3803.3
## - Bedroom.AbvGr:Bath_1 1 0.5291 21.452 -3791.9
## - Sale.Condition 5 0.8930 21.816 -3783.1
## - Neighborhood2 4 1.1787 22.102 -3768.1
## - Lot.Area.Cat 1 1.1333 22.056 -3764.2
## - Overall.Cond 1 2.2194 23.142 -3716.1
## - Overall.Qual 1 3.8546 24.777 -3647.8
## - area 1 5.0222 25.945 -3601.8
##
## Step: AIC=-3816.86
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 +
## Bedroom.AbvGr:Bath_1
##
## Df Sum of Sq RSS AIC
## - Exter.Qual2 1 0.0302 20.954 -3817.4
## <none> 20.924 -3816.9
## - Kitchen.Qual2 1 0.1413 21.065 -3812.1
## - Bldg.Type 4 0.4133 21.337 -3805.3
## - Bedroom.AbvGr:Bath_1 1 0.5279 21.452 -3793.9
## - Sale.Condition 5 0.8923 21.816 -3785.1
## - Central.Air 1 1.0193 21.943 -3771.3
## - Neighborhood2 4 1.1829 22.107 -3769.9
## - Lot.Area.Cat 1 1.1340 22.058 -3766.1
## - Year.Built 1 1.7971 22.721 -3736.5
## - Overall.Cond 1 2.2207 23.145 -3718.0
## - Overall.Qual 1 3.9030 24.827 -3647.8
## - area 1 5.0217 25.945 -3603.8
##
## Step: AIC=-3817.42
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Bedroom.AbvGr:Bath_1
##
## Df Sum of Sq RSS AIC
## <none> 20.954 -3817.4
## - Kitchen.Qual2 1 0.2151 21.169 -3809.2
## - Bldg.Type 4 0.4212 21.375 -3805.5
## - Bedroom.AbvGr:Bath_1 1 0.5410 21.495 -3793.9
## - Sale.Condition 5 0.9193 21.873 -3784.5
## - Central.Air 1 1.0040 21.958 -3772.6
## - Lot.Area.Cat 1 1.1231 22.077 -3767.2
## - Neighborhood2 4 1.2604 22.215 -3767.0
## - Year.Built 1 1.8713 22.825 -3733.9
## - Overall.Cond 1 2.2133 23.167 -3719.0
## - Overall.Qual 1 4.3582 25.312 -3630.5
## - area 1 5.0865 26.041 -3602.1
summary(ames.lm.aic)
##
## Call:
## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond +
## Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Bldg.Type +
## Overall.Qual + Central.Air + Kitchen.Qual2 + Sale.Condition +
## Bedroom.AbvGr:Bath_1, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44662 -0.06907 -0.00170 0.07010 0.50119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.638e+00 5.802e-01 7.995 3.65e-15 ***
## Lot.Area.Cat 4.048e-02 5.597e-03 7.233 9.56e-13 ***
## area 2.669e-04 1.734e-05 15.392 < 2e-16 ***
## Overall.Cond 5.194e-02 5.115e-03 10.153 < 2e-16 ***
## Bedroom.AbvGr 3.133e-02 1.473e-02 2.127 0.033635 *
## Bath_1 1.528e-01 2.021e-02 7.560 9.25e-14 ***
## Year.Built 2.753e-03 2.949e-04 9.336 < 2e-16 ***
## Neighborhood2Neigh_2 -5.128e-02 1.686e-02 -3.042 0.002415 **
## Neighborhood2Neigh_3 1.293e-02 1.571e-02 0.823 0.410526
## Neighborhood2Neigh_4 1.234e-01 1.961e-02 6.292 4.72e-10 ***
## Neighborhood2Neigh_5 6.037e-02 1.907e-02 3.166 0.001593 **
## Bldg.Type2fmCon 5.919e-02 3.493e-02 1.695 0.090478 .
## Bldg.TypeDuplex -5.333e-02 2.912e-02 -1.831 0.067370 .
## Bldg.TypeTwnhs -1.120e-01 3.083e-02 -3.631 0.000297 ***
## Bldg.TypeTwnhsE -2.746e-02 2.210e-02 -1.243 0.214316
## Overall.Qual 8.653e-02 6.073e-03 14.248 < 2e-16 ***
## Central.AirY 1.670e-01 2.442e-02 6.838 1.41e-11 ***
## Kitchen.Qual2 3.376e-02 1.067e-02 3.165 0.001598 **
## Sale.ConditionAdjLand 2.155e-01 1.095e-01 1.969 0.049291 *
## Sale.ConditionAlloca 9.977e-02 7.719e-02 1.292 0.196491
## Sale.ConditionFamily 4.084e-03 4.104e-02 0.099 0.920764
## Sale.ConditionNormal 1.075e-01 1.980e-02 5.428 7.18e-08 ***
## Sale.ConditionPartial 1.450e-01 2.691e-02 5.387 8.98e-08 ***
## Bedroom.AbvGr:Bath_1 -2.953e-02 5.883e-03 -5.020 6.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1465 on 976 degrees of freedom
## Multiple R-squared: 0.8814, Adjusted R-squared: 0.8786
## F-statistic: 315.5 on 23 and 976 DF, p-value: < 2.2e-16
ames.lm.bic <- stepAIC(ames.lm.full, k = log(nrow(ames_train)))
## Start: AIC=-3687.32
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Bldg.Type + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 +
## Central.Air * Year.Built + Bedroom.AbvGr * Bath_1
##
## Df Sum of Sq RSS AIC
## - Bldg.Type 4 0.4144 21.337 -3695.3
## - Year.Built:Central.Air 1 0.0011 20.924 -3694.2
## - Exter.Qual2 1 0.0309 20.954 -3692.7
## - Kitchen.Qual2 1 0.1406 21.063 -3687.5
## <none> 20.923 -3687.3
## - Sale.Condition 5 0.8930 21.816 -3680.1
## - Bedroom.AbvGr:Bath_1 1 0.5291 21.452 -3669.3
## - Neighborhood2 4 1.1787 22.102 -3660.1
## - Lot.Area.Cat 1 1.1333 22.056 -3641.5
## - Overall.Cond 1 2.2194 23.142 -3593.4
## - Overall.Qual 1 3.8546 24.777 -3525.1
## - area 1 5.0222 25.945 -3479.1
##
## Step: AIC=-3695.33
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air +
## Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + Year.Built:Central.Air +
## Bedroom.AbvGr:Bath_1
##
## Df Sum of Sq RSS AIC
## - Year.Built:Central.Air 1 0.0000 21.337 -3702.2
## - Exter.Qual2 1 0.0380 21.375 -3700.5
## <none> 21.337 -3695.3
## - Kitchen.Qual2 1 0.1743 21.512 -3694.1
## - Sale.Condition 5 0.9807 22.318 -3684.9
## - Bedroom.AbvGr:Bath_1 1 0.5906 21.928 -3674.9
## - Neighborhood2 4 1.0793 22.416 -3673.6
## - Overall.Cond 1 2.3416 23.679 -3598.1
## - Lot.Area.Cat 1 3.2665 24.604 -3559.8
## - Overall.Qual 1 3.9805 25.318 -3531.2
## - area 1 4.8501 26.187 -3497.4
##
## Step: AIC=-3702.24
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air +
## Kitchen.Qual2 + Sale.Condition + Exter.Qual2 + Bedroom.AbvGr:Bath_1
##
## Df Sum of Sq RSS AIC
## - Exter.Qual2 1 0.0380 21.375 -3707.4
## <none> 21.337 -3702.2
## - Kitchen.Qual2 1 0.1744 21.512 -3701.0
## - Sale.Condition 5 0.9825 22.320 -3691.8
## - Bedroom.AbvGr:Bath_1 1 0.5927 21.930 -3681.7
## - Neighborhood2 4 1.0809 22.418 -3680.5
## - Central.Air 1 0.9884 22.326 -3663.9
## - Year.Built 1 1.6100 22.947 -3636.4
## - Overall.Cond 1 2.3425 23.680 -3605.0
## - Lot.Area.Cat 1 3.2763 24.614 -3566.3
## - Overall.Qual 1 4.0470 25.384 -3535.5
## - area 1 4.8557 26.193 -3504.1
##
## Step: AIC=-3707.37
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2 + Overall.Qual + Central.Air +
## Kitchen.Qual2 + Sale.Condition + Bedroom.AbvGr:Bath_1
##
## Df Sum of Sq RSS AIC
## <none> 21.375 -3707.4
## - Kitchen.Qual2 1 0.2691 21.644 -3701.8
## - Sale.Condition 5 1.0172 22.392 -3695.4
## - Bedroom.AbvGr:Bath_1 1 0.6071 21.982 -3686.3
## - Neighborhood2 4 1.1595 22.535 -3682.2
## - Central.Air 1 0.9708 22.346 -3669.9
## - Year.Built 1 1.6830 23.058 -3638.5
## - Overall.Cond 1 2.3366 23.712 -3610.5
## - Lot.Area.Cat 1 3.2729 24.648 -3571.8
## - Overall.Qual 1 4.5547 25.930 -3521.1
## - area 1 4.9147 26.290 -3507.3
summary(ames.lm.bic)
##
## Call:
## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond +
## Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Bedroom.AbvGr:Bath_1,
## data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45617 -0.07097 0.00077 0.07238 0.53162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.889e+00 5.815e-01 8.407 < 2e-16 ***
## Lot.Area.Cat 5.208e-02 4.251e-03 12.250 < 2e-16 ***
## area 2.595e-04 1.728e-05 15.011 < 2e-16 ***
## Overall.Cond 5.290e-02 5.111e-03 10.350 < 2e-16 ***
## Bedroom.AbvGr 3.608e-02 1.472e-02 2.450 0.014449 *
## Bath_1 1.543e-01 2.007e-02 7.690 3.57e-14 ***
## Year.Built 2.589e-03 2.947e-04 8.784 < 2e-16 ***
## Neighborhood2Neigh_2 -5.025e-02 1.693e-02 -2.968 0.003069 **
## Neighborhood2Neigh_3 1.115e-02 1.570e-02 0.710 0.477675
## Neighborhood2Neigh_4 1.171e-01 1.965e-02 5.963 3.46e-09 ***
## Neighborhood2Neigh_5 5.869e-02 1.920e-02 3.057 0.002293 **
## Overall.Qual 8.809e-02 6.096e-03 14.451 < 2e-16 ***
## Central.AirY 1.591e-01 2.385e-02 6.671 4.23e-11 ***
## Kitchen.Qual2 3.742e-02 1.065e-02 3.513 0.000464 ***
## Sale.ConditionAdjLand 1.684e-01 1.085e-01 1.553 0.120854
## Sale.ConditionAlloca 7.943e-02 7.706e-02 1.031 0.302914
## Sale.ConditionFamily 1.050e-03 4.135e-02 0.025 0.979738
## Sale.ConditionNormal 1.120e-01 1.993e-02 5.620 2.49e-08 ***
## Sale.ConditionPartial 1.551e-01 2.699e-02 5.746 1.22e-08 ***
## Bedroom.AbvGr:Bath_1 -3.087e-02 5.851e-03 -5.276 1.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1477 on 980 degrees of freedom
## Multiple R-squared: 0.879, Adjusted R-squared: 0.8767
## F-statistic: 374.9 on 19 and 980 DF, p-value: < 2.2e-16
ames.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood2 + Bldg.Type + Overall.Qual +
Central.Air + Kitchen.Qual2 +
Sale.Condition + Exter.Qual2 + Central.Air*Year.Built + Bedroom.AbvGr*Bath_1,
data = ames_train, prior = "AIC",
modelprior=uniform())
summary(ames.lm.bas)
## P(B != 0 | Y) model 1 model 2
## Intercept 1.0000000 1.0000 1.000000
## Lot.Area.Cat 1.0000000 1.0000 1.000000
## area 1.0000000 1.0000 1.000000
## Overall.Cond 1.0000000 1.0000 1.000000
## Bedroom.AbvGr 0.8409956 1.0000 1.000000
## Bath_1 1.0000000 1.0000 1.000000
## Year.Built 0.9899537 1.0000 1.000000
## Neighborhood2Neigh_2 0.9874210 1.0000 1.000000
## Neighborhood2Neigh_3 0.3500874 0.0000 0.000000
## Neighborhood2Neigh_4 0.9999999 1.0000 1.000000
## Neighborhood2Neigh_5 0.9804641 1.0000 1.000000
## Bldg.Type2fmCon 0.6438858 1.0000 1.000000
## Bldg.TypeDuplex 0.5881163 1.0000 1.000000
## Bldg.TypeTwnhs 0.9939672 1.0000 1.000000
## Bldg.TypeTwnhsE 0.4441211 0.0000 0.000000
## Overall.Qual 1.0000000 1.0000 1.000000
## Central.AirY 0.5838850 1.0000 0.000000
## Kitchen.Qual2 0.9552425 1.0000 1.000000
## Sale.ConditionAdjLand 0.6321541 1.0000 1.000000
## Sale.ConditionAlloca 0.4201733 0.0000 0.000000
## Sale.ConditionFamily 0.2707315 0.0000 0.000000
## Sale.ConditionNormal 0.9999989 1.0000 1.000000
## Sale.ConditionPartial 0.9999985 1.0000 1.000000
## Exter.Qual2 0.4301311 0.0000 0.000000
## Year.Built:Central.AirY 0.5819537 0.0000 1.000000
## Bedroom.AbvGr:Bath_1 0.9999940 1.0000 1.000000
## BF NA 1.0000 0.981144
## PostProbs NA 0.0068 0.006600
## R2 NA 0.8810 0.881000
## dim NA 20.0000 20.000000
## logmarg NA -1543.1431 -1543.162135
## model 3 model 4 model 5
## Intercept 1.0000000 1.0000000 1.0000000
## Lot.Area.Cat 1.0000000 1.0000000 1.0000000
## area 1.0000000 1.0000000 1.0000000
## Overall.Cond 1.0000000 1.0000000 1.0000000
## Bedroom.AbvGr 1.0000000 1.0000000 1.0000000
## Bath_1 1.0000000 1.0000000 1.0000000
## Year.Built 1.0000000 1.0000000 1.0000000
## Neighborhood2Neigh_2 1.0000000 1.0000000 1.0000000
## Neighborhood2Neigh_3 0.0000000 0.0000000 0.0000000
## Neighborhood2Neigh_4 1.0000000 1.0000000 1.0000000
## Neighborhood2Neigh_5 1.0000000 1.0000000 1.0000000
## Bldg.Type2fmCon 1.0000000 1.0000000 1.0000000
## Bldg.TypeDuplex 1.0000000 1.0000000 1.0000000
## Bldg.TypeTwnhs 1.0000000 1.0000000 1.0000000
## Bldg.TypeTwnhsE 1.0000000 1.0000000 0.0000000
## Overall.Qual 1.0000000 1.0000000 1.0000000
## Central.AirY 1.0000000 0.0000000 1.0000000
## Kitchen.Qual2 1.0000000 1.0000000 1.0000000
## Sale.ConditionAdjLand 1.0000000 1.0000000 1.0000000
## Sale.ConditionAlloca 0.0000000 0.0000000 1.0000000
## Sale.ConditionFamily 0.0000000 0.0000000 0.0000000
## Sale.ConditionNormal 1.0000000 1.0000000 1.0000000
## Sale.ConditionPartial 1.0000000 1.0000000 1.0000000
## Exter.Qual2 0.0000000 0.0000000 0.0000000
## Year.Built:Central.AirY 0.0000000 1.0000000 0.0000000
## Bedroom.AbvGr:Bath_1 1.0000000 1.0000000 1.0000000
## BF 0.8499294 0.8384008 0.8054951
## PostProbs 0.0058000 0.0057000 0.0055000
## R2 0.8812000 0.8812000 0.8811000
## dim 21.0000000 21.0000000 21.0000000
## logmarg -1543.3057005 -1543.3193576 -1543.3593967
ames.lm.full <- lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood2 + Overall.Qual +
Central.Air + Kitchen.Qual2 +
Sale.Condition + Exter.Qual2 + Bedroom.AbvGr*Bath_1,
data = ames_train)
summary(ames.lm.full)
##
## Call:
## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond +
## Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 +
## Bedroom.AbvGr * Bath_1, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46466 -0.07111 -0.00030 0.07329 0.52987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9506781 0.5831688 8.489 < 2e-16 ***
## Lot.Area.Cat 0.0521039 0.0042496 12.261 < 2e-16 ***
## area 0.0002583 0.0000173 14.926 < 2e-16 ***
## Overall.Cond 0.0529747 0.0051098 10.367 < 2e-16 ***
## Bedroom.AbvGr 0.0368895 0.0147321 2.504 0.01244 *
## Bath_1 0.1532259 0.0200788 7.631 5.50e-14 ***
## Year.Built 0.0025467 0.0002963 8.595 < 2e-16 ***
## Neighborhood2Neigh_2 -0.0481259 0.0169999 -2.831 0.00474 **
## Neighborhood2Neigh_3 0.0137171 0.0158151 0.867 0.38597
## Neighborhood2Neigh_4 0.1127654 0.0199149 5.662 1.96e-08 ***
## Neighborhood2Neigh_5 0.0568896 0.0192384 2.957 0.00318 **
## Overall.Qual 0.0859463 0.0063072 13.627 < 2e-16 ***
## Central.AirY 0.1607451 0.0238699 6.734 2.81e-11 ***
## Kitchen.Qual2 0.0321648 0.0113698 2.829 0.00477 **
## Sale.ConditionAdjLand 0.1656805 0.1084536 1.528 0.12692
## Sale.ConditionAlloca 0.0809501 0.0770372 1.051 0.29361
## Sale.ConditionFamily 0.0022861 0.0413424 0.055 0.95591
## Sale.ConditionNormal 0.1112681 0.0199267 5.584 3.05e-08 ***
## Sale.ConditionPartial 0.1521176 0.0270766 5.618 2.52e-08 ***
## Exter.Qual2 0.0190427 0.0144166 1.321 0.18685
## Bedroom.AbvGr:Bath_1 -0.0305291 0.0058543 -5.215 2.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1476 on 979 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8768
## F-statistic: 356.5 on 20 and 979 DF, p-value: < 2.2e-16
The final model is
log(price)
= \(\beta_0\) + \(\beta_1\)Lot.Area.Cat
+ \(\beta_2\)area
+ \(\beta_3\)Overall.Cond
+ \(\beta_4\)Bedroom.AbvGr
+ \(\beta_5\)Bath_1
+ \(\beta_6\)Year.Built
+ \(\beta_7\)Neighborhood2Neigh_2
+ \(\beta_8\)Neighborhood2Neigh_3
+ \(\beta_9\)Neighborhood2Neigh_4
+ \(\beta_{10}\)Neighborhood2Neigh_5
+\(\beta_{11}\)Overall.Qual
+ \(\beta_{12}\)Central.AirY
+ \(\beta_{13}\)Kitchen.Qual2
+ \(\beta_{14}\)Sale.ConditionAdjLand
+ \(\beta_{15}\)Sale.ConditionAlloca
+ \(\beta_{16}\)Sale.ConditionFamily
+ \(\beta_{17}\)Sale.ConditionNormal
+ \(\beta_{18}\)Sale.ConditionPartial
+ \(\beta_{19}\)Exter.Qual2
+ \(\beta_{20}\)Bedroom.AbvGr
*Bath_1
with estimated regression equation log(price)
= 4.95 + 0.0521Lot.Area.Cat
+ 0.0002area
+ 0.0530Overall.Cond
+ 0.0369Bedroom.AbvGr
+ 0.1532Bath_1
+ 0.0025Year.Built
-0.0481Neighborhood2Neigh_2
+ 0.0137Neighborhood2Neigh_3
+ 0.1128Neighborhood2Neigh_4
+ 0.0569Neighborhood2Neigh_5
+0.0859Overall.Qual
+ 0.1607Central.AirY
+ 0.0321Kitchen.Qual2
+ 0.1657Sale.ConditionAdjLand
+ 0.0810Sale.ConditionAlloca
+ 0.0023Sale.ConditionFamily
+ 0.1112Sale.ConditionNormal
+ 0.1521Sale.ConditionPartial
+ 0.0190Exter.Qual2
-0.0305Bedroom.AbvGr
*Bath_1
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
I transformed price
into log(price
) to conform the model to the least square conditions. I did not transform any predictor variables to conform to model conditions.
Created a total bathroom variable that treats half-baths as .1 to reflect their relatively low value.
I created interaction variable Year.Built
*Central.Air
because newer homes are also more likely to have central air conditioning. I ultimately dropped this variable from the final model.
I created interaction variable Bedroom.AbvGr
*Bath.1
because bathrooms should increase with bedrooms or else the property value will suffer.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
I tried the AIC, BIC, and BAS methods, and ultimately relied on the variable selection of the BIC method because it produced the most parsimonious model.
ames_train.lm.aic <- stepAIC(ames_train.lm.full, k = 2)
## Start: AIC=-3443.04
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2
##
## Df Sum of Sq RSS AIC
## <none> 31.272 -3443.0
## - Bath_1 1 0.9230 32.195 -3416.0
## - Bedroom.AbvGr 1 1.6386 32.910 -3394.0
## - Lot.Area.Cat 1 3.2782 34.550 -3345.4
## - Neighborhood2 4 4.5659 35.837 -3314.8
## - Overall.Cond 1 7.6572 38.929 -3226.0
## - Year.Built 1 8.4899 39.762 -3204.9
## - area 1 15.4797 46.751 -3042.9
ames_train.lm.bic <- stepAIC(ames_train.lm.full, k = log(nrow(ames_train)))
## Start: AIC=-3389.06
## log(price) ~ Lot.Area.Cat + area + Overall.Cond + Bedroom.AbvGr +
## Bath_1 + Year.Built + Neighborhood2
##
## Df Sum of Sq RSS AIC
## <none> 31.272 -3389.1
## - Bath_1 1 0.9230 32.195 -3366.9
## - Bedroom.AbvGr 1 1.6386 32.910 -3344.9
## - Lot.Area.Cat 1 3.2782 34.550 -3296.3
## - Neighborhood2 4 4.5659 35.837 -3280.4
## - Overall.Cond 1 7.6572 38.929 -3176.9
## - Year.Built 1 8.4899 39.762 -3155.8
## - area 1 15.4797 46.751 -2993.8
ames_train.lm.bas <- bas.lm(log(price) ~ Lot.Area.Cat + area + Overall.Cond +
Bedroom.AbvGr + Bath_1 + Year.Built +
Neighborhood2, data = ames_train, prior = "AIC",
modelprior=uniform())
summary(ames_train.lm.bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.000000e+00
## Lot.Area.Cat 1.0000000 1.0000 1.0000000 1.000000e+00
## area 1.0000000 1.0000 1.0000000 1.000000e+00
## Overall.Cond 1.0000000 1.0000 1.0000000 1.000000e+00
## Bedroom.AbvGr 1.0000000 1.0000 1.0000000 1.000000e+00
## Bath_1 0.9999988 1.0000 1.0000000 1.000000e+00
## Year.Built 1.0000000 1.0000 1.0000000 1.000000e+00
## Neighborhood2Neigh_2 0.9999897 1.0000 1.0000000 0.000000e+00
## Neighborhood2Neigh_3 0.6259502 1.0000 0.0000000 0.000000e+00
## Neighborhood2Neigh_4 1.0000000 1.0000 1.0000000 1.000000e+00
## Neighborhood2Neigh_5 1.0000000 1.0000 1.0000000 1.000000e+00
## BF NA 1.0000 0.5975676 8.492981e-06
## PostProbs NA 0.6259 0.3740000 0.000000e+00
## R2 NA 0.8231 0.8225000 8.181000e-01
## dim NA 11.0000 10.0000000 9.000000e+00
## logmarg NA -1732.3554 -1732.8702398 -1.744032e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## Lot.Area.Cat 1.000000e+00 1.000000e+00
## area 1.000000e+00 1.000000e+00
## Overall.Cond 1.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 1.000000e+00
## Bath_1 1.000000e+00 0.000000e+00
## Year.Built 1.000000e+00 1.000000e+00
## Neighborhood2Neigh_2 0.000000e+00 1.000000e+00
## Neighborhood2Neigh_3 1.000000e+00 1.000000e+00
## Neighborhood2Neigh_4 1.000000e+00 1.000000e+00
## Neighborhood2Neigh_5 1.000000e+00 1.000000e+00
## BF 7.962731e-06 1.312482e-06
## PostProbs 0.000000e+00 0.000000e+00
## R2 8.185000e-01 8.178000e-01
## dim 1.000000e+01 1.000000e+01
## logmarg -1.744096e+03 -1.745899e+03
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
The out-of-sample testing did not produce dramatically different results from the training model, so I did not change the model.
For your final model, create and briefly interpret an informative plot of the residuals.
The residuals versus fits plot has constant variance with a couple outliers labeled for further investigation. Observation 428 is an abnormal sale with a very low price. Observation 310 is a normal sale a very large home, but at a rather cheap price.
plot(ames.lm.full, which = 1)
ames_train.lm.full$model[c(428, 310),]
## log(price) Lot.Area.Cat area Overall.Cond Bedroom.AbvGr Bath_1
## 428 9.456341 4 832 2 2 1.0
## 310 12.126759 7 4676 5 3 4.1
## Year.Built Neighborhood2
## 428 1923 Neigh_2
## 310 2007 Neigh_2
ames_train[c(428, 310),]
## # A tibble: 2 x 92
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fct> <int> <int> <fct>
## 1 9.02e8 832 12789 30 RM 68 9656 Pave
## 2 9.08e8 4676 184750 60 RL 130 40094 Pave
## # ... with 84 more variables: Alley <fct>, Lot.Shape <fct>,
## # Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## # Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## # Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## # Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## # Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## # Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## # Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## # Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <dbl>,
## # Bsmt.Half.Bath <dbl>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## # TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## # Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## # Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## # Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## # Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>,
## # Lot.Area.Cat <dbl>, Bath_5 <dbl>, Bath_1 <dbl>,
## # Neighborhood.chr <chr>, Neighborhood2 <fct>, season <fct>,
## # resid <dbl>, fitted <dbl>, predict.price <dbl>, Exter.Qual2 <dbl>,
## # Kitchen.Qual2 <dbl>
For your final model, calculate and briefly comment on the RMSE.
The model RMSE (translated from log(price) to price is $32,975.
# Extract Predictions (convert from log)
ames_train$predict.price <- exp(predict(ames.lm.full, ames_train))
# Calculate RMSE
rmse.full <- sqrt(mean((ames_train$price - ames_train$predict.price)^2))
rmse.full
## [1] 32975.17
What are some strengths and weaknesses of your model?
The model performs well for prediction (R=0.8793). It is also fairly parsimonious in its trimming the location indicator variables to 5 neighborhoods. All indicator variables were significant at the \(\alpha=0.05\) level except Neighborhood2Neigh_3
, Sale.ConditionAdjLand
, Sale.ConditionAlloca
, Sale.ConditionFamily
, and Exter.Qual2
.
One weakness of the model is that it does not generalize to geographies other than the Ames, IA area because of the geography indicator variables.
summary(ames.lm.full)
##
## Call:
## lm(formula = log(price) ~ Lot.Area.Cat + area + Overall.Cond +
## Bedroom.AbvGr + Bath_1 + Year.Built + Neighborhood2 + Overall.Qual +
## Central.Air + Kitchen.Qual2 + Sale.Condition + Exter.Qual2 +
## Bedroom.AbvGr * Bath_1, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46466 -0.07111 -0.00030 0.07329 0.52987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9506781 0.5831688 8.489 < 2e-16 ***
## Lot.Area.Cat 0.0521039 0.0042496 12.261 < 2e-16 ***
## area 0.0002583 0.0000173 14.926 < 2e-16 ***
## Overall.Cond 0.0529747 0.0051098 10.367 < 2e-16 ***
## Bedroom.AbvGr 0.0368895 0.0147321 2.504 0.01244 *
## Bath_1 0.1532259 0.0200788 7.631 5.50e-14 ***
## Year.Built 0.0025467 0.0002963 8.595 < 2e-16 ***
## Neighborhood2Neigh_2 -0.0481259 0.0169999 -2.831 0.00474 **
## Neighborhood2Neigh_3 0.0137171 0.0158151 0.867 0.38597
## Neighborhood2Neigh_4 0.1127654 0.0199149 5.662 1.96e-08 ***
## Neighborhood2Neigh_5 0.0568896 0.0192384 2.957 0.00318 **
## Overall.Qual 0.0859463 0.0063072 13.627 < 2e-16 ***
## Central.AirY 0.1607451 0.0238699 6.734 2.81e-11 ***
## Kitchen.Qual2 0.0321648 0.0113698 2.829 0.00477 **
## Sale.ConditionAdjLand 0.1656805 0.1084536 1.528 0.12692
## Sale.ConditionAlloca 0.0809501 0.0770372 1.051 0.29361
## Sale.ConditionFamily 0.0022861 0.0413424 0.055 0.95591
## Sale.ConditionNormal 0.1112681 0.0199267 5.584 3.05e-08 ***
## Sale.ConditionPartial 0.1521176 0.0270766 5.618 2.52e-08 ***
## Exter.Qual2 0.0190427 0.0144166 1.321 0.18685
## Bedroom.AbvGr:Bath_1 -0.0305291 0.0058543 -5.215 2.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1476 on 979 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8768
## F-statistic: 356.5 on 20 and 979 DF, p-value: < 2.2e-16
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the ames_validation
dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
#data already loaded with transformations
#load("ames_validation.Rdata")
In general, the better the model fit, the lower the RMSE. In this case, the RMSE of the test data RMSE ($20,894) is lower than using the training data. This is unusual, because the model should fit the data it was fitted to better than a new data set. The outliers in the training set increased its RMSE.
# Extract Predictions (convert from log)
ames_validation$predict.price <- exp(predict(ames.lm.full, ames_validation))
# Calculate RMSE
rmse.full <- sqrt(mean((ames_validation$price - ames_validation$predict.price)^2))
rmse.full
## [1] 20893.55
98% of the 95% predictive confidence intervals contain the true price of the house in the validation data set. This result supports the model properly reflecting uncertainty in the actual sales contracts.
# Calculate prediction interval
ames_val.pred <- predict(ames.lm.full, ames_validation, interval = "prediction")
# Get dataset of predictions and confidence intervals
out = as.data.frame(cbind(exp(ames_val.pred),
price = ames_validation$price))
# Fix names in dataset
#colnames(out)[1:2] <- c("lwr", "upr") #fix names
# Get Coverage
out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
## cover
## 1 0.9803408
#pred.test.HPM.coverage
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
As explained in Pardoe (2008) (https://ww2.amstat.org/publications/jse/v16n2/datasets.pardoe.html), a model such as this can narrow the range of possible values for the asking price of a home. For example, a home with characteristcs such as those shown in below sold for $147,000. A 95% prediction interval based on an average of the entire data set is ($40,068, $304,460) with mean $172,264. (sample mean \(\pm\) \(t_{.05}\) × sample standard deviation). The model 95% prediction interval is much closer at ($116,342, $208,235) with predicted value $155,684.
val <- ames_validation %>% summarise(mean=mean(price), sd=sd(price), n=n())
error <- qt(0.975,df=val$n-1)*val$sd
left <- mean(val$mean)-error
right <- mean(val$mean)+error
val$mean
## [1] 172264.1
left
## [1] 40068.18
right
## [1] 304460.1
ames_validation[5,c('price', 'Lot.Area', 'area', 'Overall.Cond', 'Bedroom.AbvGr', 'Bath_1', 'Year.Built', 'Neighborhood', 'Overall.Qual', 'Central.Air', 'Kitchen.Qual', 'Sale.Condition', 'Exter.Qual')]
## # A tibble: 1 x 13
## price Lot.Area area Overall.Cond Bedroom.AbvGr Bath_1 Year.Built
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 147000 10970 1026 6 3 2.00 1978
## # ... with 6 more variables: Neighborhood <fct>, Overall.Qual <int>,
## # Central.Air <fct>, Kitchen.Qual <fct>, Sale.Condition <fct>,
## # Exter.Qual <fct>
exp(predict(ames.lm.full, ames_validation[5,], interval="prediction"))
## fit lwr upr
## 1 155648.4 116341.8 208234.9
The model also does a good job with characterizing the market, explaining most of the variation in sale price (R2 = 87.93%).