Let us load the necessary packages and then load the dataset required to perform the analysis
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(devtools)
## Loading required package: usethis
library(BAS)
library(MASS)
Load the dataset
load("ames_train.Rdata.gz")
glimpse(ames_train)
## Rows: 1,000
## Columns: 81
## $ PID <int> 909176150, 905476230, 911128020, 535377150, 534177230,~
## $ area <int> 856, 1049, 1001, 1039, 1665, 1922, 936, 1246, 889, 107~
## $ price <int> 126000, 139500, 124900, 114000, 227000, 198500, 93000,~
## $ MS.SubClass <int> 30, 120, 30, 70, 60, 85, 20, 20, 20, 180, 120, 60, 30,~
## $ MS.Zoning <fct> RL, RL, C (all), RL, RL, RL, RM, RL, RL, RM, RL, RL, R~
## $ Lot.Frontage <int> NA, 42, 60, 80, 70, 64, 60, 53, 74, 35, 48, 63, 62, NA~
## $ Lot.Area <int> 7890, 4235, 6060, 8146, 8400, 7301, 6000, 3710, 12395,~
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ~
## $ Alley <fct> NA, NA, NA, NA, NA, NA, Pave, NA, NA, NA, NA, NA, NA, ~
## $ Lot.Shape <fct> Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg, Reg,~
## $ Land.Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Bnk, Lvl, Lvl, Lvl, Lvl,~
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub~
## $ Lot.Config <fct> Corner, Inside, Inside, Corner, Inside, Corner, Inside~
## $ Land.Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Mod, Gtl, Gtl, Gtl, Gtl,~
## $ Neighborhood <fct> SWISU, Edwards, IDOTRR, OldTown, NWAmes, Edwards, OldT~
## $ Condition.1 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ~
## $ Condition.2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ~
## $ Bldg.Type <fct> 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam, 1Fam, 2fmCon, 1Fam, 1F~
## $ House.Style <fct> 1Story, 1Story, 1Story, 2Story, 2Story, SFoyer, 1Story~
## $ Overall.Qual <int> 6, 5, 5, 4, 8, 7, 4, 7, 5, 6, 8, 5, 4, 6, 7, 6, 7, 3, ~
## $ Overall.Cond <int> 6, 5, 9, 8, 6, 5, 4, 5, 6, 5, 5, 5, 6, 5, 5, 5, 5, 3, ~
## $ Year.Built <int> 1939, 1984, 1930, 1900, 2001, 2003, 1953, 2007, 1984, ~
## $ Year.Remod.Add <int> 1950, 1984, 2007, 2003, 2001, 2003, 1953, 2008, 1984, ~
## $ Roof.Style <fct> Gable, Gable, Hip, Gable, Gable, Gable, Gable, Gable, ~
## $ Roof.Matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg, ~
## $ Exterior.1st <fct> Wd Sdng, HdBoard, MetalSd, MetalSd, VinylSd, HdBoard, ~
## $ Exterior.2nd <fct> Wd Sdng, HdBoard, MetalSd, MetalSd, VinylSd, HdBoard, ~
## $ Mas.Vnr.Type <fct> None, BrkFace, None, None, None, BrkFace, None, BrkFac~
## $ Mas.Vnr.Area <int> 0, 149, 0, 0, 0, 500, 0, 20, 0, 76, 196, 0, 0, 247, 11~
## $ Exter.Qual <fct> TA, Gd, Gd, Gd, Gd, Gd, Fa, Gd, TA, TA, Gd, TA, TA, TA~
## $ Exter.Cond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Foundation <fct> CBlock, CBlock, BrkTil, BrkTil, PConc, Slab, CBlock, P~
## $ Bsmt.Qual <fct> TA, Gd, TA, Fa, Gd, NA, Fa, Gd, TA, Gd, Gd, Gd, Fa, Gd~
## $ Bsmt.Cond <fct> TA, TA, TA, TA, TA, NA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Bsmt.Exposure <fct> No, Mn, No, No, No, NA, No, Gd, No, Gd, No, No, No, No~
## $ BsmtFin.Type.1 <fct> Rec, GLQ, ALQ, Unf, GLQ, NA, Unf, Unf, ALQ, GLQ, GLQ, ~
## $ BsmtFin.SF.1 <int> 238, 552, 737, 0, 643, 0, 0, 0, 647, 467, 24, 458, 0, ~
## $ BsmtFin.Type.2 <fct> Unf, ALQ, Unf, Unf, Unf, NA, Unf, Unf, Unf, Unf, Unf, ~
## $ BsmtFin.SF.2 <int> 0, 393, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Bsmt.Unf.SF <int> 618, 104, 100, 405, 167, 0, 936, 1146, 217, 80, 1318, ~
## $ Total.Bsmt.SF <int> 856, 1049, 837, 405, 810, 0, 936, 1146, 864, 547, 1342~
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ~
## $ Heating.QC <fct> TA, TA, Ex, Gd, Ex, Ex, TA, Ex, TA, Ex, Ex, Gd, TA, Gd~
## $ Central.Air <fct> Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, N, ~
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr~
## $ X1st.Flr.SF <int> 856, 1049, 1001, 717, 810, 495, 936, 1246, 889, 1072, ~
## $ X2nd.Flr.SF <int> 0, 0, 0, 322, 855, 1427, 0, 0, 0, 0, 0, 650, 0, 0, 0, ~
## $ Low.Qual.Fin.SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Bsmt.Full.Bath <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, ~
## $ Bsmt.Half.Bath <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Full.Bath <int> 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, ~
## $ Half.Bath <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, ~
## $ Bedroom.AbvGr <int> 2, 2, 2, 2, 3, 4, 2, 2, 3, 2, 2, 3, 2, 3, 3, 3, 3, 2, ~
## $ Kitchen.AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ Kitchen.Qual <fct> TA, Gd, Gd, TA, Gd, Gd, TA, Gd, TA, Gd, Gd, TA, TA, TA~
## $ TotRms.AbvGrd <int> 4, 5, 5, 6, 6, 7, 4, 5, 6, 5, 6, 6, 5, 6, 7, 7, 6, 5, ~
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Min2, Typ, Typ, Typ, Typ~
## $ Fireplaces <int> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, ~
## $ Fireplace.Qu <fct> Gd, NA, NA, NA, NA, Ex, NA, Gd, NA, NA, Gd, NA, NA, Gd~
## $ Garage.Type <fct> Detchd, Attchd, Detchd, Detchd, Attchd, BuiltIn, Detch~
## $ Garage.Yr.Blt <int> 1939, 1984, 1930, 1940, 2001, 2003, 1974, 2007, 1984, ~
## $ Garage.Finish <fct> Unf, Fin, Unf, Unf, Fin, RFn, Unf, Fin, Unf, Fin, RFn,~
## $ Garage.Cars <int> 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, ~
## $ Garage.Area <int> 399, 266, 216, 281, 528, 672, 576, 428, 484, 525, 550,~
## $ Garage.Qual <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Garage.Cond <fct> TA, TA, Po, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA~
## $ Paved.Drive <fct> Y, Y, N, N, Y, Y, Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, N, ~
## $ Wood.Deck.SF <int> 0, 0, 154, 0, 0, 0, 0, 100, 0, 0, 0, 22, 0, 0, 192, 14~
## $ Open.Porch.SF <int> 0, 105, 0, 0, 45, 0, 32, 24, 0, 44, 35, 0, 0, 76, 74, ~
## $ Enclosed.Porch <int> 0, 0, 42, 168, 0, 177, 112, 0, 0, 0, 0, 0, 128, 0, 0, ~
## $ X3Ssn.Porch <int> 0, 0, 86, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Screen.Porch <int> 166, 0, 0, 111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 185, 0, 0, ~
## $ Pool.Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Pool.QC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Fence <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, GdWo, NA, ~
## $ Misc.Feature <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Misc.Val <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Mo.Sold <int> 3, 2, 11, 5, 11, 7, 2, 3, 4, 5, 2, 3, 11, 7, 12, 10, 1~
## $ Yr.Sold <int> 2010, 2009, 2007, 2009, 2009, 2009, 2009, 2008, 2008, ~
## $ Sale.Type <fct> WD , WD , WD , WD , WD , ConLD, WD , New, WD , WD , WD~
## $ Sale.Condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Normal~
Create a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution*
year_present <- as.numeric(format(Sys.Date(), "%Y"))
ames_train$Age <- year_present - ames_train$Year.Built
ggplot(ames_train, aes(x=Age)) +
geom_histogram(stat_bins=30, alpha = 0.8, color = "black") +
xlab("Age") +
ggtitle("Age of Homes") +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Ignoring unknown parameters: stat_bins
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary((ames_train$Age))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.0 20.0 46.0 48.8 66.0 149.0
sd(ames_train$Age)
## [1] 29.63741
*We see the latest a home was built in this dataset was 11 years ago, and the earliest is 149 years ago. The mean age of homes is approximately 48.8 years, with the median age being 46 years. This is indicative of a right skew. More precisely, the distribution of home ages is multimodal. The peaks we see in the histogram indicate that most homes are newer. Due to the skew, the median would be the summary metric we’d want to use as opposed to the mean. The standard deviation is 29.63741 years.
*The mantra in real estate is “Location, Location, Location!” Let’s explore by making a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods?
by_median <- order(as.numeric(by(ames_train$price, ames_train$Neighborhood, median)), decreasing = TRUE)
ames_train$Neighborhood <- ordered(ames_train$Neighborhood, levels=levels(ames_train$Neighborhood)[by_median])
boxplot(price ~ Neighborhood, data = ames_train, las = 2, ylab = "Price", main = "House price by neighborhood in Ames, Iowa", col = "gold")
The mean price is a useful statistic to determine the most expensive neighborhoods. The median would also be useful as it would give you some insight about the spread of values. Also useful in gaining insight on spread is the standard variation.
See below for those values for each neighborhood.
ames_train %>%
group_by(Neighborhood) %>%
summarise(median=median(price, na.rm=TRUE), Stdev=sd(price, na.rm=TRUE)) %>%
arrange(desc(median))
## # A tibble: 27 x 3
## Neighborhood median Stdev
## <ord> <dbl> <dbl>
## 1 StoneBr 340692. 123459.
## 2 NridgHt 336860 105089.
## 3 NoRidge 290000 35889.
## 4 GrnHill 280000 70711.
## 5 Timber 232500 84030.
## 6 Somerst 221650 65199.
## 7 Greens 212625 29063.
## 8 Veenker 205750 72545.
## 9 Crawfor 205000 71268.
## 10 CollgCr 195800 52786.
## # ... with 17 more rows
The most expensive neighborhood is StoneBr, with median cost $340691.5.
The least expensive neighborhood is MeadowV, with median cost $85750.0.
The most heterogenous neighborhood is StoneBr, with standard deviation $123459.10.
Q2: Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
sort(sapply(ames_train, function(x) sum(is.na(x))), decreasing=TRUE)
## Pool.QC Misc.Feature Alley Fence Fireplace.Qu
## 997 971 933 798 491
## Lot.Frontage Garage.Yr.Blt Garage.Qual Garage.Cond Garage.Type
## 167 48 47 47 46
## Garage.Finish Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 46 21 21 21 21
## BsmtFin.Type.2 Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2 Bsmt.Unf.SF
## 21 7 1 1 1
## Total.Bsmt.SF Bsmt.Full.Bath Bsmt.Half.Bath Garage.Cars Garage.Area
## 1 1 1 1 1
## PID area price MS.SubClass MS.Zoning
## 0 0 0 0 0
## Lot.Area Street Lot.Shape Land.Contour Utilities
## 0 0 0 0 0
## Lot.Config Land.Slope Neighborhood Condition.1 Condition.2
## 0 0 0 0 0
## Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built
## 0 0 0 0 0
## Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 0 0 0 0 0
## Mas.Vnr.Type Exter.Qual Exter.Cond Foundation Heating
## 0 0 0 0 0
## Heating.QC Central.Air Electrical X1st.Flr.SF X2nd.Flr.SF
## 0 0 0 0 0
## Low.Qual.Fin.SF Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr
## 0 0 0 0 0
## Kitchen.Qual TotRms.AbvGrd Functional Fireplaces Paved.Drive
## 0 0 0 0 0
## Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## 0 0 0 0 0
## Pool.Area Misc.Val Mo.Sold Yr.Sold Sale.Type
## 0 0 0 0 0
## Sale.Condition Age
## 0 0
length(which(ames_train$Pool.Area==0))
## [1] 997
The variable with the greatest number of missing values is Pool.QC (pool quality). This makes sense as most homes in Ames do not have pools. The number of missing values in Pool.QC matches the number of times Pool.Area == 0.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
ames_train <- ames_train %>% mutate(price_ln = log(price))
summary(lm(ames_train$price_ln ~ ames_train$Lot.Area))$adj.r.squared
## [1] 0.05786118
summary(lm(ames_train$price_ln ~ ames_train$Land.Slope))$adj.r.squared
## [1] 0.005950438
summary(lm(ames_train$price_ln ~ ames_train$Year.Built))$adj.r.squared
## [1] 0.3947132
summary(lm(ames_train$price_ln ~ ames_train$Year.Remod.Add))$adj.r.squared
## [1] 0.3674833
summary(lm(ames_train$price_ln ~ ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.03830868
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area))$adj.r.squared
## [1] 0.450636
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Land.Slope))$adj.r.squared
## [1] 0.4030474
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add))$adj.r.squared
## [1] 0.4717006
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.4403228
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area))$adj.r.squared
## [1] 0.5239686
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Land.Slope))$adj.r.squared
## [1] 0.4803236
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.512633
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Land.Slope))$adj.r.squared
## [1] 0.5314859
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.5526062
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope))$adj.r.squared
## [1] 0.5598345
Final forward-selection adjusted R^2 model:
```lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope)```
```r
(ames_full <- lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope))
##
## Call:
## lm(formula = ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add +
## ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope)
##
## Coefficients:
## (Intercept) ames_train$Year.Built
## -1.371e+01 6.049e-03
## ames_train$Year.Remod.Add ames_train$Lot.Area
## 6.778e-03 1.028e-05
## ames_train$Bedroom.AbvGr ames_train$Land.SlopeMod
## 8.686e-02 1.384e-01
## ames_train$Land.SlopeSev
## -4.567e-01
ggplot(ames_train, aes(x=ames_full$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(ames_full$residuals, col="red")
abline(h=0, lty=2)
qqnorm(ames_full$residuals, col="red")
qqline(ames_full$residuals)
stepAIC(ames_full, direction="backward", trace=TRUE)
## Start: AIC=-2545.77
## ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add +
## ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope
##
## Df Sum of Sq RSS AIC
## <none> 77.322 -2545.8
## - ames_train$Land.Slope 2 1.4281 78.750 -2531.5
## - ames_train$Bedroom.AbvGr 1 5.0628 82.385 -2484.3
## - ames_train$Lot.Area 1 6.7292 84.051 -2464.3
## - ames_train$Year.Remod.Add 1 11.9642 89.286 -2403.9
## - ames_train$Year.Built 1 19.8546 97.177 -2319.2
##
## Call:
## lm(formula = ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add +
## ames_train$Lot.Area + ames_train$Bedroom.AbvGr + ames_train$Land.Slope)
##
## Coefficients:
## (Intercept) ames_train$Year.Built
## -1.371e+01 6.049e-03
## ames_train$Year.Remod.Add ames_train$Lot.Area
## 6.778e-03 1.028e-05
## ames_train$Bedroom.AbvGr ames_train$Land.SlopeMod
## 8.686e-02 1.384e-01
## ames_train$Land.SlopeSev
## -4.567e-01
ames_bas <- bas.lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr,
prior="BIC",
modelprior=uniform(),
data=na.omit(ames_train))
ames_bas
##
## Call:
## bas.lm(formula = ames_train$price_ln ~ ames_train$Year.Built +
## ames_train$Year.Remod.Add + ames_train$Lot.Area + ames_train$Bedroom.AbvGr,
## data = na.omit(ames_train), prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept ames_train$Year.Built
## 1 1
## ames_train$Year.Remod.Add ames_train$Lot.Area
## 1 1
## ames_train$Bedroom.AbvGr
## 1
confint(coef(ames_bas))
## 2.5% 97.5% beta
## Intercept 1.200058e+01 1.203540e+01 1.201847e+01
## ames_train$Year.Built 5.250789e-03 6.745395e-03 6.018565e-03
## ames_train$Year.Remod.Add 5.781670e-03 7.936535e-03 6.888619e-03
## ames_train$Lot.Area 6.853953e-06 1.044256e-05 8.697411e-06
## ames_train$Bedroom.AbvGr 6.528917e-02 1.076242e-01 8.703660e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(ames_bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1 1.0000 1.000000e+00 1.000000e+00
## ames_train$Year.Built 1 1.0000 1.000000e+00 1.000000e+00
## ames_train$Year.Remod.Add 1 1.0000 1.000000e+00 1.000000e+00
## ames_train$Lot.Area 1 1.0000 1.000000e+00 0.000000e+00
## ames_train$Bedroom.AbvGr 1 1.0000 0.000000e+00 1.000000e+00
## BF NA 1.0000 6.442574e-13 4.998273e-18
## PostProbs NA 1.0000 0.000000e+00 0.000000e+00
## R2 NA 0.5544 5.254000e-01 5.141000e-01
## dim NA 5.0000 4.000000e+00 4.000000e+00
## logmarg NA -2200.4098 -2.228480e+03 -2.240247e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## ames_train$Year.Built 1.000000e+00 1.000000e+00
## ames_train$Year.Remod.Add 0.000000e+00 1.000000e+00
## ames_train$Lot.Area 1.000000e+00 0.000000e+00
## ames_train$Bedroom.AbvGr 1.000000e+00 0.000000e+00
## BF 5.860164e-31 2.943579e-34
## PostProbs 0.000000e+00 0.000000e+00
## R2 4.843000e-01 4.728000e-01
## dim 4.000000e+00 3.000000e+00
## logmarg -2.270022e+03 -2.277618e+03
image(ames_bas, rotate=FALSE)
We see for the
ames_full model (our final model) that a histogram of the residuals forms a near-normal distribution, a scatterplot of the residuals shows a random scatter around 0, and the normal Q-Q plot is nearly linear. These conditions are necessary for us to build a valid model.
We see that the model with the highest adjusted R2 is the model with all of the variables. The model with the lowest AIC score is also. The model with the greatest log posterior odds is also the full model.
The three methods I used to select a model are:
All three models produced the same result– the full model is the best model.
Q:Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
#predict(ames_full, newdata=ames_train, interval="prediction", level=0.95)
ames_residuals <- residuals(ames_full)
ames_residuals_squared <- ames_residuals**2
ames_train$PID[which.max(ames_residuals_squared)]
## [1] 902207130
which.max(ames_residuals_squared)
## 428
## 428
min(ames_train$price)
## [1] 12789
The home with the largest squared residual has PID 902207130. Its index is 428 in the ames_train dataset.
One salient feature of this home is its low price: $12,789. This is far below the price of the other homes. In fact, it is the lowest home price. Furthermore, it is one of the older homes in the dataset. It was built in 1923. Its age surely negatively affects its price, as older homes are less likely than newer homes to be in good condition. It also has no central air and no paved drive. It was made with asbestos shingles, which is now known to be a health risk and whose use has been discontinued. It was sold under abnormal circumstances (meaning trade, foreclosure, or short sale). I speculate that homes that succumb to these conditions are more likely to be in disrepair as funds are not available.
Q: Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
ames_train <- ames_train %>%
mutate(Lot.Area.log = log(Lot.Area))
summary(lm(ames_train$price_ln ~ ames_train$Lot.Area.log))$adj.r.squared
## [1] 0.1368512
summary(lm(ames_train$price_ln ~ ames_train$Land.Slope))$adj.r.squared
## [1] 0.005950438
summary(lm(ames_train$price_ln ~ ames_train$Year.Built))$adj.r.squared
## [1] 0.3947132
summary(lm(ames_train$price_ln ~ ames_train$Year.Remod.Add))$adj.r.squared
## [1] 0.3674833
summary(lm(ames_train$price_ln ~ ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.03830868
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log))$adj.r.squared
## [1] 0.5209301
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Land.Slope))$adj.r.squared
## [1] 0.4030474
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add))$adj.r.squared
## [1] 0.4717006
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.4403228
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope))$adj.r.squared
## [1] 0.5225054
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Year.Remod.Add))$adj.r.squared
## [1] 0.5900694
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.5340868
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope + ames_train$Year.Remod.Add))$adj.r.squared
## [1] 0.5910888
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.5363921
summary(lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Lot.Area.log + ames_train$Land.Slope + ames_train$Year.Remod.Add + ames_train$Bedroom.AbvGr))$adj.r.squared
## [1] 0.6032018
ames_full_log <- lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr)
stepAIC(ames_full_log, direction="backward", trace=TRUE)
## Start: AIC=-2647.16
## ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add +
## ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 70.147 -2647.2
## - ames_train$Bedroom.AbvGr 1 2.0816 72.228 -2619.9
## - ames_train$Year.Remod.Add 1 11.9455 82.092 -2491.9
## - ames_train$Lot.Area.log 1 15.7256 85.872 -2446.9
## - ames_train$Year.Built 1 19.3032 89.450 -2406.1
##
## Call:
## lm(formula = ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add +
## ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr)
##
## Coefficients:
## (Intercept) ames_train$Year.Built
## -15.574204 0.005963
## ames_train$Year.Remod.Add ames_train$Lot.Area.log
## 0.006765 0.247077
## ames_train$Bedroom.AbvGr
## 0.057258
(ames_log_bas <- bas.lm(ames_train$price_ln ~ ames_train$Year.Built + ames_train$Year.Remod.Add + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr,
prior="BIC",
modelprior=uniform(),
data=na.omit(ames_train)))
##
## Call:
## bas.lm(formula = ames_train$price_ln ~ ames_train$Year.Built +
## ames_train$Year.Remod.Add + ames_train$Lot.Area.log + ames_train$Bedroom.AbvGr,
## data = na.omit(ames_train), prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept ames_train$Year.Built
## 1 1
## ames_train$Year.Remod.Add ames_train$Lot.Area.log
## 1 1
## ames_train$Bedroom.AbvGr
## 1
Using log(Lot.Area) does change the final model for two model selection methods: StepAIC using backwards selection, and bayesian model averaging. The model constructed using forward-selection using adjusted R2 did not change; it used the full model.
Q: Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
ames_pred <- predict(ames_full, ames_train)
plot(ames_pred, ames_train$price)
abline(lm(ames_train$price ~ ames_pred), col="red")
summary(lm(ames_train$price ~ ames_pred))$r.squared
## [1] 0.4847039
ames_log_pred <- predict(ames_full_log, ames_train)
plot(ames_log_pred, ames_train$price)
abline(lm(ames_train$price ~ ames_log_pred), col="red")
summary(lm(ames_train$price ~ ames_log_pred))$r.squared
## [1] 0.5186156
Using Lot.Area.log in the plot helps to capture much more uncertainty than the plot using `Lot.Area
~The R2 value of lm(ames_train$price ~ ames_pred) is 0.4847039, which is less than the R2 value of lm(ames_train$price ~ ames_log_pred), which is 0.5186156. This means the model with Lot.Area.log explains more uncertainty than the other model.