Cory Torrella is hired as a statistical consultant working for a real estate investment firm.
Your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(dplyr)
library(BAS)
library(tidyverse)
library(gridExtra)
library(MASS)
library(knitr)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set. Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
We begin by exploring a summary of the ames_train data:
summary(ames_train)
## PID area price MS.SubClass
## Min. :5.263e+08 Min. : 334 Min. : 12789 Min. : 20.00
## 1st Qu.:5.285e+08 1st Qu.:1092 1st Qu.:129763 1st Qu.: 20.00
## Median :5.354e+08 Median :1411 Median :159467 Median : 50.00
## Mean :7.059e+08 Mean :1477 Mean :181190 Mean : 57.15
## 3rd Qu.:9.071e+08 3rd Qu.:1743 3rd Qu.:213000 3rd Qu.: 70.00
## Max. :1.007e+09 Max. :4676 Max. :615000 Max. :190.00
##
## MS.Zoning Lot.Frontage Lot.Area Street Alley
## A (agr): 0 Min. : 21.00 Min. : 1470 Grvl: 3 Grvl: 33
## C (all): 9 1st Qu.: 57.00 1st Qu.: 7314 Pave:997 Pave: 34
## FV : 56 Median : 69.00 Median : 9317 NA's:933
## I (all): 1 Mean : 69.21 Mean : 10352
## RH : 7 3rd Qu.: 80.00 3rd Qu.: 11650
## RL :772 Max. :313.00 Max. :215245
## RM :155 NA's :167
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## IR1:338 Bnk: 33 AllPub:1000 Corner :173 Gtl:962 NAmes :155
## IR2: 30 HLS: 38 NoSeWa: 0 CulDSac: 76 Mod: 33 CollgCr: 85
## IR3: 3 Low: 20 NoSewr: 0 FR2 : 36 Sev: 5 Somerst: 74
## Reg:629 Lvl:909 FR3 : 5 OldTown: 71
## Inside :710 Sawyer : 61
## Edwards: 60
## (Other):494
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual
## Norm :875 Norm :988 1Fam :823 1Story :521 Min. : 1.000
## Feedr : 53 Feedr : 6 2fmCon: 20 2Story :286 1st Qu.: 5.000
## Artery : 23 Artery : 2 Duplex: 35 1.5Fin : 98 Median : 6.000
## RRAn : 14 PosN : 2 Twnhs : 38 SLvl : 41 Mean : 6.095
## PosN : 11 PosA : 1 TwnhsE: 84 SFoyer : 36 3rd Qu.: 7.000
## RRAe : 11 RRNn : 1 2.5Unf : 10 Max. :10.000
## (Other): 13 (Other): 0 (Other): 8
## Overall.Cond Year.Built Year.Remod.Add Roof.Style Roof.Matl
## Min. :1.000 Min. :1872 Min. :1950 Flat : 9 CompShg:984
## 1st Qu.:5.000 1st Qu.:1955 1st Qu.:1966 Gable :775 Tar&Grv: 11
## Median :5.000 Median :1975 Median :1992 Gambrel: 8 WdShake: 2
## Mean :5.559 Mean :1972 Mean :1984 Hip :204 WdShngl: 2
## 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004 Mansard: 4 Metal : 1
## Max. :9.000 Max. :2010 Max. :2010 Shed : 0 ClyTile: 0
## (Other): 0
## Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area Exter.Qual
## VinylSd:349 VinylSd:345 : 7 Min. : 0.0 Ex: 39
## HdBoard:164 HdBoard:150 BrkCmn : 8 1st Qu.: 0.0 Fa: 11
## MetalSd:147 MetalSd:148 BrkFace:317 Median : 0.0 Gd:337
## Wd Sdng:138 Wd Sdng:130 CBlock : 0 Mean : 104.1 TA:613
## Plywood: 74 Plywood: 96 None :593 3rd Qu.: 160.0
## CemntBd: 40 CmentBd: 40 Stone : 75 Max. :1290.0
## (Other): 88 (Other): 91 NA's :7
## Exter.Cond Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## Ex: 4 BrkTil:102 : 1 : 1 : 2 GLQ :294
## Fa: 19 CBlock:430 Ex : 87 Ex : 2 Av :157 Unf :279
## Gd:116 PConc :453 Fa : 28 Fa : 23 Gd : 98 ALQ :163
## Po: 0 Slab : 12 Gd :424 Gd : 44 Mn : 87 Rec :107
## TA:861 Stone : 3 Po : 1 Po : 1 No :635 BLQ : 87
## Wood : 0 TA :438 TA :908 NA's: 21 (Other): 49
## NA's: 21 NA's: 21 NA's : 21
## BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF
## Min. : 0.0 Unf :863 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 LwQ : 31 1st Qu.: 0.00 1st Qu.: 223.5
## Median : 400.0 Rec : 29 Median : 0.00 Median : 461.0
## Mean : 464.1 BLQ : 24 Mean : 48.07 Mean : 547.0
## 3rd Qu.: 773.0 ALQ : 20 3rd Qu.: 0.00 3rd Qu.: 783.0
## Max. :2260.0 (Other): 12 Max. :1526.00 Max. :2336.0
## NA's :1 NA's : 21 NA's :1 NA's :1
## Total.Bsmt.SF Heating Heating.QC Central.Air Electrical
## Min. : 0.0 Floor: 0 Ex:516 N: 55 : 0
## 1st Qu.: 797.5 GasA :988 Fa: 22 Y:945 FuseA: 54
## Median : 998.0 GasW : 8 Gd:157 FuseF: 12
## Mean :1059.2 Grav : 2 Po: 1 FuseP: 2
## 3rd Qu.:1301.0 OthW : 1 TA:304 Mix : 0
## Max. :3138.0 Wall : 1 SBrkr:932
## NA's :1
## X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Bsmt.Full.Bath
## Min. : 334.0 Min. : 0.0 Min. : 0.00 Min. :0.0000
## 1st Qu.: 876.2 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:0.0000
## Median :1080.5 Median : 0.0 Median : 0.00 Median :0.0000
## Mean :1157.1 Mean : 315.2 Mean : 4.32 Mean :0.4474
## 3rd Qu.:1376.2 3rd Qu.: 688.2 3rd Qu.: 0.00 3rd Qu.:1.0000
## Max. :3138.0 Max. :1836.0 Max. :1064.00 Max. :3.0000
## NA's :1
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## Min. :0.00000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.000 Median :3.000
## Mean :0.06106 Mean :1.541 Mean :0.378 Mean :2.806
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :2.00000 Max. :4.000 Max. :2.000 Max. :6.000
## NA's :1
## Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## Min. :0.000 Ex: 67 Min. : 2.00 Typ :935 Min. :0.000
## 1st Qu.:1.000 Fa: 20 1st Qu.: 5.00 Min2 : 24 1st Qu.:0.000
## Median :1.000 Gd:403 Median : 6.00 Min1 : 18 Median :1.000
## Mean :1.039 Po: 1 Mean : 6.34 Mod : 16 Mean :0.597
## 3rd Qu.:1.000 TA:509 3rd Qu.: 7.00 Maj1 : 4 3rd Qu.:1.000
## Max. :2.000 Max. :13.00 Maj2 : 2 Max. :4.000
## (Other): 1
## Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## Ex : 16 2Types : 10 Min. :1900 : 2 Min. :0.000
## Fa : 24 Attchd :610 1st Qu.:1961 Fin :247 1st Qu.:1.000
## Gd :232 Basment: 11 Median :1979 RFn :278 Median :2.000
## Po : 18 BuiltIn: 56 Mean :1978 Unf :427 Mean :1.767
## TA :219 CarPort: 1 3rd Qu.:2002 NA's: 46 3rd Qu.:2.000
## NA's:491 Detchd :266 Max. :2010 Max. :5.000
## NA's : 46 NA's :48 NA's :1
## Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## Min. : 0.0 : 1 : 1 N: 67 Min. : 0.00
## 1st Qu.: 312.0 Ex : 1 Ex : 1 P: 29 1st Qu.: 0.00
## Median : 480.0 Fa : 37 Fa : 21 Y:904 Median : 0.00
## Mean : 475.4 Gd : 7 Gd : 6 Mean : 93.84
## 3rd Qu.: 576.0 Po : 3 Po : 6 3rd Qu.:168.00
## Max. :1390.0 TA :904 TA :918 Max. :857.00
## NA's :1 NA's: 47 NA's: 47
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 28.00 Median : 0.00 Median : 0.000 Median : 0.00
## Mean : 48.93 Mean : 23.48 Mean : 3.118 Mean : 14.77
## 3rd Qu.: 74.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :742.00 Max. :432.00 Max. :508.000 Max. :440.00
##
## Pool.Area Pool.QC Fence Misc.Feature Misc.Val
## Min. : 0.000 Ex : 1 GdPrv: 43 Elev: 0 Min. : 0.00
## 1st Qu.: 0.000 Fa : 1 GdWo : 37 Gar2: 2 1st Qu.: 0.00
## Median : 0.000 Gd : 1 MnPrv:120 Othr: 1 Median : 0.00
## Mean : 1.463 TA : 0 MnWw : 2 Shed: 25 Mean : 45.81
## 3rd Qu.: 0.000 NA's:997 NA's :798 TenC: 1 3rd Qu.: 0.00
## Max. :800.000 NA's:971 Max. :15500.00
##
## Mo.Sold Yr.Sold Sale.Type Sale.Condition
## Min. : 1.000 Min. :2006 WD :863 Abnorml: 61
## 1st Qu.: 4.000 1st Qu.:2007 New : 79 AdjLand: 2
## Median : 6.000 Median :2008 COD : 27 Alloca : 4
## Mean : 6.243 Mean :2008 ConLD : 7 Family : 17
## 3rd Qu.: 8.000 3rd Qu.:2009 ConLw : 6 Normal :834
## Max. :12.000 Max. :2010 Con : 5 Partial: 82
## (Other): 13
It is important that we take note of the values that may need to be cleaned in order to prove useful. Here are some of our first actions:
Filter the data set to contain only the sale conditions that are normal, as the houses with abnormal selling conditions exhibit atypical behavior and can disproportionately influence the model.
Avoiding risk of incurring bias in the the modelling by transforming NA’s to their perspective category and acknowledging some continuous variables (such as Lot.Frontage) have a great number of NA’s which means we are missing potential valuable data.
Categorical variables such as Fence, Garage.Qual, and Garage.Cond have NA’s that are not missing data, but actually another category such as “Not having a fence”/“Not having a garage”.
With this information, we explore relationships and considerations of different factors:
n.value <- ames_train %>%
group_by(Neighborhood) %>%
summarize(homes.sold = n(),
avg.price = mean(price, na.rm = T),
med.price = median(price, na.rm = T),
iqr.price = IQR(price, na.rm = T),
avg.area = mean(area, na.rm = T),
med.area = median(area, na.rm = T),
iqr.area = IQR(area, na.rm = T)) %>%
arrange(desc(med.price))
n.value
## # A tibble: 27 x 8
## Neighborhood homes.sold avg.price med.price iqr.price avg.area med.area
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 StoneBr 20 339316. 340692. 151358 1950. 1870
## 2 NridgHt 57 333647. 336860 148800 2017. 1980
## 3 NoRidge 28 295845. 290000 50312. 2290 2282.
## 4 GrnHill 2 280000 280000 50000 1398. 1398.
## 5 Timber 19 265192. 232500 151200 1686. 1717
## 6 Somerst 74 234596. 221650 72684. 1603. 1569
## 7 Greens 4 198562. 212625 16438. 1136 1230.
## 8 Veenker 10 233650 205750 68125 1766. 1576.
## 9 Crawfor 29 204197. 205000 80100 1720. 1646
## 10 CollgCr 85 196951. 195800 58836 1479. 1541
## # ... with 17 more rows, and 1 more variable: iqr.area <dbl>
The variance in the average home price and the range of home prices as a result of the neighborhood they were built in is useful in creating a model for predicting home prices.
Now, we’ll explore the distribution of age of the houses in each neighborhood:
data.value <- ames_train
data.value$age <- sapply(data.value$Year.Built, function(x) 2019-x)
p1 <- ggplot(data.value, aes(x = age, y = ..density..)) +
geom_histogram(bins = 30, fill = 'dark gray', color = 'white') +
geom_density(size = 0.8,color = 'dark green')+
labs(title = "Distribution of ages of the houses", x = "Age", y = 'Count') +
geom_vline(xintercept = mean(data.value$age), color = "dark red", size = 1.2) +
geom_vline(xintercept = median(data.value$age), color = "dark blue", size = 1.2) +
theme_bw()
p1
df_price_neighbor <- data.value[,c("price","Neighborhood")]
p2 <- ggplot(df_price_neighbor, aes(y = price, x = reorder(Neighborhood, price, median), fill = Neighborhood), main = "Price among Neighborhoods", xlim= "Neighborhood", ylim="price" )+geom_boxplot() + coord_flip()
p2
p3 <- ggplot(data.value, aes(x = Year.Built, y = log(price), col=Overall.Qual)) + geom_point() + theme_bw() + labs(title='Year.Built Analysis', x = 'Year house was built', y = 'Logarithm of Price')
p3
As expected, more modern houses are sold for higher prices than the older ones.
However, if a given neighborhood consists mostly of new homes and higher-than-average prices, we observe that a home’s relative age is a moderating factor for prediction modeling.
lot.data <- ames_train[which(ames_train$Lot.Area < 50000),]
p4 <- ggplot(lot.data, aes(x = Lot.Area, y = price)) +
geom_point() +
geom_smooth(se = TRUE, method="lm", formula = y ~ splines::bs(x, 3)) +
geom_jitter(height = 0.05) +
ggtitle("Lot Size Analysis")
p4
Homes built on large lots typically sell for higher prices, all else being equal.
However, there are diminishing returns to additional acreage over 25000 square feet. As Lot.Area shows to be a useful variable, we need to use its logarithm as well as the logarithm of price to get a proper relation.
It’s essential to explore the structure of the data and the relationships between the variables in the data set. While we only have 3-4 plots above, it’s only a small fraction of the variables and values we explored.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
We will pick 10 variables to be predictors for price and log-transform price and area (discussed previously in EAD) to build our multi-linear model. In the exploratory summary table above, we could see which variables have the lowest p-values. We’ll use them for our initial model.
We may remove variables with a high p-value, because they reduce significant values to the model, or worse, increase standard error.
i.model <- lm(log(price) ~ log(area) + Neighborhood + Lot.Config + Overall.Qual + Overall.Cond + Year.Built + TotRms.AbvGrd + Full.Bath + Half.Bath + log(Lot.Area),
data = ames_train)
summary(i.model)
##
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Lot.Config +
## Overall.Qual + Overall.Cond + Year.Built + TotRms.AbvGrd +
## Full.Bath + Half.Bath + log(Lot.Area), data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44760 -0.06961 0.00200 0.08105 0.50723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2290996 0.8553047 -4.945 9e-07 ***
## log(area) 0.5660212 0.0340768 16.610 < 2e-16 ***
## NeighborhoodBlueste -0.1177475 0.0980573 -1.201 0.230124
## NeighborhoodBrDale -0.1890690 0.0679204 -2.784 0.005480 **
## NeighborhoodBrkSide -0.0714155 0.0577511 -1.237 0.216535
## NeighborhoodClearCr -0.0609548 0.0667644 -0.913 0.361480
## NeighborhoodCollgCr -0.1127382 0.0497513 -2.266 0.023671 *
## NeighborhoodCrawfor 0.0503747 0.0578107 0.871 0.383768
## NeighborhoodEdwards -0.1577755 0.0526820 -2.995 0.002816 **
## NeighborhoodGilbert -0.1424677 0.0526795 -2.704 0.006963 **
## NeighborhoodGreens 0.0531447 0.0891940 0.596 0.551427
## NeighborhoodGrnHill 0.2656380 0.1147967 2.314 0.020879 *
## NeighborhoodIDOTRR -0.2127837 0.0591085 -3.600 0.000335 ***
## NeighborhoodMeadowV -0.1798387 0.0614107 -2.928 0.003487 **
## NeighborhoodMitchel -0.0916415 0.0534282 -1.715 0.086626 .
## NeighborhoodNAmes -0.1013528 0.0515447 -1.966 0.049550 *
## NeighborhoodNoRidge 0.0018169 0.0554183 0.033 0.973853
## NeighborhoodNPkVill -0.0066369 0.0884910 -0.075 0.940229
## NeighborhoodNridgHt 0.0894957 0.0509240 1.757 0.079162 .
## NeighborhoodNWAmes -0.1542766 0.0538771 -2.863 0.004281 **
## NeighborhoodOldTown -0.1306623 0.0576586 -2.266 0.023665 *
## NeighborhoodSawyer -0.1127886 0.0534113 -2.112 0.034970 *
## NeighborhoodSawyerW -0.1648820 0.0522396 -3.156 0.001648 **
## NeighborhoodSomerst -0.0158120 0.0489374 -0.323 0.746685
## NeighborhoodStoneBr 0.1156330 0.0575870 2.008 0.044926 *
## NeighborhoodSWISU -0.0709978 0.0677891 -1.047 0.295209
## NeighborhoodTimber -0.0146329 0.0583969 -0.251 0.802195
## NeighborhoodVeenker -0.0868019 0.0691527 -1.255 0.209704
## Lot.ConfigCulDSac 0.0108483 0.0215361 0.504 0.614570
## Lot.ConfigFR2 -0.0540617 0.0280801 -1.925 0.054491 .
## Lot.ConfigFR3 -0.1371237 0.0680649 -2.015 0.044224 *
## Lot.ConfigInside -0.0092697 0.0128464 -0.722 0.470728
## Overall.Qual 0.0834528 0.0062670 13.316 < 2e-16 ***
## Overall.Cond 0.0710217 0.0048984 14.499 < 2e-16 ***
## Year.Built 0.0052037 0.0003946 13.187 < 2e-16 ***
## TotRms.AbvGrd -0.0155060 0.0054003 -2.871 0.004178 **
## Full.Bath -0.0496426 0.0145387 -3.415 0.000666 ***
## Half.Bath -0.0411220 0.0121895 -3.374 0.000772 ***
## log(Lot.Area) 0.1393604 0.0136129 10.237 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1477 on 961 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.8766
## F-statistic: 187.8 on 38 and 961 DF, p-value: < 2.2e-16
Adjusted R-squared is 0.8766, showing a valuable relationship between those predictors to the price.
Using either BAS or another stepwise selection procedure, choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
Now we will run 2 different model selection methods:
to find the TRUE model among the set of candidates, we start with the full array of regressors and then remove them at each step until the BIC for the model is at its highest.
to consider adding or subtracting regressors at each step. When we consider all variables at each step, it gives us confidence that the strongest model will be selected.
k <- log(nrow(ames_train))
model.bic <- stepAIC(object = i.model,
k = k,
direction = "backward",
trace = F)
summary(model.bic)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Overall.Cond +
## Year.Built + TotRms.AbvGrd + Full.Bath + Half.Bath + log(Lot.Area),
## data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37898 -0.08691 0.00075 0.09235 0.59197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9807949 0.5854912 -8.507 < 2e-16 ***
## log(area) 0.5906024 0.0358864 16.458 < 2e-16 ***
## Overall.Qual 0.1145357 0.0057951 19.764 < 2e-16 ***
## Overall.Cond 0.0665850 0.0050939 13.072 < 2e-16 ***
## Year.Built 0.0053850 0.0002558 21.050 < 2e-16 ***
## TotRms.AbvGrd -0.0151798 0.0056370 -2.693 0.0072 **
## Full.Bath -0.0669000 0.0142865 -4.683 3.22e-06 ***
## Half.Bath -0.0670495 0.0122563 -5.471 5.68e-08 ***
## log(Lot.Area) 0.1384414 0.0106491 13.000 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1596 on 991 degrees of freedom
## Multiple R-squared: 0.8571, Adjusted R-squared: 0.8559
## F-statistic: 742.9 on 8 and 991 DF, p-value: < 2.2e-16
model.aic <- stepAIC(i.model,
k = 2,
direction = "both",
trace = F)
summary(model.aic)
##
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Lot.Config +
## Overall.Qual + Overall.Cond + Year.Built + TotRms.AbvGrd +
## Full.Bath + Half.Bath + log(Lot.Area), data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44760 -0.06961 0.00200 0.08105 0.50723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2290996 0.8553047 -4.945 9e-07 ***
## log(area) 0.5660212 0.0340768 16.610 < 2e-16 ***
## NeighborhoodBlueste -0.1177475 0.0980573 -1.201 0.230124
## NeighborhoodBrDale -0.1890690 0.0679204 -2.784 0.005480 **
## NeighborhoodBrkSide -0.0714155 0.0577511 -1.237 0.216535
## NeighborhoodClearCr -0.0609548 0.0667644 -0.913 0.361480
## NeighborhoodCollgCr -0.1127382 0.0497513 -2.266 0.023671 *
## NeighborhoodCrawfor 0.0503747 0.0578107 0.871 0.383768
## NeighborhoodEdwards -0.1577755 0.0526820 -2.995 0.002816 **
## NeighborhoodGilbert -0.1424677 0.0526795 -2.704 0.006963 **
## NeighborhoodGreens 0.0531447 0.0891940 0.596 0.551427
## NeighborhoodGrnHill 0.2656380 0.1147967 2.314 0.020879 *
## NeighborhoodIDOTRR -0.2127837 0.0591085 -3.600 0.000335 ***
## NeighborhoodMeadowV -0.1798387 0.0614107 -2.928 0.003487 **
## NeighborhoodMitchel -0.0916415 0.0534282 -1.715 0.086626 .
## NeighborhoodNAmes -0.1013528 0.0515447 -1.966 0.049550 *
## NeighborhoodNoRidge 0.0018169 0.0554183 0.033 0.973853
## NeighborhoodNPkVill -0.0066369 0.0884910 -0.075 0.940229
## NeighborhoodNridgHt 0.0894957 0.0509240 1.757 0.079162 .
## NeighborhoodNWAmes -0.1542766 0.0538771 -2.863 0.004281 **
## NeighborhoodOldTown -0.1306623 0.0576586 -2.266 0.023665 *
## NeighborhoodSawyer -0.1127886 0.0534113 -2.112 0.034970 *
## NeighborhoodSawyerW -0.1648820 0.0522396 -3.156 0.001648 **
## NeighborhoodSomerst -0.0158120 0.0489374 -0.323 0.746685
## NeighborhoodStoneBr 0.1156330 0.0575870 2.008 0.044926 *
## NeighborhoodSWISU -0.0709978 0.0677891 -1.047 0.295209
## NeighborhoodTimber -0.0146329 0.0583969 -0.251 0.802195
## NeighborhoodVeenker -0.0868019 0.0691527 -1.255 0.209704
## Lot.ConfigCulDSac 0.0108483 0.0215361 0.504 0.614570
## Lot.ConfigFR2 -0.0540617 0.0280801 -1.925 0.054491 .
## Lot.ConfigFR3 -0.1371237 0.0680649 -2.015 0.044224 *
## Lot.ConfigInside -0.0092697 0.0128464 -0.722 0.470728
## Overall.Qual 0.0834528 0.0062670 13.316 < 2e-16 ***
## Overall.Cond 0.0710217 0.0048984 14.499 < 2e-16 ***
## Year.Built 0.0052037 0.0003946 13.187 < 2e-16 ***
## TotRms.AbvGrd -0.0155060 0.0054003 -2.871 0.004178 **
## Full.Bath -0.0496426 0.0145387 -3.415 0.000666 ***
## Half.Bath -0.0411220 0.0121895 -3.374 0.000772 ***
## log(Lot.Area) 0.1393604 0.0136129 10.237 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1477 on 961 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.8766
## F-statistic: 187.8 on 38 and 961 DF, p-value: < 2.2e-16
We will use the AIC Stepwise Regression Model, because it has a higher adjusted R-squared (0.8766) and better residual standard error (0.1477); which means it would perform more robust prediction values.
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
plot(model.aic)
Residual impressions:
The model’s residuals are normally distributed with a center around 0.
The shape of distribution is normal and lays close to the qq line.
We will track outliers for potential implications.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
predict <- exp(predict(model.aic, ames_train))
model.resids <- ames_train$price - predict
model.rmse <- sqrt(mean(model.resids^2))
model.rmse
## [1] 29362.77
The model RMSE is $29362.77.
This value may be interpreted as the standard deviation of prediction errors in the prediction model.
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
ames_test <- ames_test %>%
filter(Neighborhood != "Landmrk")
initmodel_test <- exp(predict(model.aic, ames_test))
resid.test <- ames_test$price - initmodel_test
rmse.test <- sqrt(mean(resid.test^2, na.rm = TRUE))
rmse.test
## [1] 25278.85
model.rmse > rmse.test
## [1] TRUE
The lower the RMSE, the better the fit.
The test RMSE is $25278.85.
This means that the prediction for the train model proved to be even more accurate for the test data.
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
full.model1 = lm(price ~ area + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr+ Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + Garage.Area + Pool.Area , data = ames_test)
summary(full.model1)
##
## Call:
## lm(formula = price ~ area + Overall.Cond + Year.Built + Year.Remod.Add +
## Heating + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual +
## TotRms.AbvGrd + Fireplaces + Garage.Yr.Blt + Garage.Area +
## Pool.Area, data = ames_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98742 -17287 86 14816 221661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.623e+05 1.599e+05 -6.019 2.73e-09 ***
## area 9.122e+01 4.550e+00 20.048 < 2e-16 ***
## Overall.Cond 7.087e+03 1.291e+03 5.490 5.47e-08 ***
## Year.Built 8.151e+02 7.558e+01 10.785 < 2e-16 ***
## Year.Remod.Add -2.787e+01 8.150e+01 -0.342 0.7324
## HeatingGasW -4.941e+03 1.125e+04 -0.439 0.6608
## HeatingWall 6.226e+03 1.899e+04 0.328 0.7431
## Central.AirY -8.820e+03 6.536e+03 -1.349 0.1776
## Bedroom.AbvGr -1.291e+04 2.001e+03 -6.453 1.95e-10 ***
## Kitchen.AbvGr -4.625e+04 6.971e+03 -6.635 6.16e-11 ***
## Kitchen.QualFa -7.942e+04 1.081e+04 -7.345 5.30e-13 ***
## Kitchen.QualGd -7.146e+04 5.602e+03 -12.757 < 2e-16 ***
## Kitchen.QualTA -7.875e+04 6.074e+03 -12.966 < 2e-16 ***
## TotRms.AbvGrd 1.342e+02 1.476e+03 0.091 0.9276
## Fireplaces 1.113e+04 1.988e+03 5.600 3.00e-08 ***
## Garage.Yr.Blt -2.383e+02 8.902e+01 -2.676 0.0076 **
## Garage.Area 8.670e+01 8.673e+00 9.997 < 2e-16 ***
## Pool.Area 1.834e+01 3.743e+01 0.490 0.6243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30350 on 761 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.8266, Adjusted R-squared: 0.8227
## F-statistic: 213.3 on 17 and 761 DF, p-value: < 2.2e-16
Having residual standard error: 30350 on 761 degrees of freedom is high, which means the chosen model is not a good fit for our prediction needs. (On our first model, we had residual standard error of 0.1477).
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
After inspecting residuals in the previous model, we found that 3 houses were sold under abnormal conditions:
We remove these to clean our data, leaving the rest of the data with more normal prediction conditions.
ames_train_clean <- ames_train%>%
filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
dplyr::select(price, area, Overall.Cond, Year.Built, Year.Remod.Add, Heating, Central.Air, Bedroom.AbvGr, Kitchen.AbvGr, Kitchen.Qual, TotRms.AbvGrd,Fireplaces, Garage.Yr.Blt,Garage.Area, Pool.Area)
full.model2 = lm(log(price) ~ log(area) + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr+ Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + log(Garage.Area) + Pool.Area , data = ames_train_clean)
summary(full.model2)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Cond + Year.Built +
## Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr +
## Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd + Fireplaces +
## Garage.Yr.Blt + log(Garage.Area) + Pool.Area, data = ames_train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74726 -0.08643 0.00512 0.08990 0.50311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5675243 0.7640547 -2.052 0.040544 *
## log(area) 0.5488495 0.0330454 16.609 < 2e-16 ***
## Overall.Cond 0.0524961 0.0059707 8.792 < 2e-16 ***
## Year.Built 0.0042442 0.0003880 10.938 < 2e-16 ***
## Year.Remod.Add 0.0006107 0.0003910 1.562 0.118773
## HeatingGasW 0.2119427 0.0608877 3.481 0.000528 ***
## HeatingGrav 0.3390236 0.1446595 2.344 0.019350 *
## HeatingWall 0.1171465 0.1471950 0.796 0.426357
## Central.AirY 0.1703173 0.0312530 5.450 6.78e-08 ***
## Bedroom.AbvGr -0.0439392 0.0093640 -4.692 3.19e-06 ***
## Kitchen.AbvGr -0.0988982 0.0324339 -3.049 0.002372 **
## Kitchen.QualFa -0.2413396 0.0512914 -4.705 3.00e-06 ***
## Kitchen.QualGd -0.2314970 0.0247071 -9.370 < 2e-16 ***
## Kitchen.QualPo 0.0923343 0.1480338 0.624 0.532982
## Kitchen.QualTA -0.2999732 0.0279207 -10.744 < 2e-16 ***
## TotRms.AbvGrd 0.0115864 0.0065618 1.766 0.077833 .
## Fireplaces 0.0774477 0.0094210 8.221 8.44e-16 ***
## Garage.Yr.Blt -0.0004732 0.0004587 -1.032 0.302506
## log(Garage.Area) 0.1429367 0.0185184 7.719 3.61e-14 ***
## Pool.Area -0.0001527 0.0001517 -1.006 0.314718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1417 on 778 degrees of freedom
## (36 observations deleted due to missingness)
## Multiple R-squared: 0.8558, Adjusted R-squared: 0.8523
## F-statistic: 243 on 19 and 778 DF, p-value: < 2.2e-16
The Residual standard error: 0.1417 on 778 degrees of freedom
Adjusted R-sqaured: 0.8523
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
The present model does not include any variable interactions. During the process of exploring variable interaction data, we could not achieve as strong of an R-squared without creating unnecessary complexity for such a minor fraction of model improvement. As a result, we continue with your present model.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
We arrived at the final.model using AIC Stepwise Regression, in which each regressor is considered at every step. This method does not permanently remove a regressor from consideration, and uses the model.aic model developed in section two as a foundation. We added and removed variables stepwise until only the strongest model remains.
full.model3.bas = bas.lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+ Kitchen.Qual+ Fireplaces, prior = "BIC",
method = "MCMC",
modelprior = uniform(), data = ames_train_clean)
summary(full.model3.bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.000000e+00 1.000000e+00
## log(area) 0.99960938 1.0000 1.000000e+00 1.000000e+00
## Overall.Cond 0.99775391 1.0000 1.000000e+00 1.000000e+00
## Year.Built 0.99902344 1.0000 1.000000e+00 1.000000e+00
## Central.AirY 0.99687500 1.0000 1.000000e+00 1.000000e+00
## Kitchen.QualFa 0.99482422 1.0000 1.000000e+00 0.000000e+00
## Kitchen.QualGd 0.99775391 1.0000 1.000000e+00 1.000000e+00
## Kitchen.QualPo 0.05195313 0.0000 1.000000e+00 0.000000e+00
## Kitchen.QualTA 0.99804688 1.0000 1.000000e+00 1.000000e+00
## Fireplaces 1.00000000 1.0000 1.000000e+00 1.000000e+00
## BF NA 1.0000 3.781668e-02 4.298596e-12
## PostProbs NA 0.9402 5.190000e-02 3.400000e-03
## R2 NA 0.8368 8.368000e-01 8.248000e-01
## dim NA 9.0000 1.000000e+01 8.000000e+00
## logmarg NA -1275.9448 -1.279220e+03 -1.302118e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## log(area) 1.000000e+00 1.000000e+00
## Overall.Cond 1.000000e+00 0.000000e+00
## Year.Built 1.000000e+00 1.000000e+00
## Central.AirY 0.000000e+00 0.000000e+00
## Kitchen.QualFa 1.000000e+00 1.000000e+00
## Kitchen.QualGd 1.000000e+00 0.000000e+00
## Kitchen.QualPo 0.000000e+00 0.000000e+00
## Kitchen.QualTA 1.000000e+00 0.000000e+00
## Fireplaces 1.000000e+00 1.000000e+00
## BF 3.992883e-06 1.387492e-61
## PostProbs 1.300000e-03 9.000000e-04
## R2 8.305000e-01 7.641000e-01
## dim 8.000000e+00 5.000000e+00
## logmarg -1.288376e+03 -1.416075e+03
image(full.model3.bas, top.models = 7)
final.model = lm(log(price) ~ log(area) + Overall.Cond + Year.Built + Central.Air + Kitchen.Qual + Fireplaces, data = ames_train_clean)
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
Here, we calculate the proportion of observations that fall within prediction intervals to see the coverage difference between train and test environments.
model.train<-mean(exp(predict(final.model, ames_train)))
model.test<-mean(exp(predict(final.model, ames_test)))
model.train>model.test
## [1] TRUE
# Predict prices train
predict.train <- exp(predict(final.model, ames_train, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage.train <- mean(ames_train$price > predict.train[,"lwr"] &
ames_train$price < predict.train[,"upr"])
coverage.train
## [1] 0.921
predict.full <- exp(predict(final.model, ames_test, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage.test <- mean(ames_test$price > predict.full[,"lwr"] &
ames_test$price < predict.full[,"upr"])
coverage.test
## [1] 0.9497549
The test model falls within the prediction interval
with results of 94.9%, while the original has results of 92.1%.
For your final model, create and briefly interpret an informative plot of the residuals.
plot(final.model)
hist(final.model$residuals)
qqnorm(final.model$residuals)
qqline(final.model$residuals)
The residual plots allow us to see that there is no obvious or major assumption violation outside of the normal distribution tails. A few points show high leverage in the Cook’s plot, but this is to be expected as the linear model uses so many categorical variables.
For your final model, calculate and briefly comment on the RMSE.
Here we are going to extract the prediction values and residuals to calculate the RMSE.
predict.model.train <- exp(predict(final.model, ames_train))
resid.train.final <- ames_train$price - predict.model.train
rmse.train.final <- sqrt(mean(resid.train.final^2))
predict.model.test <- exp(predict(final.model, ames_test))
resid.test.final <- ames_test$price - predict.model.test
rmse.test.final <- sqrt(mean(resid.test.final^2))
rmse.train.final
## [1] 37910.31
rmse.test.final
## [1] 32361.88
model.rmse
## [1] 29362.77
rmse.test
## [1] 25278.85
Initial train vs test: - Initial train model’s RMSE is $29362.77 - Initial test model’s RMSE is $25278.85
Final train vs test: - Final train model’s RMSE is $37910.31 - Final test model’s RMSE is $32361.88
We can see that the model fits well, because both test models resulted in lower RMSE values.
What are some strengths and weaknesses of your model?
Our strength is that the final model performs well with a high posterior probability. Our weakness is that we cannot 100% guarantee our delivery of the absolute best predicted house values (even with high confidence and normalized distribution).
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention:
load("ames_validation.Rdata")
paste(nrow(ames_validation), "observations")
## [1] "763 observations"
paste(nrow(ames_train_clean), "observations")
## [1] "834 observations"
predict.model.validation <- exp(predict(final.model, ames_validation))
resid.validation <- ames_validation$price - predict.model.validation
rmse.validation <- sqrt(mean(resid.validation^2))
rmse.validation
## [1] 29329.01
set.seed(555664)
confint_ames = exp(confint(predict(full.model3.bas, ames_validation, se.fit = TRUE), parm="pred"))
length(which(ifelse(ames_validation$price > confint_ames[,"2.5%"] & ames_validation$price < confint_ames[,"97.5%"], "yes", "no") == "yes"))/nrow(confint_ames)*100
## [1] 95.80603
We observe that there is less than a 4.2% chance that predicted values derived from the final.model fall outside of a 95% confidence interval.
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
Our
final.modelprovided 95.8% accuracy in predicting the selling price of homes in Ames, Iowa.
The model can be applied to out-of-case data as depicted by the results of our model. We also learned that Ames, Iowa is a diverse market for real estate investing, with homes varying greatly in price, age, square footage, and much more. Given this wide variability, it makes the efficacy of our prediction model that much more useful.
Capstone Project: Peer Assessment II by Cory Torrella Statistics with R Specialization, Duke University