As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
# set defaults: cache chunks to speed compiling subsequent edits.
knitr::opts_chunk$set(cache=TRUE, echo = TRUE)
set.seed(123456)load("ames_train.Rdata")Use the code block below to load any necessary packages
library(statsr)
library(BAS)
library(MASS)
library(glmnet)
library(ggplot2)
library(ggthemes)
library(kableExtra)
library(RColorBrewer)
library(tidyverse)
library(moments)
library(scales)
library(broom)When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
Note for marking: for the sake of readability, code required for transforming the data sets and producing the graphs has not been included inline. Any code to do with modelling etc. has been left inline.
All code used but not displayed can be found in the Appendix.
In the first project, we examined the distribution of the ages of houses which showed a tri-modal distribution with a marked peak in the 5-9 bin. Looking at the skewness, median and mean however indicated that the bulk of the mass lay in the older houses.
I wanted to revisit this and see if looking at the log of age would produce something more insightful and perhaps produce a more linear relationship with the log of sale price.
Taking the exponential of the age values, the graph shows that there was a sustained growth in housing from 90 to around 30 years prior to 2011 - i.e. from the years 1920 to 1980, with a peak in the post war years at around 55 , i.e. 1950’s. There followed quite a period of lull until a second, brief, peak at around 6 years, or 2005.
Perhaps the main peak is a reflection of urbanisation of the mid-West during those years.
The distribution, while still bimodal, is a much better fit to normal, although left skewed, and will be used for model considerations later.
There were 3 reasons for looking at this:
The sub-prime collapse happened in 2008, in many parts of the world, house prices took a dive (in my own town, prices dropped by up to two-thirds). I wanted to see if this was apparent in the Ames sales figures. There is no evidence in the data of any effect from the sub-prime collapse.
To see if there was a marked shift in prices overall for the time span covered by the data set. If so, it would need to be factored into any modelling as sales from different years would need adjusting before being able to be considered equally. Overall, prices from 2006 to 2010 seemed to be mostly static.
To see if there was any evidence of which neighbourhoods were
showing increase, and which decrease over the period to gauge which
might be better for investment. There is some movement, however caution
is needed as some points may represent few observations:
Clear Creek is worst performing ($171K to $129K) & perhaps Greens,
which shows a sharp drop from 2009 to 2010.
Timberland shows sustained strong growth over the entire period ($180K
to $300K) and would most likely represent the best area for
investment.
A little related to the previous graph, this is ideal for showing which neighbourhoods have the greatest variability in pricing. Variability (along with sales price trend) gives one of the best opportunities for investment since there is potential for buying undervalued property and selling at an over-valued price.
The neighbourhoods are sorted most variable (by coefficient of variance) top to bottom.
While standard deviation is useful, the quantity will inevitably be proportional to the mean of the value being measured (i.e. a variation of 5% is more in absolute dollar terms for higher prices than for lower). Coefficient of variance is calculated as \(\sigma / \mu\), a measure of variability in proportion to the mean. It’s useful for comparing the degree of variation from one data series to another, even if the means are drastically different from one another (ref).
This helps to give an idea of how much house prices vary as proportion of their mean.
Laying the violin plot over the boxplot helps to visualise where the weight of data points is in each suburb. Violins with a heavy left skew may be an indicator of good investment opportunities (more properties below the mean).
Combined with CV, it’s possible to identify opportunities in lower priced neighbourhoods where budget might allow for more properties to be included in the portfolio and therefore spread the risk while also increasing turnover.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables
from “ames_train” and create a linear model for price (or a
transformed version of price) using those variables. Provide the R
code and the summary output table for your model, a
brief justification for the variables you have chosen, and a
brief discussion of the model results in context (focused on
the variables that appear to be important predictors and how they relate
to sales price).
The variables chosen follow from a mix of elicitation (prior knowledge that these factors influence house prices) and examining ANOVA outputs during EDA. I was also careful not to choose covariates.
Neighborhood: This was an obvious choice from both
methods, neighbourhood has been shown in EDA plots to show great
influence along with prior knowledge that there will always be
neighbourhoods that attract a greater price than others.Overall.Qual: The quality of a house will always be a
strong influence on sale value. Closely related to
Overall.Cond - as Overall.Qual has more
factors, it’s likely the better choice of the two for accuracy of
prediction.House.Style: Another variable related to aesthetics,
style of house is usually an influencer.area: The size of house will almost always affect sale
price all other factors held equal. EDA showed a stronger linear
relationship between the log of this variable and the log of price.Lot.Area: Similar to the above, size of lot is usually
a strong factor on sale value. Also as with area, EDA showed a stronger
linear relationship between the log of this variable and the log of
price.Year.Built: newer builds will generally attract a
higher sale value. There is the factor that some “classic” era houses
also attract a higher price, however the EDA didn’t show any significant
evidence of this. Again, the log of the house age (calculated as
log(2011 - Year.Built)) is used - the difference between a
2010 and 1990 house is more influential than between a 1910 and 1930 for
example, it makes sense to compress the larger ages.Bedroom.AbvGr: Although this variable didn’t score
highly in ANOVA tests, it’s one of the most fundamental factors in
standard house valuations and included for that reason.Full.Bath: as per #7Kitchen.Qual: as per #7MS.Zoning: This came out second strongest in EDA (after
neighbourhood) and makes sense - the type of urban planning where the
house is located is important with the low and medium density
residential zones attracting better prices.Summary:
df_model1 <- df_train %>%
mutate(log.price = log(price)) %>%
mutate(log.area = log(area)) %>%
mutate(log.lot.area = log(Lot.Area)) %>%
mutate(log.age = log(2011 - Year.Built)) %>%
mutate(remod.age = 2011 - Year.Remod.Add) %>%
select(-price, -PID, -area, -Lot.Area, -Year.Built, -Year.Remod.Add)
df_model1 <- df_model1 %>%
select(
log.price, Neighborhood, Overall.Qual, House.Style, log.area,
log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.Qual, MS.Zoning
)
ames.lm <- lm(log.price ~ ., data = df_model1)
summary(ames.lm) ##
## Call:
## lm(formula = log.price ~ ., data = df_model1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70330 -0.06573 0.00468 0.07419 0.50681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.807627 0.210739 32.304 < 2e-16 ***
## NeighborhoodBlueste 0.137118 0.091705 1.495 0.135260
## NeighborhoodBrDale -0.017130 0.074647 -0.229 0.818556
## NeighborhoodBrkSide 0.010672 0.061492 0.174 0.862260
## NeighborhoodClearCr 0.079103 0.069072 1.145 0.252463
## NeighborhoodCollgCr -0.020123 0.052439 -0.384 0.701271
## NeighborhoodCrawfor 0.079285 0.059867 1.324 0.185766
## NeighborhoodEdwards -0.108498 0.055530 -1.954 0.051070 .
## NeighborhoodGilbert -0.035503 0.055324 -0.642 0.521229
## NeighborhoodGreens 0.124856 0.080858 1.544 0.122957
## NeighborhoodGrnHill 0.378760 0.104160 3.636 0.000295 ***
## NeighborhoodIDOTRR -0.052510 0.066621 -0.788 0.430819
## NeighborhoodMeadowV -0.027349 0.065999 -0.414 0.678704
## NeighborhoodMitchel 0.026220 0.055945 0.469 0.639434
## NeighborhoodNAmes -0.007479 0.054708 -0.137 0.891296
## NeighborhoodNoRidge 0.109831 0.056489 1.944 0.052216 .
## NeighborhoodNPkVill 0.026130 0.080553 0.324 0.745732
## NeighborhoodNridgHt 0.062278 0.054285 1.147 0.251631
## NeighborhoodNWAmes 0.006905 0.056470 0.122 0.902712
## NeighborhoodOldTown -0.086906 0.061561 -1.412 0.158432
## NeighborhoodSawyer -0.016041 0.056454 -0.284 0.776380
## NeighborhoodSawyerW -0.065456 0.054738 -1.196 0.232137
## NeighborhoodSomerst 0.029861 0.063821 0.468 0.639993
## NeighborhoodStoneBr 0.043778 0.061428 0.713 0.476266
## NeighborhoodSWISU -0.109875 0.067666 -1.624 0.104825
## NeighborhoodTimber 0.022354 0.058653 0.381 0.703212
## NeighborhoodVeenker 0.060628 0.067434 0.899 0.368894
## Overall.Qual 0.081793 0.006202 13.188 < 2e-16 ***
## House.Style1.5Unf 0.024544 0.048440 0.507 0.612520
## House.Style1Story 0.069853 0.019172 3.643 0.000287 ***
## House.Style2.5Unf 0.038625 0.046604 0.829 0.407473
## House.Style2Story -0.013145 0.018923 -0.695 0.487492
## House.StyleSFoyer 0.166107 0.030818 5.390 9.32e-08 ***
## House.StyleSLvl 0.068208 0.027338 2.495 0.012800 *
## log.area 0.523887 0.029435 17.798 < 2e-16 ***
## log.lot.area 0.107185 0.013651 7.852 1.34e-14 ***
## Bedroom.AbvGr -0.014137 0.008201 -1.724 0.085140 .
## log.age -0.062538 0.011710 -5.340 1.22e-07 ***
## Full.Bath -0.017230 0.013409 -1.285 0.199168
## Kitchen.QualFa -0.239864 0.041818 -5.736 1.38e-08 ***
## Kitchen.QualGd -0.093617 0.024855 -3.767 0.000178 ***
## Kitchen.QualPo -0.412026 0.135348 -3.044 0.002411 **
## Kitchen.QualTA -0.160389 0.027593 -5.813 8.93e-09 ***
## MS.ZoningFV 0.330414 0.079478 4.157 3.57e-05 ***
## MS.ZoningI (all) -0.224493 0.143638 -1.563 0.118477
## MS.ZoningRH 0.167195 0.083253 2.008 0.044957 *
## MS.ZoningRL 0.326799 0.064845 5.040 5.79e-07 ***
## MS.ZoningRM 0.250377 0.060829 4.116 4.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1245 on 786 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8938
## F-statistic: 150.1 on 47 and 786 DF, p-value: < 2.2e-16
ames_anova <- anova(ames.lm)
ames_anova %>%
rownames_to_column(var = "Variable") %>%
arrange(`Pr(>F)`) %>%
kbl() %>%
kable_styling(bootstrap_options = c("condensed"))| Variable | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| Neighborhood | 26 | 73.1056576 | 2.8117561 | 181.479670 | 0.0000000 |
| Overall.Qual | 1 | 20.8456183 | 20.8456183 | 1345.442432 | 0.0000000 |
| log.area | 1 | 10.9779894 | 10.9779894 | 708.554317 | 0.0000000 |
| log.lot.area | 1 | 1.0517136 | 1.0517136 | 67.880939 | 0.0000000 |
| log.age | 1 | 0.7823384 | 0.7823384 | 50.494608 | 0.0000000 |
| Kitchen.Qual | 4 | 0.8495418 | 0.2123854 | 13.708031 | 0.0000000 |
| House.Style | 6 | 0.8444736 | 0.1407456 | 9.084168 | 0.0000000 |
| MS.Zoning | 5 | 0.7149414 | 0.1429883 | 9.228917 | 0.0000000 |
| Bedroom.AbvGr | 1 | 0.0941711 | 0.0941711 | 6.078099 | 0.0138994 |
| Full.Bath | 1 | 0.0428519 | 0.0428519 | 2.765800 | 0.0966972 |
| Residuals | 786 | 12.1778945 | 0.0154935 | NA | NA |
Now either using BAS another stepwise selection
procedure choose the “best” model you can, using your initial model as
your starting point. Try at least two different model selection methods
and compare their results. Do they both arrive at the same model or do
they disagree? What do you think this means?
ZS.null prior and uniform model
prior, then testing against the four predictor types (BMA, HPM, MPM
& BPM).Full.Bath only.k = log(nrow(df_model1)) drops Neighbourhood,
Overall.Qual and Full.Bath.Full.Bath
only.Highest probability model finds the most likely model to have generated the data using a 0-1 loss L0.
BIC is more selective and allows for more false-positive outcomes (where positive means accepts the null hypothesis that the variable is not an influencer).
Lasso regression is a parsimonious model that performs L1 regularization. The L1 regularization adds a penalty equivalent to the absolute magnitude of regression coefficients and tries to minimize them.
#-----------------------------------------
# BAS Modelling
#-----------------------------------------
bas_ames.ZS <- bas.lm(log.price ~ .,
data = df_model1,
prior = "ZS-null",
modelprior = uniform(),
n.models=1000
)
models = data.frame(Estimator=character(),
Post.Prob=numeric(),
Vars=integer(),
Best.Vars=character()
)
#BMA
bas_ames.ZS.BMA = predict(bas_ames.ZS, estimator = "BMA")
models <- models %>% add_row(
Estimator="BMA",
Post.Prob=max(bas_ames.ZS$postprobs[1]),
Vars=length(bas_ames.ZS.BMA$best.vars[-1]),
Best.Vars=toString(bas_ames.ZS.BMA$best.vars[-1])
)
#HPM
bas_ames.ZS.HPM = predict(bas_ames.ZS, estimator = "HPM")
models <- models %>% add_row(
Estimator="HPM",
Post.Prob=max(bas_ames.ZS$postprobs[bas_ames.ZS.HPM$best]),
Vars=length(bas_ames.ZS.HPM$best.vars[-1]),
Best.Vars=toString(bas_ames.ZS.HPM$best.vars[-1])
)
#MPM
bas_ames.ZS.MPM = predict(bas_ames.ZS, estimator = "MPM", se.fit = T)
models <- models %>% add_row(
Estimator="MPM",
Post.Prob=NA,
Vars=length(bas_ames.ZS.MPM$best.vars[-1]),
Best.Vars=toString(bas_ames.ZS.MPM$best.vars[-1])
)
#BPM
bas_ames.ZS.BPM = predict(bas_ames.ZS, estimator = "BPM", se.fit = T)
models <- models %>% add_row(
Estimator="BPM",
Post.Prob=max(bas_ames.ZS$postprobs[bas_ames.ZS.HPM$best]),
Vars=length(bas_ames.ZS.BPM$best.vars[-1]),
Best.Vars=toString(bas_ames.ZS.BPM$best.vars[-1])
)
models %>%
kbl() %>%
kable_styling(bootstrap_options = c("condensed"))| Estimator | Post.Prob | Vars | Best.Vars |
|---|---|---|---|
| BMA | 0.0000000 | 47 | NeighborhoodBlueste, NeighborhoodBrDale, NeighborhoodBrkSide, NeighborhoodClearCr, NeighborhoodCollgCr, NeighborhoodCrawfor, NeighborhoodEdwards, NeighborhoodGilbert, NeighborhoodGreens, NeighborhoodGrnHill, NeighborhoodIDOTRR, NeighborhoodMeadowV, NeighborhoodMitchel, NeighborhoodNAmes, NeighborhoodNoRidge, NeighborhoodNPkVill, NeighborhoodNridgHt, NeighborhoodNWAmes, NeighborhoodOldTown, NeighborhoodSawyer, NeighborhoodSawyerW, NeighborhoodSomerst, NeighborhoodStoneBr, NeighborhoodSWISU, NeighborhoodTimber, NeighborhoodVeenker, Overall.Qual, House.Style1.5Unf, House.Style1Story, House.Style2.5Unf, House.Style2Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualPo, Kitchen.QualTA, MS.ZoningFV, MS.ZoningI (all), MS.ZoningRH, MS.ZoningRL, MS.ZoningRM |
| HPM | 0.9984024 | 23 | NeighborhoodEdwards, NeighborhoodGrnHill, NeighborhoodNoRidge, NeighborhoodNridgHt, NeighborhoodOldTown, NeighborhoodSawyerW, NeighborhoodSWISU, Overall.Qual, House.Style1Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualTA, MS.ZoningFV, MS.ZoningRH, MS.ZoningRL, MS.ZoningRM |
| MPM | NA | 23 | NeighborhoodEdwards, NeighborhoodGrnHill, NeighborhoodNoRidge, NeighborhoodNridgHt, NeighborhoodOldTown, NeighborhoodSawyerW, NeighborhoodSWISU, Overall.Qual, House.Style1Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualTA, MS.ZoningFV, MS.ZoningRH, MS.ZoningRL, MS.ZoningRM |
| BPM | 0.9984024 | 23 | NeighborhoodEdwards, NeighborhoodGrnHill, NeighborhoodNoRidge, NeighborhoodNridgHt, NeighborhoodOldTown, NeighborhoodSawyerW, NeighborhoodSWISU, Overall.Qual, House.Style1Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualTA, MS.ZoningFV, MS.ZoningRH, MS.ZoningRL, MS.ZoningRM |
#-----------------------------------------
# stepAIC with BIC
#-----------------------------------------
BICk <- log(nrow(df_model1))
m_ames_BIC1 <- lm(log.price ~ ., data = df_model1)
ames.lm.BIC1 <- stepAIC(m_ames_BIC1, direction = "both", k = BICk)## Start: AIC=-3202.13
## log.price ~ Neighborhood + Overall.Qual + House.Style + log.area +
## log.lot.area + Bedroom.AbvGr + log.age + Full.Bath + Kitchen.Qual +
## MS.Zoning
##
## Df Sum of Sq RSS AIC
## - Neighborhood 26 2.0673 14.245 -3246.2
## - Full.Bath 1 0.0256 12.204 -3207.1
## - Bedroom.AbvGr 1 0.0460 12.224 -3205.7
## <none> 12.178 -3202.1
## - MS.Zoning 5 0.7149 12.893 -3188.2
## - House.Style 6 0.8359 13.014 -3187.1
## - log.age 1 0.4419 12.620 -3179.1
## - Kitchen.Qual 4 0.8425 13.020 -3173.3
## - log.lot.area 1 0.9552 13.133 -3145.9
## - Overall.Qual 1 2.6945 14.872 -3042.2
## - log.area 1 4.9079 17.086 -2926.4
##
## Step: AIC=-3246.25
## log.price ~ Overall.Qual + House.Style + log.area + log.lot.area +
## Bedroom.AbvGr + log.age + Full.Bath + Kitchen.Qual + MS.Zoning
##
## Df Sum of Sq RSS AIC
## - Full.Bath 1 0.0827 14.328 -3248.1
## <none> 14.245 -3246.2
## - Bedroom.AbvGr 1 0.1721 14.417 -3243.0
## - Kitchen.Qual 4 0.8844 15.130 -3222.9
## - House.Style 6 1.2256 15.471 -3217.8
## + Neighborhood 26 2.0673 12.178 -3202.1
## - log.age 1 0.9066 15.152 -3201.5
## - MS.Zoning 5 1.4982 15.743 -3196.5
## - log.lot.area 1 1.5892 15.834 -3164.8
## - Overall.Qual 1 4.3581 18.603 -3030.4
## - log.area 1 6.2834 20.529 -2948.2
##
## Step: AIC=-3248.15
## log.price ~ Overall.Qual + House.Style + log.area + log.lot.area +
## Bedroom.AbvGr + log.age + Kitchen.Qual + MS.Zoning
##
## Df Sum of Sq RSS AIC
## <none> 14.328 -3248.1
## + Full.Bath 1 0.0827 14.245 -3246.2
## - Bedroom.AbvGr 1 0.2161 14.544 -3242.4
## - Kitchen.Qual 4 0.9288 15.257 -3222.7
## - House.Style 6 1.2156 15.543 -3220.6
## - log.age 1 0.8283 15.156 -3208.0
## + Neighborhood 26 2.1244 12.204 -3207.1
## - MS.Zoning 5 1.5125 15.840 -3198.1
## - log.lot.area 1 1.6300 15.958 -3165.0
## - Overall.Qual 1 4.3931 18.721 -3031.8
## - log.area 1 6.6698 20.998 -2936.1
summary(ames.lm.BIC1)##
## Call:
## lm(formula = log.price ~ Overall.Qual + House.Style + log.area +
## log.lot.area + Bedroom.AbvGr + log.age + Kitchen.Qual + MS.Zoning,
## data = df_model1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68882 -0.07352 0.00762 0.08411 0.46490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.533950 0.190390 34.319 < 2e-16 ***
## Overall.Qual 0.096188 0.006092 15.788 < 2e-16 ***
## House.Style1.5Unf 0.062797 0.050751 1.237 0.216308
## House.Style1Story 0.092483 0.018897 4.894 1.19e-06 ***
## House.Style2.5Unf 0.030900 0.048416 0.638 0.523514
## House.Style2Story 0.004048 0.018538 0.218 0.827197
## House.StyleSFoyer 0.195163 0.030453 6.409 2.48e-10 ***
## House.StyleSLvl 0.093310 0.027779 3.359 0.000819 ***
## log.area 0.538290 0.027670 19.454 < 2e-16 ***
## log.lot.area 0.109542 0.011390 9.617 < 2e-16 ***
## Bedroom.AbvGr -0.027707 0.007913 -3.502 0.000488 ***
## log.age -0.055194 0.008051 -6.856 1.40e-11 ***
## Kitchen.QualFa -0.252471 0.042531 -5.936 4.32e-09 ***
## Kitchen.QualGd -0.093961 0.023932 -3.926 9.36e-05 ***
## Kitchen.QualPo -0.291521 0.136670 -2.133 0.033222 *
## Kitchen.QualTA -0.162922 0.027213 -5.987 3.21e-09 ***
## MS.ZoningFV 0.370787 0.065836 5.632 2.45e-08 ***
## MS.ZoningI (all) -0.213860 0.151308 -1.413 0.157918
## MS.ZoningRH 0.160607 0.081277 1.976 0.048486 *
## MS.ZoningRL 0.352947 0.061121 5.775 1.10e-08 ***
## MS.ZoningRM 0.261788 0.061476 4.258 2.30e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1328 on 813 degrees of freedom
## Multiple R-squared: 0.8821, Adjusted R-squared: 0.8792
## F-statistic: 304 on 20 and 813 DF, p-value: < 2.2e-16
#-----------------------------------------
# Lasso
#-----------------------------------------
x_vars <- model.matrix(log.price~. , df_model1)[,-1]
lambda_seq <- 10^seq(2, -2, by = -.1)
# Splitting the data into test and train
ames.lm.cv <- cv.glmnet(x_vars, df_model1$log.price,
alpha = 1, lambda = lambda_seq,
nfolds = 5)
# identifying best lamda
best_lam <- ames.lm.cv$lambda.min
best_lam## [1] 0.01
# Rebuilding the model with best lamda value identified
ames.lasso <- glmnet(x_vars, df_model1$log.price, alpha = 1, lambda = best_lam)
coef(ames.lasso)## 51 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 7.6096420391
## NeighborhoodBlueste .
## NeighborhoodBrDale .
## NeighborhoodBrkSide .
## NeighborhoodClearCr .
## NeighborhoodCollgCr .
## NeighborhoodCrawfor 0.0227706109
## NeighborhoodEdwards -0.0577177872
## NeighborhoodGilbert -0.0034302508
## NeighborhoodGreens .
## NeighborhoodGrnHill 0.1878894001
## NeighborhoodIDOTRR -0.0369861529
## NeighborhoodLandmrk .
## NeighborhoodMeadowV .
## NeighborhoodMitchel 0.0032072906
## NeighborhoodNAmes .
## NeighborhoodNoRidge 0.0438913349
## NeighborhoodNPkVill .
## NeighborhoodNridgHt 0.0482616508
## NeighborhoodNWAmes .
## NeighborhoodOldTown -0.0528269100
## NeighborhoodSawyer .
## NeighborhoodSawyerW -0.0375682457
## NeighborhoodSomerst .
## NeighborhoodStoneBr .
## NeighborhoodSWISU -0.0480277202
## NeighborhoodTimber .
## NeighborhoodVeenker 0.0095768127
## Overall.Qual 0.1075474966
## House.Style1.5Unf .
## House.Style1Story 0.0290266666
## House.Style2.5Fin .
## House.Style2.5Unf .
## House.Style2Story -0.0091115590
## House.StyleSFoyer 0.0747748141
## House.StyleSLvl .
## log.area 0.4133081058
## log.lot.area 0.1072565525
## Bedroom.AbvGr -0.0001698858
## log.age -0.0614781367
## Full.Bath .
## Kitchen.QualFa -0.0811175756
## Kitchen.QualGd .
## Kitchen.QualPo .
## Kitchen.QualTA -0.0484383028
## MS.ZoningC (all) -0.1663910778
## MS.ZoningFV .
## MS.ZoningI (all) -0.2206088575
## MS.ZoningRH -0.0122199543
## MS.ZoningRL 0.0218351282
## MS.ZoningRM -0.0474788819
ames.lasso.pred <- predict(ames.lasso, s = best_lam, newx = x_vars)
head(cbind(df_model1$log.price, ames.lasso.pred), 10)## s1
## 1 11.74404 11.69928
## 2 11.84582 11.70808
## 3 11.73527 11.49212
## 4 11.64395 11.49814
## 5 12.33271 12.37559
## 6 12.19854 12.35205
## 7 11.44035 11.43092
## 8 11.83138 11.76718
## 9 11.84940 11.87828
## 10 12.29911 12.38376
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
Below are plots for residuals for all three models for comparison. The HPM model will be used moving forwards from here as it is the best performing.
The second plot adjusts the axes to remove the upper and lower 1% of values from display to allow a better visual representation and zoom in on where the bulk of the data lies.
All three models perform reasonably well within the H-spread (± 1 standard deviation represented by the area between the dotted lines). Below that, the models over-predict and, conversely, for upper prices the models all under-predict. If this problem persists after the full model, a correction factor scaled as a square of the distance from the mean can be useful. This might be tricky given the double flex characteristic of a quadratic formula.
Earlier testing with all sales conditions included produced some extreme residuals of $300K+. After removing non-normal sales, the model accuracy improved greatly.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
The RMSE for the HPM model is $22,506.54, the best of the 3 tried models.
RMSE <- tibble(
Observed = df_train$price,
Predicted.HPM = as.numeric(bas_ames.ZS.HPM$Ypred[1,]),
Predicted.BIC = as.numeric(ames.lm.BIC1$fitted.values),
Predicted.Lasso = as.numeric(ames.lasso.pred)
) %>%
mutate(Predicted.HPM = exp(Predicted.HPM)) %>%
mutate(Predicted.BIC = exp(Predicted.BIC)) %>%
mutate(Predicted.Lasso = exp(Predicted.Lasso)) %>%
mutate(HPM = Observed - Predicted.HPM) %>%
mutate(BIC = Observed - Predicted.BIC) %>%
mutate(Lasso = Observed - Predicted.Lasso)
tibble(
`RMSE HPM____` = dollar_format()(sqrt(mean(RMSE$HPM^2))),
`RMSE BIC____` = dollar_format()(sqrt(mean(RMSE$BIC^2))),
`RMSE Lasso__` = dollar_format()(sqrt(mean(RMSE$Lasso^2)))
)| RMSE HPM____ | RMSE BIC____ | RMSE Lasso__ |
|---|---|---|
| $22,506.54 | $23,628.92 | $25,754.17 |
The process of building a model generally involves starting with an
initial model (as you have done above), identifying its shortcomings,
and adapting the model accordingly. This process may be repeated several
times until the model fits the data reasonably well. However, the model
may do well on training data but perform poorly out-of-sample (meaning,
on a dataset other than the original training data) because the model is
overly-tuned to specifically fit the training data. This is called
“overfitting.” To determine whether overfitting is occurring on a model,
compare the performance of a model on both in-sample and out-of-sample
data sets. To look at performance of your initial model on out-of-sample
data, you will use the data set ames_test.
load("ames_test.Rdata")Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
BAS linear model was created using the model
proposed by HPM (dropping only Full.Bath).pred.bas object was created with fitted values
using this model with the train and test data and the HPM
estimator.Comparing data between train and test prediction, predictions for the training data are better than test. This was not the case before non-normal data was included in the model. The test data contains only normal sales conditions which is likely a good reason for the better fit prior to making this filter and re-modelling:
Despite the slightly poorer residuals with the test data set, this is to be expected and gives no cause for concern with regards to applicability to other data sets.
HPM.intial <- as.vector(which.matrix(bas_ames.ZS$which[bas_ames.ZS.HPM$best],
bas_ames.ZS$n.vars))
model_initial <- bas.lm(log.price ~ ., data = df_model1,
prior = "ZS-null",
modelprior = uniform(),
bestmodel = HPM.intial, n.models = 1)
predict.train.initial <- predict(model_initial, newdata = df_model1, estimator = "HPM")
predict.test.initial <- predict(model_initial, newdata = df_model2, estimator = "HPM")| Data | RMSE | mean | median | sd | max | q25 | q75 | skew |
|---|---|---|---|---|---|---|---|---|
| Test | $24,043.95 | $17,416.33 | $13,128.66 | $16,586.77 | $197,048 | $6,419.36 | $23,365.77 | 1.13 |
| Train | $22,506.54 | $15,926.17 | $12,208.95 | $15,912.42 | $172,183 | $5,504.76 | $21,208.68 | 1.37 |
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
The following code selects the best variables and produces summary
table (using coef()) for the final table.
# Drop 2 outliers found in initial model testing, remove any non-normal sales
df_train <- df_train %>%
mutate(log.age = log(2011 - Year.Built)) %>%
mutate(log.remod.age = log(2011 - Year.Remod.Add)) %>%
filter(!(PID %in% c(908154205, 533350090)))
df_test <- df_test %>%
mutate(log.age = log(2011 - Year.Built)) %>%
mutate(log.remod.age = log(2011 - Year.Remod.Add))
# Drop colinear and other variable scoring low ANOVA during EDA tests
df_model2 <- df_train %>%
mutate(log.price = log(price)) %>%
select(-price, -PID, -Year.Built, -Year.Remod.Add, -Garage.Yr.Blt, -outlier,
-Overall.Cond, -Bsmt.Cond, -Bsmt.Exposure, -BsmtFin.Type.1, -BsmtFin.SF.1,
-BsmtFin.Type.2, BsmtFin.SF.2, -Bsmt.Unf.SF, -Total.Bsmt.SF,
-Exter.Qual, -Pool.QC, -MS.SubClass, -X1st.Flr.SF, -X2nd.Flr.SF,
-Alley, -Lot.Frontage, -Land.Slope, -Condition.1, Condition.2, -Roof.Matl,
-Exterior.1st, -Exterior.2nd, -Heating, -Central.Air, -Electrical, -Functional,
-Garage.Cond, -Garage.Qual, -Misc.Feature)
# re-run ANOVA and select best 20
ames.lm <- lm(log.price ~ ., data = df_model2)
ames_anova <- anova(ames.lm)
ames_anova_tbl <- ames_anova %>%
rownames_to_column(var = "Variable") %>%
slice_min(`Pr(>F)`, n=20) %>%
arrange(`Pr(>F)`)
# create subset with only those vars
df_model2 <- df_model2 %>% select(log.price, all_of(ames_anova_tbl$Variable))
# create initial BAS model
model_final <- bas.lm(log.price ~ .,
data = df_model2,
prior = "ZS-null",
modelprior = uniform(),
n.models=1000
)
# Repeat HPM selection method until no more vars are eliminated
repeat {
prior_n <- length(attributes(model_final$x)$names)
#use HPM to find best vars
model_final.HPM <- predict(model_final, estimator = "HPM")
# create vector of best vars and re-run BAS model selecting only those
model_final$postprobs[model_final.HPM$best]
HPM <- as.vector(which.matrix(model_final$which[model_final.HPM$best],
model_final$n.vars))
model_final <- bas.lm(log.price ~ ., data = df_model2,
prior = "ZS-null",
modelprior = uniform(),
bestmodel = HPM, n.models = 1000)
if(prior_n == length(attributes(model_final$x)$names)){break}
}
model_final <- bas.lm(log.price ~ ., data = df_model2,
prior = "ZS-null",
modelprior = uniform(),
bestmodel = HPM, n.models = 1)
predict.train.final <- predict(model_final, estimator = "HPM", se.fit = T)
# final model R2
model_final$R2## [1] 0.9288975
coef(model_final, estimator="HPM")##
## Marginal Posterior Summaries of Coefficients:
##
## Using HPM
##
## Based on the top 1 models
## post mean post SD post p(B != 0)
## Intercept 1.200e+01 3.621e-03 1.000e+00
## area 3.330e-04 1.298e-05 1.000e+00
## NeighborhoodBlueste 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodBrDale -1.593e-02 4.677e-02 1.000e+00
## NeighborhoodBrkSide 2.396e-02 2.044e-02 1.000e+00
## NeighborhoodClearCr 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodCollgCr 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodCrawfor 1.039e-01 2.284e-02 1.000e+00
## NeighborhoodEdwards -6.904e-02 1.657e-02 1.000e+00
## NeighborhoodGilbert 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodGreens 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodGrnHill 6.642e-01 7.887e-02 1.000e+00
## NeighborhoodIDOTRR 5.851e-02 2.710e-02 1.000e+00
## NeighborhoodMeadowV -5.940e-02 3.353e-02 1.000e+00
## NeighborhoodMitchel 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodNAmes 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodNoRidge 6.461e-02 2.261e-02 1.000e+00
## NeighborhoodNPkVill 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodNridgHt 6.711e-02 2.174e-02 1.000e+00
## NeighborhoodNWAmes 2.272e-02 1.946e-02 1.000e+00
## NeighborhoodOldTown 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodSawyer 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodSawyerW 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodSomerst 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodStoneBr 5.315e-02 3.331e-02 1.000e+00
## NeighborhoodSWISU -8.559e-02 3.488e-02 1.000e+00
## NeighborhoodTimber 0.000e+00 0.000e+00 0.000e+00
## NeighborhoodVeenker 1.303e-01 3.658e-02 1.000e+00
## MS.ZoningFV 4.483e-01 5.312e-02 1.000e+00
## MS.ZoningI (all) 0.000e+00 0.000e+00 0.000e+00
## MS.ZoningRH 2.730e-01 6.549e-02 1.000e+00
## MS.ZoningRL 4.123e-01 4.925e-02 1.000e+00
## MS.ZoningRM 3.106e-01 4.755e-02 1.000e+00
## Overall.Qual 7.453e-02 4.912e-03 1.000e+00
## House.Style1.5Unf 0.000e+00 0.000e+00 0.000e+00
## House.Style1Story 4.330e-02 1.221e-02 1.000e+00
## House.Style2.5Unf 0.000e+00 0.000e+00 0.000e+00
## House.Style2Story -4.391e-02 1.319e-02 1.000e+00
## House.StyleSFoyer 7.478e-02 2.345e-02 1.000e+00
## House.StyleSLvl 0.000e+00 0.000e+00 0.000e+00
## Lot.ShapeIR2 1.011e-02 2.313e-02 1.000e+00
## Lot.ShapeIR3 0.000e+00 0.000e+00 0.000e+00
## Lot.ShapeReg 0.000e+00 0.000e+00 0.000e+00
## Bldg.Type2fmCon 0.000e+00 0.000e+00 0.000e+00
## Bldg.TypeDuplex -1.131e-01 2.279e-02 1.000e+00
## Bldg.TypeTwnhs -8.510e-02 2.392e-02 1.000e+00
## Bldg.TypeTwnhsE -3.776e-02 1.734e-02 1.000e+00
## Bsmt.QualFa -1.037e-01 3.298e-02 1.000e+00
## Bsmt.QualGd -5.544e-02 1.807e-02 1.000e+00
## Bsmt.QualNONE -2.872e-01 4.343e-02 1.000e+00
## Bsmt.QualPo 0.000e+00 0.000e+00 0.000e+00
## Bsmt.QualTA -5.467e-02 2.189e-02 1.000e+00
## Bsmt.Full.Bath 7.684e-02 7.931e-03 1.000e+00
## Exter.CondFa -1.544e-01 2.821e-02 1.000e+00
## Exter.CondGd 7.695e-04 1.195e-02 1.000e+00
## Exter.CondTA 0.000e+00 0.000e+00 0.000e+00
## Land.ContourHLS 0.000e+00 0.000e+00 0.000e+00
## Land.ContourLow 0.000e+00 0.000e+00 0.000e+00
## Land.ContourLvl 0.000e+00 0.000e+00 0.000e+00
## FoundationCBlock 0.000e+00 0.000e+00 0.000e+00
## FoundationPConc 0.000e+00 0.000e+00 0.000e+00
## FoundationSlab 1.433e-01 4.896e-02 1.000e+00
## FoundationStone 0.000e+00 0.000e+00 0.000e+00
## Lot.Area 2.446e-06 3.968e-07 1.000e+00
## Heating.QCFa -1.058e-01 2.687e-02 1.000e+00
## Heating.QCGd 0.000e+00 0.000e+00 0.000e+00
## Heating.QCPo -2.170e-01 1.082e-01 1.000e+00
## Heating.QCTA -2.461e-02 9.483e-03 1.000e+00
## Lot.ConfigCulDSac 0.000e+00 0.000e+00 0.000e+00
## Lot.ConfigFR2 0.000e+00 0.000e+00 0.000e+00
## Lot.ConfigFR3 0.000e+00 0.000e+00 0.000e+00
## Lot.ConfigInside 0.000e+00 0.000e+00 0.000e+00
## Kitchen.QualFa -5.771e-02 2.598e-02 1.000e+00
## Kitchen.QualGd 0.000e+00 0.000e+00 0.000e+00
## Kitchen.QualPo 0.000e+00 0.000e+00 0.000e+00
## Kitchen.QualTA 0.000e+00 0.000e+00 0.000e+00
## Garage.Cars 4.834e-02 7.300e-03 1.000e+00
## log.age -3.799e-02 9.248e-03 1.000e+00
## log.remod.age -3.100e-02 6.145e-03 1.000e+00
## Paved.DriveP 0.000e+00 0.000e+00 0.000e+00
## Paved.DriveY 8.135e-02 1.488e-02 1.000e+00
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
Code for the Transformations can be found in the Appendix
Factors - the following factor variables incorrectly
use the NA value to describe ‘Not Applicable’ or
‘None’:
Alley, BsmtFin.Type.1, BsmtFin.Type.2, Bsmt.Qual, Bsmt.Cond, Bsmt.Exposure, Fireplace.Qu, Garage.Type, Garage.Finish,
Mas.Vnr.Type, Garage.Qual, Garage.Cond, Pool.QC, Fence, Misc.FeatureEach of these has been re-factored to convert NA’s to
“None” to include the rows in regression.
Utilities has only one level and is dropped from the dataset.
Year sold, month sold and PID are dropped as they have no value in predicting prices.
1 in 7 observations are missing Lot.Frontage. This
variable has a strong linear relationship with Lot.Area.
For the sake of model exploration, the NA values are replaced with a
value calculated by lot area. It’s not used in the final model due to
it’s uncertainty and collinearity with Lot.Area.
Mas.Vnr.Area relates to Masonry veneer. Those that had
NA and veneer type None are set to 0.
In the training data, a BAS::Bayes.outlier found 3
outliers for sales price at over 97.5% probability which were removed.
No outliers were found in the test data.
In the test model, log of Lot.Area and area
were used. In the final, they proved to be more significant without
their log forms.
Year.Built and Year.Remod.Add were both
converted to years before 2011 (i.e. 2011 - x). The logs of
both of these values were used in the final model as influence of age
difference decreases with increasing age.
The log of price is used throughout although for
converted back for residual analysis.
A few observations in the test data with factors that have no
observations in the training data needed to be removed for the final
fit. The limitations of BAS linear models with factors is
that unused factors are dropped from the model, so it has no idea how to
deal with them and throws an error.
Sales not classified as “Normal” were removed from the training data before building the final model. We are predicting prices under normal sales conditions. Many of the non-normal sales types had highly irregular sales prices. Removing these improved the accuracy of the model significantly. 163 rows were removed.
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
No variable interaction used.
Variables were selected that were low in correlation. A first visual scan removed most of these, however some aliases were found still. Subsetting the data to just the top 20 on the ANOVA table got rid of the aliases and produces \(GVIF^{(1/(2*Df))}\) of 2 or less. Due to this, no variable interaction necessary.
# deselecting obvious variables related to others, generally those found higher on the ANOVA were kept,
# the co-variables of those dropped
df_corr <- df_train %>%
mutate(log.price = log(price)) %>%
select(-price, -PID, -Year.Built, -Year.Remod.Add, -Garage.Yr.Blt, -outlier,
-Overall.Cond, -Bsmt.Cond, -Bsmt.Exposure, -BsmtFin.Type.1, -BsmtFin.SF.1,
-BsmtFin.Type.2, BsmtFin.SF.2, -Bsmt.Unf.SF, -Total.Bsmt.SF,
-Exter.Qual, -Pool.QC, -MS.SubClass, -X1st.Flr.SF, -X2nd.Flr.SF)
# create lm, test for aliases - no VIF analysis possible while aliases exist
ames.lm <- lm(log.price ~ ., data = df_corr)
paste("Aliased variables:", attributes(alias(ames.lm)$Complete)$dimnames[[1]])## [1] "Aliased variables: Garage.QualNONE" "Aliased variables: Garage.CondNONE"
## [3] "Aliased variables: Garage.CondTA"
# take list of top 20 of remaining variables on ANOVA
ames_anova <- anova(ames.lm)
ames_anova_tbl <- ames_anova %>%
rownames_to_column(var = "Variable") %>%
slice_min(`Pr(>F)`, n=20) %>%
arrange(`Pr(>F)`)
# Subset top 20 and check for aliases and GVIF analysis
df_corr <- df_corr %>% select(log.price, all_of(ames_anova_tbl$Variable))
ames.lm <- glm(log.price ~ ., data = df_corr)
paste("Aliased variables:", attributes(alias(ames.lm)$Complete)$dimnames[[1]])## [1] "Aliased variables: "
as.data.frame(car::vif(ames.lm)) %>% arrange(desc(`GVIF^(1/(2*Df))`))| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| Overall.Qual | 4.015410 | 1 | 2.003849 |
| area | 3.507081 | 1 | 1.872720 |
| MS.Zoning | 59.933151 | 5 | 1.505798 |
| Central.Air | 2.067113 | 1 | 1.437746 |
| Lot.Frontage | 2.043984 | 1 | 1.429680 |
| Foundation | 16.459201 | 4 | 1.419225 |
| Bsmt.Qual | 29.174513 | 5 | 1.401201 |
| Bldg.Type | 8.830748 | 4 | 1.312955 |
| Alley | 2.725569 | 2 | 1.284885 |
| Kitchen.Qual | 6.104375 | 4 | 1.253733 |
| House.Style | 12.570433 | 6 | 1.234845 |
| Neighborhood | 19921.407882 | 26 | 1.209704 |
| Heating.QC | 4.551012 | 4 | 1.208547 |
| Bsmt.Full.Bath | 1.422089 | 1 | 1.192514 |
| Lot.Shape | 2.634474 | 3 | 1.175210 |
| Land.Contour | 2.600833 | 3 | 1.172696 |
| Heating | 2.992845 | 4 | 1.146860 |
| Lot.Config | 2.181920 | 4 | 1.102440 |
| Exter.Cond | 1.718586 | 3 | 1.094448 |
| Condition.1 | 3.861214 | 8 | 1.088104 |
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
Starting with the full set of variables, I took a 3 phase backwards elimination approach:
PID, Pool.Qual,
Year/Month Sold etc.).Final model selected the following variables:
| area | Neighborhood | MS.Zoning | Overall.Qual | House.Style | Lot.Shape | Bldg.Type | Bsmt.Qual | Bsmt.Full.Bath | Exter.Cond |
| Foundation | Lot.Area | Heating | Lot.Config | Kitchen.Qual | Garage.Cars | log.age | log.remod.age | Paved.Drive | |
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
RMSE for the training data is lower than for the testing data, however not noticeably so - out-of-sample data will normally be larger than for the data it was modelled on, but only if the difference is large is it an indicator of a poorly fitting model. No further adjustments resulting from out-of-sample testing were required.
Prior to removing non-normal sales from the training data, this was the reverse and the RMSE was actually higher for the training data as the test (and validation) set includes only normal sales - the training data was producing large residuals from the non-normal observations.
After initial results, I went back to the creation of the final model and implemented a loop to keep applying BPM selection until no further variables were eliminated. This improved RMSE for both data sets.
For your final model, create and briefly interpret an informative plot of the residuals.
Four plots relating to residuals are shown below:
For your final model, calculate and briefly comment on the RMSE.
RMSE for Train is approximately 10% lower than for Test ($18,515 vs $20,282). This suggests that overfitting is not a problem for the final model.
The initial model has an RMSE of $22,506 and $24,044 for the Test & Train data sets, the final model is a considerable improvement.
residuals.final %>%
group_by(Data) %>%
summarise(RMSE = dollar_format()(sqrt(mean((Residual*1e5)^2)))) %>%
kbl() %>% kable_styling(bootstrap_options = "condensed", full_width = F, position = "left")| Data | RMSE |
|---|---|
| Test | $20,281.73 |
| Train | $18,515.00 |
What are some strengths and weaknesses of your model?
Strengths:
Weaknesses:
bas.lm() function drops factor levels for those
levels where there is no data. When predicting with new data sets that
contain data in those dropped levels, the method fails with an error.
This is quite a limitation as those observations need to be filtered out
and cannot be predicted. They would need to be added to the training
data and the model recalibrated in order to include in future
predictions. This causes an issue where prior predictions are no longer
equitable with new predictions.Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional
assessment of your final model. Discuss your findings, be sure to
mention: * What is the RMSE of your final model when applied to the
validation data?
* How does this value compare to that of the training data and/or
testing data? * What percentage of the 95% predictive confidence (or
credible) intervals contain the true price of the house in the
validation data set?
* From this result, does your final model properly reflect
uncertainty?
load("ames_validation.Rdata")RMSE for the Validation data is $19,258, compared with $18,235 & $21,120 for Train and Test. Validation performs better than Test in this respect.
93.48% of predictions for the validation data set lie in the 95% confidence level. With this result, the model would be rejected and go back for more work. The figures for Train & Test are 95.80% & 92.72% respectively.
predict.validation.final <- predict(
model_final,
newdata = df_final_validation,
estimator = "HPM", se.fit = T)residuals.final.validation %>%
group_by(Data) %>%
summarise(RMSE = dollar_format()(sqrt(mean((Residual*1e5)^2)))) %>%
kbl() %>% kable_styling(bootstrap_options = "condensed", full_width = F, position = "left")| Data | RMSE |
|---|---|
| Test | $20,281.73 |
| Train | $18,515.00 |
| Validation | $19,258.49 |
train.confint <- exp(confint(predict.train.final, parm = "pred"))/1e5
test.confint <- exp(confint(predict.test.final, parm = "pred"))/1e5
validation.confint <- exp(confint(predict.validation.final, parm = "pred"))/1e5
all_confints <- as.data.frame(Data = "Train", train.confint[,]) %>%
add_row(
as.data.frame(Data = "Test", test.confint[,])
) %>%
add_row(
as.data.frame(Data = "Validation", validation.confint[,])
)
cbind(residuals.final.validation, all_confints) %>%
mutate(in_ci = ifelse(Observed >= `2.5%` & Observed <= `97.5%`,TRUE,FALSE)) %>%
group_by(Data) %>%
summarize(`% in CI` = round(mean(in_ci)*100,2)) %>%
kbl() %>% kable_styling(bootstrap_options = "condensed", full_width = F, position = "left")| Data | % in CI |
|---|---|
| Test | 92.72 |
| Train | 95.80 |
| Validation | 93.48 |
Overall, using Bayesian Inference methods with the Highest Probability Model, the model provides a good prediction of models for most of the data, however the tails (particularly the lower tail) perform poorly when considering the residual as a proportion of the observed data.
While the Training data fits 95.8% of data within the 95% confidence interval, out-of-sample fitted data didn’t quite meet the 95% threshold indicating the model needs a bit more work to see if this could be met.
The model suffers from a one-size-fits-all approach. We’re attempting to force a linear model across the entire price range where relationships are not necessarily uniform across the price spectrum, and variables have changing significance across that range. This could be improved by splitting the datasets at (say) $350,000 and creating two models, possibly a third for houses valued less than $100,000.
Lessons learnt in this exercise:
NA to indicate
‘None’ or ‘Not Applicable’, which are very different things. Converting
NA‘s in these categories enabled a much bigger data set to
be included in training. Similarly, converting NA’s to zero
in the continuous columns that relate to these ’Not Applicable’ cases
kept a much larger training set.The following code blocks were used but not displayed inline.
df_log.age <- data.frame(age = log(2011- ames_train$Year.Built))
log.age.mu = mean(df_log.age$age)
log.age.sd = sd(df_log.age$age)
plot_age <- df_log.age %>%
ggplot(aes(x=age)) +
geom_histogram(
alpha = 0.8,
bins = 30,
aes(y = stat(density, ..scaled..)),
position="identity",
fill="orange",
col="darkgrey") +
stat_function(
fun = dnorm,
aes(colour = "Normal"),
size = 1.2,
alpha = 0.6,
args = list(
mean = log.age.mu,
sd = log.age.sd
)
) +
geom_line(
aes(y = ..density.., colour = "Observed"),
stat = 'density',
size = 1.5,
alpha = 0.8) +
theme_minimal() +
scale_color_manual(name="Distribution", values=c(Normal="blue", Observed="red")) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom") +
labs(title = "Distribution of Log of House Age at 2011", y = "Density", x="Log of House Age")
nei_colours <- colorRampPalette(brewer.pal(4, "Set1"))(27)
plot_price_year <- ames_train %>%
group_by(Neighborhood, Yr.Sold) %>%
summarise(Med = median(log(price))) %>%
ggplot(aes(x=Yr.Sold, y=Med, colour=Neighborhood)) +
geom_line(alpha = 0.8, size=1) +
theme_minimal() +
scale_color_manual(values = nei_colours) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(
title = "Change in House Sales Prices by Year and Neighbourhood",
y = "Log of Sales Price",
x="Year"
)
# Calculate CV for each neighbourhood, create factor in descending order of CV
neighbourhood_price_variation <- ames_train %>%
group_by(Neighborhood) %>%
summarise(
Mean = mean(price)/1e5,
SD = sd(price)/1e5
) %>%
mutate(CV = SD/Mean) %>%
arrange(desc(CV)) %>%
mutate(Neighborhood = factor(Neighborhood, unique(Neighborhood))) %>%
arrange(desc(Neighborhood))
nei_plot <- ames_train %>%
select(Neighborhood, price) %>%
mutate(Neighborhood = factor(Neighborhood, neighbourhood_price_variation$Neighborhood)) %>%
ggplot(aes(as.factor(Neighborhood), price/1e5), fill=Neighborhood) +
theme_minimal() +
geom_boxplot(aes(fill=Neighborhood)) +
geom_violin(aes(colour=Neighborhood), width=1.5, alpha=0.3) +
coord_flip() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position="none",
plot.caption = element_text(face="italic")
) +
scale_color_manual(values = nei_colours) +
labs(
title = "Distribution of House Prices by Neighbourhood",
caption = "Neighbourhoods listed in descending order of coefficient of variability",
x = NULL,
y="Price ($'00,000)"
)# deal with factor NAs (either NA or empty string) - replace with 'NONE'
replace_factor_na <- function(x){
x <- as.character(x)
x <- if_else((is.na(x) | (x=="")), 'NONE', x)
x <- as.factor(x)
}
df_train <- ames_train %>%
mutate_at(
vars(Alley, BsmtFin.Type.1, BsmtFin.Type.2, Bsmt.Qual, Bsmt.Cond, Bsmt.Exposure,
Fireplace.Qu, Garage.Type, Garage.Finish, Mas.Vnr.Type,
Garage.Qual, Garage.Cond, Pool.QC, Fence, Misc.Feature),
replace_factor_na
)
# drop month/year sold - doesn't help predict a sale value unless using it to adjust sale price
df_train <- df_train %>% select(-Yr.Sold, -Mo.Sold)
# drop utilities, only one value in whole db
df_train <- df_train %>% select(-Utilities)
#deal with continuous na's
# 48 observations miss garage age due to not having a garage
# for purposes of EDA, assume those values are the same as house age
# revisit if garage age is significant influencer
df_train <- df_train %>%
mutate(Garage.Yr.Blt = ifelse(is.na(Garage.Yr.Blt), Year.Built, Garage.Yr.Blt))
# 1 in 7 are missing Lot.Frontage - EDA, found that
# RLot.Frontage looks to be significant influencer
# => strong liner relationship between area and frontage found (after removing outliers)
# => replace NAs with calculated value from area
# <30,000 represents roughly 98% of values
df_train %>%
filter(Lot.Frontage>0 & Lot.Area<30000) %>%
ggplot(aes(x=Lot.Area, y=Lot.Frontage)) +
geom_point() +
geom_smooth(method = "lm")
# find slope (given by covariance) and intercept
slope <- function(x, y){
return(cov(x, y)/var(x))
}
intercept <- function(x, y, m){
b <- mean(y) - (m * mean(x))
return(b)
}
lf.parameters <- df_train %>%
filter(Lot.Frontage>0 & Lot.Area<30000) %>%
summarise(
slope = slope(Lot.Area, Lot.Frontage),
intercept = intercept(Lot.Area, Lot.Frontage, slope)
)
# confirm with lm()
lot.frontage.lm <- lm(Lot.Area ~ Lot.Frontage,
data = (df_train %>% filter(Lot.Frontage>0 & Lot.Area<20000)))
summary(lot.frontage.lm)
calc_frontage <- function(LotArea) {
lf.parameters$intercept + lf.parameters$slope * LotArea
}
df_train <- df_train %>%
mutate(Lot.Frontage = ifelse(is.na(Lot.Frontage), calc_frontage(Lot.Area), Lot.Frontage))
# Mas.Vnr.Area relates to Masonry veneer. Those that have none will have 0.
df_train <- df_train %>%
replace_na(list(Mas.Vnr.Area=0))
# Only two observations remain with NAs - for the sake of EDA, set these to 0 also
df_train <- df_train %>%
replace(is.na(.), 0)
# Perform a basic ANOVA to find influencers to use on outlier test
ames.lm <- lm(log(price) ~ ., data = df_train)
ames_anova <- anova(ames.lm)
potential_vars <- (
ames_anova %>%
rownames_to_column(var = "Variable") %>%
slice_max(`Pr(>F)`, n=10)
)$Variable
n <- nrow(df_train)
ames_k <- qnorm(0.5 + 0.5 * 0.95 ^ (1 / n))
ames.lm <- lm(log(price) ~ ., data = (df_train %>% select(price, all_of(potential_vars))))
outliers <- BAS::Bayes.outlier(ames.lm, k = ames_k)
head(sort(outliers$prob.outlier, decreasing=TRUE), 5)
# 3 observations can be counted as outlier
df_train$outlier = outliers$prob.outlier
df_train %>%
filter(outlier>0.975) %>%
select(PID, price, Year.Built, Year.Remod.Add, area, Overall.Cond, Exterior.1st,
Exter.Cond, Bsmt.Cond, BsmtFin.Type.1, Central.Air, Garage.Cond, Sale.Condition) %>%
kbl() %>% kable_styling(bootstrap_options = c("condensed"), font_size = 10)
#remove outliers from training dataset
df_train <- df_train %>%
filter(!(PID %in% (df_train %>% filter(outlier>0.975))$PID))
# filter sale.condition on normal then drop, only one value in whole db
df_train <- df_train %>%
filter(Sale.Condition == "Normal") %>%
select(-Sale.Condition)Test and Validate data sets are similarly transformed.
residuals <- tibble(
Observed = df_train$price/1e5,
Predicted.HPM = as.numeric(bas_ames.ZS.HPM$Ypred[1,]),
Predicted.BIC = as.numeric(ames.lm.BIC1$fitted.values),
Predicted.Lasso = as.numeric(ames.lasso.pred)
) %>%
mutate(Predicted.HPM = exp(Predicted.HPM)/1e5) %>%
mutate(Predicted.BIC = exp(Predicted.BIC)/1e5) %>%
mutate(Predicted.Lasso = exp(Predicted.Lasso)/1e5) %>%
mutate(HPM = Observed - Predicted.HPM) %>%
mutate(BIC = Observed - Predicted.BIC) %>%
mutate(Lasso = Observed - Predicted.Lasso) %>%
pivot_longer(
cols = c(Lasso, BIC, HPM),
values_to = "Residual",
names_to = "Model"
) %>%
mutate(Model = factor(Model, levels = c("Lasso", "BIC", "HPM")))
price.mu = mean(df_train$price)/1e5
price.sd = sd(df_train$price)/1e5
g_res <- residuals %>%
ggplot(aes(x = Observed, y = Residual, colour = Model)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = price.mu, linetype="longdash", alpha=0.5) +
geom_vline(xintercept = price.mu-price.sd, linetype="dotted", alpha=0.4, size=1) +
geom_vline(xintercept = price.mu+price.sd, linetype="dotted", alpha=0.4, size=1) +
geom_point(alpha = 0.3, size=2) +
geom_smooth(se = F, size = 1.2) +
theme_minimal() +
scale_color_manual(values = c("#80B1D3", "#B3DE69", "#FF7F00")) +
theme(
plot.title = element_text(hjust = 0.5),
plot.margin = margin(20,10,10,10)
) +
labs(
title = "Residual vs Observed Values for BIC and MPM Models ($'00,000)"
)
residuals.train.HPM <- residuals %>% filter(Model=="HPM")
price98 <- c(quantile(residuals.train.HPM$Observed, .01), quantile(residuals.train.HPM$Observed, .99))
res98 <- c(quantile(residuals.train.HPM$Residual, .01), quantile(residuals.train.HPM$Residual, .99))
g_res
g_res + coord_cartesian(xlim=price98, ylim = res98) +
labs(
caption = "Axes cropped to show middle 98% of test data in both directions."
)residuals.intial <- tibble(
Data = "Train",
Observed = df_train$price/1e5,
Predicted = as.numeric(predict.train.initial$Ypred[1,])
) %>%
add_row(
Data = "Test",
Observed = df_test$price/1e5,
Predicted = as.numeric(predict.test.initial$Ypred[1,])
) %>%
mutate(Predicted = exp(Predicted)/1e5) %>%
mutate(Residual = Observed - Predicted)
residuals.intial <- tibble(
Data = "Train",
Observed = df_train$price/1e5,
Predicted = as.numeric(predict.train.initial$Ypred[1,])
) %>%
add_row(
Data = "Test",
Observed = df_test$price/1e5,
Predicted = as.numeric(predict.test.initial$Ypred[1,])
) %>%
mutate(Predicted = exp(Predicted)/1e5) %>%
mutate(Residual = Observed - Predicted)
residuals.intial %>%
mutate(Residual = (Residual)*1e5) %>%
group_by(Data) %>%
summarise(
RMSE=sqrt(mean(Residual^2)),
mean=round(mean(abs(Residual)),2),
median=median(abs(Residual)),
sd=round(sd(abs(Residual)),2),
max=max(abs(Residual)),
q25=quantile(abs(Residual), 0.25),
q75=quantile(abs(Residual),0.75),
skew=round(skewness(Residual),2)) %>%
mutate(across(RMSE:q75, dollar_format())) %>%
kbl(
caption = "<center><h4><strong>Distribution of Absolute Residuals</strong></h4></center>",
escape = F
) %>%
kable_styling(bootstrap_options = c("condensed"))
test.price98 <- c(quantile(residuals.intial$Observed, .01), quantile(residuals.intial$Observed, .99))
test.res98 <- c(quantile(residuals.intial$Residual, .01), quantile(residuals.intial$Residual, .99))
g_test_res <- residuals.intial %>%
ggplot(aes(x = Observed, y = Residual, colour=Data)) +
geom_vline(xintercept = price.mu, linetype="longdash", alpha=0.5) +
geom_vline(xintercept = price.mu-price.sd, linetype="dotted", alpha=0.4, size=1) +
geom_vline(xintercept = price.mu+price.sd, linetype="dotted", alpha=0.4, size=1) +
geom_point(alpha = 0.6, size=2) +
geom_smooth(se = F, size = 1.5) +
theme_minimal() +
geom_hline(yintercept = 0) +
scale_colour_manual(values=c("#33A02B", "#67a9d5")) +
theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,30,10)) +
labs(
title = "Residual Values for HPM Model for Train and Test Data ($'00,000)",
) +
ylab("Residual") + xlab("Observed")
g_test_res
g_test_res + coord_cartesian(xlim=test.price98, ylim = test.res98) +
labs(
caption = "Axes cropped to show middle 98% of test data in both directions."
)
residual.mu = mean(residuals.intial$Residual)
residual.sd = sd(residuals.intial$Residual)
test.res99 <- c(quantile(residuals.intial$Residual, .005), quantile(residuals.intial$Residual, .995))
residuals.intial %>%
ggplot(aes(x=Residual, fill=Data, colour=Data)) +
geom_histogram(
alpha = 0.3,
bins = 30,
aes(y = stat(density)),
position="identity",
col="darkgrey"
) +
geom_vline(
aes(xintercept = median(Residual[Data=="Test"]), color="Test"),
size = 1.2,
linetype="longdash"
) +
geom_vline(
aes(xintercept = median(Residual[Data=="Train"]), color="Train"),
size = 1.2,
linetype="longdash"
) +
stat_function(
fun = dnorm,
aes(colour = "Normal"),
size = 1.2,
linetype="longdash",
args = list(
mean = residual.mu,
sd = residual.sd
)
) +
geom_line(
aes(y = ..density.., colour = Data),
stat = 'density',
size = 1.5
) +
theme_minimal() +
scale_fill_manual(values=c("Test" = "#33A02B", "Train" = "#1F78B4")) +
scale_colour_manual(
name="Density",
values = c("Normal" = "#E3211C", "Test" = "#33A02B", "Train" = "#1F78B4")
) +
theme(
plot.title = element_text(hjust = 0.5),
legend.position="bottom"
) +
labs(
title = "Distribution of Train and Test Residuals Using HPM",
caption = "NOTE: the x-axis has been trunctated at $100K in both directions due to the excessive tails.
Verticals represent median residuals for both data sets.",
y = "Density", x="Residual ($'00,000)"
) +
coord_cartesian(xlim=c(-1,1)) # bas drops empty levels from model
df_final_test <- df_test %>%
filter(Foundation != "Wood") %>%
filter(Exter.Cond != "Po")
predict.train.final <- predict(model_final, newdata = df_train, estimator = "HPM", se.fit = T)
predict.test.final <- predict(model_final, newdata = df_final_test, estimator = "HPM", se.fit = T)
z_score <- function(log.prices){
predicts <- exp(log.prices)
mu_train <- mean(predicts)
sd_train <- sd(predicts)
(predicts - mu_train)/sd_train
}
residuals.final <- tibble(
Data = "Train",
Observed = df_train$price/1e5,
Predicted = as.numeric(predict.train.final$Ypred[1,]),
Z_Score = z_score(predict.train.final$Ypred)[1,]
) %>%
add_row(
Data = "Test",
Observed = df_final_test$price/1e5,
Predicted = as.numeric(predict.test.final$Ypred[1,]),
Z_Score = z_score(predict.test.final$Ypred)[1,]
) %>%
mutate(Predicted = exp(Predicted)/1e5) %>%
mutate(Residual = Observed - Predicted)
price98 <- c(quantile(residuals.final$Observed, .01), quantile(residuals.final$Observed, .99))
res98 <- c(quantile(residuals.final$Residual, .01), quantile(residuals.final$Residual, .99))
observed.mu = mean(residuals.final$Observed)
observed.sd = sd(residuals.final$Observed)
g_final_res <- residuals.final %>%
ggplot(aes(x = Observed, y = Residual, colour=Data)) +
geom_vline(xintercept = observed.mu, linetype="longdash", alpha=0.5) +
geom_vline(xintercept = observed.mu-observed.sd, linetype="dotted", alpha=0.4, size=1) +
geom_vline(xintercept = observed.mu+observed.sd, linetype="dotted", alpha=0.4, size=1) +
geom_point(alpha = 0.5, size=1) +
geom_smooth(se = F, size = 1.5) +
theme_minimal() +
geom_hline(yintercept = 0) +
scale_colour_manual(values=c("#33A02B", "#67a9d5")) +
theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,10,10)) +
labs(
title = "Residual Values for Final Model for Train and Test Data ($'00,000)",
) +
ylab("Residual") + xlab("Observed")
g_final_res_prop <- residuals.final %>%
ggplot(aes(x = Observed, y = Residual/Observed, colour=Data)) +
geom_vline(xintercept = observed.mu, linetype="longdash", alpha=0.5) +
geom_vline(xintercept = observed.mu-observed.sd, linetype="dotted", alpha=0.4, size=1) +
geom_vline(xintercept = observed.mu+observed.sd, linetype="dotted", alpha=0.4, size=1) +
geom_point(alpha = 0.5, size=1) +
geom_smooth(se = F, size = 1.5) +
theme_minimal() +
geom_hline(yintercept = 0) +
scale_colour_manual(values=c("#33A02B", "#67a9d5")) +
theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,10,10)) +
labs(
title = "Final Model Residual Values as Proportion of Observed Values",
) +
ylab("Residual/Observed") + xlab("Observed ($'00,000)")
hist_res_final <- residuals.final %>%
ggplot(aes(x=Residual, fill=Data, colour=Data)) +
geom_histogram(
alpha = 0.3,
bins = 30,
aes(y = stat(density)),
position="identity",
col="darkgrey"
) +
geom_vline(
aes(xintercept = median(Residual[Data=="Test"]), color="Test"),
size = 1.2,
linetype="longdash"
) +
geom_vline(
aes(xintercept = median(Residual[Data=="Train"]), color="Train"),
size = 1.2,
linetype="longdash"
) +
stat_function(
fun = dnorm,
aes(colour = "Normal"),
size = 1.2,
linetype="longdash",
args = list(
mean = residual.mu,
sd = residual.sd
)
) +
geom_line(
aes(y = ..density.., colour = Data),
stat = 'density',
size = 1.5
) +
theme_minimal() +
scale_fill_manual(values=c("Test" = "#33A02B", "Train" = "#1F78B4")) +
scale_colour_manual(
name="Density",
values = c("Normal" = "#E3211C", "Test" = "#33A02B", "Train" = "#1F78B4")
) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom") +
labs(title = "Distribution of Residuals Using Final Model", y = "Density", x="Residual ($'00,000)") +
coord_cartesian(xlim=c(-1,1))
qq_res_final <- residuals.final %>% ggplot(aes(colour=Data)) +
geom_qq(aes(sample = Residual/Z_Score), alpha = 0.4) +
scale_color_manual(values=c("Test" = "#33A02B", "Train" = "#1F78B4")) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom") +
labs(title = "QQ-Plot of Standardised Residuals") +
labs(x = "Theoretical quantiles", y = "Standardized residuals") +
coord_cartesian(ylim=c(-7.5,7.5), xlim=c(-3,3)) residuals.validation <- tibble(
Data = "Validation",
Observed = df_final_validation$price/1e5,
Predicted = as.numeric(predict.validation.final$Ypred[1,]),
Z_Score = z_score(predict.validation.final$Ypred)[1,]
) %>%
mutate(Predicted = exp(Predicted)/1e5) %>%
mutate(Residual = Observed - Predicted)
residuals.final.validation <- rbind(residuals.final, residuals.validation)
price98 <- c(quantile(residuals.validation$Observed, .01), quantile(residuals.validation$Observed, .99))
res98 <- c(quantile(residuals.validation$Residual, .01), quantile(residuals.validation$Residual, .99))
observed.mu = mean(residuals.validation$Observed)
observed.sd = sd(residuals.validation$Observed)
residuals.final.validation %>%
ggplot(aes(x = Observed, y = Residual/Observed, colour=Data)) +
geom_vline(xintercept = observed.mu, linetype="longdash", alpha=0.5) +
geom_vline(xintercept = observed.mu-observed.sd, linetype="dotted", alpha=0.4, size=1) +
geom_vline(xintercept = observed.mu+observed.sd, linetype="dotted", alpha=0.4, size=1) +
geom_point(alpha = 0.6, size=2) +
geom_smooth(se = F, size = 1.5) +
theme_minimal() +
geom_hline(yintercept = 0) +
scale_colour_manual(values=c("#33A02B", "#67a9d5", "orange")) +
theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,10,10)) +
labs(
title = "Residual Values for All Data Sets as proportion of Observed ($'00,000)",
) +
ylab("Residual") + xlab("Observed") + coord_cartesian(xlim=price98, ylim = c(-0.5,0.5)) +
labs(
caption = "Axes cropped to show middle 98% of test data in both directions.
Vertical Lines are mean and plus/minus one standard devation for the Validation Data only."
)