Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

# set defaults: cache chunks to speed compiling subsequent edits.
knitr::opts_chunk$set(cache=TRUE, echo = TRUE)
set.seed(123456)

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(BAS)
library(MASS)
library(glmnet)
library(ggplot2)
library(ggthemes)
library(kableExtra)
library(RColorBrewer)
library(tidyverse)
library(moments)
library(scales)
library(broom)

Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

Note for marking: for the sake of readability, code required for transforming the data sets and producing the graphs has not been included inline. Any code to do with modelling etc. has been left inline.

All code used but not displayed can be found in the Appendix.

1.1 Distribution of Log of House Age at 2011

In the first project, we examined the distribution of the ages of houses which showed a tri-modal distribution with a marked peak in the 5-9 bin. Looking at the skewness, median and mean however indicated that the bulk of the mass lay in the older houses.

I wanted to revisit this and see if looking at the log of age would produce something more insightful and perhaps produce a more linear relationship with the log of sale price.

Taking the exponential of the age values, the graph shows that there was a sustained growth in housing from 90 to around 30 years prior to 2011 - i.e. from the years 1920 to 1980, with a peak in the post war years at around 55 , i.e. 1950’s. There followed quite a period of lull until a second, brief, peak at around 6 years, or 2005.

Perhaps the main peak is a reflection of urbanisation of the mid-West during those years.

The distribution, while still bimodal, is a much better fit to normal, although left skewed, and will be used for model considerations later.

1.2 Change in House Sales Prices by Year and Neighbourhood

There were 3 reasons for looking at this:

The sub-prime collapse happened in 2008, in many parts of the world, house prices took a dive (in my own town, prices dropped by up to two-thirds). I wanted to see if this was apparent in the Ames sales figures. There is no evidence in the data of any effect from the sub-prime collapse.
To see if there was a marked shift in prices overall for the time span covered by the data set. If so, it would need to be factored into any modelling as sales from different years would need adjusting before being able to be considered equally. Overall, prices from 2006 to 2010 seemed to be mostly static.
To see if there was any evidence of which neighbourhoods were showing increase, and which decrease over the period to gauge which might be better for investment. There is some movement, however caution is needed as some points may represent few observations:

Clear Creek is worst performing ($171K to $129K) & perhaps Greens, which shows a sharp drop from 2009 to 2010.

Timberland shows sustained strong growth over the entire period ($180K to $300K) and would most likely represent the best area for investment.

1.3 Distribution of House Prices by Neighbourhood

A little related to the previous graph, this is ideal for showing which neighbourhoods have the greatest variability in pricing. Variability (along with sales price trend) gives one of the best opportunities for investment since there is potential for buying undervalued property and selling at an over-valued price.

The neighbourhoods are sorted most variable (by coefficient of variance) top to bottom.

While standard deviation is useful, the quantity will inevitably be proportional to the mean of the value being measured (i.e. a variation of 5% is more in absolute dollar terms for higher prices than for lower). Coefficient of variance is calculated as $\sigma / \mu$, a measure of variability in proportion to the mean. It’s useful for comparing the degree of variation from one data series to another, even if the means are drastically different from one another (ref).

This helps to give an idea of how much house prices vary as proportion of their mean.

Laying the violin plot over the boxplot helps to visualise where the weight of data points is in each suburb. Violins with a heavy left skew may be an indicator of good investment opportunities (more properties below the mean).

Combined with CV, it’s possible to identify opportunities in lower priced neighbourhoods where budget might allow for more properties to be included in the portfolio and therefore spread the risk while also increasing turnover.

Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

The variables chosen follow from a mix of elicitation (prior knowledge that these factors influence house prices) and examining ANOVA outputs during EDA. I was also careful not to choose covariates.

Neighborhood: This was an obvious choice from both methods, neighbourhood has been shown in EDA plots to show great influence along with prior knowledge that there will always be neighbourhoods that attract a greater price than others.
Overall.Qual: The quality of a house will always be a strong influence on sale value. Closely related to Overall.Cond - as Overall.Qual has more factors, it’s likely the better choice of the two for accuracy of prediction.
House.Style: Another variable related to aesthetics, style of house is usually an influencer.
area: The size of house will almost always affect sale price all other factors held equal. EDA showed a stronger linear relationship between the log of this variable and the log of price.
Lot.Area: Similar to the above, size of lot is usually a strong factor on sale value. Also as with area, EDA showed a stronger linear relationship between the log of this variable and the log of price.
Year.Built: newer builds will generally attract a higher sale value. There is the factor that some “classic” era houses also attract a higher price, however the EDA didn’t show any significant evidence of this. Again, the log of the house age (calculated as log(2011 - Year.Built)) is used - the difference between a 2010 and 1990 house is more influential than between a 1910 and 1930 for example, it makes sense to compress the larger ages.
Bedroom.AbvGr: Although this variable didn’t score highly in ANOVA tests, it’s one of the most fundamental factors in standard house valuations and included for that reason.
Full.Bath: as per #7
Kitchen.Qual: as per #7
MS.Zoning: This came out second strongest in EDA (after neighbourhood) and makes sense - the type of urban planning where the house is located is important with the low and medium density residential zones attracting better prices.

Summary:

All variables show significance with the log of price.
Some neighbourhoods have great influence, some not. However, as the ANOVA table shows, overall, this variable has the strongest strong influence. Surprisingly, the more classic indicators of number of bedrooms and bathrooms are the weakest, although still within the probability for inclusion.
Residuals have a median of 0.00468 with an IQR of 0.015.
Adjusted R-squared: 0.8938 - 89.4% of the log of house price variation can be explained by this model.

df_model1 <- df_train %>%
    mutate(log.price = log(price)) %>%
    mutate(log.area = log(area)) %>%
    mutate(log.lot.area = log(Lot.Area)) %>%
    mutate(log.age = log(2011 - Year.Built)) %>%
    mutate(remod.age = 2011 - Year.Remod.Add) %>%
    select(-price, -PID, -area, -Lot.Area, -Year.Built, -Year.Remod.Add)

df_model1 <- df_model1 %>% 
    select(
        log.price, Neighborhood, Overall.Qual, House.Style, log.area, 
        log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.Qual, MS.Zoning
    )

ames.lm <- lm(log.price ~ ., data = df_model1)
summary(ames.lm)

## 
## Call:
## lm(formula = log.price ~ ., data = df_model1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70330 -0.06573  0.00468  0.07419  0.50681 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.807627   0.210739  32.304  < 2e-16 ***
## NeighborhoodBlueste  0.137118   0.091705   1.495 0.135260    
## NeighborhoodBrDale  -0.017130   0.074647  -0.229 0.818556    
## NeighborhoodBrkSide  0.010672   0.061492   0.174 0.862260    
## NeighborhoodClearCr  0.079103   0.069072   1.145 0.252463    
## NeighborhoodCollgCr -0.020123   0.052439  -0.384 0.701271    
## NeighborhoodCrawfor  0.079285   0.059867   1.324 0.185766    
## NeighborhoodEdwards -0.108498   0.055530  -1.954 0.051070 .  
## NeighborhoodGilbert -0.035503   0.055324  -0.642 0.521229    
## NeighborhoodGreens   0.124856   0.080858   1.544 0.122957    
## NeighborhoodGrnHill  0.378760   0.104160   3.636 0.000295 ***
## NeighborhoodIDOTRR  -0.052510   0.066621  -0.788 0.430819    
## NeighborhoodMeadowV -0.027349   0.065999  -0.414 0.678704    
## NeighborhoodMitchel  0.026220   0.055945   0.469 0.639434    
## NeighborhoodNAmes   -0.007479   0.054708  -0.137 0.891296    
## NeighborhoodNoRidge  0.109831   0.056489   1.944 0.052216 .  
## NeighborhoodNPkVill  0.026130   0.080553   0.324 0.745732    
## NeighborhoodNridgHt  0.062278   0.054285   1.147 0.251631    
## NeighborhoodNWAmes   0.006905   0.056470   0.122 0.902712    
## NeighborhoodOldTown -0.086906   0.061561  -1.412 0.158432    
## NeighborhoodSawyer  -0.016041   0.056454  -0.284 0.776380    
## NeighborhoodSawyerW -0.065456   0.054738  -1.196 0.232137    
## NeighborhoodSomerst  0.029861   0.063821   0.468 0.639993    
## NeighborhoodStoneBr  0.043778   0.061428   0.713 0.476266    
## NeighborhoodSWISU   -0.109875   0.067666  -1.624 0.104825    
## NeighborhoodTimber   0.022354   0.058653   0.381 0.703212    
## NeighborhoodVeenker  0.060628   0.067434   0.899 0.368894    
## Overall.Qual         0.081793   0.006202  13.188  < 2e-16 ***
## House.Style1.5Unf    0.024544   0.048440   0.507 0.612520    
## House.Style1Story    0.069853   0.019172   3.643 0.000287 ***
## House.Style2.5Unf    0.038625   0.046604   0.829 0.407473    
## House.Style2Story   -0.013145   0.018923  -0.695 0.487492    
## House.StyleSFoyer    0.166107   0.030818   5.390 9.32e-08 ***
## House.StyleSLvl      0.068208   0.027338   2.495 0.012800 *  
## log.area             0.523887   0.029435  17.798  < 2e-16 ***
## log.lot.area         0.107185   0.013651   7.852 1.34e-14 ***
## Bedroom.AbvGr       -0.014137   0.008201  -1.724 0.085140 .  
## log.age             -0.062538   0.011710  -5.340 1.22e-07 ***
## Full.Bath           -0.017230   0.013409  -1.285 0.199168    
## Kitchen.QualFa      -0.239864   0.041818  -5.736 1.38e-08 ***
## Kitchen.QualGd      -0.093617   0.024855  -3.767 0.000178 ***
## Kitchen.QualPo      -0.412026   0.135348  -3.044 0.002411 ** 
## Kitchen.QualTA      -0.160389   0.027593  -5.813 8.93e-09 ***
## MS.ZoningFV          0.330414   0.079478   4.157 3.57e-05 ***
## MS.ZoningI (all)    -0.224493   0.143638  -1.563 0.118477    
## MS.ZoningRH          0.167195   0.083253   2.008 0.044957 *  
## MS.ZoningRL          0.326799   0.064845   5.040 5.79e-07 ***
## MS.ZoningRM          0.250377   0.060829   4.116 4.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1245 on 786 degrees of freedom
## Multiple R-squared:  0.8998, Adjusted R-squared:  0.8938 
## F-statistic: 150.1 on 47 and 786 DF,  p-value: < 2.2e-16

ames_anova <- anova(ames.lm)
ames_anova %>%
    rownames_to_column(var = "Variable") %>%
    arrange(`Pr(>F)`) %>%
    kbl() %>% 
    kable_styling(bootstrap_options = c("condensed"))

Variable	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Neighborhood	26	73.1056576	2.8117561	181.479670	0.0000000
Overall.Qual	1	20.8456183	20.8456183	1345.442432	0.0000000
log.area	1	10.9779894	10.9779894	708.554317	0.0000000
log.lot.area	1	1.0517136	1.0517136	67.880939	0.0000000
log.age	1	0.7823384	0.7823384	50.494608	0.0000000
Kitchen.Qual	4	0.8495418	0.2123854	13.708031	0.0000000
House.Style	6	0.8444736	0.1407456	9.084168	0.0000000
MS.Zoning	5	0.7149414	0.1429883	9.228917	0.0000000
Bedroom.AbvGr	1	0.0941711	0.0941711	6.078099	0.0138994
Full.Bath	1	0.0428519	0.0428519	2.765800	0.0966972
Residuals	786	12.1778945	0.0154935	NA	NA

Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

Running BAS with a ZS.null prior and uniform model prior, then testing against the four predictor types (BMA, HPM, MPM & BPM).
Concentrating on HPM (highest probability model), the selected model drops Full.Bath only.
Running a BIC test with stepAIC and k = log(nrow(df_model1)) drops Neighbourhood, Overall.Qual and Full.Bath.
Running a Lasso test with glmnet drops Full.Bath only.

Highest probability model finds the most likely model to have generated the data using a 0-1 loss L₀.

BIC is more selective and allows for more false-positive outcomes (where positive means accepts the null hypothesis that the variable is not an influencer).

Lasso regression is a parsimonious model that performs L1 regularization. The L1 regularization adds a penalty equivalent to the absolute magnitude of regression coefficients and tries to minimize them.

#-----------------------------------------
# BAS Modelling
#-----------------------------------------

bas_ames.ZS <- bas.lm(log.price ~ ., 
                  data = df_model1,
                  prior = "ZS-null", 
                  modelprior = uniform(),
                  n.models=1000
)

models = data.frame(Estimator=character(), 
                    Post.Prob=numeric(), 
                    Vars=integer(),
                    Best.Vars=character()
                    )

#BMA
bas_ames.ZS.BMA = predict(bas_ames.ZS, estimator = "BMA")
models <- models %>% add_row(
    Estimator="BMA", 
    Post.Prob=max(bas_ames.ZS$postprobs[1]), 
    Vars=length(bas_ames.ZS.BMA$best.vars[-1]),
    Best.Vars=toString(bas_ames.ZS.BMA$best.vars[-1])
)

#HPM
bas_ames.ZS.HPM = predict(bas_ames.ZS, estimator = "HPM")
models <- models %>% add_row(
    Estimator="HPM", 
    Post.Prob=max(bas_ames.ZS$postprobs[bas_ames.ZS.HPM$best]), 
    Vars=length(bas_ames.ZS.HPM$best.vars[-1]),
    Best.Vars=toString(bas_ames.ZS.HPM$best.vars[-1])
)

#MPM
bas_ames.ZS.MPM = predict(bas_ames.ZS, estimator = "MPM", se.fit = T)
models <- models %>% add_row(
    Estimator="MPM", 
    Post.Prob=NA, 
    Vars=length(bas_ames.ZS.MPM$best.vars[-1]),
    Best.Vars=toString(bas_ames.ZS.MPM$best.vars[-1])
)

#BPM
bas_ames.ZS.BPM = predict(bas_ames.ZS, estimator = "BPM", se.fit = T)
models <- models %>% add_row(
    Estimator="BPM", 
    Post.Prob=max(bas_ames.ZS$postprobs[bas_ames.ZS.HPM$best]), 
    Vars=length(bas_ames.ZS.BPM$best.vars[-1]),
    Best.Vars=toString(bas_ames.ZS.BPM$best.vars[-1])
)

models %>%
    kbl() %>%
    kable_styling(bootstrap_options = c("condensed"))

Estimator	Post.Prob	Vars	Best.Vars
BMA	0.0000000	47	NeighborhoodBlueste, NeighborhoodBrDale, NeighborhoodBrkSide, NeighborhoodClearCr, NeighborhoodCollgCr, NeighborhoodCrawfor, NeighborhoodEdwards, NeighborhoodGilbert, NeighborhoodGreens, NeighborhoodGrnHill, NeighborhoodIDOTRR, NeighborhoodMeadowV, NeighborhoodMitchel, NeighborhoodNAmes, NeighborhoodNoRidge, NeighborhoodNPkVill, NeighborhoodNridgHt, NeighborhoodNWAmes, NeighborhoodOldTown, NeighborhoodSawyer, NeighborhoodSawyerW, NeighborhoodSomerst, NeighborhoodStoneBr, NeighborhoodSWISU, NeighborhoodTimber, NeighborhoodVeenker, Overall.Qual, House.Style1.5Unf, House.Style1Story, House.Style2.5Unf, House.Style2Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualPo, Kitchen.QualTA, MS.ZoningFV, MS.ZoningI (all), MS.ZoningRH, MS.ZoningRL, MS.ZoningRM
HPM	0.9984024	23	NeighborhoodEdwards, NeighborhoodGrnHill, NeighborhoodNoRidge, NeighborhoodNridgHt, NeighborhoodOldTown, NeighborhoodSawyerW, NeighborhoodSWISU, Overall.Qual, House.Style1Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualTA, MS.ZoningFV, MS.ZoningRH, MS.ZoningRL, MS.ZoningRM
MPM	NA	23	NeighborhoodEdwards, NeighborhoodGrnHill, NeighborhoodNoRidge, NeighborhoodNridgHt, NeighborhoodOldTown, NeighborhoodSawyerW, NeighborhoodSWISU, Overall.Qual, House.Style1Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualTA, MS.ZoningFV, MS.ZoningRH, MS.ZoningRL, MS.ZoningRM
BPM	0.9984024	23	NeighborhoodEdwards, NeighborhoodGrnHill, NeighborhoodNoRidge, NeighborhoodNridgHt, NeighborhoodOldTown, NeighborhoodSawyerW, NeighborhoodSWISU, Overall.Qual, House.Style1Story, House.StyleSFoyer, House.StyleSLvl, log.area, log.lot.area, Bedroom.AbvGr, log.age, Full.Bath, Kitchen.QualFa, Kitchen.QualGd, Kitchen.QualTA, MS.ZoningFV, MS.ZoningRH, MS.ZoningRL, MS.ZoningRM

#-----------------------------------------
# stepAIC with BIC
#-----------------------------------------
BICk <- log(nrow(df_model1))
m_ames_BIC1 <- lm(log.price ~ ., data = df_model1)
ames.lm.BIC1 <- stepAIC(m_ames_BIC1, direction = "both", k = BICk)

## Start:  AIC=-3202.13
## log.price ~ Neighborhood + Overall.Qual + House.Style + log.area + 
##     log.lot.area + Bedroom.AbvGr + log.age + Full.Bath + Kitchen.Qual + 
##     MS.Zoning
## 
##                 Df Sum of Sq    RSS     AIC
## - Neighborhood  26    2.0673 14.245 -3246.2
## - Full.Bath      1    0.0256 12.204 -3207.1
## - Bedroom.AbvGr  1    0.0460 12.224 -3205.7
## <none>                       12.178 -3202.1
## - MS.Zoning      5    0.7149 12.893 -3188.2
## - House.Style    6    0.8359 13.014 -3187.1
## - log.age        1    0.4419 12.620 -3179.1
## - Kitchen.Qual   4    0.8425 13.020 -3173.3
## - log.lot.area   1    0.9552 13.133 -3145.9
## - Overall.Qual   1    2.6945 14.872 -3042.2
## - log.area       1    4.9079 17.086 -2926.4
## 
## Step:  AIC=-3246.25
## log.price ~ Overall.Qual + House.Style + log.area + log.lot.area + 
##     Bedroom.AbvGr + log.age + Full.Bath + Kitchen.Qual + MS.Zoning
## 
##                 Df Sum of Sq    RSS     AIC
## - Full.Bath      1    0.0827 14.328 -3248.1
## <none>                       14.245 -3246.2
## - Bedroom.AbvGr  1    0.1721 14.417 -3243.0
## - Kitchen.Qual   4    0.8844 15.130 -3222.9
## - House.Style    6    1.2256 15.471 -3217.8
## + Neighborhood  26    2.0673 12.178 -3202.1
## - log.age        1    0.9066 15.152 -3201.5
## - MS.Zoning      5    1.4982 15.743 -3196.5
## - log.lot.area   1    1.5892 15.834 -3164.8
## - Overall.Qual   1    4.3581 18.603 -3030.4
## - log.area       1    6.2834 20.529 -2948.2
## 
## Step:  AIC=-3248.15
## log.price ~ Overall.Qual + House.Style + log.area + log.lot.area + 
##     Bedroom.AbvGr + log.age + Kitchen.Qual + MS.Zoning
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       14.328 -3248.1
## + Full.Bath      1    0.0827 14.245 -3246.2
## - Bedroom.AbvGr  1    0.2161 14.544 -3242.4
## - Kitchen.Qual   4    0.9288 15.257 -3222.7
## - House.Style    6    1.2156 15.543 -3220.6
## - log.age        1    0.8283 15.156 -3208.0
## + Neighborhood  26    2.1244 12.204 -3207.1
## - MS.Zoning      5    1.5125 15.840 -3198.1
## - log.lot.area   1    1.6300 15.958 -3165.0
## - Overall.Qual   1    4.3931 18.721 -3031.8
## - log.area       1    6.6698 20.998 -2936.1

summary(ames.lm.BIC1)

## 
## Call:
## lm(formula = log.price ~ Overall.Qual + House.Style + log.area + 
##     log.lot.area + Bedroom.AbvGr + log.age + Kitchen.Qual + MS.Zoning, 
##     data = df_model1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68882 -0.07352  0.00762  0.08411  0.46490 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.533950   0.190390  34.319  < 2e-16 ***
## Overall.Qual       0.096188   0.006092  15.788  < 2e-16 ***
## House.Style1.5Unf  0.062797   0.050751   1.237 0.216308    
## House.Style1Story  0.092483   0.018897   4.894 1.19e-06 ***
## House.Style2.5Unf  0.030900   0.048416   0.638 0.523514    
## House.Style2Story  0.004048   0.018538   0.218 0.827197    
## House.StyleSFoyer  0.195163   0.030453   6.409 2.48e-10 ***
## House.StyleSLvl    0.093310   0.027779   3.359 0.000819 ***
## log.area           0.538290   0.027670  19.454  < 2e-16 ***
## log.lot.area       0.109542   0.011390   9.617  < 2e-16 ***
## Bedroom.AbvGr     -0.027707   0.007913  -3.502 0.000488 ***
## log.age           -0.055194   0.008051  -6.856 1.40e-11 ***
## Kitchen.QualFa    -0.252471   0.042531  -5.936 4.32e-09 ***
## Kitchen.QualGd    -0.093961   0.023932  -3.926 9.36e-05 ***
## Kitchen.QualPo    -0.291521   0.136670  -2.133 0.033222 *  
## Kitchen.QualTA    -0.162922   0.027213  -5.987 3.21e-09 ***
## MS.ZoningFV        0.370787   0.065836   5.632 2.45e-08 ***
## MS.ZoningI (all)  -0.213860   0.151308  -1.413 0.157918    
## MS.ZoningRH        0.160607   0.081277   1.976 0.048486 *  
## MS.ZoningRL        0.352947   0.061121   5.775 1.10e-08 ***
## MS.ZoningRM        0.261788   0.061476   4.258 2.30e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1328 on 813 degrees of freedom
## Multiple R-squared:  0.8821, Adjusted R-squared:  0.8792 
## F-statistic:   304 on 20 and 813 DF,  p-value: < 2.2e-16

#-----------------------------------------
# Lasso
#-----------------------------------------
x_vars <- model.matrix(log.price~. , df_model1)[,-1]
lambda_seq <- 10^seq(2, -2, by = -.1)
# Splitting the data into test and train
ames.lm.cv <- cv.glmnet(x_vars, df_model1$log.price,
                       alpha = 1, lambda = lambda_seq, 
                       nfolds = 5)
# identifying best lamda
best_lam <- ames.lm.cv$lambda.min
best_lam

## [1] 0.01

# Rebuilding the model with best lamda value identified
ames.lasso <- glmnet(x_vars, df_model1$log.price, alpha = 1, lambda = best_lam)
coef(ames.lasso)

## 51 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## (Intercept)          7.6096420391
## NeighborhoodBlueste  .           
## NeighborhoodBrDale   .           
## NeighborhoodBrkSide  .           
## NeighborhoodClearCr  .           
## NeighborhoodCollgCr  .           
## NeighborhoodCrawfor  0.0227706109
## NeighborhoodEdwards -0.0577177872
## NeighborhoodGilbert -0.0034302508
## NeighborhoodGreens   .           
## NeighborhoodGrnHill  0.1878894001
## NeighborhoodIDOTRR  -0.0369861529
## NeighborhoodLandmrk  .           
## NeighborhoodMeadowV  .           
## NeighborhoodMitchel  0.0032072906
## NeighborhoodNAmes    .           
## NeighborhoodNoRidge  0.0438913349
## NeighborhoodNPkVill  .           
## NeighborhoodNridgHt  0.0482616508
## NeighborhoodNWAmes   .           
## NeighborhoodOldTown -0.0528269100
## NeighborhoodSawyer   .           
## NeighborhoodSawyerW -0.0375682457
## NeighborhoodSomerst  .           
## NeighborhoodStoneBr  .           
## NeighborhoodSWISU   -0.0480277202
## NeighborhoodTimber   .           
## NeighborhoodVeenker  0.0095768127
## Overall.Qual         0.1075474966
## House.Style1.5Unf    .           
## House.Style1Story    0.0290266666
## House.Style2.5Fin    .           
## House.Style2.5Unf    .           
## House.Style2Story   -0.0091115590
## House.StyleSFoyer    0.0747748141
## House.StyleSLvl      .           
## log.area             0.4133081058
## log.lot.area         0.1072565525
## Bedroom.AbvGr       -0.0001698858
## log.age             -0.0614781367
## Full.Bath            .           
## Kitchen.QualFa      -0.0811175756
## Kitchen.QualGd       .           
## Kitchen.QualPo       .           
## Kitchen.QualTA      -0.0484383028
## MS.ZoningC (all)    -0.1663910778
## MS.ZoningFV          .           
## MS.ZoningI (all)    -0.2206088575
## MS.ZoningRH         -0.0122199543
## MS.ZoningRL          0.0218351282
## MS.ZoningRM         -0.0474788819

ames.lasso.pred <- predict(ames.lasso, s = best_lam, newx = x_vars)
head(cbind(df_model1$log.price, ames.lasso.pred), 10)

##                   s1
## 1  11.74404 11.69928
## 2  11.84582 11.70808
## 3  11.73527 11.49212
## 4  11.64395 11.49814
## 5  12.33271 12.37559
## 6  12.19854 12.35205
## 7  11.44035 11.43092
## 8  11.83138 11.76718
## 9  11.84940 11.87828
## 10 12.29911 12.38376

Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

Below are plots for residuals for all three models for comparison. The HPM model will be used moving forwards from here as it is the best performing.

The second plot adjusts the axes to remove the upper and lower 1% of values from display to allow a better visual representation and zoom in on where the bulk of the data lies.

All three models perform reasonably well within the H-spread (± 1 standard deviation represented by the area between the dotted lines). Below that, the models over-predict and, conversely, for upper prices the models all under-predict. If this problem persists after the full model, a correction factor scaled as a square of the distance from the mean can be useful. This might be tricky given the double flex characteristic of a quadratic formula.

Earlier testing with all sales conditions included produced some extreme residuals of $300K+. After removing non-normal sales, the model accuracy improved greatly.

Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

The RMSE for the HPM model is $22,506.54, the best of the 3 tried models.

RMSE <- tibble(
    Observed = df_train$price,
    Predicted.HPM = as.numeric(bas_ames.ZS.HPM$Ypred[1,]),
    Predicted.BIC = as.numeric(ames.lm.BIC1$fitted.values),
    Predicted.Lasso = as.numeric(ames.lasso.pred)
    ) %>%
    mutate(Predicted.HPM = exp(Predicted.HPM)) %>%
    mutate(Predicted.BIC = exp(Predicted.BIC)) %>%
    mutate(Predicted.Lasso = exp(Predicted.Lasso)) %>%
    mutate(HPM = Observed - Predicted.HPM) %>%
    mutate(BIC = Observed - Predicted.BIC)  %>% 
    mutate(Lasso = Observed - Predicted.Lasso)

tibble(
    `RMSE HPM____` = dollar_format()(sqrt(mean(RMSE$HPM^2))),
    `RMSE BIC____` = dollar_format()(sqrt(mean(RMSE$BIC^2))),
    `RMSE Lasso__` = dollar_format()(sqrt(mean(RMSE$Lasso^2)))
       )

RMSE HPM____	RMSE BIC____	RMSE Lasso__
$22,506.54	$23,628.92	$25,754.17

Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

A new BAS linear model was created using the model proposed by HPM (dropping only Full.Bath).
A new pred.bas object was created with fitted values using this model with the train and test data and the HPM estimator.

Comparing data between train and test prediction, predictions for the training data are better than test. This was not the case before non-normal data was included in the model. The test data contains only normal sales conditions which is likely a good reason for the better fit prior to making this filter and re-modelling:

The root mean squared error is slightly higher for the Test data ($22.5K for Train vs $24.0K).
Mean and median of the residuals is slightly lower for Train than Test, standard deviation is identical.
Maximum error for Train is $172K vs $197K. For the Test data, the maximum occurred at the extreme right tail, Train has a couple of outliers at around the $400K mark. There are no outliers in the IQR as with the training data.
IQR is slightly higher for Test.
Both data sets are right-skewed, Train more so. The skewness comes from the larger residual errors in the more expensive housing.
The training residuals are more centred around zero than the test data (median values are -$117 and -$4091 respectively). Both underestimate on average.
The fit for residual means across sales price is more linear for the test data, particularly above the median.
Both models deteriorate more than 1 standard deviation below the mean price, over-predicting the price increases below that point.

Despite the slightly poorer residuals with the test data set, this is to be expected and gives no cause for concern with regards to applicability to other data sets.

HPM.intial <- as.vector(which.matrix(bas_ames.ZS$which[bas_ames.ZS.HPM$best],
                              bas_ames.ZS$n.vars))

model_initial <- bas.lm(log.price ~ ., data = df_model1,
                      prior = "ZS-null",
                      modelprior = uniform(),
                      bestmodel = HPM.intial, n.models = 1)

predict.train.initial <- predict(model_initial, newdata = df_model1, estimator = "HPM")
predict.test.initial <- predict(model_initial, newdata = df_model2, estimator = "HPM")

**Distribution of Absolute Residuals**
Data	RMSE	mean	median	sd	max	q25	q75	skew
Test	$24,043.95	$17,416.33	$13,128.66	$16,586.77	$197,048	$6,419.36	$23,365.77	1.13
Train	$22,506.54	$15,926.17	$12,208.95	$15,912.42	$172,183	$5,504.76	$21,208.68	1.37

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

Section 3.1 Final Model

Provide the summary table for your model.

The following code selects the best variables and produces summary table (using coef()) for the final table.

# Drop 2 outliers found in initial model testing, remove any non-normal sales
df_train <- df_train %>%
    mutate(log.age = log(2011 - Year.Built)) %>%
    mutate(log.remod.age = log(2011 - Year.Remod.Add)) %>%
    filter(!(PID %in% c(908154205, 533350090)))

df_test <- df_test %>%
    mutate(log.age = log(2011 - Year.Built)) %>%
    mutate(log.remod.age = log(2011 - Year.Remod.Add))

# Drop colinear and other variable scoring low ANOVA during EDA tests
df_model2 <- df_train %>%
    mutate(log.price = log(price)) %>%
    select(-price, -PID, -Year.Built, -Year.Remod.Add, -Garage.Yr.Blt, -outlier,
           -Overall.Cond, -Bsmt.Cond, -Bsmt.Exposure, -BsmtFin.Type.1, -BsmtFin.SF.1,
           -BsmtFin.Type.2, BsmtFin.SF.2, -Bsmt.Unf.SF, -Total.Bsmt.SF,
           -Exter.Qual, -Pool.QC, -MS.SubClass, -X1st.Flr.SF, -X2nd.Flr.SF,
           -Alley, -Lot.Frontage, -Land.Slope, -Condition.1, Condition.2, -Roof.Matl,
           -Exterior.1st, -Exterior.2nd, -Heating, -Central.Air, -Electrical, -Functional,
           -Garage.Cond, -Garage.Qual, -Misc.Feature)

# re-run ANOVA and select best 20
ames.lm <- lm(log.price ~ ., data = df_model2)
ames_anova <- anova(ames.lm)
ames_anova_tbl <- ames_anova %>%
    rownames_to_column(var = "Variable") %>%
    slice_min(`Pr(>F)`, n=20) %>%
    arrange(`Pr(>F)`)

# create subset with only those vars
df_model2 <- df_model2 %>% select(log.price, all_of(ames_anova_tbl$Variable))

# create initial BAS model
model_final <- bas.lm(log.price ~ ., 
                      data = df_model2,
                      prior = "ZS-null", 
                      modelprior = uniform(),
                      n.models=1000
)

# Repeat HPM selection method until no more vars are eliminated
repeat {

    prior_n <- length(attributes(model_final$x)$names)
    #use HPM to find best vars
    model_final.HPM <- predict(model_final, estimator = "HPM")
    # create vector of best vars and re-run BAS model selecting only those
    model_final$postprobs[model_final.HPM$best]
    HPM <- as.vector(which.matrix(model_final$which[model_final.HPM$best],
                                 model_final$n.vars))
    model_final <- bas.lm(log.price ~ ., data = df_model2,
                            prior = "ZS-null",
                            modelprior = uniform(),
                            bestmodel = HPM, n.models = 1000)
    if(prior_n == length(attributes(model_final$x)$names)){break}
    
}

model_final <- bas.lm(log.price ~ ., data = df_model2,
                      prior = "ZS-null",
                      modelprior = uniform(),
                      bestmodel = HPM, n.models = 1)

predict.train.final <- predict(model_final, estimator = "HPM", se.fit = T)

# final model R2
model_final$R2

## [1] 0.9288975

coef(model_final, estimator="HPM")

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  HPM 
## 
##  Based on the top  1 models 
##                      post mean   post SD     post p(B != 0)
## Intercept             1.200e+01   3.621e-03   1.000e+00    
## area                  3.330e-04   1.298e-05   1.000e+00    
## NeighborhoodBlueste   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodBrDale   -1.593e-02   4.677e-02   1.000e+00    
## NeighborhoodBrkSide   2.396e-02   2.044e-02   1.000e+00    
## NeighborhoodClearCr   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodCollgCr   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodCrawfor   1.039e-01   2.284e-02   1.000e+00    
## NeighborhoodEdwards  -6.904e-02   1.657e-02   1.000e+00    
## NeighborhoodGilbert   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodGreens    0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodGrnHill   6.642e-01   7.887e-02   1.000e+00    
## NeighborhoodIDOTRR    5.851e-02   2.710e-02   1.000e+00    
## NeighborhoodMeadowV  -5.940e-02   3.353e-02   1.000e+00    
## NeighborhoodMitchel   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodNAmes     0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodNoRidge   6.461e-02   2.261e-02   1.000e+00    
## NeighborhoodNPkVill   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodNridgHt   6.711e-02   2.174e-02   1.000e+00    
## NeighborhoodNWAmes    2.272e-02   1.946e-02   1.000e+00    
## NeighborhoodOldTown   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodSawyer    0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodSawyerW   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodSomerst   0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodStoneBr   5.315e-02   3.331e-02   1.000e+00    
## NeighborhoodSWISU    -8.559e-02   3.488e-02   1.000e+00    
## NeighborhoodTimber    0.000e+00   0.000e+00   0.000e+00    
## NeighborhoodVeenker   1.303e-01   3.658e-02   1.000e+00    
## MS.ZoningFV           4.483e-01   5.312e-02   1.000e+00    
## MS.ZoningI (all)      0.000e+00   0.000e+00   0.000e+00    
## MS.ZoningRH           2.730e-01   6.549e-02   1.000e+00    
## MS.ZoningRL           4.123e-01   4.925e-02   1.000e+00    
## MS.ZoningRM           3.106e-01   4.755e-02   1.000e+00    
## Overall.Qual          7.453e-02   4.912e-03   1.000e+00    
## House.Style1.5Unf     0.000e+00   0.000e+00   0.000e+00    
## House.Style1Story     4.330e-02   1.221e-02   1.000e+00    
## House.Style2.5Unf     0.000e+00   0.000e+00   0.000e+00    
## House.Style2Story    -4.391e-02   1.319e-02   1.000e+00    
## House.StyleSFoyer     7.478e-02   2.345e-02   1.000e+00    
## House.StyleSLvl       0.000e+00   0.000e+00   0.000e+00    
## Lot.ShapeIR2          1.011e-02   2.313e-02   1.000e+00    
## Lot.ShapeIR3          0.000e+00   0.000e+00   0.000e+00    
## Lot.ShapeReg          0.000e+00   0.000e+00   0.000e+00    
## Bldg.Type2fmCon       0.000e+00   0.000e+00   0.000e+00    
## Bldg.TypeDuplex      -1.131e-01   2.279e-02   1.000e+00    
## Bldg.TypeTwnhs       -8.510e-02   2.392e-02   1.000e+00    
## Bldg.TypeTwnhsE      -3.776e-02   1.734e-02   1.000e+00    
## Bsmt.QualFa          -1.037e-01   3.298e-02   1.000e+00    
## Bsmt.QualGd          -5.544e-02   1.807e-02   1.000e+00    
## Bsmt.QualNONE        -2.872e-01   4.343e-02   1.000e+00    
## Bsmt.QualPo           0.000e+00   0.000e+00   0.000e+00    
## Bsmt.QualTA          -5.467e-02   2.189e-02   1.000e+00    
## Bsmt.Full.Bath        7.684e-02   7.931e-03   1.000e+00    
## Exter.CondFa         -1.544e-01   2.821e-02   1.000e+00    
## Exter.CondGd          7.695e-04   1.195e-02   1.000e+00    
## Exter.CondTA          0.000e+00   0.000e+00   0.000e+00    
## Land.ContourHLS       0.000e+00   0.000e+00   0.000e+00    
## Land.ContourLow       0.000e+00   0.000e+00   0.000e+00    
## Land.ContourLvl       0.000e+00   0.000e+00   0.000e+00    
## FoundationCBlock      0.000e+00   0.000e+00   0.000e+00    
## FoundationPConc       0.000e+00   0.000e+00   0.000e+00    
## FoundationSlab        1.433e-01   4.896e-02   1.000e+00    
## FoundationStone       0.000e+00   0.000e+00   0.000e+00    
## Lot.Area              2.446e-06   3.968e-07   1.000e+00    
## Heating.QCFa         -1.058e-01   2.687e-02   1.000e+00    
## Heating.QCGd          0.000e+00   0.000e+00   0.000e+00    
## Heating.QCPo         -2.170e-01   1.082e-01   1.000e+00    
## Heating.QCTA         -2.461e-02   9.483e-03   1.000e+00    
## Lot.ConfigCulDSac     0.000e+00   0.000e+00   0.000e+00    
## Lot.ConfigFR2         0.000e+00   0.000e+00   0.000e+00    
## Lot.ConfigFR3         0.000e+00   0.000e+00   0.000e+00    
## Lot.ConfigInside      0.000e+00   0.000e+00   0.000e+00    
## Kitchen.QualFa       -5.771e-02   2.598e-02   1.000e+00    
## Kitchen.QualGd        0.000e+00   0.000e+00   0.000e+00    
## Kitchen.QualPo        0.000e+00   0.000e+00   0.000e+00    
## Kitchen.QualTA        0.000e+00   0.000e+00   0.000e+00    
## Garage.Cars           4.834e-02   7.300e-03   1.000e+00    
## log.age              -3.799e-02   9.248e-03   1.000e+00    
## log.remod.age        -3.100e-02   6.145e-03   1.000e+00    
## Paved.DriveP          0.000e+00   0.000e+00   0.000e+00    
## Paved.DriveY          8.135e-02   1.488e-02   1.000e+00

Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

Code for the Transformations can be found in the Appendix

Factors - the following factor variables incorrectly use the NA value to describe ‘Not Applicable’ or ‘None’:

Alley, BsmtFin.Type.1, BsmtFin.Type.2, Bsmt.Qual, Bsmt.Cond, Bsmt.Exposure, Fireplace.Qu, Garage.Type, Garage.Finish, Mas.Vnr.Type, Garage.Qual, Garage.Cond, Pool.QC, Fence, Misc.Feature

Each of these has been re-factored to convert NA’s to “None” to include the rows in regression.

Utilities has only one level and is dropped from the dataset.

Year sold, month sold and PID are dropped as they have no value in predicting prices.

1 in 7 observations are missing Lot.Frontage. This variable has a strong linear relationship with Lot.Area. For the sake of model exploration, the NA values are replaced with a value calculated by lot area. It’s not used in the final model due to it’s uncertainty and collinearity with Lot.Area.

Mas.Vnr.Area relates to Masonry veneer. Those that had NA and veneer type None are set to 0.

In the training data, a BAS::Bayes.outlier found 3 outliers for sales price at over 97.5% probability which were removed. No outliers were found in the test data.

In the test model, log of Lot.Area and area were used. In the final, they proved to be more significant without their log forms.

Year.Built and Year.Remod.Add were both converted to years before 2011 (i.e. 2011 - x). The logs of both of these values were used in the final model as influence of age difference decreases with increasing age.

The log of price is used throughout although for converted back for residual analysis.

A few observations in the test data with factors that have no observations in the training data needed to be removed for the final fit. The limitations of BAS linear models with factors is that unused factors are dropped from the model, so it has no idea how to deal with them and throws an error.

Sales not classified as “Normal” were removed from the training data before building the final model. We are predicting prices under normal sales conditions. Many of the non-normal sales types had highly irregular sales prices. Removing these improved the accuracy of the model significantly. 163 rows were removed.

Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

No variable interaction used.

Variables were selected that were low in correlation. A first visual scan removed most of these, however some aliases were found still. Subsetting the data to just the top 20 on the ANOVA table got rid of the aliases and produces $GVIF^{(1/(2*Df))}$ of 2 or less. Due to this, no variable interaction necessary.

# deselecting obvious variables related to others, generally those found higher on the ANOVA were kept, 
# the co-variables of those dropped
df_corr <- df_train %>%
    mutate(log.price = log(price)) %>%
    select(-price, -PID, -Year.Built, -Year.Remod.Add, -Garage.Yr.Blt, -outlier,
           -Overall.Cond, -Bsmt.Cond, -Bsmt.Exposure, -BsmtFin.Type.1, -BsmtFin.SF.1,
           -BsmtFin.Type.2, BsmtFin.SF.2, -Bsmt.Unf.SF, -Total.Bsmt.SF,
           -Exter.Qual, -Pool.QC, -MS.SubClass, -X1st.Flr.SF, -X2nd.Flr.SF)

# create lm, test for aliases - no VIF analysis possible while aliases exist
ames.lm <- lm(log.price ~ ., data = df_corr)
paste("Aliased variables:", attributes(alias(ames.lm)$Complete)$dimnames[[1]])

## [1] "Aliased variables: Garage.QualNONE" "Aliased variables: Garage.CondNONE"
## [3] "Aliased variables: Garage.CondTA"

# take list of top 20 of remaining variables on ANOVA 
ames_anova <- anova(ames.lm)
ames_anova_tbl <- ames_anova %>%
    rownames_to_column(var = "Variable") %>%
    slice_min(`Pr(>F)`, n=20) %>%
    arrange(`Pr(>F)`)

# Subset top 20 and check for aliases and GVIF analysis
df_corr <- df_corr %>% select(log.price, all_of(ames_anova_tbl$Variable))
ames.lm <- glm(log.price ~ ., data = df_corr)
paste("Aliased variables:", attributes(alias(ames.lm)$Complete)$dimnames[[1]])

## [1] "Aliased variables: "

as.data.frame(car::vif(ames.lm)) %>% arrange(desc(`GVIF^(1/(2*Df))`))

	GVIF	Df	GVIF^(1/(2*Df))
Overall.Qual	4.015410	1	2.003849
area	3.507081	1	1.872720
MS.Zoning	59.933151	5	1.505798
Central.Air	2.067113	1	1.437746
Lot.Frontage	2.043984	1	1.429680
Foundation	16.459201	4	1.419225
Bsmt.Qual	29.174513	5	1.401201
Bldg.Type	8.830748	4	1.312955
Alley	2.725569	2	1.284885
Kitchen.Qual	6.104375	4	1.253733
House.Style	12.570433	6	1.234845
Neighborhood	19921.407882	26	1.209704
Heating.QC	4.551012	4	1.208547
Bsmt.Full.Bath	1.422089	1	1.192514
Lot.Shape	2.634474	3	1.175210
Land.Contour	2.600833	3	1.172696
Heating	2.992845	4	1.146860
Lot.Config	2.181920	4	1.102440
Exter.Cond	1.718586	3	1.094448
Condition.1	3.861214	8	1.088104

Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

Starting with the full set of variables, I took a 3 phase backwards elimination approach:

Visually going through the variables and removing those that will be co-variants (e.g. the various basement fields) and those that won’t be useful in prediction (PID, Pool.Qual, Year/Month Sold etc.).
Working through ANOVA outputs with VIF analysis to find optimum ANOVA score while eliminating high VIF scores to minimise collinearity.
Applying top 20 remaining variables determined by ANOVA score and iteratively applying HPM selection until no further variables were eliminated to create final model.

Final model selected the following variables:

area	Neighborhood	MS.Zoning	Overall.Qual	House.Style	Lot.Shape	Bldg.Type	Bsmt.Qual	Bsmt.Full.Bath	Exter.Cond
Foundation	Lot.Area	Heating	Lot.Config	Kitchen.Qual	Garage.Cars	log.age	log.remod.age	Paved.Drive

Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

RMSE for the training data is lower than for the testing data, however not noticeably so - out-of-sample data will normally be larger than for the data it was modelled on, but only if the difference is large is it an indicator of a poorly fitting model. No further adjustments resulting from out-of-sample testing were required.

Prior to removing non-normal sales from the training data, this was the reverse and the RMSE was actually higher for the training data as the test (and validation) set includes only normal sales - the training data was producing large residuals from the non-normal observations.

After initial results, I went back to the creation of the final model and implemented a loop to keep applying BPM selection until no further variables were eliminated. This improved RMSE for both data sets.

Part 4 Final Model Assessment

Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

Four plots relating to residuals are shown below:

The first pair plot the residuals for each observed price (train and test data sets). Left is the full data set, right is cropped to remove the most extreme 1% values from either ends of the axes. It shows there are fewer extreme errors than the initial model. Test performs better for higher priced houses (higher than $400K) than Train though this can also be put down to lack of data points in this range. Both deteriorate rapidly for houses more than 1 standard deviation below the mean price. Test data looks to be consistently over-predicting house prices across the IQR (residual is observation minus prediction, negative residual indicates predicted is greater than observed).
The second pair (full and cropped as per #1) show residuals as a proportion of the observed price. This shows that, proportional to house price, the model performs relatively well for all but lower price houses and shows the lower price deterioration more clearly.
Histogram of residuals shows distributions that are peaked with long tails. Note, this graph is clipped to only show residuals of up to $100K in either direction.
The QQ plot indicates a distribution that is both peaked and leptokurtic (fat-tailed, from the more extreme residuals).

Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

RMSE for Train is approximately 10% lower than for Test ($18,515 vs $20,282). This suggests that overfitting is not a problem for the final model.

The initial model has an RMSE of $22,506 and $24,044 for the Test & Train data sets, the final model is a considerable improvement.

residuals.final %>%
    group_by(Data) %>%
    summarise(RMSE = dollar_format()(sqrt(mean((Residual*1e5)^2)))) %>%
    kbl() %>% kable_styling(bootstrap_options = "condensed", full_width = F, position = "left")

Data	RMSE
Test	$20,281.73
Train	$18,515.00

Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

Strengths:

Final RMSE is lower than the initial model indicating it is a better fit to both sets of data.
The high posterior probability of the final BMA model is a very good indication of the strength of the model chosen.
The model performs reasonably well over the IQR.

Weaknesses:

The model suffers from a one-size-fits-all approach. We’re attempting to force a linear model across the entire price range where relationships are not necessarily uniform across the price spectrum, and variables have changing significance across that range. This could be improved by splitting the datasets at (say) $350,000 and creating two models, possibly a third for houses valued less than $100,000.
Outside of the IQR, the model increasingly undervalues lower priced houses, and overvalues higher priced houses.
The scale of residual error can be quite large. At the median, house prices predictions can be in error by up to 25% and by up to almost 80% at the lower extreme of prices.
While the goal of the model is to predict prices under normal sale conditions, it couldn’t be used for cases where sale conditions were not normal. A separate model for those conditions would need to be built.
The bas.lm() function drops factor levels for those levels where there is no data. When predicting with new data sets that contain data in those dropped levels, the method fails with an error. This is quite a limitation as those observations need to be filtered out and cannot be predicted. They would need to be added to the training data and the model recalibrated in order to include in future predictions. This causes an issue where prior predictions are no longer equitable with new predictions.

Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

RMSE for the Validation data is $19,258, compared with $18,235 & $21,120 for Train and Test. Validation performs better than Test in this respect.

93.48% of predictions for the validation data set lie in the 95% confidence level. With this result, the model would be rejected and go back for more work. The figures for Train & Test are 95.80% & 92.72% respectively.

predict.validation.final <- predict(
    model_final, 
    newdata = df_final_validation, 
    estimator = "HPM", se.fit = T)

residuals.final.validation %>%
    group_by(Data) %>%
    summarise(RMSE = dollar_format()(sqrt(mean((Residual*1e5)^2)))) %>%
    kbl() %>% kable_styling(bootstrap_options = "condensed", full_width = F, position = "left")

Data	RMSE
Test	$20,281.73
Train	$18,515.00
Validation	$19,258.49

train.confint <- exp(confint(predict.train.final, parm = "pred"))/1e5
test.confint <- exp(confint(predict.test.final, parm = "pred"))/1e5
validation.confint <- exp(confint(predict.validation.final, parm = "pred"))/1e5
all_confints <- as.data.frame(Data = "Train", train.confint[,]) %>%
    add_row(
        as.data.frame(Data = "Test", test.confint[,])
    ) %>%
    add_row(
        as.data.frame(Data = "Validation", validation.confint[,])
    ) 
    
cbind(residuals.final.validation, all_confints) %>% 
    mutate(in_ci = ifelse(Observed >= `2.5%` & Observed <= `97.5%`,TRUE,FALSE)) %>% 
    group_by(Data) %>%
    summarize(`% in CI` = round(mean(in_ci)*100,2)) %>%
    kbl() %>% kable_styling(bootstrap_options = "condensed", full_width = F, position = "left")

Data	% in CI
Test	92.72
Train	95.80
Validation	93.48

Part 5 Conclusion

Overall, using Bayesian Inference methods with the Highest Probability Model, the model provides a good prediction of models for most of the data, however the tails (particularly the lower tail) perform poorly when considering the residual as a proportion of the observed data.

While the Training data fits 95.8% of data within the 95% confidence interval, out-of-sample fitted data didn’t quite meet the 95% threshold indicating the model needs a bit more work to see if this could be met.

The model suffers from a one-size-fits-all approach. We’re attempting to force a linear model across the entire price range where relationships are not necessarily uniform across the price spectrum, and variables have changing significance across that range. This could be improved by splitting the datasets at (say) $350,000 and creating two models, possibly a third for houses valued less than $100,000.

Lessons learnt in this exercise:

It’s important to think of the goal for the model and only include data in the training that is relevant to that goal. For this reason, non-normal sale condition data was omitted from training the model.
Missing data needs to be considered and dealt with appropriately. Many of the categorical variables used NA to indicate ‘None’ or ‘Not Applicable’, which are very different things. Converting NA‘s in these categories enabled a much bigger data set to be included in training. Similarly, converting NA’s to zero in the continuous columns that relate to these ’Not Applicable’ cases kept a much larger training set.
Sometimes, where the influence of variables change considerably across the spectrum of data, a modelling problem may need to be split into sections. In this case, what drives a house price in the lower end of the sector is not necessarily relevant in the higher end and vice versa.
Prediction is better where there is the bulk of the mass, tails are not so reliable.
Elimination of collinear variables makes a large impact on the accuracy of the data, as does removing variables with little information that may be incorrectly identified as significant factors by modelling algorithms.
EDA showed stronger linear relationships between log of price and the logs of area and Lot.Area, however during model variable selection, applying the log of both of these significantly decreased their influence. Always test before and after log before assuming the EDA is applicable.

Appendix

Code Blocks

The following code blocks were used but not displayed inline.

Graphs in Part 1 EDA:

df_log.age <- data.frame(age = log(2011- ames_train$Year.Built)) 
log.age.mu = mean(df_log.age$age)
log.age.sd = sd(df_log.age$age)

plot_age <- df_log.age %>% 
    ggplot(aes(x=age)) +
    geom_histogram(
        alpha = 0.8, 
        bins = 30, 
        aes(y = stat(density, ..scaled..)),
        position="identity", 
        fill="orange", 
        col="darkgrey") +
    stat_function(
        fun = dnorm, 
        aes(colour = "Normal"), 
        size = 1.2, 
        alpha = 0.6,
        args = list(
            mean = log.age.mu, 
            sd = log.age.sd
            )
        ) +
    geom_line(
        aes(y = ..density.., colour = "Observed"), 
        stat = 'density', 
        size = 1.5, 
        alpha = 0.8) + 
    theme_minimal() +
    scale_color_manual(name="Distribution", values=c(Normal="blue", Observed="red")) +
    theme(plot.title = element_text(hjust = 0.5), legend.position="bottom") +
    labs(title = "Distribution of Log of House Age at 2011", y = "Density", x="Log of House Age")

nei_colours <- colorRampPalette(brewer.pal(4, "Set1"))(27)

plot_price_year <- ames_train %>% 
    group_by(Neighborhood, Yr.Sold) %>%
    summarise(Med = median(log(price)))  %>%   
    ggplot(aes(x=Yr.Sold, y=Med, colour=Neighborhood)) +
    geom_line(alpha = 0.8, size=1) +
    theme_minimal() +
    scale_color_manual(values = nei_colours)  +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(
        title = "Change in House Sales Prices by Year and Neighbourhood", 
        y = "Log of Sales Price", 
        x="Year"
        )
    
# Calculate CV for each neighbourhood, create factor in descending order of CV 
neighbourhood_price_variation <- ames_train %>% 
    group_by(Neighborhood) %>% 
    summarise(
        Mean = mean(price)/1e5, 
        SD = sd(price)/1e5
        ) %>%
    mutate(CV = SD/Mean) %>%
    arrange(desc(CV)) %>%
    mutate(Neighborhood = factor(Neighborhood, unique(Neighborhood))) %>%
    arrange(desc(Neighborhood))

nei_plot <- ames_train %>% 
    select(Neighborhood, price) %>%
    mutate(Neighborhood = factor(Neighborhood, neighbourhood_price_variation$Neighborhood)) %>% 
    ggplot(aes(as.factor(Neighborhood), price/1e5), fill=Neighborhood) +
    theme_minimal() +
    geom_boxplot(aes(fill=Neighborhood)) +
    geom_violin(aes(colour=Neighborhood), width=1.5, alpha=0.3) + 
    coord_flip() +
    theme(
        plot.title = element_text(hjust = 0.5), 
        legend.position="none", 
        plot.caption = element_text(face="italic")
    ) +
    scale_color_manual(values = nei_colours) +
    labs(
        title = "Distribution of House Prices by Neighbourhood", 
        caption = "Neighbourhoods listed in descending order of coefficient of variability",
        x = NULL, 
        y="Price ($'00,000)"
    )

Transforming the Ames data

# deal with factor NAs (either NA or empty string) - replace with 'NONE'
replace_factor_na <- function(x){
    x <- as.character(x)
    x <- if_else((is.na(x) | (x=="")), 'NONE', x)
    x <- as.factor(x)
}
df_train <- ames_train %>%
    mutate_at(
        vars(Alley, BsmtFin.Type.1, BsmtFin.Type.2, Bsmt.Qual, Bsmt.Cond, Bsmt.Exposure,
             Fireplace.Qu, Garage.Type, Garage.Finish, Mas.Vnr.Type,
             Garage.Qual, Garage.Cond, Pool.QC, Fence, Misc.Feature), 
        replace_factor_na
    ) 

# drop month/year sold - doesn't help predict a sale value unless using it to adjust sale price
df_train <- df_train %>% select(-Yr.Sold, -Mo.Sold)

# drop utilities, only one value in whole db
df_train <- df_train %>% select(-Utilities)

#deal with continuous na's 

# 48 observations miss garage age  due to not having a garage
# for purposes of EDA, assume those values are the same as house age
# revisit if garage age is significant influencer
df_train <- df_train %>%
    mutate(Garage.Yr.Blt = ifelse(is.na(Garage.Yr.Blt), Year.Built, Garage.Yr.Blt))

# 1 in 7 are missing Lot.Frontage - EDA, found that
# RLot.Frontage looks to be significant influencer
# => strong liner relationship between area and frontage found (after removing outliers)
# => replace NAs with calculated value from area
# <30,000 represents roughly 98% of values
df_train %>%
    filter(Lot.Frontage>0 & Lot.Area<30000) %>%
    ggplot(aes(x=Lot.Area, y=Lot.Frontage)) +
    geom_point() +
    geom_smooth(method = "lm")

# find slope (given by covariance) and intercept
slope <- function(x, y){
    return(cov(x, y)/var(x))
}

intercept <- function(x, y, m){
    b <- mean(y) - (m * mean(x))
    return(b)
}

lf.parameters <- df_train %>%
    filter(Lot.Frontage>0 & Lot.Area<30000) %>%
    summarise(
        slope = slope(Lot.Area, Lot.Frontage),
        intercept = intercept(Lot.Area, Lot.Frontage, slope)
    )

# confirm with lm()
lot.frontage.lm <- lm(Lot.Area ~ Lot.Frontage, 
                      data = (df_train %>% filter(Lot.Frontage>0 & Lot.Area<20000)))
summary(lot.frontage.lm)

calc_frontage <- function(LotArea) {
    lf.parameters$intercept + lf.parameters$slope * LotArea
}

df_train <- df_train %>%
    mutate(Lot.Frontage = ifelse(is.na(Lot.Frontage), calc_frontage(Lot.Area), Lot.Frontage))

# Mas.Vnr.Area relates to Masonry veneer. Those that have none will have 0.
df_train <- df_train  %>%
    replace_na(list(Mas.Vnr.Area=0))

# Only two observations remain with NAs - for the sake of EDA, set these to 0 also
df_train <- df_train  %>%
    replace(is.na(.), 0)

# Perform a basic ANOVA to find influencers to use on outlier test
ames.lm <- lm(log(price) ~ ., data = df_train)
ames_anova <- anova(ames.lm)
potential_vars <- (
        ames_anova %>%
            rownames_to_column(var = "Variable") %>%
            slice_max(`Pr(>F)`, n=10) 
    )$Variable

n <- nrow(df_train)
ames_k <- qnorm(0.5 + 0.5 * 0.95 ^ (1 / n))
ames.lm <- lm(log(price) ~ ., data = (df_train %>% select(price, all_of(potential_vars))))
outliers <- BAS::Bayes.outlier(ames.lm, k = ames_k)
head(sort(outliers$prob.outlier, decreasing=TRUE), 5)

# 3 observations can be counted as outlier
df_train$outlier = outliers$prob.outlier

df_train %>%
    filter(outlier>0.975) %>%
    select(PID, price, Year.Built, Year.Remod.Add, area, Overall.Cond, Exterior.1st, 
           Exter.Cond, Bsmt.Cond, BsmtFin.Type.1, Central.Air, Garage.Cond, Sale.Condition) %>%
    kbl() %>% kable_styling(bootstrap_options = c("condensed"), font_size = 10)

#remove outliers from training dataset
df_train <- df_train %>%
    filter(!(PID %in% (df_train %>% filter(outlier>0.975))$PID))

# filter sale.condition on normal then drop, only one value in whole db
df_train <- df_train %>% 
    filter(Sale.Condition == "Normal") %>%
    select(-Sale.Condition)

Test and Validate data sets are similarly transformed.

Section 2.3 Plotting Residuals

residuals <- tibble(
    Observed = df_train$price/1e5,
    Predicted.HPM = as.numeric(bas_ames.ZS.HPM$Ypred[1,]),
    Predicted.BIC = as.numeric(ames.lm.BIC1$fitted.values),
    Predicted.Lasso = as.numeric(ames.lasso.pred)
    ) %>%
    mutate(Predicted.HPM = exp(Predicted.HPM)/1e5) %>%
    mutate(Predicted.BIC = exp(Predicted.BIC)/1e5) %>%
    mutate(Predicted.Lasso = exp(Predicted.Lasso)/1e5) %>%
    mutate(HPM = Observed - Predicted.HPM) %>%
    mutate(BIC = Observed - Predicted.BIC)  %>% 
    mutate(Lasso = Observed - Predicted.Lasso)  %>% 
    pivot_longer(
        cols = c(Lasso, BIC, HPM), 
        values_to = "Residual", 
        names_to = "Model"
    ) %>% 
    mutate(Model = factor(Model, levels = c("Lasso", "BIC", "HPM")))

price.mu = mean(df_train$price)/1e5
price.sd = sd(df_train$price)/1e5
g_res <- residuals %>%
    ggplot(aes(x = Observed, y = Residual, colour = Model)) +
    geom_hline(yintercept = 0) +
    geom_vline(xintercept = price.mu, linetype="longdash", alpha=0.5) +
    geom_vline(xintercept = price.mu-price.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_vline(xintercept = price.mu+price.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_point(alpha = 0.3, size=2) +
    geom_smooth(se = F, size = 1.2) +
    theme_minimal() +
    scale_color_manual(values = c("#80B1D3", "#B3DE69", "#FF7F00")) +
    theme(
        plot.title = element_text(hjust = 0.5), 
        plot.margin = margin(20,10,10,10)
        ) +
    labs(
        title = "Residual vs Observed Values for BIC and MPM Models ($'00,000)"
    )

residuals.train.HPM <- residuals %>% filter(Model=="HPM")
price98 <- c(quantile(residuals.train.HPM$Observed, .01), quantile(residuals.train.HPM$Observed, .99))
res98 <- c(quantile(residuals.train.HPM$Residual, .01), quantile(residuals.train.HPM$Residual, .99))

g_res
g_res + coord_cartesian(xlim=price98, ylim = res98) +
    labs(
        caption = "Axes cropped to show middle 98% of test data in both directions."
    )

Section 2.5 Summing and Plotting Residuals Test Data

residuals.intial <- tibble(
    Data = "Train",
    Observed = df_train$price/1e5,
    Predicted = as.numeric(predict.train.initial$Ypred[1,])
) %>%
    add_row(
        Data = "Test",
        Observed = df_test$price/1e5,
        Predicted = as.numeric(predict.test.initial$Ypred[1,])
    ) %>%
    mutate(Predicted = exp(Predicted)/1e5) %>%
    mutate(Residual = Observed - Predicted) 

residuals.intial <- tibble(
    Data = "Train",
    Observed = df_train$price/1e5,
    Predicted = as.numeric(predict.train.initial$Ypred[1,])
) %>%
    add_row(
        Data = "Test",
        Observed = df_test$price/1e5,
        Predicted = as.numeric(predict.test.initial$Ypred[1,])
    ) %>%
    mutate(Predicted = exp(Predicted)/1e5) %>%
    mutate(Residual = Observed - Predicted) 

residuals.intial %>% 
    mutate(Residual = (Residual)*1e5) %>%
    group_by(Data) %>%
    summarise(
        RMSE=sqrt(mean(Residual^2)),
        mean=round(mean(abs(Residual)),2),
        median=median(abs(Residual)), 
        sd=round(sd(abs(Residual)),2),
        max=max(abs(Residual)),
        q25=quantile(abs(Residual), 0.25),
        q75=quantile(abs(Residual),0.75),
        skew=round(skewness(Residual),2)) %>%
    mutate(across(RMSE:q75, dollar_format())) %>%
    kbl(
        caption = "<center><h4><strong>Distribution of Absolute Residuals</strong></h4></center>",
        escape = F
    ) %>% 
    kable_styling(bootstrap_options = c("condensed"))

test.price98 <- c(quantile(residuals.intial$Observed, .01), quantile(residuals.intial$Observed, .99))
test.res98 <- c(quantile(residuals.intial$Residual, .01), quantile(residuals.intial$Residual, .99))

g_test_res <- residuals.intial %>%
    ggplot(aes(x = Observed, y = Residual, colour=Data)) +
    geom_vline(xintercept = price.mu, linetype="longdash", alpha=0.5) +
    geom_vline(xintercept = price.mu-price.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_vline(xintercept = price.mu+price.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_point(alpha = 0.6, size=2) +
    geom_smooth(se = F, size = 1.5) +
    theme_minimal() +
    geom_hline(yintercept = 0) +
    scale_colour_manual(values=c("#33A02B", "#67a9d5")) +
    theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,30,10)) +
    labs(
        title = "Residual Values for HPM Model for Train and Test Data ($'00,000)",
    ) + 
    ylab("Residual") + xlab("Observed") 

g_test_res
g_test_res + coord_cartesian(xlim=test.price98, ylim = test.res98) +
    labs(
        caption = "Axes cropped to show middle 98% of test data in both directions."
    )

residual.mu = mean(residuals.intial$Residual)
residual.sd = sd(residuals.intial$Residual)
test.res99 <- c(quantile(residuals.intial$Residual, .005), quantile(residuals.intial$Residual, .995))

residuals.intial %>%
    ggplot(aes(x=Residual, fill=Data, colour=Data)) +
    geom_histogram(
        alpha = 0.3, 
        bins = 30, 
        aes(y = stat(density)),
        position="identity", 
        col="darkgrey"
        ) +
    geom_vline(
        aes(xintercept = median(Residual[Data=="Test"]), color="Test"), 
        size = 1.2, 
        linetype="longdash"
    ) + 
    geom_vline(
        aes(xintercept = median(Residual[Data=="Train"]), color="Train"), 
        size = 1.2, 
        linetype="longdash"
    ) + 
    stat_function(
        fun = dnorm, 
        aes(colour = "Normal"), 
        size = 1.2, 
        linetype="longdash",
        args = list(
            mean = residual.mu, 
            sd = residual.sd
        )
    ) +
    geom_line(
        aes(y = ..density.., colour = Data), 
        stat = 'density', 
        size = 1.5
        ) + 
    theme_minimal() +
    scale_fill_manual(values=c("Test" = "#33A02B", "Train" = "#1F78B4")) +
    scale_colour_manual(
        name="Density", 
        values = c("Normal" = "#E3211C", "Test" = "#33A02B", "Train" = "#1F78B4")
        ) + 
    theme(
        plot.title = element_text(hjust = 0.5), 
        legend.position="bottom"
        ) +
    labs(
        title = "Distribution of Train and Test Residuals Using HPM", 
        caption = "NOTE: the x-axis has been trunctated at $100K in both directions due to the excessive tails.
        Verticals represent median residuals for both data sets.",
        y = "Density", x="Residual ($'00,000)"
        ) + 
    coord_cartesian(xlim=c(-1,1))

Section 4.1 - Plotting Residuals

# bas drops empty levels from model
df_final_test <- df_test %>%
    filter(Foundation != "Wood") %>%
    filter(Exter.Cond != "Po")

predict.train.final <- predict(model_final, newdata = df_train, estimator = "HPM", se.fit = T)
predict.test.final <- predict(model_final, newdata = df_final_test, estimator = "HPM", se.fit = T)

z_score <- function(log.prices){
    predicts <- exp(log.prices)
    mu_train <- mean(predicts)
    sd_train <- sd(predicts)
    (predicts - mu_train)/sd_train
}

residuals.final <- tibble(
        Data = "Train",
        Observed = df_train$price/1e5,
        Predicted = as.numeric(predict.train.final$Ypred[1,]),
        Z_Score = z_score(predict.train.final$Ypred)[1,]
    ) %>%
    add_row(
        Data = "Test",
        Observed = df_final_test$price/1e5,
        Predicted = as.numeric(predict.test.final$Ypred[1,]),
        Z_Score = z_score(predict.test.final$Ypred)[1,]
    ) %>%
    mutate(Predicted = exp(Predicted)/1e5) %>%
    mutate(Residual = Observed - Predicted) 

price98 <- c(quantile(residuals.final$Observed, .01), quantile(residuals.final$Observed, .99))
res98 <- c(quantile(residuals.final$Residual, .01), quantile(residuals.final$Residual, .99))
observed.mu = mean(residuals.final$Observed)
observed.sd = sd(residuals.final$Observed)


g_final_res <- residuals.final %>%
    ggplot(aes(x = Observed, y = Residual, colour=Data)) +
    geom_vline(xintercept = observed.mu, linetype="longdash", alpha=0.5) +
    geom_vline(xintercept = observed.mu-observed.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_vline(xintercept = observed.mu+observed.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_point(alpha = 0.5, size=1) +
    geom_smooth(se = F, size = 1.5) +
    theme_minimal() +
    geom_hline(yintercept = 0) +
    scale_colour_manual(values=c("#33A02B", "#67a9d5")) +
    theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,10,10)) +
    labs(
        title = "Residual Values for Final Model for Train and Test Data ($'00,000)",
    ) + 
    ylab("Residual") + xlab("Observed") 

g_final_res_prop <- residuals.final %>%
    ggplot(aes(x = Observed, y = Residual/Observed, colour=Data)) +
    geom_vline(xintercept = observed.mu, linetype="longdash", alpha=0.5) +
    geom_vline(xintercept = observed.mu-observed.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_vline(xintercept = observed.mu+observed.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_point(alpha = 0.5, size=1) +
    geom_smooth(se = F, size = 1.5) +
    theme_minimal() +
    geom_hline(yintercept = 0) +
    scale_colour_manual(values=c("#33A02B", "#67a9d5")) +
    theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,10,10)) +
    labs(
        title = "Final Model Residual Values as Proportion of Observed Values",
    ) + 
    ylab("Residual/Observed") + xlab("Observed ($'00,000)") 

hist_res_final <- residuals.final %>%
    ggplot(aes(x=Residual, fill=Data, colour=Data)) +
    geom_histogram(
        alpha = 0.3, 
        bins = 30, 
        aes(y = stat(density)),
        position="identity", 
        col="darkgrey"
    ) +
    geom_vline(
        aes(xintercept = median(Residual[Data=="Test"]), color="Test"), 
        size = 1.2, 
        linetype="longdash"
    ) + 
    geom_vline(
        aes(xintercept = median(Residual[Data=="Train"]), color="Train"), 
        size = 1.2, 
        linetype="longdash"
    ) + 
    stat_function(
        fun = dnorm, 
        aes(colour = "Normal"), 
        size = 1.2, 
        linetype="longdash",
        args = list(
            mean = residual.mu, 
            sd = residual.sd
        )
    ) +
    geom_line(
        aes(y = ..density.., colour = Data), 
        stat = 'density', 
        size = 1.5
    ) + 
    theme_minimal() +
    scale_fill_manual(values=c("Test" = "#33A02B", "Train" = "#1F78B4")) +
    scale_colour_manual(
        name="Density", 
        values = c("Normal" = "#E3211C", "Test" = "#33A02B", "Train" = "#1F78B4")
        ) + 
    theme(plot.title = element_text(hjust = 0.5), legend.position="bottom") +
    labs(title = "Distribution of Residuals Using Final Model", y = "Density", x="Residual ($'00,000)") + 
    coord_cartesian(xlim=c(-1,1)) 

qq_res_final <- residuals.final %>% ggplot(aes(colour=Data)) +
    geom_qq(aes(sample = Residual/Z_Score), alpha = 0.4) +
    scale_color_manual(values=c("Test" = "#33A02B", "Train" = "#1F78B4")) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5), legend.position="bottom") +
    labs(title = "QQ-Plot of Standardised Residuals") + 
    labs(x = "Theoretical quantiles", y = "Standardized residuals") + 
    coord_cartesian(ylim=c(-7.5,7.5), xlim=c(-3,3))

Section 4.4: Plotting Validation Residuals

residuals.validation <- tibble(
        Data = "Validation",
        Observed = df_final_validation$price/1e5,
        Predicted = as.numeric(predict.validation.final$Ypred[1,]),
        Z_Score = z_score(predict.validation.final$Ypred)[1,]
    ) %>%
    mutate(Predicted = exp(Predicted)/1e5) %>%
    mutate(Residual = Observed - Predicted)

residuals.final.validation <- rbind(residuals.final, residuals.validation)

price98 <- c(quantile(residuals.validation$Observed, .01), quantile(residuals.validation$Observed, .99))
res98 <- c(quantile(residuals.validation$Residual, .01), quantile(residuals.validation$Residual, .99))
observed.mu = mean(residuals.validation$Observed)
observed.sd = sd(residuals.validation$Observed)

residuals.final.validation %>%
    ggplot(aes(x = Observed, y = Residual/Observed, colour=Data)) +
    geom_vline(xintercept = observed.mu, linetype="longdash", alpha=0.5) +
    geom_vline(xintercept = observed.mu-observed.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_vline(xintercept = observed.mu+observed.sd, linetype="dotted", alpha=0.4, size=1) +
    geom_point(alpha = 0.6, size=2) +
    geom_smooth(se = F, size = 1.5) +
    theme_minimal() +
    geom_hline(yintercept = 0) +
    scale_colour_manual(values=c("#33A02B", "#67a9d5", "orange")) +
    theme(plot.title = element_text(hjust = 0.5), plot.margin = margin(20,10,10,10)) +
    labs(
        title = "Residual Values for All Data Sets as proportion of Observed ($'00,000)",
    ) + 
    ylab("Residual") + xlab("Observed") + coord_cartesian(xlim=price98, ylim = c(-0.5,0.5)) +
    labs(
        caption = "Axes cropped to show middle 98% of test data in both directions.
        Vertical Lines are mean and plus/minus one standard devation for the Validation Data only."
    )

Peer Assessment II