Info

This file will introduce you to model development, testing, and validation by using frequentist and Bayesian approaches.

In this scenario we played the role of a consultant hired by a real estate investment firm in Ames, Iowa, a mid-west town in the United States, to analyze data in order to help provide insight into how the firm should invest for highest profits, and to quantify and communicate to the company management what types of real estate properties are good investments and why.

Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

Part 1 - Exploratory Data Analysis (EDA)


First - location, second - quantity.

From one side we got a clear relationship between price and location, then better neighborhood, higher price.

On another hand, we can observe that with increases area price going down with respect to location. Also, we should mention that more expensive neighborhoods have bigger property area. That circumstance makes a basic comparison of mean/median price per square feet neighborhood skew.

For further convenience, we will collect all the data transformations in this block. We will discuss these transformations below.

## `summarise()` ungrouping output (override with `.groups` argument)

Distribution of the price of the houses.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129762  159467  181190  213000  615000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1092    1411    1477    1743    4676
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.4   102.5   121.2   123.3   141.2   276.2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.46   11.77   11.98   12.02   12.27   13.33
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.81    7.00    7.25    7.24    7.46    8.45

Price & area are both right-skewed. We can use those derivatives such as “log” or “price/area” to make distributions more symmetric.

Neighborhood

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 5
##   Neighborhood `median(price)` `IQR(price)` `median(PSF)` `IQR(PSF)`
##   <fct>                  <dbl>        <dbl>         <dbl>      <dbl>
## 1 Blueste               123900        10250         113.        3.55
## 2 OldTown               120000        29550          92.4      33.3 
## 3 IDOTRR                 99500        48900          86.8      32.0 
## 4 BrDale                 98750        16725          92.9      11.7 
## 5 MeadowV                85750        20150          83.5      41.2
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 5
##   Neighborhood `median(price)` `IQR(price)` `median(PSF)` `IQR(PSF)`
##   <fct>                  <dbl>        <dbl>         <dbl>      <dbl>
## 1 StoneBr              340692.      151358           166.       56.7
## 2 NridgHt              336860       148800           159.       45.8
## 3 NoRidge              290000        50312.          128.       19.6
## 4 GrnHill              280000        50000           199.       21.1
## 5 Timber               232500       151200           161.       41.9

Graphs show us a clear relationship between “Home price” and “Neighborhood”. We got neighborhood starting from MeadowV to SroneBr, with an increasing average price we observe an increasing range of price distributions into groups. From this point of view variable, “home price per sq. feet” distributions look more constant across neighborhood groups. That can lead us to significant differences in area distribution between neighborhoods.

Let’s review the relationship between Price ~ area.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

As expected we got a clear positive relationship between “Home price” and “Area”. We should mention, that for a bigger price as for a bigger area we got bigger residuals different. On the other hand, we do not seem to clear the relationship between “Home price per sq.feet” and “Area”, actually it almost impossible to say that relationship we have.

But, if we construct those distributions for the separate neighborhoods we got a much clear relationship for both variances. Also, we got different behavior for each neighborhood. We show here some special cases for one of the most expensive neighborhoods “Northridge Heights” and one of a budget district “Iowa DOT and Rail Road”.

Year.Built - Tr / Age.Blt - Good by Neighborhoods

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Mostly, we got the same behavior for price and PSF, with age increase - price decrease, except for a few cases, such as “OldTown”. Also, we got different strength relationships for different neighborhoods. Roughly, for neighborhoods with relatively small differences in ages price changes stronger. Distribution of home age right-skewed with 3 visible prices with most cases around 15 years. Also, we got smaller goops with centers around 60 and 100 years. Most expensive neighborhoods stay usually stay in the first group, all others roughly presented through all groups. That leads us to the same assumptions - used different approaches for different price classes of properties.


Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Section 2.1 An Initial Model


Let’s build full models that we expect to be more promising in the future.

# Bayesian approach
# BIC
m.BIC.full.baslm.price.log <- bas.lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
                               Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
                               Bsmt.Full.Bath + Functional.Tr + Sale.Condition, 
                             data = ames_train,
                             prior = "BIC", 
                             modelprior = uniform(),
                             method = "MCMC", MCMC.iterations = 10 ^ 5)

m.BIC.full.baslm.PSF <- bas.lm(PSF ~ Neighborhood + area * AgeGroup + Age.Blt * AgeGroup +
                             Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
                             Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
                           data = ames_train,
                           prior = "BIC", 
                           modelprior = uniform(),
                           method = "MCMC", MCMC.iterations = 10 ^ 5)
# AIC
m.AIC.full.baslm.price.log <- bas.lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
                               Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
                               Bsmt.Full.Bath + Functional.Tr + Sale.Condition, 
                             data = ames_train,
                             prior = "AIC", 
                             modelprior = uniform(),
                             method = "MCMC", MCMC.iterations = 10 ^ 5)

m.AIC.full.baslm.PSF <- bas.lm(PSF ~ Neighborhood + area * AgeGroup + Age.Blt * AgeGroup +
                             Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
                             Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
                           data = ames_train,
                           prior = "AIC", 
                           modelprior = uniform(),
                           method = "MCMC", MCMC.iterations = 10 ^ 5)

# Frequentist approach
m.full.lm.price.log <- lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
                        Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
                        Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
                    data = ames_train)

m.full.lm.PSF <- lm(PSF ~ Neighborhood + area * AgeGroup + Age.Blt * AgeGroup +
                      Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure + 
                      Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
                    data = ames_train)

Let’s get summary and anova output for m.full.lm.price.log.

## 
## Call:
## lm(formula = price.log ~ Neighborhood + area.log * AgeGroup + 
##     Age.Blt * AgeGroup + Bldg.Type + House.Style + Overall.Qual + 
##     Bsmt.Exposure + Bsmt.Full.Bath + Functional.Tr + Sale.Condition, 
##     data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6608 -0.0630  0.0012  0.0648  0.5870 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              6.14919    0.46374   13.26 < 0.0000000000000002 ***
## NeighborhoodNridgHt     -0.06369    0.04115   -1.55              0.12204    
## NeighborhoodNoRidge     -0.06787    0.04933   -1.38              0.16920    
## NeighborhoodGrnHill      0.40875    0.11256    3.63              0.00030 ***
## NeighborhoodTimber       1.34936    0.46616    2.89              0.00388 ** 
## NeighborhoodSomerst     -0.08970    0.04121   -2.18              0.02975 *  
## NeighborhoodGreens       1.32242    0.46230    2.86              0.00432 ** 
## NeighborhoodVeenker      1.31285    0.46615    2.82              0.00496 ** 
## NeighborhoodCrawfor      1.06353    0.48201    2.21              0.02759 *  
## NeighborhoodCollgCr      1.29236    0.46409    2.78              0.00546 ** 
## NeighborhoodBlmngtn     -0.16960    0.05943   -2.85              0.00442 ** 
## NeighborhoodClearCr      1.40153    0.46846    2.99              0.00285 ** 
## NeighborhoodNWAmes       1.29089    0.46684    2.77              0.00580 ** 
## NeighborhoodGilbert      1.28216    0.46577    2.75              0.00602 ** 
## NeighborhoodSawyerW      1.26901    0.46493    2.73              0.00646 ** 
## NeighborhoodMitchel      1.29738    0.46323    2.80              0.00520 ** 
## NeighborhoodNPkVill      1.31048    0.46302    2.83              0.00475 ** 
## NeighborhoodNAmes        1.28394    0.46332    2.77              0.00570 ** 
## NeighborhoodSawyer       1.29058    0.46325    2.79              0.00544 ** 
## NeighborhoodSWISU        0.84533    0.48163    1.76              0.07956 .  
## NeighborhoodEdwards      0.85046    0.47870    1.78              0.07595 .  
## NeighborhoodBrkSide      0.93640    0.47784    1.96              0.05033 .  
## NeighborhoodBlueste      1.30785    0.46472    2.81              0.00499 ** 
## NeighborhoodOldTown      0.87431    0.47920    1.82              0.06839 .  
## NeighborhoodIDOTRR       0.76613    0.47771    1.60              0.10910    
## NeighborhoodBrDale       1.20077    0.45987    2.61              0.00917 ** 
## NeighborhoodMeadowV      1.11671    0.45771    2.44              0.01488 *  
## area.log                 0.77504    0.06025   12.86 < 0.0000000000000002 ***
## AgeGroupMedium                NA         NA      NA                   NA    
## AgeGroupOld                   NA         NA      NA                   NA    
## Age.Blt                 -0.00995    0.00334   -2.97              0.00301 ** 
## Bldg.Type2fmCon         -0.05199    0.03569   -1.46              0.14552    
## Bldg.TypeDuplex         -0.19317    0.02899   -6.66    0.000000000045163 ***
## Bldg.TypeTwnhs          -0.12838    0.03590   -3.58              0.00037 ***
## Bldg.TypeTwnhsE         -0.07110    0.02436   -2.92              0.00360 ** 
## House.Style1.5Unf        0.05670    0.05603    1.01              0.31183    
## House.Style1Story        0.05673    0.02054    2.76              0.00586 ** 
## House.Style2.5Unf       -0.03157    0.05074   -0.62              0.53398    
## House.Style2Story       -0.04604    0.02087   -2.21              0.02760 *  
## House.StyleSFoyer        0.08665    0.03639    2.38              0.01745 *  
## House.StyleSLvl          0.00343    0.03118    0.11              0.91245    
## Overall.Qual             0.07340    0.00654   11.23 < 0.0000000000000002 ***
## Bsmt.ExposureGd          0.06718    0.02020    3.33              0.00092 ***
## Bsmt.ExposureMn         -0.04033    0.02093   -1.93              0.05425 .  
## Bsmt.ExposureNo         -0.03234    0.01554   -2.08              0.03770 *  
## Bsmt.ExposureNoBasement -0.15282    0.03715   -4.11    0.000042261941308 ***
## Bsmt.Full.Bath           0.07775    0.01000    7.78    0.000000000000019 ***
## Functional.TrTRUE        0.07275    0.02093    3.48              0.00053 ***
## Sale.ConditionAdjLand    0.16150    0.11146    1.45              0.14770    
## Sale.ConditionAlloca     0.21191    0.07744    2.74              0.00633 ** 
## Sale.ConditionFamily    -0.05314    0.04094   -1.30              0.19453    
## Sale.ConditionNormal     0.11684    0.01970    5.93    0.000000004233025 ***
## Sale.ConditionPartial    0.13200    0.02824    4.67    0.000003387783519 ***
## area.log:AgeGroupMedium -0.21349    0.06025   -3.54              0.00041 ***
## area.log:AgeGroupOld    -0.16640    0.06294   -2.64              0.00834 ** 
## AgeGroupMedium:Age.Blt   0.00709    0.00338    2.10              0.03637 *  
## AgeGroupOld:Age.Blt      0.00744    0.00339    2.20              0.02822 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.145 on 945 degrees of freedom
## Multiple R-squared:  0.888,  Adjusted R-squared:  0.882 
## F-statistic:  139 on 54 and 945 DF,  p-value: <0.0000000000000002

We got a relatively good Multiple R-squared: 0.888. Now, we going to briefly discuss the most important predictors. As expected from the EDA section the model shows us a positive relationship between area.log and price.log variable. For geGroupMedium and AgeGroupOld we got some extra fine. We should mention that the log(area) variable is statistically significant because of the very small P-value. The neighborhood variable is categorical and for most values is statistically significant too.

## Analysis of Variance Table
## 
## Response: price.log
##                    Df Sum Sq Mean Sq F value               Pr(>F)    
## Neighborhood       26  106.8    4.11  196.20 < 0.0000000000000002 ***
## area.log            1   30.6   30.62 1462.23 < 0.0000000000000002 ***
## Age.Blt             1    4.0    3.98  190.00 < 0.0000000000000002 ***
## Bldg.Type           4    3.0    0.75   36.05 < 0.0000000000000002 ***
## House.Style         6    2.9    0.48   23.15 < 0.0000000000000002 ***
## Overall.Qual        1    4.9    4.93  235.33 < 0.0000000000000002 ***
## Bsmt.Exposure       4    1.4    0.36   17.26   0.0000000000001117 ***
## Bsmt.Full.Bath      1    1.3    1.32   62.85   0.0000000000000063 ***
## Functional.Tr       1    0.2    0.24   11.43              0.00075 ***
## Sale.Condition      5    1.3    0.26   12.37   0.0000000000121697 ***
## area.log:AgeGroup   2    0.3    0.14    6.63              0.00138 ** 
## AgeGroup:Age.Blt    2    0.1    0.05    2.44              0.08786 .  
## Residuals         945   19.8    0.02                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The most significant predictor is the Neighborhood, it explains roughly 60% (106.8/176.7) of the response variable. Next going the area.log variable which explains roughly 17% of the price.log.


Section 2.2 Model Selection


Let’s use Bayesian and Frequentist approaches.

# Bayesian approach

# BIC
# price.log
#BIC.BMA_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "BMA", se.fit = TRUE, nsim = 10000)
BIC.BPM_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000)
BIC.HPM_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000)
BIC.MPM_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# PSF
#BIC.BMA_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "BMA", se.fit = TRUE, nsim = 10000)
BIC.BPM_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000)
BIC.HPM_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000)
BIC.MPM_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000)

# AIC
# price.log
#AIC.BMA_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "BMA", se.fit = TRUE, nsim = 10000)
AIC.BPM_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000)
AIC.HPM_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000)
AIC.MPM_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# PSF
#AIC.BMA_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "BMA", se.fit = TRUE, nsim = 10000)
AIC.BPM_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000)
AIC.HPM_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000)
AIC.MPM_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# ---

# Frequentist approach

# Stepwise selection
# BIC
n <- nrow(ames_train)
m.BIC.lm.price.log <- stepAIC(m.full.lm.price.log, k = log(n), trace = F)
m.BIC.lm.PSF       <- stepAIC(m.full.lm.PSF, k = log(n), trace = F)
# AIC
m.AIC.lm.price.log <- stepAIC(m.full.lm.price.log, k = 2, trace = F)
m.AIC.lm.PSF       <- stepAIC(m.full.lm.PSF, k = 2, trace = F)

Generally, models will arrive at a similar result, with respect to different approaches. We going to discuss it in more detail in the next section.

Here, we should note, that now we comment into code the BMA method because it is very heavy for RAM on the machine this model building to develop at the same time. We did that method too and in our cases, those models do not get into the topmost accurate models.


Section 2.3 Initial Model Residuals


Let’s construct credible and confidense intervals for our models.

# Prediction

# Bayesian approach
# ---
# function for get credible intervals & unite result in the data frame
get.credible.intervals <- function(pred.baslm) {
  df <- data.frame(confint(pred.baslm, parm = "coef")[,1:3],
                   confint(pred.baslm, parm = "pred")[,1:2])
  colnames(df) = c("low_mu","hight_mu","pred","low","hight")
  df[,c("low","low_mu","pred","hight_mu","hight")]
}
# BIC
#pred.BIC.BMA.baslm.price.log <- exp(get.credible.intervals(BIC.BMA_pred_price.log))
pred.BIC.BPM.baslm.price.log <- exp(get.credible.intervals(BIC.BPM_pred_price.log))
pred.BIC.HPM.baslm.price.log <- exp(get.credible.intervals(BIC.HPM_pred_price.log))
pred.BIC.MPM.baslm.price.log <- exp(get.credible.intervals(BIC.MPM_pred_price.log))
#pred.BIC.BMA.baslm.PSF       <- get.credible.intervals(BIC.BMA_pred_PSF)*ames_train$area
pred.BIC.BPM.baslm.PSF       <- get.credible.intervals(BIC.BPM_pred_PSF)*ames_train$area
pred.BIC.HPM.baslm.PSF       <- get.credible.intervals(BIC.HPM_pred_PSF)*ames_train$area
pred.BIC.MPM.baslm.PSF       <- get.credible.intervals(BIC.MPM_pred_PSF)*ames_train$area
# AIC
#pred.AIC.BMA.baslm.price.log <- exp(get.credible.intervals(AIC.BMA_pred_price.log))
pred.AIC.BPM.baslm.price.log <- exp(get.credible.intervals(AIC.BPM_pred_price.log))
pred.AIC.HPM.baslm.price.log <- exp(get.credible.intervals(AIC.HPM_pred_price.log))
pred.AIC.MPM.baslm.price.log <- exp(get.credible.intervals(AIC.MPM_pred_price.log))
#pred.AIC.BMA.baslm.PSF       <- get.credible.intervals(AIC.BMA_pred_PSF)*ames_train$area
pred.AIC.BPM.baslm.PSF       <- get.credible.intervals(AIC.BPM_pred_PSF)*ames_train$area
pred.AIC.HPM.baslm.PSF       <- get.credible.intervals(AIC.HPM_pred_PSF)*ames_train$area
pred.AIC.MPM.baslm.PSF       <- get.credible.intervals(AIC.MPM_pred_PSF)*ames_train$area

# Frequentist approach
# ---
pred.BIC.lm.price.log    <- data.frame(exp(predict(m.BIC.lm.price.log, ames_train, interval = "prediction")))
pred.AIC.lm.price.log    <- data.frame(exp(predict(m.AIC.lm.price.log, ames_train, interval = "prediction")))
pred.BIC.lm.PSF          <- data.frame(predict(m.BIC.lm.PSF, ames_train, interval = "prediction")*ames_train$area)
pred.AIC.lm.PSF          <- data.frame(predict(m.AIC.lm.PSF, ames_train, interval = "prediction")*ames_train$area)

Also, we going to develop the function, which helps us to make residual plots from different perspectives.

# create function which help with building resid plots
plot_resid <- function(df_ames = ames_train, 
                       df_ames_pred,
                       numeric_variables = c("price","pred","area","Age.Blt"),
                       categorical_variables = c("Neighborhood","Bldg.Type","House.Style","Overall.Qual",
                                                "Bsmt.Exposure","Bsmt.Full.Bath","Functional.Tr","Sale.Condition"),
                       indexPlot = TRUE,
                       pricepredPlot = TRUE, 
                       outliersindex) {
  
  df <- cbind(df_ames, df_ames_pred)

  if(!("low_mu" %in% colnames(df))) {
    df$low <- df$lwr
    df$low_mu <- df$lwr
    df$pred <- df$fit
    df$hight_mu <- df$upr
    df$hight <- df$upr
  }
  
  df$resid <- df$price - df$pred
  if (pricepredPlot == T) {
  gplot_price <- ggplot(data = df) +
    geom_point(aes(x = pred, y = price), color = "blue", size = 0.6) +
    geom_line(aes(x = pred, y = low_mu, lty = "second")) +
    geom_line(aes(x = pred, y = hight_mu, lty = "second")) +
    geom_line(aes(x = pred, y = low, lty = "third")) +
    geom_line(aes(x = pred, y = hight, lty = "third")) +
    geom_line(aes(x = pred, y = pred, color = "orange")) +
    scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") +
    scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
                          , name = "") +
    theme_bw() +
    theme(legend.position = c(1, 0), legend.justification = c(1.5, 0)) +
    ggtitle(paste0('Model uncertainty distribution: ',substitute(df_ames_pred))) +
    geom_point(data = df[outliersindex,], aes(x = pred, y = price), color = "orange", pch = 1, cex = 6)
  plot(gplot_price)
  }
  if (indexPlot == T) {
  gplot_index <- ggplot(data = df) +
    geom_point(aes(x = as.numeric(row.names(df)), y = resid), color = "blue", size = 0.6)  +
    geom_hline(yintercept=0, linetype="dashed") +
    ggtitle(paste0("Residuals by index")) +
    geom_point(data = df[outliersindex,], aes(x = as.numeric(row.names(df[outliersindex,])), y = resid), color = "orange", pch = 1, cex = 6)
  plot(gplot_index)
  }

  if (length(numeric_variables) != 0) {
  for (variable in numeric_variables) {
    df[,"variable"] <- df[,variable]
    gplot_numeric <- ggplot(data = df) +
      geom_point(aes(x = variable, y = resid), color = "blue", size = 0.6) +
      geom_hline(yintercept=0, linetype="dashed") +
      ggtitle(paste0("Residuals by <",variable,">")) +
      xlab(variable) +
      geom_point(data = df[outliersindex,], aes(x = variable, y = resid), color = "orange", pch = 1, cex = 6)
    print(gplot_numeric)
  }
  }

  if (length(categorical_variables) != 0) {
  for (variable in categorical_variables) {
    df[,"variable"] <- df[,variable]
    gplot_categorical <- ggplot(data = df) +
      geom_boxplot(aes(y = as.factor(variable), x = resid, color = as.factor(variable))) +
      ggtitle(paste0("Residuals by <",variable,">")) +
      ylab(variable)
    print(gplot_categorical)
  }
  }
}

Let’s defined outliers. For all models, we got 5 outliers cases with an area above 3500 feet or price above 550k within +-250k prediction error.

## # A tibble: 5 x 4
##    area  price Neighborhood Overall.Qual
##   <int>  <int> <fct>               <int>
## 1  2470 615000 NridgHt                10
## 2  4676 184750 Edwards                10
## 3  3672 415000 Edwards                 6
## 4  2364 611657 NridgHt                 9
## 5  2338 591587 StoneBr                 9

Let’s briefly discuss most perspective models.

pred.AIC.lm.price.log

Increased inaccuracy roughly within:

  • +-50K for “StoneBr”,“NridgHt”,“Timber”,“Somerst”,“Veenker” and “Crawfor” neighborhoods;
  • -150k to +50k for predicted price from 250k;
  • -150k to +50k for new home under 15 years;
  • +-50k for 8, 9 and 10 “overall.qual”;

Underestimate roughly on:

  • -70k for real price from 370k & -140k around 600k;
  • -25k for 2 “Bsmt.Full.Bath”;

Overestimate roughly on: none.

pred.AIC.BPM.baslm.price.log

Increased inaccuracy roughly within:

  • +-50K for “StoneBr”,“NridgHt”,“Timber”,“Somerst”,“ClearCr” and “Crawfor” neighborhoods;
  • +-50K for new home under 15 years;
  • “1Fam” Bldg.Type;

Underestimate roughly on:

  • -25k for 9 and 10 “overall.qual”;
  • -25k for 2 “Bsmt.Full.Bath”;
  • -50k for real price from 370k & -150k around 600k;
  • -75k for predict price from 450k;

Overestimate roughly on:

  • +15 for “2.5Unf” House.Style.

pred.AIC.MPM.baslm.price.log

Increased inaccuracy roughly within:

  • -75k to +50k for new home under 15 years;
  • +-50K for “StoneBr”,“NridgHt”,“NoRidge”,“Timber” and “Crawfor” neighborhoods;
  • +-50k for 8, 9 and 10 “overall.qual”;

Underestimate roughly on:

  • -50k for real price from 370k & -130k around 600k;
  • -75k for predict price from 450k;
  • -25k for 2 “Bsmt.Full.Bath”;

Overestimate roughly on:

  • +25 for “2.5Unf” House.Style;


Section 2.4 Initial Model RMSE


Let’s culculate root mean squared error (RMSE) into original values $.

df_RMSE <- data.frame(RMSE = double())

# Bayesian approach
# BIC
#df_RMSE["pred.BIC.BMA.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.BMA.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.BIC.BPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.BPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.BIC.HPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.HPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.BIC.MPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.MPM.baslm.price.log$pred - ames_train$price)^2))
#df_RMSE["pred.BIC.BMA.baslm.PSF","RMSE"]       <- sqrt(mean((pred.BIC.BMA.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.BIC.BPM.baslm.PSF","RMSE"]       <- sqrt(mean((pred.BIC.BPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.BIC.HPM.baslm.PSF","RMSE"]       <- sqrt(mean((pred.BIC.HPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.BIC.MPM.baslm.PSF","RMSE"]       <- sqrt(mean((pred.BIC.MPM.baslm.PSF$pred - ames_train$price)^2))

# AIC
#df_RMSE["pred.AIC.BMA.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.BMA.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.AIC.BPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.BPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.AIC.HPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.HPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.AIC.MPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.MPM.baslm.price.log$pred - ames_train$price)^2))
#df_RMSE["pred.AIC.BMA.baslm.PSF","RMSE"]       <- sqrt(mean((pred.AIC.BMA.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.AIC.BPM.baslm.PSF","RMSE"]       <- sqrt(mean((pred.AIC.BPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.AIC.HPM.baslm.PSF","RMSE"]       <- sqrt(mean((pred.AIC.HPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.AIC.MPM.baslm.PSF","RMSE"]       <- sqrt(mean((pred.AIC.MPM.baslm.PSF$pred - ames_train$price)^2))

# Frequentist approach
df_RMSE["pred.BIC.lm.price.log","RMSE"] <- sqrt(mean((pred.BIC.lm.price.log$fit - ames_train$price)^2, na.rm =T))
df_RMSE["pred.AIC.lm.price.log","RMSE"] <- sqrt(mean((pred.AIC.lm.price.log$fit - ames_train$price)^2, na.rm =T))
df_RMSE["pred.BIC.lm.PSF","RMSE"]       <- sqrt(mean((pred.BIC.lm.PSF$fit - ames_train$price)^2))
df_RMSE["pred.AIC.lm.PSF","RMSE"]       <- sqrt(mean((pred.AIC.lm.PSF$fit - ames_train$price)^2))

df_RMSE %>%
  arrange(RMSE) %>%
  head()
##                               RMSE
## pred.AIC.lm.price.log        25602
## pred.AIC.BPM.baslm.price.log 25766
## pred.AIC.MPM.baslm.price.log 25787
## pred.AIC.HPM.baslm.price.log 26206
## pred.AIC.lm.PSF              26625
## pred.AIC.MPM.baslm.PSF       26643

For the first three models, we got a relatively close RMSE value. Also, we should mention that based on residual plots we can say about roughly similar behavior of those models.


Section 2.5 Overfitting


Let’s make the needed data transformations. For all our models we have categorical explanatory variables, this leads us to be careful about new factor variables.

Let’s make all the needed steps to get the RMSE result for the test data set.

## Warning in predict.lm(m.AIC.lm.price.log, ames_test, interval = "prediction"):
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(m.AIC.lm.PSF, ames_test, interval = "prediction"):
## prediction from a rank-deficient fit may be misleading
df_RMSE["pred.BIC.lm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.lm.price.log_test$fit - ames_test$price)^2, na.rm =T))*k
df_RMSE["pred.AIC.lm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.lm.price.log_test$fit - ames_test$price)^2, na.rm =T))*k
df_RMSE["pred.BIC.lm.PSF","RMSE_test"]       <- sqrt(mean((pred.BIC.lm.PSF_test$fit - ames_test$price)^2))*k
df_RMSE["pred.AIC.lm.PSF","RMSE_test"]       <- sqrt(mean((pred.AIC.lm.PSF_test$fit - ames_test$price)^2))*k

# BIC
# price.log
BIC.BPM_pred_price.log_test <- predict(m.BIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.HPM_pred_price.log_test <- predict(m.BIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.MPM_pred_price.log_test <- predict(m.BIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
# PSF
BIC.BPM_pred_PSF_test <- predict(m.BIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.HPM_pred_PSF_test <- predict(m.BIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.MPM_pred_PSF_test <- predict(m.BIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)

# AIC
# price.log
AIC.BPM_pred_price.log_test <- predict(m.AIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.HPM_pred_price.log_test <- predict(m.AIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.MPM_pred_price.log_test <- predict(m.AIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
# PSF
AIC.BPM_pred_PSF_test <- predict(m.AIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.HPM_pred_PSF_test <- predict(m.AIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.MPM_pred_PSF_test <- predict(m.AIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)

# BIC
pred.BIC.BPM.baslm.price.log_test <- exp(get.credible.intervals(BIC.BPM_pred_price.log_test))
pred.BIC.HPM.baslm.price.log_test <- exp(get.credible.intervals(BIC.HPM_pred_price.log_test))
pred.BIC.MPM.baslm.price.log_test <- exp(get.credible.intervals(BIC.MPM_pred_price.log_test))
pred.BIC.BPM.baslm.PSF_test       <- get.credible.intervals(BIC.BPM_pred_PSF_test)*ames_test$area
pred.BIC.HPM.baslm.PSF_test       <- get.credible.intervals(BIC.HPM_pred_PSF_test)*ames_test$area
pred.BIC.MPM.baslm.PSF_test       <- get.credible.intervals(BIC.MPM_pred_PSF_test)*ames_test$area
# AIC
pred.AIC.BPM.baslm.price.log_test <- exp(get.credible.intervals(AIC.BPM_pred_price.log_test))
pred.AIC.HPM.baslm.price.log_test <- exp(get.credible.intervals(AIC.HPM_pred_price.log_test))
pred.AIC.MPM.baslm.price.log_test <- exp(get.credible.intervals(AIC.MPM_pred_price.log_test))
pred.AIC.BPM.baslm.PSF_test       <- get.credible.intervals(AIC.BPM_pred_PSF_test)*ames_test$area
pred.AIC.HPM.baslm.PSF_test       <- get.credible.intervals(AIC.HPM_pred_PSF_test)*ames_test$area
pred.AIC.MPM.baslm.PSF_test       <- get.credible.intervals(AIC.MPM_pred_PSF_test)*ames_test$area

# Bayesian approach
# BIC

df_RMSE["pred.BIC.BPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.BPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.HPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.HPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.MPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.MPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.BPM.baslm.PSF","RMSE_test"]       <- sqrt(mean((pred.BIC.BPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.HPM.baslm.PSF","RMSE_test"]       <- sqrt(mean((pred.BIC.HPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.MPM.baslm.PSF","RMSE_test"]       <- sqrt(mean((pred.BIC.MPM.baslm.PSF_test$pred - ames_test$price)^2))*k

# AIC
df_RMSE["pred.AIC.BPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.BPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.HPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.HPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.MPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.MPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.BPM.baslm.PSF","RMSE_test"]       <- sqrt(mean((pred.AIC.BPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.HPM.baslm.PSF","RMSE_test"]       <- sqrt(mean((pred.AIC.HPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.MPM.baslm.PSF","RMSE_test"]       <- sqrt(mean((pred.AIC.MPM.baslm.PSF_test$pred - ames_test$price)^2))*k

df_RMSE$RMSE_test_diff <- df_RMSE$RMSE_test - df_RMSE$RMSE

RMSE

##                               RMSE RMSE_test RMSE_test_diff
## pred.AIC.lm.price.log        25602     25493         -108.9
## pred.AIC.BPM.baslm.price.log 25766     25303         -463.1
## pred.AIC.MPM.baslm.price.log 25787     25346         -440.6
## pred.AIC.HPM.baslm.price.log 26206     25706         -500.0
## pred.AIC.lm.PSF              26625     28378         1753.2
## pred.AIC.MPM.baslm.PSF       26643     28379         1736.2
## pred.AIC.BPM.baslm.PSF       26652     28360         1708.1
## pred.AIC.HPM.baslm.PSF       26850     28178         1328.9
## pred.BIC.HPM.baslm.price.log 27094     26346         -748.4
## pred.BIC.MPM.baslm.price.log 27117     26835         -281.5
## pred.BIC.BPM.baslm.PSF       27274     27856          581.4
## pred.BIC.MPM.baslm.PSF       27359     27828          468.9
## pred.BIC.HPM.baslm.PSF       27359     27828          468.9
## pred.BIC.BPM.baslm.price.log 27796     27313         -482.9
## pred.BIC.lm.price.log        28582     28333         -248.9
## pred.BIC.lm.PSF              29102     28970         -131.9

We got very close results for all our top-3 models within and without themself. We will use the pred.AIC.lm.price.log model for our next step, but we should mark that in a real-life situation we probably will try all of them.

Basically, it’s mean that our top-3 models give roughly the same accuracy for prediction. We build our models used by test train data and used them on the test data and models give us roughly the same result. Based on that we can assume that they are relatively reliable with respect to own restrictions.


Part 3 Development of a Final Model

Section 3.1 Final Model


First, we will start with anova output of our pre-filnal model pred.AIC.lm.price.log.

## Analysis of Variance Table
## 
## Response: price.log
##                    Df Sum Sq Mean Sq F value               Pr(>F)    
## Neighborhood       26  106.8    4.11  196.20 < 0.0000000000000002 ***
## area.log            1   30.6   30.62 1462.23 < 0.0000000000000002 ***
## Age.Blt             1    4.0    3.98  190.00 < 0.0000000000000002 ***
## Bldg.Type           4    3.0    0.75   36.05 < 0.0000000000000002 ***
## House.Style         6    2.9    0.48   23.15 < 0.0000000000000002 ***
## Overall.Qual        1    4.9    4.93  235.33 < 0.0000000000000002 ***
## Bsmt.Exposure       4    1.4    0.36   17.26   0.0000000000001117 ***
## Bsmt.Full.Bath      1    1.3    1.32   62.85   0.0000000000000063 ***
## Functional.Tr       1    0.2    0.24   11.43              0.00075 ***
## Sale.Condition      5    1.3    0.26   12.37   0.0000000000121697 ***
## area.log:AgeGroup   2    0.3    0.14    6.63              0.00138 ** 
## AgeGroup:Age.Blt    2    0.1    0.05    2.44              0.08786 .  
## Residuals         945   19.8    0.02                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using this information we can make assumptions that can make our final model more accurate. We will add some variables and transform others. In more detail we discuss it below.

## Analysis of Variance Table
## 
## Response: price.log
##                              Df Sum Sq Mean Sq F value               Pr(>F)    
## Neighborhood                 26  106.8    4.11  203.80 < 0.0000000000000002 ***
## area.log                      1   30.6   30.62 1518.85 < 0.0000000000000002 ***
## Age.Blt                       1    4.0    3.98  197.36 < 0.0000000000000002 ***
## Bldg.Type                     4    3.0    0.75   37.45 < 0.0000000000000002 ***
## log(Overall.Qual)             1    6.3    6.25  310.14 < 0.0000000000000002 ***
## Bsmt.Exposure                 4    2.1    0.52   25.63 < 0.0000000000000002 ***
## Bsmt.Full.Bath                1    1.9    1.85   91.81 < 0.0000000000000002 ***
## Functional.Tr                 1    0.2    0.16    7.95               0.0049 ** 
## Sale.Condition                5    1.1    0.23   11.30        0.00000000013 ***
## log(Garage.Area + 1)          1    0.1    0.11    5.55               0.0187 *  
## log(Total.Bsmt.SF + 1)        1    0.5    0.51   25.15        0.00000063374 ***
## log(X1st.Flr.SF)              1    0.4    0.38   19.06        0.00001407664 ***
## log(Lot.Area)                 1    0.2    0.20    9.79               0.0018 ** 
## log(as.numeric(Exter.Qual))   1    0.2    0.20    9.75               0.0018 ** 
## area.log:AgeGroup             2    0.3    0.13    6.44               0.0017 ** 
## AgeGroup:Age.Blt              2    0.1    0.05    2.60               0.0751 .  
## Residuals                   946   19.1    0.02                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the compare of Residuals values we can assume that we got a more accurate model (19.1 aganst 19.8).

## 
## Call:
## lm(formula = price.log ~ Neighborhood + area.log + AgeGroup + 
##     Age.Blt + Bldg.Type + log(Overall.Qual) + Bsmt.Exposure + 
##     Bsmt.Full.Bath + Functional.Tr + Sale.Condition + log(Garage.Area + 
##     1) + log(Total.Bsmt.SF + 1) + log(X1st.Flr.SF) + log(Lot.Area) + 
##     log(as.numeric(Exter.Qual)) + area.log:AgeGroup + AgeGroup:Age.Blt, 
##     data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5596 -0.0619  0.0024  0.0645  0.5970 
## 
## Coefficients: (2 not defined because of singularities)
##                             Estimate Std. Error t value             Pr(>|t|)
## (Intercept)                  5.85814    0.46170   12.69 < 0.0000000000000002
## NeighborhoodNridgHt         -0.07264    0.04041   -1.80              0.07258
## NeighborhoodNoRidge         -0.07130    0.04853   -1.47              0.14213
## NeighborhoodGrnHill          0.36456    0.11087    3.29              0.00105
## NeighborhoodTimber           1.34109    0.45647    2.94              0.00338
## NeighborhoodSomerst         -0.08524    0.04073   -2.09              0.03660
## NeighborhoodGreens           1.34349    0.45257    2.97              0.00307
## NeighborhoodVeenker          1.29788    0.45598    2.85              0.00452
## NeighborhoodCrawfor          1.20087    0.46834    2.56              0.01050
## NeighborhoodCollgCr          1.27945    0.45398    2.82              0.00493
## NeighborhoodBlmngtn         -0.14087    0.05972   -2.36              0.01854
## NeighborhoodClearCr          1.36865    0.45819    2.99              0.00289
## NeighborhoodNWAmes           1.28478    0.45620    2.82              0.00496
## NeighborhoodGilbert          1.26717    0.45527    2.78              0.00549
## NeighborhoodSawyerW          1.25457    0.45443    2.76              0.00588
## NeighborhoodMitchel          1.29040    0.45264    2.85              0.00446
## NeighborhoodNPkVill          1.31238    0.45268    2.90              0.00383
## NeighborhoodNAmes            1.28849    0.45311    2.84              0.00456
## NeighborhoodSawyer           1.28931    0.45292    2.85              0.00451
## NeighborhoodSWISU            0.99566    0.46849    2.13              0.03382
## NeighborhoodEdwards          0.99443    0.46542    2.14              0.03289
## NeighborhoodBrkSide          1.07931    0.46475    2.32              0.02043
## NeighborhoodBlueste          1.32710    0.45444    2.92              0.00358
## NeighborhoodOldTown          1.01589    0.46628    2.18              0.02960
## NeighborhoodIDOTRR           0.91568    0.46531    1.97              0.04937
## NeighborhoodBrDale           1.21461    0.44964    2.70              0.00703
## NeighborhoodMeadowV          1.17403    0.44763    2.62              0.00886
## area.log                     0.60250    0.05588   10.78 < 0.0000000000000002
## AgeGroupMedium                    NA         NA      NA                   NA
## AgeGroupOld                       NA         NA      NA                   NA
## Age.Blt                     -0.00943    0.00330   -2.86              0.00432
## Bldg.Type2fmCon             -0.05330    0.03445   -1.55              0.12216
## Bldg.TypeDuplex             -0.16997    0.02839   -5.99   0.0000000030404748
## Bldg.TypeTwnhs              -0.08136    0.03868   -2.10              0.03569
## Bldg.TypeTwnhsE             -0.04256    0.02586   -1.65              0.10015
## log(Overall.Qual)            0.41014    0.03360   12.21 < 0.0000000000000002
## Bsmt.ExposureGd              0.06266    0.01994    3.14              0.00173
## Bsmt.ExposureMn             -0.04086    0.01976   -2.07              0.03893
## Bsmt.ExposureNo             -0.03168    0.01416   -2.24              0.02545
## Bsmt.ExposureNoBasement      0.03167    0.10574    0.30              0.76463
## Bsmt.Full.Bath               0.07744    0.00982    7.89   0.0000000000000084
## Functional.TrTRUE            0.07548    0.02068    3.65              0.00028
## Sale.ConditionAdjLand        0.18937    0.11011    1.72              0.08580
## Sale.ConditionAlloca         0.20694    0.07606    2.72              0.00663
## Sale.ConditionFamily        -0.05958    0.04017   -1.48              0.13829
## Sale.ConditionNormal         0.10949    0.01941    5.64   0.0000000223248549
## Sale.ConditionPartial        0.12180    0.02781    4.38   0.0000131717894850
## log(Garage.Area + 1)         0.00798    0.00409    1.95              0.05155
## log(Total.Bsmt.SF + 1)       0.02212    0.01577    1.40              0.16098
## log(X1st.Flr.SF)             0.10083    0.02627    3.84              0.00013
## log(Lot.Area)                0.04729    0.01605    2.95              0.00330
## log(as.numeric(Exter.Qual)) -0.05536    0.02263   -2.45              0.01463
## area.log:AgeGroupMedium     -0.21104    0.05912   -3.57              0.00038
## area.log:AgeGroupOld        -0.18486    0.06126   -3.02              0.00261
## AgeGroupMedium:Age.Blt       0.00631    0.00334    1.89              0.05914
## AgeGroupOld:Age.Blt          0.00709    0.00334    2.12              0.03393
##                                
## (Intercept)                 ***
## NeighborhoodNridgHt         .  
## NeighborhoodNoRidge            
## NeighborhoodGrnHill         ** 
## NeighborhoodTimber          ** 
## NeighborhoodSomerst         *  
## NeighborhoodGreens          ** 
## NeighborhoodVeenker         ** 
## NeighborhoodCrawfor         *  
## NeighborhoodCollgCr         ** 
## NeighborhoodBlmngtn         *  
## NeighborhoodClearCr         ** 
## NeighborhoodNWAmes          ** 
## NeighborhoodGilbert         ** 
## NeighborhoodSawyerW         ** 
## NeighborhoodMitchel         ** 
## NeighborhoodNPkVill         ** 
## NeighborhoodNAmes           ** 
## NeighborhoodSawyer          ** 
## NeighborhoodSWISU           *  
## NeighborhoodEdwards         *  
## NeighborhoodBrkSide         *  
## NeighborhoodBlueste         ** 
## NeighborhoodOldTown         *  
## NeighborhoodIDOTRR          *  
## NeighborhoodBrDale          ** 
## NeighborhoodMeadowV         ** 
## area.log                    ***
## AgeGroupMedium                 
## AgeGroupOld                    
## Age.Blt                     ** 
## Bldg.Type2fmCon                
## Bldg.TypeDuplex             ***
## Bldg.TypeTwnhs              *  
## Bldg.TypeTwnhsE                
## log(Overall.Qual)           ***
## Bsmt.ExposureGd             ** 
## Bsmt.ExposureMn             *  
## Bsmt.ExposureNo             *  
## Bsmt.ExposureNoBasement        
## Bsmt.Full.Bath              ***
## Functional.TrTRUE           ***
## Sale.ConditionAdjLand       .  
## Sale.ConditionAlloca        ** 
## Sale.ConditionFamily           
## Sale.ConditionNormal        ***
## Sale.ConditionPartial       ***
## log(Garage.Area + 1)        .  
## log(Total.Bsmt.SF + 1)         
## log(X1st.Flr.SF)            ***
## log(Lot.Area)               ** 
## log(as.numeric(Exter.Qual)) *  
## area.log:AgeGroupMedium     ***
## area.log:AgeGroupOld        ** 
## AgeGroupMedium:Age.Blt      .  
## AgeGroupOld:Age.Blt         *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.142 on 946 degrees of freedom
## Multiple R-squared:  0.892,  Adjusted R-squared:  0.886 
## F-statistic:  148 on 53 and 946 DF,  p-value: <0.0000000000000002

Section 3.2 Transformation


We got increased residuals inaccuracy for high level “Overall.Qual” variable, which we can observe on the plot below. Also, we have roughly symmetric values distribution of this variable into ames_train data, but it can be not too easy to observe. For our model developing purpose, we decide that it can be helpful to shift the center of distribution to the right, which theoretically can increase accuracy for high-level value.

## [1] "Overall.Qual summary:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    6.00    6.09    7.00   10.00
## [1] "log(Overall.Qual) summary:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.61    1.79    1.78    1.95    2.30

Section 3.3 Variable Interaction


For final model we include interactions between area.log & AgeGroup and Age.Blt & AgeGroup. Into the EDA section, we found that we got age distribution with 3 peaks, then we decide to observe the relationship for area and Age.Blt separately for each age group and found different behavior in the variable relationship between those groups. Also, we find into area and price variables, which lead us to create PSF (price per square feet) variable as predicting variable, but it was not included in the final model.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

##        area_slope age_slope
## New        142.12   -2751.5
## Medium      85.97   -1475.1
## Old         66.93    -533.8

Section 3.4 Variable Selection


At first, we did exploratory data analysis to choose (or transform/create) the most reliable variable (predicted and explanatory) that can visually make logical sense to be included in full models. Second, we decide to use two conceptually different approaches: Bayesian and Frequentist.

For bout cases, we used AIC (Akaike information criterion) and BIC (Bayesian information criterion) selection. For bout criteria the model with the smallest those value is preferable. In general, using BIC leads to fewer variables for the best model compared to using adjusted 𝑅2 or AIC.

For the Bayesian approach we also used 4:

  • BMA (Bayesian model averaging model) - model which obtained by averaging all models using their posterior probabilities;
  • HPM (Highest probability model) - most likely model into 0-1 loss;
  • MPB (Median Probability Model) - model includes all predictors whose marginal posterior inclusion probabilities are greater than 0.5;
  • BPM (Best Predictive Model) - a model whose predictions are closest to those given by BMA.

Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


We got a similar RMSE for train and test data, which confirm our decision about the pre-best model was chosen. For the next step, we explore summary and ANOVA output, which give information about variables significances. Most significant is the neighborhood, after that log(area). Other variables have less weight. It seams reasonoble to add variable thish can specify meserment about property area, folow this logic we add next variable:

  • log(Garage.Area + 1)
  • log(Total.Bsmt.SF + 1)
  • log(X1st.Flr.SF)
  • log(X2nd.Flr.SF + 1)
  • log(Lot.Area)

Also, from the residuals plot, we find increased residuals inaccuracy for the high level “Overall.Qual” variable and decide to transform it into log(“Overall.Qual). From EDA we have knowledge about good visual relationships between Exter.Qual and price with mostly the same behavior as”Overall.Qual". That lead to add log(as.numeric(Exter.Qual)) variable.

For the final model, we use the same frequentist approach to evaluate the final included variables.


Part 4 Final Model Assessment

Section 4.1 Final Model Residual


## Warning in predict.lm(m.AIC.lm.price.log.final, ames_train, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(m.AIC.lm.price.log.final, ames_test, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading

We got mostly constant residuals distribution.

Increased inaccuracy roughly within:

  • +-50K for “Timber” and “Crawfor” neighborhoods;
  • slighly for predicted price from 300k;
  • -+70k for new home under 15 years;
  • +-40k for 9 and 10 “overall.qual”;

Underestimate roughly on:

  • -70k for real price from 370k with outliers around 600k.
  • -25k for 2 “Bsmt.Full.Bath”;

Overestimate roughly on: none.


Section 4.2 Final Model RMSE


We going to culculate root mean squared error (RMSE) into original values $.

##                              RMSE RMSE_test RMSE_test_diff
## pred.AIC.lm.price.log.final 25357     24791         -566.0
## pred.AIC.lm.price.log       25602     25493         -108.9

For our final model pred.AIC.lm.price.log.final we got slightly better RMSE 25357 and 24791 dollars for train and test dataset respectively than for pre-final pred.AIC.lm.price.log 25602 and 25493. We have a relatively similar result for train and test datasets with respect to numbers rows. That means that the model relatively stable for outside data. Important to mark, that we got that result with the first try, which can mean, that the ames_test data set do not have a lot of opportunities to become a part of train data.

That means that in general, our average mistake stays around 25k.


Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


For the strengths of the model, we can refer to stable predicted results for most cases. Based on all output we got, we can assume that we can aspect the average mistake of predicted price around 25-35k.

## [1] 159467

As for weaknesses, we can say the same 25-35k is a relatively big number with respect to the most common price around 160k. Also, the model should be applied with very сaution for houses younger than 15 ages and for houses with predicted prices above 370k. The model can not be applied for neighborhoods that do not present in the train data set, for that purpose should apply some other models.


Section 4.4 Final Model Validation


Let’s make the needed data transformations.

## Warning in predict.lm(m.AIC.lm.price.log.final, ames_validation, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
##                              RMSE RMSE_test RMSE_test_diff RMSE_validation
## pred.AIC.lm.price.log.final 25357     24791           -566           25369
##                             RMSE_diff_validation
## pred.AIC.lm.price.log.final                11.59

We got 25369 dollars for RMSE validation with respect of row in validation data set. That a reletevly simmilar result by compare with RMSE for train 25357 and test data 24791.

## [1] 0.9697

97% of confidence intervals contain the true price of the house in the validation data set. That means that the final model properly reflect uncertainty.


Part 5 Conclusion


Multiple R-squared of the final model is 0.892, that means that the model explains roughly 89.2% of houses price.

The most interesting knowledge from the data and the model is that price for houses changes differently for different classes of property.

Since we do not have an experiment, but an observation study, we can consider the only association, and do not can make causal conclusions.