2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Loading the necessary packages and libraries.

library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(MASS)
library(broom)
library(lubridate)
library(gridExtra)
library(GGally)

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

2.2 Explore the structure of the dataset and its variables

The intial dataset has 1000 observations in 81 variables. Computed the age of the house since the year built and added an additional column House.Age
Dataset contains both numeric and factor variables as indicated by the output of the str function on ames_train dataframe.

# Retrieve data structure of the variables in the dataset #
ames_train <- tbl_df(ames_train)
str(ames_train)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  81 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...

2.3 Exploratory Graphs

After exploration of many graphs, I found the following three graphs to summarize.

2.3.1 1. Boxplot of the `price` vs Neighborhoods and Overall Quality

Location of the house plays an important role in determining the price of the house. The boxplot shows the price distribution in multiples of 1000 in the various neighborhoods of Ames, Iowa. MeadowV is the least expensive locality while StoneBr is the most expensive. Also, plot indicates that the variation in the house price is also quite significant in the StoneBr, NridgHt localities though they are one of the most expensive localities.
Overall Quality has a strong positve correlation with the house prices.

2.3.2 2. Paired Association plots

Housing price has a strong positive relationship with the Overall Quality, Overall Condition and Lot Area of the house.
Housing price has a negative relationship with the Age of the house.
The number of bedroom above ground does not seem to have a very strong determining effect on the housing price. However, this could be better verified during varible selection during modeling.
Quality seems to be an important factor when people look for the houses. Hence a separate analysis on all the quality related variables.
The configuration of the Lot whether it is in corner or well inside a street or a layout is also important factor to determine the housing prices.

EDA Summary: Overall Quality, Overall Condition, Age of the House, Lot Area, Neighborhood,number of bedrooms appear to have a strong relationship to the price of the house. Other variables on quality like for Garage, Kitchen, Basement and configuration of the Lot have medium association with the house price.

2.4 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.4.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

Create a linear model for price using the following variables. Variables selected based on the previous assessment, intution and expert knowledge. - Lot.Area - Overall.Qual - Overall.Cond - House.Age - Kitchen.Qual - area - Bedroom.AbvGr - Bsmt.Qual - Exter.Qual - Neighborhood

ames_train_fit <- ames_train %>%
  filter(Sale.Condition == "Normal") %>%
  dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
                Bedroom.AbvGr,area, Neighborhood, Exter.Qual)

ames_train_fit <- na.omit(ames_train_fit)

# Run the model against all variables except Neighborhood and variables which have been log transformed variables #
model.fit <- lm(data = ames_train_fit, log(price) ~ log(Lot.Area) + log(area) + log(House.Age) + 
                  Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + Exter.Qual + Neighborhood + 
                  Bedroom.AbvGr)
summary(model.fit)

## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + log(House.Age) + 
##     Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + 
##     Exter.Qual + Neighborhood + Bedroom.AbvGr, data = ames_train_fit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84933 -0.06044  0.00587  0.06465  0.45309 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.601036   0.174386  43.587  < 2e-16 ***
## log(Lot.Area)        0.140648   0.011665  12.057  < 2e-16 ***
## log(area)            0.477700   0.023065  20.711  < 2e-16 ***
## log(House.Age)      -0.161004   0.017717  -9.088  < 2e-16 ***
## Overall.Qual         0.056936   0.006027   9.447  < 2e-16 ***
## Overall.Cond         0.054356   0.004498  12.085  < 2e-16 ***
## Bsmt.QualFa         -0.120274   0.037620  -3.197 0.001445 ** 
## Bsmt.QualGd         -0.075241   0.021177  -3.553 0.000404 ***
## Bsmt.QualPo         -0.239834   0.118946  -2.016 0.044113 *  
## Bsmt.QualTA         -0.099868   0.025789  -3.873 0.000117 ***
## Kitchen.QualFa      -0.137083   0.040996  -3.344 0.000866 ***
## Kitchen.QualGd      -0.060986   0.026482  -2.303 0.021549 *  
## Kitchen.QualPo      -0.160177   0.125077  -1.281 0.200711    
## Kitchen.QualTA      -0.096330   0.028374  -3.395 0.000721 ***
## Exter.QualFa        -0.139133   0.059970  -2.320 0.020599 *  
## Exter.QualGd        -0.085073   0.036752  -2.315 0.020886 *  
## Exter.QualTA        -0.082355   0.040035  -2.057 0.040016 *  
## NeighborhoodBlueste -0.035021   0.080468  -0.435 0.663525    
## NeighborhoodBrDale  -0.132301   0.064015  -2.067 0.039095 *  
## NeighborhoodBrkSide -0.078653   0.053989  -1.457 0.145571    
## NeighborhoodClearCr -0.021881   0.062021  -0.353 0.724330    
## NeighborhoodCollgCr -0.082639   0.046975  -1.759 0.078938 .  
## NeighborhoodCrawfor  0.049345   0.054490   0.906 0.365442    
## NeighborhoodEdwards -0.117875   0.050745  -2.323 0.020445 *  
## NeighborhoodGilbert -0.128795   0.049195  -2.618 0.009017 ** 
## NeighborhoodGreens   0.179216   0.073428   2.441 0.014883 *  
## NeighborhoodGrnHill  0.387315   0.121795   3.180 0.001531 ** 
## NeighborhoodIDOTRR  -0.206114   0.055580  -3.708 0.000224 ***
## NeighborhoodMeadowV -0.148740   0.055864  -2.663 0.007918 ** 
## NeighborhoodMitchel -0.031036   0.050139  -0.619 0.536107    
## NeighborhoodNAmes   -0.028977   0.049701  -0.583 0.560040    
## NeighborhoodNoRidge  0.056164   0.050377   1.115 0.265249    
## NeighborhoodNPkVill  0.011520   0.073236   0.157 0.875051    
## NeighborhoodNridgHt  0.002868   0.049202   0.058 0.953533    
## NeighborhoodNWAmes  -0.057779   0.050885  -1.135 0.256524    
## NeighborhoodOldTown -0.177196   0.053467  -3.314 0.000962 ***
## NeighborhoodSawyer  -0.037216   0.051101  -0.728 0.466667    
## NeighborhoodSawyerW -0.138330   0.048847  -2.832 0.004748 ** 
## NeighborhoodSomerst -0.027373   0.046564  -0.588 0.556799    
## NeighborhoodStoneBr  0.031589   0.055555   0.569 0.569787    
## NeighborhoodSWISU   -0.109296   0.061956  -1.764 0.078116 .  
## NeighborhoodTimber  -0.036876   0.053633  -0.688 0.491939    
## NeighborhoodVeenker  0.033352   0.060472   0.552 0.581431    
## Bedroom.AbvGr       -0.026555   0.007433  -3.573 0.000375 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.113 on 770 degrees of freedom
## Multiple R-squared:  0.9123, Adjusted R-squared:  0.9074 
## F-statistic: 186.2 on 43 and 770 DF,  p-value: < 2.2e-16

2.4.1.1 Summary:

R-Square based on the initial model fit is 91.23%

2.4.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

I used the stepAIC and Bayesian Average Sampling technique as two model methods.
R squared is still the same in case of AIC and based on p-value, all variables are significant.
In BAS model, the inclusion probability graph clearly indicates that Bsmt.Qual and Kitchen.Qual have inclusion probabilities less than 40%. We can drop these two variables in the final model selection.

# Using the stepwise function #
model.fit.AIC <- step(model.fit, direction = "backward", k = 2 ,trace = FALSE)
summary(model.fit.AIC)

## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + log(House.Age) + 
##     Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + 
##     Exter.Qual + Neighborhood + Bedroom.AbvGr, data = ames_train_fit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84933 -0.06044  0.00587  0.06465  0.45309 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.601036   0.174386  43.587  < 2e-16 ***
## log(Lot.Area)        0.140648   0.011665  12.057  < 2e-16 ***
## log(area)            0.477700   0.023065  20.711  < 2e-16 ***
## log(House.Age)      -0.161004   0.017717  -9.088  < 2e-16 ***
## Overall.Qual         0.056936   0.006027   9.447  < 2e-16 ***
## Overall.Cond         0.054356   0.004498  12.085  < 2e-16 ***
## Bsmt.QualFa         -0.120274   0.037620  -3.197 0.001445 ** 
## Bsmt.QualGd         -0.075241   0.021177  -3.553 0.000404 ***
## Bsmt.QualPo         -0.239834   0.118946  -2.016 0.044113 *  
## Bsmt.QualTA         -0.099868   0.025789  -3.873 0.000117 ***
## Kitchen.QualFa      -0.137083   0.040996  -3.344 0.000866 ***
## Kitchen.QualGd      -0.060986   0.026482  -2.303 0.021549 *  
## Kitchen.QualPo      -0.160177   0.125077  -1.281 0.200711    
## Kitchen.QualTA      -0.096330   0.028374  -3.395 0.000721 ***
## Exter.QualFa        -0.139133   0.059970  -2.320 0.020599 *  
## Exter.QualGd        -0.085073   0.036752  -2.315 0.020886 *  
## Exter.QualTA        -0.082355   0.040035  -2.057 0.040016 *  
## NeighborhoodBlueste -0.035021   0.080468  -0.435 0.663525    
## NeighborhoodBrDale  -0.132301   0.064015  -2.067 0.039095 *  
## NeighborhoodBrkSide -0.078653   0.053989  -1.457 0.145571    
## NeighborhoodClearCr -0.021881   0.062021  -0.353 0.724330    
## NeighborhoodCollgCr -0.082639   0.046975  -1.759 0.078938 .  
## NeighborhoodCrawfor  0.049345   0.054490   0.906 0.365442    
## NeighborhoodEdwards -0.117875   0.050745  -2.323 0.020445 *  
## NeighborhoodGilbert -0.128795   0.049195  -2.618 0.009017 ** 
## NeighborhoodGreens   0.179216   0.073428   2.441 0.014883 *  
## NeighborhoodGrnHill  0.387315   0.121795   3.180 0.001531 ** 
## NeighborhoodIDOTRR  -0.206114   0.055580  -3.708 0.000224 ***
## NeighborhoodMeadowV -0.148740   0.055864  -2.663 0.007918 ** 
## NeighborhoodMitchel -0.031036   0.050139  -0.619 0.536107    
## NeighborhoodNAmes   -0.028977   0.049701  -0.583 0.560040    
## NeighborhoodNoRidge  0.056164   0.050377   1.115 0.265249    
## NeighborhoodNPkVill  0.011520   0.073236   0.157 0.875051    
## NeighborhoodNridgHt  0.002868   0.049202   0.058 0.953533    
## NeighborhoodNWAmes  -0.057779   0.050885  -1.135 0.256524    
## NeighborhoodOldTown -0.177196   0.053467  -3.314 0.000962 ***
## NeighborhoodSawyer  -0.037216   0.051101  -0.728 0.466667    
## NeighborhoodSawyerW -0.138330   0.048847  -2.832 0.004748 ** 
## NeighborhoodSomerst -0.027373   0.046564  -0.588 0.556799    
## NeighborhoodStoneBr  0.031589   0.055555   0.569 0.569787    
## NeighborhoodSWISU   -0.109296   0.061956  -1.764 0.078116 .  
## NeighborhoodTimber  -0.036876   0.053633  -0.688 0.491939    
## NeighborhoodVeenker  0.033352   0.060472   0.552 0.581431    
## Bedroom.AbvGr       -0.026555   0.007433  -3.573 0.000375 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.113 on 770 degrees of freedom
## Multiple R-squared:  0.9123, Adjusted R-squared:  0.9074 
## F-statistic: 186.2 on 43 and 770 DF,  p-value: < 2.2e-16

# Using BAS model #
model.fit.bas <- bas.lm(data = ames_train_fit,log(price) ~ log(Lot.Area) + log(area) + 
                          log(House.Age) + Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + 
                          Exter.Qual + Bedroom.AbvGr + Neighborhood, prior = "JZS", modelprior = uniform(),                          method = "BAS")
image(model.fit.bas, top.models = 10, rotate = TRUE)

plot(model.fit.bas, which = 4, ask = FALSE, sub.caption = " ")

2.4.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

The preferred model for me is BAS model due to lesser number of predictors compared to AIC.
In the BAS model, Exter.Qual and Bsmt.Qual is not a significant predictor as indicated in the top ranked models and in the posterior inclusion probabilities.

#Diagnostic Plots for the selected model #
plot(model.fit.bas, which = 1, sub.caption = " ")

Residuals vs the Fitted plot indicates three outlier points. Observation 61, 596 and 318. The Lot.Area of row number 596 is significantly higher than the rest of the data. House in row 61 was built in 1925 and remodelled in 1950. The external condition of house in row 596 is rated “Good” which is shown as a significant predictor in the model. Also it has a pool with Pool.QC rated as “Fantastic”. In the previous assessment, this variable has the highest number of NA values.
The predictions are all for the Normal sale condition. This is expected when there are categorical predictors with limited levels. I am not expecting any serious implications due to these outliers.

2.4.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

init.model.bas.fit <- exp(fitted(model.fit.bas, estimator = "BMA"))
residuals.ames.train <-  init.model.bas.fit - ames_train_fit$price
rmse.ames.train <- sqrt(mean(residuals.ames.train ^2))
message ("The RMSE value or in sample error is ", round(rmse.ames.train,2))

## The RMSE value or in sample error is 21504.41

2.4.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

I select Normal sale condition in the testing dataset. Secondly, I realize there is a new level in the Neighborhood variable “Landmrk” and in normal sale condition, occurrence of this level is just once. My informed decision is to drop this level from the test dataset.

load("ames_test.Rdata")

ames_test <- tbl_df(ames_test)

ames_test <-  ames_test %>%
  filter(Sale.Condition == "Normal")%>%
  filter(Neighborhood != "Landmrk") %>%
  mutate(House.Age = year(today()) -  Year.Built) %>%
  dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
                Bedroom.AbvGr,area, Neighborhood, Exter.Qual)

ames_test <- na.omit(ames_test)

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

I used the Bayesian Model Averaging to predict the house prices in the test data. There is an increase in the RMSE calculated in the test data though not alarmingly high.The model fits the training data better than the test data. One way to address is to arrive at similar predictability using fewer number of variables or use different priors in the BAS.

test.pred <-  predict(model.fit.bas, newdata = ames_test, estimator = "BMA")
test.pred.val <- exp(test.pred$fit)
residuals.ames.test <- ames_test$price - test.pred.val
rmse.ames.test <- sqrt(mean(residuals.ames.test^2))
message ("The RMSE value or out of sample error is ", round(rmse.ames.test,2))

## The RMSE value or out of sample error is 22865.36

# Check if the RMSE of test data is higher than RMSE training data #
message ("Is out of sample error more than in sample error ? Answer: ",(rmse.ames.test > rmse.ames.train))

## Is out of sample error more than in sample error ? Answer: TRUE

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

2.5 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

2.5.1 Section 3.1 Final Model

Provide the summary table for your model.

#names(ames_train_dup)

ames_train_play <- ames_train_dup %>%
     filter(Sale.Condition == "Normal") %>%
  dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
                Bedroom.AbvGr,area, Neighborhood, Exter.Qual,Garage.Qual, Garage.Finish,Garage.Type,
                Garage.Cars,Garage.Cond,Garage.Area,Bldg.Type,Year.Remod.Add,Heating,Heating.QC,
                Central.Air) %>%
         mutate(House.Mod = year(today()) -  Year.Remod.Add)
      

ames_train_play <- na.omit(ames_train_play)

# Final Variable and Model Selection #
model.bas.final <- bas.lm(data = ames_train_play, log(price) ~ log(Lot.Area) * log(area) +
                          log(House.Age) * log(House.Mod) + Overall.Qual + Overall.Cond +  
                          Bedroom.AbvGr + Neighborhood + log(area) + log(Lot.Area),
                           prior = "JZS", 
                           modelprior = uniform(),
                           method = "BAS")

## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length

## Warning in bas.lm(data = ames_train_play, log(price) ~ log(Lot.Area) *
## log(area) + : bestmodel violates heredity conditions; resetting to null
## model

summary(model.bas.final,n.models = 3)

##                               P(B != 0 | Y)  model 1     model 2
## Intercept                         1.0000000   1.0000   1.0000000
## log(Lot.Area)                     0.9999976   1.0000   1.0000000
## log(area)                         0.9999928   1.0000   1.0000000
## log(House.Age)                    0.3562933   0.0000   1.0000000
## log(House.Mod)                    0.9999904   1.0000   1.0000000
## Overall.Qual                      1.0000000   1.0000   1.0000000
## Overall.Cond                      1.0000000   1.0000   1.0000000
## Bedroom.AbvGr                     0.9973427   1.0000   1.0000000
## NeighborhoodBlueste               1.0000000   1.0000   1.0000000
## NeighborhoodBrDale                1.0000000   1.0000   1.0000000
## NeighborhoodBrkSide               1.0000000   1.0000   1.0000000
## NeighborhoodClearCr               1.0000000   1.0000   1.0000000
## NeighborhoodCollgCr               1.0000000   1.0000   1.0000000
## NeighborhoodCrawfor               1.0000000   1.0000   1.0000000
## NeighborhoodEdwards               1.0000000   1.0000   1.0000000
## NeighborhoodGilbert               1.0000000   1.0000   1.0000000
## NeighborhoodGreens                1.0000000   1.0000   1.0000000
## NeighborhoodGrnHill               1.0000000   1.0000   1.0000000
## NeighborhoodIDOTRR                1.0000000   1.0000   1.0000000
## NeighborhoodMeadowV               1.0000000   1.0000   1.0000000
## NeighborhoodMitchel               1.0000000   1.0000   1.0000000
## NeighborhoodNAmes                 1.0000000   1.0000   1.0000000
## NeighborhoodNoRidge               1.0000000   1.0000   1.0000000
## NeighborhoodNPkVill               1.0000000   1.0000   1.0000000
## NeighborhoodNridgHt               1.0000000   1.0000   1.0000000
## NeighborhoodNWAmes                1.0000000   1.0000   1.0000000
## NeighborhoodOldTown               1.0000000   1.0000   1.0000000
## NeighborhoodSawyer                1.0000000   1.0000   1.0000000
## NeighborhoodSawyerW               1.0000000   1.0000   1.0000000
## NeighborhoodSomerst               1.0000000   1.0000   1.0000000
## NeighborhoodStoneBr               1.0000000   1.0000   1.0000000
## NeighborhoodSWISU                 1.0000000   1.0000   1.0000000
## NeighborhoodTimber                1.0000000   1.0000   1.0000000
## NeighborhoodVeenker               1.0000000   1.0000   1.0000000
## log(Lot.Area):log(area)           1.0000000   1.0000   1.0000000
## log(House.Age):log(House.Mod)     0.9999898   1.0000   1.0000000
## BF                                       NA   1.0000   0.5534834
## PostProbs                                NA   0.6420   0.3553000
## R2                                       NA   0.9057   0.9062000
## dim                                      NA  35.0000  36.0000000
## logmarg                                  NA 811.0078 810.4163164
##                                    model 3
## Intercept                     1.000000e+00
## log(Lot.Area)                 1.000000e+00
## log(area)                     1.000000e+00
## log(House.Age)                0.000000e+00
## log(House.Mod)                1.000000e+00
## Overall.Qual                  1.000000e+00
## Overall.Cond                  1.000000e+00
## Bedroom.AbvGr                 0.000000e+00
## NeighborhoodBlueste           1.000000e+00
## NeighborhoodBrDale            1.000000e+00
## NeighborhoodBrkSide           1.000000e+00
## NeighborhoodClearCr           1.000000e+00
## NeighborhoodCollgCr           1.000000e+00
## NeighborhoodCrawfor           1.000000e+00
## NeighborhoodEdwards           1.000000e+00
## NeighborhoodGilbert           1.000000e+00
## NeighborhoodGreens            1.000000e+00
## NeighborhoodGrnHill           1.000000e+00
## NeighborhoodIDOTRR            1.000000e+00
## NeighborhoodMeadowV           1.000000e+00
## NeighborhoodMitchel           1.000000e+00
## NeighborhoodNAmes             1.000000e+00
## NeighborhoodNoRidge           1.000000e+00
## NeighborhoodNPkVill           1.000000e+00
## NeighborhoodNridgHt           1.000000e+00
## NeighborhoodNWAmes            1.000000e+00
## NeighborhoodOldTown           1.000000e+00
## NeighborhoodSawyer            1.000000e+00
## NeighborhoodSawyerW           1.000000e+00
## NeighborhoodSomerst           1.000000e+00
## NeighborhoodStoneBr           1.000000e+00
## NeighborhoodSWISU             1.000000e+00
## NeighborhoodTimber            1.000000e+00
## NeighborhoodVeenker           1.000000e+00
## log(Lot.Area):log(area)       1.000000e+00
## log(House.Age):log(House.Mod) 1.000000e+00
## BF                            2.662346e-03
## PostProbs                     1.700000e-03
## R2                            9.034000e-01
## dim                           3.400000e+01
## logmarg                       8.050793e+02

confint(coefficients(model.bas.final))

##                                       2.5%         97.5%         beta
## Intercept                     12.019673707 12.0355892298 12.027739486
## log(Lot.Area)                 -0.726071182  0.1109261585 -0.307682553
## log(area)                     -0.605260472  0.4336893201 -0.088880269
## log(House.Age)                 0.000000000  0.1640322395  0.037849711
## log(House.Mod)                 0.258729469  0.5602856510  0.378564431
## Overall.Qual                   0.059780969  0.0827756865  0.071131979
## Overall.Cond                   0.049009717  0.0687756846  0.058965346
## Bedroom.AbvGr                 -0.045604090 -0.0161037483 -0.030906841
## NeighborhoodBlueste           -0.217953248  0.1033535004 -0.054724163
## NeighborhoodBrDale            -0.300504630 -0.0391434722 -0.168296732
## NeighborhoodBrkSide           -0.125882985  0.0886555744 -0.017360333
## NeighborhoodClearCr           -0.161495720  0.0798901703 -0.039273022
## NeighborhoodCollgCr           -0.198625501 -0.0117214147 -0.103685705
## NeighborhoodCrawfor           -0.035068243  0.1804061792  0.075695184
## NeighborhoodEdwards           -0.204141606 -0.0042234591 -0.101726118
## NeighborhoodGilbert           -0.252240908 -0.0571506789 -0.152376626
## NeighborhoodGreens            -0.002187161  0.2904348977  0.145372760
## NeighborhoodGrnHill            0.096478495  0.5787071244  0.344050225
## NeighborhoodIDOTRR            -0.246346623 -0.0239325322 -0.133858015
## NeighborhoodMeadowV           -0.261449690 -0.0164226428 -0.137308703
## NeighborhoodMitchel           -0.149990790  0.0505498346 -0.049830807
## NeighborhoodNAmes             -0.118104228  0.0779343250 -0.018786905
## NeighborhoodNoRidge           -0.082343161  0.1204641449  0.018905302
## NeighborhoodNPkVill           -0.150495150  0.1400869026 -0.001852661
## NeighborhoodNridgHt           -0.027598484  0.1619977682  0.067180270
## NeighborhoodNWAmes            -0.189825898  0.0156814360 -0.087144981
## NeighborhoodOldTown           -0.211357931  0.0002734994 -0.104219174
## NeighborhoodSawyer            -0.139442487  0.0634034363 -0.037604083
## NeighborhoodSawyerW           -0.249994796 -0.0539071381 -0.151549204
## NeighborhoodSomerst           -0.112324339  0.0706678075 -0.018736424
## NeighborhoodStoneBr           -0.071295746  0.1478639944  0.039737459
## NeighborhoodSWISU             -0.155900444  0.0941154326 -0.031231479
## NeighborhoodTimber            -0.115205005  0.0942804055 -0.009403058
## NeighborhoodVeenker           -0.124866698  0.1190322180 -0.002576698
## log(Lot.Area):log(area)        0.006000627  0.1195099697  0.063977609
## log(House.Age):log(House.Mod) -0.135058764 -0.0640099912 -0.090470313
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

2.5.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

Yes, I log transformed the price, Lot.Area, area and House.Age and the House.Mod variable.
Note House.Age is the calculated variable based on the years since Year.Built and House.Mod based on years since Year.Remod.Add
Initial investigation on the distribution of the above variables had indicated skewness and hence I chose to log transformed them. This was also done in the previous assessments.

image(model.bas.final, top.models = 10)

plot(model.bas.final,which = c(1:3), sub.caption = "model.bas.final")

2.5.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

Yes, I included the 2-level interaction between log(Lot.Area) * log(area) and secondly between log(House.Age) * log(House.Mod). 3 or more level interactions are uncommon and hence did not include in my model creation.
The price of the house can be deterministic depending on the lot area and the area of the house together. Also the price of the the house could be influenced not only by when it was first built but also by when it was last remodified/remodelled. If remodeling was done recently, it could be a different price compared to no remodeling ever since the house was first constructed ages back.
From the inclusion probability, log(House.Age) has posterior inclusion probability less than 40% while the interaction term log(House.Age) * log(House.Mod) has posterior inclusion probability close to 100%

plot(model.bas.final, which = 4, sub.caption = " ")

2.5.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

Based on the initial bas model, I dropped the Exter.Qual and Bsmt.Qual variables but they were all significant in the AIC model. Overall Quality and Overall Condition are more significant predictors compared to other Quality variables.
I included House.Mod based on the Year.Remod.Add as this was not in my initial model but could have a significant influence on the house price and is clear in their posterior inclusion probabilities.

# Using the stepwise function #
model.fit.AIC <- step(model.fit, direction = "backward", k = 2 ,trace = FALSE)
summary(model.fit.AIC)

## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + log(House.Age) + 
##     Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + 
##     Exter.Qual + Neighborhood + Bedroom.AbvGr, data = ames_train_fit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84933 -0.06044  0.00587  0.06465  0.45309 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.601036   0.174386  43.587  < 2e-16 ***
## log(Lot.Area)        0.140648   0.011665  12.057  < 2e-16 ***
## log(area)            0.477700   0.023065  20.711  < 2e-16 ***
## log(House.Age)      -0.161004   0.017717  -9.088  < 2e-16 ***
## Overall.Qual         0.056936   0.006027   9.447  < 2e-16 ***
## Overall.Cond         0.054356   0.004498  12.085  < 2e-16 ***
## Bsmt.QualFa         -0.120274   0.037620  -3.197 0.001445 ** 
## Bsmt.QualGd         -0.075241   0.021177  -3.553 0.000404 ***
## Bsmt.QualPo         -0.239834   0.118946  -2.016 0.044113 *  
## Bsmt.QualTA         -0.099868   0.025789  -3.873 0.000117 ***
## Kitchen.QualFa      -0.137083   0.040996  -3.344 0.000866 ***
## Kitchen.QualGd      -0.060986   0.026482  -2.303 0.021549 *  
## Kitchen.QualPo      -0.160177   0.125077  -1.281 0.200711    
## Kitchen.QualTA      -0.096330   0.028374  -3.395 0.000721 ***
## Exter.QualFa        -0.139133   0.059970  -2.320 0.020599 *  
## Exter.QualGd        -0.085073   0.036752  -2.315 0.020886 *  
## Exter.QualTA        -0.082355   0.040035  -2.057 0.040016 *  
## NeighborhoodBlueste -0.035021   0.080468  -0.435 0.663525    
## NeighborhoodBrDale  -0.132301   0.064015  -2.067 0.039095 *  
## NeighborhoodBrkSide -0.078653   0.053989  -1.457 0.145571    
## NeighborhoodClearCr -0.021881   0.062021  -0.353 0.724330    
## NeighborhoodCollgCr -0.082639   0.046975  -1.759 0.078938 .  
## NeighborhoodCrawfor  0.049345   0.054490   0.906 0.365442    
## NeighborhoodEdwards -0.117875   0.050745  -2.323 0.020445 *  
## NeighborhoodGilbert -0.128795   0.049195  -2.618 0.009017 ** 
## NeighborhoodGreens   0.179216   0.073428   2.441 0.014883 *  
## NeighborhoodGrnHill  0.387315   0.121795   3.180 0.001531 ** 
## NeighborhoodIDOTRR  -0.206114   0.055580  -3.708 0.000224 ***
## NeighborhoodMeadowV -0.148740   0.055864  -2.663 0.007918 ** 
## NeighborhoodMitchel -0.031036   0.050139  -0.619 0.536107    
## NeighborhoodNAmes   -0.028977   0.049701  -0.583 0.560040    
## NeighborhoodNoRidge  0.056164   0.050377   1.115 0.265249    
## NeighborhoodNPkVill  0.011520   0.073236   0.157 0.875051    
## NeighborhoodNridgHt  0.002868   0.049202   0.058 0.953533    
## NeighborhoodNWAmes  -0.057779   0.050885  -1.135 0.256524    
## NeighborhoodOldTown -0.177196   0.053467  -3.314 0.000962 ***
## NeighborhoodSawyer  -0.037216   0.051101  -0.728 0.466667    
## NeighborhoodSawyerW -0.138330   0.048847  -2.832 0.004748 ** 
## NeighborhoodSomerst -0.027373   0.046564  -0.588 0.556799    
## NeighborhoodStoneBr  0.031589   0.055555   0.569 0.569787    
## NeighborhoodSWISU   -0.109296   0.061956  -1.764 0.078116 .  
## NeighborhoodTimber  -0.036876   0.053633  -0.688 0.491939    
## NeighborhoodVeenker  0.033352   0.060472   0.552 0.581431    
## Bedroom.AbvGr       -0.026555   0.007433  -3.573 0.000375 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.113 on 770 degrees of freedom
## Multiple R-squared:  0.9123, Adjusted R-squared:  0.9074 
## F-statistic: 186.2 on 43 and 770 DF,  p-value: < 2.2e-16

# Using BAS model #
model.fit.bas <- bas.lm(data = ames_train_fit,log(price) ~ log(Lot.Area) + log(area) + 
                          log(House.Age) + Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + 
                          Exter.Qual + Bedroom.AbvGr + Neighborhood, prior = "JZS", modelprior = uniform(),                          method = "BAS")

## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length

## Warning in bas.lm(data = ames_train_fit, log(price) ~ log(Lot.Area) +
## log(area) + : bestmodel violates heredity conditions; resetting to null
## model

image(model.fit.bas, top.models = 10, rotate = TRUE)

plot(model.fit.bas, which = 4, ask = FALSE, sub.caption = " ")

2.5.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

I tested the final model using the ames_test data and compared the in sample and out of sample error compared to the previous initial model.

I decide to stick with the final model to test against the validation data set even though the RMSE is higher in the final model for both in sample and out of sample error compared to the initial model. My next step would be to check whether the RMSE with the validation dataset increases or decreases compared to the in sample and out of sample error of the final model.

load("ames_test.Rdata")
ames_test <- tbl_df(ames_test)

ames_test <-  ames_test %>%
  filter(Sale.Condition == "Normal") %>%
   filter(Neighborhood != "Landmrk") %>%
  mutate(House.Age = year(today()) -  Year.Built) %>%
  mutate(House.Mod = year(today()) - Year.Remod.Add) %>%
  dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
                Bedroom.AbvGr,area, Neighborhood, Exter.Qual,Bldg.Type, House.Mod)

ames_test <- na.omit(ames_test)

# In sample erro in final model using ames_train_play #

init.model.bas.fit <- exp(fitted(model.bas.final, estimator = "BMA"))
residuals.ames.train <-  init.model.bas.fit - ames_train_play$price
rmse.ames.train.final <- sqrt(mean(residuals.ames.train ^2))
message ("The RMSE value or in sample error using final model is ", round(rmse.ames.train.final,2))

## The RMSE value or in sample error using final model is 22643.59

# Out of sample error in final model using ames_test # 
test.pred <-  predict(model.bas.final, newdata = ames_test, estimator = "BMA")
test.pred.val <- exp(test.pred$fit)
residuals.ames.test <- ames_test$price - test.pred.val
rmse.ames.test.final <- sqrt(mean(residuals.ames.test^2))
message ("The RMSE value or out of sample error using final model is ", round(rmse.ames.test.final,2))

## The RMSE value or out of sample error using final model is 24059.09

# Check if the RMSE of test data is higher than RMSE training data #
message("Is in sample error in final model more than in sample error in initial model? Answer: ", (rmse.ames.train.final > rmse.ames.train))

## Is in sample error in final model more than in sample error in initial model? Answer: TRUE

message("Is out of sample error in final model more than out of sample error in initial model ? Answer: ", (rmse.ames.test.final > rmse.ames.test))

## Is out of sample error in final model more than out of sample error in initial model ? Answer: TRUE

message("Is out of sample error in final model more than in sample error in final model ? Answer: ", (rmse.ames.test.final > rmse.ames.train.final))

## Is out of sample error in final model more than in sample error in final model ? Answer: TRUE

2.6 Part 4 Final Model Assessment

2.6.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

The final model Residual vs Fitted plot is randomly distributed and reflects linearity.
There are leverage points with high residuals reflected. I will keep these points in the model to avoid any further overfitting.

2.6.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

The RMSE value or in sample error using final model is 22643.59
The RMSE value or out of sample error using final model is 24059.09
The in sample error in final model more than in sample error in initial model.
The out of sample error in final model more than out of sample error in initial model.
The out of sample error in final model more than out of sample error in final model.

2.6.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

Strengths

Variable selection using Bayesian Model Averaging and model selected based on rank and posterior distribution.
Top ranked model based on model rank matrix and posterior inclusion probabilities of variables
Residual plot indicates that the assumptions of linear model is not violated
Interaction terms are included which are significant predictors

Weakness

The in sample error in final model more than in sample error in initial model.
The out of sample error in final model more than out of sample error in initial model.
The out of sample error in final model more than out of sample error in final model.
If a new level in the Neighborhood is selected, then the model cannot be used. A better model could be perhaps made out of generic variables based on quality, condition and age of the house.

2.6.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")
ames_validation <- tbl_df(ames_validation)

ames_validation <-  ames_validation %>%
  filter(Sale.Condition == "Normal") %>%
   filter(Neighborhood != "Landmrk") %>%
  mutate(House.Age = year(today()) -  Year.Built) %>%
  mutate(House.Mod = year(today()) - Year.Remod.Add) %>%
  dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
                Bedroom.AbvGr,area, Neighborhood, Exter.Qual,Bldg.Type, House.Mod)
ames_validation <- na.omit(ames_validation)

The RMSE in validation dataset in final model is less than RMSE in test dataset in final model

The RMSE in validation dataset in final model is less than in sample error in train dataset in final model.

The coverage probability is ~ 93% and hence it explains the model uncertainty

This confirms that my final model is still a better model with better predicatibility of the house prices in Ames,Iowa.

validate.pred <-  predict(model.bas.final, newdata = ames_validation, estimator = "BMA")
validate.pred.val <- exp(validate.pred$fit)
residuals.ames.validate <- ames_validation$price - validate.pred.val
rmse.ames.validate <- sqrt(mean(residuals.ames.validate^2))
rmse.ames.validate

## [1] 21917.84

# Check if the RMSE of test data is higher than RMSE training data #
message("Is out of sample error in validation dataset in final model more than out of sample error in test dataset in final model ? Answer: ", (rmse.ames.validate > rmse.ames.test.final))

## Is out of sample error in validation dataset in final model more than out of sample error in test dataset in final model ? Answer: FALSE

message("Is out of sample error in validation dataset in final model more than in sample error in train dataset in final model ? Answer: ", (rmse.ames.validate > rmse.ames.train.final))

## Is out of sample error in validation dataset in final model more than in sample error in train dataset in final model ? Answer: FALSE

interval <- quantile(validate.pred.val, c(0.025, 0.975))
coverage.prob <- mean(ames_validation$price > interval[1] &
                            ames_validation$price < interval[2])
message("Coverage probability is = ", paste0(round(100*coverage.prob,0),"%"))

## Coverage probability is = 93%

2.7 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

This is an interesting dataset for house price prediction against a multitude of variables which can be explored.

For Normal sale condition the arrived model is a good predictor and also mentions the uncertainty.

Model is validated using test and validation dataset and out of sample error is calculated to check for any overfitting.

The RMSE reduces for the validation dataset.

There is a strong interaction between overall age of the house and the year it was remodel or remodified.

tLog Tranformation of the variables is necessary to satisfy the regression assumptions.

In this exercise I practiced a lot of techniques for linear modeling using Bayesian approach which was my key interest coming from frequentist background.

1 Background