As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Loading the necessary packages and libraries.
library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(MASS)
library(broom)
library(lubridate)
library(gridExtra)
library(GGally)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
The intial dataset has 1000 observations in 81 variables. Computed the age of the house since the year built and added an additional column House.Age
Dataset contains both numeric and factor variables as indicated by the output of the str function on ames_train dataframe.
# Retrieve data structure of the variables in the dataset #
ames_train <- tbl_df(ames_train)
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 81 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
After exploration of many graphs, I found the following three graphs to summarize.
price vs Neighborhoods and Overall QualityLocation of the house plays an important role in determining the price of the house. The boxplot shows the price distribution in multiples of 1000 in the various neighborhoods of Ames, Iowa. MeadowV is the least expensive locality while StoneBr is the most expensive. Also, plot indicates that the variation in the house price is also quite significant in the StoneBr, NridgHt localities though they are one of the most expensive localities.
Overall Quality has a strong positve correlation with the house prices.
Housing price has a strong positive relationship with the Overall Quality, Overall Condition and Lot Area of the house.
Housing price has a negative relationship with the Age of the house.
The number of bedroom above ground does not seem to have a very strong determining effect on the housing price. However, this could be better verified during varible selection during modeling.
Quality seems to be an important factor when people look for the houses. Hence a separate analysis on all the quality related variables.
The configuration of the Lot whether it is in corner or well inside a street or a layout is also important factor to determine the housing prices.
EDA Summary: Overall Quality, Overall Condition, Age of the House, Lot Area, Neighborhood,number of bedrooms appear to have a strong relationship to the price of the house. Other variables on quality like for Garage, Kitchen, Basement and configuration of the Lot have medium association with the house price.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
Create a linear model for price using the following variables. Variables selected based on the previous assessment, intution and expert knowledge. - Lot.Area - Overall.Qual - Overall.Cond - House.Age - Kitchen.Qual - area - Bedroom.AbvGr - Bsmt.Qual - Exter.Qual - Neighborhood
ames_train_fit <- ames_train %>%
filter(Sale.Condition == "Normal") %>%
dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
Bedroom.AbvGr,area, Neighborhood, Exter.Qual)
ames_train_fit <- na.omit(ames_train_fit)
# Run the model against all variables except Neighborhood and variables which have been log transformed variables #
model.fit <- lm(data = ames_train_fit, log(price) ~ log(Lot.Area) + log(area) + log(House.Age) +
Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual + Exter.Qual + Neighborhood +
Bedroom.AbvGr)
summary(model.fit)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + log(House.Age) +
## Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual +
## Exter.Qual + Neighborhood + Bedroom.AbvGr, data = ames_train_fit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84933 -0.06044 0.00587 0.06465 0.45309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.601036 0.174386 43.587 < 2e-16 ***
## log(Lot.Area) 0.140648 0.011665 12.057 < 2e-16 ***
## log(area) 0.477700 0.023065 20.711 < 2e-16 ***
## log(House.Age) -0.161004 0.017717 -9.088 < 2e-16 ***
## Overall.Qual 0.056936 0.006027 9.447 < 2e-16 ***
## Overall.Cond 0.054356 0.004498 12.085 < 2e-16 ***
## Bsmt.QualFa -0.120274 0.037620 -3.197 0.001445 **
## Bsmt.QualGd -0.075241 0.021177 -3.553 0.000404 ***
## Bsmt.QualPo -0.239834 0.118946 -2.016 0.044113 *
## Bsmt.QualTA -0.099868 0.025789 -3.873 0.000117 ***
## Kitchen.QualFa -0.137083 0.040996 -3.344 0.000866 ***
## Kitchen.QualGd -0.060986 0.026482 -2.303 0.021549 *
## Kitchen.QualPo -0.160177 0.125077 -1.281 0.200711
## Kitchen.QualTA -0.096330 0.028374 -3.395 0.000721 ***
## Exter.QualFa -0.139133 0.059970 -2.320 0.020599 *
## Exter.QualGd -0.085073 0.036752 -2.315 0.020886 *
## Exter.QualTA -0.082355 0.040035 -2.057 0.040016 *
## NeighborhoodBlueste -0.035021 0.080468 -0.435 0.663525
## NeighborhoodBrDale -0.132301 0.064015 -2.067 0.039095 *
## NeighborhoodBrkSide -0.078653 0.053989 -1.457 0.145571
## NeighborhoodClearCr -0.021881 0.062021 -0.353 0.724330
## NeighborhoodCollgCr -0.082639 0.046975 -1.759 0.078938 .
## NeighborhoodCrawfor 0.049345 0.054490 0.906 0.365442
## NeighborhoodEdwards -0.117875 0.050745 -2.323 0.020445 *
## NeighborhoodGilbert -0.128795 0.049195 -2.618 0.009017 **
## NeighborhoodGreens 0.179216 0.073428 2.441 0.014883 *
## NeighborhoodGrnHill 0.387315 0.121795 3.180 0.001531 **
## NeighborhoodIDOTRR -0.206114 0.055580 -3.708 0.000224 ***
## NeighborhoodMeadowV -0.148740 0.055864 -2.663 0.007918 **
## NeighborhoodMitchel -0.031036 0.050139 -0.619 0.536107
## NeighborhoodNAmes -0.028977 0.049701 -0.583 0.560040
## NeighborhoodNoRidge 0.056164 0.050377 1.115 0.265249
## NeighborhoodNPkVill 0.011520 0.073236 0.157 0.875051
## NeighborhoodNridgHt 0.002868 0.049202 0.058 0.953533
## NeighborhoodNWAmes -0.057779 0.050885 -1.135 0.256524
## NeighborhoodOldTown -0.177196 0.053467 -3.314 0.000962 ***
## NeighborhoodSawyer -0.037216 0.051101 -0.728 0.466667
## NeighborhoodSawyerW -0.138330 0.048847 -2.832 0.004748 **
## NeighborhoodSomerst -0.027373 0.046564 -0.588 0.556799
## NeighborhoodStoneBr 0.031589 0.055555 0.569 0.569787
## NeighborhoodSWISU -0.109296 0.061956 -1.764 0.078116 .
## NeighborhoodTimber -0.036876 0.053633 -0.688 0.491939
## NeighborhoodVeenker 0.033352 0.060472 0.552 0.581431
## Bedroom.AbvGr -0.026555 0.007433 -3.573 0.000375 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.113 on 770 degrees of freedom
## Multiple R-squared: 0.9123, Adjusted R-squared: 0.9074
## F-statistic: 186.2 on 43 and 770 DF, p-value: < 2.2e-16
Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
# Using the stepwise function #
model.fit.AIC <- step(model.fit, direction = "backward", k = 2 ,trace = FALSE)
summary(model.fit.AIC)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + log(House.Age) +
## Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual +
## Exter.Qual + Neighborhood + Bedroom.AbvGr, data = ames_train_fit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84933 -0.06044 0.00587 0.06465 0.45309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.601036 0.174386 43.587 < 2e-16 ***
## log(Lot.Area) 0.140648 0.011665 12.057 < 2e-16 ***
## log(area) 0.477700 0.023065 20.711 < 2e-16 ***
## log(House.Age) -0.161004 0.017717 -9.088 < 2e-16 ***
## Overall.Qual 0.056936 0.006027 9.447 < 2e-16 ***
## Overall.Cond 0.054356 0.004498 12.085 < 2e-16 ***
## Bsmt.QualFa -0.120274 0.037620 -3.197 0.001445 **
## Bsmt.QualGd -0.075241 0.021177 -3.553 0.000404 ***
## Bsmt.QualPo -0.239834 0.118946 -2.016 0.044113 *
## Bsmt.QualTA -0.099868 0.025789 -3.873 0.000117 ***
## Kitchen.QualFa -0.137083 0.040996 -3.344 0.000866 ***
## Kitchen.QualGd -0.060986 0.026482 -2.303 0.021549 *
## Kitchen.QualPo -0.160177 0.125077 -1.281 0.200711
## Kitchen.QualTA -0.096330 0.028374 -3.395 0.000721 ***
## Exter.QualFa -0.139133 0.059970 -2.320 0.020599 *
## Exter.QualGd -0.085073 0.036752 -2.315 0.020886 *
## Exter.QualTA -0.082355 0.040035 -2.057 0.040016 *
## NeighborhoodBlueste -0.035021 0.080468 -0.435 0.663525
## NeighborhoodBrDale -0.132301 0.064015 -2.067 0.039095 *
## NeighborhoodBrkSide -0.078653 0.053989 -1.457 0.145571
## NeighborhoodClearCr -0.021881 0.062021 -0.353 0.724330
## NeighborhoodCollgCr -0.082639 0.046975 -1.759 0.078938 .
## NeighborhoodCrawfor 0.049345 0.054490 0.906 0.365442
## NeighborhoodEdwards -0.117875 0.050745 -2.323 0.020445 *
## NeighborhoodGilbert -0.128795 0.049195 -2.618 0.009017 **
## NeighborhoodGreens 0.179216 0.073428 2.441 0.014883 *
## NeighborhoodGrnHill 0.387315 0.121795 3.180 0.001531 **
## NeighborhoodIDOTRR -0.206114 0.055580 -3.708 0.000224 ***
## NeighborhoodMeadowV -0.148740 0.055864 -2.663 0.007918 **
## NeighborhoodMitchel -0.031036 0.050139 -0.619 0.536107
## NeighborhoodNAmes -0.028977 0.049701 -0.583 0.560040
## NeighborhoodNoRidge 0.056164 0.050377 1.115 0.265249
## NeighborhoodNPkVill 0.011520 0.073236 0.157 0.875051
## NeighborhoodNridgHt 0.002868 0.049202 0.058 0.953533
## NeighborhoodNWAmes -0.057779 0.050885 -1.135 0.256524
## NeighborhoodOldTown -0.177196 0.053467 -3.314 0.000962 ***
## NeighborhoodSawyer -0.037216 0.051101 -0.728 0.466667
## NeighborhoodSawyerW -0.138330 0.048847 -2.832 0.004748 **
## NeighborhoodSomerst -0.027373 0.046564 -0.588 0.556799
## NeighborhoodStoneBr 0.031589 0.055555 0.569 0.569787
## NeighborhoodSWISU -0.109296 0.061956 -1.764 0.078116 .
## NeighborhoodTimber -0.036876 0.053633 -0.688 0.491939
## NeighborhoodVeenker 0.033352 0.060472 0.552 0.581431
## Bedroom.AbvGr -0.026555 0.007433 -3.573 0.000375 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.113 on 770 degrees of freedom
## Multiple R-squared: 0.9123, Adjusted R-squared: 0.9074
## F-statistic: 186.2 on 43 and 770 DF, p-value: < 2.2e-16
# Using BAS model #
model.fit.bas <- bas.lm(data = ames_train_fit,log(price) ~ log(Lot.Area) + log(area) +
log(House.Age) + Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual +
Exter.Qual + Bedroom.AbvGr + Neighborhood, prior = "JZS", modelprior = uniform(), method = "BAS")
image(model.fit.bas, top.models = 10, rotate = TRUE)
plot(model.fit.bas, which = 4, ask = FALSE, sub.caption = " ")
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
#Diagnostic Plots for the selected model #
plot(model.fit.bas, which = 1, sub.caption = " ")
Residuals vs the Fitted plot indicates three outlier points. Observation 61, 596 and 318. The Lot.Area of row number 596 is significantly higher than the rest of the data. House in row 61 was built in 1925 and remodelled in 1950. The external condition of house in row 596 is rated “Good” which is shown as a significant predictor in the model. Also it has a pool with Pool.QC rated as “Fantastic”. In the previous assessment, this variable has the highest number of NA values.
The predictions are all for the Normal sale condition. This is expected when there are categorical predictors with limited levels. I am not expecting any serious implications due to these outliers.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
init.model.bas.fit <- exp(fitted(model.fit.bas, estimator = "BMA"))
residuals.ames.train <- init.model.bas.fit - ames_train_fit$price
rmse.ames.train <- sqrt(mean(residuals.ames.train ^2))
message ("The RMSE value or in sample error is ", round(rmse.ames.train,2))
## The RMSE value or in sample error is 21504.41
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
ames_test <- tbl_df(ames_test)
ames_test <- ames_test %>%
filter(Sale.Condition == "Normal")%>%
filter(Neighborhood != "Landmrk") %>%
mutate(House.Age = year(today()) - Year.Built) %>%
dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
Bedroom.AbvGr,area, Neighborhood, Exter.Qual)
ames_test <- na.omit(ames_test)
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
test.pred <- predict(model.fit.bas, newdata = ames_test, estimator = "BMA")
test.pred.val <- exp(test.pred$fit)
residuals.ames.test <- ames_test$price - test.pred.val
rmse.ames.test <- sqrt(mean(residuals.ames.test^2))
message ("The RMSE value or out of sample error is ", round(rmse.ames.test,2))
## The RMSE value or out of sample error is 22865.36
# Check if the RMSE of test data is higher than RMSE training data #
message ("Is out of sample error more than in sample error ? Answer: ",(rmse.ames.test > rmse.ames.train))
## Is out of sample error more than in sample error ? Answer: TRUE
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
#names(ames_train_dup)
ames_train_play <- ames_train_dup %>%
filter(Sale.Condition == "Normal") %>%
dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
Bedroom.AbvGr,area, Neighborhood, Exter.Qual,Garage.Qual, Garage.Finish,Garage.Type,
Garage.Cars,Garage.Cond,Garage.Area,Bldg.Type,Year.Remod.Add,Heating,Heating.QC,
Central.Air) %>%
mutate(House.Mod = year(today()) - Year.Remod.Add)
ames_train_play <- na.omit(ames_train_play)
# Final Variable and Model Selection #
model.bas.final <- bas.lm(data = ames_train_play, log(price) ~ log(Lot.Area) * log(area) +
log(House.Age) * log(House.Mod) + Overall.Qual + Overall.Cond +
Bedroom.AbvGr + Neighborhood + log(area) + log(Lot.Area),
prior = "JZS",
modelprior = uniform(),
method = "BAS")
## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length
## Warning in bas.lm(data = ames_train_play, log(price) ~ log(Lot.Area) *
## log(area) + : bestmodel violates heredity conditions; resetting to null
## model
summary(model.bas.final,n.models = 3)
## P(B != 0 | Y) model 1 model 2
## Intercept 1.0000000 1.0000 1.0000000
## log(Lot.Area) 0.9999976 1.0000 1.0000000
## log(area) 0.9999928 1.0000 1.0000000
## log(House.Age) 0.3562933 0.0000 1.0000000
## log(House.Mod) 0.9999904 1.0000 1.0000000
## Overall.Qual 1.0000000 1.0000 1.0000000
## Overall.Cond 1.0000000 1.0000 1.0000000
## Bedroom.AbvGr 0.9973427 1.0000 1.0000000
## NeighborhoodBlueste 1.0000000 1.0000 1.0000000
## NeighborhoodBrDale 1.0000000 1.0000 1.0000000
## NeighborhoodBrkSide 1.0000000 1.0000 1.0000000
## NeighborhoodClearCr 1.0000000 1.0000 1.0000000
## NeighborhoodCollgCr 1.0000000 1.0000 1.0000000
## NeighborhoodCrawfor 1.0000000 1.0000 1.0000000
## NeighborhoodEdwards 1.0000000 1.0000 1.0000000
## NeighborhoodGilbert 1.0000000 1.0000 1.0000000
## NeighborhoodGreens 1.0000000 1.0000 1.0000000
## NeighborhoodGrnHill 1.0000000 1.0000 1.0000000
## NeighborhoodIDOTRR 1.0000000 1.0000 1.0000000
## NeighborhoodMeadowV 1.0000000 1.0000 1.0000000
## NeighborhoodMitchel 1.0000000 1.0000 1.0000000
## NeighborhoodNAmes 1.0000000 1.0000 1.0000000
## NeighborhoodNoRidge 1.0000000 1.0000 1.0000000
## NeighborhoodNPkVill 1.0000000 1.0000 1.0000000
## NeighborhoodNridgHt 1.0000000 1.0000 1.0000000
## NeighborhoodNWAmes 1.0000000 1.0000 1.0000000
## NeighborhoodOldTown 1.0000000 1.0000 1.0000000
## NeighborhoodSawyer 1.0000000 1.0000 1.0000000
## NeighborhoodSawyerW 1.0000000 1.0000 1.0000000
## NeighborhoodSomerst 1.0000000 1.0000 1.0000000
## NeighborhoodStoneBr 1.0000000 1.0000 1.0000000
## NeighborhoodSWISU 1.0000000 1.0000 1.0000000
## NeighborhoodTimber 1.0000000 1.0000 1.0000000
## NeighborhoodVeenker 1.0000000 1.0000 1.0000000
## log(Lot.Area):log(area) 1.0000000 1.0000 1.0000000
## log(House.Age):log(House.Mod) 0.9999898 1.0000 1.0000000
## BF NA 1.0000 0.5534834
## PostProbs NA 0.6420 0.3553000
## R2 NA 0.9057 0.9062000
## dim NA 35.0000 36.0000000
## logmarg NA 811.0078 810.4163164
## model 3
## Intercept 1.000000e+00
## log(Lot.Area) 1.000000e+00
## log(area) 1.000000e+00
## log(House.Age) 0.000000e+00
## log(House.Mod) 1.000000e+00
## Overall.Qual 1.000000e+00
## Overall.Cond 1.000000e+00
## Bedroom.AbvGr 0.000000e+00
## NeighborhoodBlueste 1.000000e+00
## NeighborhoodBrDale 1.000000e+00
## NeighborhoodBrkSide 1.000000e+00
## NeighborhoodClearCr 1.000000e+00
## NeighborhoodCollgCr 1.000000e+00
## NeighborhoodCrawfor 1.000000e+00
## NeighborhoodEdwards 1.000000e+00
## NeighborhoodGilbert 1.000000e+00
## NeighborhoodGreens 1.000000e+00
## NeighborhoodGrnHill 1.000000e+00
## NeighborhoodIDOTRR 1.000000e+00
## NeighborhoodMeadowV 1.000000e+00
## NeighborhoodMitchel 1.000000e+00
## NeighborhoodNAmes 1.000000e+00
## NeighborhoodNoRidge 1.000000e+00
## NeighborhoodNPkVill 1.000000e+00
## NeighborhoodNridgHt 1.000000e+00
## NeighborhoodNWAmes 1.000000e+00
## NeighborhoodOldTown 1.000000e+00
## NeighborhoodSawyer 1.000000e+00
## NeighborhoodSawyerW 1.000000e+00
## NeighborhoodSomerst 1.000000e+00
## NeighborhoodStoneBr 1.000000e+00
## NeighborhoodSWISU 1.000000e+00
## NeighborhoodTimber 1.000000e+00
## NeighborhoodVeenker 1.000000e+00
## log(Lot.Area):log(area) 1.000000e+00
## log(House.Age):log(House.Mod) 1.000000e+00
## BF 2.662346e-03
## PostProbs 1.700000e-03
## R2 9.034000e-01
## dim 3.400000e+01
## logmarg 8.050793e+02
confint(coefficients(model.bas.final))
## 2.5% 97.5% beta
## Intercept 12.019673707 12.0355892298 12.027739486
## log(Lot.Area) -0.726071182 0.1109261585 -0.307682553
## log(area) -0.605260472 0.4336893201 -0.088880269
## log(House.Age) 0.000000000 0.1640322395 0.037849711
## log(House.Mod) 0.258729469 0.5602856510 0.378564431
## Overall.Qual 0.059780969 0.0827756865 0.071131979
## Overall.Cond 0.049009717 0.0687756846 0.058965346
## Bedroom.AbvGr -0.045604090 -0.0161037483 -0.030906841
## NeighborhoodBlueste -0.217953248 0.1033535004 -0.054724163
## NeighborhoodBrDale -0.300504630 -0.0391434722 -0.168296732
## NeighborhoodBrkSide -0.125882985 0.0886555744 -0.017360333
## NeighborhoodClearCr -0.161495720 0.0798901703 -0.039273022
## NeighborhoodCollgCr -0.198625501 -0.0117214147 -0.103685705
## NeighborhoodCrawfor -0.035068243 0.1804061792 0.075695184
## NeighborhoodEdwards -0.204141606 -0.0042234591 -0.101726118
## NeighborhoodGilbert -0.252240908 -0.0571506789 -0.152376626
## NeighborhoodGreens -0.002187161 0.2904348977 0.145372760
## NeighborhoodGrnHill 0.096478495 0.5787071244 0.344050225
## NeighborhoodIDOTRR -0.246346623 -0.0239325322 -0.133858015
## NeighborhoodMeadowV -0.261449690 -0.0164226428 -0.137308703
## NeighborhoodMitchel -0.149990790 0.0505498346 -0.049830807
## NeighborhoodNAmes -0.118104228 0.0779343250 -0.018786905
## NeighborhoodNoRidge -0.082343161 0.1204641449 0.018905302
## NeighborhoodNPkVill -0.150495150 0.1400869026 -0.001852661
## NeighborhoodNridgHt -0.027598484 0.1619977682 0.067180270
## NeighborhoodNWAmes -0.189825898 0.0156814360 -0.087144981
## NeighborhoodOldTown -0.211357931 0.0002734994 -0.104219174
## NeighborhoodSawyer -0.139442487 0.0634034363 -0.037604083
## NeighborhoodSawyerW -0.249994796 -0.0539071381 -0.151549204
## NeighborhoodSomerst -0.112324339 0.0706678075 -0.018736424
## NeighborhoodStoneBr -0.071295746 0.1478639944 0.039737459
## NeighborhoodSWISU -0.155900444 0.0941154326 -0.031231479
## NeighborhoodTimber -0.115205005 0.0942804055 -0.009403058
## NeighborhoodVeenker -0.124866698 0.1190322180 -0.002576698
## log(Lot.Area):log(area) 0.006000627 0.1195099697 0.063977609
## log(House.Age):log(House.Mod) -0.135058764 -0.0640099912 -0.090470313
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
Yes, I log transformed the price, Lot.Area, area and House.Age and the House.Mod variable.
Note House.Age is the calculated variable based on the years since Year.Built and House.Mod based on years since Year.Remod.Add
Initial investigation on the distribution of the above variables had indicated skewness and hence I chose to log transformed them. This was also done in the previous assessments.
image(model.bas.final, top.models = 10)
plot(model.bas.final,which = c(1:3), sub.caption = "model.bas.final")
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
Yes, I included the 2-level interaction between log(Lot.Area) * log(area) and secondly between log(House.Age) * log(House.Mod). 3 or more level interactions are uncommon and hence did not include in my model creation.
The price of the house can be deterministic depending on the lot area and the area of the house together. Also the price of the the house could be influenced not only by when it was first built but also by when it was last remodified/remodelled. If remodeling was done recently, it could be a different price compared to no remodeling ever since the house was first constructed ages back.
From the inclusion probability, log(House.Age) has posterior inclusion probability less than 40% while the interaction term log(House.Age) * log(House.Mod) has posterior inclusion probability close to 100%
plot(model.bas.final, which = 4, sub.caption = " ")
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
Based on the initial bas model, I dropped the Exter.Qual and Bsmt.Qual variables but they were all significant in the AIC model. Overall Quality and Overall Condition are more significant predictors compared to other Quality variables.
I included House.Mod based on the Year.Remod.Add as this was not in my initial model but could have a significant influence on the house price and is clear in their posterior inclusion probabilities.
# Using the stepwise function #
model.fit.AIC <- step(model.fit, direction = "backward", k = 2 ,trace = FALSE)
summary(model.fit.AIC)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + log(House.Age) +
## Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual +
## Exter.Qual + Neighborhood + Bedroom.AbvGr, data = ames_train_fit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84933 -0.06044 0.00587 0.06465 0.45309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.601036 0.174386 43.587 < 2e-16 ***
## log(Lot.Area) 0.140648 0.011665 12.057 < 2e-16 ***
## log(area) 0.477700 0.023065 20.711 < 2e-16 ***
## log(House.Age) -0.161004 0.017717 -9.088 < 2e-16 ***
## Overall.Qual 0.056936 0.006027 9.447 < 2e-16 ***
## Overall.Cond 0.054356 0.004498 12.085 < 2e-16 ***
## Bsmt.QualFa -0.120274 0.037620 -3.197 0.001445 **
## Bsmt.QualGd -0.075241 0.021177 -3.553 0.000404 ***
## Bsmt.QualPo -0.239834 0.118946 -2.016 0.044113 *
## Bsmt.QualTA -0.099868 0.025789 -3.873 0.000117 ***
## Kitchen.QualFa -0.137083 0.040996 -3.344 0.000866 ***
## Kitchen.QualGd -0.060986 0.026482 -2.303 0.021549 *
## Kitchen.QualPo -0.160177 0.125077 -1.281 0.200711
## Kitchen.QualTA -0.096330 0.028374 -3.395 0.000721 ***
## Exter.QualFa -0.139133 0.059970 -2.320 0.020599 *
## Exter.QualGd -0.085073 0.036752 -2.315 0.020886 *
## Exter.QualTA -0.082355 0.040035 -2.057 0.040016 *
## NeighborhoodBlueste -0.035021 0.080468 -0.435 0.663525
## NeighborhoodBrDale -0.132301 0.064015 -2.067 0.039095 *
## NeighborhoodBrkSide -0.078653 0.053989 -1.457 0.145571
## NeighborhoodClearCr -0.021881 0.062021 -0.353 0.724330
## NeighborhoodCollgCr -0.082639 0.046975 -1.759 0.078938 .
## NeighborhoodCrawfor 0.049345 0.054490 0.906 0.365442
## NeighborhoodEdwards -0.117875 0.050745 -2.323 0.020445 *
## NeighborhoodGilbert -0.128795 0.049195 -2.618 0.009017 **
## NeighborhoodGreens 0.179216 0.073428 2.441 0.014883 *
## NeighborhoodGrnHill 0.387315 0.121795 3.180 0.001531 **
## NeighborhoodIDOTRR -0.206114 0.055580 -3.708 0.000224 ***
## NeighborhoodMeadowV -0.148740 0.055864 -2.663 0.007918 **
## NeighborhoodMitchel -0.031036 0.050139 -0.619 0.536107
## NeighborhoodNAmes -0.028977 0.049701 -0.583 0.560040
## NeighborhoodNoRidge 0.056164 0.050377 1.115 0.265249
## NeighborhoodNPkVill 0.011520 0.073236 0.157 0.875051
## NeighborhoodNridgHt 0.002868 0.049202 0.058 0.953533
## NeighborhoodNWAmes -0.057779 0.050885 -1.135 0.256524
## NeighborhoodOldTown -0.177196 0.053467 -3.314 0.000962 ***
## NeighborhoodSawyer -0.037216 0.051101 -0.728 0.466667
## NeighborhoodSawyerW -0.138330 0.048847 -2.832 0.004748 **
## NeighborhoodSomerst -0.027373 0.046564 -0.588 0.556799
## NeighborhoodStoneBr 0.031589 0.055555 0.569 0.569787
## NeighborhoodSWISU -0.109296 0.061956 -1.764 0.078116 .
## NeighborhoodTimber -0.036876 0.053633 -0.688 0.491939
## NeighborhoodVeenker 0.033352 0.060472 0.552 0.581431
## Bedroom.AbvGr -0.026555 0.007433 -3.573 0.000375 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.113 on 770 degrees of freedom
## Multiple R-squared: 0.9123, Adjusted R-squared: 0.9074
## F-statistic: 186.2 on 43 and 770 DF, p-value: < 2.2e-16
# Using BAS model #
model.fit.bas <- bas.lm(data = ames_train_fit,log(price) ~ log(Lot.Area) + log(area) +
log(House.Age) + Overall.Qual + Overall.Cond + Bsmt.Qual + Kitchen.Qual +
Exter.Qual + Bedroom.AbvGr + Neighborhood, prior = "JZS", modelprior = uniform(), method = "BAS")
## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length
## Warning in bas.lm(data = ames_train_fit, log(price) ~ log(Lot.Area) +
## log(area) + : bestmodel violates heredity conditions; resetting to null
## model
image(model.fit.bas, top.models = 10, rotate = TRUE)
plot(model.fit.bas, which = 4, ask = FALSE, sub.caption = " ")
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
I tested the final model using the ames_test data and compared the in sample and out of sample error compared to the previous initial model.
I decide to stick with the final model to test against the validation data set even though the RMSE is higher in the final model for both in sample and out of sample error compared to the initial model. My next step would be to check whether the RMSE with the validation dataset increases or decreases compared to the in sample and out of sample error of the final model.
load("ames_test.Rdata")
ames_test <- tbl_df(ames_test)
ames_test <- ames_test %>%
filter(Sale.Condition == "Normal") %>%
filter(Neighborhood != "Landmrk") %>%
mutate(House.Age = year(today()) - Year.Built) %>%
mutate(House.Mod = year(today()) - Year.Remod.Add) %>%
dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
Bedroom.AbvGr,area, Neighborhood, Exter.Qual,Bldg.Type, House.Mod)
ames_test <- na.omit(ames_test)
# In sample erro in final model using ames_train_play #
init.model.bas.fit <- exp(fitted(model.bas.final, estimator = "BMA"))
residuals.ames.train <- init.model.bas.fit - ames_train_play$price
rmse.ames.train.final <- sqrt(mean(residuals.ames.train ^2))
message ("The RMSE value or in sample error using final model is ", round(rmse.ames.train.final,2))
## The RMSE value or in sample error using final model is 22643.59
# Out of sample error in final model using ames_test #
test.pred <- predict(model.bas.final, newdata = ames_test, estimator = "BMA")
test.pred.val <- exp(test.pred$fit)
residuals.ames.test <- ames_test$price - test.pred.val
rmse.ames.test.final <- sqrt(mean(residuals.ames.test^2))
message ("The RMSE value or out of sample error using final model is ", round(rmse.ames.test.final,2))
## The RMSE value or out of sample error using final model is 24059.09
# Check if the RMSE of test data is higher than RMSE training data #
message("Is in sample error in final model more than in sample error in initial model? Answer: ", (rmse.ames.train.final > rmse.ames.train))
## Is in sample error in final model more than in sample error in initial model? Answer: TRUE
message("Is out of sample error in final model more than out of sample error in initial model ? Answer: ", (rmse.ames.test.final > rmse.ames.test))
## Is out of sample error in final model more than out of sample error in initial model ? Answer: TRUE
message("Is out of sample error in final model more than in sample error in final model ? Answer: ", (rmse.ames.test.final > rmse.ames.train.final))
## Is out of sample error in final model more than in sample error in final model ? Answer: TRUE
For your final model, create and briefly interpret an informative plot of the residuals.
For your final model, calculate and briefly comment on the RMSE.
What are some strengths and weaknesses of your model?
Strengths
Weakness
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
ames_validation <- tbl_df(ames_validation)
ames_validation <- ames_validation %>%
filter(Sale.Condition == "Normal") %>%
filter(Neighborhood != "Landmrk") %>%
mutate(House.Age = year(today()) - Year.Built) %>%
mutate(House.Mod = year(today()) - Year.Remod.Add) %>%
dplyr::select(price,Lot.Area,Overall.Qual,Overall.Cond,House.Age,Kitchen.Qual,Bsmt.Qual,
Bedroom.AbvGr,area, Neighborhood, Exter.Qual,Bldg.Type, House.Mod)
ames_validation <- na.omit(ames_validation)
The RMSE in validation dataset in final model is less than RMSE in test dataset in final model
The RMSE in validation dataset in final model is less than in sample error in train dataset in final model.
The coverage probability is ~ 93% and hence it explains the model uncertainty
This confirms that my final model is still a better model with better predicatibility of the house prices in Ames,Iowa.
validate.pred <- predict(model.bas.final, newdata = ames_validation, estimator = "BMA")
validate.pred.val <- exp(validate.pred$fit)
residuals.ames.validate <- ames_validation$price - validate.pred.val
rmse.ames.validate <- sqrt(mean(residuals.ames.validate^2))
rmse.ames.validate
## [1] 21917.84
# Check if the RMSE of test data is higher than RMSE training data #
message("Is out of sample error in validation dataset in final model more than out of sample error in test dataset in final model ? Answer: ", (rmse.ames.validate > rmse.ames.test.final))
## Is out of sample error in validation dataset in final model more than out of sample error in test dataset in final model ? Answer: FALSE
message("Is out of sample error in validation dataset in final model more than in sample error in train dataset in final model ? Answer: ", (rmse.ames.validate > rmse.ames.train.final))
## Is out of sample error in validation dataset in final model more than in sample error in train dataset in final model ? Answer: FALSE
interval <- quantile(validate.pred.val, c(0.025, 0.975))
coverage.prob <- mean(ames_validation$price > interval[1] &
ames_validation$price < interval[2])
message("Coverage probability is = ", paste0(round(100*coverage.prob,0),"%"))
## Coverage probability is = 93%
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
This is an interesting dataset for house price prediction against a multitude of variables which can be explored.
For Normal sale condition the arrived model is a good predictor and also mentions the uncertainty.
Model is validated using test and validation dataset and out of sample error is calculated to check for any overfitting.
The RMSE reduces for the validation dataset.
There is a strong interaction between overall age of the house and the year it was remodel or remodified.
tLog Tranformation of the variables is necessary to satisfy the regression assumptions.
In this exercise I practiced a lot of techniques for linear modeling using Bayesian approach which was my key interest coming from frequentist background.