As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(MASS)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(GGally)
library(statsr)
library(formattable)
library(moments)
library(BAS)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
This training dataset has 1000 observations of 81 variables per house. Let us see what are the variables it contains, with an additional variable: Housing.Age (Current Year (2020) - Year House was built).
ames_train$Housing.Age<-2020-ames_train$Year.Built
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 82 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
## $ Housing.Age : num 81 36 90 120 19 17 67 13 36 15 ...
We now known the different variables that are present within the dataset, in understanding whether they are integers or factors and what are some of the entries in the dataset.
One of the most informative graphs that was done in this EDA was the boxplot of price distribution in multiples of 1000 across the various neighbourhoods beside a boxplot of the age of the houses in those neighborhoods. MeadowV was identified to be the least expensive neighborhood whereas StoneBr was the most expensive and has the more heterogeneous pricing (highest standard deviation).
p1<-ggplot(ames_train,aes(x=reorder(Neighborhood,price),y=price))+geom_boxplot()+coord_flip()+labs(title = "Neighbourhood vs Price", x = 'Neighbourhood', y = "Price (in USD$k)")
p2<-ggplot(ames_train,aes(x=reorder(Neighborhood,price),y=Housing.Age))+geom_boxplot()+coord_flip()+labs(title = "Neighborhoods vs Age of House", x = 'Neighbourhood', y = "Age of House")
grid.arrange(p1, p2, nrow = 1)
From this graph, it is clear that age of the estates is likely to be a key influence in the neighborhood housing prices, as there seems to be a clustering of older houses in the cheaper neighborhoods, with a few exceptions. Hence, if we include age of the houses, we likely can exclude neighbourhoods from the model to be built. This prompted me to do further analysis with the data, to see if there are more variables that are correlated.
To derive a initial model for the analysis, it is important to understand the correlation between the variables and extract the ones with the highest correlation to price (which is set to have an absolute R value of >0.4). Hence, the following code below will extract that information:
corrdata <- as.data.frame(sapply(ames_train, as.numeric))
corrbase= cor(corrdata, method = "s")
## Warning in cor(corrdata, method = "s"): the standard deviation is zero
corr_sortprice = as.matrix(sort(corrbase[,'price'], decreasing = TRUE))
corr_chosen = names(which(apply(corr_sortprice, 1, function(x) (x > 0.4 | x < -0.4))))
corrplot(as.matrix(corrbase[corr_chosen,corr_chosen]), type = 'upper', method='ellipse')
From this, it is clear that we need to choose between 1) Overall Quality, Exterior Quality ,Full Bath and Area as well as 2)Year.Built, Housing Age, and Foundation. Other variables such as Fireplaces and Heating.QC can be added directly to the model. This is informative for me to select variables that are less dependent on each other to reduce the size of the initial model.
We want to understand whether using log(Lot.Area) or Lot.Area would be a better predictor of log (price), so I plotted the 2 graphs using a linear model to extrapolate:
basedata<-ames_train%>%
select(Lot.Area, price)%>%
na.omit()%>%
mutate(logarea=log(Lot.Area),logprice=log(price))
linear_Eric_model<- lm(log(price) ~ Lot.Area,data=basedata)
linear_Eric_log_model<-lm(log(price) ~logarea,data=basedata)
basedf<-as.data.frame(basedata)
basedf$log_predicted_price<-log(exp(predict(linear_Eric_model)))
basedf$log_loggedArea_predicted_price<-log(exp(predict(linear_Eric_log_model)))
p3<-ggplot(basedf,aes(x=log_predicted_price,y=logprice))+geom_point()+
geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) +
geom_abline(intercept = 0, slope = 1)+
labs(title = "Model without log-Lot.Area transformation", x = 'log(Predicted price)', y = "log(Actual price)")
p4<-ggplot(basedf,aes(x=log_loggedArea_predicted_price,y=logprice))+geom_point()+
geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) +
geom_abline(intercept = 0, slope = 1)+
labs(title = "Model with log-Lot.Area transformation", x = 'log(Predicted price)', y = "log(Actual price)")
grid.arrange(p3, p4, nrow = 2)
From this, it is clear that the log(Lot.Area) predictor was much better, with a better spread of the random error. As a result, I would add a transformed variable, log(Lot.Area), into the initial model.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
From my initial analysis, I am going to select the following variables for the initial model: 1. log(Area): Area has a strong correlation with price, and as determined earlier, log(Area) is a better fit to the model. 2. Housing.Age: A variable defined earlier based on Current Year - Year.built 3. Full.Bath: Number of full bathrooms in a house 4. Heating.QC: Quality of heating, seems like a good variable to be included based on correlation to price. 5. Year.Remod.Add: Remodelling could potentially help the house regain value, hence it will be included in this model. 6. Fireplaces: Has a relatively high correlation to price (likely to be a function of the neighborhood) 7. X1st.Flr.SF: Good proxy for the amount of build-up of the house, which indicates the amount of land that the house occupies. 8. TotRms.AbvGrd: Good proxy for the living space available for the owner
model.initial<-lm(log(price)~ log(Lot.Area)+Housing.Age+ Full.Bath + Heating.QC +Year.Remod.Add+Fireplaces+X1st.Flr.SF+TotRms.AbvGrd, data=ames_train)
summary(model.initial)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Housing.Age + Full.Bath +
## Heating.QC + Year.Remod.Add + Fireplaces + X1st.Flr.SF +
## TotRms.AbvGrd, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11498 -0.09676 0.00503 0.10887 0.66981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.390e+00 8.933e-01 3.795 0.000157 ***
## log(Lot.Area) 8.151e-02 1.421e-02 5.736 1.28e-08 ***
## Housing.Age -4.067e-03 3.020e-04 -13.465 < 2e-16 ***
## Full.Bath 1.784e-02 1.635e-02 1.091 0.275393
## Heating.QCFa -2.461e-01 4.591e-02 -5.361 1.03e-07 ***
## Heating.QCGd -4.344e-02 1.977e-02 -2.197 0.028243 *
## Heating.QCPo -1.529e-01 2.034e-01 -0.752 0.452375
## Heating.QCTA -1.318e-01 1.759e-02 -7.489 1.53e-13 ***
## Year.Remod.Add 3.729e-03 4.423e-04 8.432 < 2e-16 ***
## Fireplaces 1.136e-01 1.120e-02 10.135 < 2e-16 ***
## X1st.Flr.SF 2.737e-04 2.174e-05 12.586 < 2e-16 ***
## TotRms.AbvGrd 5.089e-02 5.359e-03 9.497 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2023 on 988 degrees of freedom
## Multiple R-squared: 0.7712, Adjusted R-squared: 0.7687
## F-statistic: 302.8 on 11 and 988 DF, p-value: < 2.2e-16
The model here shows an adjusted R^2 value of 0.7687, which is pretty high considering that we only have 8 variables in this model.
Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
Let’s try using AIC Stepwise-Regression to reduce the model down to a few variables:
model.initial_AIC<-step(model.initial,k=2)
## Start: AIC=-3184.11
## log(price) ~ log(Lot.Area) + Housing.Age + Full.Bath + Heating.QC +
## Year.Remod.Add + Fireplaces + X1st.Flr.SF + TotRms.AbvGrd
##
## Df Sum of Sq RSS AIC
## - Full.Bath 1 0.0487 40.482 -3184.9
## <none> 40.433 -3184.1
## - log(Lot.Area) 1 1.3467 41.780 -3153.3
## - Heating.QC 4 2.9043 43.337 -3122.7
## - Year.Remod.Add 1 2.9097 43.343 -3116.6
## - TotRms.AbvGrd 1 3.6908 44.124 -3098.8
## - Fireplaces 1 4.2038 44.637 -3087.2
## - X1st.Flr.SF 1 6.4828 46.916 -3037.4
## - Housing.Age 1 7.4195 47.853 -3017.6
##
## Step: AIC=-3184.9
## log(price) ~ log(Lot.Area) + Housing.Age + Heating.QC + Year.Remod.Add +
## Fireplaces + X1st.Flr.SF + TotRms.AbvGrd
##
## Df Sum of Sq RSS AIC
## <none> 40.482 -3184.9
## - log(Lot.Area) 1 1.3260 41.808 -3154.7
## - Heating.QC 4 2.9774 43.459 -3121.9
## - Year.Remod.Add 1 3.0853 43.567 -3113.5
## - Fireplaces 1 4.2152 44.697 -3087.8
## - TotRms.AbvGrd 1 5.0962 45.578 -3068.3
## - X1st.Flr.SF 1 6.5971 47.079 -3035.9
## - Housing.Age 1 8.3929 48.875 -2998.5
model.initial_AIC$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 988 40.43311 -3184.106
## 2 - Full.Bath 1 0.04874089 989 40.48185 -3184.901
Using AIC, we see that the Full.Bath variable is dropped, with a slight improvement in AIC.
Let us try and use BAS with a BIC prior to select the results:
narrow_data<-ames_train%>%
select(Lot.Area, Housing.Age, Full.Bath, Heating.QC, Year.Remod.Add, Fireplaces, X1st.Flr.SF, TotRms.AbvGrd,price)%>%
mutate(logarea=log(Lot.Area))
bas_Eric<- bas.lm(log(price) ~.,data=narrow_data, prior='BIC',
modelprior=uniform())
plot(bas_Eric,which=4,ask=FALSE, cex.lab=0.5)
image(bas_Eric,rotate=FALSE)
Based on the BAS inclusion probabilities, it is likely for us to exclude Heating.QC and Full.Bath in this final linear model, which differs slightlyy from the AIC selection.
Initial Model: log(price)~ log(Lot.Area)+Housing.Age+ Full.Bath + Heating.QC +Year.Remod.Add+Fireplaces+X1st.Flr.SF+TotRms.AbvGrd
Model AIC: log(price)~ log(Lot.Area)+Housing.Age + Heating.QC +Year.Remod.Add+Fireplaces+X1st.Flr.SF+TotRms.AbvGrd
Model BAS: log(price)~ log(Lot.Area)+Housing.Age+Year.Remod.Add+Fireplaces+X1st.Flr.SF+TotRms.AbvGrd
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
Let us use the BAS model for this calculation:
model.initial_BAS<-lm(log(price)~ log(Lot.Area)+Housing.Age+Year.Remod.Add+Fireplaces+X1st.Flr.SF+TotRms.AbvGrd, data=ames_train)
par(mfrow=c(1,2))
hist(model.initial_BAS$residuals, main="Residuals")
plot(model.initial_BAS,which=1,ask=FALSE)
We see that the residuals in this plot appear to be relatively symmetrical and are centred/clustered around zero. They also do not have a clear pattern or shape, indicating random distribution (of error).
When observing the Residual vs Fitted graph, the general shape of the red line is horizontal, which indicates that a linear relationship is indeed true. However, closer to the tail ends (at higher prices), there seems to be an increase in the residual values, indicating that the variance and bias in the model diverges as the predicted price increases. To avoid this issue, we can either use the model to predict prices below a certain range or have more data points for houses at higher prices to improve its predictability.
3 points stand out from the plot due to overprediction, Houses number 428, 276 and 310 as marked on the plot. The original prices vs the predicted prices and residuals are shown in the table below.
validation<-as.data.frame(ames_train)%>%
select(price,Neighborhood,Overall.Qual,X1st.Flr.SF)
validation$predicted<- exp(predict(model.initial_BAS))
validation$residuals<-residuals(model.initial_BAS)
formattable(validation[c(428,181,310),])
| price | Neighborhood | Overall.Qual | X1st.Flr.SF | predicted | residuals | |
|---|---|---|---|---|---|---|
| 428 | 12789 | OldTown | 2 | 832 | 110818.6 | -2.1593089 |
| 181 | 82500 | NWAmes | 7 | 1411 | 184226.5 | -0.8033676 |
| 310 | 184750 | Edwards | 10 | 3138 | 578247.3 | -1.1409980 |
The main commonality across the 3 data points is their location in neighborhoods with generally lower prices, which may not be fully captured by the age of the house.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
pred_initial<-exp(predict(model.initial_BAS,ames_train))
resid_initial<-ames_train$price-pred_initial
rmse_initial<-sqrt(mean(resid_initial^2))
paste('The RMSE for the Bayesian selected model under ames_train is US$ ',round(rmse_initial,2),sep = "")
## [1] "The RMSE for the Bayesian selected model under ames_train is US$ 39756.38"
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
ames_test$Housing.Age<-2020-ames_test$Year.Built
model_predict.values<-exp(predict(model.initial_BAS,ames_test))
ames_test$predicted=model_predict.values
ames_test$log_predicted=log(model_predict.values)
ames_test$log_price=log(ames_test$price)
ggplot(ames_test,aes(x=log_predicted,y=log_price))+geom_point()+
geom_smooth(method = "lm", linetype = 'dashed', se = FALSE) +
geom_abline(intercept = 0, slope = 1)+
labs(title = "Model Testing", x = 'log(Predicted price)', y = "log(Actual price)")
The data seems to fit this test set quite well. Let us try and use RMSE to determine the fit of the data in the training set.
pred_initial_test<-exp(predict(model.initial_BAS,ames_test))
resid_initial_test<-ames_test$price-pred_initial_test
rmse_initial_test<-sqrt(mean(resid_initial_test^2))
paste('The RMSE for the Bayesian selected model with the ames_test dataset is US$ ',round(rmse_initial_test,2),sep = "")
## [1] "The RMSE for the Bayesian selected model with the ames_test dataset is US$ 36340.71"
paste('The RMSE for the Bayesian selected model with the ames_train dataset is US$ ',round(rmse_initial,2),sep = "")
## [1] "The RMSE for the Bayesian selected model with the ames_train dataset is US$ 39756.38"
Usually, RMSE for training data will be lower than that of the testing data set because the model is more fitted to what it is trained on. However, in this case, the test data set has a better fit for the model than the training dataset, which means this model is not overfitted.
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
For this new model, I have added a few more variables for evaluation:
1) Building type(Bldg.Type) 2) Sale Condition (Sale.Condition) 3) Bedrooms above ground (Bedroom.AbvGr) 4) Total Area (area) 5) Kitchen Quality (Kitchen.Qual)
All variables are then used in the model selection below, along with the various transformations required to make the model more robust, to give the final model below.
model_final<- lm(log(price)~ log(Lot.Area)+Housing.Age+Year.Remod.Add+Fireplaces+log(X1st.Flr.SF)+Sale.Condition+Bedroom.AbvGr+log(area)+Kitchen.Qual,data=ames_train)
summary(model_final)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Housing.Age + Year.Remod.Add +
## Fireplaces + log(X1st.Flr.SF) + Sale.Condition + Bedroom.AbvGr +
## log(area) + Kitchen.Qual, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.90829 -0.08116 0.00429 0.10162 0.50829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1431525 0.8064604 2.657 0.0080 **
## log(Lot.Area) 0.0877313 0.0125356 6.999 4.78e-12 ***
## Housing.Age -0.0037095 0.0002575 -14.405 < 2e-16 ***
## Year.Remod.Add 0.0022979 0.0004034 5.696 1.62e-08 ***
## Fireplaces 0.0661465 0.0103497 6.391 2.54e-10 ***
## log(X1st.Flr.SF) 0.1800284 0.0240223 7.494 1.48e-13 ***
## Sale.ConditionAdjLand 0.1937955 0.1275409 1.519 0.1290
## Sale.ConditionAlloca 0.1556706 0.0910701 1.709 0.0877 .
## Sale.ConditionFamily 0.0296665 0.0486117 0.610 0.5418
## Sale.ConditionNormal 0.1597787 0.0235411 6.787 1.97e-11 ***
## Sale.ConditionPartial 0.2041114 0.0317566 6.427 2.02e-10 ***
## Bedroom.AbvGr -0.0385265 0.0091018 -4.233 2.52e-05 ***
## log(area) 0.4917508 0.0298413 16.479 < 2e-16 ***
## Kitchen.QualFa -0.3198098 0.0504227 -6.343 3.44e-10 ***
## Kitchen.QualGd -0.1779494 0.0247467 -7.191 1.27e-12 ***
## Kitchen.QualPo -0.1962677 0.1792331 -1.095 0.2738
## Kitchen.QualTA -0.2564860 0.0285835 -8.973 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1759 on 983 degrees of freedom
## Multiple R-squared: 0.828, Adjusted R-squared: 0.8252
## F-statistic: 295.7 on 16 and 983 DF, p-value: < 2.2e-16
The adjusted R^2 value for this model is much higher than the initial, at 0.825 with only 9 variables.
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
I decided to transform price to log(price) for the distribution to be more normal, as the prices were right skewed.
As for Lot.area, let us visualize it with a histogram. * * *
ggplot(data=ames_train,aes(x=Lot.Area))+
geom_histogram(bins = 50, fill = 'steelblue', colour = 'black')+
xlab("Lot.Area")+
ylab("Number of Houses")+
ggtitle("Distribution of the Lot.Area in the Dataset")
As we can see from the above diagran, the plot is right skewed, which means that it is not normally distributed. However, if we were to apply log(Lot.Area) instead:
ames_train$log_lotarea<-log(ames_train$Lot.Area)
ggplot(data=ames_train,aes(x=log_lotarea))+
geom_histogram(bins = 50, fill = 'steelblue', colour = 'black')+
xlab("log(Lot.Area)")+
ylab("Number of Houses")+
ggtitle("Distribution of the Log(Lot.Area) in the Dataset")
As you can visually see, this is a much better output. Hence, all area variables (Lot.Area, X1st.Flr.SF, area) are log transformed in the final model.
Housing.Age is also another variable I defined. Instead of having the Year.Built, I thought it was better to define the Housing Age (Current Year - Year.Built).
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences. * * This model does not include any variable interactions. This is to maintain parsimony, as the model is already very strong without including interactions. We need to consider designing a model that is both predictive and simple, to avoid overfitting to a single dataset. Hence, I’ve chosen not to have any variable interactions, despite the possibility of improving the predictive power of the model. * *
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
We initially performed corrplot to check the relationship between predictors and price. Then we narrowed it down the top few factors (8 variables) that form the base model using BAS, and picking the ones with the highest inclusion probabilities.
After which, we added a few more variables that may feature, and use BAS again as shown below to narrow down a few more indicators.
model_final_selection<- bas.lm(log(price)~ log(Lot.Area)+Housing.Age+Year.Remod.Add+Fireplaces+log(X1st.Flr.SF)+TotRms.AbvGrd+Bldg.Type+Sale.Condition+Bedroom.AbvGr+log(area)+Kitchen.Qual,data=ames_train,prior="BIC")
plot(model_final_selection,which=4,ask=FALSE, cex.lab=0.5)
image(model_final_selection,rotate=TRUE)
Based on this, we will now exclude Building type (3 out of the 4 factors would be excluded) and Total Rooms above ground for this analysis. * * *
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
Usually, RMSE for training data will be lower than that of the testing data set because the model is more fitted to what it is trained on. However, in this case, the test data set has a better fit for the model than the training dataset, which means this model is not overfitted. Hence no change is made.
pred_final<-exp(predict(model_final,ames_train))
resid_final<-ames_train$price-pred_final
rmse_final<-sqrt(mean(resid_final^2))
paste('The RMSE for the final Bayesian selected model with the ames_train dataset is US$ ',round(rmse_initial,2),sep = "")
## [1] "The RMSE for the final Bayesian selected model with the ames_train dataset is US$ 39756.38"
pred_final_test<-exp(predict(model_final,ames_test))
resid_final_test<-ames_test$price-pred_final_test
rmse_final_test<-sqrt(mean(resid_final_test^2))
paste('The RMSE for the final Bayesian selected model with the ames_test dataset is US$ ',round(rmse_initial_test,2),sep = "")
## [1] "The RMSE for the final Bayesian selected model with the ames_test dataset is US$ 36340.71"
For your final model, create and briefly interpret an informative plot of the residuals.
par(mfrow=c(1,2))
hist(model_final$residuals, main="Residuals")
plot(model_final,which=1,ask=FALSE)
We see that the residuals in this plot appear to be relatively symmetrical and are centred/clustered around zero. They also do not have a clear pattern or shape, indicating random distribution (of error).
When observing the Residual vs Fitted graph, the general shape of the red line is relatively horizontal, which indicates that a linear relationship is indeed true. However, closer to the tail ends (at higher prices), there seems to be an increase in the residual values, indicating that the variance and bias in the model diverges as the predicted price increases. To avoid this issue, we can either use the model to predict prices below a certain range or have more data points for houses at higher prices to improve its predictability. Alternatively, we could remove a few variables, but that would reduce the predictive power of the model.
For your final model, calculate and briefly comment on the RMSE.
pred_final<-exp(predict(model_final,ames_train))
resid_final<-ames_train$price-pred_final
rmse_final<-sqrt(mean(resid_final^2))
paste('The RMSE for the final Bayesian selected model with the ames_train dataset is US$ ',round(rmse_final,2),sep = "")
## [1] "The RMSE for the final Bayesian selected model with the ames_train dataset is US$ 33369.1"
pred_final_test<-exp(predict(model_final,ames_test))
resid_final_test<-ames_test$price-pred_final_test
rmse_final_test<-sqrt(mean(resid_final_test^2))
paste('The RMSE for the final Bayesian selected model with the ames_test dataset is US$ ',round(rmse_final_test,2),sep = "")
## [1] "The RMSE for the final Bayesian selected model with the ames_test dataset is US$ 28661.74"
The final model RMSE has not changed significantly from the initial model, although it has decreased from the initial model. * * *
What are some strengths and weaknesses of your model?
The strengths of the model are 1) relatively high R^2 score with few variables (9), 2) not overfitted, 3) uses mostly integer variables and 4) can predict most house prices correctly with a few exceptions.
The weaknesses of this model is that it was built on data with a few outliers (Houses 428,310,741), which may have skewed the coefficients. Also, the range of the data seems to be limited to houses which are cheaper, which means that this model cannot be extended beyond its test range. * * *
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
ames_validation$Housing.Age<-2020-ames_validation$Year.Built
pred_final_val<-exp(predict(model_final,ames_validation))
resid_final_val<-ames_validation$price-pred_final_val
rmse_final_val<-sqrt(mean(resid_final_val^2))
paste('The RMSE for the final Bayesian selected model with the ames_test dataset is US$ ',round(rmse_final_val,2),sep = "")
## [1] "The RMSE for the final Bayesian selected model with the ames_test dataset is US$ 27185.02"
The RMSE value of this model is even better than that of the testing and training sets, which is very unusual. This is also a good indicator that the model produced is robust.
Compared to the primary model:
pred_initial_val<-exp(predict(model.initial_BAS,ames_validation))
resid_initial_val<-ames_validation$price-pred_initial_val
rmse_initial_val<-sqrt(mean(resid_initial_val^2))
paste('The RMSE for the initial Bayesian selected model with the ames_test dataset is US$ ',round(rmse_initial_val,2),sep = "")
## [1] "The RMSE for the initial Bayesian selected model with the ames_test dataset is US$ 36373.04"
paste('The improvement in RMSE is US$ ',round(rmse_initial_val-rmse_final_val,2),sep = "")
## [1] "The improvement in RMSE is US$ 9188.02"
Hence, the final model predicts home prices more accurately by $9.18k, which is significant considering only one more variable was added.
Let us find out the confidence interval for the ob
con_intv<- exp(predict(model_final, ames_validation, interval = "prediction", level = 0.95))
coverage_prob<- mean(ames_validation$price > con_intv[, "lwr"] & ames_validation$price < con_intv[, "upr"], na.rm = T) * 100
paste(round(coverage_prob,2),"% of the predicted prices from 95% predictive confidence intervals contain the true price of the house in the validation set.",sep = "")
## [1] "96.85% of the predicted prices from 95% predictive confidence intervals contain the true price of the house in the validation set."
The final model does reflect uncertainty as there is still less than 4% chance that the predicted values don’t fall in the range of actual values, but it is slim.
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
The final model (niodel_final) is a fairly good model with a high R^2 adjusted value of 82.5% while only using 9 variables. We observe a wide coverage probability of 96.85% as well. However, the model was built with some datapoints that appear to be outliers, which may have unintended effects on the final model. The RMSE results show that no overfitting has occurred, and the model fits both the test and validation sets better than the training set. Thank you for reading and I hope you learnt something from reading my capstone project!
-Eric, 8/2/2020 * * *