Multivariate Real-Estate Modeling: Bayesian and Frequentist Methods

Background

The following report was produced as a capstone project for a 6-month statistics and data science course. This report utilizes both Bayesian and Frequentist methodologies for statistical inference and linear regression.

The prompt: a real-estate firm has commissioned the help of a data science consultant to produce a predictive model for house prices. Such a model would be used to valuate homes and inform real-estate investments by comparing predicted prices to their asking prices. A greater predicted price indicates an “under-valued” home and implies a potential for profiting on the real estate investment.

My resulting models are highly predictive with an adjusted R-squared of 91%. I develop a Frequentist model and a Bayesian model generated with a Markov Chain Monte Carlo algorithm. The latter model utilizes Bayesian Model Averaging that provides the analyst with a more flexible way of representing parameters of interest. This becomes a useful analog with the Frequentist model and makes for a complementary, dual approach.

load("ames_train.Rdata")
library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(ggExtra)
library(colorspace)
library(ggthemes)
library(ggfortify)

Part 1: Exploratory Data Analysis (EDA)

The histogram below plots the density of house quality ratings by price. The fill colors correspond to the different levels of house quality.

Following the histogram from left to right one sees a spectrum of differentiated colors indicating that house quality tends to increase with price. This relationship appears strong so I will include house quality in my full model and expect that it remains in the final model.

#Change to factor variables (misrepresented as integers in original data set)
ames_train$Overall.Qual <- as.factor(ames_train$Overall.Qual)
ames_train$Overall.Cond <- as.factor(ames_train$Overall.Cond)

ggplot(data=ames_train, aes(x=(price/1000), fill= factor(Overall.Qual, levels = rev(levels(Overall.Qual)), labels = c("Very Excellent", "Excellent", "Very Good", "Good", "Above Average", "Average", "Below Average", "Fair", "Poor", "Very Poor")))) + geom_histogram(bins=60, aes(y=..density..)) + labs(title= "House Quality Overlay On Price Histogram", x= "Price ($1k USD)", y= "Density", fill = "House Quality") + theme(title = element_text(face="bold"))

The plot below contains three separate visualizations: a scatter plot between the house area and price variables and both of the histogram distributions for house area and price.

From the scatter plot, it can be seen that there is some degree of linear correlation between price and area; however, the relationship weakens as house area increases. This falling off is due to the heavy right skew present in both variables (shown in the histograms located on the scatter plot margins). These outlying points reduce the linearity of the data and signal for the potential use of log-transformations in order to decrease numerical variable deviation and perhaps increase the model’s R-squared.

g1 <- ggplot(data=ames_train, aes(x=area, y=(price/1000))) + geom_point(alpha=.4) + labs(title= "Price and Area Visualization", x= "Area (sq. feet)", y= "Price ($1k USD)") + theme(axis.title.x = element_text(color="orange3"), axis.title.y = element_text(color="seagreen3"), title = element_text(face="bold"))
       
ggMarginal(g1, type= "histogram", xparams = list(fill = "orange3"), yparams = list(fill = "seagreen3"))

The plot below shows average house price and IQR by Neighborhood, similar to a standard box plot but in a modified rendering inspired by Edward Tufte’s minimalist style.

Location is indeed pertinent when it comes to predicting house prices; therefore, the neighborhood variable will be included in the full model.

ggplot(data=ames_train, aes(x=reorder(Neighborhood, price), y=price/1000)) + geom_tufteboxplot() + coord_flip() + theme_minimal() + labs(title= "Tufte Boxplot Visualizing Price and Neighborhood", x= "Ames Neighborhoods", y= "Price ($1k USD)")+theme(title = element_text(face="bold"))

Part 2: Development of Initial Model

Section 2.1: Full Initial Model

An initial model is fit to evaluate hypotheses made during exploratory data analysis, identify significant predictors, and use the results to inform variable selection in the final model.

My EDA indicates that house quality, area, and neighborhood all exhibit some degree of relation to a house’s price. So I will start with these three variables and also include the following:

Age of house- accounts for depreciation of house value as house ages (will create house age variable from Year.Built)
# of bedrooms- determines max occupancy of a house (perhaps a better predictor than absolute square footage)
# of bathrooms- amenities that affect livability of a house (will create new variable to combine full and half bathrooms for simplification)
Central air- many expect central air systems these days, a house without one may be viewed as needing a $10k-20k update before moving in.

#**LOG TRANSFORM**
##Log-transforming area and price variables to improve their skews
ames_train <- ames_train%>%
        mutate(price = log(price), area=log(area))


#**NEW VARIABLES**
##Combining full bathrooms and half bathrooms
##Change "year house was built" into "age of house"
ames_train <- ames_train%>%
        mutate(Bathroom = Full.Bath + Half.Bath, House.Age = 2018 - Year.Built)


#**FACTOR VARIABLE MODS**
##Regroup Neighborhood into three levels, reflecting the top third, middle third, and bottom third of average house prices
levels(ames_train$Neighborhood) <- list(Tier_1=c("NridgHt", "NoRidge", "GrnHill", "StoneBr", "Timber", "Veenker", "Somerst", "Blmngtn", "NWAmes", "Greens"), Tier_2= c("ClearCr", "Gilbert", "Crawfor", "CollgCr", "SawyerW", "Mitchel", "NAmes", "NPkVill", "Sawyer"), Tier_3=c("Blueste", "Edwards", "SWISU", "OldTown", "BrkSide", "BrDale", "IDOTRR", "MeadowV", "Landmrk"))

##Needed as factor for EDA, now changing back to numeric
ames_train$Overall.Qual <- as.integer(ames_train$Overall.Qual)
ames_train$Overall.Cond <- as.integer(ames_train$Overall.Cond)


#**FILTERING AND SUBSETTING DATA**
##Only considering houses sold under "normal conditions"
ames_train <- ames_train%>%
        filter(ames_train$Sale.Condition == "Normal")

##Finally, subsetting dataset to include only variables to be used in the full model
ames_train <- ames_train[,c(2, 3, 15, 20, 44, 53, 82, 83)]

Section 2.1.1: Frequentist Initial Model

It appears that all of the explanatory variables contained within this full model are statistically significant predictors of house price (evidenced by their low p-values).

#Full Frequentist model
full.freq.model <- lm(price ~ ., data=ames_train)
summary(full.freq.model)

## 
## Call:
## lm(formula = price ~ ., data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83750 -0.08032  0.00071  0.08480  0.68930 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.0728004  0.1795775  39.386  < 2e-16 ***
## area                0.6304187  0.0296208  21.283  < 2e-16 ***
## NeighborhoodTier_2 -0.0639473  0.0154639  -4.135 3.91e-05 ***
## NeighborhoodTier_3 -0.2112822  0.0211215 -10.003  < 2e-16 ***
## Overall.Qual        0.0954271  0.0061140  15.608  < 2e-16 ***
## Central.AirY        0.1784027  0.0239547   7.448 2.40e-13 ***
## Bedroom.AbvGr      -0.0315453  0.0084855  -3.718 0.000215 ***
## Bathroom           -0.0561835  0.0105102  -5.346 1.17e-07 ***
## House.Age          -0.0019045  0.0002835  -6.717 3.46e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1435 on 825 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.8589 
## F-statistic: 634.7 on 8 and 825 DF,  p-value: < 2.2e-16

As expected house area and overall quality both exhibit positive coefficients. This means that an increase in these variables would correspond to an increase in the predicted price of a house. A central air system would also increase the predicted price.

Alternatively, the neighborhood variable exhibits a negative coefficient and the magnitude of Tier 3 neighborhoods is greater than that of Tier 2. This makes sense as the predicted price of a house decreases as you move down the neighborhood tiers. Also, as expected, increasing the age of a house will result in lower predicted prices.

Two variables return unexpected coefficients. The number of bedrooms and bathrooms exhibit negative coefficients. I thought that more bedrooms and bathrooms would increase the occupancy and livability of a home, thus increasing its price; however, these model coefficients must be interpreted as a decrease in the response variable while holding all other explanatory variables constant. A potential interpretation: increasing bedrooms and bathrooms while holding area constant should indeed lessen the value of a home because dividing the house up into more and more rooms will reduce the overall living space and make for smaller rooms.

Section 2.1.2: Bayesian Initial Model

Below shows the Bayesian regression output for the same data with a Jeffrey-Zellner-Siow prior on the unknown coefficients and a uniform prior on all possible models.

#Full Bayesian model
full.bas.model <- bas.lm(price ~ ., data=ames_train, prior = "JZS", modelprior = uniform())
coefficients(full.bas.model, estimator = "BMA")

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  256 models 
##                     post mean   post SD     post p(B != 0)
## Intercept           11.9958404   0.0049691   1.0000000    
## area                 0.6290062   0.0308883   1.0000000    
## NeighborhoodTier_2  -0.0639383   0.0162909   0.9942787    
## NeighborhoodTier_3  -0.2111167   0.0216339   1.0000000    
## Overall.Qual         0.0956035   0.0061837   1.0000000    
## Central.AirY         0.1781838   0.0239877   1.0000000    
## Bedroom.AbvGr       -0.0306867   0.0098991   0.9713955    
## Bathroom            -0.0563428   0.0105555   0.9999783    
## House.Age           -0.0019088   0.0002846   1.0000000

Again, all of the variables contained within the full model appear significant (indicated by their high posterior probabilities).

All Bayesian coefficients are similar in sign and magnitude to that produced in the Frequentist regression. Note: The above coefficients are calculated using Bayesian Model Averaging (“BMA”). This hierarchical model minimizes squared-error loss and will be a useful point of comparison with the previous frequentist approach.

Section 2.2: Initial Model Selection

Several different variable selection methodologies are applied to the full initial model to arrive at a more parsimonious model.

Section 2.2.1: Frequentist Model Selection

Backwards variable selection is applied to the full frequentist model using both AIC and BIC criterion.

#Stepwise BIC model selection (output shown)
finalBIC_freq_model <- step(full.freq.model, data=ames_train, direction="backward", k=log(length(ames_train)), trace = TRUE)

## Start:  AIC=-3229.04
## price ~ area + Neighborhood + Overall.Qual + Central.Air + Bedroom.AbvGr + 
##     Bathroom + House.Age
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       16.980 -3229.0
## - Bedroom.AbvGr  1    0.2844 17.265 -3217.3
## - Bathroom       1    0.5881 17.568 -3202.7
## - House.Age      1    0.9285 17.909 -3186.7
## - Central.Air    1    1.1416 18.122 -3176.9
## - Neighborhood   2    2.4158 19.396 -3122.3
## - Overall.Qual   1    5.0140 21.994 -3015.3
## - area           1    9.3228 26.303 -2866.1

#Stepwise AIC model selection (output shown)
finalAIC_freq_model <- step(full.freq.model, data=ames_train, direction="backward", k=2, trace = TRUE)

## Start:  AIC=-3229.76
## price ~ area + Neighborhood + Overall.Qual + Central.Air + Bedroom.AbvGr + 
##     Bathroom + House.Age
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       16.980 -3229.8
## - Bedroom.AbvGr  1    0.2844 17.265 -3217.9
## - Bathroom       1    0.5881 17.568 -3203.4
## - House.Age      1    0.9285 17.909 -3187.4
## - Central.Air    1    1.1416 18.122 -3177.5
## - Neighborhood   2    2.4158 19.396 -3122.8
## - Overall.Qual   1    5.0140 21.994 -3016.0
## - area           1    9.3228 26.303 -2866.8

The AIC and BIC methods agree with one another and result in final models that conserve all variables from the full model. This grants more credibility to the variables if they can survive two separate criteria.

Section 2.2.2: Bayesian Model Selection

Bayesian regression simulates many models and ranks them by their posterior probabilities. The 5 most likely models are shown below. “Model 1” corresponds to the highest probability model (“HPM”) with a posterior probability of 62%.

Bayesian model averaging does not require explicit variable selection employed by Frequentist approaches - it reflects coefficients and uncertainties averaged across the model space with weights corresponding to model posterior probabilities. This describes a unique advantage to Bayesian regression and will be utilized from here on out for comparison with Frequentist methods.

round(summary(full.bas.model), 3)

##                    P(B != 0 | Y) model 1 model 2 model 3 model 4 model 5
## Intercept                  1.000   1.000   1.000   1.000   1.000   1.000
## area                       1.000   1.000   1.000   1.000   1.000   1.000
## NeighborhoodTier_2         0.994   1.000   1.000   0.000   1.000   0.000
## NeighborhoodTier_3         1.000   1.000   1.000   1.000   1.000   1.000
## Overall.Qual               1.000   1.000   1.000   1.000   1.000   1.000
## Central.AirY               1.000   1.000   1.000   1.000   1.000   1.000
## Bedroom.AbvGr              0.971   1.000   0.000   1.000   1.000   0.000
## Bathroom                   1.000   1.000   1.000   1.000   0.000   1.000
## House.Age                  1.000   1.000   1.000   1.000   1.000   1.000
## BF                            NA   1.000   0.030   0.006   0.000   0.000
## PostProbs                     NA   0.966   0.029   0.006   0.000   0.000
## R2                            NA   0.860   0.858   0.857   0.855   0.854
## dim                           NA   9.000   8.000   8.000   8.000   7.000
## logmarg                       NA 788.499 784.979 783.369 777.786 775.944

Section 2.3: Initial Model Diagnostics

Regression diagnostics are run to evaluate residuals and check that model assumptions are satisfied.

Section 2.3.1: Frequentist Model Diagnostics

The residual variance appears to be relatively stable across the fitted values and does not exhibit a pattern. Furthermore, the plot in the first quadrant shows that the residuals roughly follow a normal distribution. The standardized residual plot in the third quadrant (lower-left) identifies 2-3 potential outliers.

#Create regression diagnostic plots
autoplot(finalAIC_freq_model, label.colour = "red", label.size=4)

The fourth quadrant plot (lower-right) shows that these three outliers have moderate leverage. Cooks distance is a function of leverage and residual and can be used to quantify relative influence. Points with a cooks distance of 1 or more are considered atypical and highly influential.

The plot below shows an absence of highly influential points.

#Calculate cooks distance
cooks.freq <- cooks.distance(finalAIC_freq_model)

#Check Cooks Distance
ggplot(ames_train, aes(x=finalAIC_freq_model$fitted.values, y=cooks.freq)) + geom_point() + labs(title="Influential Points", y= "Cooks Distance", x="Fitted Values") + theme_minimal()

The residuals from the Frequentist model exhibit a standard deviation, or RMSE, of $28,286.

#Calculate frequentist model residuals 
resid.freq <- exp(ames_train$price) - exp(finalAIC_freq_model$fitted.values)

#Calculate RMSE
rmse.freq.train <- sqrt(mean(resid.freq^2))
rmse.freq.train

## [1] 28286.22

Section 2.3.2: Bayesian Model Diagnostics

Similar to frequentist diagnostics, the residual variance is stable and exhibits no pattern with fitted values from the Bayesian model.

#Obtain fitted values from averaged model ("BMA")
pred.bas.BMA.train <- predict(full.bas.model, ames_train, 
                    estimator="BMA", 
                    prediction=TRUE, se.fit=TRUE)

#Calculate residuals (observed - fitted)
resid.bas.BMA.train = ames_train$price - pred.bas.BMA.train$fit

#Plot residuals vs fitted values
ggplot(ames_train, aes(x=pred.bas.BMA.train$fit, y=resid.bas.BMA.train)) + geom_point() + labs(title="Residuals Vs Fitted", y= "Residuals", x="Fitted Values") + theme_minimal() + geom_text(aes(label=ifelse(abs(resid.bas.BMA.train) > .495, as.character(row(resid.bas.BMA.train)), "")), hjust=0,vjust=0)

Similarly, the Bayesian model residuals exhibit a standard deviation, or RMSE, of $28,296.

rmse.bas.train <- sqrt(mean( (exp(ames_train$price) - exp(pred.bas.BMA.train$fit))^2 ))
rmse.bas.train

## [1] 28296.07

Section 2.4: Overfitting Analysis

While the model does well on training data, it may perform poorly on out-of-sample data because the model may be overly-tuned to the data it was originally regressed on. This phenomenon is called “over fitting.” The initial model’s performance on the training data is now compared with its performance on out-of-sample “test” data.

load("ames_test.Rdata")

#**LOG TRANSFORM**
##Log-transforming area and price variables to improve their skews
ames_test <- ames_test%>%
        mutate(price = log(price), area=log(area))


#**NEW VARIABLES**
##Combining full bathrooms and half bathrooms
##Change "year house was built" into "age of house"
ames_test <- ames_test%>%
        mutate(Bathroom = Full.Bath + Half.Bath, House.Age = 2018 - Year.Built)


#**FACTOR VARIABLE MODS**
##Regroup Neighborhood into three levels, reflecting the top third, middle third, and bottom third of average house prices 
levels(ames_test$Neighborhood) <- list(Tier_1=c("NridgHt", "NoRidge", "GrnHill", "StoneBr", "Timber", "Veenker", "Somerst", "Blmngtn", "NWAmes", "Greens"), Tier_2= c("ClearCr", "Gilbert", "Crawfor", "CollgCr", "SawyerW", "Mitchel", "NAmes", "NPkVill", "Sawyer"), Tier_3=c("Blueste", "Edwards", "SWISU", "OldTown", "BrkSide", "BrDale", "IDOTRR", "MeadowV", "Landmrk"))


#**FILTERING DATA**
##Only considering houses sold under "normal conditions" from here on out
ames_test <- ames_test%>%
        filter(ames_test$Sale.Condition == "Normal")

Section 2.4.1: Frequentist Model Out-of-Sample Comparison

The predictions are not significantly more accurate on the training data. I calculated a slightly greater RMSE using the test data but this is to be expected since the model was generated from training data. Overall, the RMSE’s from the test and train data are comparable indicating similar accuracy across both data sets.

pred.freq.test <- predict(finalAIC_freq_model, ames_test, prediction=TRUE)

rmse.freq.test <- sqrt(mean( (exp(ames_test$price) - exp(pred.freq.test))^2 ))
print(list( "Train RMSE" = rmse.freq.train, "Test RMSE" = rmse.freq.test))

## $`Train RMSE`
## [1] 28286.22
## 
## $`Test RMSE`
## [1] 29777.5

Section 2.4.2: Bayesian Model Out-of-Sample Comparison

And the same goes for the Bayesian model. The RMSE generated by the test data is slightly higher than that from the training data; however, the RMSE’s are comparable and indicate similar accuracy between both data sets.

pred.bas.BMA.test <- predict(full.bas.model, ames_test, 
                    estimator="BMA", 
                    prediction=TRUE, se.fit=TRUE)

rmse.bas.test <- sqrt(mean( (exp(ames_test$price) - exp(pred.bas.BMA.test$fit))^2 ))
print(list( "Train RMSE" = rmse.bas.train, "Test RMSE" = rmse.bas.test))

## $`Train RMSE`
## [1] 28296.07
## 
## $`Test RMSE`
## [1] 29783.26

Part 3: Development of Final Model

The initial model is reliable so now it can inform the development of a final model.

Section 3.1: Final Model

All of the variables from my previous model will be carried over into this final model since it performed well on out-of-sample data and exhibits a high R-squared. Also, no variables were dropped through a BIC and AIC stepwise selection process. Note: Please see section 2.1 for my explanantion and justification for including these previously regressed varaibles.

The following variables will be added into this final model:

Type of dwelling- Choices are single family homes, duplex, and townhouses. I would expect single family homes to be most expensive when controlling for other variables (such as square footage and quality).
Overall condition of home- Related to house quality which is an important predictor (adjust for collinearity with stepwise selection)
Year the house is remodeled- This could be a caveat to how old a house is. For example, an 80 year house recently remodeled may sell for more than a 35 year old house that has not been remodeled (holding all else constant)
Rating of finished basement area- The quality of the basement and whether or not it is livable. How this square footage can be utilized should affect the price.
Total basement area- would expect a positive relationship with price, like total area (adjust for collinearity with stepwise selection)
Cars in garage- The number of cars that the garage was built for. Expecting a positive relationship
1st floor area- would expect a positive relationship with price, like total area (adjust for collinearity with stepwise selection)
2nd floor area- would expect a positive relationship with price, like total area (adjust for collinearity with stepwise selection)

#Load original data set
load("ames_train.Rdata")

#**LOG TRANSFORM**
##Log-transforming price and variables related to area to imporve skew
##log(x+1) for transforming contnuous variables containing 0
ames_train <- ames_train%>%
        mutate(price = log(price), area=log(area), Total.Bsmt.SF = log(Total.Bsmt.SF+1), X1st.Flr.SF = log(X1st.Flr.SF), X2nd.Flr.SF = log(X2nd.Flr.SF+1) )


#**NEW VARIABLES**
##Considering "age of house" more straightforward than "year house was built"
##Similarly, changing "remodeling data" to "years since house was remodeled"
ames_train <- ames_train%>%
        mutate(House.Age = 2018 - Year.Built, Remod.Age = 2018 - Year.Remod.Add)


#**FACTOR VARIABLE MODS**
##Regroup Neighborhood into three levels, reflecting the top third, middle third, and bottom third of average house prices 
levels(ames_train$Neighborhood) <- list(Tier_1=c("NridgHt", "NoRidge", "GrnHill", "StoneBr", "Timber", "Veenker", "Somerst", "Blmngtn", "NWAmes", "Greens"), Tier_2= c("ClearCr", "Gilbert", "Crawfor", "CollgCr", "SawyerW", "Mitchel", "NAmes", "NPkVill", "Sawyer"), Tier_3=c("Blueste", "Edwards", "SWISU", "OldTown", "BrkSide", "BrDale", "IDOTRR", "MeadowV", "Landmrk"))


##Similarly, avoid small cell size levels in Building Type
levels(ames_train$Bldg.Type) <- list("1Fam" = "1Fam", "2Fam" = c("2fmCon", "Duplex"), "TwnHouse" = c("Twnhs", "TwnhsE"))

##Create explitcit level for NA's
##Combine unfinished basements and no basements (NA's)
ames_train$BsmtFin.Type.1 <- addNA(ames_train$BsmtFin.Type.1)

levels(ames_train$BsmtFin.Type.1) <- list("Good Living" = "GLQ", "Avg Living" = "ALQ", "Below Avg Living" = "BLQ", "Avg Rec Room" = "Rec", "Low Qual" = "LwQ", "Unfinished/No Bsmt" = c(NA, "Unf", ""))


#**FILTERING AND SUBSETTING DATA**
##Only considering houses sold under "normal conditions"
ames_train <- ames_train%>%
        filter(ames_train$Sale.Condition == "Normal")

##Finally, subsetting dataset to include only variables to be used in the full model
ames_train <- ames_train[,c(2, 3, 15, 18, 20, 21, 36, 41, 44, 46, 47, 51, 52, 53, 82, 83)]

Section 3.1.1: Final Frequentist Model

Model selection entails backwards elimination using both BIC and AIC criterion. Three variables were dropped using both criteria: the two bathroom variables and the second floor surface area.

#Start with full model (15 explanatory variables)
full.freq.model2 <- lm(price ~ ., data=ames_train)

#Stepwise selection using BIC criterion (output NOT SHOWN)
finalBIC_freq_model2 <- step(full.freq.model2, data=ames_train, direction="backward", k=log(length(ames_train)), trace = FALSE)

#Stepwise selection using AIC criterion (output shown)
finalAIC_freq_model2 <- step(full.freq.model2, data=ames_train, direction="backward", k=2, trace = TRUE)

## Start:  AIC=-3551.86
## price ~ area + Neighborhood + Bldg.Type + Overall.Qual + Overall.Cond + 
##     BsmtFin.Type.1 + Total.Bsmt.SF + Central.Air + X1st.Flr.SF + 
##     X2nd.Flr.SF + Full.Bath + Half.Bath + Bedroom.AbvGr + House.Age + 
##     Remod.Age
## 
##                  Df Sum of Sq    RSS     AIC
## - Full.Bath       1   0.00092 11.187 -3553.8
## - Half.Bath       1   0.00428 11.190 -3553.5
## - X2nd.Flr.SF     1   0.01264 11.198 -3552.9
## <none>                        11.186 -3551.9
## - X1st.Flr.SF     1   0.06928 11.255 -3548.7
## - Remod.Age       1   0.18085 11.367 -3540.5
## - Central.Air     1   0.26166 11.447 -3534.6
## - Total.Bsmt.SF   1   0.29382 11.480 -3532.2
## - Bedroom.AbvGr   1   0.43813 11.624 -3521.8
## - Overall.Cond    1   0.63342 11.819 -3507.9
## - House.Age       1   0.69228 11.878 -3503.8
## - Neighborhood    2   0.72806 11.914 -3503.3
## - BsmtFin.Type.1  5   0.97392 12.160 -3492.2
## - area            1   1.12470 12.310 -3474.0
## - Bldg.Type       2   1.16570 12.351 -3473.2
## - Overall.Qual    1   2.89583 14.082 -3361.9
## 
## Step:  AIC=-3553.79
## price ~ area + Neighborhood + Bldg.Type + Overall.Qual + Overall.Cond + 
##     BsmtFin.Type.1 + Total.Bsmt.SF + Central.Air + X1st.Flr.SF + 
##     X2nd.Flr.SF + Half.Bath + Bedroom.AbvGr + House.Age + Remod.Age
## 
##                  Df Sum of Sq    RSS     AIC
## - Half.Bath       1   0.00341 11.190 -3555.5
## - X2nd.Flr.SF     1   0.01262 11.199 -3554.9
## <none>                        11.187 -3553.8
## - X1st.Flr.SF     1   0.06842 11.255 -3550.7
## - Remod.Age       1   0.18695 11.374 -3542.0
## - Central.Air     1   0.26082 11.448 -3536.6
## - Total.Bsmt.SF   1   0.29319 11.480 -3534.2
## - Bedroom.AbvGr   1   0.44227 11.629 -3523.5
## - Overall.Cond    1   0.63252 11.819 -3509.9
## - Neighborhood    2   0.73129 11.918 -3505.0
## - House.Age       1   0.76705 11.954 -3500.5
## - BsmtFin.Type.1  5   0.97328 12.160 -3494.2
## - Bldg.Type       2   1.16690 12.354 -3475.0
## - area            1   1.20727 12.394 -3470.3
## - Overall.Qual    1   2.89524 14.082 -3363.8
## 
## Step:  AIC=-3555.54
## price ~ area + Neighborhood + Bldg.Type + Overall.Qual + Overall.Cond + 
##     BsmtFin.Type.1 + Total.Bsmt.SF + Central.Air + X1st.Flr.SF + 
##     X2nd.Flr.SF + Bedroom.AbvGr + House.Age + Remod.Age
## 
##                  Df Sum of Sq    RSS     AIC
## - X2nd.Flr.SF     1   0.01210 11.202 -3556.6
## <none>                        11.190 -3555.5
## - X1st.Flr.SF     1   0.06510 11.255 -3552.7
## - Remod.Age       1   0.18597 11.376 -3543.8
## - Central.Air     1   0.26379 11.454 -3538.1
## - Total.Bsmt.SF   1   0.29589 11.486 -3535.8
## - Bedroom.AbvGr   1   0.45012 11.640 -3524.6
## - Overall.Cond    1   0.63289 11.823 -3511.7
## - Neighborhood    2   0.74829 11.938 -3505.6
## - House.Age       1   0.80812 11.998 -3499.4
## - BsmtFin.Type.1  5   0.97003 12.160 -3496.2
## - Bldg.Type       2   1.17573 12.366 -3476.2
## - area            1   1.28460 12.475 -3466.9
## - Overall.Qual    1   2.89222 14.082 -3365.8
## 
## Step:  AIC=-3556.64
## price ~ area + Neighborhood + Bldg.Type + Overall.Qual + Overall.Cond + 
##     BsmtFin.Type.1 + Total.Bsmt.SF + Central.Air + X1st.Flr.SF + 
##     Bedroom.AbvGr + House.Age + Remod.Age
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        11.202 -3556.6
## - Remod.Age       1    0.1850 11.387 -3545.0
## - Central.Air     1    0.2618 11.464 -3539.4
## - Total.Bsmt.SF   1    0.2936 11.496 -3537.1
## - Bedroom.AbvGr   1    0.4515 11.654 -3525.7
## - Overall.Cond    1    0.6286 11.831 -3513.1
## - Neighborhood    2    0.7514 11.954 -3506.5
## - House.Age       1    0.8547 12.057 -3497.3
## - BsmtFin.Type.1  5    0.9756 12.178 -3497.0
## - X1st.Flr.SF     1    0.9663 12.169 -3489.6
## - Bldg.Type       2    1.1665 12.369 -3478.0
## - Overall.Qual    1    2.8849 14.087 -3367.5
## - area            1    6.0411 17.243 -3198.9

Both selection methods yield the same model, which instills more confidence in this final model. All variables that were in the initial model exhibit the same sign and this final model appears highly thorough in that it accounts for 91% of the variance in house prices.

print(summary(finalAIC_freq_model2), digits=2)

## 
## Call:
## lm(formula = price ~ area + Neighborhood + Bldg.Type + Overall.Qual + 
##     Overall.Cond + BsmtFin.Type.1 + Total.Bsmt.SF + Central.Air + 
##     X1st.Flr.SF + Bedroom.AbvGr + House.Age + Remod.Age, data = ames_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.745 -0.062 -0.003  0.065  0.492 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       6.91662    0.15913    43.5   <2e-16 ***
## area                              0.49070    0.02341    21.0   <2e-16 ***
## NeighborhoodTier_2               -0.06395    0.01322    -4.8    2e-06 ***
## NeighborhoodTier_3               -0.13466    0.01823    -7.4    4e-13 ***
## Bldg.Type2Fam                    -0.03489    0.02016    -1.7    0.084 .  
## Bldg.TypeTwnHouse                -0.13356    0.01504    -8.9   <2e-16 ***
## Overall.Qual                      0.07760    0.00536    14.5   <2e-16 ***
## Overall.Cond                      0.03273    0.00484     6.8    3e-11 ***
## BsmtFin.Type.1Avg Living         -0.01829    0.01429    -1.3    0.201    
## BsmtFin.Type.1Below Avg Living   -0.05388    0.01717    -3.1    0.002 ** 
## BsmtFin.Type.1Avg Rec Room       -0.05133    0.01682    -3.1    0.002 ** 
## BsmtFin.Type.1Low Qual           -0.08863    0.02099    -4.2    3e-05 ***
## BsmtFin.Type.1Unfinished/No Bsmt -0.10066    0.01302    -7.7    3e-14 ***
## Total.Bsmt.SF                     0.01927    0.00417     4.6    4e-06 ***
## Central.AirY                      0.09547    0.02187     4.4    1e-05 ***
## X1st.Flr.SF                       0.15155    0.01807     8.4    2e-16 ***
## Bedroom.AbvGr                    -0.04225    0.00737    -5.7    1e-08 ***
## House.Age                        -0.00216    0.00027    -7.9    1e-14 ***
## Remod.Age                        -0.00109    0.00030    -3.7    3e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.12 on 815 degrees of freedom
## Multiple R-squared:  0.91,   Adjusted R-squared:  0.91 
## F-statistic: 4.5e+02 on 18 and 815 DF,  p-value: <2e-16

Section 3.1.2: Final Bayesian Model (MCMC)

The following Bayesian regression utilizes a Markov Chain Monte Carlo (“MCMC”) algorithm to randomly step through the model space. It proceeds by evaluating posterior odds rather than posterior probabilities that requires an exhaustive enumeration of the model space. Below is a plot of the Bayesian regression diagnostics.

The first-quadrant plot (upper-right) shows the cumulative probability as unique models are added through MCMC iterations. Its leveling off around 1.0 indicates to me that enough iterations were run to thoroughly represent the model space. The second-quadrant plot shows residuals with roughly constant variance and no apparent pattern.

The third-quadrant plot (lower-left) indicates that models are the most likely when they include 15 predictor variables. This means that most variables from the full model should be included in the models with the higher likelihoods. The inclusion probability plot in quadrant-four supports this in showing that all but three variables are significant predictors.

full.bas.model2 <- bas.lm(price ~ ., data=ames_train, prior = "JZS", method = "MCMC", initprobs="marg-eplogp", MCMC.iterations=65000)

par(mfrow = c(2,2))
plot(full.bas.model2, cex=.5)

Below shows the regression coefficients calculated by Bayesian model averaging. This makes for an interesting comparison with my final Frequentist model. The Frequentist model drops “Full.Bath”, “Half.Bath”, and “X2nd.Flr.Sf”, whereas BMA drops nothing by design.

Instead, it assigns the three lowest posterior probabilities to these variables, and since the coefficients are weighted based on these posterior probabilities, the variables also have the smallest coefficient magnitudes. This ultimately means that these three variables will have a small (perhaps negligible) affect on price prediction. Predictions based on the Bayesian and Frequentist models should still be comparable.

print(coefficients(full.bas.model2), digits=2)

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  261 models 
##                                   post mean  post SD   post p(B != 0)
## Intercept                         11.99584    0.00407   1.00000      
## area                               0.51327    0.05959   0.99992      
## NeighborhoodTier_2                -0.06744    0.01358   0.99934      
## NeighborhoodTier_3                -0.13656    0.01870   0.99986      
## Bldg.Type2Fam                     -0.01313    0.02073   0.39329      
## Bldg.TypeTwnHouse                 -0.13500    0.01573   0.99902      
## Overall.Qual                       0.07958    0.00561   0.99952      
## Overall.Cond                       0.03214    0.00493   0.99948      
## BsmtFin.Type.1Avg Living          -0.00364    0.01049   0.26260      
## BsmtFin.Type.1Below Avg Living    -0.03667    0.02330   0.81691      
## BsmtFin.Type.1Avg Rec Room        -0.03298    0.02286   0.78152      
## BsmtFin.Type.1Low Qual            -0.07582    0.02365   0.98455      
## BsmtFin.Type.1Unfinished/No Bsmt  -0.09041    0.01410   0.99975      
## Total.Bsmt.SF                      0.01950    0.00421   0.99945      
## Central.AirY                       0.09945    0.02229   0.99886      
## X1st.Flr.SF                        0.12458    0.05391   0.89178      
## X2nd.Flr.SF                       -0.00243    0.00497   0.31277      
## Full.Bath                          0.00025    0.00523   0.18109      
## Half.Bath                          0.00084    0.00527   0.17808      
## Bedroom.AbvGr                     -0.04407    0.00744   0.99982      
## House.Age                         -0.00217    0.00029   0.99989      
## Remod.Age                         -0.00118    0.00031   0.99498

Section 3.2: Variable Transformation Discussion

I performed logarithmic transformations on 5 variables included in my full model: one was the price variable and the other four were related to surface areas. My EDA discovered that these numerical variables exhibit right skew and contain outliers with exceptionally large prices and areas. Log transformations work to reduce the effect of these outliers.

Also, log transformations can help to assuage heteroscedasticity, or the tendency for residuals to vary with a model’s fitted values. This directly disobeys one of the assumptions for linear regression, which states that residuals must be normally distributed.

Section 3.3: Variable Interaction Discussion

Variable interaction describes a situation when three or more variables depend on one another. For example, the house area coefficient summarized in this report is interpreted as: “increasing the house area by 1 square foot will increase the log(price) of the house by about .5 log(USD) while all other variables are held constant”.

Variable interaction occurs if the affect of increasing the house area on house price depends on the neighborhood that the house is located in. Perhaps, a 500 sq. ft. increase to a house in a tier 1 neighborhood increases its price by 35%, whereas the same increase in area to a house in a tier 3 neighborhood only increases its price by 15%. In other words, the price per square foot depends on the neighborhood because the area and neighborhood variables “interact” with one another.

The above example is plausible; however, no a priori hypotheses were made regarding the presence of variable interactions. It is difficult to accurately assess interactions unless a study was specifically designed to do so. Interaction terms have also been excluded here because “adding interaction terms makes the coefficients of the lower order terms conditional effects, not main effects. … The interaction uses up degrees of freedom, changes the meaning of the lower order coefficients, and complicates the model.” (1)

Section 3.4: Variable Selection Discussion

The BIC and AIC criterion were employed to select variables from this full model. This involved a process of stepwise, backwards variable elimination. In the end, both criteria produced the same final model.

AIC and BIC criteria were decided upon because they work towards more parsimonious models by applying a penalty term to the number of predictors included. Model strength and complexity contend with one another in an effort to produce strong, yet simple models. These models usually have less variables and assumptions compared to models selected with the sole intent of maximizing predictive power.

Note: Please see sections 2.1 and 3.1 for a full list of the variables in the full model and justifications for their inclusion

Section 3.5: Model Testing Discussion

The initial model predictions performed well on the test data evidenced by comparable RMSE’s; therefore, this prior model was not over fit on the training data. Also, the BIC and AIC variable selection converged to the same final model and this final model exhibited a high R-squared of 86%.

Considering the above results, I decided to include all of the variables from my previous model in my final model. It also prompted me to add basement, first-floor, and second-floor surface areas due to the high significance of the total house area variable in the prior model. Similarly, I felt it important to include overall house condition in the second iteration since overall house quality performed well as a predictor in the first regression.

Part 4: Final Model Assessment

Section 4.1: Final Model Residuals Analysis

Regression diagnostics are run to evaluate residuals and check that model assumptions are satisfied.

Section 4.1.1: Final Frequentist Model Residuals

The residual variance appears to be relatively stable across the fitted values and does not exhibit a pattern (seen in the second quadrant). The normal Q-Q plot (in the first quadrant) shows that the residuals approximate a normal distribution. There is not much trouble with outliers - even observation 611 exhibits a cooks distance well-below the 1.0 threshold. The two greatest outliers correspond to houses 80+ years of age. Subsequent analysis may decide to focus modeling on more contemporary homes (say 60 years or younger) for an improved RMSE.

#Create regression diagnostic plots
autoplot(finalAIC_freq_model2, label.colour = "red", label.size=4)

#Calculate cooks distance
cooks.freq2 <- cooks.distance(finalAIC_freq_model2)

#Check Cooks Distance
ggplot(ames_train, aes(x=finalAIC_freq_model2$fitted.values, y=cooks.freq2)) + geom_point() + labs(title="Influential Points", y= "Cooks Distance", x="Fitted Values") + theme_minimal()

The residuals from the final frequentist model exhibit a standard deviation, or RMSE, of $23,365. This is 17% decrease from the RMSE calculated in the initial model.

#Calculate frequentist model residuals 
resid.freq2 <- exp(ames_train$price) - exp(finalAIC_freq_model2$fitted.values)

#Calculate RMSE
rmse.freq.train2 <- sqrt(mean(resid.freq2^2))
rmse.freq.train2

## [1] 23364.98

Section 4.1.2: Final Bayesian Model Residuals

Similar to Frequentist diagnostics, the residual variance is stable and exhibits no pattern with fitted values from the Bayesian model.

#Obtain fitted values from averaged model (BMA)
pred.bas.BMA.train2 <- predict(full.bas.model2, ames_train, 
                    estimator="BMA", 
                    prediction=TRUE, se.fit=TRUE)

#Calculate residuals (observed - fitted)
resid.bas.BMA.train2 = ames_train$price - pred.bas.BMA.train2$fit

#Plot residuals vs fitted values
ggplot(ames_train, aes(x=pred.bas.BMA.train2$fit, y=resid.bas.BMA.train2)) + geom_point() + labs(title="Residuals Vs Fitted", y= "Residuals", x="Fitted Values") + theme_minimal() + geom_text(aes(label=ifelse(abs(resid.bas.BMA.train2) > .490, as.character(row(resid.bas.BMA.train)), "")), hjust=0,vjust=0)

Similarly, the Bayesian model residuals exhibit a standard deviation, or RMSE, of $23,356 (also a 17% reduction compared to that of the initial model).

rmse.bas.train2 <- sqrt(mean( (exp(ames_train$price) - exp(pred.bas.BMA.train2$fit))^2 ))
rmse.bas.train2

## [1] 23354.97

Section 4.2: Final Model Evaluation

Stregnths

The final model did not yield sign reversals for any variables also included in the initial model
BIC and AIC selection criterion resulted in the same final model
Adjusted R-squared extremely high where the final model accounts for 91% of house price variance
The model makes intuitive sense:

It predicts that price increases when the following are increased: total area, house quality & condition, basement area, having a central air system, and the 1st floor area. Alternatively, the final model predicts that price decreases when a house is located in a lower tier neighborhood, is not a 1-family home, has a lower quality basement, and the older the house and remodeling date. As seen in the previous model, a price decrease accompanies increasing the number of bedrooms while holding other variables (like area) constant.

Weaknesses

The RMSE is over $20k which is about 15% of the average house price. This may be relatively high since the main goal here is prediction power and RMSE is a measure of the absolute fit. A standard deviation in residual variance of this magnitude may compromise the accurate identification of over-valued and under-valued homes.
The model was trained on data containing houses from 9 - 146 years old. Consequently, most of the largest residuals come from these older houses, indicating that the model has trouble predicting over this vast age range. Subsequent analysis might subset certain age ranges to improve predictive power.
Some of the predictor variables exhibit multicollinearity such as house quality/condition, total area/first floor area, and house age/remodel date. Highly multicollinear variables present a problem even though variable selection penalized added predictors.

Section 4.3: Final Model Validation

Regression diagnostics show that the final model does well on training data. This section runs the final model with out-of-sample data to test for over-fitting. Also, it tests whether or not the final model properly reflects uncertainty with a 95% coverage test.

It ends with the final deliverable: identifying which houses are “over-valued” and “under-valued”.

load("ames_validation.Rdata")

#**LOG TRANSFORM**
##Log-transforming price and variables related to area to imporve skew
##log(x+1) for transforming continuous variables containing 0
ames_validation <- ames_validation%>%
        mutate(price = log(price), area=log(area), Total.Bsmt.SF = log(Total.Bsmt.SF+1), X1st.Flr.SF = log(X1st.Flr.SF), X2nd.Flr.SF = log(X2nd.Flr.SF+1) )


#**NEW VARIABLES**
##Considering "age of house" more straightforward than "year house was built"
##Similarly, changing "remodeling data" to "years since house was remodeled"
ames_validation <- ames_validation%>%
        mutate(House.Age = 2018 - Year.Built, Remod.Age = 2018 - Year.Remod.Add)


#**FACTOR VARIABLE MODS**
##Regroup Neighborhood into three levels, reflecting the top third, middle third, and bottom third of average house prices 
levels(ames_validation$Neighborhood) <- list(Tier_1=c("NridgHt", "NoRidge", "GrnHill", "StoneBr", "Timber", "Veenker", "Somerst", "Blmngtn", "NWAmes", "Greens"), Tier_2= c("ClearCr", "Gilbert", "Crawfor", "CollgCr", "SawyerW", "Mitchel", "NAmes", "NPkVill", "Sawyer"), Tier_3=c("Blueste", "Edwards", "SWISU", "OldTown", "BrkSide", "BrDale", "IDOTRR", "MeadowV", "Landmrk"))


##Similarly, avoid small cell size levels in Building Type
levels(ames_validation$Bldg.Type) <- list("1Fam" = "1Fam", "2Fam" = c("2fmCon", "Duplex"), "TwnHouse" = c("Twnhs", "TwnhsE"))

##Create explitcit level for NA's
##Combine unfinished basements and no basements (NA's)
ames_validation$BsmtFin.Type.1 <- addNA(ames_validation$BsmtFin.Type.1)

levels(ames_validation$BsmtFin.Type.1) <- list("Good Living" = "GLQ", "Avg Living" = "ALQ", "Below Avg Living" = "BLQ", "Avg Rec Room" = "Rec", "Low Qual" = "LwQ", "Unfinished/No Bsmt" = c(NA, "Unf", ""))


#**FILTERING AND SUBSETTING DATA**
##Only considering houses sold under "normal conditions"
ames_validation <- ames_validation%>%
        filter(ames_validation$Sale.Condition == "Normal")

Section 4.3.1: Final Frequentist Model Validation

When applying the final model to the out-of-sample validation data, an RMSE of $22,742 results. The RMSE from the training and validation data are comparable showing that the final model exhibits similar accuracy over both data sets. This lends more credibility to the final model.

pred.freq.validation <- predict(finalAIC_freq_model2, ames_validation, prediction=TRUE)

rmse.freq.validation <- sqrt(mean( (exp(ames_validation$price) - exp(pred.freq.validation))^2 ))
print(list( "Final Model RMSE" = rmse.freq.train2, "Validate RMSE" = rmse.freq.validation))

## $`Final Model RMSE`
## [1] 23364.98
## 
## $`Validate RMSE`
## [1] 22741.77

The 95% prediction interval calculated using the final Frequentist model contains the validation data’s true price roughly 95% of the time. This is what we expect and it indicates that the model’s uncertainty is accurately reflected.

# Predict prices using final model on new validation data... creates upper and lower bound for each predicted value
predict.freq.validation <- predict(finalAIC_freq_model2, ames_validation, interval = "prediction")

# Calculate proportion of observations that fall within prediction intervals
coverage.prob.freq <- mean(ames_validation$price > predict.freq.validation[,"lwr"] &
                            ames_validation$price < predict.freq.validation[,"upr"])
coverage.prob.freq

## [1] 0.9541284

The scatter plot below shows which homes are estimated to be over-valued (red) and under-valued (green). This is calculated based on residuals where positive residuals indicate that the asking price is greater than the predicted price and negative residuals indicate that the predicted price is greater than the asking price.

#Over-valued homes have positive residuals and under-valued homes have negative residuals
ames_train <- ames_train%>%
        mutate(valuation.freq =ifelse(resid.freq2<0, "Under-Valued", "Over-Valued"))

#Plot with hypothetical regression line overlay
ggplot(ames_train, aes(x=finalAIC_freq_model2$fitted.values, y=price, color = valuation.freq)) + geom_point(alpha = .3) + geom_smooth(method='lm',formula=y~x, se=FALSE, colour="black", size=.5, linetype="dashed") + theme(legend.title=element_blank()) + labs(title="Post-Model Valuation of Houses", x="Fitted Prices", y= "Observed Prices") + scale_color_manual(values = c("Under-Valued" = "green",'Over-Valued' = 'red'))

Section 4.3.2: Final Bayesian Model Validation

The Bayesian model generates an RMSE of $22,708 using the validation data. This is comparable to the RMSE generated on the training data thus proving that the model was not over fit.

pred.bas.BMA.validation <- predict(full.bas.model2, ames_validation, 
                    estimator="BMA", 
                    prediction=TRUE, se.fit=TRUE)

rmse.bas.validation <- sqrt(mean( (exp(ames_validation$price) - exp(pred.bas.BMA.validation$fit))^2 ))
print(list("Final Model RMSE" = rmse.bas.train2, "Validate RMSE" = rmse.bas.validation))

## $`Final Model RMSE`
## [1] 23354.97
## 
## $`Validate RMSE`
## [1] 22702.59

95% prediction intervals generated with the BMA model contain the true price from the validation data set about 95% of the time. Again, this indicates that the final Bayesian model accurately reflects expected uncertainty.

predict.bas.validation <- predict(full.bas.model2, ames_validation, 
                    estimator="BMA", 
                    prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and confidence intervals
pred.df = as.data.frame(cbind(confint(predict.bas.validation), price = ames_validation$price))

# Fix names in dataset
colnames(pred.df)[1:2] <- c("lwr", "upr")

# Get Coverage
coverage.prob.validation <- pred.df %>% summarize(cover = sum(price > lwr & price < upr)/n())
coverage.prob.validation

##      cover
## 1 0.948886

The plot showing under-valued and over-valued houses has been omitted for the Bayesian section because it is very similar to that of the Frequentist. The summary shows that about half of the houses are over-valued and the other half are under-valued, which is what we expect since the residuals are normally distributed.

#Over-valued homes have positive residuals and under-valued homes have negative residuals
ames_train <- ames_train%>%
        mutate(valuation.bas =ifelse(resid.bas.BMA.train2<0, "Under-Valued", "Over-Valued"))

summary(ames_train$valuation.bas)

##             V1     
##  Over-Valued :402  
##  Under-Valued:432

Part 5: Conclusion

I have applied both Frequentist and Bayesian regression methodologies on housing data from Ames, Iowa in order to produce highly predictive house price models with an R-squared of over 90%.

This began with an exploratory data analysis for me to assimilate the provided data set, search for variables that might benefit from transformations, and derive preliminary hypotheses regarding variable correlation with house price. These hypotheses informed the variables that I included in an initial (test) model. This model exhibited a high R-squared, no variables were dropped during selection (using AIC and BIC), and it performed well on out-of-sample data. An initial Bayesian model was also fit and it corroborates all of the above.

For these reasons, all variables from the initial model were included in the full model and other variables were selectively added in an effort to boost predictive power. For variable selection, AIC and BIC agreed and yielded the same final model. This was supported by a final Bayesian model generated by MCMC (and summarized in BMA form) where the lowest posterior probabilities corresponded to the three variables dropped in the Frequentist approach.

Full diagnostics were run on the final models to check for outliers, points of high influence and test the assumption of normally distributed residuals. The RMSE’s decreased and the R-squared’s increased going from the initial to final model. Also, both final models performed well on out-of-sample data seen by comparable RMSE’s and properly reflected 95% uncertainties between the training and validation data sets.

Under-valued houses are a good investment for real estate firms and identifying them is the ultimate reason why the firm commissioned the development of these predictive models. All houses with an asking price lower than what the model predicts for that house is considered to be “under-valued”; however, the degree of this “under-valuing” matters because it determines the feasibility of profiting and the profit margin itself.

With that said, I end the report by showing a list of the top 5 most under-valued houses and the amount that they are under-valued by. This would be the advised starting point for a real-estate agent in Ames, Iowa.

under.val <- sort(resid.freq2)
head(round(cbind.data.frame("Asking Price"= exp(ames_train$price), "Price Deficit" = under.val, "House Age" = ames_train$House.Age, "Quality"= ames_train$Overall.Qual, "Square Feet" = exp(ames_train$area)), digits=2), 5)

##     Asking Price Price Deficit House Age Quality Square Feet
## 96        126000     -63625.17        79       6         856
## 23        139500     -55979.43        34       5        1049
## 642       124900     -52484.26        88       5        1001
## 224       114000     -51309.43       118       4        1039
## 95        227000     -49326.73        17       8        1665

Sources: Testing and Dropping Interaction Terms in Regression (1)