Executive Summary

The purpose of this project was to fit a series of regression models to a dataset containing housing features and a corresponding sale price as the response variable. Three models were constructed using both R and JAGS. One of the JAGS models used 3 features to predict sale prices while the final iteration used 4 features. A number of evaluation metrics (including the deviation information criterion (DIC)) were generated to gauge the accuracy of each model. It was determined that incorporating the 4th feature reduced the DIC and, hence, was a better for the data.

Introduction

The Ames Housing dataset, which is available on Kaggle.com, was compiled by Dean De Cock for use in data science education. It’s an incredible dataset resource for data scientists and statisticians looking for a modernized and expanded version of the often-cited Boston Housing dataset.

The subject dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, along with the sale price of each home. For the purposes of this project, we will construct two models that leverage a subset (specifically, 3-4) of the most informative explanatory variables understand their ability to accurately predict sale prices for homes given their features.

The features to be included are:

LotArea - Lot size in square feet
OverallCond - Rates the overall condition of the house (10=Excellent, 1=Very Poor)
GrLivArea - Above grade (ground) living area in square feet

We will create an additional iteration of the model that incorporates the feature TotRmsAbvGrd, the total number of rooms above grade.

Load and Explore the Dataset

dat = read.csv(file="data_files/housing-price-data.csv", header=TRUE)

# Subset the raw data to features of interest and the response variable, SalePrice
dat1=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","SalePrice")]

# Code the categorical variable `HouseStyle`.
dat1$HouseStyle_coded = factor(dat1$HouseStyle)
newdat1 = dat1[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","SalePrice")]
head(newdat1)

##   LotArea HouseStyle_coded OverallCond GrLivArea SalePrice
## 1    8450           2Story           5      1710    208500
## 2    9600           1Story           8      1262    181500
## 3   11250           2Story           5      1786    223500
## 4    9550           2Story           5      1717    140000
## 5   14260           2Story           5      2198    250000
## 6   14115           1.5Fin           5      1362    143000

Now, let’s look at the relationships among the features and their distributions:

library(Hmisc)
pairs(newdat1)

It looks like most, if not all, of the numeric features have a somewhat linear relationship with SalePrice.

hist.data.frame(newdat1)

It also appears that all of our non-categorical variables (including the response variable, SalePrice) have a somewhat normal distribution.

Postulate a Model

Since each of the features appear to have linear relationship with SalePrice, let’s start off by postulating a heirarchical JAGS linear model of the descriptor features. We will model with non-informative prior distributions (i.e., large \(\sigma^2\)) for the model coefficients. The model will take the form of

\(y_i \sim N(\mu_i,\sigma^2),\space \space \space \mu_i = \beta_0+\beta x_{1i}+...+\beta_k x_{ki}, \space \space \space \beta_k \sim N(0, 1e6)\)

Where \(k\) is the number of descriptor variables in the data set and \(i\) is the number of observations.

Also, \(y_i\space| \space x_i,\beta,\sigma^2 \stackrel{ind}{\sim}N(\beta_0+\beta x_{1i}+...+\beta_k x_{ki},\sigma^2)\),

where the noninformative prior for \(\sigma^2\) is modeled using an \(InverseGamma(\alpha,\beta)\) distribution.

library("rjags")
newdat1 = na.omit(newdat1)

mod1_jags_string = " model {
    for (i in 1:n) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
        +b[4]*HouseStyle_coded1Story[i]+b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
        + b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    for (i in 1:9) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "
data_jags = list(y=newdat1$SalePrice,LotArea=newdat1$LotArea,OverallCond=newdat1$OverallCond,
                 HouseStyle_coded1.5Unf=as.numeric(newdat1$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat1$HouseStyle_coded=="1Story"),
                 HouseStyle_coded2.5Fin=as.numeric(newdat1$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat1$HouseStyle_coded=="2.5Unf"),
                 HouseStyle_coded2Story=as.numeric(newdat1$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat1$HouseStyle_coded=="SFoyer"),
                 HouseStyle_codedSLvl=as.numeric(newdat1$HouseStyle_coded=="SLvl"),n=nrow(newdat1)) 
params1 = c("b0","b", "sig")
inits1 = function() {
    inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(9,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod1_jags = jags.model(textConnection(mod1_jags_string), data=data_jags, inits=inits1, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1460
##    Unobserved stochastic nodes: 11
##    Total graph size: 17039
## 
## Initializing model

Fit the Model using the Monte Carlo-Markov Chain (MCMC) Sampler

update(mod1_jags, 1000) # burn-in

mod1_jags_sim = coda.samples(model=mod1_jags,
                        variable.names=params1,
                        n.iter=5000)

mod1_jags_csim = do.call(rbind, mod1_jags_sim) # combine multiple chains

Check the Model by Examining Convergence Diagnostics

# gelman.diag(mod1_jags_sim)
# autocorr.diag(mod1_jags_sim)
# autocorr.plot(mod1_jags_sim)
# effectiveSize(mod1_jags_sim)
plot(mod1_jags_sim)

We can observe that the above traces for the \(b[i]\)s converge and the densities are all normal. Since we have convergence, we can conclude that our probability distributions for the coefficients were reasonable. We can get a posterior summary of the parameters in our model.

summary(mod1_jags_sim)

## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean        SD  Naive SE Time-series SE
## b[1]     4.516    0.2141  0.001748       0.002923
## b[2] 18733.066  533.1838  4.353428       7.686240
## b[3]   -82.840  992.1058  8.100510       8.259117
## b[4]  1836.497  966.5476  7.891828       8.118721
## b[5]    38.119  991.7340  8.097474       8.038132
## b[6]    -8.461 1005.5864  8.210578       8.210332
## b[7]  3111.074  982.3646  8.020973       8.075076
## b[8]   -50.427  994.0725  8.116568       8.116447
## b[9]    41.293 1005.7823  8.212178       8.141461
## b0    4384.127  982.4641  8.021786       9.389993
## sig  87098.806 1715.7394 14.008953      17.290666
## 
## 2. Quantiles for each variable:
## 
##           2.5%       25%       50%       75%     97.5%
## b[1]     4.094     4.371     4.519     4.662     4.934
## b[2] 17686.155 18370.307 18729.466 19093.857 19776.835
## b[3] -1997.715  -755.499   -87.709   582.436  1869.909
## b[4]   -38.028  1181.517  1826.616  2500.843  3746.310
## b[5] -1910.057  -631.316    38.970   715.903  1970.005
## b[6] -1989.969  -691.333    -3.640   671.758  1963.364
## b[7]  1205.133  2449.899  3103.954  3779.193  5046.609
## b[8] -2003.124  -723.413   -37.513   630.763  1874.435
## b[9] -1892.400  -651.491    35.083   724.979  2037.988
## b0    2470.640  3717.156  4375.585  5051.985  6312.429
## sig  83831.321 85920.419 87079.213 88242.660 90580.620

We notice that the standard deviation is quite high for the coefficients associated with the coded categorical variable HouseStyle_coded. This makes sense since the \(x_i\) for this feature can only take on values of \(0\) or \(1\).

Residual checks

In a Bayesian model, we have distributions for residuals, but we’ll simplify and look only at the residuals evaluated at the posterior mean of the parameters.

X = cbind(rep(1.0, data_jags$n), data_jags$LotArea,data_jags$OverallCond,data_jags$HouseStyle_coded1.5Unf,
          data_jags$HouseStyle_coded1Story,data_jags$HouseStyle_coded2.5Fin,data_jags$HouseStyle_coded2.5Unf,
          data_jags$HouseStyle_coded2Story,data_jags$HouseStyle_codedSFoyer,data_jags$HouseStyle_codedSLvl)
head(X)

##      [,1]  [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1  8450    5    0    0    0    0    1    0     0
## [2,]    1  9600    8    0    1    0    0    0    0     0
## [3,]    1 11250    5    0    0    0    0    1    0     0
## [4,]    1  9550    5    0    0    0    0    1    0     0
## [5,]    1 14260    5    0    0    0    0    1    0     0
## [6,]    1 14115    5    0    0    0    0    0    0     0

(pm_params = colMeans(mod1_jags_csim)) # posterior mean

##         b[1]         b[2]         b[3]         b[4]         b[5]         b[6] 
##     4.516035 18733.066495   -82.840268  1836.496662    38.119044    -8.461405 
##         b[7]         b[8]         b[9]           b0          sig 
##  3111.073537   -50.426905    41.293430  4384.126828 87098.805598

Create an inference vector, \(\hat{y}\), generated by the model and examine the residuals.

yhat = drop(X %*% pm_params[1:10])
resid = data_jags$y - yhat
plot(resid) # residuals against data index

plot(yhat, resid) # residuals against predicted values

qqnorm(resid) # checking normality of residuals

# rownames(dat1)[order(resid1, decreasing=TRUE)[1:5]]

Put a comment here about the residuals….

Iterate with another model

As mentioned previously, we will build another linear model that incorporates the numeric feature TotRmsAbvGrd, which is the total rooms above grade (does not include bathrooms).

#Adjust the dataset to include `TotRmsAbvGrd`.
dat2=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
dat2$HouseStyle_coded = factor(dat2$HouseStyle)
newdat2 = dat2[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]

# head(newdat2)

The JAGS model becomes:

mod2_jags_string = " model {
    for (i in 1:n) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
        + b[4]*HouseStyle_coded1Story[i]+ b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
        + b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
        + b[10]*TotRmsAbvGrd[i]
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    for (i in 1:10) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "
data_jags = list(y=newdat2$SalePrice, LotArea=newdat2$LotArea,OverallCond=newdat2$OverallCond,
                 HouseStyle_coded1.5Unf=as.numeric(newdat2$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat2$HouseStyle_coded=="1Story"),
                 HouseStyle_coded2.5Fin=as.numeric(newdat2$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat2$HouseStyle_coded=="2.5Unf"),
                 HouseStyle_coded2Story=as.numeric(newdat2$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat2$HouseStyle_coded=="SFoyer"),
                 HouseStyle_codedSLvl=as.numeric(newdat2$HouseStyle_coded=="SLvl"),TotRmsAbvGrd=dat2$TotRmsAbvGrd,
                 n=nrow(newdat2)) 
params1 = c("b0","b", "sig")
inits1 = function() {
    inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(10,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod2_jags = jags.model(textConnection(mod2_jags_string), data=data_jags, inits=inits1, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1460
##    Unobserved stochastic nodes: 12
##    Total graph size: 18584
## 
## Initializing model

update(mod2_jags, 1000) # burn-in

mod2_jags_sim = coda.samples(model=mod2_jags,
                        variable.names=params1,
                        n.iter=5000)

mod2_jags_csim = do.call(rbind, mod2_jags_sim) # combine multiple chains

# gelman.diag(mod2_jags_sim)
# autocorr.diag(mod2_jags_sim)
# autocorr.plot(mod2_jags_sim)
# effectiveSize(mod2_jags_sim)
plot(mod2_jags_sim)

summary(mod2_jags_sim)

## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean        SD  Naive SE Time-series SE
## b[1]      2.187    0.1804  0.001473       0.002536
## b[2]   6320.474  644.9932  5.266347      14.616028
## b[3]    -76.500 1000.7012  8.170691       8.187354
## b[4]   1828.731  961.7537  7.852686       8.330413
## b[5]    -56.935  998.6541  8.153977       7.949206
## b[6]   -108.204 1007.2232  8.223943       8.222919
## b[7]   1659.996  984.3864  8.037482       8.278267
## b[8]    -46.278  993.5073  8.111954       8.111943
## b[9]    -40.484  989.1334  8.076240       8.076471
## b[10] 17422.713  593.2215  4.843633      13.555247
## b0     1848.780  988.5787  8.071711       9.950972
## sig   68025.482 1311.2246 10.706104      12.953728
## 
## 2. Quantiles for each variable:
## 
##            2.5%       25%       50%       75%     97.5%
## b[1]      1.835     2.068     2.186     2.307     2.545
## b[2]   5039.984  5883.579  6328.454  6760.770  7571.565
## b[3]  -2064.627  -742.690   -70.252   601.595  1856.803
## b[4]    -54.297  1176.913  1826.188  2472.133  3718.412
## b[5]  -1986.068  -746.544   -54.830   622.248  1882.227
## b[6]  -2114.466  -780.700  -108.644   565.399  1867.003
## b[7]   -249.170   991.382  1656.688  2325.639  3610.991
## b[8]  -1991.705  -714.529   -52.019   623.372  1902.876
## b[9]  -1967.256  -706.296   -25.291   629.126  1889.060
## b[10] 16238.684 17027.802 17425.099 17822.135 18574.408
## b0      -79.009  1187.955  1846.208  2518.234  3778.759
## sig   65535.771 67109.584 67990.308 68897.752 70641.267

The model coefficient for TotRmsAbvGrd, b[10], is quite large as compared with the other coefficients, indicating that this feature is a stronger driver of SalePrice. As with the previous model, the trace plots show convergence for all the coefficients and normal densities.

Comparison of DIC metrics for both models.

Let’s compare the two models using the DIC metric:

dic.samples(mod1_jags, n.iter=1e3)

## Mean deviance:  37363 
## penalty 3.277 
## Penalized deviance: 37366

dic.samples(mod2_jags, n.iter=1e3)

## Mean deviance:  36640 
## penalty 3.526 
## Penalized deviance: 36644

Upon examination of the deviance information criterion (DIC), the penalized deviance of the second model (which incorporates TotRmsAbvGrd) is lower than the first model. We conclude that the second model that incorporates the additional feature is a better fit (and predictor) for the data.

Data Analysis Project - Bayesian Statistics V2

Ken Wood

August 5, 2023