Data Analysis Project - Bayesian Statistics

Executive Summary

The purpose of this project was to fit a series of regression models to a dataset containing housing features and a corresponding sale price as the response variable. Three models were constructed using both R and JAGS. One of the JAGS models used 3 features to predict sale prices while the final iteration used 4 features. A number of evaluation metrics (including the deviation information criterion (DIC)) were generated to gauge the accuracy of each model. It was determined that incorporating the 4th feature reduced the DIC and, hence, was a better for the data.

Introduction

The Ames Housing dataset, which is available on Kaggle.com, was compiled by Dean De Cock for use in data science education. It’s an incredible dataset resource for data scientists and statisticians looking for a modernized and expanded version of the often-cited Boston Housing dataset.

The subject dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, along with the sale price of each home. For the purposes of this project, we will construct two models that leverage a subset (specifically, 3-4) of the most informative explanatory variables to predict sale prices for homes given their features.

The features to be included are:

LotArea - Lot size in square feet
OverallCond - Rates the overall condition of the house (10=Excellent, 1=Very Poor)
GrLivArea - Above grade (ground) living area in square feet

We will create an additional iteration of the model that incorporates the feature TotRmsAbvGrd, the total number of rooms above grade.

Load and Explore the Dataset

dat = read.csv(file="data_files/housing-price-data.csv", header=TRUE)

# Subset the raw data to features of interest and the response variable, SalePrice
dat1=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","SalePrice")]
head(dat1)

##   LotArea HouseStyle OverallCond GrLivArea SalePrice
## 1    8450     2Story           5      1710    208500
## 2    9600     1Story           8      1262    181500
## 3   11250     2Story           5      1786    223500
## 4    9550     2Story           5      1717    140000
## 5   14260     2Story           5      2198    250000
## 6   14115     1.5Fin           5      1362    143000

str(dat1)

## 'data.frame':    1460 obs. of  5 variables:
##  $ LotArea    : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ HouseStyle : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallCond: int  5 8 5 5 5 5 5 6 5 6 ...
##  $ GrLivArea  : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ SalePrice  : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

We will need to code the categorical variable HouseStyle.

dat1$HouseStyle_coded = factor(dat1$HouseStyle)
newdat1 = dat1[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","SalePrice")]

head(newdat1)

##   LotArea HouseStyle_coded OverallCond GrLivArea SalePrice
## 1    8450           2Story           5      1710    208500
## 2    9600           1Story           8      1262    181500
## 3   11250           2Story           5      1786    223500
## 4    9550           2Story           5      1717    140000
## 5   14260           2Story           5      2198    250000
## 6   14115           1.5Fin           5      1362    143000

Let’s examine the distribution of the response variable, SalePrice

hist(newdat1$SalePrice,breaks=50,main="Histogram of SalePrice",xlab = "SalePrice")
abline(v = mean(newdat1$SalePrice), col='blue', lwd = 3)
#add label to mean vertical line
text(x=4e5, y=2e2, 'Mean = $181K, SD= $79K')

It appears that the distribution of SalePrice is somewhat normal with a mean and standard deviation calculated as follows:

mean(newdat1$SalePrice)

## [1] 180921.2

sd(newdat1$SalePrice)

## [1] 79442.5

Now, let’s look at the relationships among the features and their distributions:

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

pairs(newdat1)

It looks like most, if not all, of the numeric features have a somewhat linear relationship with SalePrice.

hist.data.frame(newdat1)

It also appears that all of our non-categorical variables (including the response variable, SalePrice) have a somewhat normal distribution.

Postulate a Model

Let’s start off by postulating a linear model of the descriptor features.

mod_linear = lm(SalePrice~LotArea+HouseStyle_coded+OverallCond+GrLivArea, data=newdat1)
summary(mod_linear)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + HouseStyle_coded + OverallCond + 
##     GrLivArea, data = newdat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -549116  -22005   -1040   21462  300921 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -7.391e+04  1.004e+04  -7.364 2.97e-13 ***
## LotArea                 3.781e-01  1.402e-01   2.697 0.007069 ** 
## HouseStyle_coded1.5Unf  5.377e+04  1.429e+04   3.762 0.000175 ***
## HouseStyle_coded1Story  6.702e+04  4.621e+03  14.502  < 2e-16 ***
## HouseStyle_coded2.5Fin -8.912e+04  1.888e+04  -4.721 2.57e-06 ***
## HouseStyle_coded2.5Unf -2.873e+04  1.587e+04  -1.811 0.070345 .  
## HouseStyle_coded2Story  2.717e+04  4.885e+03   5.562 3.18e-08 ***
## HouseStyle_codedSFoyer  6.941e+04  9.463e+03   7.335 3.68e-13 ***
## HouseStyle_codedSLvl    4.864e+04  7.523e+03   6.465 1.38e-10 ***
## OverallCond             2.121e+03  1.221e+03   1.738 0.082451 .  
## GrLivArea               1.278e+02  3.227e+00  39.612  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50680 on 1449 degrees of freedom
## Multiple R-squared:  0.5958, Adjusted R-squared:  0.593 
## F-statistic: 213.6 on 10 and 1449 DF,  p-value: < 2.2e-16

We now create a hierarchical model in JAGS with the following configuration:

\(y_i \sim N(\mu_i,\sigma^2),\space \space \space \mu_i = \beta_0+\beta x_{1i}+...+\beta_k x_{ki}, \space \space \space \beta_k \sim N(0, 1e6)\)

We note that \(k\) is the number of descriptor variables in the data set and \(i\) is the number of observations.

Also,

\(y_i\space| \space x_i,\beta,\sigma^2 \stackrel{ind}{\sim}N(\beta_0+\beta x_{1i}+...+\beta_k x_{ki},\sigma^2)\),

where the noninformative prior for \(\sigma^2\) is modeled using an \(InverseGamma(\alpha,\beta)\) distribution.

library("rjags")

## Loading required package: coda

## Linked to JAGS 4.3.2

## Loaded modules: basemod,bugs

newdat1 = na.omit(newdat1)

mod1_jags_string = " model {
    for (i in 1:n) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b0 
                + b[1]*LotArea[i] 
                + b[2]*OverallCond[i]
                + b[3]*HouseStyle_coded1.5Unf[i]
                + b[4]*HouseStyle_coded1Story[i]
                + b[5]*HouseStyle_coded2.5Fin[i]
                + b[6]*HouseStyle_coded2.5Unf[i]
                + b[7]*HouseStyle_coded2Story[i]
                + b[8]*HouseStyle_codedSFoyer[i]
                + b[9]*HouseStyle_codedSLvl[i]
    }
    
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    
    for (i in 1:9) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "

set.seed(72)
data_jags = list(y=newdat1$SalePrice, 
                 LotArea=newdat1$LotArea,
                 OverallCond=newdat1$OverallCond,
                 HouseStyle_coded1.5Unf=as.numeric(newdat1$HouseStyle_coded=="1.5Unf"),
                 HouseStyle_coded1Story=as.numeric(newdat1$HouseStyle_coded=="1Story"),
                 HouseStyle_coded2.5Fin=as.numeric(newdat1$HouseStyle_coded=="2.5Fin"),
                 HouseStyle_coded2.5Unf=as.numeric(newdat1$HouseStyle_coded=="2.5Unf"),
                 HouseStyle_coded2Story=as.numeric(newdat1$HouseStyle_coded=="2Story"),
                 HouseStyle_codedSFoyer=as.numeric(newdat1$HouseStyle_coded=="SFoyer"),
                 HouseStyle_codedSLvl=as.numeric(newdat1$HouseStyle_coded=="SLvl"),
                 n=nrow(newdat1)) 

params1 = c("b0","b", "sig")

inits1 = function() {
    inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(9,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}

mod1_jags = jags.model(textConnection(mod1_jags_string), data=data_jags, inits=inits1, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1460
##    Unobserved stochastic nodes: 11
##    Total graph size: 17039
## 
## Initializing model

Fit the Model using the Monte Carlo-Markov Chain (MCMC) Sampler

update(mod1_jags, 1000) # burn-in

mod1_jags_sim = coda.samples(model=mod1_jags,
                        variable.names=params1,
                        n.iter=5000)

mod1_jags_csim = do.call(rbind, mod1_jags_sim) # combine multiple chains

Check the Model by Examining Convergence Diagnostics

plot(mod1_jags_sim)

gelman.diag(mod1_jags_sim)

## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b[3]          1          1
## b[4]          1          1
## b[5]          1          1
## b[6]          1          1
## b[7]          1          1
## b[8]          1          1
## b[9]          1          1
## b0            1          1
## sig           1          1
## 
## Multivariate psrf
## 
## 1

autocorr.diag(mod1_jags_sim)

##               b[1]        b[2]         b[3]         b[4]         b[5]
## Lag 0  1.000000000 1.000000000  1.000000000  1.000000000  1.000000000
## Lag 1  0.467351114 0.517387579 -0.006042575  0.032073550  0.003355981
## Lag 5  0.035142282 0.027545286 -0.006636925 -0.010659212 -0.001548847
## Lag 10 0.007007868 0.012057753 -0.008422920 -0.001790485 -0.003374652
## Lag 50 0.006696992 0.004084732  0.006054505  0.003195365  0.005494540
##                b[6]         b[7]         b[8]         b[9]           b0
## Lag 0   1.000000000  1.000000000  1.000000000  1.000000000  1.000000000
## Lag 1   0.011995885  0.036234173 -0.010341378  0.008083053  0.144759361
## Lag 5   0.002093311 -0.004502159  0.012347229  0.000930736 -0.003354070
## Lag 10 -0.013581633 -0.002436371 -0.006185099 -0.007064863  0.005869866
## Lag 50  0.003133407 -0.003571329  0.012985793  0.004794318  0.002187294
##                  sig
## Lag 0   1.0000000000
## Lag 1   0.1451967806
## Lag 5  -0.0004499572
## Lag 10  0.0017325720
## Lag 50  0.0084397105

autocorr.plot(mod1_jags_sim)

effectiveSize(mod1_jags_sim)

##      b[1]      b[2]      b[3]      b[4]      b[5]      b[6]      b[7]      b[8] 
##  5444.727  4828.779 14677.558 14067.230 14491.556 14949.060 13522.636 14706.595 
##      b[9]        b0       sig 
## 15000.000 12147.659  9766.578

dic.samples(mod1_jags, n.iter=1e3)

## Mean deviance:  37362 
## penalty 3.133 
## Penalized deviance: 37365

We can get a posterior summary of the parameters in our model.

summary(mod1_jags_sim)

## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean       SD  Naive SE Time-series SE
## b[1]     4.523    0.214  0.001748       0.002901
## b[2] 18715.426  538.964  4.400624       7.766745
## b[3]   -75.640  999.420  8.160231       8.250287
## b[4]  1842.948  979.917  8.000988       8.261489
## b[5]    49.500  992.786  8.106064       8.259452
## b[6]     3.195  998.208  8.150335       8.164812
## b[7]  3103.537  976.898  7.976336       8.400458
## b[8]   -75.392 1002.471  8.185144       8.271105
## b[9]    45.090  996.310  8.134838       8.134915
## b0    4386.038  987.317  8.061413       8.985172
## sig  87122.053 1749.489 14.284517      17.808990
## 
## 2. Quantiles for each variable:
## 
##         2.5%      25%       50%       75%     97.5%
## b[1]     4.1     4.38     4.521     4.667     4.946
## b[2] 17660.8 18352.10 18713.917 19082.405 19762.434
## b[3] -2045.6  -743.75   -69.981   600.002  1867.847
## b[4]  -101.3  1187.47  1845.519  2503.085  3755.609
## b[5] -1876.0  -624.02    46.494   722.721  2002.405
## b[6] -1980.5  -672.99     3.266   672.909  1951.205
## b[7]  1176.1  2451.19  3097.659  3756.873  5021.667
## b[8] -2031.6  -756.56   -66.424   601.091  1888.087
## b[9] -1890.9  -631.12    43.385   719.728  2004.057
## b0    2439.0  3716.83  4387.870  5059.510  6326.242
## sig  83744.6 85930.55 87110.946 88298.273 90621.011

Residual checks

Checking residuals (the difference between the response and the model’s prediction for that value) is important with linear models since residuals can reveal violations of the assumptions we made to specify the model. In particular, we are looking for any sign that the model is not linear, normally distributed, or that the observations are not independent (conditional on covariates). We first evaluate the simple linear model proposed earlier:

plot(resid(mod_linear)) # to check independence (looks okay)

plot(predict(mod_linear), resid(mod_linear)) # to check for linearity, constant variance (looks reasonable)

qqnorm(resid(mod_linear)) # to check Normality assumption (we want this to be a straight line)

Now let’s return to our JAGS model. In a Bayesian model, we have distributions for residuals, but we’ll simplify and look only at the residuals evaluated at the posterior mean of the parameters.

X = cbind(
          rep(1.0, data_jags$n), 
          data_jags$LotArea,
          data_jags$OverallCond,
          data_jags$HouseStyle_coded1.5Unf,
          data_jags$HouseStyle_coded1Story,
          data_jags$HouseStyle_coded2.5Fin,
          data_jags$HouseStyle_coded2.5Unf,
          data_jags$HouseStyle_coded2Story,
          data_jags$HouseStyle_codedSFoyer,
          data_jags$HouseStyle_codedSLvl
          )
head(X)

##      [,1]  [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1  8450    5    0    0    0    0    1    0     0
## [2,]    1  9600    8    0    1    0    0    0    0     0
## [3,]    1 11250    5    0    0    0    0    1    0     0
## [4,]    1  9550    5    0    0    0    0    1    0     0
## [5,]    1 14260    5    0    0    0    0    1    0     0
## [6,]    1 14115    5    0    0    0    0    0    0     0

(pm_params = colMeans(mod1_jags_csim)) # posterior mean

##         b[1]         b[2]         b[3]         b[4]         b[5]         b[6] 
##     4.522740 18715.426133   -75.639783  1842.947571    49.500478     3.195325 
##         b[7]         b[8]         b[9]           b0          sig 
##  3103.537308   -75.392071    45.090052  4386.037726 87122.053475

yhat = drop(X %*% pm_params[1:10])
resid = data_jags$y - yhat
plot(resid) # residuals against data index

plot(yhat, resid) # residuals against predicted values

qqnorm(resid) # checking normality of residuals

plot(predict(mod_linear), resid(mod_linear)) # to compare with reference linear model

# rownames(dat1)[order(resid1, decreasing=TRUE)[1:5]]

Iterate with another model

As mentioned previously, we will build another linear model that incorporates the feature TotRmsAbvGrd, which is the total rooms above grade (does not include bathrooms).

We first adjust the dataset to include TotRmsAbvGrd.

dat2=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
dat2$HouseStyle_coded = factor(dat2$HouseStyle)
newdat2 = dat2[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]

head(newdat2)

##   LotArea HouseStyle_coded OverallCond GrLivArea TotRmsAbvGrd SalePrice
## 1    8450           2Story           5      1710            8    208500
## 2    9600           1Story           8      1262            6    181500
## 3   11250           2Story           5      1786            6    223500
## 4    9550           2Story           5      1717            7    140000
## 5   14260           2Story           5      2198            9    250000
## 6   14115           1.5Fin           5      1362            5    143000

mod2_jags_string = " model {
    for (i in 1:n) {
        y[i] ~ dnorm(mu[i], prec)
        mu[i] = b0 
                + b[1]*LotArea[i] 
                + b[2]*OverallCond[i]
                + b[3]*HouseStyle_coded1.5Unf[i]
                + b[4]*HouseStyle_coded1Story[i]
                + b[5]*HouseStyle_coded2.5Fin[i]
                + b[6]*HouseStyle_coded2.5Unf[i]
                + b[7]*HouseStyle_coded2Story[i]
                + b[8]*HouseStyle_codedSFoyer[i]
                + b[9]*HouseStyle_codedSLvl[i]
                + b[10]*TotRmsAbvGrd[i]
    }
    
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    
    for (i in 1:10) {
        b[i] ~ dnorm(0.0, 1.0/1.0e6)
    }
    
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
} "

set.seed(72)
data_jags = list(y=newdat2$SalePrice, 
                 LotArea=newdat2$LotArea,
                 OverallCond=newdat2$OverallCond,
                 HouseStyle_coded1.5Unf=as.numeric(newdat2$HouseStyle_coded=="1.5Unf"),
                 HouseStyle_coded1Story=as.numeric(newdat2$HouseStyle_coded=="1Story"),
                 HouseStyle_coded2.5Fin=as.numeric(newdat2$HouseStyle_coded=="2.5Fin"),
                 HouseStyle_coded2.5Unf=as.numeric(newdat2$HouseStyle_coded=="2.5Unf"),
                 HouseStyle_coded2Story=as.numeric(newdat2$HouseStyle_coded=="2Story"),
                 HouseStyle_codedSFoyer=as.numeric(newdat2$HouseStyle_coded=="SFoyer"),
                 HouseStyle_codedSLvl=as.numeric(newdat2$HouseStyle_coded=="SLvl"),
                 TotRmsAbvGrd=dat2$TotRmsAbvGrd,
                 n=nrow(newdat2)) 

params1 = c("b0","b", "sig")

inits1 = function() {
    inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(10,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}

mod2_jags = jags.model(textConnection(mod2_jags_string), data=data_jags, inits=inits1, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1460
##    Unobserved stochastic nodes: 12
##    Total graph size: 18584
## 
## Initializing model

update(mod2_jags, 1000) # burn-in

mod2_jags_sim = coda.samples(model=mod2_jags,
                        variable.names=params1,
                        n.iter=5000)

mod2_jags_csim = do.call(rbind, mod2_jags_sim) # combine multiple chains

plot(mod2_jags_sim)

gelman.diag(mod2_jags_sim)

## Potential scale reduction factors:
## 
##       Point est. Upper C.I.
## b[1]           1          1
## b[2]           1          1
## b[3]           1          1
## b[4]           1          1
## b[5]           1          1
## b[6]           1          1
## b[7]           1          1
## b[8]           1          1
## b[9]           1          1
## b[10]          1          1
## b0             1          1
## sig            1          1
## 
## Multivariate psrf
## 
## 1

autocorr.diag(mod2_jags_sim)

##                b[1]       b[2]          b[3]         b[4]         b[5]
## Lag 0   1.000000000 1.00000000  1.0000000000  1.000000000  1.000000000
## Lag 1   0.526815074 0.77956070 -0.0049580437  0.066959977 -0.002216822
## Lag 5   0.006298122 0.23345808  0.0006013765 -0.010520439  0.012557118
## Lag 10 -0.017168636 0.03986557  0.0108679549 -0.004269528 -0.004944846
## Lag 50  0.016443908 0.02875153  0.0098513743 -0.001455652 -0.001672352
##                b[6]         b[7]         b[8]          b[9]      b[10]
## Lag 0   1.000000000  1.000000000  1.000000000  1.0000000000 1.00000000
## Lag 1  -0.010279612  0.031166488  0.008216099  0.0008785314 0.80628827
## Lag 5   0.014107338 -0.005238457 -0.001696852  0.0015168399 0.26801525
## Lag 10  0.005440948  0.006116026 -0.007914953 -0.0131132510 0.03495348
## Lag 50  0.010518267  0.007403159 -0.002315090 -0.0054259798 0.01702556
##                  b0          sig
## Lag 0   1.000000000  1.000000000
## Lag 1   0.226336968  0.066379488
## Lag 5  -0.005610698  0.006157642
## Lag 10 -0.015752700 -0.008022802
## Lag 50 -0.018338118  0.005989576

autocorr.plot(mod2_jags_sim)

effectiveSize(mod2_jags_sim)

##      b[1]      b[2]      b[3]      b[4]      b[5]      b[6]      b[7]      b[8] 
##  5053.967  2145.525 15451.674 13312.223 15328.276 15257.529 15040.422 14663.452 
##      b[9]     b[10]        b0       sig 
## 15000.000  1953.375  9646.624 11474.412

dic.samples(mod1_jags, n.iter=1e3)

## Mean deviance:  37364 
## penalty 3.113 
## Penalized deviance: 37367

dic.samples(mod2_jags, n.iter=1e3)

## Mean deviance:  36640 
## penalty 3.576 
## Penalized deviance: 36643

Upon examination of the deviance information criterion (DIC), the penalized deviance of the second model (which incorporates TotRmsAbvGrd) is lower than the first model. We conclude that the second model is a better fit (and predictor) for the data.