The purpose of this project was to fit a series of regression models
to a dataset containing housing features and a corresponding sale price
as the response variable. Three models were constructed using both
R and JAGS. One of the JAGS
models used 3 features to predict sale prices while the final iteration
used 4 features. A number of evaluation metrics (including the deviation
information criterion (DIC)) were generated to gauge the accuracy of
each model. It was determined that incorporating the 4th feature reduced
the DIC and, hence, was a better for the data.
The Ames Housing dataset, which is available on Kaggle.com, was compiled by Dean De Cock for use in data science education. It’s an incredible dataset resource for data scientists and statisticians looking for a modernized and expanded version of the often-cited Boston Housing dataset.
The subject dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, along with the sale price of each home. For the purposes of this project, we will construct two models that leverage a subset (specifically, 3-4) of the most informative explanatory variables to predict sale prices for homes given their features.
The features to be included are:
LotArea - Lot size in square feetOverallCond - Rates the overall condition of the house
(10=Excellent, 1=Very Poor)GrLivArea - Above grade (ground) living area in square
feetWe will create an additional iteration of the model that incorporates
the feature TotRmsAbvGrd, the total number of rooms above
grade.
dat = read.csv(file="data_files/housing-price-data.csv", header=TRUE)
# Subset the raw data to features of interest and the response variable, SalePrice
dat1=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","SalePrice")]
head(dat1)
## LotArea HouseStyle OverallCond GrLivArea SalePrice
## 1 8450 2Story 5 1710 208500
## 2 9600 1Story 8 1262 181500
## 3 11250 2Story 5 1786 223500
## 4 9550 2Story 5 1717 140000
## 5 14260 2Story 5 2198 250000
## 6 14115 1.5Fin 5 1362 143000
str(dat1)
## 'data.frame': 1460 obs. of 5 variables:
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallCond: int 5 8 5 5 5 5 5 6 5 6 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
We will need to code the categorical variable
HouseStyle.
dat1$HouseStyle_coded = factor(dat1$HouseStyle)
newdat1 = dat1[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","SalePrice")]
head(newdat1)
## LotArea HouseStyle_coded OverallCond GrLivArea SalePrice
## 1 8450 2Story 5 1710 208500
## 2 9600 1Story 8 1262 181500
## 3 11250 2Story 5 1786 223500
## 4 9550 2Story 5 1717 140000
## 5 14260 2Story 5 2198 250000
## 6 14115 1.5Fin 5 1362 143000
Let’s examine the distribution of the response variable,
SalePrice
hist(newdat1$SalePrice,breaks=50,main="Histogram of SalePrice",xlab = "SalePrice")
abline(v = mean(newdat1$SalePrice), col='blue', lwd = 3)
#add label to mean vertical line
text(x=4e5, y=2e2, 'Mean = $181K, SD= $79K')
It appears that the distribution of SalePrice is
somewhat normal with a mean and standard deviation calculated as
follows:
mean(newdat1$SalePrice)
## [1] 180921.2
sd(newdat1$SalePrice)
## [1] 79442.5
Now, let’s look at the relationships among the features and their distributions:
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
pairs(newdat1)
It looks like most, if not all, of the numeric features have a
somewhat linear relationship with SalePrice.
hist.data.frame(newdat1)
It also appears that all of our non-categorical variables (including
the response variable, SalePrice) have a somewhat normal
distribution.
Let’s start off by postulating a linear model of the descriptor features.
mod_linear = lm(SalePrice~LotArea+HouseStyle_coded+OverallCond+GrLivArea, data=newdat1)
summary(mod_linear)
##
## Call:
## lm(formula = SalePrice ~ LotArea + HouseStyle_coded + OverallCond +
## GrLivArea, data = newdat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -549116 -22005 -1040 21462 300921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.391e+04 1.004e+04 -7.364 2.97e-13 ***
## LotArea 3.781e-01 1.402e-01 2.697 0.007069 **
## HouseStyle_coded1.5Unf 5.377e+04 1.429e+04 3.762 0.000175 ***
## HouseStyle_coded1Story 6.702e+04 4.621e+03 14.502 < 2e-16 ***
## HouseStyle_coded2.5Fin -8.912e+04 1.888e+04 -4.721 2.57e-06 ***
## HouseStyle_coded2.5Unf -2.873e+04 1.587e+04 -1.811 0.070345 .
## HouseStyle_coded2Story 2.717e+04 4.885e+03 5.562 3.18e-08 ***
## HouseStyle_codedSFoyer 6.941e+04 9.463e+03 7.335 3.68e-13 ***
## HouseStyle_codedSLvl 4.864e+04 7.523e+03 6.465 1.38e-10 ***
## OverallCond 2.121e+03 1.221e+03 1.738 0.082451 .
## GrLivArea 1.278e+02 3.227e+00 39.612 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50680 on 1449 degrees of freedom
## Multiple R-squared: 0.5958, Adjusted R-squared: 0.593
## F-statistic: 213.6 on 10 and 1449 DF, p-value: < 2.2e-16
We now create a hierarchical model in JAGS with the
following configuration:
\(y_i \sim N(\mu_i,\sigma^2),\space \space \space \mu_i = \beta_0+\beta x_{1i}+...+\beta_k x_{ki}, \space \space \space \beta_k \sim N(0, 1e6)\)
We note that \(k\) is the number of descriptor variables in the data set and \(i\) is the number of observations.
Also,
\(y_i\space| \space x_i,\beta,\sigma^2 \stackrel{ind}{\sim}N(\beta_0+\beta x_{1i}+...+\beta_k x_{ki},\sigma^2)\),
where the noninformative prior for \(\sigma^2\) is modeled using an \(InverseGamma(\alpha,\beta)\) distribution.
library("rjags")
## Loading required package: coda
## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs
newdat1 = na.omit(newdat1)
mod1_jags_string = " model {
for (i in 1:n) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b0
+ b[1]*LotArea[i]
+ b[2]*OverallCond[i]
+ b[3]*HouseStyle_coded1.5Unf[i]
+ b[4]*HouseStyle_coded1Story[i]
+ b[5]*HouseStyle_coded2.5Fin[i]
+ b[6]*HouseStyle_coded2.5Unf[i]
+ b[7]*HouseStyle_coded2Story[i]
+ b[8]*HouseStyle_codedSFoyer[i]
+ b[9]*HouseStyle_codedSLvl[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:9) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
set.seed(72)
data_jags = list(y=newdat1$SalePrice,
LotArea=newdat1$LotArea,
OverallCond=newdat1$OverallCond,
HouseStyle_coded1.5Unf=as.numeric(newdat1$HouseStyle_coded=="1.5Unf"),
HouseStyle_coded1Story=as.numeric(newdat1$HouseStyle_coded=="1Story"),
HouseStyle_coded2.5Fin=as.numeric(newdat1$HouseStyle_coded=="2.5Fin"),
HouseStyle_coded2.5Unf=as.numeric(newdat1$HouseStyle_coded=="2.5Unf"),
HouseStyle_coded2Story=as.numeric(newdat1$HouseStyle_coded=="2Story"),
HouseStyle_codedSFoyer=as.numeric(newdat1$HouseStyle_coded=="SFoyer"),
HouseStyle_codedSLvl=as.numeric(newdat1$HouseStyle_coded=="SLvl"),
n=nrow(newdat1))
params1 = c("b0","b", "sig")
inits1 = function() {
inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(9,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod1_jags = jags.model(textConnection(mod1_jags_string), data=data_jags, inits=inits1, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1460
## Unobserved stochastic nodes: 11
## Total graph size: 17039
##
## Initializing model
update(mod1_jags, 1000) # burn-in
mod1_jags_sim = coda.samples(model=mod1_jags,
variable.names=params1,
n.iter=5000)
mod1_jags_csim = do.call(rbind, mod1_jags_sim) # combine multiple chains
plot(mod1_jags_sim)
gelman.diag(mod1_jags_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b[3] 1 1
## b[4] 1 1
## b[5] 1 1
## b[6] 1 1
## b[7] 1 1
## b[8] 1 1
## b[9] 1 1
## b0 1 1
## sig 1 1
##
## Multivariate psrf
##
## 1
autocorr.diag(mod1_jags_sim)
## b[1] b[2] b[3] b[4] b[5]
## Lag 0 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000
## Lag 1 0.467351114 0.517387579 -0.006042575 0.032073550 0.003355981
## Lag 5 0.035142282 0.027545286 -0.006636925 -0.010659212 -0.001548847
## Lag 10 0.007007868 0.012057753 -0.008422920 -0.001790485 -0.003374652
## Lag 50 0.006696992 0.004084732 0.006054505 0.003195365 0.005494540
## b[6] b[7] b[8] b[9] b0
## Lag 0 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000
## Lag 1 0.011995885 0.036234173 -0.010341378 0.008083053 0.144759361
## Lag 5 0.002093311 -0.004502159 0.012347229 0.000930736 -0.003354070
## Lag 10 -0.013581633 -0.002436371 -0.006185099 -0.007064863 0.005869866
## Lag 50 0.003133407 -0.003571329 0.012985793 0.004794318 0.002187294
## sig
## Lag 0 1.0000000000
## Lag 1 0.1451967806
## Lag 5 -0.0004499572
## Lag 10 0.0017325720
## Lag 50 0.0084397105
autocorr.plot(mod1_jags_sim)
effectiveSize(mod1_jags_sim)
## b[1] b[2] b[3] b[4] b[5] b[6] b[7] b[8]
## 5444.727 4828.779 14677.558 14067.230 14491.556 14949.060 13522.636 14706.595
## b[9] b0 sig
## 15000.000 12147.659 9766.578
dic.samples(mod1_jags, n.iter=1e3)
## Mean deviance: 37362
## penalty 3.133
## Penalized deviance: 37365
We can get a posterior summary of the parameters in our model.
summary(mod1_jags_sim)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 4.523 0.214 0.001748 0.002901
## b[2] 18715.426 538.964 4.400624 7.766745
## b[3] -75.640 999.420 8.160231 8.250287
## b[4] 1842.948 979.917 8.000988 8.261489
## b[5] 49.500 992.786 8.106064 8.259452
## b[6] 3.195 998.208 8.150335 8.164812
## b[7] 3103.537 976.898 7.976336 8.400458
## b[8] -75.392 1002.471 8.185144 8.271105
## b[9] 45.090 996.310 8.134838 8.134915
## b0 4386.038 987.317 8.061413 8.985172
## sig 87122.053 1749.489 14.284517 17.808990
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 4.1 4.38 4.521 4.667 4.946
## b[2] 17660.8 18352.10 18713.917 19082.405 19762.434
## b[3] -2045.6 -743.75 -69.981 600.002 1867.847
## b[4] -101.3 1187.47 1845.519 2503.085 3755.609
## b[5] -1876.0 -624.02 46.494 722.721 2002.405
## b[6] -1980.5 -672.99 3.266 672.909 1951.205
## b[7] 1176.1 2451.19 3097.659 3756.873 5021.667
## b[8] -2031.6 -756.56 -66.424 601.091 1888.087
## b[9] -1890.9 -631.12 43.385 719.728 2004.057
## b0 2439.0 3716.83 4387.870 5059.510 6326.242
## sig 83744.6 85930.55 87110.946 88298.273 90621.011
Checking residuals (the difference between the response and the model’s prediction for that value) is important with linear models since residuals can reveal violations of the assumptions we made to specify the model. In particular, we are looking for any sign that the model is not linear, normally distributed, or that the observations are not independent (conditional on covariates). We first evaluate the simple linear model proposed earlier:
plot(resid(mod_linear)) # to check independence (looks okay)
plot(predict(mod_linear), resid(mod_linear)) # to check for linearity, constant variance (looks reasonable)
qqnorm(resid(mod_linear)) # to check Normality assumption (we want this to be a straight line)
Now let’s return to our JAGS model. In a Bayesian model,
we have distributions for residuals, but we’ll simplify and look only at
the residuals evaluated at the posterior mean of the parameters.
X = cbind(
rep(1.0, data_jags$n),
data_jags$LotArea,
data_jags$OverallCond,
data_jags$HouseStyle_coded1.5Unf,
data_jags$HouseStyle_coded1Story,
data_jags$HouseStyle_coded2.5Fin,
data_jags$HouseStyle_coded2.5Unf,
data_jags$HouseStyle_coded2Story,
data_jags$HouseStyle_codedSFoyer,
data_jags$HouseStyle_codedSLvl
)
head(X)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 8450 5 0 0 0 0 1 0 0
## [2,] 1 9600 8 0 1 0 0 0 0 0
## [3,] 1 11250 5 0 0 0 0 1 0 0
## [4,] 1 9550 5 0 0 0 0 1 0 0
## [5,] 1 14260 5 0 0 0 0 1 0 0
## [6,] 1 14115 5 0 0 0 0 0 0 0
(pm_params = colMeans(mod1_jags_csim)) # posterior mean
## b[1] b[2] b[3] b[4] b[5] b[6]
## 4.522740 18715.426133 -75.639783 1842.947571 49.500478 3.195325
## b[7] b[8] b[9] b0 sig
## 3103.537308 -75.392071 45.090052 4386.037726 87122.053475
yhat = drop(X %*% pm_params[1:10])
resid = data_jags$y - yhat
plot(resid) # residuals against data index
plot(yhat, resid) # residuals against predicted values
qqnorm(resid) # checking normality of residuals
plot(predict(mod_linear), resid(mod_linear)) # to compare with reference linear model
# rownames(dat1)[order(resid1, decreasing=TRUE)[1:5]]
As mentioned previously, we will build another linear model that
incorporates the feature TotRmsAbvGrd, which is the total
rooms above grade (does not include bathrooms).
We first adjust the dataset to include TotRmsAbvGrd.
dat2=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
dat2$HouseStyle_coded = factor(dat2$HouseStyle)
newdat2 = dat2[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
head(newdat2)
## LotArea HouseStyle_coded OverallCond GrLivArea TotRmsAbvGrd SalePrice
## 1 8450 2Story 5 1710 8 208500
## 2 9600 1Story 8 1262 6 181500
## 3 11250 2Story 5 1786 6 223500
## 4 9550 2Story 5 1717 7 140000
## 5 14260 2Story 5 2198 9 250000
## 6 14115 1.5Fin 5 1362 5 143000
mod2_jags_string = " model {
for (i in 1:n) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b0
+ b[1]*LotArea[i]
+ b[2]*OverallCond[i]
+ b[3]*HouseStyle_coded1.5Unf[i]
+ b[4]*HouseStyle_coded1Story[i]
+ b[5]*HouseStyle_coded2.5Fin[i]
+ b[6]*HouseStyle_coded2.5Unf[i]
+ b[7]*HouseStyle_coded2Story[i]
+ b[8]*HouseStyle_codedSFoyer[i]
+ b[9]*HouseStyle_codedSLvl[i]
+ b[10]*TotRmsAbvGrd[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:10) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
set.seed(72)
data_jags = list(y=newdat2$SalePrice,
LotArea=newdat2$LotArea,
OverallCond=newdat2$OverallCond,
HouseStyle_coded1.5Unf=as.numeric(newdat2$HouseStyle_coded=="1.5Unf"),
HouseStyle_coded1Story=as.numeric(newdat2$HouseStyle_coded=="1Story"),
HouseStyle_coded2.5Fin=as.numeric(newdat2$HouseStyle_coded=="2.5Fin"),
HouseStyle_coded2.5Unf=as.numeric(newdat2$HouseStyle_coded=="2.5Unf"),
HouseStyle_coded2Story=as.numeric(newdat2$HouseStyle_coded=="2Story"),
HouseStyle_codedSFoyer=as.numeric(newdat2$HouseStyle_coded=="SFoyer"),
HouseStyle_codedSLvl=as.numeric(newdat2$HouseStyle_coded=="SLvl"),
TotRmsAbvGrd=dat2$TotRmsAbvGrd,
n=nrow(newdat2))
params1 = c("b0","b", "sig")
inits1 = function() {
inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(10,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod2_jags = jags.model(textConnection(mod2_jags_string), data=data_jags, inits=inits1, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1460
## Unobserved stochastic nodes: 12
## Total graph size: 18584
##
## Initializing model
update(mod2_jags, 1000) # burn-in
mod2_jags_sim = coda.samples(model=mod2_jags,
variable.names=params1,
n.iter=5000)
mod2_jags_csim = do.call(rbind, mod2_jags_sim) # combine multiple chains
plot(mod2_jags_sim)
gelman.diag(mod2_jags_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b[3] 1 1
## b[4] 1 1
## b[5] 1 1
## b[6] 1 1
## b[7] 1 1
## b[8] 1 1
## b[9] 1 1
## b[10] 1 1
## b0 1 1
## sig 1 1
##
## Multivariate psrf
##
## 1
autocorr.diag(mod2_jags_sim)
## b[1] b[2] b[3] b[4] b[5]
## Lag 0 1.000000000 1.00000000 1.0000000000 1.000000000 1.000000000
## Lag 1 0.526815074 0.77956070 -0.0049580437 0.066959977 -0.002216822
## Lag 5 0.006298122 0.23345808 0.0006013765 -0.010520439 0.012557118
## Lag 10 -0.017168636 0.03986557 0.0108679549 -0.004269528 -0.004944846
## Lag 50 0.016443908 0.02875153 0.0098513743 -0.001455652 -0.001672352
## b[6] b[7] b[8] b[9] b[10]
## Lag 0 1.000000000 1.000000000 1.000000000 1.0000000000 1.00000000
## Lag 1 -0.010279612 0.031166488 0.008216099 0.0008785314 0.80628827
## Lag 5 0.014107338 -0.005238457 -0.001696852 0.0015168399 0.26801525
## Lag 10 0.005440948 0.006116026 -0.007914953 -0.0131132510 0.03495348
## Lag 50 0.010518267 0.007403159 -0.002315090 -0.0054259798 0.01702556
## b0 sig
## Lag 0 1.000000000 1.000000000
## Lag 1 0.226336968 0.066379488
## Lag 5 -0.005610698 0.006157642
## Lag 10 -0.015752700 -0.008022802
## Lag 50 -0.018338118 0.005989576
autocorr.plot(mod2_jags_sim)
effectiveSize(mod2_jags_sim)
## b[1] b[2] b[3] b[4] b[5] b[6] b[7] b[8]
## 5053.967 2145.525 15451.674 13312.223 15328.276 15257.529 15040.422 14663.452
## b[9] b[10] b0 sig
## 15000.000 1953.375 9646.624 11474.412
dic.samples(mod1_jags, n.iter=1e3)
## Mean deviance: 37364
## penalty 3.113
## Penalized deviance: 37367
dic.samples(mod2_jags, n.iter=1e3)
## Mean deviance: 36640
## penalty 3.576
## Penalized deviance: 36643
Upon examination of the deviance information criterion (DIC), the
penalized deviance of the second model (which incorporates
TotRmsAbvGrd) is lower than the first model. We conclude
that the second model is a better fit (and predictor) for the data.