The purpose of this project was to fit a series of regression models
to a dataset containing housing features and a corresponding sale price
as the response variable. Three models were constructed using both
R
and JAGS
. One of the JAGS
models used 3 features to predict sale prices while the final iteration
used 4 features. A number of evaluation metrics (including the deviation
information criterion (DIC)) were generated to gauge the accuracy of
each model. It was determined that incorporating the 4th feature reduced
the DIC and, hence, was a better for the data.
The Ames Housing dataset, which is available on Kaggle.com, was compiled by Dean De Cock for use in data science education. It’s an incredible dataset resource for data scientists and statisticians looking for a modernized and expanded version of the often-cited Boston Housing dataset.
The subject dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, along with the sale price of each home. For the purposes of this project, we will construct two models that leverage a subset (specifically, 3-4) of the most informative explanatory variables understand their ability to accurately predict sale prices for homes given their features.
The features to be included are:
LotArea
- Lot size in square feetOverallCond
- Rates the overall condition of the house
(10=Excellent, 1=Very Poor)GrLivArea
- Above grade (ground) living area in square
feetWe will create an additional iteration of the model that incorporates
the feature TotRmsAbvGrd
, the total number of rooms above
grade.
dat = read.csv(file="data_files/housing-price-data.csv", header=TRUE)
# Subset the raw data to features of interest and the response variable, SalePrice
dat1=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","SalePrice")]
# Code the categorical variable `HouseStyle`.
dat1$HouseStyle_coded = factor(dat1$HouseStyle)
newdat1 = dat1[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","SalePrice")]
head(newdat1)
## LotArea HouseStyle_coded OverallCond GrLivArea SalePrice
## 1 8450 2Story 5 1710 208500
## 2 9600 1Story 8 1262 181500
## 3 11250 2Story 5 1786 223500
## 4 9550 2Story 5 1717 140000
## 5 14260 2Story 5 2198 250000
## 6 14115 1.5Fin 5 1362 143000
Now, let’s look at the relationships among the features and their distributions:
library(Hmisc)
pairs(newdat1)
It looks like most, if not all, of the numeric features have a
somewhat linear relationship with SalePrice
.
hist.data.frame(newdat1)
It also appears that all of our non-categorical variables (including
the response variable, SalePrice
) have a somewhat normal
distribution.
Since each of the features appear to have linear relationship with
SalePrice
, let’s start off by postulating a heirarchical
JAGS
linear model of the descriptor features. We
will model with non-informative prior distributions (i.e., large \(\sigma^2\)) for the model
coefficients. The model will take the form of
\(y_i \sim N(\mu_i,\sigma^2),\space \space \space \mu_i = \beta_0+\beta x_{1i}+...+\beta_k x_{ki}, \space \space \space \beta_k \sim N(0, 1e6)\)
Where \(k\) is the number of descriptor variables in the data set and \(i\) is the number of observations.
Also, \(y_i\space| \space x_i,\beta,\sigma^2 \stackrel{ind}{\sim}N(\beta_0+\beta x_{1i}+...+\beta_k x_{ki},\sigma^2)\),
where the noninformative prior for \(\sigma^2\) is modeled using an \(InverseGamma(\alpha,\beta)\) distribution.
library("rjags")
newdat1 = na.omit(newdat1)
mod1_jags_string = " model {
for (i in 1:n) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
+b[4]*HouseStyle_coded1Story[i]+b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
+ b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:9) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
data_jags = list(y=newdat1$SalePrice,LotArea=newdat1$LotArea,OverallCond=newdat1$OverallCond,
HouseStyle_coded1.5Unf=as.numeric(newdat1$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat1$HouseStyle_coded=="1Story"),
HouseStyle_coded2.5Fin=as.numeric(newdat1$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat1$HouseStyle_coded=="2.5Unf"),
HouseStyle_coded2Story=as.numeric(newdat1$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat1$HouseStyle_coded=="SFoyer"),
HouseStyle_codedSLvl=as.numeric(newdat1$HouseStyle_coded=="SLvl"),n=nrow(newdat1))
params1 = c("b0","b", "sig")
inits1 = function() {
inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(9,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod1_jags = jags.model(textConnection(mod1_jags_string), data=data_jags, inits=inits1, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1460
## Unobserved stochastic nodes: 11
## Total graph size: 17039
##
## Initializing model
update(mod1_jags, 1000) # burn-in
mod1_jags_sim = coda.samples(model=mod1_jags,
variable.names=params1,
n.iter=5000)
mod1_jags_csim = do.call(rbind, mod1_jags_sim) # combine multiple chains
# gelman.diag(mod1_jags_sim)
# autocorr.diag(mod1_jags_sim)
# autocorr.plot(mod1_jags_sim)
# effectiveSize(mod1_jags_sim)
plot(mod1_jags_sim)
We can observe that the above traces for the \(b[i]\)s converge and the densities are all
normal. Since we have convergence, we can conclude that our
probability distributions for the coefficients were reasonable.
We can get a posterior summary of the parameters in our model.
summary(mod1_jags_sim)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 4.516 0.2141 0.001748 0.002923
## b[2] 18733.066 533.1838 4.353428 7.686240
## b[3] -82.840 992.1058 8.100510 8.259117
## b[4] 1836.497 966.5476 7.891828 8.118721
## b[5] 38.119 991.7340 8.097474 8.038132
## b[6] -8.461 1005.5864 8.210578 8.210332
## b[7] 3111.074 982.3646 8.020973 8.075076
## b[8] -50.427 994.0725 8.116568 8.116447
## b[9] 41.293 1005.7823 8.212178 8.141461
## b0 4384.127 982.4641 8.021786 9.389993
## sig 87098.806 1715.7394 14.008953 17.290666
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 4.094 4.371 4.519 4.662 4.934
## b[2] 17686.155 18370.307 18729.466 19093.857 19776.835
## b[3] -1997.715 -755.499 -87.709 582.436 1869.909
## b[4] -38.028 1181.517 1826.616 2500.843 3746.310
## b[5] -1910.057 -631.316 38.970 715.903 1970.005
## b[6] -1989.969 -691.333 -3.640 671.758 1963.364
## b[7] 1205.133 2449.899 3103.954 3779.193 5046.609
## b[8] -2003.124 -723.413 -37.513 630.763 1874.435
## b[9] -1892.400 -651.491 35.083 724.979 2037.988
## b0 2470.640 3717.156 4375.585 5051.985 6312.429
## sig 83831.321 85920.419 87079.213 88242.660 90580.620
We notice that the standard deviation is quite high for the
coefficients associated with the coded categorical variable
HouseStyle_coded
. This makes sense since the \(x_i\) for this feature can only take on
values of \(0\) or \(1\).
In a Bayesian model, we have distributions for residuals, but we’ll simplify and look only at the residuals evaluated at the posterior mean of the parameters.
X = cbind(rep(1.0, data_jags$n), data_jags$LotArea,data_jags$OverallCond,data_jags$HouseStyle_coded1.5Unf,
data_jags$HouseStyle_coded1Story,data_jags$HouseStyle_coded2.5Fin,data_jags$HouseStyle_coded2.5Unf,
data_jags$HouseStyle_coded2Story,data_jags$HouseStyle_codedSFoyer,data_jags$HouseStyle_codedSLvl)
head(X)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 8450 5 0 0 0 0 1 0 0
## [2,] 1 9600 8 0 1 0 0 0 0 0
## [3,] 1 11250 5 0 0 0 0 1 0 0
## [4,] 1 9550 5 0 0 0 0 1 0 0
## [5,] 1 14260 5 0 0 0 0 1 0 0
## [6,] 1 14115 5 0 0 0 0 0 0 0
(pm_params = colMeans(mod1_jags_csim)) # posterior mean
## b[1] b[2] b[3] b[4] b[5] b[6]
## 4.516035 18733.066495 -82.840268 1836.496662 38.119044 -8.461405
## b[7] b[8] b[9] b0 sig
## 3111.073537 -50.426905 41.293430 4384.126828 87098.805598
Create an inference vector, \(\hat{y}\), generated by the model and examine the residuals.
yhat = drop(X %*% pm_params[1:10])
resid = data_jags$y - yhat
plot(resid) # residuals against data index
plot(yhat, resid) # residuals against predicted values
qqnorm(resid) # checking normality of residuals
# rownames(dat1)[order(resid1, decreasing=TRUE)[1:5]]
As mentioned previously, we will build another linear model that
incorporates the numeric feature TotRmsAbvGrd
, which is the
total rooms above grade (does not include bathrooms).
#Adjust the dataset to include `TotRmsAbvGrd`.
dat2=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
dat2$HouseStyle_coded = factor(dat2$HouseStyle)
newdat2 = dat2[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
# head(newdat2)
The JAGS
model becomes:
mod2_jags_string = " model {
for (i in 1:n) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
+ b[4]*HouseStyle_coded1Story[i]+ b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
+ b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
+ b[10]*TotRmsAbvGrd[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:10) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
data_jags = list(y=newdat2$SalePrice, LotArea=newdat2$LotArea,OverallCond=newdat2$OverallCond,
HouseStyle_coded1.5Unf=as.numeric(newdat2$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat2$HouseStyle_coded=="1Story"),
HouseStyle_coded2.5Fin=as.numeric(newdat2$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat2$HouseStyle_coded=="2.5Unf"),
HouseStyle_coded2Story=as.numeric(newdat2$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat2$HouseStyle_coded=="SFoyer"),
HouseStyle_codedSLvl=as.numeric(newdat2$HouseStyle_coded=="SLvl"),TotRmsAbvGrd=dat2$TotRmsAbvGrd,
n=nrow(newdat2))
params1 = c("b0","b", "sig")
inits1 = function() {
inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(10,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod2_jags = jags.model(textConnection(mod2_jags_string), data=data_jags, inits=inits1, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1460
## Unobserved stochastic nodes: 12
## Total graph size: 18584
##
## Initializing model
update(mod2_jags, 1000) # burn-in
mod2_jags_sim = coda.samples(model=mod2_jags,
variable.names=params1,
n.iter=5000)
mod2_jags_csim = do.call(rbind, mod2_jags_sim) # combine multiple chains
# gelman.diag(mod2_jags_sim)
# autocorr.diag(mod2_jags_sim)
# autocorr.plot(mod2_jags_sim)
# effectiveSize(mod2_jags_sim)
plot(mod2_jags_sim)
summary(mod2_jags_sim)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 2.187 0.1804 0.001473 0.002536
## b[2] 6320.474 644.9932 5.266347 14.616028
## b[3] -76.500 1000.7012 8.170691 8.187354
## b[4] 1828.731 961.7537 7.852686 8.330413
## b[5] -56.935 998.6541 8.153977 7.949206
## b[6] -108.204 1007.2232 8.223943 8.222919
## b[7] 1659.996 984.3864 8.037482 8.278267
## b[8] -46.278 993.5073 8.111954 8.111943
## b[9] -40.484 989.1334 8.076240 8.076471
## b[10] 17422.713 593.2215 4.843633 13.555247
## b0 1848.780 988.5787 8.071711 9.950972
## sig 68025.482 1311.2246 10.706104 12.953728
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 1.835 2.068 2.186 2.307 2.545
## b[2] 5039.984 5883.579 6328.454 6760.770 7571.565
## b[3] -2064.627 -742.690 -70.252 601.595 1856.803
## b[4] -54.297 1176.913 1826.188 2472.133 3718.412
## b[5] -1986.068 -746.544 -54.830 622.248 1882.227
## b[6] -2114.466 -780.700 -108.644 565.399 1867.003
## b[7] -249.170 991.382 1656.688 2325.639 3610.991
## b[8] -1991.705 -714.529 -52.019 623.372 1902.876
## b[9] -1967.256 -706.296 -25.291 629.126 1889.060
## b[10] 16238.684 17027.802 17425.099 17822.135 18574.408
## b0 -79.009 1187.955 1846.208 2518.234 3778.759
## sig 65535.771 67109.584 67990.308 68897.752 70641.267
The model coefficient for TotRmsAbvGrd
, b[10], is quite
large as compared with the other coefficients, indicating that this
feature is a stronger driver of SalePrice
. As with the
previous model, the trace plots show convergence for all the
coefficients and normal densities.
Let’s compare the two models using the DIC metric:
dic.samples(mod1_jags, n.iter=1e3)
## Mean deviance: 37363
## penalty 3.277
## Penalized deviance: 37366
dic.samples(mod2_jags, n.iter=1e3)
## Mean deviance: 36640
## penalty 3.526
## Penalized deviance: 36644
Upon examination of the deviance information criterion (DIC), the
penalized deviance of the second model (which incorporates
TotRmsAbvGrd
) is lower than the first model. We conclude
that the second model that incorporates the additional feature is a
better fit (and predictor) for the data.