library(tidyverse)
library(lubridate)
library(GGally)
library(rjags)
library(hpiR)
This is the Capstone Project (peer graded assignment) for the Bayesian Statistics: Techniques and Models course in Coursera platform.
For this assignment it was chosen a dataset composed of houses prices from central Seattle, Washinton. For this, data was gathered from the King County Assessor’s FTP site (https://info.kingcounty.gov/assessor/DataDownload/default.aspx). A number of initial data munging tasks were necessary to bring the data into this format.
A number of initial data munging tasks were necessary to bring the data into the final format. The dataset was sliced and features selected so that in the end remained prices and some numerical features for central Seattle home sales from the years of 2015 and 2016, including only detached single family residences and townhomes.
For reproductibility, all the code, data and seeds are included.
data("seattle_sales")
ss <- seattle_sales; rm(seattle_sales)
ss$sale_date <- as.Date(ss$sale_date)
ss <- ss %>% filter(sale_date > '2014-12-31')
ss <- ss %>% filter(area %in% 13:16)
ssn <- ss %>% select(sale_price, lot_sf,bldg_grade,
tot_sf,beds,baths,age)
nm <- names(ssn)
ssn <- map_dfc(.x = (1:ncol(ssn)), .f= ~as.numeric(ssn[,.x] ))
names(ssn) <- nm
The data to be modeled has 2565 observations, one dependent variable (sale_price) and 6 features (independent variables). So the format of the dataframe used is:
The remaining variables are all numeric, some are integer, and canbe summarized as follows:
summary(ssn)
## sale_price lot_sf bldg_grade tot_sf
## Min. : 169317 Min. : 525 Min. : 5.00 Min. : 440
## 1st Qu.: 559900 1st Qu.: 1539 1st Qu.: 7.00 1st Qu.:1370
## Median : 700000 Median : 3840 Median : 8.00 Median :1710
## Mean : 867443 Mean : 4083 Mean : 8.08 Mean :1991
## 3rd Qu.: 954700 3rd Qu.: 5225 3rd Qu.: 9.00 3rd Qu.:2420
## Max. :6150000 Max. :57166 Max. :13.00 Max. :9820
## beds baths age
## Min. :0.000 Min. :0.500 Min. : 0.00
## 1st Qu.:3.000 1st Qu.:1.750 1st Qu.: 6.00
## Median :3.000 Median :2.250 Median : 60.00
## Mean :3.094 Mean :2.175 Mean : 52.07
## 3rd Qu.:4.000 3rd Qu.:2.500 3rd Qu.: 94.00
## Max. :8.000 Max. :7.750 Max. :116.00
The relations between variables are shown in the following scatterplots and correlation coefficients:
We are going to model the price as a function of the remaining variables under a linear regression paradigm.
First the uninformative model is generated by the lm function to be used
md0 <- lm(data = ssn,
sale_price ~ lot_sf+bldg_grade+tot_sf+
beds+baths+age)
summary(md0)
##
## Call:
## lm(formula = sale_price ~ lot_sf + bldg_grade + tot_sf + beds +
## baths + age, data = ssn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1410123 -120759 -17788 101903 3256118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.167e+06 5.767e+04 -20.231 < 2e-16 ***
## lot_sf 3.059e+01 1.758e+00 17.403 < 2e-16 ***
## bldg_grade 1.643e+05 7.515e+03 21.863 < 2e-16 ***
## tot_sf 2.889e+02 1.191e+01 24.251 < 2e-16 ***
## beds -5.950e+04 7.950e+03 -7.485 9.80e-14 ***
## baths 4.621e+04 1.100e+04 4.202 2.73e-05 ***
## age 1.732e+03 1.740e+02 9.959 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276700 on 2558 degrees of freedom
## Multiple R-squared: 0.7449, Adjusted R-squared: 0.7443
## F-statistic: 1245 on 6 and 2558 DF, p-value: < 2.2e-16
Model 1 included all features. Dependent variable (sale_price), linear coefficients and intercept were modelled with normal distributions.
md_str <- 'model{
for(i in 1:length(sale_price)){
sale_price[i] ~ dnorm(mu[i],prec)
mu[i] = b[1]*lot_sf[i] + b[2]*bldg_grade[i] +
b[3]*tot_sf[i] + b[4]*beds[i] +
b[5]*baths[i] + b[6]*age[i] + b0
}
for (j in 1:6) {
b[j] ~ dnorm(0.0, 1.0/1.0e6)
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
}'
dados <- as.list(ssn)
params <- c('b0','b','sig')
set.seed(1001)
md1 = jags.model(textConnection(md_str), data=dados,
n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 2565
## Unobserved stochastic nodes: 8
## Total graph size: 22356
##
## Initializing model
update(md1, 1000) # burn-in
md1_sim = coda.samples(model=md1,
variable.names=params,
n.iter=2e3)
md1_csim <- md1_sim[[1]] %>% as_tibble() %>%
bind_rows(md1_sim[[2]] %>% as_tibble()) %>%
bind_rows(md1_sim[[3]] %>% as_tibble()) %>%
as.mcmc()
plot(md1_sim)
gelman.diag(md1_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1.01
## b[2] 1 1.00
## b[3] 1 1.01
## b[4] 1 1.00
## b[5] 1 1.00
## b[6] 1 1.00
## b0 1 1.00
## sig 1 1.01
##
## Multivariate psrf
##
## 1.01
autocorr.diag(md1_sim)
## b[1] b[2] b[3] b[4] b[5]
## Lag 0 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000
## Lag 1 0.679512709 0.560195141 0.769892794 0.195691183 0.126941237
## Lag 5 0.107558823 0.001141217 0.165194637 0.009008162 -0.012554215
## Lag 10 -0.004516752 -0.029325832 0.003539843 0.015904393 -0.016367028
## Lag 50 0.033734415 -0.001322973 -0.001692438 0.027783077 0.003684408
## b[6] b0 sig
## Lag 0 1.000000000 1.000000000 1.000000000
## Lag 1 0.581458538 0.025613426 -0.028883830
## Lag 5 0.022130733 0.016810743 0.019020201
## Lag 10 0.004658540 0.006897662 -0.004167729
## Lag 50 -0.003978657 -0.006394589 -0.008912855
effectiveSize(md1_sim)
## b[1] b[2] b[3] b[4] b[5] b[6] b0 sig
## 1274.839 1931.334 1009.851 4037.058 4649.683 1828.120 5814.186 6289.260
So the model has good convergence.
(coeff <- colMeans(md1_csim))
## b[1] b[2] b[3] b[4] b[5] b[6]
## 28.77877 930.08898 401.89797 -1224.07611 79.97986 -880.29668
## b0 sig
## -182.29247 306311.78292
yhat <- coeff[1]*ssn$lot_sf + coeff[2]*ssn$bldg_grade +
coeff[3]*ssn$tot_sf + coeff[4]*ssn$beds +
coeff[5]*ssn$baths + coeff[6]*ssn$age + coeff[7]
resid <- ssn$sale_price - yhat
plot(resid)
plot(yhat, resid)
summary(md1_sim)
##
## Iterations = 1001:3000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 2000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 28.78 1.923 0.02482 0.05392
## b[2] 930.09 891.573 11.51016 20.65671
## b[3] 401.90 5.714 0.07377 0.17977
## b[4] -1224.08 984.057 12.70412 15.50948
## b[5] 79.98 1001.725 12.93221 14.70540
## b[6] -880.30 136.626 1.76384 3.20743
## b0 -182.29 992.559 12.81388 13.03966
## sig 306311.78 4326.910 55.86016 54.61367
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 25.03 27.49 28.74 30.1 32.57
## b[2] -830.55 331.46 942.67 1528.2 2665.45
## b[3] 390.78 398.03 401.76 405.7 413.41
## b[4] -3116.05 -1883.92 -1229.75 -559.2 726.98
## b[5] -1901.65 -593.26 84.40 754.2 2066.27
## b[6] -1150.20 -971.68 -879.87 -787.0 -617.73
## b0 -2165.42 -846.02 -159.39 476.0 1706.75
## sig 297821.72 303347.09 306304.10 309179.5 314880.21
md_str <- 'model{
for(i in 1:length(sale_price)){
sale_price[i] ~ dnorm(mu[i],prec)
mu[i] = b[1]*lot_sf[i] + b[2]*bldg_grade[i] +
b[3]*tot_sf[i] + b[4]*beds[i] +
b[5]*baths[i] + b[6]*age[i] + b0
}
for (j in 1:6) {
b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
}'
set.seed(1002)
dados <- as.list(ssn)
params <- c('b0','b','sig')
mdcf = jags.model(textConnection(md_str), data=dados,
n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 2565
## Unobserved stochastic nodes: 8
## Total graph size: 22357
##
## Initializing model
update(mdcf, 1000) # burn-in
mdcf_sim = coda.samples(model=mdcf,
variable.names=params,
n.iter=2e3)
summary(mdcf_sim)
##
## Iterations = 2001:4000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 2000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 2.919e+01 1.7256 0.02228 0.05745
## b[2] 3.547e-02 0.9603 0.01240 0.01950
## b[3] 3.719e+02 4.4612 0.05759 0.15342
## b[4] -2.380e-02 1.0156 0.01311 0.02032
## b[5] -8.975e-03 1.0129 0.01308 0.02021
## b[6] -3.520e-02 0.9803 0.01266 0.01901
## b0 2.031e+02 1017.3173 13.13351 13.50210
## sig 3.105e+05 4323.2702 55.81318 58.27009
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 25.778 2.804e+01 2.919e+01 3.031e+01 3.269e+01
## b[2] -1.941 -4.602e-01 6.221e-03 5.059e-01 2.108e+00
## b[3] 363.038 3.690e+02 3.719e+02 3.748e+02 3.809e+02
## b[4] -2.228 -4.971e-01 -1.539e-02 4.754e-01 2.052e+00
## b[5] -2.161 -5.059e-01 -3.513e-03 4.884e-01 2.078e+00
## b[6] -2.106 -5.312e-01 -2.811e-02 4.427e-01 2.047e+00
## b0 -1805.844 -4.831e+02 2.003e+02 8.912e+02 2.180e+03
## sig 302172.157 3.075e+05 3.105e+05 3.134e+05 3.189e+05
As b[1] and b[3] are the only significant coefficients, we will try a simpler model with only
lot_sf and tot_sf features.
md_str <- 'model{
for(i in 1:length(sale_price)){
sale_price[i] ~ dnorm(mu[i],prec)
mu[i] = b[1]*lot_sf[i] + b[2]*tot_sf[i] + b0
}
for (j in 1:2) {
b[j] ~ dnorm(0.0, 1.0/1.0e6)
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
}'
dados <- as.list(ssn)
params <- c('b0','b','sig')
set.seed(1003)
md2 = jags.model(textConnection(md_str), data=dados,
n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 2565
## Unobserved stochastic nodes: 4
## Total graph size: 11840
##
## Initializing model
update(md2, 1000) # burn-in
md2_sim = coda.samples(model=md2,
variable.names=params,
n.iter=2e3)
md2_csim <- md2_sim[[1]] %>% as_tibble() %>%
bind_rows(md2_sim[[2]] %>% as_tibble()) %>%
bind_rows(md2_sim[[3]] %>% as_tibble()) %>%
as.mcmc()
plot(md2_sim)
gelman.diag(md2_sim)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## b[1] 1 1
## b[2] 1 1
## b0 1 1
## sig 1 1
##
## Multivariate psrf
##
## 1
autocorr.diag(md2_sim)
## b[1] b[2] b0 sig
## Lag 0 1.0000000000 1.00000000 1.000000000 1.000000000
## Lag 1 0.6012127678 0.61101459 0.041569945 0.035498502
## Lag 5 0.0772557586 0.08361840 0.013273393 -0.001790311
## Lag 10 -0.0001181435 -0.01701020 -0.010304378 0.003909176
## Lag 50 0.0207724430 0.03084526 0.002160294 -0.009620787
effectiveSize(md2_sim)
## b[1] b[2] b0 sig
## 1494.342 1448.522 7295.260 5671.562
(coeff <- colMeans(md2_csim))
## b[1] b[2] b0 sig
## 24.31447 392.80876 -383.21253 308949.54097
yhat <- coeff[1]*ssn$lot_sf + coeff[2]*ssn$tot_sf + coeff[3]
resid <- ssn$sale_price - yhat
plot(resid)
plot(yhat, resid)
In order to compare both linear regression models, we calculate their Deviance Information Criterion.
The Deviance Information Criterion (DIC) for the more complex model (Model 1), the one using all features (independent variables), is:
dic.samples(md1, n.iter=1e3)
## Mean deviance: 72089
## penalty 4.033
## Penalized deviance: 72093
The DIC for the simpler model (Model 2) is:
dic.samples(md2, n.iter=1e3)
## Mean deviance: 72134
## penalty 3.037
## Penalized deviance: 72137
Both tested models converged. Model 1 (more complex) is marginaly better than Model 2. Both models showed better predictions for houses with prices bellow 1.6 million.