library(tidyverse)
library(lubridate)
library(GGally)
library(rjags)
library(hpiR)

1 Introduction

This is the Capstone Project (peer graded assignment) for the Bayesian Statistics: Techniques and Models course in Coursera platform.

For this assignment it was chosen a dataset composed of houses prices from central Seattle, Washinton. For this, data was gathered from the King County Assessor’s FTP site (https://info.kingcounty.gov/assessor/DataDownload/default.aspx). A number of initial data munging tasks were necessary to bring the data into this format.

A number of initial data munging tasks were necessary to bring the data into the final format. The dataset was sliced and features selected so that in the end remained prices and some numerical features for central Seattle home sales from the years of 2015 and 2016, including only detached single family residences and townhomes.

For reproductibility, all the code, data and seeds are included.

2 Data preparation

data("seattle_sales")
ss <- seattle_sales; rm(seattle_sales)
ss$sale_date <- as.Date(ss$sale_date)
ss <- ss %>% filter(sale_date > '2014-12-31')
ss <- ss %>% filter(area %in% 13:16)
ssn <- ss %>% select(sale_price, lot_sf,bldg_grade,
                     tot_sf,beds,baths,age)
nm <- names(ssn)
ssn <- map_dfc(.x = (1:ncol(ssn)), .f= ~as.numeric(ssn[,.x] ))
names(ssn) <- nm

The data to be modeled has 2565 observations, one dependent variable (sale_price) and 6 features (independent variables). So the format of the dataframe used is:

  • sale_price: Price of the home.
  • lot_sf: Size of lot in square feet.
  • bldg_grade: Quality of the building construction (higher is better).
  • tot_sf: Size of home in square feet.
  • beds: Number of bedrooms.
  • baths: Number of bathrooms.
  • age: Age of home.

3 Exploring the data

The remaining variables are all numeric, some are integer, and canbe summarized as follows:

summary(ssn)
##    sale_price          lot_sf        bldg_grade        tot_sf    
##  Min.   : 169317   Min.   :  525   Min.   : 5.00   Min.   : 440  
##  1st Qu.: 559900   1st Qu.: 1539   1st Qu.: 7.00   1st Qu.:1370  
##  Median : 700000   Median : 3840   Median : 8.00   Median :1710  
##  Mean   : 867443   Mean   : 4083   Mean   : 8.08   Mean   :1991  
##  3rd Qu.: 954700   3rd Qu.: 5225   3rd Qu.: 9.00   3rd Qu.:2420  
##  Max.   :6150000   Max.   :57166   Max.   :13.00   Max.   :9820  
##       beds           baths            age        
##  Min.   :0.000   Min.   :0.500   Min.   :  0.00  
##  1st Qu.:3.000   1st Qu.:1.750   1st Qu.:  6.00  
##  Median :3.000   Median :2.250   Median : 60.00  
##  Mean   :3.094   Mean   :2.175   Mean   : 52.07  
##  3rd Qu.:4.000   3rd Qu.:2.500   3rd Qu.: 94.00  
##  Max.   :8.000   Max.   :7.750   Max.   :116.00

The relations between variables are shown in the following scatterplots and correlation coefficients:

4 Linear Regression Models

We are going to model the price as a function of the remaining variables under a linear regression paradigm.

First the uninformative model is generated by the lm function to be used

md0 <- lm(data = ssn,
         sale_price ~ lot_sf+bldg_grade+tot_sf+
            beds+baths+age)
   summary(md0)
## 
## Call:
## lm(formula = sale_price ~ lot_sf + bldg_grade + tot_sf + beds + 
##     baths + age, data = ssn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1410123  -120759   -17788   101903  3256118 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.167e+06  5.767e+04 -20.231  < 2e-16 ***
## lot_sf       3.059e+01  1.758e+00  17.403  < 2e-16 ***
## bldg_grade   1.643e+05  7.515e+03  21.863  < 2e-16 ***
## tot_sf       2.889e+02  1.191e+01  24.251  < 2e-16 ***
## beds        -5.950e+04  7.950e+03  -7.485 9.80e-14 ***
## baths        4.621e+04  1.100e+04   4.202 2.73e-05 ***
## age          1.732e+03  1.740e+02   9.959  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276700 on 2558 degrees of freedom
## Multiple R-squared:  0.7449, Adjusted R-squared:  0.7443 
## F-statistic:  1245 on 6 and 2558 DF,  p-value: < 2.2e-16

4.1 Model 1

Model 1 included all features. Dependent variable (sale_price), linear coefficients and intercept were modelled with normal distributions.

4.1.1 Setup of Model 1

md_str <- 'model{
   for(i in 1:length(sale_price)){
      sale_price[i] ~ dnorm(mu[i],prec)
      mu[i] = b[1]*lot_sf[i] +  b[2]*bldg_grade[i] + 
       b[3]*tot_sf[i] +  b[4]*beds[i] + 
       b[5]*baths[i] +  b[6]*age[i] + b0
   }
   for (j in 1:6) {
        b[j] ~ dnorm(0.0, 1.0/1.0e6)
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
}'

dados <- as.list(ssn)
params <- c('b0','b','sig')

4.1.2 Fitting Model 1

set.seed(1001)
md1 = jags.model(textConnection(md_str), data=dados,
                    n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2565
##    Unobserved stochastic nodes: 8
##    Total graph size: 22356
## 
## Initializing model
update(md1, 1000) # burn-in

md1_sim = coda.samples(model=md1,
                         variable.names=params,
                         n.iter=2e3)

md1_csim <- md1_sim[[1]] %>% as_tibble() %>% 
  bind_rows(md1_sim[[2]] %>% as_tibble()) %>% 
  bind_rows(md1_sim[[3]] %>% as_tibble()) %>% 
  as.mcmc()

4.1.3 Checking model convergence

plot(md1_sim)

gelman.diag(md1_sim)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1       1.01
## b[2]          1       1.00
## b[3]          1       1.01
## b[4]          1       1.00
## b[5]          1       1.00
## b[6]          1       1.00
## b0            1       1.00
## sig           1       1.01
## 
## Multivariate psrf
## 
## 1.01
autocorr.diag(md1_sim)
##                b[1]         b[2]         b[3]        b[4]         b[5]
## Lag 0   1.000000000  1.000000000  1.000000000 1.000000000  1.000000000
## Lag 1   0.679512709  0.560195141  0.769892794 0.195691183  0.126941237
## Lag 5   0.107558823  0.001141217  0.165194637 0.009008162 -0.012554215
## Lag 10 -0.004516752 -0.029325832  0.003539843 0.015904393 -0.016367028
## Lag 50  0.033734415 -0.001322973 -0.001692438 0.027783077  0.003684408
##                b[6]           b0          sig
## Lag 0   1.000000000  1.000000000  1.000000000
## Lag 1   0.581458538  0.025613426 -0.028883830
## Lag 5   0.022130733  0.016810743  0.019020201
## Lag 10  0.004658540  0.006897662 -0.004167729
## Lag 50 -0.003978657 -0.006394589 -0.008912855
effectiveSize(md1_sim)
##     b[1]     b[2]     b[3]     b[4]     b[5]     b[6]       b0      sig 
## 1274.839 1931.334 1009.851 4037.058 4649.683 1828.120 5814.186 6289.260

So the model has good convergence.

4.1.4 Residuals

(coeff <- colMeans(md1_csim))
##         b[1]         b[2]         b[3]         b[4]         b[5]         b[6] 
##     28.77877    930.08898    401.89797  -1224.07611     79.97986   -880.29668 
##           b0          sig 
##   -182.29247 306311.78292
yhat <- coeff[1]*ssn$lot_sf +  coeff[2]*ssn$bldg_grade + 
       coeff[3]*ssn$tot_sf +  coeff[4]*ssn$beds + 
       coeff[5]*ssn$baths +  coeff[6]*ssn$age + coeff[7]

resid <- ssn$sale_price - yhat

plot(resid)

plot(yhat, resid) 

summary(md1_sim)
## 
## Iterations = 1001:3000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 2000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean       SD Naive SE Time-series SE
## b[1]     28.78    1.923  0.02482        0.05392
## b[2]    930.09  891.573 11.51016       20.65671
## b[3]    401.90    5.714  0.07377        0.17977
## b[4]  -1224.08  984.057 12.70412       15.50948
## b[5]     79.98 1001.725 12.93221       14.70540
## b[6]   -880.30  136.626  1.76384        3.20743
## b0     -182.29  992.559 12.81388       13.03966
## sig  306311.78 4326.910 55.86016       54.61367
## 
## 2. Quantiles for each variable:
## 
##           2.5%       25%       50%      75%     97.5%
## b[1]     25.03     27.49     28.74     30.1     32.57
## b[2]   -830.55    331.46    942.67   1528.2   2665.45
## b[3]    390.78    398.03    401.76    405.7    413.41
## b[4]  -3116.05  -1883.92  -1229.75   -559.2    726.98
## b[5]  -1901.65   -593.26     84.40    754.2   2066.27
## b[6]  -1150.20   -971.68   -879.87   -787.0   -617.73
## b0    -2165.42   -846.02   -159.39    476.0   1706.75
## sig  297821.72 303347.09 306304.10 309179.5 314880.21

4.2 Choosing features

md_str <- 'model{
   for(i in 1:length(sale_price)){
      sale_price[i] ~ dnorm(mu[i],prec)
      mu[i] = b[1]*lot_sf[i] +  b[2]*bldg_grade[i] + 
       b[3]*tot_sf[i] +  b[4]*beds[i] + 
       b[5]*baths[i] +  b[6]*age[i] + b0
   }
   for (j in 1:6) {
        b[j] ~ ddexp(0.0, sqrt(2.0)) # has variance 1.0
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
}'

set.seed(1002)
dados <- as.list(ssn)
params <- c('b0','b','sig')
mdcf = jags.model(textConnection(md_str), data=dados,
                    n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2565
##    Unobserved stochastic nodes: 8
##    Total graph size: 22357
## 
## Initializing model
update(mdcf, 1000) # burn-in

mdcf_sim = coda.samples(model=mdcf,
                         variable.names=params,
                         n.iter=2e3)
summary(mdcf_sim)
## 
## Iterations = 2001:4000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 2000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean        SD Naive SE Time-series SE
## b[1]  2.919e+01    1.7256  0.02228        0.05745
## b[2]  3.547e-02    0.9603  0.01240        0.01950
## b[3]  3.719e+02    4.4612  0.05759        0.15342
## b[4] -2.380e-02    1.0156  0.01311        0.02032
## b[5] -8.975e-03    1.0129  0.01308        0.02021
## b[6] -3.520e-02    0.9803  0.01266        0.01901
## b0    2.031e+02 1017.3173 13.13351       13.50210
## sig   3.105e+05 4323.2702 55.81318       58.27009
## 
## 2. Quantiles for each variable:
## 
##            2.5%        25%        50%       75%     97.5%
## b[1]     25.778  2.804e+01  2.919e+01 3.031e+01 3.269e+01
## b[2]     -1.941 -4.602e-01  6.221e-03 5.059e-01 2.108e+00
## b[3]    363.038  3.690e+02  3.719e+02 3.748e+02 3.809e+02
## b[4]     -2.228 -4.971e-01 -1.539e-02 4.754e-01 2.052e+00
## b[5]     -2.161 -5.059e-01 -3.513e-03 4.884e-01 2.078e+00
## b[6]     -2.106 -5.312e-01 -2.811e-02 4.427e-01 2.047e+00
## b0    -1805.844 -4.831e+02  2.003e+02 8.912e+02 2.180e+03
## sig  302172.157  3.075e+05  3.105e+05 3.134e+05 3.189e+05

As b[1] and b[3] are the only significant coefficients, we will try a simpler model with only

lot_sf and tot_sf features.

4.3 Model 2

4.3.1 Setup of Model 2

md_str <- 'model{
   for(i in 1:length(sale_price)){
      sale_price[i] ~ dnorm(mu[i],prec)
      mu[i] = b[1]*lot_sf[i] + b[2]*tot_sf[i] + b0
   }
   for (j in 1:2) {
        b[j] ~ dnorm(0.0, 1.0/1.0e6)
    }
    b0 ~ dnorm(0.0, 1.0/1.0e6)
    prec ~ dgamma(5/2.0, 5*10.0/2.0)
    sig2 = 1.0 / prec
    sig = sqrt(sig2)
}'

dados <- as.list(ssn)
params <- c('b0','b','sig')

4.3.2 Fitting Model 2

set.seed(1003)
md2 = jags.model(textConnection(md_str), data=dados,
                    n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2565
##    Unobserved stochastic nodes: 4
##    Total graph size: 11840
## 
## Initializing model
update(md2, 1000) # burn-in

md2_sim = coda.samples(model=md2,
                         variable.names=params,
                         n.iter=2e3)

md2_csim <- md2_sim[[1]] %>% as_tibble() %>% 
  bind_rows(md2_sim[[2]] %>% as_tibble()) %>% 
  bind_rows(md2_sim[[3]] %>% as_tibble()) %>% 
  as.mcmc()

4.3.3 Checking model convergence

plot(md2_sim)

gelman.diag(md2_sim)
## Potential scale reduction factors:
## 
##      Point est. Upper C.I.
## b[1]          1          1
## b[2]          1          1
## b0            1          1
## sig           1          1
## 
## Multivariate psrf
## 
## 1
autocorr.diag(md2_sim)
##                 b[1]        b[2]           b0          sig
## Lag 0   1.0000000000  1.00000000  1.000000000  1.000000000
## Lag 1   0.6012127678  0.61101459  0.041569945  0.035498502
## Lag 5   0.0772557586  0.08361840  0.013273393 -0.001790311
## Lag 10 -0.0001181435 -0.01701020 -0.010304378  0.003909176
## Lag 50  0.0207724430  0.03084526  0.002160294 -0.009620787
effectiveSize(md2_sim)
##     b[1]     b[2]       b0      sig 
## 1494.342 1448.522 7295.260 5671.562

4.3.4 Residuals

(coeff <- colMeans(md2_csim))
##         b[1]         b[2]           b0          sig 
##     24.31447    392.80876   -383.21253 308949.54097
yhat <- coeff[1]*ssn$lot_sf + coeff[2]*ssn$tot_sf + coeff[3]

resid <- ssn$sale_price - yhat

plot(resid)

plot(yhat, resid) 

5 Comparing models

In order to compare both linear regression models, we calculate their Deviance Information Criterion.

The Deviance Information Criterion (DIC) for the more complex model (Model 1), the one using all features (independent variables), is:

dic.samples(md1, n.iter=1e3)
## Mean deviance:  72089 
## penalty 4.033 
## Penalized deviance: 72093

The DIC for the simpler model (Model 2) is:

dic.samples(md2, n.iter=1e3)
## Mean deviance:  72134 
## penalty 3.037 
## Penalized deviance: 72137

6 Conclusion

Both tested models converged. Model 1 (more complex) is marginaly better than Model 2. Both models showed better predictions for houses with prices bellow 1.6 million.