R Simulation notebook

In this notebook we compare the predictive distribution \(y^{rep}\) to the observed data \(y\). In this way, we undertake posterior predictive checks (PPC) to evaluate uncertainty associated with the estimated parameters of the model. By definition, PPC are: “simulating replicated data under the fitted model and then comparing these to the observed data” (Gelman and Hill, 2007). In particular, we use PPC to check for any systematic differences between \(y^{rep}\) and \(y\) as indication of any potential failings of the model.

First of all we load prerequisite packages and specify the spatial FE and classical MLM setups.

suppressWarnings(library(ggthemes))
suppressWarnings(library(merTools))
suppressWarnings(library(bayesplot))
suppressWarnings(library(lme4))

### load data
data.df = read.csv('c:/Users/sgscombe/Documents//manchester_case_study/data/manchester_ldc_sorted.csv')
data.df = data.df[order(data.df$cluster_id),]

spatial.fe = lm(log(total_value) ~ 0 +cluster_id+ no_rooms +  relevel(cat_Name1, ref="Showrooms") +  total_area + car_park_spaces  
                , data=data.df)

mlm = lmer(log_val ~ no_rooms +  relevel(cat_Name1, ref="Showrooms") +  total_area + car_park_spaces  
           + (1|cluster_id) , data=data.df)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

Next, we catch model uncertainty in predictions from our MLM model fit with lme4. This is achieved using the predictInterval method which works as follows:

Extract the fixed and random coefficients.
Take n draws from the multivariate normal distribution of the fixed and random coefficients separately.
Calculate the linear predictor for each row based on the fixed and random coefficient draws.
Returns mean or median of the simulated prediction along with the lower and upper limits of the prediction interval.

n.sims = 1000

features = c('log_val','no_rooms', 'car_park_spaces', 'total_area',
             'cluster_id', 'cat_Name1')

PI <- predictInterval(merMod = mlm, newdata = data.df[features], 
                      level = 0.95, n.sims = n.sims,
                      stat = "median", type="linear.prediction",
                      include.resid.var = TRUE, returnSims = TRUE)

## Warning: executing %dopar% sequentially: no parallel backend registered

pi_sims = attr(PI, "sim.results") # return 14982 x 1000 draws

y.rep.mlm = t(pi_sims) # transpose to bayesplot format

Next, we undertake a similar simulation exercise for the spatial FE model.

# 1000 relicates from the parameters in fitted model
spatial.fe.sim = sim(spatial.fe, n.sims)


# simulations used to create 1000 fake datasets of 13982 observations
n = length(data.df$total_value)
y.rep.fe = array(NA, c(n.sims,n))

rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }

for (s in 1:n.sims) {
  # fill 1000 rows with 14982 columns
  y.rep.fe[s,] = rnorm(n,  spatial.fe.sim@coef[s], spatial.fe.sim@sigma[s])
}

y.true = log(data.df$total_value)

### spatial fe
ppc_dens_overlay(y.true, y.rep.fe[1:50,])

### mlm
ppc_dens_overlay(y.true, y.rep.mlm[1:50,])

We further capture how well the posterior predictive distribution captures the mean value of the observed data. We observe that the classical MLM does a far better job at predicting a greater proportion of mean values than the spatial FE model.

ppc_stat(y.true, y.rep.fe[0:50,], binwidth=0.05)

ppc_stat(y.true, y.rep.mlm[0:50,], binwidth=0.05)

Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., Gelman, A. (2018). Visualization in Bayesian workflow. Journal of the Royal Statistical Society Series A, accepted for publication. arXiv preprint: https://arxiv.org/abs/1709.01449

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511790942