\( \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} \)

1 Introduction

This is a short tutorial giving a flavour of how the package Stan may be used in R to calibrate spatial regression models, and be used (in conjunction with standard R) as a Bayesian inferential tool. The data used is based on the Irish data distributed with the spdep package, but with a more detailed set of polygons defining the counties of the Republic of Ireland, and using the Irish National Grid as a map projection. To work through this tutorial you will need to have R installed, and also the following R packages and their dependencies: spdep,rstan,ggplot2 and tmap. Note that you may need to install RStan from github - note that there are a few extra steps here - you can’t just install via CRAN in the usual way. However full instruction, for Mac, Windows and linux, are here: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

Once these are in place, you’ll need to load them into R:

library(spdep)
library(rstan)
library(ggplot2)
library(tmap)

and also, set up the options for stan:

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Next, you will need to load the data. This takes the form of a SpatialPolygonsDataFrame called ireland which is in the binary file ireland.RData. The file can be downloaded from this link: https://www.dropbox.com/s/vzzo2jqqiaqputo/ireland.RData?dl=0 - download this into your working folder.

Once this is done, read in the data

load('ireland.RData')
ireland.nb <- poly2nb(ireland,snap=0.0002)

The second line produces a nb object - this contains information about which counties in Ireland are coterminous - that is, they share a common boundary. The objected can be drawn, as below:

plot(ireland)
plot(ireland.nb,coords=coordinates(ireland),add=TRUE,pch=16,col='darkred')

1.1 A First Look at the Data

The data part of the Ireland SpatialPolygonsDataFrame has the following variables:

Variable	Definition
`A`	Percentage of sample with blood group A
`towns`	Towns/unit area
`pale`	Beyond the Pale 0, within the Pale 1
`size`	number of blood type samples
`ROADACC`	arterial road network accessibility in 1961
`OWNCONS`	percentage in value terms of gross agricultural output of each county consumed by itself
`POPCHG`	1961 population as percentage of 1926
`RETSALE`	value of retail sales ??000
`INCOME`	total personal income ??000
`names`	County names

1.1.1 Source

Upton and Fingleton 1985, - Bailey and Gatrell 1995, ch. 1 for blood group data, Cliff and Ord (1973), p. 107 for remaining variables (also after O’Sullivan, 1968). Polygon borders and Irish data sourced from Michael Tiefelsdorf’s SPSS Saddlepoint bundle, originally hosted at: http://geog-www.sbs.ohio-state.edu/faculty/tiefelsdorf/GeoStat.htm but no longer available. Definition of the English Pale is somewhat idiosyncratic, to say the least.

1.2 Basic Maps

The main variable of interest here is the percentage of the population with blood group A. This may be mapped via tmap:

tm_shape(ireland) + tm_polygons(col='A',title='Blood group A (% Pop)')

Note that there is quite a clear spatial trend here - with higher concentrations in the south east- in Carlow, Wexford and Wicklow particularly. Another variable of interest is that of population change:

tm_shape(ireland) + tm_polygons(col='POPCHG',title='1961 Popn as % of 1926 Popn')

Again, quite a strong pattern, bearing some relation to the previous variable. It is possible to explore the relationship between these two variables:

ggplot(data=data.frame(ireland),aes(x=POPCHG,y=A)) + geom_point() + geom_smooth(method='lm')

The above scatter plot adds a linear trend line, which shows a relationship between blood group A and population change - but there are some outliers. The regression line might be improved by using a robust regression approach - such as rlm in the MASS package - supplied as a standard package in R.

library(MASS)
ggplot(data=data.frame(ireland),aes(x=POPCHG,y=A)) + geom_point() + geom_smooth(method='lm',col='darkblue',fill='blue') + geom_smooth(method='rlm',col='darkred',fill='red')

Here, both the robust and standard fitted lines are shown (together with prediction intervals). It may be seen that the slope of the robust line is slightly less than the standard line - and the prediction interval is narrower.

1.3 Spatial Outliers

Another issue with ordinary least squares regression is that it assumes the error terms for each observation are independent - which is often not the case for geographical data. Here, the regression residuals are mapped:

ireland$resids <- rlm(A~POPCHG,data=data.frame(ireland))$res
tm_shape(ireland) + tm_polygons(col='resids',title='Residuals') + tm_style_col_blind()

Note that it was necessary to add the term tm_style_col_blind() to this map - the values of residuals vary across zero so that a divergent colour scale is needed, and the default one is red/green - the most common form of colour blindness. Fortunately a map style to accommodate this is provided - but it would surely make sense not to have an obviously problematic one as the default ?

With that issue addressed, it can be observed that there does seem to be a degree of spatial association in the residuals.

1.4 Results of Initial Explorations

The key things to note are that

There seems to be a correlation between the two variables.
There seem to be some outliers, something which may need to be considered.
The error term has some degree of spatial autocorrelation.

2 Basic Modelling

Prior to any Bayesian analysis, a fairly straightforward autoregressive error model may be applied to the data, using a maximum likelihood approach. The data model here is

\[ \mathbf{Y} = \alpha + \beta \mathbf{x} + \bm{\varepsilon} \]

where \(\mathbf{Y}\) is the variable A, \(\mathbf{x}\) is the variable POPCHG (both column vectors), \(\alpha\) and \(\beta\) are regression coefficients, and \(\bm{\varepsilon}\) is the random error term. However, instead of the assumption that \(\bm{\varepsilon}\) consists of independent errors, we assume here that

\[ \bm{\varepsilon} = \lambda \mathbf{W} \bm{\varepsilon} + \bm{\zeta} \] and \[ \zeta_i \sim \textrm{N}(0,\sigma^2) \ \forall \ i \]

where \(\mathbf{W}\) is related to the contiguity matrix of the Irish counties - \(W_{ij}\) is zero if counties \(i\) and \(j\) are not coterminous, and \(1/d_i\) if they are, and \(d_i\) is the number of counties coterminous to county \(i\). Thus \(\mathbf{W}\) is effectively a smoothing matrix replacing the value of some quantity at county \(i\) with the mean of its neighbours. Here the smoothing occurs over the error terms, and a further error term is added. The value of \(\lambda\) determines the degree of spatial correlation - if \(\lambda = 0\) the process is equivalent to the ordinary least squares model.

The spdep package can calibrate models of this kind, with the errorsarlm function - note that this requires a data frame and a regression model to be specified, but also needs the contiguity matrix (here as an object of type listw - the neighbour list object is converted to this via nb2listw):

sp_model <- errorsarlm(A~POPCHG,data=data.frame(ireland),listw=nb2listw(ireland.nb))
summary(sp_model)

## 
## Call:errorsarlm(formula = A ~ POPCHG, data = data.frame(ireland), 
##     listw = nb2listw(ireland.nb))
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -4.89613593 -0.99467226 -0.00087877  0.92653239  3.84756641 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) 26.071477   3.063731  8.5097   <2e-16
## POPCHG       0.034060   0.027991  1.2168   0.2237
## 
## Lambda: 0.79441, LR test value: 12.888, p-value: 0.0003307
## Asymptotic standard error: 0.10874
##     z-value: 7.3058, p-value: 2.7556e-13
## Wald statistic: 53.374, p-value: 2.7567e-13
## 
## Log likelihood: -54.08839 for error model
## ML residual variance (sigma squared): 3.0202, (sigma: 1.7379)
## Number of observations: 26 
## Number of parameters estimated: 4 
## AIC: 116.18, (AIC for lm: 127.06)

The summary output suggests that \(\lambda\) is significantly different from zero - and the drop in AIC from the OLS model also suggests the inclusion of spatial autocorrelation in the model has improved the fit. We can compare the regression coefficients to the OLS model:

ols_model <- lm(A~POPCHG,data=data.frame(ireland))
cbind(coef(ols_model),coef(sp_model)[-1])

##                    [,1]        [,2]
## (Intercept) 22.12032794 26.07147736
## POPCHG       0.08401024  0.03406033

Note, that the first coefficient of the spatial model is the estimate of \(\lambda\) and this is dropped when comparing the regression coefficients, \(\alpha\) and \(\beta\). From this, it can be seen that allowing for spatial autocorrelation in the error term does have a notable effect - particularly in \(\beta\) (the POPCHG coefficient) which reduces in value by a factor of around 2.5. Also note that the value of \(\beta\) is no longer significantly different from zero.

2.1 Aside: More comprehensible coefficients

One useful technique for regression is to standardise the predictor variables, \(\mathbf{x}\) - that is, subtract their mean value and divide by their standard deviation - but not the response variable \(\mathbf{Y}\). If this is done, the units of the regression coefficient \(\beta\) will be the same as each observation in \(\mathbf{Y}\) - and represent the change in the response when the predictor changes by one standard deviation. Thus, a good idea of the influence is gained - this coefficient is the amount of change in response for a typical amount of variation in the predictor. Furthermore, if multiple regression is used, the coefficients for each of the predictors will have the same units, and be comparable - giving an indication of relative influence. Note also that since the predictors have been mean centred, the intercept term represents the value of the response when all predictors take their mean value. For the special case of OLS regression, this will be the mean value of the rersponse.

A further enhancement is to substitute the difference between the 2.5\({}^\textrm{th}\) and 97.5\({}^\textrm{th}\) percentiles for the standard deviation. Then the coefficient represents the variation in response over the typical range of variation (an estimate for 95 percent of the population). Here, this will be done for POPCHG:

rescale <- function(x) (x - mean(x))/(quantile(x,0.975) - quantile(x,0.025))
sp_model_scaled <- errorsarlm(A~rescale(POPCHG),data=data.frame(ireland),listw=nb2listw(ireland.nb))
summary(sp_model_scaled)

## 
## Call:
## errorsarlm(formula = A ~ rescale(POPCHG), data = data.frame(ireland), 
##     listw = nb2listw(ireland.nb))
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -4.89613589 -0.99467228 -0.00087881  0.92653240  3.84756627 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)      29.0740     1.6614 17.5001   <2e-16
## rescale(POPCHG)   1.9414     1.5955  1.2168   0.2237
## 
## Lambda: 0.79441, LR test value: 12.888, p-value: 0.0003307
## Asymptotic standard error: 0.10874
##     z-value: 7.3058, p-value: 2.7556e-13
## Wald statistic: 53.374, p-value: 2.7567e-13
## 
## Log likelihood: -54.08839 for error model
## ML residual variance (sigma squared): 3.0202, (sigma: 1.7379)
## Number of observations: 26 
## Number of parameters estimated: 4 
## AIC: 116.18, (AIC for lm: 127.06)

Now it can be seen that if POPCHG takes the mean value (88.2%) then the proportion with blood group A is 29.1% - and that over the typical range of the predictor, the blood group A proportion varies by 1.9% (all figures quoted to 1 d.p.) - the latter observation giving an idea of the magnitude of the influence (assuming a null hypothesis that \(\beta = 0\) is not true).

The coefficient values are compared below - for ordinary least squares regression, robust regression and spatial regression - all of which make different assumptions about the error term. All of the methods provide similar intercept values. The robust approach suggest that the proportion of blood group A varies slightly less of the typical range of POPCHG than the OLS calibration – both over 4 – but the spatial model suggests that the variation that may be attributed to the typical range of POPCHG is considerably less - around half of this. In this model, further variability is due to random spatial trends.

Method	Error Term	\(\alpha\)	\(\beta\)
OLS	Independent Gaussian	29.5	4.8
Robust	Independent Gaussian with outliers	29.4	4.3
Spatial	Spatially Correlated Gaussian	29.1	1.9

3 A Bayesian Approach

The previous section demonstrated classical approaches to spatial regression and also robust regression. In this section, the Bayesian approach is considered. The main tool for this will be the package rstan - a version of the Stan software with an interface to R.

3.1 Breakneck Speed Bayes

There isn’t time to give a detailed description of Bayesian inference here, but the following illustrates it surprisingly well:

Bayesian Inference XKCD-Style.

Bayes theorem is a useful device to deduce the probability of statement \(\mathcal{B}\) being true given statement \(\mathcal{A}\) is true in (ie \(\textrm{P}(\mathcal{B}|\mathcal{A})\)) given \(\textrm{P}(\mathcal{B}|\mathcal{A})\) and the marginal probabilities \(\textrm{P}(\mathcal{A})\) and \(\textrm{P}(\mathcal{B})\) - they are linked by:

\[ \textrm{P}(\mathcal{B}|\mathcal{A}) = \frac{\textrm{P}(\mathcal{A}|\mathcal{B})}{\textrm{P}(\mathcal{A})} \] It is therefore a way of turning conditional probability statements on their head - given the probability of \(\mathcal{A}\) given \(\mathcal{B}\) and the marginal probabilities we obtain the probability of \(\mathcal{B}\) given \(\mathcal{A}\). Now if \(\mathcal{A}\) is some data, and \(\mathcal{B}\) is a set of model parameters, this means that starting with a data model giving the probability of a data set (essentially all statistical models fit this description) and the data set, we can deduce the probability of the model given the data set. Or more generally if we have a set of alternative models for the data, we can deduce the probability of each alternative given the data. Sets of alternative models need not be finite - for example the set of all Gaussian models can be parametrised by a mean \(\mu\) and a standard deviation \(\sigma\), and given a set of observations \(\mathcal{D}\) we can use Bayes theorem to find the probability of \(\mu\) and \(\sigma\) via

\[ \textrm{P}(\mu,\sigma | \mathcal{D}) = \frac{\textrm{P}(\mathcal{D}|\mu,\sigma) \textrm{P}(\mu,\sigma)}{\textrm{P}(\mathcal{D})} \] Note that \(P(\mathcal{D})\) is the probability of the data for all possible \(\mu\) and \(\sigma\) values, and so integrating over the possibilities we have

\[ \textrm{P}(\mathcal{D}) = \int_{\mu,\sigma} \textrm{P}(\mathcal{D}|\mu,\sigma) \textrm{P}(\mu,\sigma) \ d\mu \ d \sigma \] Noting that the above expression does not depend on \(\mu\) and \(\sigma\) - they are integrated out - and that for a given analysis the data are a fixed quantity, we can conclude it is a constant, and so the simpler expression

\[ \textrm{P}(\mu,\sigma | \mathcal{D}) \propto \textrm{P}(\mathcal{D}|\mu,\sigma) \textrm{P}(\mu,\sigma) \]

holds. More broadly, if \(\bm\theta\) is a set of parameters for a general model, then \[ \textrm{P}(\bm\theta | \mathcal{D}) \propto \textrm{P}(\mathcal{D}|\bm\theta) \textrm{P}(\bm\theta) \]

Now \(P(\mathcal{D}|\bm\theta)\) is just the likelihood of the data given \(\bm\theta\) - the data model. \(P(\bm\theta)\) is a prior probability for the values of \(\bm\theta\) - often these are just constant values representing no prior information but they may not be - for example \(\textrm{P}(\mu,\sigma)\) might result from the outcome of a previous study using a different sample, but conducted using the same procedure. In Bayesian terminology \(P(\bm\theta)\) is the prior distribution for \(\bm\theta\) and \(\textrm{P}(\bm\theta | \mathcal{D})\) is the posterior distribution. However, in doing this, we have subtly redefined the concept of probability - rather than being a limit of the proportion of particular outcomes to an infinitely repeated experiment, it has now taken on the role of a a relative degree of belief based on past experience.

A further twist here is that there is no requirement that the prior distrubtion has to be a well defined probability distribution. The only requirement is that the posterior distribution be well defined. This is often useful in trying to represent a state of no prior knowledge - for example the elements of \(\bm\theta\) may range from \(-\infty\) to \(\infty\) but \(\textrm{P}(\bm\theta)\) could be a constant provided \(\textrm{P}(\bm\theta|\mathcal{D})\) is proportional to a well-defined distribution.

In adopting these definitions we become Bayesian rather than just users of Bayes’ theorem in standard probability calculations. However, many people do adopt this definition - some from an axiomatic viewpoint - insisting that all inference should be Bayesian, whilst others do on a circumstantial basis - applying Bayesian approaches when they deem it appropriate but using alternatives otherwise.

3.2 Practical Bayes

If you skipped the last part the key point is that a posterior distribution for some set of parameters \(\bm\theta\) given a data model \(\textrm{P}(\mathcal{D}|\bm\theta)\) and a distribution for \(\bm\theta\) representing prior belief about the parameter values is proportional to \(\textrm{P}(\mathcal{D}|\bm\theta) \textrm{P}(\bm\theta)\). A set of prior beliefs for the parameters of interest are modified in the light of the data to give a set of posterior beliefs. Note that to compute the mean or median or other moment or quantile-based statistics for the posterior distribution, we need the complete expression for the posterior, not just something proportional to it - this implies normalising the expression to integrate to 1 - so the full expression is

\[ \textrm{P}(\bm\theta|\mathcal{D}) = \displaystyle{\frac{\textrm{P}(\mathcal{D}|\bm\theta) \textrm{P}(\bm\theta)}{\int_{\bm\theta} \textrm{P}(\mathcal{D}|\bm\theta) \textrm{P}(\bm\theta)\ d\bm\theta}} \] A notable practical issue here is that the integral in the denominator is not generally easy to find, except in a some standard cases - for example if \(\textrm{P}(\mathcal{D}|\bm\theta)\) is a Gaussian distribution, and \(\mathrm{P}(\bm\theta)\) is constant. In practice, a more general approach is to use Markov Chain Monte Carlo (MCMC) sampling. That is, instead of analytically computing \(\textrm{P}(\bm\theta|\mathcal{D})\) analytically, a large number of random draws from the distribution are created, and the distribution properties are computed empirically from these. For example, to infer the mean and standard deviation of a Gaussian distribution in this way, samples from the posterior distribution for both \(\mu\) and \(\sigma\) are drawn, and thier means are calculated (to obtain a point estimate) or the \(\textsf{97.5}^{\textsf{th}}\) and \(\textsf{2.5}^{\textsf{th}}\) percentiles are calculated to obtain 95% credibility intervals for the two parameters. Note that credibility intervals are the Bayesian counterparts to classical confidence intervals.

There are a number of software tools to carry out MCMC sampling, including rstan - an implimentation of the Stan software tool designed to work in conjunction with R. Without going into the exact algorithm used for sampling, Stan uses a descriptor file - essentially a mini programming language quite similar to R - that specifies the data \(\mathcal{D}\) and the distributions \(\textrm{P}(\bm\theta)\). Then, the simulations are run - initiated by an R command if rstan is used, and the simulated values are returned. Some built in functions in R can then be used to summarise or visualise posterior distributions based in the simulated values.

3.2.1 How `Stan` works

The operation of Stan is set out in the diagram below. When the stan(<file>.stan) function is invoked in R a stan file (here denoted as <file>.stan) is first translated into C++ and then compiled to an object code file. This is then linked to R (this is automatically done inside the stan command, using the Rcpp library). Then the simulations are run (also automatically inside the stan function) , with the output including the simulated parameter values stored in an R variable.

An advantage of this approach is that the multiple simulations are computed using compiled C++ code, with a notable saving in CPU time compared to using an interpreted language to specify the simulations.

3.2.2 A simple `stan` task

A very basic stan task is to calibrate an ordinary least squares regression. Although this can be done using the basic lm function in R, this serves to illustrate how Stan is used in practice. The data \(\mathcal{D}\) here consist of two vectors \(x\) and \(y\) and the parameters \(\bm\theta\) are \((\alpha,\beta,\sigma)\) with the data model (ie \(\textrm{P}(\mathcal{D}|\bm\theta)\)) being

\[ \mathbf{y} \sim \textrm{N}(\alpha + \beta \mathbf{x}, \sigma) \] Where vectors in the specification of the distribution imply that the distribution holds independently on an element-by-element basis. Here the prior distribution for \((\alpha,\beta,\sigma)\) will be constant, noting the restriction that \(\sigma \ge 0\). The stan file looks like this:

data {
   int N;
   vector[N] x;
   vector[N] y;
 }
 
 parameters {
   real beta;
   real<lower=0,upper=100> alpha;
   real<lower = 0> sigma;
 }
 
 model {
   y ~ normal(alpha + x * beta, sigma);
 }

In the data section, the data used to calibrate the model is declared. Here we need the vectors x and y together with N - an integer stating their length. If these variables are defined in R they will automatically be passed to stan. Next, in the parameters section the parameters are declared - in particular note that although all are declared as real numbers, sigma is declared to have a lower bound of zero. Finally, in the model section, the data model is set out - here just the basic Gaussian (normal) distribution. Prior distributions for the parameters can also be given here - although if they are omitted as in this example, they are assumed to be constants. To test this model, type the code above into a file called lm.stan and store it in your working directory. Then, at the R prompt, enter

set.seed(80147)
x <- rescale(ireland$POPCHG)
y <- ireland$A
N <- length(y)
lm_result <- stan('lm.stan')

This defines the variables used in the data section of the stan file - they are the same ones as in the regression model, with the rescaled version of POPCHG. The set.seed function call is to guarantee reproducibility - entering this ensures the same sequence of psuedo-random numbers will occur when the simulations are run - you may wish to use a different number. The results of running the stan model - following the process set out in the diagram earlier - are stored in lm_result. As a default, 4000 simulations are created. To obtain a quick overview of the simulated parameters, enter

lm_result

## Inference for Stan model: lm.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta    4.72    0.03 1.89   1.01   3.49   4.72   5.98   8.37  4000    1
## alpha  29.52    0.01 0.54  28.45  29.16  29.51  29.88  30.57  4000    1
## sigma   2.72    0.01 0.42   2.06   2.43   2.67   2.96   3.70  4000    1
## lp__  -34.27    0.03 1.28 -37.56 -34.84 -33.95 -33.33 -32.81  1709    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 23 15:58:15 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Noting that the means of the estimates for \(\alpha\) and \(\beta\) are very similar to those obtained with lm using ordinary least squares regression and calibrating in the standard way. The package rstan also has some visualisation tools - in particular, to see the marginal distribution of each of the parameters, enter

stan_dens(lm_result)

It is also possible to extract the individual simulations from lm_result and store in a data frame. This can then be explored with standard R tools, such as ggplot. The following plot gives the joint marginal posterior distribution for \(\alpha\) and \(\beta\).

lm_df <- as.data.frame(lm_result)
ggplot(lm_df,aes(x=alpha,y=beta)) + geom_point(alpha=0.4,col='indianred') + geom_density2d(col='darkblue')

They seem relatively uncorrelated - this can be tested using

cor(lm_df$alpha,lm_df$beta)

## [1] -0.03388867

Suggesting only a very small correlation. This suggests that posterior beliefs about these two parameters should be independent of one another - i.e. if we were to know the value of \(\alpha\) it wouldn’t change our posterior belief in the value of \(\beta\) and vice versa.

Finally, it is possible to evaluate more general hypotheses using this approach. For example, suppose we were interested in the hypothesis that \(\beta \in [3.6,4.4]\). Then we simply compute the proportion of simulated \(\beta\)-values falling in this range:

length(lm_df$beta[lm_df$beta <= 4.4 & lm_df$beta >= 3.6])/length(lm_df$beta)

## [1] 0.16325

3.2.3 A More robust approach

Following from the classical analysis, a more robust approach to the Gaussian linear regression model can be obtained. However, the approach used in rlm is not so easy to replicate as a Bayesian method, so an alternative approach is suggested. Here, rather than segregating outliers from other observations and re-weighting them, we simply adopt a data model with heavier tails than a Gaussian distribution - and therefore more prone to large residuals. Such a distribution is the \(t\)-distribution -

\[ f(x|\nu,\mu,\sigma) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\sqrt{\pi\nu\sigma^2}\Gamma\left(\frac{\nu}{2}\right)} \left(1+\frac{(x-\mu)^2}{\sigma^2\nu}\right)^{-\frac{\nu+1}{2}} \] also denoted by \(x \sim \textrm{t}(\nu,\mu,\sigma)\). The additional parameter \(\nu\) specifies the heaviuness of the tails - the lower it is, the heavier the tail. As \(\nu \rightarrow \infty\) the distribution tends to Gaussian. Thus, estimating a model based on the \(t\) distribution allows heavier tales - and therefore outliers - to be considered. Estimating the value of \(\nu\) gives an indication of the tendency for outliers to occur. For the regression model, we will use:

\[ \mathbf{y} \sim \textrm{t}(\nu,\alpha + \beta \mathbf{x}, \sigma) \]

The new stan file - called lmr.stan looks like this:

data {
   int N;
   vector[N] x;
   vector[N] y;
 }
 
 parameters {
   real beta;
   real<lower=0,upper=100> alpha;
   real<lower=0> nu;
   real<lower = 0> sigma;
 }
 
 model {
   nu ~ gamma(2,0.1);
   y ~ student_t(nu,alpha + x * beta, sigma);
 }

The prior distribution for \(\nu\) is a gamma distribution with parameters 2 and 0.1 – this has been found to work well in practice (see Juárez and Steel (2010) Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.) It is run as below, noting that N, x and y were already set earlier:

lmr_result <- stan('lmr.stan')

and as before, the results may be checked:

lmr_result

## Inference for Stan model: lmr.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##         mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta    4.69    0.03  1.69   1.50   3.57   4.66   5.78   8.08  3454    1
## alpha  29.46    0.01  0.51  28.45  29.12  29.46  29.79  30.47  4000    1
## nu     17.64    0.28 13.46   2.27   7.58  14.24  23.78  52.76  2349    1
## sigma   2.39    0.01  0.49   1.45   2.08   2.38   2.69   3.42  1429    1
## lp__  -39.53    0.03  1.40 -42.90 -40.25 -39.22 -38.48 -37.75  1959    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 23 15:58:19 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

… and visualised as univariate marginal posterior distributions:

stan_dens(lmr_result)

… and as a bivariate posterior distribution in \(\alpha\) and \(\beta\).

lmr_df <- as.data.frame(lmr_result)
ggplot(lmr_df,aes(x=alpha,y=beta)) + geom_point(alpha=0.4,col='indianred') + geom_density2d(col='darkblue')

There is still only very little posterior correlation between the two estimates.

cor(lmr_df$alpha,lmr_df$beta)

## [1] -0.0119172

A point estimate of \(\nu\) – the posterior distribution mean – is relatively high at 17.6. To see how this compares to a Gaussian distribution, a quantile-quantile plot (qq-plot) is a helpful visual device - this shows the relationship between the quantiles of the estimated \(t\)-distribution and the Gaussian distribution. Notable discrepancies in the tails would suggest a propensity towards outliers. Some code for this using ggplot is shown below:

nu_est <- mean(lmr_df$nu) 
quant <- seq(0.01,0.99,l=99)
qqdat <- data.frame(q=quant,norm=qnorm(quant),tdist=qt(quant,nu_est))
ggplot(qqdat,aes(x=norm,y=tdist)) + geom_line(col='indianred') + geom_abline(intercept=0,slope=1,lwd=0.25,lty=2)

A reference line for \(x=y\) is added, so that it is clear where the qq-plot departs from that expected if the error term was Gaussian. The tails in the \(t\)-distribution do depart slightly from this - suggesting some tendency to outliers.

3.2.4 A Spatial Model

The spatial regression model introduced earler will now be considered in a Bayesian context. In order to express the model in stan it is helpul to consider the distribution of the error term \(\bm\varepsilon\) explicitly. Recall that

\[ \bm\varepsilon = \lambda\mathbf{W}\bm\varepsilon + \bm\zeta \] with \(\bm\zeta \sim \textrm{N}(\bm0,\sigma)\) so that \[ \bm\varepsilon = (\mathbf{I} - \lambda \mathbf{W})^{-1} \bm\zeta \]

and therefore \(\bm\varepsilon\) has a multivariate normal distribution: \[ \bm\varepsilon \sim \textrm{MVN}(\bm0,(\mathbf{I} - \lambda \mathbf{W})^{-1}(\mathbf{I} - \lambda \mathbf{W})^{-T}\sigma^2) \] where \((.)^{-T}\) denotes inversion followed by transposition.

Thus, the full spatial regression data model is given by

\[ \mathbf{y} \sim \textrm{MVN}(\alpha\mathbf{e} + \beta\mathbf{x},(\mathbf{I} - \lambda \mathbf{W})^{-1}(\mathbf{I} - \lambda \mathbf{W})^{-T}\sigma^2) \]

where \(\mathbf{e}\) denotes a column vector of ones. If \(\mathbf{I}\), \(\mathbf{W}\) and \(\mathbf{e}\) are pre-defined in R, then the stan file below specifies this model.

data {
   int N;
   vector[N] x;
   vector[N] y;
   matrix<lower=0>[N,N] W;
   vector<upper=1>[N] e;
   matrix<lower=0,upper=1>[N,N] I;
 }
 
 parameters {
   real beta;
   real<lower=0,upper=100> alpha;
   real<lower = 0> sigma;
   real<lower=-1,upper=1> lambda;
 }
 
 
 
 model {
   y ~ multi_normal_prec(alpha + x * beta,  crossprod(I - lambda * W)/(sigma*sigma));
 }

Note that the multivariate Gaussian probability function ends with _prec meaning that the precision matrix - which is the inverse of the variance-covariance matrix is specified rather than the variance-covariance matrix itself. This is useful here, as it is the inverse that is directly specified in terms of \(\lambda\) and \(\mathbf{W}\) - saving a matrix inversion step in the simulations. If the code above is in a file called car.stan then the following code will run the model - before setting it going, the extra variables are defined:

I <- diag(N)
W <- nb2mat(ireland.nb)
e <- rep(1,N)
car_res <- stan('car.stan')

Inspecting the results shows the estimates:

car_res

## Inference for Stan model: car.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta     1.84    0.05 1.76  -1.64   0.69   1.83   2.97   5.44  1156 1.00
## alpha   28.91    0.18 5.30  18.24  27.37  29.05  30.47  38.97   842 1.01
## sigma    1.91    0.01 0.33   1.42   1.69   1.86   2.10   2.69   944 1.00
## lambda   0.81    0.00 0.13   0.50   0.73   0.83   0.91   0.98   741 1.01
## lp__   -30.80    0.08 1.87 -35.72 -31.71 -30.35 -29.44 -28.47   583 1.01
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 23 15:58:28 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

As before, a notable drop in \(\beta\) is observed. Also as before univariate marginal posterior distributions can be view (left as a self-test exercise this time). A new posterior distribution of interest is the bivariate marginal distribution of \(\beta\) and \(\lambda\)

car_df <- as.data.frame(car_res)
ggplot(car_df,aes(x=lambda,y=beta)) + geom_point(alpha=0.4,col='indianred') + geom_density2d(col='darkblue')

3.2.5 Spatial and Robust Models

Here is one advantage of the stan approach - it is possible to specify a model that is both spatially autocorrelated in the error term and takes heavy distributional tails into account. This is acheived by modelling the error term with a multivariate \(t\) distribution. This is defined by

\[ t(\mathbf{x}|\nu,\bm\mu,\bm{\Sigma}) = \frac{\Gamma\left[(\nu + n)/2\right]}{\Gamma(\nu/2)(\pi\nu)^{n/2}|\bm\Sigma|^{1/2} }\left[1 + \frac{1}{\nu}(\mathbf{x} - \bm\mu)^T\bm\Sigma^{-1}(\mathbf{x} - \bm\mu)\right]^{-(\nu+n)/2} \] - and using the variance-covariance matrix used above to encapsulate spatial dependency a heavy tailed spatially autocorrelated data model may be obtained. The code for this is below:

data {
   int N;
   vector[N] x;
   vector[N] y;
   matrix<lower=0>[N,N] W;
   vector<upper=1>[N] e;
   matrix<lower=0,upper=1>[N,N] I;
 }
 
 parameters {
   real beta;
   real<lower=0,upper=100> alpha;
   real<lower = 0> sigma;
   real<lower=0> nu;
   real<lower=-1,upper=1> lambda;
 }
 
 
 
 model {
   nu ~ gamma(2,0.1);
   y ~ multi_student_t(nu,alpha + x * beta,  tcrossprod(inverse(I - lambda * W))*sigma*sigma);
 }

Since there is no prec form of the multivariate \(t\) distribution in stan this time the inverse is calculated explicitly in the code. Below the code is run:

car_ht_res <- stan('car_ht.stan')

This can be inspected in the usual way:

car_ht_res

## Inference for Stan model: car_ht.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta     2.00    0.05  1.89  -1.74   0.76   2.03   3.24   5.83  1257 1.00
## alpha   28.91    0.11  4.38  19.54  27.59  29.11  30.37  36.44  1582 1.00
## sigma    1.98    0.02  0.53   1.13   1.64   1.91   2.25   3.22   708 1.00
## nu      20.29    0.37 14.51   2.73   9.93  16.94  26.90  56.40  1525 1.00
## lambda   0.79    0.01  0.14   0.41   0.71   0.82   0.90   0.98   522 1.01
## lp__   -36.76    0.08  2.00 -41.63 -37.85 -36.34 -35.24 -34.08   634 1.00
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 23 15:59:10 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The univariate marginal distributions may be plotted as before:

stan_dens(car_ht_res)

Here, \(\nu\) is a little higher than in the non-spatial model, with a posterior mean value of 20.3 – suggesting that perhaps the tails are less heavy when spatial autocorrelation is also allowed for. It may be of interest to inspect the joint posterior marginal distribution of \(\lambda\) and \(\nu\):

car_ht_df <- as.data.frame(car_ht_res)
ggplot(car_ht_df,aes(x=nu,y=lambda)) +  geom_point(alpha=0.4,col='indianred') + geom_density2d(col='darkblue') + geom_smooth(method='loess')

adding the smoothing line gives the conditional mean of \(\lambda\) as a function of \(\nu\) - and suggests it is fairly independent over the typical range of values of \(\nu\) in the posterior distribution.

3.3 A Spatially Varying Regression Coefficient

As a final demonstration of the flexibility of stan a spatially varying coefficient model will be calibrated. In this model the value of \(\beta\) will vary for each county - effectively becoming a random vector \(\bm\beta\). This vector will have a spatially correlated distribution similar to that of \(\bm\varepsilon\) specifically

\[ \bm\beta \sim \textrm{MVN}(\mu_\beta \mathbf{e},(I - \lambda_\beta \mathbf{W})^{-1}(I - \lambda_\beta \mathbf{W})^{-T}\sigma_\beta^2) \] where \(\mu_\beta\) is the mean value of the \(\beta\) coefficients, and \(\lambda_\beta\) is the coeffiecient of spatial dependency for \(\beta\). Here, a standard Gaussian spatial model is used, rather than a multivariate \(t\)-distribution. The spatially varying \(\beta\) coefficients are then used as part of the spatial regression model of \(y\) on \(x\):

\[ \mathbf{y} \sim \textrm{MVT}(\alpha \mathbf{e} + \bm\beta \otimes x, (I - \lambda \mathbf{W})^{-1}(I - \lambda \mathbf{W})^{-T}\sigma^2) \] where \(\otimes\) denotes element-wise multiplication of vectors. The stan code for this is listed below.

data {
   int N;
   vector[N] x;
   vector[N] y;
   matrix<lower=0>[N,N] W;
   vector<upper=1>[N] e;
   matrix<lower=0,upper=1>[N,N] I;
 }
 
 
 parameters {
   vector[N] beta;
   real<lower=0,upper=100> alpha;
   real mbeta;
   real<lower = 0> sigma;
   real<lower = 0> sigma_b;
   real<lower=-1,upper=1> lambda;
   real<lower=-1,upper=1> lambda_b;
 }
 
 
 
 model {
   beta ~ multi_normal_prec(e * mbeta,  tcrossprod(I - lambda_b * W)/(sigma_b*sigma_b));
   y ~ multi_normal_prec(alpha + x .* beta,  tcrossprod(I - lambda * W)/(sigma*sigma));
 }

Note that the stan symbol for element-wise multiplication is .*. The simulation is run below - note the iter=5000 option - since quite a few more quantities are estimated this time it was felt that more simulations may be needed.:

svc_res <- stan('svc.stan',iter=5000)

and the results may be listed. Note that the random draws of each element of \(\bm\beta\) are all listed - making local coefficient estimates possible.

svc_res

## Inference for Stan model: svc.
## 4 chains, each with iter=5000; warmup=2500; thin=1; 
## post-warmup draws per chain=2500, total post-warmup draws=10000.
## 
##            mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff
## beta[1]    4.00    0.10  3.32  -2.34   1.98   3.88   5.79  11.07  1091
## beta[2]    3.13    0.11  3.00  -3.34   1.39   3.27   5.04   8.92   756
## beta[3]    3.83    0.10  3.27  -2.86   1.94   3.82   5.71  10.60  1130
## beta[4]    3.56    0.10  3.41  -3.61   1.64   3.64   5.54  10.12  1235
## beta[5]    3.85    0.10  3.12  -2.56   1.93   3.88   5.75   9.98  1027
## beta[6]    3.27    0.08  2.03  -0.83   1.92   3.27   4.67   7.12   576
## beta[7]    3.66    0.09  3.34  -3.16   1.74   3.67   5.55  10.52  1476
## beta[8]    3.89    0.09  3.19  -2.43   1.94   3.88   5.80  10.50  1178
## beta[9]    2.87    0.11  3.00  -3.56   1.10   3.01   4.83   8.46   730
## beta[10]   3.87    0.10  3.45  -2.91   1.84   3.85   5.81  10.91  1132
## beta[11]   3.75    0.10  3.30  -2.86   1.82   3.67   5.67  10.37  1198
## beta[12]   2.89    0.10  3.11  -3.68   1.08   3.01   4.88   8.97   914
## beta[13]   4.05    0.09  3.24  -2.32   2.10   3.98   5.84  10.86  1277
## beta[14]   3.70    0.10  3.20  -2.89   1.81   3.69   5.57  10.21  1021
## beta[15]   4.00    0.08  2.76  -1.52   2.23   3.98   5.64   9.54  1098
## beta[16]   3.55    0.10  3.01  -2.62   1.74   3.52   5.38   9.70   974
## beta[17]   3.38    0.10  3.23  -3.30   1.50   3.43   5.30   9.71  1040
## beta[18]   3.81    0.09  2.90  -1.89   1.99   3.80   5.55   9.64  1056
## beta[19]   3.84    0.10  3.21  -2.61   1.95   3.80   5.68  10.42   968
## beta[20]   3.38    0.10  3.26  -3.57   1.52   3.50   5.30  10.12   975
## beta[21]   3.90    0.10  3.16  -2.39   1.99   3.89   5.76  10.46  1038
## beta[22]   3.74    0.10  3.52  -3.19   1.73   3.74   5.67  10.59  1335
## beta[23]   3.85    0.09  3.41  -2.91   1.91   3.74   5.75  10.89  1366
## beta[24]   3.56    0.09  3.29  -3.12   1.67   3.59   5.47  10.08  1260
## beta[25]   3.60    0.09  3.37  -3.28   1.74   3.64   5.52  10.47  1372
## beta[26]   4.41    0.11  3.27  -1.47   2.32   4.26   6.15  11.72   922
## alpha     28.62    0.03  1.12  26.35  27.90  28.63  29.38  30.70  1658
## mbeta      3.75    0.10  2.29  -0.90   2.29   3.72   5.18   8.49   534
## sigma      2.04    0.01  0.36   1.48   1.79   1.99   2.24   2.87  1651
## sigma_b    1.93    0.08  1.34   0.39   0.93   1.60   2.54   5.46   312
## lambda     0.72    0.01  0.15   0.36   0.64   0.74   0.83   0.95   926
## lambda_b   0.00    0.02  0.54  -0.92  -0.45   0.00   0.45   0.90   599
## lp__     -56.17    1.52 17.45 -87.86 -68.54 -57.13 -43.82 -21.79   132
##          Rhat
## beta[1]  1.00
## beta[2]  1.00
## beta[3]  1.00
## beta[4]  1.00
## beta[5]  1.01
## beta[6]  1.01
## beta[7]  1.00
## beta[8]  1.00
## beta[9]  1.00
## beta[10] 1.01
## beta[11] 1.00
## beta[12] 1.00
## beta[13] 1.00
## beta[14] 1.00
## beta[15] 1.01
## beta[16] 1.00
## beta[17] 1.00
## beta[18] 1.00
## beta[19] 1.00
## beta[20] 1.00
## beta[21] 1.01
## beta[22] 1.00
## beta[23] 1.00
## beta[24] 1.00
## beta[25] 1.00
## beta[26] 1.01
## alpha    1.00
## mbeta    1.01
## sigma    1.00
## sigma_b  1.02
## lambda   1.00
## lambda_b 1.01
## lp__     1.04
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 23 16:00:47 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

These can also be viewed in map form - the as.matrix function extracts the local beta samples (pars = 'beta' selects just the \(\bm\beta\) coefficients) - taking column-wise means computes posterior mean estimates - and then the map is drawn using tmap.

beta_sims <- as.matrix(svc_res,pars='beta')
ireland$local_beta <- apply(beta_sims,2,mean)
tm_shape(ireland) + tm_polygons(col='local_beta',title='Local Beta')

Similarly, posterior standard deviations can be computed for each of these local \(\beta\) coefficients:

ireland$local_beta_sd <- apply(beta_sims,2,sd)
tm_shape(ireland) + tm_polygons(col='local_beta_sd',title='Beta SD')

You may also have noted from the initial printout of svc_res that the value of \(\lambda_\beta\) is quite close to zero. Here the posterior joint marginal distribution of \(\lambda\) and \(\lambda_\beta\) is inspected:

svc_df <- as.data.frame(svc_res)
ggplot(svc_df,aes(x=lambda_b,y=lambda)) +  geom_point(alpha=0.4,col='indianred') + geom_density2d(col='darkblue')

The low value of \(\lambda_b\) is noted - and once again the estimated values of \(\lambda\) and \(\lambda_b\) are virtually uncorrelated - this may be verified:

cor(svc_df$lambda,svc_df$lambda_b)

## [1] -0.008299519

Finally, a posterior distribution on the standard deviation as a function of the \(\beta\) values can be computed, to gain some idea of how much variability the local betas actually exhibit. In this case, the quantity that we require a posterior distribution is a function of the elements of \(\bm\beta\) - their standard deviation. Note that this is achieve by applying a row-wise sd to the simulated betas, giving a set of samples of the standard deviation of the \(\beta\) values. They are stored in a data frame (even though there is only one column) because this is how ggplot expects data to be supplied. A density curve is then drawn.

sd_df <- data.frame(sdsim = apply(beta_sims,1,sd))
ggplot(sd_df,aes(x=sdsim)) + geom_density(fill='dodgerblue',alpha=0.3)

As a point estimate, the posterior mean can be computed -

mean(sd_df$sdsim)

## [1] 2.017964

4 Conclusion

This tutorial has introduced ideas of spatial modeling from a classical and a Bayesian viewpoint. Probably the mean advantages of the Bayesian approach here are

Flexibilty in specifying models.
Flexibility in specifying hypotheses.

Although the approach is perhaps more complex - and does of course require an entirely different definition of probability to be adopted!

Using rstan and spdep for Spatial Modelling