Stan is a flexible modelling language that makes it straightforward to estimate a very wide range of probability models using Bayesian techniques. There are a few reasons one may want to spend the time to learn Stan:

By the end of this tutorial, you should feel comfortable with the following:

  1. Understand the Bayesian workflow
  2. Know how to write out a Stan model
  3. Know how to estimate and check a model in R
  4. Know where to get help

The examples given below will use R’s interface with Stan, rstan. While Stan can also be called from Python (using PyStan), Julia (using Stan.jl) and other environments, there are some useful companion libraries that are better developed in R.

1. The Bayesian Workflow

A strong culture exists within the Bayesian community around a workflow that promotes high quality modelling. Many of the steps should be familiar to economists, yet a few are distinct to Bayesian modelling.

While at first it may seem a drag to adhere to this workflow strictly, after a while it should feel natural. It will reduce the number of errors in your work, and help you think through the details of the modelling task. Let’s walk through each step, gradually introducing Stan along the way.

A) Writing out your model in probability form

Stan estimates probability models. These are models in which we assume that all unknown parameters and the outcome variable(s) \(y\) each come from some (conditional) probability distributions.

As a first example, take the linear regression model. In this model we have an (\(N\)-long) vector of outcomes \(y\), a matrix of covariates \(X\) (\(N\times P\)), a (\(P\)-long) vector of unknown coefficients \(\beta\), and a (\(N\)-long) vector of residuals \(\epsilon\). The model is typically written out in matrix notation as

\[ y = X\beta + \epsilon \]

This is not yet a probability model. To express it as a probability model, we make an assumption about the distribution of the residuals. We’ll start by assuming they’re normally distributed \(\epsilon \sim \mathcal{N}(0, \sigma)\), where \(\sigma\) is another parmeter–the standard deviation of the residuals.

Once we have made this distributional assumption, we can use it to express the linear model above in probability notation.

\[ y = X\beta + \epsilon \mbox{ and } \epsilon \sim \mathcal{N}(0, \sigma) \implies y \sim \mathcal{N}(X\beta, \sigma) \]

(This follows from the fact that if we add \(X\beta\) to \(\epsilon\), we have a new distribution with a mean of \(X\beta\).)

Note that the normal distribution \(\mathcal{N}(\mu, \sigma)\) is a two-parameter distribution. It has a mean \(\mu\) and standard deviation \(\sigma\). This does not stop us from estimating a model that has a function for each parameter that itself may have several parameters—in the case above, we’re using a linear function \(\mu = X\beta\) (with \(P\) regression coefficients in \(\beta\)) for the mean model.

The model \(y \sim \mathcal{N}(X\beta, \sigma)\) is the data model. If we knew the value of the parameters \(\beta\) and \(\sigma\) for certain, we could simulate plausible values for an outcome \(\hat{y_{i}}\) for a given set of covariates \(X_{i}\) simply by generating random numbers drawn from \(\mathcal{N}(X_{i}\beta, \sigma)\). (This gets to an important aspect of Bayesian models: there is no predicted value, rather a predictive distribution). Yet we don’t know \(\beta\) and \(\sigma\) for certain, and want to generate probabilistic estimates for them.

In Bayesian statistics we try to learn about a posterior distribution, in this case \(p(\beta, \sigma | y, X)\)—a joint distribution of the parameters given the data. According to Bayes’ law, this distribution will be proportional to the product of the Likelihood and prior distributions given to the parameters. That is

\[ p(\beta, \sigma | y, X) \propto p(y | \beta, \sigma, X)\times p(\beta, \sigma) \]

If we assume that the prior distributions of \(\beta\) and \(\sigma\) are independent, then \(p(\beta, \sigma) = p(\beta)\times p(\sigma)\), and this becomes

\[ p(\beta, \sigma | y, X) \propto p(y | \beta, \sigma, X)\times p(\beta)\times p(\sigma) \]

We already made a modelling assumption about \(p(y | \beta, \sigma, X)\)–it is our data model \(\mathcal{N}(X_{i}\beta, \sigma)\). Now we just need to specify priors for the parameters \(\beta\) and \(\sigma\).

For regression coefficients, we often use univariate or multivariate normal priors (we will use univariate priors below). The mean and scale of the prior distributions can be used to include information that is known before observing the data, such as parameter estimates from a meta-study, or a theoretically imposed value. It is also common to use so-called shrinkage priors or regularising priors. These are priors that shrink the estimate towards 0 (or a group mean). In many prediction tasks, shrinkage priors help prevent overfitting.

An important note on prior distributions: if a prior distribution puts no weight on a particular value of the parameter, the posterior distribution cannot put weight on that value either. We can use this useful property to use prior distributions in order to restrict estimates of \(\beta\) to economically meaningful values. For instance, if we are estimating a very simple cost function \(\mbox{costs} = \alpha + \beta \mbox{quantity sold} + e\), we may want to exclude the possibility of zero or negative fixed costs (that is, we think that \(\alpha> 0\)). In such a case, we could give \(\alpha\) a prior distribution that is only defined for positive values, such as the lognormal distribution, or a gamma distribution.

Similarly, the prior for \(\sigma\) should be constrained to positive numbers. A convenient prior for standard deviations is the half Cauchy distribution (restricted to positive numbers), which provides some prior information but allows for potentially large standard deviations.

Putting it all together

We now have our probability model. It is made up of two prior distributions for the parameters (one for the \(\beta\)s, one for the \(\sigma\)), and a model for the data given the parameters.

Where \(\mu_{p}, \sigma_{p}, x_{0}\mbox{ and } \gamma\) summarise prior information and are chosen by the researcher, we have the priors:

\[ \mbox{for }p\in [1 \dots P] \mbox{ }\beta_{p} \sim\mathcal{N}(\mu_{p}, \sigma_{p}) \]

\[ \sigma \sim\mathcal{Cauchy}_{+}(x_{0}, \gamma) \]

And the data model

\[ y \sim\mathcal{N}(X\beta, \sigma) \]

A slightly richer model

The illustration above shows how you could put together a simple probability model, but does not motivate why you might want to. The real power of probabilistic modelling is that it is not much harder to define far richer models than simple ones.

For instance, let’s say that we want to make two changes to the model above. First, we’d like to accept that our outcome \(y\) might come from a so-called “fat tail” distribution like the student’s t distribution. Next, we want to restrict the first element of \(\beta\) to be non-negative. We can define this slightly richer model like so:

Priors:

\[ \beta_{1} \sim\mathcal{N}_{+}(\mu_{1}, \sigma_{1}) \]

\[ \mbox{for }p\in [2 \dots P] \mbox{ }\beta_{p} \sim\mathcal{N}(\mu_{p}, \sigma_{p}) \]

\[ \sigma \sim\mathcal{Cauchy}_{+}(x_{0}, \gamma) \]

And the data model

\[ y \sim\mbox{Student's t}(\nu, X\beta, \sigma) \]

Where \(\nu\) is the degrees of freedom of the student’s t distribution. We need to give this parameter a prior also, limited below by 1. As the student’s t distribution approaches a normal distribution as \(\nu\) gets much above 20, we may want to centre our distribution around 7, with a fairly wide spread.

\[ \nu \sim \mathcal{Cauchy}_{>1}(7, 10) \]

B) Simulating the model with known parameters (an introduction to Stan)

Now that we have two models, we should generate some simulated values of \(y\) for known parameters \(\theta = \{\beta, \sigma, \nu\}\) and covariates \(X\). While this would be trivial to do in your statistics package of choice, we’ll use this task as an opportunity to introduce Stan.

A Stan model is comprised of code blocks. Each block is a place for a certain task. The bold blocks below must be present in all Stan programs (even if they contain no arguments):

  1. functions, where we define functions to be used in the blocks below
  2. data, declares the data to be used for the model
  3. transformed data, makes transformations of the data passed in above
  4. parameters, defines the unknowns to be estimated, including any restrictions on their values.
  5. transformed parameters, often it is preferable to work with transformations of the parameters and data declared above; in this case we define them here.
  6. model, where the full probability model is defined.
  7. generated quantities, generates a range of outputs from the model (posterior predictions, forecasts, values of loss functions, etc.).

Let’s write a toy model to generate simulated data.

First we need to load some libraries. I normally use dplyr for data munging, ggplot2 for plotting the parameters, rstan, which is R’s interface with Stan, and reshape2, which I use to manipulate parameter draws.

# Load necessary libraries and set up multi-core processing for Stan

library(dplyr); library(ggplot2); library(rstan); library(reshape2)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Next we define the model we want to estimate (or in this case, simply run). Note that you can define this as a string, as I have done below, or as a text file that is saved with a .stan suffix. The script below is annotated. A few things to notice:

  • Stan writes C++ programs, and so requires static typing (that is, defining the data and variables by their type).
  • All random number generators in Stan are distribution names followed by _rng
  • A very complete account of the Stan language is available in the Stan Modeling Language User’s Guide and Reference Manual.
model_string <- "
data {
    // Define the data
    int N; // the number of observations
    int P; // the number of covariates
    matrix[N, P] X; // The covariate matrix

    // Seeing as we're simulating the model for known parameters, we should pass these too
    vector[P] beta; // The regression coefficients
    real sigma; // The standard deviation of the regression
    real nu; // The degrees of freedom parameter
}
parameters {
  // The parameters we want to estimate would go in here
}
model {
  // This is where the probability model we want to estimate would go
}
generated quantities {
    vector[N] yhat; // Define the outcome variable

    // This is where we generate outcomes for a given draw of parameters and the Xs.
    
    for(i in 1:N){
        yhat[i] <- student_t_rng(nu, X[i,]*beta, sigma);
    }
}
"

Next we generate the required inputs in order to generate the data \(y\) (that is, seeing as we’re drawing from \(p(y|\beta, \sigma, X)\), we need values for \(\beta, \sigma\) and \(X\)). We turn these into a data list and compile the Stan script defined above. This should take around 20 seconds (but only needs to be done once).

# Generate a matrix of random numbers, and values for beta, nu and sigma

set.seed(42)
N <- 100
P <- 10
X <- matrix(rnorm(N*P), N, P)
nu <- 10
sigma <- 5
beta <- rnorm(10)
# Make sure the first element of beta is positive
beta[1] <- abs(beta[1])


# Put the data into a list
data_list <- list(N = N, P = P, X = X, sigma = sigma, beta = beta)

# Compile the model
compiled_model <- stan_model(model_code = model_string)

Finally, we run the script above to generate some plausible values for \(\hat{y}\). The output below summarises the “draws” of \(\hat{y}\) just as it would parameter estimates from the model. The reason for this will be apparent later on.

simulated_values <- sampling(compiled_model, data = data_list, iter = 1, chains = 1,algorithm = "Fixed_param")
## 
## SAMPLING FOR MODEL '90567997b672b20302055f20e532fcd2' NOW (CHAIN 1).
## 
## Chain 1, Iteration: 1 / 1 [100%]  (Sampling)# 
## #  Elapsed Time: 4e-06 seconds (Warm-up)
## #                5.7e-05 seconds (Sampling)
## #                6.1e-05 seconds (Total)
## #
simulated_values
## Inference for Stan model: 90567997b672b20302055f20e532fcd2.
## 1 chains, each with iter=1; warmup=0; thin=1; 
## post-warmup draws per chain=1, total post-warmup draws=1.
## 
##             mean se_mean sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## yhat[1]     4.48      NA NA   4.48   4.48   4.48   4.48   4.48     1  NaN
## yhat[2]     0.80      NA NA   0.80   0.80   0.80   0.80   0.80     1  NaN
## yhat[3]    16.95      NA NA  16.95  16.95  16.95  16.95  16.95     1  NaN
## yhat[4]     4.60      NA NA   4.60   4.60   4.60   4.60   4.60     1  NaN
## yhat[5]    -9.70      NA NA  -9.70  -9.70  -9.70  -9.70  -9.70     1  NaN
## yhat[6]    -0.47      NA NA  -0.47  -0.47  -0.47  -0.47  -0.47     1  NaN
## yhat[7]     0.85      NA NA   0.85   0.85   0.85   0.85   0.85     1  NaN
## yhat[8]     6.25      NA NA   6.25   6.25   6.25   6.25   6.25     1  NaN
## yhat[9]     2.67      NA NA   2.67   2.67   2.67   2.67   2.67     1  NaN
## yhat[10]    4.28      NA NA   4.28   4.28   4.28   4.28   4.28     1  NaN
## yhat[11]    2.66      NA NA   2.66   2.66   2.66   2.66   2.66     1  NaN
## yhat[12]    7.28      NA NA   7.28   7.28   7.28   7.28   7.28     1  NaN
## yhat[13]   -2.26      NA NA  -2.26  -2.26  -2.26  -2.26  -2.26     1  NaN
## yhat[14]    5.72      NA NA   5.72   5.72   5.72   5.72   5.72     1  NaN
## yhat[15]   -6.06      NA NA  -6.06  -6.06  -6.06  -6.06  -6.06     1  NaN
## yhat[16]    9.20      NA NA   9.20   9.20   9.20   9.20   9.20     1  NaN
## yhat[17]   -1.79      NA NA  -1.79  -1.79  -1.79  -1.79  -1.79     1  NaN
## yhat[18]   -7.47      NA NA  -7.47  -7.47  -7.47  -7.47  -7.47     1  NaN
## yhat[19]   -5.71      NA NA  -5.71  -5.71  -5.71  -5.71  -5.71     1  NaN
## yhat[20]   -2.44      NA NA  -2.44  -2.44  -2.44  -2.44  -2.44     1  NaN
## yhat[21]    7.49      NA NA   7.49   7.49   7.49   7.49   7.49     1  NaN
## yhat[22]  -17.82      NA NA -17.82 -17.82 -17.82 -17.82 -17.82     1  NaN
## yhat[23]    1.81      NA NA   1.81   1.81   1.81   1.81   1.81     1  NaN
## yhat[24]    4.96      NA NA   4.96   4.96   4.96   4.96   4.96     1  NaN
## yhat[25]   -7.97      NA NA  -7.97  -7.97  -7.97  -7.97  -7.97     1  NaN
## yhat[26]   -4.98      NA NA  -4.98  -4.98  -4.98  -4.98  -4.98     1  NaN
## yhat[27]    6.25      NA NA   6.25   6.25   6.25   6.25   6.25     1  NaN
## yhat[28]  -10.15      NA NA -10.15 -10.15 -10.15 -10.15 -10.15     1  NaN
## yhat[29]    2.17      NA NA   2.17   2.17   2.17   2.17   2.17     1  NaN
## yhat[30]   -1.02      NA NA  -1.02  -1.02  -1.02  -1.02  -1.02     1  NaN
## yhat[31]   -1.85      NA NA  -1.85  -1.85  -1.85  -1.85  -1.85     1  NaN
## yhat[32]    6.00      NA NA   6.00   6.00   6.00   6.00   6.00     1  NaN
## yhat[33]    0.76      NA NA   0.76   0.76   0.76   0.76   0.76     1  NaN
## yhat[34]   -0.54      NA NA  -0.54  -0.54  -0.54  -0.54  -0.54     1  NaN
## yhat[35]    8.56      NA NA   8.56   8.56   8.56   8.56   8.56     1  NaN
## yhat[36]  -19.86      NA NA -19.86 -19.86 -19.86 -19.86 -19.86     1  NaN
## yhat[37]    3.49      NA NA   3.49   3.49   3.49   3.49   3.49     1  NaN
## yhat[38]    5.07      NA NA   5.07   5.07   5.07   5.07   5.07     1  NaN
## yhat[39]  -13.13      NA NA -13.13 -13.13 -13.13 -13.13 -13.13     1  NaN
## yhat[40]    3.20      NA NA   3.20   3.20   3.20   3.20   3.20     1  NaN
## yhat[41]    2.24      NA NA   2.24   2.24   2.24   2.24   2.24     1  NaN
## yhat[42]   -4.01      NA NA  -4.01  -4.01  -4.01  -4.01  -4.01     1  NaN
## yhat[43]    9.43      NA NA   9.43   9.43   9.43   9.43   9.43     1  NaN
## yhat[44]    6.55      NA NA   6.55   6.55   6.55   6.55   6.55     1  NaN
## yhat[45]   -4.83      NA NA  -4.83  -4.83  -4.83  -4.83  -4.83     1  NaN
## yhat[46]    0.52      NA NA   0.52   0.52   0.52   0.52   0.52     1  NaN
## yhat[47]    3.70      NA NA   3.70   3.70   3.70   3.70   3.70     1  NaN
## yhat[48]    9.35      NA NA   9.35   9.35   9.35   9.35   9.35     1  NaN
## yhat[49]   -3.36      NA NA  -3.36  -3.36  -3.36  -3.36  -3.36     1  NaN
## yhat[50]    3.74      NA NA   3.74   3.74   3.74   3.74   3.74     1  NaN
## yhat[51]   12.30      NA NA  12.30  12.30  12.30  12.30  12.30     1  NaN
## yhat[52]    2.78      NA NA   2.78   2.78   2.78   2.78   2.78     1  NaN
## yhat[53]   -2.19      NA NA  -2.19  -2.19  -2.19  -2.19  -2.19     1  NaN
## yhat[54]    7.84      NA NA   7.84   7.84   7.84   7.84   7.84     1  NaN
## yhat[55]   11.45      NA NA  11.45  11.45  11.45  11.45  11.45     1  NaN
## yhat[56]    4.68      NA NA   4.68   4.68   4.68   4.68   4.68     1  NaN
## yhat[57]    6.31      NA NA   6.31   6.31   6.31   6.31   6.31     1  NaN
## yhat[58]  -12.40      NA NA -12.40 -12.40 -12.40 -12.40 -12.40     1  NaN
## yhat[59]  -29.22      NA NA -29.22 -29.22 -29.22 -29.22 -29.22     1  NaN
## yhat[60]    1.55      NA NA   1.55   1.55   1.55   1.55   1.55     1  NaN
## yhat[61]   -6.72      NA NA  -6.72  -6.72  -6.72  -6.72  -6.72     1  NaN
## yhat[62]   -2.56      NA NA  -2.56  -2.56  -2.56  -2.56  -2.56     1  NaN
## yhat[63]   10.48      NA NA  10.48  10.48  10.48  10.48  10.48     1  NaN
## yhat[64]   -9.98      NA NA  -9.98  -9.98  -9.98  -9.98  -9.98     1  NaN
## yhat[65]    0.91      NA NA   0.91   0.91   0.91   0.91   0.91     1  NaN
## yhat[66]   12.30      NA NA  12.30  12.30  12.30  12.30  12.30     1  NaN
## yhat[67]   -8.02      NA NA  -8.02  -8.02  -8.02  -8.02  -8.02     1  NaN
## yhat[68]   -1.96      NA NA  -1.96  -1.96  -1.96  -1.96  -1.96     1  NaN
## yhat[69]    0.90      NA NA   0.90   0.90   0.90   0.90   0.90     1  NaN
## yhat[70]   -0.56      NA NA  -0.56  -0.56  -0.56  -0.56  -0.56     1  NaN
## yhat[71]  -11.31      NA NA -11.31 -11.31 -11.31 -11.31 -11.31     1  NaN
## yhat[72]    1.16      NA NA   1.16   1.16   1.16   1.16   1.16     1  NaN
## yhat[73]    3.88      NA NA   3.88   3.88   3.88   3.88   3.88     1  NaN
## yhat[74]   11.22      NA NA  11.22  11.22  11.22  11.22  11.22     1  NaN
## yhat[75]    0.36      NA NA   0.36   0.36   0.36   0.36   0.36     1  NaN
## yhat[76]   15.32      NA NA  15.32  15.32  15.32  15.32  15.32     1  NaN
## yhat[77]    2.10      NA NA   2.10   2.10   2.10   2.10   2.10     1  NaN
## yhat[78]   -6.64      NA NA  -6.64  -6.64  -6.64  -6.64  -6.64     1  NaN
## yhat[79]   -1.38      NA NA  -1.38  -1.38  -1.38  -1.38  -1.38     1  NaN
## yhat[80]   -1.17      NA NA  -1.17  -1.17  -1.17  -1.17  -1.17     1  NaN
## yhat[81]   -2.34      NA NA  -2.34  -2.34  -2.34  -2.34  -2.34     1  NaN
## yhat[82]   -2.94      NA NA  -2.94  -2.94  -2.94  -2.94  -2.94     1  NaN
## yhat[83]    8.74      NA NA   8.74   8.74   8.74   8.74   8.74     1  NaN
## yhat[84]    7.40      NA NA   7.40   7.40   7.40   7.40   7.40     1  NaN
## yhat[85]   -0.73      NA NA  -0.73  -0.73  -0.73  -0.73  -0.73     1  NaN
## yhat[86]   -0.60      NA NA  -0.60  -0.60  -0.60  -0.60  -0.60     1  NaN
## yhat[87]   13.00      NA NA  13.00  13.00  13.00  13.00  13.00     1  NaN
## yhat[88]    5.96      NA NA   5.96   5.96   5.96   5.96   5.96     1  NaN
## yhat[89]    8.87      NA NA   8.87   8.87   8.87   8.87   8.87     1  NaN
## yhat[90]  -11.55      NA NA -11.55 -11.55 -11.55 -11.55 -11.55     1  NaN
## yhat[91]    4.06      NA NA   4.06   4.06   4.06   4.06   4.06     1  NaN
## yhat[92]   -7.92      NA NA  -7.92  -7.92  -7.92  -7.92  -7.92     1  NaN
## yhat[93]    2.32      NA NA   2.32   2.32   2.32   2.32   2.32     1  NaN
## yhat[94]    4.87      NA NA   4.87   4.87   4.87   4.87   4.87     1  NaN
## yhat[95]   18.48      NA NA  18.48  18.48  18.48  18.48  18.48     1  NaN
## yhat[96]   -6.65      NA NA  -6.65  -6.65  -6.65  -6.65  -6.65     1  NaN
## yhat[97]    3.57      NA NA   3.57   3.57   3.57   3.57   3.57     1  NaN
## yhat[98]   -5.53      NA NA  -5.53  -5.53  -5.53  -5.53  -5.53     1  NaN
## yhat[99]   -8.26      NA NA  -8.26  -8.26  -8.26  -8.26  -8.26     1  NaN
## yhat[100]  -2.54      NA NA  -2.54  -2.54  -2.54  -2.54  -2.54     1  NaN
## lp__        0.00      NA NA   0.00   0.00   0.00   0.00   0.00     1  NaN
## 
## Samples were drawn using (diag_e) at Fri Feb 26 11:54:08 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We can turn this somewhat strange data format into something more useable like so:

yhat <- plyr::ldply(extract(simulated_values, pars = "yhat", permuted = F))$V1

Recovering known parameters from simulated data

Now we have our known parameters beta, nu and sigma, our right-hand-side variables \(X\), and our simulated outcomes \(y\). Now let’s estimate two models to see if we can pin down the known parameters. The two models we’ll estimate will be those two models described above–one that assumes normal errors, and the other that assumes Student t errors with a positive \(\beta_{1}\).

First we need to write out the two models to be estimated. These will take a similar form to the script above, although now we will specify the parameters to be estimated and the probability model.

First we define the incorrectly specified model:

incorrect_model <- "
data {
  int N; // number of observations
  int P; // number of covariates
  matrix[N, P] X; //covariate matrix
  vector[N] y; //outcome vector
}
parameters {
  vector[P] beta; // the regression coefficients
  real<lower = 0> sigma; // the residual standard deviation (note that it's restricted to be non-negative)
}
model {
  // Define the priors
  beta ~ normal(0, 5); // same prior for all betas; we could define a different one for each, or use a multivariate prior
  sigma ~ cauchy(0, 2.5);

  // The likelihood
  y ~ normal(X*beta, sigma);
}
generated quantities {
  // For model comparison, we'll want to keep the likelihood contribution of each point
  vector[N] log_lik;
  for(i in 1:N){
    log_lik[i] <- normal_log(y[i], X[i,]*beta, sigma);
  }
}
"

Next, the correctly specified model:

correct_model <- "
data {
  int N; // number of observations
  int P; // number of covariates
  matrix[N, P] X; //covariate matrix
  vector[N] y; //outcome vector
}
parameters {
  // We need to define two betas--the first is the restricted value, the next are the others. We'll join these in the next block
  real<lower = 0> beta_1;
  vector[P-1] beta_2; // the regression coefficients
  real<lower = 0> sigma; // the residual standard deviation (note that it's restricted to be non-negative)
  real<lower = 1> nu; 
}
transformed parameters {
  vector[P] beta;
  beta <- append_row(rep_vector(beta_1, 1), beta_2);
}
model {
  // Define the priors
  beta ~ normal(0, 5); // same prior for all betas; we could define a different one for each, or use a multivariate prior. The first beta will have a prior of the N+(0, 5)
  sigma ~ cauchy(0, 2.5);
  nu ~ cauchy(7, 10);

  // The likelihood
  y ~ student_t(nu, X*beta, sigma);
}
generated quantities {
  // For model comparison, we'll want to keep the likelihood contribution of each point
  vector[N] log_lik;
  for(i in 1:N){
    log_lik[i] <- student_t_log(y[i], nu, X[i,]*beta, sigma);
  }
}
"

Now estimate the models

data_list_2 <- list(X = X, N = N, y = yhat, P = P)
incorrect_fit <- stan(model_code = incorrect_model, data = data_list_2, cores = 4, verbose = F)
correct_fit <- stan(model_code = correct_model, data = data_list_2, cores = 4, verbose = F)

And print the parameter estimates along side the actuals

print(incorrect_fit, pars = c("beta", "sigma"))
## Inference for Stan model: 390850dbea08014475adce8a9dc05d89.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## beta[1]   3.65    0.01 0.59  2.50  3.26  3.66  4.04  4.80  4000    1
## beta[2]   0.46    0.01 0.69 -0.88  0.00  0.46  0.93  1.82  4000    1
## beta[3]   1.94    0.01 0.61  0.76  1.55  1.93  2.33  3.16  4000    1
## beta[4]  -0.96    0.01 0.69 -2.37 -1.42 -0.95 -0.49  0.35  4000    1
## beta[5]  -1.53    0.01 0.61 -2.77 -1.95 -1.51 -1.11 -0.34  4000    1
## beta[6]  -0.36    0.01 0.59 -1.51 -0.77 -0.36  0.04  0.78  4000    1
## beta[7]   0.59    0.01 0.62 -0.57  0.14  0.58  1.01  1.80  4000    1
## beta[8]  -3.35    0.01 0.67 -4.69 -3.81 -3.34 -2.91 -2.06  4000    1
## beta[9]  -1.51    0.01 0.59 -2.71 -1.90 -1.50 -1.11 -0.35  4000    1
## beta[10]  0.23    0.01 0.51 -0.78 -0.11  0.23  0.57  1.22  4000    1
## sigma     5.97    0.01 0.46  5.16  5.65  5.93  6.25  6.96  4000    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Feb 26 11:54:37 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
print(correct_fit, pars = c("beta", "sigma", "nu"))
## Inference for Stan model: 2394ebf854ef103bf7ccaad1d847800e.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean     sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## beta[1]   3.82    0.01   0.59  2.64  3.43  3.83  4.23  4.94  3397    1
## beta[2]   0.51    0.01   0.69 -0.86  0.05  0.51  0.99  1.84  3918    1
## beta[3]   1.84    0.01   0.61  0.66  1.44  1.84  2.25  3.02  4000    1
## beta[4]  -0.87    0.01   0.71 -2.27 -1.35 -0.87 -0.40  0.52  4000    1
## beta[5]  -1.55    0.01   0.62 -2.72 -1.98 -1.56 -1.13 -0.34  4000    1
## beta[6]  -0.30    0.01   0.61 -1.48 -0.70 -0.30  0.12  0.89  4000    1
## beta[7]   0.77    0.01   0.62 -0.43  0.37  0.78  1.18  1.92  4000    1
## beta[8]  -3.39    0.01   0.65 -4.62 -3.86 -3.40 -2.95 -2.13  4000    1
## beta[9]  -1.55    0.01   0.58 -2.69 -1.94 -1.55 -1.16 -0.42  4000    1
## beta[10]  0.19    0.01   0.53 -0.85 -0.18  0.21  0.56  1.21  4000    1
## sigma     5.36    0.01   0.54  4.32  4.98  5.35  5.73  6.47  2647    1
## nu       27.80    5.48 240.10  3.89  7.71 11.43 18.96 83.66  1918    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Feb 26 11:55:07 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Notice that the parameter estimates printed above are in the same format as the simulated data above? When we call print() on a stanfit object, it generates some summary statistics for all parameters and generated quantities. In Markov Chain Monte Carlo MCMC methods, the posterior distribution \(p(\theta|y)\) is estimated by generating a large number of draws from the distribution (there is no analytical form from which to calculate the relevant statistics). For each draw of the parameters, values in the generated quantities block are created. That is, in each iteration we generate values in the generated quantities block that make use of different parameter estimates. Once we have a large number of draws from the distribution, we can calculate relevant statistics of those draws (print() returns mean, standard deviation, and quantiles). But we also want to know if those draws are reliable.

The print() method calculates a couple more useful statistics to help us know when the estimates are unreliable. First is n_eff. This tells us the effective number of independent draws that we have. Some MCMC algorithms (like Gibbs sampling or Metropolis Hastings, which Stan does not implement) are prone to drawing serially correlated estimates of the parameters—one iteration’s estimates are highly dependent on the previous iteration’s. When this happens, the