Keywords: Stochastic Processes, Simulations, Bayesian Inference, Gibbs Sampling, Financial Applications, Markov Chain Monte Carlo methods, R Statistical Software, WinBUGS
Abstract: We illustrate the use of Markov Chains and Gibbs Sampling in modelling stochastic processes with three simple applications: (a) Obtaining the distribution of returns of a portfolio of two correlated stocks (modeled as normal distributions) (b) Using Bayesian Inference (a Beta Binomial conjugate pair model) to estimate the prevalence of spam email and (c) Fitting a t-distribution to the monthly returns of a stock (using WinBUGS).
A Stochastic Process can be defined as a sequence of random variable “steps”, \(X_1,X_2,.,X_n\), where the indexes represent the discrete steps. A Markov Chain is a model for particular Stochastic Processes that assumes single-step dependency. A Gibbs Sampler is a simulated Markov Chain with a limiting distribution that targets a desired target distribution. Especially in Bayesian Inference, Gibbs Sampling is a computational tool used where an analytical solution isn’t readily available.In our first example we illustrate a Gibbs Sampler without Bayesian Inference. In the second example, we illustrate the Gibbs Sampler with a Bayesian Model and finally in our third example we use WinBUGS to fit a t-distribution, a problem that is challenging to do analytically.
We consider a spam filter and use a Bayesian Inference model to describe the process. Our goal is to estimate the prevalence \(\pi\) of spam without having to directly count the instances of spam. Our model is thus represented:
Events:
\(\{S=1\}\): Email is in fact Spam. \(\{S=0\}\): Email is not Spam
\(\{R=1\}\): Email is marked as Spam. \(\{R=0\}\) Email is marked as NOT Spam
Parameters and definitions:
Sensitivity: \(\eta = P\{R=1|S=1\} = 0.90\):
Specificity: \(\theta = P\{R=0|S=0\} = 0.95\)
Prevalence: \(\psi = P\{S=1\}\). Then by the fundamental laws of probability, we can write:
\(\tau = P\{R=1\}\) \(= P\{R=1,S=1\} + P\{R=1,S=0\}\) \(= \psi\eta + (1-\psi)(1-\theta)\)
It is easy to count the number of rejected emails to estimate the probability \(\tau\). However, it would be a lot more labour-intensive to count the number of spam e-mails directly. So we can rewrite the equation above to solve for prevalence \(\psi\), which is usually the parameter of interest:
\[ \psi = \frac{\tau + \theta -1}{\eta +\theta -1}\]
Let’s assume for sake of example that we mark 233 emails as spam out of total of 1000 emails. In R code, this would translated as:
# Assumptions and Parameters
# {S=1} email is spam; {S=0} not spam
# {R=1} email blocked; {R=0} passed
eta = 0.90; # Sensitivity
theta = 0.95; # Specificity
r = 233; # Rejected Emails
n = 1000; # Total number of emails
tau = r/n # Estimate of rejection rate P{R=1}
print(tau)
## [1] 0.233
psi = (tau + theta - 1)/(eta + theta -1)
psi
## [1] 0.2152941
We can then build traditional frequentist CI as \(\tau \pm 1.96\sqrt{\tau(1-\tau)/n}\). Again, in R, this is
tau.l = tau - 1.96 * sqrt(tau*(1-tau)/n)
tau.u = tau + 1.96 * sqrt(tau*(1-tau)/n)
round(c(tau.l, tau.u),3)
## [1] 0.207 0.259
# Taking the tau endpoints and "transforming" them into the the psi interval through the psi equation above
psi.l = (tau.l + theta - 1)/(eta + theta -1)
psi.u = (tau.u + theta - 1)/(eta + theta -1)
round(c(psi.l, psi.u),3)
## [1] 0.184 0.246
In many situations, this procedure for getting point and interval estimates of works well. However, especially when n is relatively small, there are surprisingly many cases in which the displayed formula yields nonsensical point estimates of outside the interval (0, 1) for \(\psi\). For example, consider the example:
eta = 0.99; theta = 0.97;
r = 5; n = 250;
tau = r/n ; tau
## [1] 0.02
psi = (tau + theta - 1)/(eta + theta -1)
psi
## [1] -0.01041667
And we see that the probability psi arrived at is non-valid number.
To introduce the Bayesian Inference Method, we begin by a straighforward example where Gibbs Sampling is not required. We want to estimate \(tau\) and we know from Bayes Equation that
\[ Posterior \propto Prior \times Likelihood\] Now we begin with a non-informative prior, \(U(0,1)\) or \(Beta(\alpha_0=1,\beta_0=1)\). So the Posterior in this context can be written as
\[ p(\tau|r) \propto p(\tau) \times p(r|\tau)\]
We begin with a non-informative prior \(Beta(\alpha_0=1,\beta_0=1)\) for \(\tau\). So the posterior for \(\tau\) can be written as
\[ \tau \sim Beta(\alpha_0=1,\beta_0=1)\] \[p(\tau|r) \propto p(\tau) \times p(r|\tau)\]
\[ \propto \tau^{\alpha_0-1}(1-\tau)^{\beta_0-1} \times \tau^r(1-\tau)^{n-r}\]
\[ = \tau^{\alpha_0+r-1}(1-\tau)^{beta_0+n-r-1}\]
where the right-hand side reveals itself readily as the density function of \(Beta(\alpha_n,\beta_n)\) . This functional “compatibility” between the Beta (prior and posterior) and Binomial (likelihood) families in Bayesian Inference is what leads them to be considered a conjugate pair. Then If \(n=1000\) and \(r=233\) we define
\[ \alpha_n = \alpha_0 + r = 1 + 233 = 234\] \[\beta_n = \beta_0 + n - r = 1 + 1000 - 233 = 768\] We can thus use the resulting Beta density posterior for \(p(\tau|r)\) to derive the mean and CI for \(\tau\). In R:
alpha.0 = 1; beta.0 = 1 # Prior Information
n = 1000; r = 233 # Data
alpha.n = alpha.0 +r # Posterior parameters
beta.n = beta.0 + n - r
alpha.n/(alpha.n+beta.n) # Posterior density mean
## [1] 0.2335329
qbeta(c(.025,.975),alpha.n,beta.n) # CI for the posterior density mean
## [1] 0.2078629 0.2602105
Next we use a Gibbs sampler to obtain point and interval estimates for \(\psi\). Unlike the estimate for \(\tau\) (where we were able to count \(r\) the spam-marked emails), we need a Gibbs sampler because it is not feasible to directly count the spam emails. \[ p = P\{S=1|R=1\} = P\{S=1,R=1\}/P\{R=1\}\] \[ p = \psi\eta/\tau\] and \[q = P\{S=1|R=0\} = \] \[\psi(1-\eta)/(1-\tau)\] Let X be the number among the r rejected emails that are indeed spam and Y be the number of spam emails among the n-r emails passed through. We can simulate these latent counts X and Y using
\[X|r,\psi \sim Binom(r,p)\]
\[ Y|r,\psi \sim Binom(n-r,q)\]
\[ \psi|X,Y \sim Beta(\alpha_n,\beta_n)\] where we consider \(V=X+Y\), and \(\alpha_n=alpha_0+V\) and \(beta_n=beta_0+n-V\).
We use one value of \(\psi\) to find the next in this iterative manner. This is in effect a Markov Chain for which the limiting distribution is the posterior of \(\psi\). At each iteration, we use the prior distribution and the data. The continued and recursive use of this information will cause the convergence of the simulated values toward the appropriate posterior distribution. We illustrate this Gibbs Sampler method below:
N = 10000 # Number of iterations
Nb = 2000; N1 = N+1 # Burn-in
psi = numeric(N)
psi[1] = .5 # Initial value
# Gibbs Sampler Loop
for(i in 2:N) {
tau=psi[i-1]*eta+(1-psi[i-1])* (1-theta)
X = rbinom(1, r, psi[i-1]* eta/tau)
Y = rbinom(1, n-r, psi[i-1]*(1-eta)/(1-tau))
psi[i] = rbeta(1, alpha.0+X+Y, beta.0+n-X-Y)
}
mean(psi[Nb:N])
## [1] 0.2118245
par(mfrow=c(2,2))
hist(psi)
plot(1:N,cumsum(psi)/(1:N),type="l",ylab= "psi", ylim=c(0.20,0.24))
plot(psi,type='p',pch='.',ylim=c(0.15,0.30))
acf(psi)
Again visual diagnostics (the bottom psi[i] and ACF plots) show proper mixing and Markov model assumptions are respected. Also the psi[i] seem to converge to a reasonable value. Further, we can obtain an empirical Bayesian credible interval by looking at the post burn-in psi[i]:
psi.sort <- sort(psi)
L = as.integer(.05/2*(N-Nb))
U = as.integer((1-.05/2)*(N-Nb))
psi.L = psi.sort[L]
psi.U = psi.sort[U]
round(c(psi.L,psi.U),3)
## [1] 0.184 0.222
And we can see that this interval contains the point value estimated above through the Gibbs Sampler.
In Section 1 we modeled monthly stock returns as Normal. Here we use the more realistic t-distribution, which more accurately describes the outliers frequently found in stock data (“fat tails”, kurtosis). We use the popular WinBUGS (Bayesian analysis Using Gibbs Sampling) software:
Below we call the WinBUGS/OpenBUGS libraries from within R:
library(R2WinBUGS); library(R2OpenBUGS)
## Loading required package: coda
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked _by_ '.GlobalEnv':
##
## tau
##
## Attaching package: 'R2OpenBUGS'
## The following objects are masked from 'package:R2WinBUGS':
##
## as.bugs.array, attach.all, attach.bugs, bugs, bugs.data,
## bugs.log, detach.all, detach.bugs, monitor, read.bugs,
## write.model
data(CRSPmon, package="Ecdat")
ibm = CRSPmon[,2] # loads ibm and y as ibm returns
y= as.numeric(ibm)
N=length(y)
ibm_data=list("y","N")
inits=function(){ list(mu=rnorm(1,0,.3),tau=runif(1,1,10),nu=runif(1,1,30))}
univt.mcmc = bugs(ibm_data,inits,model.file="Tbrate_t.txt",
parameters=c("mu","tau","nu","sigma"),
n.chains=3,n.iter=2600,n.burnin=100,
n.thin=1,
bugs.seed=10)
print(univt.mcmc,digits=4)
## Inference for Bugs model at "Tbrate_t.txt",
## Current: 3 chains, each with 2600 iterations (first 100 discarded)
## Cumulative: n.sims = 7500 iterations saved
## mean sd 2.5% 25% 50% 75%
## mu 0.0094 0.0035 0.0027 0.0071 0.0094 0.0118
## tau 240.9409 24.3706 199.4000 224.0000 238.7000 255.0000
## nu 25.0819 12.0561 7.3279 14.7775 23.1500 34.7500
## sigma 0.0647 0.0032 0.0581 0.0626 0.0647 0.0668
## deviance -916.4080 2.1635 -919.1000 -917.9000 -916.9000 -915.5000
## 97.5% Rhat n.eff
## mu 0.0165 1.0020 1600
## tau 295.9050 1.0042 770
## nu 48.2152 1.0107 200
## sigma 0.0708 1.0042 770
## deviance -910.7000 1.0042 780
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = Dbar-Dhat)
## pD = 2.0350 and DIC = -914.4000
## DIC is an estimate of expected predictive error (lower deviance is better).
plot(univt.mcmc)
# WinBUGS Model Tbrate_t:
# model{
# for(i in 1:N){
# y[i]~dt(mu,tau,nu) Defines y as a t distribution with parameter mu and...
# }
# mu ~ dnorm(0.0,1.0E-6) ... Prior mu (mean)
# tau ~ dgamma(0.1,.01) ... Prior tau (1/scale^2)
# sigma<-1/sqrt(tau) ... Scale
# nu~dunif(2,50) ... Prior of degrees of freedom
# }
x=seq(from = -0.3, to =0.3, by = .01)
hist(ibm, freq = FALSE, col="skyblue2")
library(metRology)
##
## Attaching package: 'metRology'
## The following objects are masked from 'package:base':
##
## cbind, rbind
x=seq(from = -0.3, to =0.3, by = .01)
hist(ibm, freq = FALSE, col="skyblue2")
# curve(dt(x, .0094, 25)*240, lwd=2, col="red", add=T)
curve(dt.scaled(x, df = 25, mean = 0.0094, sd = .07,ncp=0, log = FALSE), lwd=2, col="red", add=T)
# Need to correctly plot the t with mu = .0094 and df = 25 over the histogram.
[1] Trumbo, B. & Suess, E., PSGS, Springer Verlag, 2010
[2] Trumbo & Suess, Entry in Encyclopedia of Social Network Analysis and Mining
[3] Ruppert, D., Statistics and Data Analysis for Financial Engineering, Springer Verlag 2011
[4] WinBUGS Manual