Bayesian models can be large and therefore confront challenges that require foundational understanding of distribution theory. Bayes’ theorem starts with jointly distributed random variables [data, parameters] and moves to a conditional distribution [parameters | data]. Conditional distributions are fundamental components of algorithms that are used to compute the posterior distribution.
Logistics
Today
review forest inventory vignette B
introduce basic probability concepts
Readings
Evaluating the impacts of fungal seedling pathogens on temperate forest seedling survival, Hersh et al. on joint, conditional, predictive distributions used in the fungal infection example, Ecology.
Objectives
- Understand key concepts:
- discrete and continuous densities, probability mass functions, and probability density functions
- support for a distribution
- moments of a distribution
- Apply basic rules to manipulate jointly distributed random variables:
- total probability
- Bayes theorem
Why distribution theory?
‘I trust my oncologist to know the probabilities. I don’t trust him to put them together.’ –climatologist Steve Schneider on his cancer treatment
Consider the following scenario:
I get tested for covid 19, the test is positive, and I want to know if I can expect symptoms. I am told that 4% of the population in my area is infectious, 55% of covid infections result in symptoms, the false positive rate is 0.5%, and the false negative rate is 9%. Or, if I don’t develop symptoms, was the test wrong (false positive)?
A graph displays the key elements of the covid-testing scenario.
Probability distributions are the building blocks of hierarchical models, the boxes in the graphical model. The notion of support that a distribution provides for a random variable is needed in order to put distributions together, e.g., when \(\theta\) is random in one distribution and temporarily fixed in another, as in \([x | \theta][\theta]\). If I go no further than stitching together a model specification for rjags, I still need to know some basic distribution theory. The next two sections describe some relationships between a data model \([x | \theta]\) and a process model \([\theta| \cdot]\). The example for seedling pathogens that follows applies these concepts to unknown states and parameters.
And, by the way, the basic rules from this vignette can answer the covid-testing question, which reappears at the end.
A distribution for the data
The likelihood is a data model, i.e., observations given parameters \([x|\theta]\) The sample space is the set of all possible values of \(x\) that could be observed. The support of a distribution is simply the set of \(x\) values having non-zero probability or density.
If my data model for observed \(x\) depends on a parameter \(\theta\), I might write that distribution generically as \([x|\theta]\). For this to be a valid probability model for \(x\) this distribution has to satisfy a few rules concerning support:
- It must allow non-zero probability or density for any possible observed \(x\), the sample space.
If \(x\) represents basal area of trees, then the probability model cannot be Gaussian (normal) or log-normal, because neither of these distributions accept point mass (discrete values) at zero.
- It must not assign probability or density to values that could not be observed.
The probability model for basal area cannot be Gaussian, which assigns non-zero probability to negative values.
- Total probability: the summed (discrete \(x\)) or integrated (continuous \(x\)) probability model must equal 1.
A Poisson distribution, \(Poi(x | \lambda)\), can be used for count data, because it supports non-negative integers, \(x \in \{0, 1, 2, \dots \}\).
A binomial distribution, \(binom(x |n, \theta)\), can be used for binary outcomes, because it supports non-negative integers up to the number of trials, \(n\), i.e., \(y \in \{0, 1, \dots, n \}\).
A Gaussian (normal) distribution, \(N(x | \mu, \sigma^2)\), supports numbers anywhere on the real line, \((-\infty, \infty)\). It cannot be used for counts, because it supports non-integer values.
An exponential distribution, \(Exp(x | \rho)\), can be used for times-to-events, because it supports numbers anywhere on the positive real line, \([0, \infty)\).
The first step in building a model is to identify a distribution that supports observed \(x\).
A distribution for the process
A process model links to the data model, \([x | \theta] \times [\theta | \cdot]\).
Models are constructed hierarchically. Now that I have a data model \([x | \theta]\), I need a process model for \(\theta\),
\[ [x | \theta] \times [\theta | \mbox{process} ] \]
To get specific, suppose I have counts, and I decide to model them with a Poisson distribution, \(Poi(x|\theta)\). I now require a distribution that can describe parameter \(\theta\) as the consequence of some process. At minimum, this process is simply a combination of variables that help to ‘explain’ variation in counts. It could be more elaborate, describing dynamic variables that respond in both space and time.
Note that the distibution for this process must support \(\theta\) such that it agrees with the data model. For example, a time series for individual \(i\) at time \(t\) might look like this:
\[ Poi(x_{i,t} | \theta_{i,t}) LN(\theta_{i,t} | \theta_{i,t-1}, \sigma^2) \]
The log-normal supports \(\theta \in (0, \infty)\), and this matches the rate parameter of the Poisson. Many considerations can influence the choices for these two distributions, but this matching support is non-negotiable. To gain a better understanding of support I provide a bit more background on continuous and discrete distributions.
Exercise 1. Look up the support for the uniform distribution, \(unif(x | a, b)\), and offer a situation where you might use it.
Exercise 2. Use the notion of support to justify why we could use this prior, \(\prod_p N(\beta_p | b, B)\), with this likelihood, \(\prod_i N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\).
Continuous probability models: distribution and density
Cumulative distribution functions describe probability for an interval. Probability density functions describe the rate of change in probability at \(x\).
The probability for any value \(x\) of a continuous variable is zero, so probability must be defined for an interval \(dx\). A normal probability density function, or PDF, looks like a bell-shaped curve. For now I use the notation \(p(x)\) to refer to the PDF of \(x\). The PDF does not show probability (it is a density), so how do I interpret it? Here is the shape of a normal distribution \(p(x) = N(x | 0, 0.1)\):
length <- 100
xseq <- seq(-1.5, 1.5, length = length)
sigma <- sqrt(0.1)
PDF <- dnorm( xseq, 0, sigma )
plot( xseq, PDF, xlab='x', ylab = 'Density (1/x)', type='l' )The normal (Gaussian) density for x has units 1/x.
I have drawn this function for a sequence of \(x\) values called xseq, which are spaced at equal intervals of dx = diff(xseq)[1] = 0.030303. This curve shows a mode at the mean = 0, where the function takes a value of dnorm(0, 0, sigma) = 1.2615663. If I was tempted to interpret the PDF as a probability, the fact that this value exceeds 1 would be a reminder that a density cannot be a probability.
Integration is like multiplication. The probability density function (PDF) for a continuous variable \(x\) has dimension \(1/x\), so it becomes a probability (dimensionless) by integration over an interval \(dx\) to give dimension \(x \times x^{-1} = x^0\).
Recall that the area under the PDF curve must equal 1,
\[ \int_{-\infty}^{\infty} p(x) dx = 1 \]
so I check this with a Reimann sum. I do this by approximating the smooth curve with a series of rectangles of width \(dx\). The area of a box is its width times its height, \(dx \times p(x)\). Here are a series of boxes showing the areas of two boxes selected at random,
dx <- diff(xseq)[1] # width
plot( xseq, PDF, xlab='x', ylab = 'Density (1/x)', type='s' )
segments( xseq, 0, xseq, PDF )A probability is approximated by the area of a box, not by its height.
Here is the sum of the box areas:
# add all the boxes
sum(PDF)*dx## [1] 0.9999984
This confirms that the area under the curve is correct for a PDF. I can make this approximation as accurate as I want by increasing the number of boxes, i.e., by increasing the length argument to function xseq. This discrete approximation is evaluated for a continuous PDF with the function pnorm(x1, mean, sd), where x1 is some value of \(x\). It is this integral, up to the value (upper limit) \(x_1\),
\[ P(x_1) = \int_{\infty}^{x_1} p(x) dx \] A cumulative distribution function (CDF) for a continuous variable \(x\) is the area under the PDF to the left of \(x\), \(P(x) = \int_{-\infty}^x p(x') dx'\). The PDF can be recovered from the CDF by differentiation, \(p(x) = dP(x)/dx\).
Differentiation is like division: if I divide an area, say meters \(\times\) meters, by its width, I go from m\(^2\) to m, i.e., I subtract one unit from the exponent. Likewise, when I go from dimensionless \(P(x)\) (units \(x^0\)) to \(p(x)\) (units \(x^{-1}\)) I lose one unit from the exponent. Again, a PDF must be integrated to get a probability.
If I want the probability that \(x\) lies within an interval defined by two values \((x_1, x_2)\), I integrate the PDF over that interval,
\[ [x \in (x_1, x_2)] = \int_{x_1}^{x_2} p(x) dx \] This is equivalent to subtracting the CDF at these two values, \[ \begin{align} [x \in (x_1, x_2)] &= P(x_2) - P(x_1) \\ &= \int_{\infty}^{x_2} p(x) dx - \int_{\infty}^{x_1} p(x) dx \end{align} \] Again, this is just the area under the curve between \(x_1\) and \(x_2\). Here it is in R,
x1 <- .2
x2 <- .6
area <- pnorm(x2, 0, sigma) - pnorm(x1, 0, sigma)or, equvalently, diff( pnorm(c(x1,x2), 0, sigma) ) = 0.2346548.
Here I show the interior \(80\%\) probability for the normal distribution. I start by finding the values for \((x_1, x_2)\) at \(P(x_1) = 0.1\) and \(P(x_2) = 0.9\). I call these quantiles q. I then locate the portion of xseq within this interval, and assign it to an index z. Finally, I determine the polygon shape and generate a plot:
# probability that x lines in an interval
p <- c(.1, .9) # probability to locate x1, x2
q <- qnorm( p, 0, sigma) # this is x1, x2 for p
z <- which( xseq >= q[1] & xseq <= q[2] ) # the sequence that lies with q
# outline for the probability area, bounded above by PDF, below by zero
yshape <- c( 0, PDF[ z ], 0, 0 )
xshape <- c( xseq[ z[1] ], xseq[ z ], xseq[ max(z) ], xseq[ z[1] ] )
plot( xseq, PDF, xlab='', ylab = 'Density (1/x)', type='l', col='grey')
polygon( xshape, yshape, col = 'wheat', border='brown', lwd=2)Probability for this interval is the area under the curve.
I can also pick the same information off a plot of the CDF. Here I draw the CDF
CDF <- pnorm( xseq, 0, sigma )
plot( xseq, CDF, xlab='x', ylab = 'Probability', type='l', xaxt='n' )
axis(1, at = c(-1, 0, 1))
axis(1, at = q, labels = c(p1,p2), col = 'brown', col.ticks = 'brown',
font = 3)
w1 <- max( which( CDF < p[1]) )
w2 <- max( which( CDF < p[2]) )
lines( xseq[ c(1,w1) ], p[c(1,1)], lty=2, col='brown')
lines( xseq[ c(1,w2) ], p[c(2,2)], lty=2, col='brown')
lines( xseq[ c(w1,w1) ], c(0, p[1]), lty=2, col='brown')
lines( xseq[ c(w2,w2) ], c(0, p[2]), lty=2, col='brown')The area defined by two values is equivalently the area under the PDF and the difference in the CDF.
The Reimann sum makes use of the concept of limits, which was on the minds of both Newton and Leibniz when they (apparently, independently) invented calculus. The PDF is the rate of change in the CDF, i.e., its derivative, which can be viewed as the limit of a ratio as an increment \(dx\) approaches zero,
\[ p(x) = \frac{dP}{dx} = \lim_{dx \rightarrow 0}\left[ \frac{P(x+dx) - P(x)}{dx} \right] \]
i.e., division. The multiplication \(p(x) \cdot dx\) that gives us the area of a box used for the Reimann sum can be viewed as the limit of the anti-derivative
\[ \int_x^{x+dx} p(u) du = \lim_{dx \rightarrow 0}[p(x) \cdot dx] \]
Ironically, although multiplication is easy and division is hard, the tables are turned on their calculus counterparts: differentiation is usually easier than integration.
The idea of derivatives and integrals as limiting quantities is important for computation in Bayesian analysis. Differentiation is needed for optimization problems, which is more important for methods like least-squares and maximum likelihood. Optimization and integration are commonly approximated numerically.
In the discussion ahead I rely on some of the notation of calculus, but there are no difficult solutions.
Exercise 3. Use the R function qnorm to determine the quantiles \((x_1, x_2)\) that bound 95% of the standard normal distribution. Then use the function pnorm to verify that this interval does hold an area equal to 0.95.
Exercise 4. I want to generate a random sample from a normal distribution and see if I can ‘recover the mean’. I want to draw the PDF and CDF for the distribution I would obtain from the estimate and a confidence interval. Here are the steps I use:
- define a sample size, a mean and variance
-
draw a random sample using
rnorm - estimate parameters as \(\hat{\mu} = \bar{x}\) and \(\hat{\sigma}^2 = \frac{n}{n-1} var(x)\)
-
determine a 95% confidence interval using
qnorm -
draw a PDF using
dnormand CDF usingpnormbased on the estimates -
determine if the true estimate lies within this interval
Plots from Example 1.
Now repeat this 1000 times and count the number of times the confidence interval includes the true mean.
Density of low and high 95% CIs from Example 1.
Probability spaces for discrete variables
A probability mass function (PMF) uses the concept of point mass for the probability assigned to a discrete value. For example, a coin toss has point mass at heads and tails (zero and one).
A probability distribution assigns probability to the set of all possible outcomes over the sample space. This is the support. Recall that the sample space for a continuous probability distribution is all or part of the real line. The normal distribution applies to the full real line, \(\mathbb{R} = (-\infty, \infty)\). The gamma (including the exponential) and the log-normal distributions apply to the non-negative real numbers \(\mathbb{R}_+ = [0, \infty)\). The beta distribution \(beta(a, b)\) applies to real numbers in the interval \([0, 1]\). The uniform distribution, \(unif(a, b)\), applies to the interval \([a, b]\). These are univariate distributions: they describe one dimension. [Note: when discussing an interval, a square bracket indicates that I am including the boundary–a uniform distribution can be defined at zero, but the lognormal distribution is not.]
Three continuous distributions supported on different portions of the real line. Zero and one (grey lines) are unsupported. PDFs above and CDFs below.
The sample space for a discrete distribution is a set of discrete values. For the Bernoulli distribution, \(Bernoulli( \theta)\), the sample space is \(\{0, 1\}\). For the binomial distribution, \(binom(n, \theta)\), the sample space is \(\{0, \dots, n \}\). For count data, often modeled as a Poisson or negative binomial distribution, the sample space is \(\{0, 1, 2, \dots \}\). The probability mass function means that there is point mass on specific values (e.g., integers) and no support elsewhere.
Three discrete distributions, with PMFs above and CDFs below.
The figures show continuous and discrete distributions, each in two forms, as densities and as cumulative probabilities. The foregoing relationships between the PDF and CDF for continuous variables are extended here to discrete variables.
For discrete distributions the PDF is replaced with a probability mass function (PMF).
Like the CDF (and unlike the PDF), values in the PMF are (dimensionless) probabilities. To obtain the CDF from the PMF I sum, rather than integrate,
\[P(x) = \sum_{k \leq x} p(k)\] This notation says that I am summing all values of in the PMF that are less than or equal to \(x\). I use the index \(k\) to represent these values.
To obtain the PMF from the CDF I difference, rather than differentiate,
\[p(x) = P(x) - P(x-1)\]
The sum over the sample space is
\[1 = \sum_{k \in \mathcal{K}} p(k)\]
where \(\mathcal{K}\) is the sample space.
In R there are functions for common discrete distributions just as for continuous distributions. The CDF begins with the letter p. The PDF and PMF begin with the letter d (e.g., dbinom). To obtain random values use the letter r. For quantiles use the letter q.
Suppose I want to draw the PMF for the Poisson distribution having intensity \(\lambda = 4.6\). I can generate a sequence of integer values and then use dpois:
k <- c(0:20)
dk <- dpois(k,4.6)
plot(k, dk)
segments(k, 0, k, dk)Open symbols indicate point mass in the Poisson PMF.
I could draw the CDF with ppois:
pk <- ppois(k, 4.6)
plot(k, pk, type='s')The Poisson CDF.
If I want to confirm the values of k having these probabilities, I invert the probabilities with qpois, which takes probability as input and returns the quantile:
qpois(pk,4.6)## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Another common distribution is the binomial distribution, \(binom(x, | n, \theta)\), where \(\theta \in (0, 1)\), and support for \(x\) is \(x = 0, \dots, n\).
We will get to multivariate distributions
Recall that a (univariate) normal distribution describes an outcome \(x\) on the real line, \(\mathbb{R} = (-\infty,\infty)\). A d-dimensional multivariate normal distribution describes a length-\(d\) random vector in \(\mathbb{R}^d\).
The (univariate) binomial distribution, which describes the number of successes in \(n\) trials, has a multivariate counterpart.
The multinomial distribution assigns probability to \(n\) trials, each of which can have \(J\) discrete outcomes, \(multinom(\mathbf{x} | n, \boldsymbol{\theta})\), where \(\boldsymbol{\theta} = (\theta_1, \dots, \theta_J)\) is a vector of probabilities assigned to each of the \(J\) outcomes \(\mathbf{x} = (x_1, \dots, x_J)\).
In other words, the multinomial has the sample space (support) \(\{0, \dots, n\}^J\), subject to the constraint that the sum over all \(j\) classes is equal to \(n\). Just like the binomial, there are only \(J-1\) values the vector in \(\boldsymbol{\theta}\), because they sum to one. The binomial distribution has two classes, and one parameter.
To get specific, if the sample space for eye color were \(x \in \{\mbox{'brown', 'blue', 'hazel', 'amber', 'grey', 'green'}\}\), then a random sample of size \(n = 5\) for \(J = 6\) classes, each with 12 students, could have these outcomes drawn from the function rmultinom:
## [,1] [,2] [,3] [,4] [,5]
## brown 10 8 10 11 8
## blue 1 2 2 1 2
## hazel 0 0 0 0 0
## amber 1 2 0 0 0
## grey 0 0 0 0 2
## green 0 0 0 0 0
Note that each column distributes the 12 students across the 6 categories.
Multivariate distributions come up when we consider an infection model below.
Moments
Moments are used to summarize the location and shape of a distribution. The mean, variance, skewness, and kurtosis are examples of moments.
Moments are expectations of powers of \(x\). I can think of the \(m^{th}\) moment of a distribution as a weighted average of \(x^m\). For a continuous distribution
\[ E[x^m] = \int_{-\infty}^{\infty} x^m p(x) dx \] The ‘weight’ in this case is \(p(x)\). The highest weight goes to \(x\) at the mode of \(p(x)\).
The first moment, \(m = 1\), is the mean.
A central moment weights not \(x\), but rather the distance from \(x\) to the mean or expectation of the distribution, \(E(x)\). This distance is \(x - E(x)\). The variance is the second central moment,
\[ var[x] = E \left[(x - E(x))^2 \right] = \int_{-\infty}^{\infty} (x - E(x))^2 p(x) dx \] For discrete variables, we replace the integral with a sum. These basic properties of moments will be valuable describing the shapes of distributions.
Discrete probability for multiple host states
Distribution theory is needed to estimate not only parameters, but also unobserved states. They can be continuous, discrete, or occur in combination.
I now want to apply probability concepts to disease infection that come from our studies of seedling survival and fungal pathogen attack Hersh et al.
Hersh et al. studied the effects of experimental warming on pathogen infection that causes damping off and the leaf damage on this Quercus falcata seedling. One of the open-top warming chambers is shown at right. Warm air is distributed by blowers through tubes suspended above seedlings.
Consider the joint probability of two events, \([I,S]\) that a host plant is infected with a fungal taxon, \(I\), and that it survives, \(S\). In this example, both of these events are binary, being either true (indicated with a one) or not (indicated with a zero). For example, the probability that an individual is not infected and survives is written as \([I = 0,S = 1]\). If I write simply \([I, S]\) for these binary events, it is interpreted as the probability that both events are a ‘success’ or ‘true’, i.e., \([I=1, S=1]\).
A graphical model of relationships discussed for the fungal infection problem. Symbols are I - infected host, S - survival, D - detection. Greek letters are parameters.
This is a multinomial distribution
As previously, states (or events) and parameters are nodes in the graph, the states \(\{I, S, D\}\), and the parameters \(\{\theta, \phi, \pi_0, \pi_1 \}\). The connections are arrows or edges. Here I assign parameters to probabilities:
\[ \begin{aligned} \left[I\right] &= \theta \\ [D | I = 1] &= \phi \\ [S | I = 0] &= \pi_0 \\ [S | I = 1] &= \pi_1 \end{aligned} \]
A host can become infected with probability \([I] = \theta\). An infection can be detected with probability \([D|I=1] = \phi\). An infected individual survives with probability \([S| I = 1] = \pi_1\), and a non-infected individual survives with probability \([S| I = 0] = \pi_0\). For this example, I assume no false positives, \([D=1 | I = 0] = 0\).
Each of these outcomes is written as univariate, either as a conditional or a marginal distribution. However, if I want to put them together, I need to view them jointly. For the joint distribution \([I, S]\) there are four possible outcomes, which can be viewed as a table:
Table 1. Multinomial (joint) probabilities for infection and survival model.
| outcomes | \([I ,S = 0]\) | \([I,S = 1]\) | \([I]\) |
|---|---|---|---|
| \([I = 0, S]\) | \((1 - \theta)(1 - \pi_0)\) | \((1 - \theta)\pi_0\) | \(1-\theta\) |
| \([I = 1, S]\) | \(\theta(1 - \pi_1)\) | \(\theta \pi_1\) | \(\theta\) |
| \([S]\) | \((1 - \theta)(1 - \pi_0) +\) \(\theta(1 - \pi_1)\) |
\((1 - \theta)\pi_0 +\) \(\theta(1 - \pi_1)\) |
1 |
The cells in Table 1 are the elements \([I, S]\). Summing across rows and columns are the marginal probabilities for \([I]\) and \([S]\). The sum of all four elements is one (lower right corner).
The multinomial distribution is \(multinom( n, \mathbf{p})\), where \(\mathbf{p}\) are the four elements of the table.
Start simple
How can I learn about the effect of infection on survival, if infection is not observed?
I might be interested in estimating state \(I\), in comparing parameter estimates (‘does \(\pi_0\) differ from \(\pi_1\)?), or both. From the joint distribution \([I, S]\) I can observe \(S\), but not \(I\). I want a model for the conditional probability of survival given that an individual is infected, \([S|I = 1] = \pi_1\) or not \([S|I = 0] = \pi_0\). Again, the notation \([S|I]\) indicates that the event to the left of the bar is ‘conditional on’ the event to the right. I cannot condition on \(I\) if it is unknown. Rather, I want to estimate it. Progress requires Bayes theorem.
Nodes from the graphical model limited to infection status and survival.
I need Bayes’ theorem for the probability of infection (unknown) given survival (observed), \([I|S] = \frac{[S|I][I]}{[S]}\)
To make use of the model I need a relationship between conditional and joint probabilities,
\[[I, S] = [S | I] [I]\] Here I have factored the joint probability on the left-hand side into a conditional distribution and a marginal distribution on the right-hand side. I can also factor the joint distribution this way,
\[[I, S] = [I | S] [S]\] Because both are equal to the joint probability, they must be equal to each other,
\[[S|I]][I] = [I|S][S]\] Rearranging, I have Bayes theorem, solved two ways,
\[[S|I] = \frac{[I|S][S]}{[I]}\]
and
\[[I|S] = \frac{[S|I][I]}{[S]}\]
The two pieces of this relationship that I have not yet defined are the marginal distributions, \([S]\) and \([I]\). I could evaluate either one conditional on the other using the law of total probability,
\[[I=0] = \sum_{j \in \{0, 1\}} [I=0 | S = j][S = j]\]
or
\[[S=1] = \sum_{j \in \{0, 1\}} [S=1 | I = j][I = j]\] How can I use these relationships to address the effect of infection on survival?
Given survival status, I first determine the probability that the individual was infected. I have four factors, all univariate, two conditional and two marginal distributions. I have defined \([S|I]\) in terms of parameter values, but I want to know \([I|S]\). For a host that survived, Bayes theorem gives me
\[
\begin{aligned}
\left[I|S=1 \right] &= \frac{[S=1|I][I]}{[S=1]} \\
&= \frac{[S=1|I][I]}{\sum_{j \in \{0, 1\}} [S=1 | I = j][I = j]} \\
&= \frac{\pi_1 \theta}{\pi_0 (1 - \theta) + \pi_1 \theta}
\end{aligned}
\] For a host that died this conditional probability is
\[ \begin{aligned} \left[I|S=0 \right] &= \frac{[S=0|I][I]}{[S=0]} \\ &= \frac{(1 - \pi_1) \theta}{(1 - \pi_0) (1 - \theta) + (1 - \pi_1) \theta} \end{aligned} \] These two expressions demonstrate that, if I knew the parameter values, then I could evaluate the conditional probability for \([I | S]\). If I do not know parameter values, then they too might be estimated.
Before going further, notice that the numerator is always the ‘unormalized’ probability of the two events. The demoninator simply normalizes them.
Exercise 5 In R: Assume that there are \(n = 100\) hosts, and that infection decreases survival probability \((\pi_1 < \pi_0)\). Define parameter values for \(\{\pi_0, \pi_1, \theta \}\) and draw a binomial distribution for \([I | S=0]\) and for \([I | S=1]\). (Use the function dbinom.) Is the infection rate estimated to be higher for survivors or for those that die? How are these two distributions affected by the underlying prevalence of infection, \(\theta\)? [Hint: write down the probabilities required and then place them in a function].
Comparison of binomial distributions of infection for survivors and deaths.
Continuous probability for parameters
The prior distribution much match the support required by the likehood, but we can often do better than that. Conjugacy refers to a likelhood-prior pair that will yields a posterior distribution with the same form as the prior.
Here I consider the problem of estimating parameters. The survival parameters are \(\pi_I = \{\pi_0, \pi_1 \}\). From Bayes theorem I need this:
\[[\pi_I | S] = \frac{[S | \pi_I ][\pi_I]}{[S]}\]
where the subscript \(I = 0\) (uninfected) or \(I = 1\) (infected). Again, assuming I know whether or not a host is infected, I can write the distribution for survival conditioned on parameters as
\[[S | \pi_I ] = \pi_I^S (1 - \pi_I)^{1 - S}\]
This is a Bernoulli distribution. The Bernoulli distribution is a special case of the binomial distribution for a single trial,
\[Bernoulli(p) = binom(1,p)\]
A typical prior distribution for this likelihood would be the beta density,
\[ beta(\pi_I | a, b) = \frac{\pi_I^{a-1}(1 - \pi_I)^{b-1}}{B(a, b)} \] where \(B(a,b)\) is a beta function, which has a numerical solution. The important thing to recognize about this likelihood-prior combination is conjugacy. Consider their product,
\[ \begin{aligned} bernoulli(S | \pi_I ) \times beta(\pi_I | a, b) & \propto \pi_I^S (1 - \pi_I)^{1 - S} \times \pi_I^{a-1}(1 - \pi_I)^{b-1} \\ &= \pi_I^{S+a-1}(1 - \pi_I)^{1 - S + b-1} \\ &= beta(\pi_I | S + a, 1 - S + b) \end{aligned} \] In the first line I just multiply two distributions. In the second line I collect terms. When I do this, I note that it has the same form as the beta prior distribution, only the parameters have changed. In the last line I update the first parameter from \(a\) to \(S+a\) and the second parameter from \(b\) to \(1 - S + b\).
A graphical model for estimating survival probabilities with a prior distribution.
A Bayesian model
When a stage can be marginalized in a hierarchical model, it can simplify and stabilize computation.
It is important to note that I do not need to know, or even estimate, the unknown infection states \(I\) in order to estimate parameters about infection. We saw above that, given parameter estimates, I can estimate infection probability for a given individual based on its observed \((S, D)\). To fit all parameters I return to the full model, shown at left:
Marginalizing \(I\) simplifies the model (removes a stage) while binding together the observed states \((S, D)\). Fitting the marginalized model does not keep us from predicting unobserved infection status from the fitted model.
Several features of this graph can help me simplify it. First, consider that \(S\) and \(D\) are conditionally independent given infection state \(I\). When I only observe \(S\) and \(D\), they each provide some information about the other, as when a positive detection may suggest that I am at greater mortality risk. Or when premature death might be consistent with infection by a deadly disease. They each inform each other, because \(I\) is a stochastic node; \(S\) and \(D\) ‘communicate’ through stochastic \(I\).
However, if I know infection status, then node \(I\) is not stochastic; it is fixed. Conditional independence means that, if I know \(I\), then I can forget all about \(S\) when I am thinking about \(D\), and vice versa. In other words, \(S\) and \(D\) provide no information about one another when I know \(I\). The important thing here is that conditional independence allows me to write a (complicated) joint distribution \([S, D | I]\) as a product of two (simple) univariate distributions \([S|I] \times [D|I]\). Including parameters, I am saying this,
\[ \begin{aligned} \left[ S, D|I, \pi_0, \pi_1, \phi \right] \times [I | \theta] &= [S|I, \pi_0, \pi_1] \times [D| I, \phi] \times [I | \theta] \\ &= [S, D| \pi_0, \pi_1, \phi, \theta] \end{aligned} \] The second thing to note is that \(I\) disappeared from the second line of this equation. To move from line 1 to line 2 I used total probability \([S, D] = \sum_{j \in (1,2)} [S,D | I = j][I = j]\). There is a solution for each of the four observed states \((S, D)\), which now do not involve \(I\). For example,
\[ p_{1} = [ S = 0, D = 0] = (1 - \pi_0)(1 - \theta) + (1 - \pi_1)(1 - \phi) \theta \] I could use total probability to derive the other three elements of \(\mathbf{p}\).
A Bayesian model to estimate parameters could now look like this:
\[ \begin{aligned} \left[ \theta, \phi, \pi_0, \pi_1|S, D \right] & \propto [S, D|\theta, \phi, \pi_0, \pi_1] \times [\theta, \phi, \pi_0, \pi_1] \\ &= \prod_{i=1}^n multinom(n_i, \mathbf{x}_i | \mathbf{p}) [\theta][ \phi][ \pi_0][ \pi_1] \end{aligned} \] Each observation has \(n_i\) subjects distributed across the four combinations. If observation \(i\) has \(n_i = 6\) subjects, then \(\mathbf{x}_i\) could be \((0,3,1,2)\), i.e., the number of subjects observed in the four combinations \((S, D)\).To make this simple I might assign independent uniform or beta densities to each of the four parameters.
Exercise 6 Assume you have fitted parameter values for this model. Write a function to determine \([I| S=0, D=0]\), \([I|S=1, D=0]\), and \([I| S=0, D=1]\). Interpret the differences.
Broad application
These examples of discrete states and continuous parameters are used by Clark and Hersh and Hersh et al. to evaluate the effect of co-infection of multiple pathogens that attack multiple hosts.
You might try this bonus problem:
Bonus: I get tested for covid 19, the test is positive, and I want to know if I can expect symptoms. I am told that 4% of the population in my area is infectious, 55% of covid intections result in symptoms, the false positive rate is 0.5%, and the false negative rate is 9%. I’m hoping that this vignette can provide guidance on my chances of having symptoms. Or, if I don’t have symptoms, was I really infected? Can this vignette help?
The bonus question provides conditional probabilities that can be used to determine probability of infection.
Recap
Distribution theory is needed to put distributions together with observations, with parameters, and with other distributions. A continous distribution supports values on all or part of the real line as a density. A discrete distribution supports point mass at specific values, termed a probability mass function. Cumulative probability comes from integration (continuous) or summation (discrete). Moments describe the location and shape of a distribution, such as mean and variance. The multinomial distribution for multiple events can be used to estimate latent (unobserved) states and parameters. Conditional independence can allow us to simplify a model. When a stage can be marginalized, there are fewer stages, which can be an advantage when we get to computation.