Probability distributions, uncertainty, and transformations

Training overview

The goal of this training is to cover statistical concepts required for later courses on regression modeling, especially meta-regression with MR-BRT and conducting crosswalk analyses. The concepts include probability distributions, uncertainty, and converting a quantity from one probability space to another.

In the first section, we define and discuss the purpose of probability distributions. With the Normal distribution as an example, we emphasize the fact that probability distributions are imperfect but useful mathematical descriptions of real-world phenomena. We conclude by discussing how the choice of a probability distribution should reflect the key characteristics of the data it aims to represent.

The second section deals with stochastic uncertainty, or the uncertainty we expect due to finite sample sizes. We discuss what it means to take a random sample of a population and how sample size influences the uncertainty distribution around an estimate. To link with the previous section, we distinguish between “standard deviation” and “standard error” (the latter is the standard deviation of an uncertainty distribution). We then use these concepts to learn about p-values, confidence intervals, and hypothesis testing.

The final section covers the log and logit transformations, including how to convert an estimate (defined by its mean and standard error) into log or logit space with the “delta method”. We discuss the particular difficulty of transforming quantities at the extreme edge of a probability distribution, like a count of zero, and its implications for meta-regression and generalized linear models.

Section 1: Probability distributions

Introduction

The concept of a probability distribution is fundamental to the field of statistics. A probability distribution describes the possible outcomes of a random variable and how likely they are to occur. A random variable is a variable whose values are determined by the outcomes of a random process. It can be discrete, taking on specific values such as the number of heads in a series of coin flips, or continuous, like the heights of students in a school. In statistics, the term “random” does not imply a lack of order or pattern; rather, it acknowledges that outcomes are at least somewhat subject to chance.

The Normal distribution

We begin our exploration of probability distributions with the Normal distribution, also called the Gaussian distribution. The Normal distribution can be fully described by two parameters: mean and standard deviation.

  • The mean, often denoted as \(\mu\) (mu), is a measure of the central tendency of the distribution and represents the average value of all possible outcomes. It is the point around which the values of the random variable are most densely clustered.

  • The standard deviation, denoted as \(\sigma\) (sigma), quantifies the amount of variation or dispersion in the distribution. It reflects how spread out the values are around the mean. A smaller standard deviation indicates that the values are closely clustered around the mean, while a larger standard deviation suggests a wider spread.

For a given set of observations – the measured heights of students in a school, for example – we can calculate the mean and standard deviation. Let’s say a school has 500 students and their heights are measured in centimeters (cm): 143.4, 150.1, 144.1, 133.9, 149.2, etc. Here’s how it looks on a graph:

The horizontal axis defines height categories, and the vertical axis describes what proportion of students have a height in a given category.

Calculating the mean and standard deviation of measured heights, we get a mean of 140.2 and a standard deviation of 7.1. With these two pieces of information, we can display the Normal distribution that is most consistent with the observed heights.

The blue line is only a rough approximation of the observed heights. It is too low for some height categories and too higher for others. However, if we are willing to believe that the blue line is a faithful representation of the orange bars, we have greatly simplified the amount of information required to describe student heights in this school. That is, rather than having to think about the heights of each and every student, we can simply refer to the mean and standard deviation. Five hundreds pieces of information are now roughly described by two pieces of information.

Considerations for selecting a probability distribution

The field of statistics is built upon these sorts of approximations. We convert observations of the world into structured mathematical descriptions like the Normal distribution, then we proceed as though the mathematical description is indeed true. This begs the question: How well does the mathematical description reflect reality? While probability distributions provide a simplified and manageable way of understanding complex data, there is always a trade-off between the simplicity of a model and its ability to capture the full complexity of real-world phenomena. The key challenge in statistics is to find the balance where a model is sufficiently detailed to be useful, yet not so complex that it becomes impractical or loses its explanatory power.

Let’s look at another example in which the Normal distribution might not be the right choice.

The graph shows the distribution of household incomes in a hypothetical country. The vast majority of households have an income below 40000. A few lucky households have an income above 70000. None of the households have an income below 0, because income by definition cannot be negative.

Why does the Normal distribution fail to describe what is going on? To its credit, as before, the blue line representing the Normal distribution (with mean of 25907 and standard deviation of 13824) is sometimes too high or too low as we saw before. However, the problem is that the blue line:

  • Suggests that people some portion of the population has negative incomes, which by definition is not possible; and

  • Fails to capture the skewed, asymmetrical shape of the distribution with most households clustered on the left side and a gradual tapering off households on the right side.

So what went wrong? We chose the wrong probability distribution.

When the random variable of interest cannot go below zero, it is common to use the log-normal or Poisson distribution instead of the Normal distribution. Here, we show how the log-normal distribution would summarize these observations of household income:

Much better. The log-normal distribution fits the data better than the Normal distribution.

Summary and guidelines

So what have we learned? While probability distributions are an imperfect tool for summarizing data, they offer a simplified lens through which to view and understand the world. The simplicity is beneficial because it allows us to refer to the information content of the data in a way that is easier to conceptualize and handle mathematically. And whether it’s the outcomes of coin flips, household incomes, or a disease-related parameter, choosing the right distribution is key to accurately representing the data.

When should you use a certain probability distribution? Here are some commonly used defaults for various scenarios:

  • If the data have no natural bounds, like the difference between a count this year and a count last year, use the Normal distribution;

  • If the data cannot be negative by definition, like the incidence or remission rate for a disease, use the log-normal distribution (as in mrtool) or Poisson distribution (as in regmod);

  • For parameters that must exist between 0 and 1, like prevalence or a percentage, use the logit-normal distribution (as in mrtool) or the binomial distribution (as in regmod).

Alternatives to all of these are available for extenuating circumstances, but this is a good place to start.

It can be disorienting to have multiple options available; what if different probability distributions lead to different conclusions? This is a real debate in the field of statistics that cannot always be resolved cleanly. Sometimes the data can tell us which distribution fits the best through quantitative tests. Other times, the best we can do is to think carefully about the data generating process and assess whether a distribution reflects its essential characteristics. A best practice when publishing in the peer-reviewed literature is to run the analysis multiple times to determine whether the primary results of the study are robust to different model specifications. If a scientific finding is real, it should withstand this sort of sensitivity test.

Hands-on

[XX ‘learnr’ exercise]

Section 2: Uncertainty

Distributions and standard error calculations: - Normal - Poisson - Binomial - Wilson approximation

[XX binomial ’learnr exercise]

```{r binomial1, exercise=TRUE}

set.seed(4)

n <- 100
p <- 0.8
draws1 <- rbinom(n = n, size = 1, prob = p)

mod1 <- glm(y1 ~ 1, data = data.frame(y1 = draws1), family = "binomial")

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
dfp <- data.frame(q = seq(-5, 5, by = 0.01)) %>%
  mutate(
    dens1 = sapply(.$q, function(x) dnorm(x = x, mean = coef(mod1), sd = summary(mod1)$coefficients[1, 2]) ),
    p = plogis(q)
  )


library(ggplot2)
p1 <- ggplot(data = dfp) +
  geom_vline(xintercept = 0.5, linetype = 2, color = adjustcolor("darkgray", alpha.f = 0.5)) +
  geom_area(mapping = aes(x=p, y=dens1), fill = adjustcolor("darkblue", alpha.f = 0.5)) +
  theme_minimal() +
  xlab("Probability of flipping heads") +
  ylab("Density") +
  theme(
    axis.text=element_text(size=12), 
    axis.title=element_text(size=14,face="bold")
  )
# geom_area(mapping = aes(x=p, y=dens1)) +
# geom_area(mapping = aes(x=p, y=dens1))

p1

Section 3: Transforming the dependent variable

When conducting a meta-regression analysis, the model needs to account for logical bounds on the quantity of interest. An incidence rate ratio (IRR) cannot be negative, for example. For linear mixed-effects models like MR-BRT, the solution is to transform the quantity of interest (i.e. the dependent variable) into a space that is bounded by negative infinity and infinity, conduct the modeling in that space, and do the inverse of the transformation on model predictions to express the results in the original units. Operationally, we do the transformation and its inverse using an approach called the delta method.

Why is the delta method necessary? The delta method is a way to transform a random variable from one probability space into another. A random variable is a quantity that follows a probability distribution, like the distribution of students’ heights in a classroom. The same concept applies to the uncertainty distribution around the result of a scientific study which is used as input data in a meta-regression; it is a quantity that follows a probability distribution. For this reason simply using the log transformation without the delta method is not sufficient because it fails to account for the fact that the dependent variable is a full probability distribution.

As a second example, consider the case-fatality ratio (CFR) defined as the proportion of disease cases that are fatal. The number of cases and number of fatalities are both non-negative by definition, so the CFR can never be negative. In addition to that, the number of fatalities can never exceed the number of cases, so the CFR can never exceed 1. If we fail to build in these types of constraints into the structure of a model, at a minimum it skews the final results. In extreme cases, it can lead to illogical findings like the upper bound of a CFR uncertainty interval exceeding 1. So to model CFR in a meta-regression, we use the delta method to transform CFR and its uncertainty into a space that is bounded by negative infinity and infinity, run the model, and do the inverse transformation to bring the results back into the natural units of CFR.

Which transformations should be used, and when?

  • When the quantity of interest is bounded by 0, use the delta method with a log transformation;
  • When the quantity of interest is bounded by 0 and 1, use the delta method with a logit transformation.

Other transformations are possible in principle, but log and logit have useful statistical properties that make them the standard in epidemiology and other fields.

How can the delta method be implemented? In later chapters, we will introduce the MSCA team’s statistical packages which include user-friendly functions for making delta method transformations. For now, we provide a short demonstration using the msm R package. The inputs to the delta method are the mean and standard error (i.e. standard deviation of the uncertainty around the mean estimate) for a given observation.

library(msm)

set.seed(0)
options(digits = 4)
df_sim1 <- data.frame(
  mean1 = runif(n = 10, min = 0.1, max = 0.9),
  se1 = runif(n = 10, min = 0.1, max = 0.2)
)

# transforming from linear to logit
df_sim1$se_logit <- mapply(FUN = function(mu, sigma) {
  msm::deltamethod(g = ~log(x1/(1-x1)), mean = mu, cov = sigma^2)
}, mu = df_sim1$mean1, sigma = df_sim1$se1)

# transforming from logit to linear
# -- should get the same result as the original SE
df_sim1$se_check1 <- mapply(FUN = function(mu, sigma) {
  msm::deltamethod(g = ~exp(x1)/(1+exp(x1)), mean = mu, cov = sigma^2)
}, mu = qlogis(df_sim1$mean1), sigma = df_sim1$se_logit)

# transforming from linear to log
df_sim1$se_log <- mapply(FUN = function(mu, sigma) {
  msm::deltamethod(g = ~log(x1), mean = mu, cov = sigma^2)
}, mu = df_sim1$mean1, sigma = df_sim1$se1)

# transforming from log to linear
# -- should get the same result as the original SE
df_sim1$se_check2 <- mapply(FUN = function(mu, sigma) {
  msm::deltamethod(g = ~exp(x1), mean = mu, cov = sigma^2)
}, mu = log(df_sim1$mean1), sigma = df_sim1$se_log)

df_sim1
    mean1    se1 se_logit se_check1 se_log se_check2
1  0.8174 0.1062   0.7113    0.1062 0.1299    0.1062
2  0.3124 0.1206   0.5614    0.1206 0.3860    0.1206
3  0.3977 0.1177   0.4912    0.1177 0.2958    0.1177
4  0.5583 0.1687   0.6841    0.1687 0.3022    0.1687
5  0.8266 0.1384   0.9655    0.1384 0.1675    0.1384
6  0.2613 0.1770   0.9168    0.1770 0.6772    0.1770
7  0.8187 0.1498   1.0091    0.1498 0.1829    0.1498
8  0.8557 0.1718   1.3914    0.1718 0.2007    0.1718
9  0.6286 0.1992   0.8532    0.1992 0.3169    0.1992
10 0.6033 0.1380   0.5766    0.1380 0.2288    0.1380

So a log or logit transformation will ensure that the results stay within the logical bounds of the quantity of interest, but what about zeros in the input data? Log- or logit-transforming 0 yields negative infinity, and logit-transforming 1 yields infinity. Because passing infinity to a regression model will break it, zeros (and ones for logit models) need to be adjusted prior to applying the delta method. Several adjustment methods have been proposed in the epidemiological literature, and the topic remains an active area of debate. The best solution is likely to depend on factors specific to the analysis. We refer readers to Sweeting et al. (https://onlinelibrary.wiley.com/doi/10.1002/sim.1761), Bradburn et al. (https://onlinelibrary.wiley.com/doi/10.1002/sim.2528) and Yanan et al. (https://pubmed.ncbi.nlm.nih.gov/30887438/) for a discussion of options.