Discussions

Foundational Distributions

The normal distribution is a continuous probability distribution defined by its mean and standard deviation, symmetric around the mean. It models natural variability arising from many small additive effects; many stat methods rely on normality due to the Central Limit Theorem.

The binomial distribution is a discrete distribution describing the number of successes in a fixed number of trials with constant success probability. It is fully defined by the number of trials and the probability of success. Outcomes are bounded between (0,# of Trials).

The poisson distribution is a discrete distribution modeling the number of events occurring in a fixed interval of time or space given a constant average rate. It assumes independence, and the mean and the variance are both equal to the rate parameter.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(patchwork)

## Warning: package 'patchwork' was built under R version 4.4.3

set.seed(10)

norm_data <- data.frame(
  x = rnorm(10000, mean = 0, sd = 1)
)

p_norm <- ggplot(norm_data, aes(x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40,
                 fill = "steelblue",
                 alpha = .7) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1)) +
  labs(title = "Normal Distribution",
       x = "Value",
       y = "Density") +
  theme_minimal()

binom_data <- data.frame(
  x = rbinom(10000, size = 20, prob = .4)
)

p_binom <- ggplot(binom_data, aes(x)) +
  geom_bar(aes(y = after_stat(prop)),
           fill = "darkorange",
           alpha = .7) +
  labs(title = "Binomial Distribution",
       x = "# Success",
       y = "Probability") +
  theme_minimal()

pois_data <- data.frame(
  x = rpois(10000, lambda = 5)
)

p_pois <- ggplot(pois_data, aes(x)) +
  geom_bar(aes(y = after_stat(prop)),
           fill = "darkgreen",
           alpha = .7) +
  labs(title = "Poisson Distribution",
       x = "Event Count",
       y = "Probability") +
  theme_minimal()

p_norm | p_binom | p_pois

The probability density function describes how densely probability mass is distributed around each possible value of a continuous random variable; it does not give probabilities directly, but relative likelihoods. The cumulative distribution function gives the probability that the random variable is less than or equal to a given value, obtained by integrating the probability density function up to that point.

For a normal distribution, the PDFs bell shape reflects two important ideas - values near the mean are the most likely to be observed, and likelihood decays smoothly and symmetrically as distance from the mean increases. The exponential term penalizes large deviations quadratically, while the normalization constant ensures total probability integrates to 1.

p_norm_pdf <- ggplot(norm_data, aes(x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40,
                 fill = "steelblue",
                 alpha = 0.7) +
  stat_function(fun = dnorm,
                args = list(mean = 0, sd = 1),
                linewidth = 1) +
  labs(
    title = "Normal Distribution PDF",
    x = "Value",
    y = "Density"
  ) +
  theme_minimal()

p_norm_cdf <- ggplot(norm_data, aes(x)) +
  stat_ecdf(geom = "step", linewidth = 1) +
  stat_function(fun = pnorm,
                args = list(mean = 0, sd = 1),
                linetype = "dashed") +
  labs(
    title = "Normal Distribution CDF",
    x = "Value",
    y = "P(X ≤ x)"
  ) +
  theme_minimal()

p_norm_pdf | p_norm_cdf

As you can see, the PDF plot shows how probability mass concentrates near the mean, with symmetry and exponential tail decay; the CDF plot accumulates that mass from left to right, flattening in the tails where the PDF is near 0.

The normal distribution is defined by two parameters - the mean (location) and the standard deviation (scale). The mean determines the center of a distribution, and the standard deviation controls the spread. In R, these parameters are optional and default to mean = 0 and sd = 1.

The binomial distribution is defined by the number of trials and the probability of success. The number of trials make up the support of the distribution, while the probability controls the skew and the variance. In R, these parameters are required to be stated as no meaningful defaults exist. More precisely, dbinom requires size and prob.

The poisson distribution is defined by a single parameter - the rate (lambda), which represents both the mean and variance. This parameter controls the expected number of events per interval and in R, this parameter must be explicitly stated. More precisely, dpois requires lambda.

?dnorm

## starting httpd help server ... done

#?dbinom
#?dpois

The normal distribution is appropriate when modeling continuous measurements influenced by many small independent effects. Examples include human height, measurement errors, and standardized test scores. In these settings, extreme values are rare and deviations from the mean are symmetric.

The binomial distribution models the count of successes in a fixed number of independent trials with constant success probability. Examples include the number of free throws made out of a fixed number of attempts, the number of defective items in a batch of a known size, or whether a user clicks an ad over a fixed set of impressions. Each trial has only two outcomes and identical conditions.

The poisson distribution applies to event counts over a fixed interval when events occur independently at a constant, average rate. Examples include the number of calls received by a call center per hour, the number of earthquakes in a region per year, or the number of goals scored in a soccer game. The key feature is that events are unbounded above and not limited to a fixed number of trials.

Because we plotted the PDF and CDF functions for the normal distribution in part B, we will do the same for binomial and poisson distributions for e:

n <- 20
p <- 0.4

binom_data <- data.frame(
  x = 0:n,
  pmf = dbinom(0:n, size = n, prob = p),
  cdf = pbinom(0:n, size = n, prob = p)
)

p_binom_pmf <- ggplot(binom_data, aes(x, pmf)) +
  geom_col(fill = "darkorange", alpha = 0.7) +
  labs(
    title = "Binomial PMF",
    x = "# Successes",
    y = "P(X = k)"
  ) +
  theme_minimal()

p_binom_cdf <- ggplot(binom_data, aes(x, cdf)) +
  geom_step(linewidth = 1) +
  labs(
    title = "Binomial CDF",
    x = "# Successes",
    y = "P(X ≤ k)"
  ) +
  theme_minimal()

p_binom_pmf | p_binom_cdf

The PMF shows the probability of observing exactly K successes out of a fixed number of trials. The distribution is centered near the expected value np, with with symmetry reflecting a moderate success probability and finite support between 0 and n. Probabilities taper toward the extremes because there are fewer ways to achieve very low or very high success counts. The CDF accumulates the PMF from left to right, giving the probability of observing at most K successes. The steepest increase occurs near the center of the distribution where PMF values are the largest. The step wise shape reflects the discrete nature of the binomial outcome space.

lambda <- 5
x_vals <- 0:20

pois_data <- data.frame(
  x = x_vals,
  pmf = dpois(x_vals, lambda = lambda),
  cdf = ppois(x_vals, lambda = lambda)
)

p_pois_pmf <- ggplot(pois_data, aes(x, pmf)) +
  geom_col(fill = "darkgreen", alpha = 0.7) +
  labs(
    title = "Poisson PMF",
    x = "Event Count",
    y = "P(X = k)"
  ) +
  theme_minimal()

p_pois_cdf <- ggplot(pois_data, aes(x, cdf)) +
  geom_step(linewidth = 1) +
  labs(
    title = "Poisson CDF",
    x = "Event Count",
    y = "P(X ≤ k)"
  ) +
  theme_minimal()

p_pois_pmf | p_pois_cdf

The PMF shows the probability of observing K events in a fixed interval given a constant average rate. The distribution is right skewed, with most mass concentrated near the rate parameter lambda and a long tail for larger counts. Unlike the binomial, the support is unbounded above, though probabilities decay rapidly - this makes sense logically, as the probability the Bruins score 7 goes tonight is exponentially lower than the probability they score 3. The CDF represents the cumulative probability of observing K or fewer events. It rises quickly for small counts where probability mass is concentrated, then flattens as additional event counts become increasingly unlikely. The step function again reflects discreteness, with diminishing increments in the upper tail.

Convergence of Distributions

Let:

\[ X = \text{number of deaths within 30 days} \]

\[ N = \text{number of neurosurgical procedures} \]

\[ \pi = \text{national mortality proportion} \] We are interested in determining whether the hospital’s observed mortality proportion is more extreme than the national rate.

We consider the one-sided hypothesis test:

\[ H_0: p = \pi \]

\[ H_A: p > \pi \] where p is the true mortality proportion at the hospital.

Evidence against the null hypothesis is summarized by the upper-tail probability:

\[ P(X \ge x \mid H_0) \] A small value of this probability suggests that the hospital’s mortality count is unusually high relative to the national proportion.

We select values consistent with the Poisson approximation regime:

\[ N = 200 \]

\[ x = 12 \]

\[ \pi = 0.03 \] Under the null hypothesis, the expected number of deaths is:

\[ \lambda = N\pi = 200 \times 0.03 = 6 \] Binomial: Under the null hypothesis:

\[ X \sim \text{Binomial}(N, \pi) \] The one-sided tail probability is:

\[ P(X \ge x) = 1 - P(X \le x - 1) \] Poisson:

Using the Poisson approximation to the binomial:

\[ X \sim \text{Poisson}(\lambda) \] with:

\[ \lambda = N\pi \] The corresponding tail probability is:

\[ P(X \ge x) = 1 - P(X \le x - 1) \]

N  <- 200
x  <- 12
pi <- 0.03
lambda <- N * pi

# Binomial model
p_binom <- pbinom(x - 1, size = N, prob = pi, lower.tail = FALSE)

# Poisson approximation
p_pois <- ppois(x - 1, lambda = lambda, lower.tail = FALSE)

list(
  N = N,
  x = x,
  pi = pi,
  lambda = lambda,
  p_value_binomial = p_binom,
  p_value_poisson  = p_pois
)

## $N
## [1] 200
## 
## $x
## [1] 12
## 
## $pi
## [1] 0.03
## 
## $lambda
## [1] 6
## 
## $p_value_binomial
## [1] 0.01841132
## 
## $p_value_poisson
## [1] 0.02009196

Using the binomial model, we obtain:

\[ P_{\text{Binomial}}(X \ge 12) \approx 0.018 \] Using the Poisson approximation, we obtain:

\[ P_{\text{Poisson}}(X \ge 12) \approx 0.020 \] Both probabilities are small, indicating evidence that the hospital’s mortality count is higher than expected under the national mortality rate.

The binomial and Poisson results are very similar because the Poisson distribution is a limiting approximation to the binomial distribution when:

\[ N \text{ is large}, \quad \pi \text{ is small}, \quad \text{and} \quad \lambda = N\pi \text{ is moderate}. \] The results are not identical because the binomial distribution has finite support bounded by N, while the Poisson distribution has infinite support and approximates the binomial term (1 − π)^(N − k) using an exponential form. When N is large and π is small, this approximation is accurate and the resulting error is minimal.

A direct way to compare the two distributional assumptions is to plot their probability mass functions side by side. If the Poisson approximation is appropriate, the two PMFs will closely match over the range of counts with non-negligible probability mass. Highlighting the upper-tail region where k is greater than or equal to the observed value shows the portion of probability corresponding to the one-sided p-value.

k_max <- max(x + 20, ceiling(lambda + 6 * sqrt(lambda)))
k <- 0:k_max

pmf_df <- rbind(
  data.frame(k = k, pmf = dbinom(k, size = N, prob = pi), dist = "Binomial"),
  data.frame(k = k, pmf = dpois(k, lambda = lambda), dist = "Poisson")
)

pmf_df$tail_region <- pmf_df$k >= x

ggplot(pmf_df, aes(x = k, y = pmf)) +
  geom_col(aes(fill = tail_region), alpha = 0.45) +
  facet_wrap(~ dist, ncol = 2, scales = "free_y") +
  geom_vline(xintercept = x, linetype = "dashed", linewidth = 0.8) +
  labs(
    title = "PMFs Under Binomial vs Poisson Assumptions",
    subtitle = "Red bars mark the upper-tail region k ≥ x (the one-sided p-value region)",
    x = "k (deaths within 30 days)",
    y = "P(X = k)"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c(`FALSE` = "grey70", `TRUE` = "red")) +
  guides(fill = "none")

Discussions

Sam Denomme

2026-01-16