5.8 - Multinomial and bivariate normal

Introduction

In this section we will study two “well-known” joint distributions

Multinomial: multivariate extension of binomial
Bivariate normal: bivariate extension of normal

Multinomial experiment

A multinomial experiment satisfies these properties:

Experiment consists of \(n\) identical, independent trials.

Outcome of each trial falls into one of \(k\) mutually exclusive classes/cells/buckets.

Probability that each outcome falls into cell \(i\) is \(p_i\), \(i = 1,2,...,k\) with \(\sum_{i=1}^k p_i = 1\).

Random variables of interest are \(Y_1\), \(Y_2\),…,\(Y_k\), with \(Y_i\) the number of trials for which the outcome falls into cell \(i\).

Multinomial pmf

The joint pmf of \((Y_1,Y_2,...,Y_k)\) is:

\[p(y_1,y_2,...,y_k) = P(Y_1 = y_1,Y_2=y_2,...,Y_k=y_k) = \frac{n!}{y_1!y_2!...y_k!}p_1^{y_1}p_2^{y_2}...p_k^{y_k},\]

with joint support:

\[y_i \in \{0,1,...,n\}; \sum_{i=1}^k y_i = n\]

Multinomial properties

Marginally, \(Y_i \sim BIN(n, p_i)\)
- \(\Rightarrow E(Y_i) = np_i\)
- \(\Rightarrow Var(Y_i) = np_i(1-p_i)\)
\(Cov(Y_i, Y_j) = -np_ip_j\)

Genetics example

Suppose fruit flies contain a gene for body color
- \(A\): dominant allele = black
- \(a\): recessive allele = brown

Flies inherit one allele from father and one from mother
- \(AA\), \(Aa\) = black; \(aa\) = brown
Suppose \(A\) alleles occur with probability \(p\) and \(a\) with probability \(1-p\)

If 60 fruit flies are sampled, let \((Y_1,Y_2,Y_3)\) be the number of flies with genotypes \((AA,Aa,aa)\) respectively.

\[(Y_1,Y_2,Y_3) \sim MULT(n=60, p^2, 2p(1-p), (1-p)^2)\]

Suppose the \(A\) allele occurs 60% of the time. Find \(p(20,30,10)\).

Solving genetics example

\[(Y_1,Y_2,Y_3) \sim MULT(n=60, p^2, p(1-p), (1-p)^2)\]

\(p=0.6\)
Find \(p(20,30,10)\).

\[p(20,30,10) = \frac{60!}{20!30!10!} (0.36)^{20} (0.48)^{30}(0.16)^{10}\]

dmultinom(x = c(20,30,10), prob = c(.36, .48, .16))

[1] 0.01429269

Simulating \(N\) fruit fly samples

library(tidyverse)
(thousand_fly_samples <- rmultinom(1000, size = 60, prob = c(.36, .48, .16)) 
          %>% t()
          %>% data.frame()
          %>% setNames(c('Y1','Y2','Y3'))
)  %>% head

cov(thousand_fly_samples) %>% round(3)

        Y1      Y2     Y3
Y1  13.490 -10.632 -2.859
Y2 -10.632  15.516 -4.885
Y3  -2.859  -4.885  7.744

library(GGally)
ggpairs(thousand_fly_samples, 
          lower = list(continuous = wrap("points", 
                                         alpha = 0.6,
                                         position = position_jitter(width = 0.2, 
                                                                    height = 0.2)
                                         )
                       )
        ) + 
  theme_classic(base_size = 1)

Pairs plot of fruit fly simulation study

Which \((Y_i, Y_j)\) pairs are most negatively correlated, and why?

Bivariate normal

If \((X,Y) \sim BVN(\mu_X,\mu_Y,\sigma_X^2,\sigma_Y^2,\rho)\), then:

\[\small f(x,y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} e^{-\frac{1}{2(1-\rho^2)}\left[\left(\frac{x-\mu_X}{\sigma_X}\right)^2 + \left(\frac{y-\mu_Y}{\sigma_Y}\right)^2 - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X\sigma_Y}\right]},x, y \in \mathbb{R}\]

Bivariate standard normal:

\[f(x,y) = \frac{1}{2\pi\sqrt{1-\rho^2}} e^{-\frac{x^2-2\rho xy + y^2}{2(1-\rho^2)}}, x, y \in \mathbb{R}\]

Bivariate standard normal surface

Fully interactive: https://www.desmos.com/3d/lugqutzqmc

Properties of bivariate normal

\(\rho = Cor(X,Y) = 0\) if and only if \(X\) and \(Y\) are independent!
\(X\) and \(Y\) are marginally normally distributed

For standard bivariate normal:

\[Y|X=x\sim N(\rho x, 1-\rho^2)\] \[X|Y=y \sim N(\rho y, 1-\rho^2)\]

For full bivariate normal:

\[Y|X=x \sim N\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X} (x-\mu_X), \sigma_Y^2 (1-\rho^2)\right)\]

Simulating bivariate normal data

Suppose we want to simulate \(N\) realizations of \((X,Y)\sim BVN(\mu_X, \mu_Y, \sigma^2_X,\sigma^2_Y, \rho)\) for a grid of parameter values.
Can use the following relationships. Given \(Z_1 \sim N(0,1)\) and \(Z_2 \sim N(0,1)\), with \(Z_1\) and \(Z_2\) independent:

\[X_{std} = Z_1\] \[Y_{std} = \rho Z_1 + \sqrt{1-\rho^2}Z_2\] \[X = \sigma_X \cdot X_{std}+ \mu_X\] \[Y = \sigma_Y \cdot Y_{std}+ \mu_Y\]

Simulating over a grid

We can use these relationships to simulate over parameter grid using familiar techniques to get a data set like this:

head(bvnorm_sim)

# A tibble: 6 × 10
   sigx  sigy   rho .trial      z1      z2   x_std  y_std     x     y
  <dbl> <dbl> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>
1   0.5   0.5  -0.8      1 -0.776   1.52   -0.776   1.54   4.61  5.77
2   0.5   0.5  -0.8      2  0.187  -2.52    0.187  -1.66   5.09  4.17
3   0.5   0.5  -0.8      3 -0.911  -0.727  -0.911   0.292  4.54  5.15
4   0.5   0.5  -0.8      4 -0.0697  1.82   -0.0697  1.15   4.97  5.57
5   0.5   0.5  -0.8      5 -1.39   -0.988  -1.39    0.520  4.30  5.26
6   0.5   0.5  -0.8      6  0.208   0.0933  0.208  -0.111  5.10  4.94

Code: practice!

Plotting over a grid

Given data shown, we can create the following:

library(ggh4x)
ggplot(data = bvnorm_sim) + 
  geom_point(aes(x = x, y = y), alpha = 0.5, size = .5) + 
  geom_point(aes(x = 5, y = 5), col='red') + 
  facet_nested(rho~sigx+sigy, labeller = label_both) + 
  theme_bw()

Simulated bivariate normal data over parameter grids