The normal distribution

M. Drew LaMar
October 5, 2020

Standard normal deviate: Not to be confused with 'everyday ordinary pervert.' You don't often find a jargon term that seems to be both redundant and self-contradictory.”

- Whitlock & Schluter

Other type of contingency tests

Fisher’s exact test: (2 x 2 tables only) Examines the independence of two categorical variables, even with small expected values

\( G \)-test: (any table) Derived from principles of likelihood.

     Pro: Great with complicated experimental designs with multiple explanatory variables.
     Con: Can be less accurate for small sample sizes.

Magnitudes of association

Situation: You've found a statistically significant association between two categorical variables using the \( \chi^2 \) contingency test.

Question: Where is the association? In other words, in which levels of the categories is the association present and how large is the association?

We need to estimate the magnitude of the association, which the \( P \)-value does not give us!

We can estimate odds ratios or relative risks for 2 \( \times \) 2 sub-tables within the contingency table by either subsetting or collapsing.

Magnitudes of association

          Uninfected Lightly Highly
Eaten              1      10     37
Not eaten         49      35      9

plot of chunk unnamed-chunk-2

Magnitudes of association

          Uninfected Highly
Eaten              1     37
Not eaten         49      9

plot of chunk unnamed-chunk-3

Magnitudes of association

oddsratio(parTable[,c(1,3)], method = "wald")
$data
          Uninfected Highly Total
Eaten              1     37    38
Not eaten         49      9    58
Total             50     46    96

$measure
                        NA
odds ratio with 95% C.I.    estimate        lower      upper
               Eaten     1.000000000           NA         NA
               Not eaten 0.004964148 0.0006020703 0.04093004

$p.value
           NA
two-sided     midp.exact fisher.exact   chi.square
  Eaten               NA           NA           NA
  Not eaten 1.110223e-16 6.861412e-17 4.140762e-15

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The normal distribution

Definition: The normal distribution is a continuous probability distribution describing a bell-shaped curve. It is a good approximation to the frequency distributions of many biological variables.

The normal distribution - Equation

The probability density function \( f(Y) \) for a random normal variable is given by \[ f(Y) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(Y-\mu)^2}{2\sigma^2}}, \] where \( \mu \) and \( \sigma \) are mean and standard deviation of \( Y \), respectively.

The normal distribution - Equation

The probability density function \( f(Y) \) for a random normal variable is given by \[ f(Y) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(Y-\mu)^2}{2\sigma^2}}, \] where \( \mu \) and \( \sigma \) are mean and standard deviation of \( Y \), respectively.

The normal distribution - In R

x <- seq(from=-2, to=12, length.out=1000)
y <- dnorm(x, mean=5, sd=2)
plot(x, y, type="l", cex.axis=1.5, cex.lab=1.5)

plot of chunk unnamed-chunk-5

Summary of functions for normal dist.

Name R command Uses
PDF dnorm(x, mean, sd) -
CDF pnorm(q, mean, sd, lower.tail=TRUE) -
CCDF pnorm(q, mean, sd, lower.tail=FALSE) Compute \( P \)-values
QF qnorm(p, mean, sd, lower.tail=TRUE) -
CQF qnorm(p, mean, sd, lower.tail=FALSE) Compute critical values

Defaults: mean = 0 and sd = 1 (standard normal deviate)

Discuss: With \( \mu=\sigma=2 \), what is \( \mathrm{Pr[} Y > 4\mathrm{]} \)?

\[ Y \sim N(\mu,\sigma^2) = N(2,4) \]

Computing probs - Greater than

Question: With \( Y \sim N(2,4) \), what is \( \mathrm{Pr[} Y > 4\mathrm{]} \)?

plot of chunk unnamed-chunk-6

(prob <- pnorm(4, 
               mean=2, 
               sd=2, 
               lower.tail=FALSE))
[1] 0.1586553

Computing probs - Less than

Question: With \( Y \sim N(2,4) \), what is \( \mathrm{Pr[} Y < 4\mathrm{]} \)?

plot of chunk unnamed-chunk-8

(prob <- pnorm(4, 
               mean=2, 
               sd=2, 
               lower.tail=TRUE))
[1] 0.8413447

Computing probs - Between

Question: With \( Y \sim N(2,4) \), what is \( \mathrm{Pr[} 2 < Y < 4\mathrm{]} \)?

\[ \mathrm{Pr[} 2 < Y < 4\mathrm{]} = \mathrm{Pr[} Y > 2\mathrm{]} - \mathrm{Pr[} Y > 4\mathrm{]} \]

Computing probs - Between

Question: With \( Y \sim N(2,4) \), what is \( \mathrm{Pr[} 2 < Y < 4\mathrm{]} \)?

plot of chunk unnamed-chunk-10

(prob <- 
   pnorm(2, 
         mean=2, sd=2, 
         lower.tail=FALSE) - 
   pnorm(4, 
         mean=2, sd=2, 
         lower.tail=FALSE))
[1] 0.3413447

Standard normal deviates

\[ \begin{eqnarray*} f(Y) & = & \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(Y-\mu)^2}{2\sigma^2}} \\ & = & \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\left(\frac{Y-\mu}{\sigma}\right)^2} \end{eqnarray*} \]

Letting \( Z = \frac{Y-\mu}{\sigma} \), we have

\[ f(Z) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}Z^2}. \]

The mean of \( Z \) is zero and the standard deviation of \( Z \) is one.

Standard normal deviates

Definition: The standard normal deviate

\[ Z = \frac{Y-\mu}{\sigma} \] tells us how many standard deviations \( \sigma \) a particular \( Y \) value is from the mean \( \mu \).

Standard normal tables

Standard normal tables

Question: With \( Y \sim N(\mu=2,\sigma^2=4) \), what is \( \mathrm{Pr[} 2 < Y < 4\mathrm{]} \)?

\[ \begin{eqnarray*} \mathrm{Pr[} 2 < Y < 4\mathrm{]} & = & \mathrm{Pr}\left[ \frac{2-2}{2} < Z < \frac{4-2}{2}\right] \\ & = & \mathrm{Pr[}0 < Z < 1\mathrm{]} \\ & = & \mathrm{Pr[}Z > 0\mathrm{]} - \mathrm{Pr[}Z > 1\mathrm{]} \end{eqnarray*} \]

Standard normal tables

Question: With \( Y \sim N(\mu=2,\sigma^2=4) \), what is \( \mathrm{Pr[} 2 < Y < 4\mathrm{]} \)?

\[ \begin{eqnarray*} \mathrm{Pr[} 2 < Y < 4\mathrm{]} & = & \mathrm{Pr[}Z > 0\mathrm{]} - \mathrm{Pr[}Z > 1\mathrm{]} \\ & = & 0.5 - 0.1587 \\ & = & 0.3413 \end{eqnarray*} \]

Normal distribution of sample means

Theorem: If a variable \( Y \) has a normal distribution in a population, then the distribution of sample means \( \bar{Y} \) is also normal.

Theorem: \( Y \sim N(\mu,\sigma^2) \Rightarrow \bar{Y} \sim N(\mu,\sigma_{\bar{Y}}^2) \), where \( \sigma_{\bar{Y}} \) is the standard error of the mean given by

\[ \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}}. \]

Normal distribution of sample means

alt text

alt text

Central limit theorem

Central Limit Theorem: According to the central limit theorem, the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normally distributed.

http://www.zoology.ubc.ca/~whitlock/kingfisher/CLT.htm

Normal approx. to binomial dist.

Discuss: Why does the normal distribution show up so often in many apparently unrelated fields of study?

Definition: The normal distribution arises naturally from the combination of a large number of independent random events or factors.

Normal approx. to binomial dist.

alt text

alt text

Normal approx. to binomial dist.

Normal approx. to binomial dist.

  • Flip a coin at each pin: heads go right, tails go left
  • Number of heads chooses positive slope “lanes”
  • Can overlay Pascal's triangle to get number of paths
  • Running machine includes probabilities of following those paths
  • Thus, we get a binomial distribution!

Normal approx. to binomial dist.

“A typical example is a person's height, which is determined by a combination of many independent factors, both genetic and environmental. Each of these factors may tend to increase or decrease a person's height,just as a ball in Galton's board may bounce to the right or the left at each level. As Galton's board shows, when you combine many chance factors, the resulting distribution is binomial. By the Central Limit Theorem, when the number of independent factors is very large, the binomial distribution is approximated by a normal curve.”

Paul Trow (http://ptrow.com/articles/Galton_June_07.htm)