7.4 - sampling from normal populations: sample mean and variance

Overview

For the next couple lectures, we will focus on important sampling distributions that arise when the parent population is normal:

This lecture: study sampling distributions of sample mean and variance

Sample mean and variance defined

Given a sample \(Y_1\), \(Y_1\),…,\(Y_n\):

\[\bar Y = \frac{\sum_{i=1}^n Y_i}{n}\]

\[S^2 = \frac{\sum_{i=1}^n (Y_i - \bar Y)^2}{n-1}\]

To study sampling distributions of these statistics (or functions thereof), let’s build up a few preliminaries.

General algebraic facts

The following hold regardless of the distribution of the parent population.

\(\sum_{i=1}^n Y_i = n \bar Y\)

Define a sample residual \(e_i = Y_i - \bar Y\). Then we can re-express \(S^2\) as:

\[S^2 = \frac{\sum_{i=1}^n (Y_i-\bar Y)^2}{n-1} = \frac{\sum_{i=1}^n e_i^2}{n-1}.\]

Another way to represent “sum of squared residuals”: \(\sum_{i=1}^n e_i^2=(n-1)S^2\)

Sum of sample residuals is 0: \(\sum_{i=1}^n (Y_i-\bar Y) =\sum_{i=1}^n e_i= 0\)

Proof:

\[\sum_{i=1}^n (Y_i-\bar Y) =\sum_{i=1}^n Y_i-\sum_{i=1}^n \bar Y= n\bar Y - n \bar Y = 0\]

General algebraic facts (continued)

\(\sum_{i=1}^n (Y_i-\mu)^2 = \sum_{i=1}^n (Y_i-\bar Y)^2 + n(\bar Y-\mu)^2\)

Proof:

\[\sum_{i=1}^n (Y_i-\mu)^2=\sum_{i=1}^n (Y_i-\bar Y + \bar Y -\mu)^2=\sum_{i=1}^n [(Y_i-\bar Y)^2 + 2(Y_i-\bar Y)(\bar Y -\mu)+(\bar Y -\mu)^2]\]

\[=\sum_{i=1}^n (Y_i-\bar Y)^2 + 2(\bar Y -\mu)\sum_{i=1}^n (Y_i-\bar Y)+\sum_{i=1}^n(\bar Y -\mu)^2\]

\[=\sum_{i=1}^n (Y_i-\bar Y)^2 +0+n(\bar Y -\mu)^2\]

Distributions of individual observations (review)

If \(Y \sim N(\mu,\sigma^2)\), then:

\(Z = \frac{Y - \mu}{\sigma} \sim N(0,1)\)
A standard normal squared is \(\chi^2_1\): \(Z^2 \sim \chi^2_1\)

Proved using MGF method of 6.3

Sampling distributions of sums of normals (or functions thereof)

If \(Y_1, Y_2, ... Y_n \stackrel{i.i.d}{\sim} N(\mu,\sigma^2)\), then:

\(\sum_{i=1}^n Y_i \sim N(n\mu, n\sigma^2)\)
\(\bar Y = \frac{\sum_{i=1}^nY_i}{n}\sim N\left(\mu, \frac{\sigma^2}{n}\right)\)
\(\frac{\bar Y - \mu}{\sigma/\sqrt{n}}\sim N(0,1)\)

Proofs: Practice!

Properties of \(S^2\) when sampling from normal

If \(Y_1\), \(Y_1\),…,\(Y_n\) \(\stackrel{i.i.d.}{\sim} N(\mu, \sigma^2)\), then:

\(S^2 \perp\!\!\!\perp \bar Y\)
- Difficult to prove! Requires \(n\rightarrow n\) Jacobian transformation. See this pdf for gory details. Will convince ourselves with simulations

\(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)
- Will prove

Simulation “proof” of 1. \(\bar Y \perp\!\!\!\perp S^2\)

library(tidyverse)
library(purrrfect)

N <- 1000
(many_normal_samples <- parameters(~n, ~mu, ~sigma,
           c(5, 10, 20), c(-2, 0, 2), c(2, 4)
           )
          %>% add_trials(N)
          %>% mutate(ysample = pmap(list(n, mu, sigma), \(nn, m, s) rnorm(nn, m, s)))
          %>% mutate(ybar = map_dbl(ysample, mean),
                     S2 = map_dbl(ysample, var)
                     )
) %>% head()

# A tibble: 6 × 7
      n    mu sigma .trial ysample    ybar    S2
  <dbl> <dbl> <dbl>  <dbl> <list>    <dbl> <dbl>
1     5    -2     2      1 <dbl [5]> -2.33 2.70 
2     5    -2     2      2 <dbl [5]> -1.46 0.538
3     5    -2     2      3 <dbl [5]> -2.32 4.81 
4     5    -2     2      4 <dbl [5]> -1.93 5.22 
5     5    -2     2      5 <dbl [5]> -1.87 0.923
6     5    -2     2      6 <dbl [5]> -2.58 7.43

Plotting \(\bar Y\) vs \(S^2\)

library(ggh4x)
ggplot(data = many_normal_samples) + 
  geom_point(aes(x = ybar, y = S2),
             shape='.')+ 
  labs(x = expression(bar(Y)),
       y = expression(S^2),
       title='Plots of sample mean vs sample variance')+
  facet_nested(mu~sigma+n, labeller = label_both, scale = 'free_y') + 
  theme_classic()

\(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\) preliminaries

A. Sum of independent chi-squares is chi-square, degrees-of-freedom add: If \(Z_1^2,Z_2^2,...,Z_n^2 \stackrel{i.i.d}{\sim} \chi^2_1\), then \(\sum_{i=1}^n Z_i^2 \sim \chi^2_n\).

Proof:

\[M_{Z^2}(t) = \left(\frac{1}{1-2t}\right)^{1/2}\]

\[M_{\sum_{i=1}^n Z_i^2 }(t) = \underbrace{\left(\frac{1}{1-2t}\right)^{n/2}}_{MGF\ of\ \chi^2_n}\]

B. If, for \(p>q\):

\(U=X+Y\)
\(U\sim \chi^2_p\)
\(Y\sim \chi^2_q\)
\(X\perp\!\!\!\perp Y\)
\(\Rightarrow X =U-Y\sim \chi^2_{p-q}\)

Proof:

\[\left(\frac{1}{1-2t}\right)^{p/2}=M_U(t) = M_X(t)M_Y(t) = M_X(t)\left(\frac{1}{1-2t}\right)^{q/2}\]

\[\Rightarrow M_X(t) = \underbrace{\left(\frac{1}{1-2t}\right)^{(p-q)/2}}_{MGF\ of\ \chi^2_{p-q}}\]

\(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\) proof

If \(Y_1\), \(Y_2\),…,\(Y_n \stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)\):

\[\frac{Y_i-\mu}{\sigma} \sim N(0,1) \Rightarrow \left(\frac{Y_i-\mu}{\sigma}\right)^2 \sim \chi^2_1 \Rightarrow \sum_{i=1}^n \left(\frac{Y_i-\mu}{\sigma}\right)^2 \sim \chi^2_n\]

But:

\[\sum_{i=1}^n \left(\frac{Y_i-\mu}{\sigma}\right)^2 =\frac{1}{\sigma^2}\sum_{i=1}^n (Y_i-\mu)^2 =\frac{1}{\sigma^2}\left( \sum_{i=1}^n (Y_i-\bar Y)^2 + n(\bar Y - \mu)^2\right) \mbox{(algebraic property, slide 5)}\]

\[ = \frac{(n-1)S^2}{\sigma^2} + n\left(\frac{\bar Y - \mu}{\sigma}\right)^2 = \frac{(n-1)S^2}{\sigma^2} + \underbrace{\left(\frac{\bar Y - \mu}{\sigma/\sqrt{n}}\right)^2}_{N(0,1)^2 \equiv \chi^2_1}\]

Furthermore, since \(S^2 \perp\!\!\!\perp \bar Y\), using preliminary B on previous slide:

\[ \frac{(n-1)S^2}{\sigma^2} = \underbrace{\sum_{i=1}^n \left(\frac{Y_i-\mu}{\sigma}\right)^2}_{\sim \chi^2_n}- \underbrace{\left(\frac{\bar Y - \mu}{\sigma/\sqrt{n}}\right)^2}_{\sim \chi^2_1}\sim \chi^2_{n-1}\]

Application: 95% CI for \(\sigma^2\)

Let \(q_1\) and \(q_2\) be the \(2.5^{th}\) and \(97.5^{th}\) percentiles of a \(\chi^2_{n-1}\) distribution (use qchisq(0.025, df = n-1) and qchisq(0.975, df = n-1)):

chi-square distribution with 2.5 and 97.5 percentiles

Then \(0.95 = P\left(q_1 \le \frac{(n-1)S^2}{\sigma^2} \le q_2 \right)= P\left(\frac{(n-1)S^2}{q_1} \ge \sigma^2 \ge \frac{(n-1)S^2}{q_2} \right)\)

Application: hypothesis tests of \(\sigma^2\) in quality control

Suppose a machine is calibrated to precisely fill 12-ounce Coke bottles but may have slight variations from bottle-to-bottle
Suppose the distribution of actual bottle fills is intended to follow a normal distribution with mean \(\mu=12\) and variance of \(\sigma^2 = 0.01\).
If there is evidence that \(\sigma^2 > 0.01\), the machine will need to be recalibrated. This then becomes a problem of testing:

\[H_0: \sigma^2 = 0.01\] \[H_a: \sigma^2 > 0.01\]

Suppose a sample of \(n=20\) bottles is taken from the production line; how large will \(S^2\) need to be to convincingly suggest the machine needs to be recalibrated?

Finding the critical region

qchisq(0.95, df = 20-1)

[1] 30.14353

Reject \(H_0\) when \(\frac{(20-1)S^2}{0.01} > 30.14 \Rightarrow\) when \(S^2 > 0.0159\)