MT5762 Lecture 8

C. Donovan

housekeeping

  • reminder projects are in play

Sampling distributions of Estimates

The sampling distribution of the sample mean

  • Ultimately we are aiming for confidence intervals
  • These are one of our main inferential tools
  • We can see what these mean by simulation (so you can see where we want to get to)

(refer to confidence interval PPT)

Likely values for the parameter

We have a sample - how do we convey information about the true value? Why do it this way?

  • centre - the sample mean is our best guess at the population mean.
  • unbiased estimate - true value is just as likely to be above our estimate as below
  • precision information should be included
  • have an idea of shape of disribution for the parameter (CLT)
  • need some magical multiplier that makes us correct X% of the time

Standard error from our sample

  • The sample mean Calcium level for our samples as 34038.12
  • The sample standard deviation was 9069.678
  • The standard error therefore:

\[ se(\bar{x})=\frac{s_x}{\sqrt{n}}= \frac{9069.678}{\sqrt{24}}= 1851.340 \]

Distribution of estimates: CLT

  • The CLT states that for large samples the distribution of sample means is closely Normal
  • We know about this distribution:
    • 2 standard errors (we're talking distribution of means) about the mean captures the central 95%
    • Values >2 SE from the mean are getting rare: <5% or <1 in 20
    • Random \( \bar{x} \) more than 3 SE from \( \mu \), are even rarer: about 0.3%

A rough Confidence Interval (CI)

So - transfer this logic to centre on the sample mean we have drawn

  • It is likely the true mean lies within a 2 standard error interval:

\[ \bar{x} \pm 2\times se(\bar{x}) \]

A two-standard-error interval

\[ \begin{align*} estimate ~\pm ~~&~~~~ 2 \times standard~ error\\ \bar{x}- 2\times se(\bar{x}), ~~&~~\bar{x}+2\times se(\bar{x})\\ 30208.33-2 \times \frac{9069.678}{\sqrt{24}},~~&~~ 30208.33+2 \times \frac{9069.678}{\sqrt{24}}\\ (26505.65,~~ &~~ 33911.01) \end{align*} \]

A rough CI

So, the true mean Calcium level for our potting mix population, is likely (about 19/20) to be between 26505 and 33911 units.

  • This inferential approach basically applies to all means
  • This includes proportions (mean of a binary variable) and many model estimates e.g. regression coefficients
  • We do need to be more exact than 2 SE though

The sampling distribution of the sample proportion

Applying this to proportions

  • In this study, the proportion of cannabis seeds that germinated was very low - only about 46% of seeds germinated
  • How accurate is \( \hat{p}=0.46 \) as a measure of \( p \)?
  • Note: proportions can be the mean of a binary variable (here germinate=1, not-germinated=0).

Germination of cannabis seeds - Binomial

  • We estimated this \( p \) from 100 seeds - so:
    • fixed number of trials (n = 100),
    • 46 successes
  • So lends itself to a binomial distribution

How precise is our sample proportion?

  • For large samples, the distribution of \( \hat{p} \) is approximately Normal with:

\[ mean~ = p ~\textrm{and standard deviation} = \sqrt{\frac{{p}(1-{p})}{n}} \]

but since we never know \( p \), we use the standard error of the sample proportion:

\[ \text{Std. Err.}~{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

  • The sample proportion \( \hat{p} \) is an unbiased estimate of \( {p} \)

How precise is our sample proportion?

Apply the 2-SE interval:

\[ \begin{align*} estimate \pm ~~&~~2 \times standard~error\\ \hat{p}- 2\times se(\hat{p}),~~&~~\hat{p}+ 2\times se(\hat{p})\\ \hat{p}- 2\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},~~&~~\hat{p}+2\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\\ 0.46-2 \times \sqrt{\frac{{0.46}(1-{0.46})}{100}},~~&~~0.46+2 \times \sqrt{\frac{{0.46}(1-{0.46})}{100}}\\ (0.3603,~~&~~0.5597) \end{align*} \]

  • So quite likely (about 19/20) \( p \) lies somewhere between 36% and 56%

Standard error for a difference between means

  • We can further extend this idea to other problems
  • The difference between means also gives rise to a distribution of means
  • So we can inferentially estimate differences between populations

Standard error for a difference between means

  p <- ggplot(CaData) + geom_boxplot(aes(Group, Ca), fill = 'purple', alpha = 0.8)

  p

plot of chunk unnamed-chunk-3

Exploratory Data Analysis

Site SD 1st Quartile Median Mean n
Blockhouse Bay 11586.907 58000 60000 63620 13
Northland 9184.830 47000 56000 54110 9
Potting mix 9069.678 22750 30000 30210 24

Summary statistics for Calcium values in Cannabis leaves

Exploratory Data Analysis

  CaSummary <- CaData %>% group_by(Group) %>% 
    summarise(SD = sd(Ca), Q25 = quantile(Ca, 0.25), 
              median = median(Ca), Q75 = quantile(Ca, 0.75),
              mean = mean(Ca), n = n())

  CaSummary
# A tibble: 3 x 7
  Group     SD    Q25 median    Q75   mean     n
  <fct>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <int>
1 bhb   11587. 58000. 60000. 76000. 63615.    13
2 nth    9185. 47000. 56000. 62000. 54111.     9
3 pm     9070. 22750. 30000. 35000. 30208.    24

Exploratory Data Analysis

  • Average Calcium levels in Blockhouse bay appear to be about double the potting mix and higher than Northland
  • The spread in each group is fairly similar
  • Potting mix and Northland appears symmetrical, whereas Blockhouse Bay is right-skewed

Do average Calcium levels differ between the soil types?

  • We'll compare average Calcium levels for potting mix and Blockhouse Bay
  • It appears Blockhouse Bay is higher than potting mix, but sampling uncertainty exists
  • The parameter we want to estimate is the true mean difference between Calcium levels in Blockhouse Bay and potting mix (ie. \( \mu_{bhb}-\mu_{pm} \)).

Do average Calcium levels differ between the soil types?

  • We have an estimate for this parameter:

\[ \begin{align*} \bar{x}_{bhb}-\bar{x}_{pm}= &63620-30210\\ = &33410 \end{align*} \]

  • To build our 2-SE interval (i.e. plausible values of the true difference), we need an estimate of precision – the standard error of this estimate: we need \( se(\bar{x}_{bhb}-\bar{x}_{pm} \)
  • We can find the standard errors for each estimate separately, but we need to combine them.

Do average Calcium levels differ between the soil types?

We can combine these standard errors using the formula below.

SE for a Difference in Means (Independent Samples)

\[ \begin{align*} se(\bar{x}_1-\bar{x}_2)=&\sqrt{se(\bar{x}_1)^2+se(\bar{x}_2)^2}\\ =& \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \end{align*} \]

Do average Calcium levels differ between the soil types?

\[ 33410 \pm 2\times \sqrt{\frac{11586.90^2}{13}+\frac{9069.678^2}{24}} \]

33410 + 2*sqrt(11586.90^2/13+9069.678^2/24)
[1] 40827.51
33410 - 2*sqrt(11586.90^2/13+9069.678^2/24)
[1] 25992.49

Interpreting the interval for a difference between means

What do we notice about this interval for the population, or true, mean difference?

  • The upper limit and the lower limit are both positive - likely \( \mu_{bhb} \) > \( \mu_{pm} \)
  • This interval does not contain zero
  • If zero is not contained in the interval, and zero is not a plausible value for the true mean difference
  • If the lower limit for the interval is negative and the upper limit is positive – then zero is plausible.

So, we have a meaningful finding in the face of uncertainty

Standard error for a difference between proportions

The same logic applies for proportions (with sufficiently large samples), the main complication is the SE

  • 13 of the 33 seeds (39.4%) of seeds germinated in Blockhouse Bay soil and 24 of the 33 seeds (72.7%) of seeds germinated in potting mix ie. \[ \hat{p}_{bhb}=0.394,~~~~\hat{p}_{pm}=0.727 \]
  • Germination rates for Blockhouse Bay look lower than potting mix, but sampling uncertainty exists

Do germination rates differ between the soil types?

The approach is similar

  • we want to estimate \( p_{bhb}-p_{pm} \).
  • We can find the standard errors for each estimate separately but we need to combine them.

Do germination rates differ between the soil types?

We can combine these standard errors using the formula below.

SE for a Difference between Proportions (Independent Samples)

\[ \begin{align*} se(\hat{p}_1-\hat{p}_2)=&\sqrt{se(\hat{p}_1)^2+se(\hat{p}_2)^2}\\ =& \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \end{align*} \]

Interpreting the interval for a difference between proportions

What do we notice about this interval for the population, or true, difference between proportions?

  • The upper and lower limits are both negative — it's likely \( p_{bhb} \) is smaller than \( p_{pm} \)
  • This interval does not contain zero
  • If zero is not contained in the interval, the zero is not a plausible value for the true difference i.e. a difference is likely

Interpreting the interval for a difference between proportions

What do we notice about this interval for the population, or true, difference between proportions??

  • The two standard error interval for \( p_{bhb}-p_{pm} \) was [-0.563,-0.103].
  • This means we can be fairly confident that germination rates in potting mix are between 10% and 56% better, on average, than those in Blockhouse Bay.

Refining our intervals: the t-distribution

We don't know the true SE

  • When we sample data from a Normal (\( Z \)) distribution and we know the population standard deviation \( \sigma \), the sample mean is exactly Normally distributed about the true population mean, \( \mu \)
  • When we don't know the population SD, we introduce a new source of variability – we use the sample SD of our data (\( s_x \)) instead of a known population SD (\( \sigma \))
  • When the population SD is unknown the \( t \)-distribution is used instead

The t-distribution

  • is indexed by a parameter called the degrees of freedom (\( df \)); \( df \) is determined by the sample size (\( df=n-1 \))
  • is symmetrical about zero and has a similar shape to the Normal distribution
  • becomes more and more like the Normal as \( n \) (and thus \( df \)) increases; a \( t \)-distribution with small \( n \) (& \( df \)) has fatter tails and a flatter top compared with the Normal distribution
  • with \( df=\infty \) and the Normal (0,1) distribution are two ways of describing the same distribution

The t-distribution

  x <- seq(-5, 5, by = 0.1)

  plot(x, dt(x, df = 5), type = 'l', lwd = 2, col = 'slateblue4')

  lines(x, dnorm(x), lwd = 2)

  lines(x, dt(x, 50), col = 'blue', lwd = 2)

  lines(x, dt(x, 90), col = 'purple')

plot of chunk unnamed-chunk-7

How does this change how we build intervals for parameters?

  • For large samples, the sample mean falls within about two standard errors of the population mean for approximately 95% of samples taken
  • For small samples, the distribution of \( T \) is quite different to the Normal (\( Z \)).
  • For small samples, we must build intervals which are more than two-standard errors to capture the population mean most of the time

How does this change how we build intervals for parameters?

  • For instance, if we have a sample size of 7, the sample mean only falls within \( \pm 2 \) standard errors about 91% of the time (even when the data is exactly Normal) ie. 90.8% of the \( t \)-distribution falls between 2 standard errors when \( df=n-1=7-1=6 \) \newline \[ pr(-2 \leq T \leq 2)=0.908 \]
  • If we want to ensure intervals for parameters contain the true parameter value a set proportion of times, we often need to replace \( \pm 2 \) with a bigger number