MT5762 Lecture 8

C. Donovan

housekeeping

reminder projects are in play

Sampling distributions of Estimates

The sampling distribution of the sample mean

Ultimately we are aiming for confidence intervals
These are one of our main inferential tools
We can see what these mean by simulation (so you can see where we want to get to)

(refer to confidence interval PPT)

Likely values for the parameter

We have a sample - how do we convey information about the true value? Why do it this way?

centre - the sample mean is our best guess at the population mean.
unbiased estimate - true value is just as likely to be above our estimate as below
precision information should be included
have an idea of shape of disribution for the parameter (CLT)
need some magical multiplier that makes us correct X% of the time

Standard error from our sample

The sample mean Calcium level for our samples as 34038.12
The sample standard deviation was 9069.678
The standard error therefore:

\[ se(\bar{x})=\frac{s_x}{\sqrt{n}}= \frac{9069.678}{\sqrt{24}}= 1851.340 \]

Distribution of estimates: CLT

The CLT states that for large samples the distribution of sample means is closely Normal
We know about this distribution:
- 2 standard errors (we're talking distribution of means) about the mean captures the central 95%
- Values >2 SE from the mean are getting rare: <5% or <1 in 20
- Random \( \bar{x} \) more than 3 SE from \( \mu \), are even rarer: about 0.3%

A rough Confidence Interval (CI)

So - transfer this logic to centre on the sample mean we have drawn

It is likely the true mean lies within a 2 standard error interval:

\[ \bar{x} \pm 2\times se(\bar{x}) \]

A two-standard-error interval

\[ \begin{align*} estimate ~\pm ~~&~~~~ 2 \times standard~ error\\ \bar{x}- 2\times se(\bar{x}), ~~&~~\bar{x}+2\times se(\bar{x})\\ 30208.33-2 \times \frac{9069.678}{\sqrt{24}},~~&~~ 30208.33+2 \times \frac{9069.678}{\sqrt{24}}\\ (26505.65,~~ &~~ 33911.01) \end{align*} \]

A rough CI

So, the true mean Calcium level for our potting mix population, is likely (about 19/20) to be between 26505 and 33911 units.

This inferential approach basically applies to all means
This includes proportions (mean of a binary variable) and many model estimates e.g. regression coefficients
We do need to be more exact than 2 SE though

The sampling distribution of the sample proportion

Applying this to proportions

In this study, the proportion of cannabis seeds that germinated was very low - only about 46% of seeds germinated
How accurate is \( \hat{p}=0.46 \) as a measure of \( p \)?
Note: proportions can be the mean of a binary variable (here germinate=1, not-germinated=0).

Germination of cannabis seeds - Binomial

We estimated this \( p \) from 100 seeds - so:
- fixed number of trials (n = 100),
- 46 successes
So lends itself to a binomial distribution

How precise is our sample proportion?

For large samples, the distribution of \( \hat{p} \) is approximately Normal with:

\[ mean~ = p ~\textrm{and standard deviation} = \sqrt{\frac{{p}(1-{p})}{n}} \]

but since we never know \( p \), we use the standard error of the sample proportion:

\[ \text{Std. Err.}~{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

The sample proportion \( \hat{p} \) is an unbiased estimate of \( {p} \)

How precise is our sample proportion?

Apply the 2-SE interval:

\[ \begin{align*} estimate \pm ~~&~~2 \times standard~error\\ \hat{p}- 2\times se(\hat{p}),~~&~~\hat{p}+ 2\times se(\hat{p})\\ \hat{p}- 2\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},~~&~~\hat{p}+2\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\\ 0.46-2 \times \sqrt{\frac{{0.46}(1-{0.46})}{100}},~~&~~0.46+2 \times \sqrt{\frac{{0.46}(1-{0.46})}{100}}\\ (0.3603,~~&~~0.5597) \end{align*} \]

So quite likely (about 19/20) \( p \) lies somewhere between 36% and 56%

Standard error for a difference between means

We can further extend this idea to other problems
The difference between means also gives rise to a distribution of means
So we can inferentially estimate differences between populations

Standard error for a difference between means

  p <- ggplot(CaData) + geom_boxplot(aes(Group, Ca), fill = 'purple', alpha = 0.8)

  p

plot of chunk unnamed-chunk-3

Exploratory Data Analysis

Site	SD	1st Quartile	Median	Mean	n
Blockhouse Bay	11586.907	58000	60000	63620	13
Northland	9184.830	47000	56000	54110	9
Potting mix	9069.678	22750	30000	30210	24

Summary statistics for Calcium values in Cannabis leaves

Exploratory Data Analysis

  CaSummary <- CaData %>% group_by(Group) %>% 
    summarise(SD = sd(Ca), Q25 = quantile(Ca, 0.25), 
              median = median(Ca), Q75 = quantile(Ca, 0.75),
              mean = mean(Ca), n = n())

  CaSummary

# A tibble: 3 x 7
  Group     SD    Q25 median    Q75   mean     n
  <fct>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <int>
1 bhb   11587. 58000. 60000. 76000. 63615.    13
2 nth    9185. 47000. 56000. 62000. 54111.     9
3 pm     9070. 22750. 30000. 35000. 30208.    24

Exploratory Data Analysis

Average Calcium levels in Blockhouse bay appear to be about double the potting mix and higher than Northland
The spread in each group is fairly similar
Potting mix and Northland appears symmetrical, whereas Blockhouse Bay is right-skewed

Do average Calcium levels differ between the soil types?

We'll compare average Calcium levels for potting mix and Blockhouse Bay
It appears Blockhouse Bay is higher than potting mix, but sampling uncertainty exists
The parameter we want to estimate is the true mean difference between Calcium levels in Blockhouse Bay and potting mix (ie. \( \mu_{bhb}-\mu_{pm} \)).

Do average Calcium levels differ between the soil types?

We have an estimate for this parameter:

\[ \begin{align*} \bar{x}_{bhb}-\bar{x}_{pm}= &63620-30210\\ = &33410 \end{align*} \]

To build our 2-SE interval (i.e. plausible values of the true difference), we need an estimate of precision – the standard error of this estimate: we need \( se(\bar{x}_{bhb}-\bar{x}_{pm} \)
We can find the standard errors for each estimate separately, but we need to combine them.

Do average Calcium levels differ between the soil types?

We can combine these standard errors using the formula below.

SE for a Difference in Means (Independent Samples)

\[ \begin{align*} se(\bar{x}_1-\bar{x}_2)=&\sqrt{se(\bar{x}_1)^2+se(\bar{x}_2)^2}\\ =& \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \end{align*} \]

Do average Calcium levels differ between the soil types?

\[ 33410 \pm 2\times \sqrt{\frac{11586.90^2}{13}+\frac{9069.678^2}{24}} \]

33410 + 2*sqrt(11586.90^2/13+9069.678^2/24)

[1] 40827.51

33410 - 2*sqrt(11586.90^2/13+9069.678^2/24)

[1] 25992.49

Interpreting the interval for a difference between means

What do we notice about this interval for the population, or true, mean difference?

The upper limit and the lower limit are both positive - likely \( \mu_{bhb} \) > \( \mu_{pm} \)
This interval does not contain zero
If zero is not contained in the interval, and zero is not a plausible value for the true mean difference
If the lower limit for the interval is negative and the upper limit is positive – then zero is plausible.

So, we have a meaningful finding in the face of uncertainty

Standard error for a difference between proportions

The same logic applies for proportions (with sufficiently large samples), the main complication is the SE

13 of the 33 seeds (39.4%) of seeds germinated in Blockhouse Bay soil and 24 of the 33 seeds (72.7%) of seeds germinated in potting mix ie. \[ \hat{p}_{bhb}=0.394,~~~~\hat{p}_{pm}=0.727 \]
Germination rates for Blockhouse Bay look lower than potting mix, but sampling uncertainty exists

Do germination rates differ between the soil types?

The approach is similar

we want to estimate \( p_{bhb}-p_{pm} \).
We can find the standard errors for each estimate separately but we need to combine them.

Do germination rates differ between the soil types?

We can combine these standard errors using the formula below.

SE for a Difference between Proportions (Independent Samples)

\[ \begin{align*} se(\hat{p}_1-\hat{p}_2)=&\sqrt{se(\hat{p}_1)^2+se(\hat{p}_2)^2}\\ =& \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \end{align*} \]

Interpreting the interval for a difference between proportions

What do we notice about this interval for the population, or true, difference between proportions?

The upper and lower limits are both negative — it's likely \( p_{bhb} \) is smaller than \( p_{pm} \)
This interval does not contain zero
If zero is not contained in the interval, the zero is not a plausible value for the true difference i.e. a difference is likely

Interpreting the interval for a difference between proportions

What do we notice about this interval for the population, or true, difference between proportions??

The two standard error interval for \( p_{bhb}-p_{pm} \) was [-0.563,-0.103].
This means we can be fairly confident that germination rates in potting mix are between 10% and 56% better, on average, than those in Blockhouse Bay.

Refining our intervals: the t-distribution

We don't know the true SE

When we sample data from a Normal (\( Z \)) distribution and we know the population standard deviation \( \sigma \), the sample mean is exactly Normally distributed about the true population mean, \( \mu \)
When we don't know the population SD, we introduce a new source of variability – we use the sample SD of our data (\( s_x \)) instead of a known population SD (\( \sigma \))
When the population SD is unknown the \( t \)-distribution is used instead

The t-distribution

is indexed by a parameter called the degrees of freedom (\( df \)); \( df \) is determined by the sample size (\( df=n-1 \))
is symmetrical about zero and has a similar shape to the Normal distribution
becomes more and more like the Normal as \( n \) (and thus \( df \)) increases; a \( t \)-distribution with small \( n \) (& \( df \)) has fatter tails and a flatter top compared with the Normal distribution
with \( df=\infty \) and the Normal (0,1) distribution are two ways of describing the same distribution

The t-distribution

  x <- seq(-5, 5, by = 0.1)

  plot(x, dt(x, df = 5), type = 'l', lwd = 2, col = 'slateblue4')

  lines(x, dnorm(x), lwd = 2)

  lines(x, dt(x, 50), col = 'blue', lwd = 2)

  lines(x, dt(x, 90), col = 'purple')

plot of chunk unnamed-chunk-7

How does this change how we build intervals for parameters?

For large samples, the sample mean falls within about two standard errors of the population mean for approximately 95% of samples taken
For small samples, the distribution of \( T \) is quite different to the Normal (\( Z \)).
For small samples, we must build intervals which are more than two-standard errors to capture the population mean most of the time

How does this change how we build intervals for parameters?

For instance, if we have a sample size of 7, the sample mean only falls within \( \pm 2 \) standard errors about 91% of the time (even when the data is exactly Normal) ie. 90.8% of the \( t \)-distribution falls between 2 standard errors when \( df=n-1=7-1=6 \) \newline \[ pr(-2 \leq T \leq 2)=0.908 \]
If we want to ensure intervals for parameters contain the true parameter value a set proportion of times, we often need to replace \( \pm 2 \) with a bigger number