C. Donovan
(refer to confidence interval PPT)
We have a sample - how do we convey information about the true value? Why do it this way?
\[ se(\bar{x})=\frac{s_x}{\sqrt{n}}= \frac{9069.678}{\sqrt{24}}= 1851.340 \]
So - transfer this logic to centre on the sample mean we have drawn
\[ \bar{x} \pm 2\times se(\bar{x}) \]
A two-standard-error interval
\[ \begin{align*} estimate ~\pm ~~&~~~~ 2 \times standard~ error\\ \bar{x}- 2\times se(\bar{x}), ~~&~~\bar{x}+2\times se(\bar{x})\\ 30208.33-2 \times \frac{9069.678}{\sqrt{24}},~~&~~ 30208.33+2 \times \frac{9069.678}{\sqrt{24}}\\ (26505.65,~~ &~~ 33911.01) \end{align*} \]
So, the true mean Calcium level for our potting mix population, is likely (about 19/20) to be between 26505 and 33911 units.
Applying this to proportions
\[ mean~ = p ~\textrm{and standard deviation} = \sqrt{\frac{{p}(1-{p})}{n}} \]
but since we never know \( p \), we use the standard error of the sample proportion:
\[ \text{Std. Err.}~{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
Apply the 2-SE interval:
\[ \begin{align*} estimate \pm ~~&~~2 \times standard~error\\ \hat{p}- 2\times se(\hat{p}),~~&~~\hat{p}+ 2\times se(\hat{p})\\ \hat{p}- 2\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},~~&~~\hat{p}+2\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\\ 0.46-2 \times \sqrt{\frac{{0.46}(1-{0.46})}{100}},~~&~~0.46+2 \times \sqrt{\frac{{0.46}(1-{0.46})}{100}}\\ (0.3603,~~&~~0.5597) \end{align*} \]
p <- ggplot(CaData) + geom_boxplot(aes(Group, Ca), fill = 'purple', alpha = 0.8)
p
Site | SD | 1st Quartile | Median | Mean | n |
---|---|---|---|---|---|
Blockhouse Bay | 11586.907 | 58000 | 60000 | 63620 | 13 |
Northland | 9184.830 | 47000 | 56000 | 54110 | 9 |
Potting mix | 9069.678 | 22750 | 30000 | 30210 | 24 |
Summary statistics for Calcium values in Cannabis leaves
CaSummary <- CaData %>% group_by(Group) %>%
summarise(SD = sd(Ca), Q25 = quantile(Ca, 0.25),
median = median(Ca), Q75 = quantile(Ca, 0.75),
mean = mean(Ca), n = n())
CaSummary
# A tibble: 3 x 7
Group SD Q25 median Q75 mean n
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 bhb 11587. 58000. 60000. 76000. 63615. 13
2 nth 9185. 47000. 56000. 62000. 54111. 9
3 pm 9070. 22750. 30000. 35000. 30208. 24
\[ \begin{align*} \bar{x}_{bhb}-\bar{x}_{pm}= &63620-30210\\ = &33410 \end{align*} \]
We can combine these standard errors using the formula below.
SE for a Difference in Means (Independent Samples)
\[ \begin{align*} se(\bar{x}_1-\bar{x}_2)=&\sqrt{se(\bar{x}_1)^2+se(\bar{x}_2)^2}\\ =& \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \end{align*} \]
\[ 33410 \pm 2\times \sqrt{\frac{11586.90^2}{13}+\frac{9069.678^2}{24}} \]
33410 + 2*sqrt(11586.90^2/13+9069.678^2/24)
[1] 40827.51
33410 - 2*sqrt(11586.90^2/13+9069.678^2/24)
[1] 25992.49
What do we notice about this interval for the population, or true, mean difference?
So, we have a meaningful finding in the face of uncertainty
The same logic applies for proportions (with sufficiently large samples), the main complication is the SE
The approach is similar
We can combine these standard errors using the formula below.
SE for a Difference between Proportions (Independent Samples)
\[ \begin{align*} se(\hat{p}_1-\hat{p}_2)=&\sqrt{se(\hat{p}_1)^2+se(\hat{p}_2)^2}\\ =& \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \end{align*} \]
What do we notice about this interval for the population, or true, difference between proportions?
What do we notice about this interval for the population, or true, difference between proportions??
We don't know the true SE
x <- seq(-5, 5, by = 0.1)
plot(x, dt(x, df = 5), type = 'l', lwd = 2, col = 'slateblue4')
lines(x, dnorm(x), lwd = 2)
lines(x, dt(x, 50), col = 'blue', lwd = 2)
lines(x, dt(x, 90), col = 'purple')