The starting point is the \(t\) equation:
\[ t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \]
Then we rearrange to solve for \(n\):
\[ n = \frac{t^2s^2}{(\overline{x}-\mu)^2} \]
n is increasing as:
- \(s\) gets bigger,
- \(t\) gets bigger, or
- the margin for error (the whole of \((\overline{x} - \mu)\)) increase.
Where do the parameters come from? In the one-two sample tool that uses the normal to approximate the \(t\), we need
- \(s\) – the sample standard deviation –
- \(z\) or \(t\) – the relevant quantiles derived from the level of confidence – and
- the margin for error – within what margin/level of precision do we require the answer?.
From here, it is a numeric solution. For example, if I have some data that have a mean of 55 and a standard deviation of 5 and I wish to figure out the required sample size to estimate the true mean is 50 or bigger [a one-sided problem]. The one-sided \(z\) for 0.95 probability is -1.645. The calculation proceeds to
z <- qnorm(0.05) # 95 percent lower bound not a two-sided bound
s <- 5
margin.of.error <- 5 # here the 55-50.
n.result <- z^2*s^2/margin.of.error^2
n.result
## [1] 2.705543
For this problem, I only need three people. With such a small sample, the difference between \(z\) and \(t\) will be important. Let’s plug it in with \(t\).
t <- qt(0.05, df=2) # 95 percent lower bound not a two-sided bound
s <- 5
margin.of.error <- 5 # here the 55-50.
n.result <- t^2*s^2/margin.of.error^2
n.result
## [1] 8.526316
This says that I need 9. The \(z\) was -1.64; the t is -2.91. To solve it with \(t\), I guess and then see whether the solution matches up. Here, a decent guess is \(n=5\). Then test it out to see if it works. Why do I have to guess? I cannot solve for the \(t\) value without knowning \(n\) because \(t\) is defined by degrees of freedom (n-1).
t <- qt(0.05, df=4) # 95 percent lower bound not a two-sided bound
s <- 5
margin.of.error <- 5 # here the 55-50.
n.result <- t^2*s^2/margin.of.error^2
n.result
## [1] 4.544771
That works. You will not always find a perfect match because t depends on the solution.
Are a bit harder. The reason is, unlike the \(t\) and normal versions, the variance and standard deviation of a binomial or the normal approximation to the binomial are not independent of the mean. This is because the standard deviation of a binomial is \(\sqrt{n*p_{0}*(1-p_{0})}\). But that is the data – the observed number. When examining the proportion, we divide by \(n\) to yield \(\sqrt{\frac{p*(1-p)}{n}}\). It is easy to show that the standard deviation is at the maximum when \(p\) or \(p_0\) is 0.5. We use this fact commonly because it guarantees that the sample size is large enough though it could be too large – the estimate is conservative.
Start with the equation:
\[ z = \frac{\hat{p} - p_{0}}{\sqrt{\frac{p_{0}*(1-p_{0})}{n}}} \]
What we require is this. We require:
- a planning proportion: the assumed value of \(p_{0}\) – this allows us to calculate the standard deviation
- a probability/confidence level from which to derive the \(z\) quantile.
- a margin for error [the range for the estimated probability] expressed as \(\hat{p} - p_{0}\).
Rearranging the algebra, we get:
\[ n = \frac{z^2*p_{0}*(1-p_{0})}{margin.of.error} \].
Suppose that I want a survey with a margin of error of plus or minus 4 percent with 95% confidence.
p0 <- 0.5
z <- qnorm(0.025)
margin.of.error <- 0.04
solve.n <- z^2*p0*(1-p0)/margin.of.error^2
solve.n
## [1] 600.2279
We would need 601 [or fewer if her probability is greater than or less than 0.5] to estimate the proportion of Oregon residents approving of Kate Brown’s job performance to within plus or minus four percent.