Sample Size Calculations

t

The starting point is the \(t\) equation:

\[ t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \]

Then we rearrange to solve for \(n\):

\[ n = \frac{t^2s^2}{(\overline{x}-\mu)^2} \]

n is increasing as:
- \(s\) gets bigger,
- \(t\) gets bigger, or
- the margin for error (the whole of \((\overline{x} - \mu)\)) increase.

Where do the parameters come from? In the one-two sample tool that uses the normal to approximate the \(t\), we need
- \(s\) – the sample standard deviation –
- \(z\) or \(t\) – the relevant quantiles derived from the level of confidence – and
- the margin for error – within what margin/level of precision do we require the answer?.

From here, it is a numeric solution. For example, if I have some data that have a mean of 55 and a standard deviation of 5 and I wish to figure out the required sample size to estimate the true mean is 50 or bigger [a one-sided problem]. The one-sided \(z\) for 0.95 probability is -1.645. The calculation proceeds to

z <- qnorm(0.05)  # 95 percent lower bound not a two-sided bound
s <- 5
margin.of.error <- 5 # here the 55-50.
n.result <- z^2*s^2/margin.of.error^2
n.result

## [1] 2.705543

For this problem, I only need three people. With such a small sample, the difference between \(z\) and \(t\) will be important. Let’s plug it in with \(t\).

t <- qt(0.05, df=2)  # 95 percent lower bound not a two-sided bound
s <- 5
margin.of.error <- 5 # here the 55-50.
n.result <- t^2*s^2/margin.of.error^2
n.result

## [1] 8.526316

This says that I need 9. The \(z\) was -1.64; the t is -2.91. To solve it with \(t\), I guess and then see whether the solution matches up. Here, a decent guess is \(n=5\). Then test it out to see if it works. Why do I have to guess? I cannot solve for the \(t\) value without knowning \(n\) because \(t\) is defined by degrees of freedom (n-1).

t <- qt(0.05, df=4)  # 95 percent lower bound not a two-sided bound
s <- 5
margin.of.error <- 5 # here the 55-50.
n.result <- t^2*s^2/margin.of.error^2
n.result

## [1] 4.544771

That works. You will not always find a perfect match because t depends on the solution.

Proportions

Are a bit harder. The reason is, unlike the \(t\) and normal versions, the variance and standard deviation of a binomial or the normal approximation to the binomial are not independent of the mean. This is because the standard deviation of a binomial is \(\sqrt{n*p_{0}*(1-p_{0})}\). But that is the data – the observed number. When examining the proportion, we divide by \(n\) to yield \(\sqrt{\frac{p*(1-p)}{n}}\). It is easy to show that the standard deviation is at the maximum when \(p\) or \(p_0\) is 0.5. We use this fact commonly because it guarantees that the sample size is large enough though it could be too large – the estimate is conservative.

Start with the equation:

\[ z = \frac{\hat{p} - p_{0}}{\sqrt{\frac{p_{0}*(1-p_{0})}{n}}} \]

What we require is this. We require:
- a planning proportion: the assumed value of \(p_{0}\) – this allows us to calculate the standard deviation
- a probability/confidence level from which to derive the \(z\) quantile.
- a margin for error [the range for the estimated probability] expressed as \(\hat{p} - p_{0}\).

Rearranging the algebra, we get:

\[ n = \frac{z^2*p_{0}*(1-p_{0})}{margin.of.error} \].

Suppose that I want a survey with a margin of error of plus or minus 4 percent with 95% confidence.

p0 <- 0.5
z <- qnorm(0.025)
margin.of.error <- 0.04
solve.n <- z^2*p0*(1-p0)/margin.of.error^2 
solve.n

## [1] 600.2279

We would need 601 [or fewer if her probability is greater than or less than 0.5] to estimate the proportion of Oregon residents approving of Kate Brown’s job performance to within plus or minus four percent.

Sample Size Calculations

Robert W. Walker

October 27, 2016

t

Proportions