Sampling error calculator

These calculations are based on “Interval Estimate of Population Proportion”

We can predict the distribution of a variable in a population from its distribution in a sample. But even if we assume that our sample is totally random (and in that sense representative), there's a certain amount of random error in a sample. For instance, suppose that we have a population of 136,000 volumes from which we sample 2169. Of those 2169 we find 4 volumes that contain a phrase we'll call X.

(In reality some volumes might have X more than once. So technically we should be thinking about a distribution across words rather than volumes, but I don't have information about numbers of words per volume, so I'll simplify to a model where the phrase occurs only once per volume.)

Let's calculate how many vols in our sample have X.

n <- 2169
k <- 4
pbar = k/n; pbar * 100
## [1] 0.1844168

We've calculated a mean value: 0.184% of the volumes in the sample have X. We could also guesstimate that 0.184% of the volumes in a larger population have X, but how confident are we about that estimate? To figure out, we calculate the standard error: standard deviation of the sampling distribution for the mean.

SE = sqrt(pbar * (1 - pbar)/n); SE     # standard error 
## [1] 0.0009212333

We can be sorta kinda confident that the real value is within that standard error, but to be 95% confident, we need to go much further out toward the tails of the bell curve.

E = qnorm(.975) * SE; E              # margin of error 
## [1] 0.001805584
interval = pbar + c(-E, E) 
interval * 100
## [1] 0.003858377 0.364975187

So the actual proportion of volumes could be from .003% to .36%. That's a pretty wide variation. I mean, actually these are all tiny values, but because they're so tiny the amount of variation is large relative to the values themselves. We've sampled 2169 of 136000 volumes, so how many are left?

left <- 136000 - n; left
## [1] 133831
left * interval
## [1]   5.163705 488.449942

That's how many vols with X there could be in the population we haven't sampled. We've already found 4, so in the whole population there are:

(left * interval) + 4
## [1]   9.163705 492.449942

Or roughly 251 ± 242.

Basically, sampling error swamps everything here because we only found four occurrences of the phrase. If it had been a common word we could make a more accurate estimate. Suppose we found a word in roughly 50% of volumes (1100 out of still 2169).

k <- 1100
pbar = k/n
SE = sqrt(pbar * (1 - pbar)/n)
E = qnorm(.975) * SE              # margin of error 
interval = pbar + c(-E, E)
left <- 136000 - n
(left * interval) + k
## [1] 66156.08 71787.67

In that case we could produce a much tighter estimate of its prevalence in the larger population.