In this article we’ll look at the most basic of sampling problems: we’ve taken some sample measurements and computed their average, and now we want to put a confidence interval (“error bars”) around the average. We’re going to see that the sample values are distributed differently than the population being sampled, especially when the sample size is small. To create a meaningful confidence interval we need to base it on the appropriate distribution.
Memories
At university I was taught to always state confidence intervals on measurements. In my first year lecture courses this advice was mostly pro forma, but in the lab courses in my sophomore year, where we actually did some measuring, the professor introduced a systematic method.
Repeat a measurement, say n = 5 times.
Calculate the mean of the n samples.
Calculate the standard deviation, making sure to divide the sum of squares by (n-1), not n.
Use a table like the one below to find the “t-value” corresponding to the confidence interval you want and for the appropriate number of “degrees of freedom”, which we were told was the number of samples minus one. We also needed to be careful to choose the correct column. Most of the time we wanted the two-tails version because the measurements were equally likely to be too low or too high. To get a 95% confidence interval we needed to pick the \(t_{0.975}\) or \(t_{0.025}\) value. If you take 5 samples and want a 95% confidence interval on your error bars the appropriate t-value is 2.776.
Finally, take the standard deviation you calculated before, divide it by the square root of the number of measurements, multiply by the t-value, and put a “+/-” in front to get the error bars around the mean value.
measurements <-c(4.8, 4.9, 5.0, 5.1, 5.2)mean <-mean(measurements)sd <-sd(measurements)t_value <-2.776# Looked up from a printed table, but these days you could call qt(0.975, 4)cat(paste(round(mean, 3), "+/-", round(t_value * sd /sqrt(length(measurements)), 3), "\n"))
5 +/- 0.196
None of this was explained. The procedure wasn’t difficult so I just learned it by rote, but it was mysterious (and you needed to keep a table of t-values handy).
Sampling From A Normal Distribution
First we’ll demonstrate through some brute force sampling experiments how samples show a different probability mass distribution than the thing being sampled.
To keep it simple, we’ll use a standard normal distribution as the underlying population distribution. The true population mean (µ) is zero, and the true standard deviation (σ) is one.
We’re going to be taking a lot of samples from this distribution, so let’s make a function to do that. The sample_from_normal_distribution() function takes n samples from a standard normal distribution and computes the mean of the sample, the standard deviation, the standard error (sd / sqrt(n)), and two candidate confidence intervals (based on the alpha parameter) one using the normal distribution and the other using the Student’s1t distribution.
sample_from_normal_distribution <-function(n, alpha) { samples <-rnorm(n) mean <-mean(samples) sd <-sd(samples) # R's assumes it's calculating sample variance, not population variance, so divides by (n - 1) se <- sd /sqrt(n)# Compute the appropriate quantile values for the two distributions t_hi <-qt(1- (alpha /2) , n -1) t_lo <-qt((alpha /2) , n -1) n_hi <-qnorm(1- (alpha /2)) n_lo <-qnorm(alpha /2)return(list(mean = mean, sd = sd,se = se,t_lo = mean + (t_lo * se),t_hi = mean + (t_hi * se),n_lo = mean + (n_lo * se),n_hi = mean + (n_hi * se)) )}
Single Sampling Experiment
In our first experiment we’ll take some samples and compute the confidence intervals based on two different assumptions of how the samples are distributed.
set.seed(1234554321)sample <-sample_from_normal_distribution(10, 0.05)cat(paste0("mean = ", round(sample$mean, 3)), "\n")cat(paste0("confidence interval based on normal distribution: [", round(sample$n_lo, 3), ", ", round(sample$n_hi, 3), "]", "\n"))cat(paste0("confidence interval based on t distribution: [", round(sample$t_lo, 3), ", ", round(sample$t_hi, 3), "]", "\n"))
mean = 0.088
confidence interval based on normal distribution: [-0.5, 0.675]
confidence interval based on t distribution: [-0.59, 0.766]
Note how the confidence interval based on the normal distribution is narrower than the confidence interval based on the t distribution.
What Is A Confidence Interval?
The frequentist interpretation of a 95% confidence interval states that if we repeat the same sampling procedure multiple times, the true population mean will lie within the computed confidence interval 95% of the time.
Let’s see which of our computed intervals, based on two different assumptions about the distribution of samples, agrees best with this definition.
Repeated Experiments
The sampling_experiment() function performs N repetitions of n samples each, and computes how many times we see a confidence interval based on alpha that does not include the true sample mean (zero). If alpha == 0.05 we expect this to happen 5% of the time.
Miss count - normal dist: 12018 (12.018%)
Miss count - t dist: 4922 (4.922%)
The true mean falls outside a confidence interval computed from a normal distribution far more often than it does for a confidence interval based on the t distribution. The confidence interval based on the assumption of a normal distribution is too narrow; the correct interval needs to be based on a broader sample distribution.
Increasing the Sample Size
Let’s see what happens if we increase the sample size.
Miss count - normal dist: 6224 (6.224%)
Miss count - t dist: 5077 (5.077%)
Increasing the sample size from 5 to 25 brings the confidence interval computed from the normal distribution closer to the correct value.
Insights from Flipping Coins
The probability of a sample greater or equal to the mean is 0.5, so we can apply simple Bernoulli statistics (coin-flipping) to understand more about how the width of the sample distribution changes with the number of samples. Say we take 5 samples. What is the probability that more than 80% of those samples (4 or 5 samples) are from the high side of the underlying normal distribution? How about for 25 samples (20 or more samples higher than the mean)?
binom_5 <-dbinom(0:5, 5, prob =0.5)binom_25 <-dbinom(0:25, 25, prob =0.5)cat(paste0("Probability of 80+ % of 5 samples being greater than the mean: ", sum(binom_5[5:6]), "\n"))
Probability of 80+ % of 5 samples being greater than the mean: 0.1875
cat(paste0("Probability of 80+ % of 25 samples being greater than the mean: ", sum(binom_25[21:26]), "\n"))
Probability of 80+ % of 25 samples being greater than the mean: 0.00203865766525269
In summary:
Extreme sampling events broaden the sample distribution.
Extreme sampling events are proportionately more common with small sample sizes.
The distribution we use to compute the confidence interval will depend on the size of the sample set: a small sample set is described by a different - broader - distribution than a larger sample set.
Visualizing Repeated Experiments
The sample_and_pivot() function performs multiple n-sample experiments and formats the results in a way suitable for plotting. The plot_samples() function does the actual plotting.
plot_samples <-function(df) { g <-ggplot(df, aes(x = value, y = Y, color = id)) g <- g +xlim(-4, 4) g <- g +stat_function(fun = dnorm, n =101, args =list(mean =0, sd =1)) +ylab("") g <- g +geom_point(alpha =0.5) g <- g +facet_wrap(vars(id), nrow =5, ncol =5) g <- g +theme(legend.position ="none") g}
sample_and_pivot <-function(N, n) {set.seed(12345) l <-lapply(1:N, function(i) { samples <-as.list(rnorm(n)) samples$id <- ireturn(samples) })# Convert to a data table, then pivot into a long and skinny format suitable for graphing dt <-rbindlist(l) df <-pivot_longer(dt, !id) df$Y <-dnorm(df$value)return(list(l = l, df = df))}
# Generate and plot a set of small samplesdata_5 <-sample_and_pivot(25, 5)plot_samples(data_5$df)
These graphs show what happens when we take 5 samples from a standard normal distribution. Notice graph 7, where all 5 samples come from the high side of the distribution, and graphs 16, 17 and 18, where 4 of the 5 samples come from the high side.
We can count the number of high-side samples in each experiment to see just how much “weight” is on the edges of the distribution.
Let’s repeat this experiment with a sample size of 25 instead of 5. Note how these larger sample sets cluster more in the middle of the underlying normal distribution. There are still some extreme samples (e.g. graphs 6 and 21), but they’ll be balanced out by the large number of samples from the middle of the distribution.
# Generate and plot a small set of larger samplesdata_25 <-sample_and_pivot(25, 25)plot_samples(data_25$df)
As we predicted with our coin-flipping computations, with a sample size of 25 we see relatively fewer extreme sampling events (in this case, no tests with greater than 18 high-side samples or less than 8), leading to a more narrow sampling distribution.
Conclusion
In this article we’ve developed some intuition for how and why the distributions of values from small samples from a normal distribution are broader than larger sample sizes. We’ve also shown that the correct distribution for describing the distribution of sample values is the Student’s t distribution.
For the curious, in another article I’ll describe how the Student’s t distribution is derived. You won’t need any of that knowledge to construct confidence intervals using the Student’s t-test, but the derivation is interesting for what it exposes of the internal plumbing of statistics.
Footnotes
Among the many things not explained in my sophomore classes was why it’s called the Student’s t-distribution. “Student” is William Sealy Gossett (https://en.wikipedia.org/wiki/William_Sealy_Gosset), a scientist and statistician working for the Guinness brewery in the first decades of the 20th century. Gossett was concerned with the problem of how to get statistically meaningful data from small numbers of samples. His most important contribution was a correction to the z-test then in use for computing confidence intervals. Guinness allowed their scientists to publish papers, but not under their own names, so Gossett published under the pseudonym “Student”. Accordingly, his approach was dubbed “Student’s z-test”. As the approach became more wide-spread, the new statistic got a new label, and became “Student’s t-test”.↩︎