Sampling and sampling distributions

Introduction

In statistics our aim is to discover information about a population or differences between populations. Populations can be enormous. In most cases we cannot get data from each subject in a population. Instead, we pick representative samples from a population or populations and describe or compare data for variables for those subjects only. This saves time and money and makes discovery about the information we seek possible.

Inferential statistics is all about the analysis of our sample data and inferring those result to the population(s). As an example, in the case of a trial comparing an active intervention and a placebo, subjects as selected from a population. It is shown that the subjects in the two groups as similar with respect to important variables such that the only difference between them, is whether they receive the active intervention or the placebo. Since the effect of the intervention is measured by a variable, we can now compare the values for this variable between the two groups. If healthcare professionals who look at the results of this paper consider the population which they care for to be similar to the population from which the subjects were taken for the study, then the results of the study can be inferred to the patients they care for. In this sense, we note that research informs our practice.

How do we go about showing a difference between groups, though? Data point values for a variable from samples or from a population come in patterns known as distributions. Most people are familiar with the bell-shaped curve of the normal distribution. In the plot below, we see the bell-shaped standard normal distribution. The $x$ axis shows the possible values of a variable and the $y$ axis shows the probability density. Values for the variable near $0$ are more likely to occur than values further away form $0$.

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

plot(
  x,
  hx,
  type = "l",
  main = "Standard normal distribution",
  xlab = "Variable",
  ylab = "Density",
  las = 1
)

It is not only the values for a variable that can come in a pattern as shown above. If we could repeat a study over and over again, each time calculating a test statistics for a variable, there will also be a pattern or distribution to these values. It is termed a sampling distribution.

Sampling distributions form the basis of inferential statistics. They allow us to express common statistical entities such as confidence levels and p values. To see this connection, we start by looking at sampling.

Sampling

Consider the image below.

Sweets

What is the fraction (proportion) of green candies? We could count all the items to get a result. What if there are hundreds of thousands of candies, though? Instead of counting all the individual candies, it would be much easier to take a representative sample and count only those selected few. The results gained from this sampling can then be inferred to all of the candies.

By pure chance, though, our single sample could over-represent one color. The results from our counting might then not be representative of the whole.

A better solution is to repeat the sampling over and over again. After each sampling, the results are tabulated. The candies are returned to the bowel and a new random sample taken.

Instead of a physical example, we can use the R language to simulate such repeated sampling.

Simulated sampling from a population

Let’s consider then a variable in a population of $2000$ subjects. This is a small population, but to keep our calculations simple, it will suffice. Using the runif (short for random uniform) function allows us to choose random data point values from a set of stated integers, following a discrete uniform distribution. Think of a discrete uniform distribution as rolling a die. There are six elements in the sample space. Each value has an equal probability of landing face up. Each value in the sample space of $10$ integers in our example then has an equal chance of being selected at every turn during this creation process. We can draw a histogram to show the distribution of our population variable. Our sample space below are the values $10,11,12, \ldots , 18, 19, 20$. We draw one value, document it and throw it back. This is repeated until we have completed the process $2000$ times.

set.seed(123)  # Seed the pseudo-random number generator for reproducible results
population <- runif(2000,
                    min = 11,
                    max = 20)  # Choose 2000 random integers each with equal likelihood
# Create a histogram with counts shown above each bin
hist(population,
     breaks = 10,
     main = "Population variable",
     xlab = "Variable",
     ylab = "Count",
     labels = TRUE,
     xlim = c(11, 20),
     ylim = c(0, 250))

The histogram shows that each value appears in our data set with the counts being roughly equal.

Now we consider doing some research on our population. We select $30$ unique subjects (repeat = FALSE) at random and calculate the mean of the variable data point values.

set.seed(123)
mean(sample(population,
            size = 30,
            replace = FALSE)) # Calculate the mean of 30 random subjects

## [1] 15.25123

We note a mean of $15.3$. Consider the following, though: What if we started our experiment a week later? We would get a different random sample of $30$ subjects. We can simulate this with seeding the pseudo-random number generator with a different value. (Remember that we are only using the set.seed function to get reproducible results and by omitting it, we would get different values each time we run the code.)

set.seed(321)
mean(sample(population,
            size = 30,
            replace = FALSE))

## [1] 15.89853

Now we have a mean of $15.9$. What if we ran our experiment $10$ times, recording the mean each time? Well, we would have $10$ random values of a statistical variable. Yes indeed, a statistic such as the mean can be a variable!

In the code chunk below, we use a for loop to run our experiment $10$ times, appending the mean of each run of $30$ subjects to an initial empty computer variable called sample_10. Then we create a histogram of the $10$ values. The actual R code is not of interest here and you need not follow it. Rather understand the statistical concept.

sample_10 <- c()  # Creating an empty vector
set.seed(123)
for (i in 1:10){  # Looping though the experiment 10 times
  sample_10 <- append(sample_10, mean(sample(population, size = 30, replace = FALSE)))}
hist(sample_10,
     main = "Ten repeat studies",
     xlab = "Mean",
     ylab = "Count")

Suddenly the distribution does not seem uniform! Some means occurred more frequently than others. Let’s repeat our experiment $100$ times. Each time we select $30$ random subjects and calculate a mean, adding it to the vector object sample_100.

sample_100 <- c()
set.seed(123)
for (i in 1:100){
  sample_100 <- append(sample_100, mean(sample(population, size = 30, replace = FALSE)))}
hist(sample_100,
     main = "One-hundred repeat studies",
     xlab = "Mean",
     ylab = "Count")

Now this is interesting. Here goes a thousand repeats.

sample_1000 <- c()
set.seed(123)
for (i in 1:1000){
  sample_1000 <- append(sample_1000, mean(sample(population, size = 30, replace = FALSE)))}
hist(sample_1000,
     main = "One-thousand repeat studies",
     xlab = "Mean",
     ylab = "Count",
     ylim = c(0, 400),
     labels = TRUE)

What we have here is the Central Limit Theorem (CLT) in action. A statistic (the mean in this example) tends to a normal distribution as we increase the number of times we randomly calculate it! This is profound. Letting the cat out of the bag, these are probabilities, leading us to p values using the basis of the scientific method called hypothesis testing.

The z distribution

The normal distribution is arguably the best known distribution. Our first plot above was that of the standard normal distribution. It is bell-shaped, with a mean of $0$ and a standard deviation of $1$. Remember that a normal distribution is also bell-shaped, but can have any value for its mean and standard deviation other than $0$ and $0$.

The standard normal distribution can serve as a sampling distribution and we call it the z distribution. We consider a simple study were it is believed that the population mean for a certain continuous numerical variable is $100$. We collect some data from $50$ subjects and find a mean of $105$ and a standard deviation of $10$. We can use the z distribution to understand how probable it was to have found a results of $105$ (for the mean) or a mean that is more extreme, i.e. more than $105$.

The values on the x axis of the standard normal distribution are termed z values and they represent the number of standard deviations away from the mean a value is. The z distribution is recreated below, showing vertical bars for z values of $-1$ and $1$.

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

plot(
  x,
  hx,
  type = "l",
  main = "z distribution",
  xlab = "z values",
  ylab = "Density",
  las = 1
)

abline(v = -1, col = "green", lwd = 2)
abline(v = 1, col = "green", lwd = 2)

We can convert the difference between our mean and the believed mean in terms of z values, termed a z score. This is done through the equation below, where $\bar{x}$ is the mean of the variable from our $50$ subjects, $\mu_{0}$ is the believed mean, and $s$ is the sample standard deviation for the variable. Finally, we have $n$ which is the sample size.

\[z = \frac{\bar{x} - \mu_{0}}{\frac{s}{\sqrt{n}}}\]

The denominator which divides the sample standard deviation by the square root of the sample size is termed the standard error. We use this, since the z distribution is based on the standard deviation in the population, not the standard deviation in a sample. We have to compensate for this by the division shown. We do this calculation below.

my_z = (105 - 100) / (10 / sqrt(50))
my_z

## [1] 3.535534

We note that $105$ is 3.5355339 standard errors away from the mean.

In the plot below, we show this z score as a vertical line on the z (standard normal) distribution.

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

plot(
  x,
  hx,
  type = "l",
  main = "z distribution and z score",
  xlab = "z values",
  ylab = "Density",
  las = 1
)
abline(v = my_z, col = "green", lwd = 2)

If this curve indicates how likely a difference should occur, then we see that our z score was rather unlikely.

The curve in the plot has an area between it and the x axis. Just as a square or a circle has an equation for its geometrical area, so does this curve. There is a mathematical equation to calculate this area. The total area under this curve is $1$. If we want to know what the probability was of a mean of more than $105$, it would be the area under this curve to the right of the green line.

The equation for the area under the curve proceeds from the negative infinity side. If we want the area under the curve for our green line towards positive infinity, we simply subtract the area of the left from $1$. The pnorm function calculates the area under the curve. If we pass the value 3.5355339 as argument, it will show us the area under the curve from negative infinity up to the green line. Below, then, we subtract it from $1$.

1 - pnorm(my_z)

## [1] 0.000203476

We see that, given the statistics of the variable from our $50$ subjects, there would be a probability of 1 - pnorm(my_z) to have gotten a mean of more (extreme) than $105$.

Just like that, we have our first p value.

The z distribution is actually only used for variable values from a whole population in which we know the standard deviation. We used an approximation of this parameter by inclusion of the sample standard deviation and calculating the standard error. Fortunately for us, as the sample size increases, this approximation becomes more and more valid. Later in this chapter, we will learn about the t distribution, which we can use when we do not know the population standard deviation, especially of our sample size is small and the approximation of the population standard error is very poor when using the sample standard deviation.

We have learned so far that our results are but one of many possible results. We can simulate these possible results or we can use a distribution and consider the area under its curve when dealing with probabilities and p values.

Now, we have to put our knowledge onto a firmer footing.

Hypothesis testing

The goal of inferential statistics is to infer the results from a sample to a larger population. To do this, we need to formulate a research question. This must be formulated in such as way as to use variable values and statistics that we can measure from these values. A simple example would be: “Is there a difference in the test statistic between two groups?” This is indeed a proper question, since we can translate it back to a variables for which we have collected data point values.

As scientists, we set a null hypothesis, which we accept as the truth unless gathered evidence shows the contrary. This null hypothesis always assumes that our research question returns a not true value. In this case: “No, there is no difference in the test statistic between two groups.” We state that the difference between the test statistics is $0$.

Now, we need to specify an alternative hypothesis. This is the hypothesis that we accept when evidence shows that we can reject the null hypothesis. This is what happens when we state that there is a statistically significant difference in the test statistic between the groups.

This decision is based on a p value and an arbitrary cut-off, known as the $\alpha$ value. It is customary to set this as $0.05$, such that a p value of less than $0.05$ leads us to reject our null hypothesis and accept our alternative hypothesis. When the p value is not less than $0.05$, we fail to reject our null hypothesis and state that there is no significant different between the (mean) ages of the two groups.

Reassignment under the null hypothesis

In most research, we do not know the values for a variable in a population. At best, we only record the values for each subject in our sample taken from the population.

We have seen that a statistic that we calculate from the variable values (such as the mean) is one of many that could possibly occur in the setting of repeating the research many times over. As such, our statistic is one of many possible statistics. We need to consider a way of understating how probable it was to have found our statistic given the distribution of all other possible values.

Since we do not know what the variable values are in the population, we must find a way of using only the values known for our sample subjects to put our statistic value into perspective. We do this along similar lines as in the previous section. In short, we use resampling.

In the code cell below, we generate an example of results from a study, where a variable is continuous numerical and a sample space on the interval of $10$ to $20$. To make things interesting, we use the uniform distribution again. We create values for two groups, A and B, with $30$ subjects in group A and $35$ subjects in group B.

set.seed(28)
var_A_values <- runif(30, min = 11, max = 20)
var_B_values <- runif(35, min = 10, max = 19)

We will consider our test statistic to be the difference in means.

mean_diff <- mean(var_A_values) - mean(var_B_values)
mean_diff

## [1] 0.9714045

We see a difference of 0.9714045. If we reverse the order of subtraction, we would get -0.9714045. How probable was it for us to have found this test statistic (difference in means)?

To answer this question, we consider the null hypothesis. This null hypothesis would state that there is no difference in the means. Under the null hypothesis, we can randomly reassign each subject to one of the groups. The null hypothesis states that there is no difference in means, making our reassignment completely valid.

Do do this computationally, we combine all $65$ values and reshuffle them. We can then assign the first $30$ values to group A and the last $35$ to group B. We then calculate the difference in means.

new_order = sample(c(var_A_values, var_B_values))
new_order

##  [1] 13.98553 19.52103 13.30278 11.98398 13.12078 11.46096 15.57401 18.81732
##  [9] 15.76989 11.79532 13.89153 13.90584 15.24110 19.70502 16.59963 15.01165
## [17] 14.63039 13.03675 12.60561 11.29688 15.26931 11.51048 17.02486 18.35545
## [25] 18.66150 16.87280 15.69365 13.02051 13.24109 13.48939 17.37378 11.69770
## [33] 15.16518 11.57653 12.40480 12.26223 17.32185 15.13122 15.84570 17.67434
## [41] 11.30942 16.08150 11.82410 10.54122 15.06653 18.01117 10.05374 11.28373
## [49] 18.96840 11.25718 17.81611 17.73774 16.94471 13.46968 19.42212 16.53145
## [57] 10.43593 16.54581 14.02405 18.21101 14.40019 12.77090 13.81173 12.98458
## [65] 13.61315

Each time the code chunk above is run, a new order will be created. Below, we use this to repeat this process $5000$ times. Each time we use the first $30$ values as the reassigned group A and the rest as group B. We then calculate the difference in means and append it to an initially empty vector.

mean_null_hypothesis <- c()

for (i in 1:5000){
  new_reassignment <- sample(c(var_A_values, var_B_values))
  new_A = new_reassignment[1:30]
  new_B = new_reassignment[31:65]
  mean_null_hypothesis <- append(mean_null_hypothesis, mean(new_A) - mean(new_B))}

We can now view a histogram of all the possible mean differences under the null hypothesis. We see also our mean difference. Since the order of subtraction has no meaning, we view both the negative and positive differences.

hist(mean_null_hypothesis,
     main = "Five-thousand mean differences under the null hypothesis",
     xlab = "Differences in means",
     ylab = "Count")
abline(v = -mean_diff, col = "blue", lwd = 2)
abline(v = mean_diff, col = "blue", lwd = 2)

Our single mean difference is one of many. Since the order of subtraction has no meaning, we must view both the positive and negative differences in our means. To consider how probable it was to have found our difference in means, we look at the fraction of the $5000$ differences that are less than -0.9714045 and the fraction of mean differences that are more than 0.9714045.

less <- sum(mean_null_hypothesis < -mean_diff) / 5000
more <- sum(mean_null_hypothesis > mean_diff) / 5000
less + more

## [1] 0.1282

For an $\alpha$ value of $0.05$, we find the simulated probability of our test statistic to be one that would commonly be found under the null hypothesis. We therefor fail to reject the null hypothesis.

In the next chapter, we will see that we can conduct a t test to look at the differences in means for two groups. Below, we use the t.test function to calculate a p value so that we can compare it to our resampling technique above.

t.test(var_A_values, var_B_values, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  var_A_values and var_B_values
## t = 1.5006, df = 63, p-value = 0.1384
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3221735  2.2649825
## sample estimates:
## mean of x mean of y 
##  15.19944  14.22804

We see an equivalence of our results and that of Student’s t test. Using reassignment, we now understand how a p value works.

The tests that we are going to use are based on the parameters of sampling distributions. Below, we look at some sampling distribution.

The t distribution

The t tests that we are going to encounter in the next chapter as based on the t distribution. It is one of the most commonly used sampling distributions. If we knew the standard deviation for a variable in a whole population (the standard deviation of the difference in means for two groups for instance), we can create a z distribution.

Since we hardly ever know the standard deviation for a variable in a whole population, we make do with the t distribution. This sampling distribution requires only one parameter from our data and that is the sample size.

We use the t distribution when comparing a numerical variable between two groups. It forms the basis for the famous Student’s t test, developed by William Gosset, while working at the Guinness Brewing Company. The parameter we take from our sample is then the sample size minus the number of groups. This is called the degrees of freedom (DOF). Since we are comparing two groups, with sample sizes $n_1$ and $n_2$, we will have a DOF value of $n_1 + n_2 - 2$.

The t distribution is very much bell-shaped. For larger sample sizes, this distribution actually approximates the z distribution.

In the graph below, we see a dark line representing a DOF value of $30$. Considering the mean at $0$, we see successive curves for DOF values of $1,2,3,4$ and $10$ (from the bottom up).

curve(dt(x, df = 30),
      from = -3, to = 3,
      lwd = 3,
      ylab = "Distribution",
      xlab = "t statistic")
ind <- c(1, 2, 3, 5, 10)
for (i in ind) curve(dt(x, df = i), -3, 3, add = TRUE)

The $\chi^2$ distribution

Another common sampling distribution is the $chi^2$ distribution. It also takes one parameter from our sample, being the DOF. It is calculated a bit differently, though.

We commonly use the $chi^2$ distribution when dealing with categorical variables. In the graph below, we see a dark line representing a DOF value of $1$. Successive curves have DOF values of $2,3,5$, and $10$.

curve(dchisq(x,
             df = 1),
      from = 0,
      to = 10,
      lwd = 3,
      ylab = "Distribution",
      xlab = "Statistic")
ind <- c(2, 3, 5, 10)
for (i in ind) curve(dchisq(x,
                            df = i),
                     0,
                     10,
                     add = TRUE)

The F distribution

We are going to use the F distribution when we compare the means of a numerical variable for more than two groups. In this case, we will use one-way analysis of variance, where we are interested in the between group and intra-group variances. The results is going to include two degrees of freedom values. The first will be difference between the number of subjects in our sample and the number of groups and the second will be one less than the number of groups in our sample.

In the plot below we see a few curves for given values of these two degrees of freedom values.

curve(df(x, 3, 1),
      from = 0,
      to = 4,
      lwd = 3,
      main = "The F distribution",
      ylab = "Distribution",
      xlab = "Statistic")
dof <- c(5, 7, 10)
for (i in dof) curve(df(x, i, 2),
                     from = 0,
                     to = 4,
                     add = TRUE)

Definitions

Below is a list of some common terms used in statistics.

The population or study population includes all individuals to whom research pertains. This can be all diabetics or all mice with a certain gene sequence or all mice in the world for that matter. The population size is denoted by $N$.

Any values we calculate from all the data point values for a variable among (everyone) in the population is termed a parameter. In the case of the mean, we use the symbol $\mu$ and in the case of the standard deviation, we use the symbol $\sigma$ (or $^{2} for variance).

A census is the meticulous examination of all individuals in a population so as to measure and capture the data for a variable.

A sample is a random selection of individuals taken from a population. Data is collected for a variable only from these selected individuals. A sample size is denoted by $n$.

Any value calculated from the data point values for a variable in such a sample is termed a statistic. In the case of the mean, we use the symbol $\bar{x}$ and in the case of the standard deviation, we use the symbol $s$ (or $s^{2}$ for variance).

In a representative sampling the values for a variable are roughly in line with those of the whole population. Any statistics is then a good representation of the parameter for a specific variable. Results from such a sample is generalizable. If certain individuals have a higher likelihood of being chosen at random, then such a sample is biased and results from it would not be representative of the population.

Conclusion

Sampling distributions are the bedrock of inferential statistics and while we do not need intimate knowledge of then, it is good to be aware of them as we us them to calculate our p values for reporting in journal articles. It is a good idea to return to this chapter when you conduct t tests, $\chi^2$ tests, and analysis of variance. Their meaning will become crystal clear then.

The important idea in this chapter, though, is that of the CLT. Our statistical analysis of data collected for a research project will yield results that are one of many possible results. Some occur commonly and some, given the data, occur infrequently. This distribution of outcomes allows us to judge the results of the analysis of our data.

In the next chapter, we will introduce the the concept of hypothesis testing while conducting some of the most common tests in inferential statistics. It is through hypothesis testing that we make the judgement between common and rare findings, allowing us to express p values.

The tests that we choose to use in our analysis depend on the distributions of the sample data and the results of these (parametric) tests depend on sampling distributions.

Sampling and sampling distributions

Dr Juan H Klopper

Research Fellow at the School of Data Science and Computational Thinking

Lecturer in the Division of Biostatistics and Epidemiology

Stellenbosch University