In statistics our aim is to discover information about a population or differences between populations. Populations can be enormous. In most cases we cannot get data from each subject in a population. Instead, we pick representative samples from a population or populations and describe or compare data for variables for those subjects only. This saves time and money and makes discovery about the information we seek possible.
Inferential statistics is all about the analysis of our sample data and inferring those result to the population(s). As an example, in the case of a trial comparing an active intervention and a placebo, subjects as selected from a population. It is shown that the subjects in the two groups as similar with respect to important variables such that the only difference between them, is whether they receive the active intervention or the placebo. Since the effect of the intervention is measured by a variable, we can now compare the values for this variable between the two groups. If healthcare professionals who look at the results of this paper consider the population which they care for to be similar to the population from which the subjects were taken for the study, then the results of the study can be inferred to the patients they care for. In this sense, we note that research informs our practice.
How do we go about showing a difference between groups, though? Data point values for a variable from samples or from a population come in patterns known as distributions. Most people are familiar with the bell-shaped curve of the normal distribution. In the plot below, we see the bell-shaped standard normal distribution. The \(x\) axis shows the possible values of a variable and the \(y\) axis shows the probability density. Values for the variable near \(0\) are more likely to occur than values further away form \(0\).
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
plot(
x,
hx,
type = "l",
main = "Standard normal distribution",
xlab = "Variable",
ylab = "Density",
las = 1
)
It is not only the values for a variable that can come in a pattern as shown above. If we could repeat a study over and over again, each time calculating a test statistics for a variable, there will also be a pattern or distribution to these values. It is termed a sampling distribution.
Sampling distributions form the basis of inferential statistics. They allow us to express common statistical entities such as confidence levels and p values. To see this connection, we start by looking at sampling.
Consider the image below.