How do we test for group differences?

Consider R’s sleep dataset:

?sleep

Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. A data frame with 20 observations on 3 variables.

tapply(sleep$extra, sleep$group, summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.600  -0.175   0.350   0.750   1.700   3.700 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.100   0.875   1.750   2.330   4.150   5.500

ggplot(sleep, aes(group, extra)) + 
  geom_boxplot(color = "lightgrey") + 
  geom_point(color = "blue", size = 3)

How do we know if the differences between these groups are statistically significant?

To answer this question, we need to understand:

Random variables
The scope of our inference: populations versus samples
Sample statistics: variance and standard deviation
The normal distribution
Hypothesis testing
The Student’s t-test

Dataset variables

A variable is an attribute of the data. They fall into two categories:

Quantitative variables are those for which the value has numerical meaning. The value refers to a specific amount of something.
Categorical variables indicate group membership; this may be inherent (gender is a classic example) or assigned (treatment versus control groups).

A closely related idea is that of random variables:

A random variable is a numerical description of the outcome of a statistical experiment. A random variable that may assume only a finite number or an infinite sequence of values is said to be discrete; one that may assume any value in some interval on the real number line is said to be continuous.

From Encyclopedia Brittanica.

In the sleep dataset:

extra is a numerical variable
group is a categorical variable
What is ID?

Population vs. Sample

When we conduct a study, we define the population of interest. This is the entire set of elements that possess the characteristics of interest. For example, for sleep, this might be “all people”, “all old women” (if all subjects could be described in this way), “all Maryland high school students who need soporifics”, etc.

We generally cannot observe or measure all the ‘elements of interest’. By definition, populations are large are frequently inaccessible due to time, money, logistics, etc. And so we randomly sample from the population.

A population includes all elements of interest; while
A sample consists of a subset of observations from a population.

This is important, because the nomenclature, statistical formulas, notation, and scope of inference vary depending on whether your analyzing a population or a sample.

Mean, variance, and standard deviation

We can compute several statistics from our sample(s) that are both informative and fundamental to statistical tests.

The mean of the sample is its average:

\[ \huge\bar{x} = \frac{\sum_{i = 1}^{n}{x_i}}{n} \] To understand a variable’s distribution, we use measures of variability. Variance measures the dispersion of the data from the mean:

\[ \huge s^2 = \frac{\sum_{i = 1}^{n}{(x_i - \bar{x})^2}}{n - 1} \] That n - 1 in the denominator constitutes the degrees of freedom: the number of independent pieces of information that go into the estimate of a test statistic.

Finally, the standard deviation is the square root of the variance:

\[ \huge s = \sqrt{s^2} \]

By taking the square root, we return the measure to the original units of x The standard deviation indicates how close the data are to the mean:

The central limit theorem

As our sample size increases, its mean and variance approach that of the overall population. This is due to the central limit theorem, the core of much of modern statistics.

Watch how the distribution (density) of these random data change as the sample size increases, getting closer and closer to the theoretical normal distribution (dashed line):

The normal distribution of a population

Many statistical tests assume a normal distribution, meaning that the data approximate a bell-shaped curve. In normal distributions, 68% of the data fall within ±1 standard deviation from the mean; 95% within 2 standard deviations, and 99% within 3.

Student’s t distribution

From Wikipedia:

In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and the population’s standard deviation is unknown.

Hypothesis testing

Hypothesis testing is the process by which we reject or fail to reject statistical hypotheses. The two types of statistical hypotheses are:

The null hypothesis, \(H_0\), is the hypothesis that the result from the statistical analysis occurs purely by chance.
The alternative hypothesis, \(H_1\) or \(H_a\), is the hypothesis that the result is meaningful and not due to chance.

Analyses always test the null hypothesis. This means, in science, that we never prove hypotheses, we only disprove them.

To test a statistical hypothesis, you:

State the null and alternative hypotheses, in such a way that they are mutually exclusive. The null hypothesis is usually the opposite of the result that you hope to find.
Select the significance level. This value (for example, 0.01, 0.05, or 0.10) represents the probability of obtaining a significant result if the null hypothesis is actually true…meaning, the chance that you rejected the null hypothesis when you should have failed to reject it (a “false positive”).
Determine which statistical analysis to conduct.
Analyze the data by calculating the test statistic and determining the probability of that statistic. Is that probability larger than the significance level you selected, you fail to reject the null hypothesis; if it’s smaller, you do reject it.

The Student’s t test

Time to bring together all these concepts!

A t-test measures the difference in group means divided by the pooled standard error of the two group means. It then computes, based on the Student’s t-distribution, how likely it would be to observe that difference if there is actually no difference between the groups.

It is a parametric test of difference, meaning that it assumes your data:

are independent;
are (approximately) normally distributed; and
have a similar amount of variance within each group being compared.

group1 <- sleep[sleep$group == 1,]$extra
group2 <- sleep[sleep$group == 2,]$extra
t.test(group1, group2)

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean of x mean of y 
##      0.75      2.33

t.test helpfully prints out

the alternative (as opposed to the null) hypothesis: “true difference in means is not equal to 0”;
the pooled degrees of freedom (closely related to the sample size); and
the p-value. Here it is 0.079, which traditionally would be interpreted as no or weak evidence against the null hypothesis (because our significance level–see above–is by default 0.05).

We might have a narrower null hypothesis, however, for example that group 2’s mean is greater than that of group 1:

t.test(group1, group2, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -1.8608, df = 17.776, p-value = 0.0397
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.1066185
## sample estimates:
## mean of x mean of y 
##      0.75      2.33

Traditionally, this would be interpreted as moderate evidence against the null hypothesis.

Paired versus unpaired

But wait! Let’s look again at sleep:

head(sleep)

##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

library(tidyr)
sleep %>% pivot_wider(names_from = "group", values_from = "extra")

## # A tibble: 10 × 3
##    ID      `1`   `2`
##    <fct> <dbl> <dbl>
##  1 1       0.7   1.9
##  2 2      -1.6   0.8
##  3 3      -0.2   1.1
##  4 4      -1.2   0.1
##  5 5      -0.1  -0.1
##  6 6       3.4   4.4
##  7 7       3.7   5.5
##  8 8       0.8   1.6
##  9 9       0     4.6
## 10 10      2     3.4

sleep does not consist of two separate random samples drawn from a population; rather, it is 10 random subjects (the ID column; see the help page) measured twice, once for each drug. Visually:

This means that we can use a paired t-test, which is more powerful than the unpaired t-test because the pairing of samples reduces inter-subject variability (as it makes comparisons between the same subject).

t.test(group1, group2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  group1 and group2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

This is strong evidence against the null hypothesis: we would observe such data only about 0.28% of the time if there’s actually no difference between the two drugs.

The End

Some of the structure and language in this file follows Microsoft’s Statistics Primer: A Brief Overview of Basic Statistical and Probability Principles.

The repository for this document is here.

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.2.0   ggplot2_3.3.5 tibble_3.1.6 
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9        pillar_1.7.0     bslib_0.3.1      compiler_4.1.3  
##  [5] jquerylib_0.1.4  tools_4.1.3      digest_0.6.29    jsonlite_1.8.0  
##  [9] evaluate_0.15    lifecycle_1.0.1  gtable_0.3.0     pkgconfig_2.0.3 
## [13] rlang_1.0.2      cli_3.2.0        rstudioapi_0.13  yaml_2.3.5      
## [17] xfun_0.30        fastmap_1.1.0    withr_2.5.0      stringr_1.4.0   
## [21] dplyr_1.0.8      knitr_1.38       generics_0.1.2   sass_0.4.1      
## [25] vctrs_0.4.0      grid_4.1.3       tidyselect_1.1.2 glue_1.6.2      
## [29] R6_2.5.1         fansi_1.0.3      rmarkdown_2.13   farver_2.1.0    
## [33] purrr_0.3.4      magrittr_2.0.3   scales_1.1.1     ellipsis_0.3.2  
## [37] htmltools_0.5.2  colorspace_2.0-3 labeling_0.4.2   utf8_1.2.2      
## [41] stringi_1.7.6    munsell_0.5.0    crayon_1.5.1

Testing for group differences

Ben

2022-07-10