Consider R’s sleep
dataset:
?sleep
Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. A data frame with 20 observations on 3 variables.
tapply(sleep$extra, sleep$group, summary)
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.600 -0.175 0.350 0.750 1.700 3.700
##
## $`2`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.100 0.875 1.750 2.330 4.150 5.500
ggplot(sleep, aes(group, extra)) +
geom_boxplot(color = "lightgrey") +
geom_point(color = "blue", size = 3)
How do we know if the differences between these groups are statistically significant?
To answer this question, we need to understand:
A variable is an attribute of the data. They fall into two categories:
A closely related idea is that of random variables:
A random variable is a numerical description of the outcome of a statistical experiment. A random variable that may assume only a finite number or an infinite sequence of values is said to be discrete; one that may assume any value in some interval on the real number line is said to be continuous.
From Encyclopedia Brittanica.
In the sleep
dataset:
extra
is a numerical variablegroup
is a categorical variableID
?When we conduct a study, we define the population of
interest. This is the entire set of elements that possess the
characteristics of interest. For example, for sleep
, this
might be “all people”, “all old women” (if all subjects could be
described in this way), “all Maryland high school students who need
soporifics”, etc.
We generally cannot observe or measure all the ‘elements of interest’. By definition, populations are large are frequently inaccessible due to time, money, logistics, etc. And so we randomly sample from the population.
This is important, because the nomenclature, statistical formulas, notation, and scope of inference vary depending on whether your analyzing a population or a sample.
We can compute several statistics from our sample(s) that are both informative and fundamental to statistical tests.
The mean of the sample is its average:
\[ \huge\bar{x} = \frac{\sum_{i = 1}^{n}{x_i}}{n} \] To understand a variable’s distribution, we use measures of variability. Variance measures the dispersion of the data from the mean:
\[
\huge s^2 = \frac{\sum_{i = 1}^{n}{(x_i - \bar{x})^2}}{n - 1}
\] That n - 1
in the denominator constitutes the degrees
of freedom: the number of independent pieces of information that go
into the estimate of a test statistic.
Finally, the standard deviation is the square root of the variance:
\[ \huge s = \sqrt{s^2} \]
By taking the square root, we return the measure to the original units of x The standard deviation indicates how close the data are to the mean:
As our sample size increases, its mean and variance approach that of the overall population. This is due to the central limit theorem, the core of much of modern statistics.
Watch how the distribution (density) of these random data change as the sample size increases, getting closer and closer to the theoretical normal distribution (dashed line):
Many statistical tests assume a normal distribution, meaning that the data approximate a bell-shaped curve. In normal distributions, 68% of the data fall within ±1 standard deviation from the mean; 95% within 2 standard deviations, and 99% within 3.
From Wikipedia:
In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and the population’s standard deviation is unknown.
Hypothesis testing is the process by which we reject or fail to reject statistical hypotheses. The two types of statistical hypotheses are:
Analyses always test the null hypothesis. This means, in science, that we never prove hypotheses, we only disprove them.
To test a statistical hypothesis, you:
Time to bring together all these concepts!
A t-test measures the difference in group means divided by the pooled standard error of the two group means. It then computes, based on the Student’s t-distribution, how likely it would be to observe that difference if there is actually no difference between the groups.
It is a parametric test of difference, meaning that it assumes your data:
group1 <- sleep[sleep$group == 1,]$extra
group2 <- sleep[sleep$group == 2,]$extra
t.test(group1, group2)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean of x mean of y
## 0.75 2.33
t.test
helpfully prints out
We might have a narrower null hypothesis, however, for example that group 2’s mean is greater than that of group 1:
t.test(group1, group2, alternative = "less")
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -1.8608, df = 17.776, p-value = 0.0397
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.1066185
## sample estimates:
## mean of x mean of y
## 0.75 2.33
Traditionally, this would be interpreted as moderate evidence against the null hypothesis.
But wait! Let’s look again at
sleep
:
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
library(tidyr)
sleep %>% pivot_wider(names_from = "group", values_from = "extra")
## # A tibble: 10 × 3
## ID `1` `2`
## <fct> <dbl> <dbl>
## 1 1 0.7 1.9
## 2 2 -1.6 0.8
## 3 3 -0.2 1.1
## 4 4 -1.2 0.1
## 5 5 -0.1 -0.1
## 6 6 3.4 4.4
## 7 7 3.7 5.5
## 8 8 0.8 1.6
## 9 9 0 4.6
## 10 10 2 3.4
sleep
does not consist of two separate random
samples drawn from a population; rather, it is 10 random
subjects (the ID
column; see the help page) measured twice,
once for each drug. Visually:
This means that we can use a paired t-test, which is more powerful than the unpaired t-test because the pairing of samples reduces inter-subject variability (as it makes comparisons between the same subject).
t.test(group1, group2, paired = TRUE)
##
## Paired t-test
##
## data: group1 and group2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58
This is strong evidence against the null hypothesis: we would observe such data only about 0.28% of the time if there’s actually no difference between the two drugs.
Some of the structure and language in this file follows Microsoft’s Statistics Primer: A Brief Overview of Basic Statistical and Probability Principles.
The repository for this document is here.
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.2.0 ggplot2_3.3.5 tibble_3.1.6
##
## loaded via a namespace (and not attached):
## [1] highr_0.9 pillar_1.7.0 bslib_0.3.1 compiler_4.1.3
## [5] jquerylib_0.1.4 tools_4.1.3 digest_0.6.29 jsonlite_1.8.0
## [9] evaluate_0.15 lifecycle_1.0.1 gtable_0.3.0 pkgconfig_2.0.3
## [13] rlang_1.0.2 cli_3.2.0 rstudioapi_0.13 yaml_2.3.5
## [17] xfun_0.30 fastmap_1.1.0 withr_2.5.0 stringr_1.4.0
## [21] dplyr_1.0.8 knitr_1.38 generics_0.1.2 sass_0.4.1
## [25] vctrs_0.4.0 grid_4.1.3 tidyselect_1.1.2 glue_1.6.2
## [29] R6_2.5.1 fansi_1.0.3 rmarkdown_2.13 farver_2.1.0
## [33] purrr_0.3.4 magrittr_2.0.3 scales_1.1.1 ellipsis_0.3.2
## [37] htmltools_0.5.2 colorspace_2.0-3 labeling_0.4.2 utf8_1.2.2
## [41] stringi_1.7.6 munsell_0.5.0 crayon_1.5.1