Now that we have introduced the idea of a random variable, a null distribution, and a p-value, we are ready to describe the mathematical theory that permits us to compute p-values in practice. We will also learn about confidence intervals and power calculations.
A first step in statistical inference is to understand what population you are interested in. In the mouse weight example, we have two populations: female mice on control diets and female mice on high fat diets, with weight being the outcome of interest. We consider this population to be fixed, and the randomness comes from the sampling. One reason we have been using this dataset as an example is because we happen to have the weights of all the mice of this type. We download this file to our working directory and read in to R:
dat <- read.csv("mice_pheno.csv")
We can then access the population values and determine, for example, how many we have. Here we compute the size of the control population:
library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
length(controlPopulation)
## [1] 225
We usually denote these values as \(x_1,\dots,x_m\). In this case, \(m\) is the number computed above. We can do the same for the high fat diet population:
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
length(hfPopulation)
## [1] 200
and denote with \(y_1,\dots,y_n\).
We can then define summaries of interest for these populations, such as the mean and variance.
The mean:
\[\mu_X = \frac{1}{m}\sum_{i=1}^m x_i \mbox{ and } \mu_Y = \frac{1}{n} \sum_{i=1}^n y_i\]
The variance:
\[\sigma_X^2 = \frac{1}{m}\sum_{i=1}^m (x_i-\mu_X)^2 \mbox{ and } \sigma_Y^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\mu_Y)^2\]
with the standard deviation being the square root of the variance. We refer to such quantities that can be obtained from the population as population parameters. The question we started out asking can now be written mathematically: is \(\mu_Y - \mu_X = 0\) ?
Although in our illustration we have all the values and can check if this is true, in practice we do not. For example, in practice it would be prohibitively expensive to buy all the mice in a population. Here we learn how taking a sample permits us to answer our questions. This is the essence of statistical inference.
In the previous exercise, we obtained samples of 12 mice from each population. We represent data from samples with capital letters to indicate that they are random. This is common practice in statistics, although it is not always followed. So the samples are \(X_1,\dots,X_M\) and \(Y_1,\dots,Y_N\) and, in this case, \(N=M=12\). In contrast and as we saw above, when we list out the values of the population, which are set and not random, we use lower-case letters.
Since we want to know if \(\mu_Y - \mu_X\) is 0, we consider the sample version: \(\bar{Y}-\bar{X}\) with:
\[ \bar{X}=\frac{1}{M} \sum_{i=1}^M X_i \mbox{ and }\bar{Y}=\frac{1}{N} \sum_{i=1}^N Y_i. \]
Note that this difference of averages is also a random variable. Previously, we learned about the behavior of random variables with an exercise that involved repeatedly sampling from the original distribution. Of course, this is not an exercise that we can execute in practice. In this particular case it would involve buying 24 mice over and over again. Here we described the mathematical theory that mathematically relates \(\bar{X}\) to \(\mu_X\) and \(\bar{Y}\) to \(\mu_Y\), that will in turn help us understand the relationship between \(\bar{Y}-\bar{X}\) and \(\mu_Y - \mu_X\). Specifically, we will describe how the Central Limit Theorem permits us to use an approximation to answer this question, as well as motivate the widely used t-distribution.
Below we will discuss the Central Limit Theorem (CLT) and the t-distribution, both of which help us make important calculations related to probabilities. Both are frequently used in science to test statistical hypotheses. To use these, we have to make different assumptions from those for the CLT and the t-distribution. However, if the assumptions are true, then we are able to calculate the exact probabilities of events through the use of mathematical formula.
The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average \(\bar{Y}\) of a random sample follows a normal distribution centered at the population average \(\mu_Y\) and with standard deviation equal to the population standard deviation \(\sigma_Y\), divided by the square root of the sample size \(N\). We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.
Please note that if we subtract a constant from a random variable, the mean of the new random variable shifts by that constant. Mathematically, if \(X\) is a random variable with mean \(\mu\) and \(a\) is a constant, the mean of \(X - a\) is \(\mu-a\). A similarly intuitive result holds for multiplication and the standard deviation (SD). If \(X\) is a random variable with mean \(\mu\) and SD \(\sigma\), and \(a\) is a constant, then the mean and SD of \(aX\) are \(a \mu\) and \(\mid a \mid \sigma\) respectively. To see how intuitive this is, imagine that we subtract 10 grams from each of the mice weights. The average weight should also drop by that much. Similarly, if we change the units from grams to milligrams by multiplying by 1000, then the spread of the numbers becomes larger.
This implies that if we take many samples of size \(N\), then the quantity:
\[ \frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}} \]
is approximated with a normal distribution centered at 0 and with standard deviation 1.
Now we are interested in the difference between two sample averages. Here again a mathematical result helps. If we have two random variables \(X\) and \(Y\) with means \(\mu_X\) and \(\mu_Y\) and variance \(\sigma_X\) and \(\sigma_Y\) respectively, then we have the following result: the mean of the sum \(Y + X\) is the sum of the means \(\mu_Y + \mu_X\). Using one of the facts we mentioned earlier, this implies that the mean of \(Y - X = Y + aX\) with \(a = -1\) , which implies that the mean of \(Y - X\) is \(\mu_Y - \mu_X\). This is intuitive. However, the next result is perhaps not as intuitive. If \(X\) and \(Y\) are independent of each other, as they are in our mouse example, then the variance (SD squared) of \(Y + X\) is the sum of the variances \(\sigma_Y^2 + \sigma_X^2\). This implies that variance of the difference \(Y - X\) is the variance of \(Y + aX\) with \(a = -1\) which is \(\sigma^2_Y + a^2 \sigma_X^2 = \sigma^2_Y + \sigma_X^2\). So the variance of the difference is also the sum of the variances. If this seems like a counterintuitive result, remember that if \(X\) and \(Y\) are independent of each other, the sign does not really matter. It can be considered random: if \(X\) is normal with certain variance, for example, so is \(-X\). Finally, another useful result is that the sum of normal variables is again normal.
All this math is very helpful for the purposes of our study because we have two sample averages and are interested in the difference. Because both are normal, the difference is normal as well, and the variance (the standard deviation squared) is the sum of the two variances. Under the null hypothesis that there is no difference between the population averages, the difference between the sample averages \(\bar{Y}-\bar{X}\), with \(\bar{X}\) and \(\bar{Y}\) the sample average for the two diets respectively, is approximated by a normal distribution centered at 0 (there is no difference) and with standard deviation \(\sqrt{\sigma_X^2 +\sigma_Y^2}/\sqrt{N}\).
This suggests that this ratio:
\[ \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{\sigma_X^2}{M} + \frac{\sigma_Y^2}{N}}} \]
is approximated by a normal distribution centered at 0 and standard deviation 1. Using this approximation makes computing p-values simple because we know the proportion of the distribution under any value. For example, only 5% of these values are larger than 2 (in absolute value):
pnorm(-2) + (1 - pnorm(2))
## [1] 0.04550026
We don’t need to buy more mice, 12 and 12 suffice.
However, we can’t claim victory just yet because we don’t know the population standard deviations: \(\sigma_X\) and \(\sigma_Y\). These are unknown population parameters, but we can get around this by using the sample standard deviations, call them \(s_X\) and \(s_Y\). These are defined as:
\[ s_X^2 = \frac{1}{M-1} \sum_{i=1}^M (X_i - \bar{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2 \]
Note that we are dividing by \(M-1\) and \(N-1\), instead of by \(M\) and \(N\). There is a theoretical reason for doing this which we do not explain here. You can read about it here. But to get an intuition, think of the case when you just have 2 numbers. The average distance to the mean is basically 1/2 the difference between the two numbers. So you really just have information from one number. This is somewhat of a minor point. The main point is that \(s_X\) and \(s_Y\) serve as estimates of \(\sigma_X\) and \(\sigma_Y\)
So we can redefine our ratio as
\[ \sqrt{N} \frac{\bar{Y}-\bar{X}}{\sqrt{s_X^2 +s_Y^2}} \]
if \(M=N\) or in general,
\[ \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{s_X^2}{M} + \frac{s_Y^2}{N}}} \]
The CLT tells us that when \(M\) and \(N\) are large, this random variable is normally distributed with mean 0 and SD 1. Thus we can compute p-values using the function pnorm.
The CLT relies on large samples, what we refer to as asymptotic results. When the CLT does not apply, there is another option that does not rely on asymptotic results. When the original population from which a random variable, say \(Y\), is sampled is normally distributed with mean 0, then we can calculate the distribution of:
\[ \sqrt{N} \frac{\bar{Y}}{s_Y} \]
This is the ratio of two random variables so it is not necessarily normal. The fact that the denominator can be small by chance increases the probability of observing large values. William Sealy Gosset, an employee of the Guinness brewing company, deciphered the distribution of this random variable and published a paper under the pseudonym “Student”. The distribution is therefore called Student’s t-distribution. Later we will learn more about how this result is used.
Here we will use the mice phenotype data as an example. We start by creating two vectors, one for the control population and one for the high-fat diet population:
library(dplyr)
dat <- read.csv("mice_pheno.csv") #We downloaded this file in a previous section
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
It is important to keep in mind that what we are assuming to be normal here is the distribution of \(y_1,y_2,\dots,y_n\), not the random variable \(\bar{Y}\). Although we can’t do this in practice, in this illustrative example, we get to see this distribution for both controls and high fat diet mice:
library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)
Histograms of all weights for both populations.
We can use qq-plots to confirm that the distributions are relatively close to being normally distributed. We will explore these plots in more depth in a later section, but the important thing to know is that it compares data (on the y-axis) against a theoretical distribution (on the x-axis). If the points fall on the identity line, then the data is close to the theoretical distribution.
mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)
Quantile-quantile plots of all weights for both populations.
The larger the sample, the more forgiving the result is to the weakness of this approximation. In the next section, we will see that for this particular dataset the t-distribution works well even for sample sizes as small as 3.
Let’s use our data to see how well the central limit theorem approximates sample averages from our data. We will leverage our entire population dataset to compare the results we obtain by actually sampling from the distribution to what the CLT predicts.
dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)
## Sex Diet Bodyweight
## 1 F hf 31.94
## 2 F hf 32.48
## 3 F hf 22.82
## 4 F hf 19.92
## 5 F hf 32.22
## 6 F hf 27.50
Start by selecting only female mice since males and females have different weights. We will select three mice from each population.
library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
We can compute the population parameters of interest using the mean function.
mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
## [1] 2.375517
We can compute the population standard deviations of, say, a vector \(x\) as well. However, we do not use the R function sd because this function actually does not compute the population standard deviation \(\sigma_x\). Instead, sd assumes the main argument is a random sample, say \(X\), and provides an estimate of \(\sigma_x\), defined by \(s_X\) above. As shown in the equations above the actual final answer differs because one divides by the sample size and the other by the sample size minus one. We can see that with R code:
x <- controlPopulation
N <- length(x)
populationvar <- mean((x-mean(x))^2)
identical(var(x), populationvar)
## [1] FALSE
identical(var(x)*(N-1)/N, populationvar)
## [1] TRUE
So to be mathematically correct, we do not use sd or var. Instead, we use the popvar and popsd function in rafalib:
library(rafalib)
sd_hf <- popsd(hfPopulation)
sd_control <- popsd(controlPopulation)
Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we want to estimate them from samples.
N <- 12
hf <- sample(hfPopulation, 12)
control <- sample(controlPopulation, 12)
As we described, the CLT tells us that for large \(N\), each of these is approximately normal with average population mean and standard error population variance divided by \(N\). We mentioned that a rule of thumb is that \(N\) should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of \(N\).
Now we use sapply and replicate instead of for loops, which makes for cleaner code (we do not have to pre-allocate a vector, R takes care of this for us):
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
res <- sapply(Ns,function(n) {
replicate(B,mean(sample(hfPopulation,n))-mean(sample(controlPopulation,n)))
})
Now we can use qq-plots to see how well CLT approximations works for these. If in fact the normal distribution is a good approximation, the points should fall on a straight line when compared to normal quantiles. The more it deviates, the worse the approximation. In the title, we also show the average and SD of the observed distribution, which demonstrates how the SD decreases with \(\sqrt{N}\) as predicted.
mypar(2,2)
for (i in seq(along=Ns)) {
titleavg <- signif(mean(res[,i]),3)
titlesd <- signif(popsd(res[,i]),3)
title <- paste0("N=",Ns[i]," Avg=",titleavg," SD=",titlesd)
qqnorm(res[,i],main=title)
qqline(res[,i],col=2)
}
Quantile versus quantile plot of simulated differences versus theoretical normal distribution for four different sample sizes.
Here we see a pretty good fit even for 3. Why is this? Because the population itself is relatively close to normally distributed, the averages are close to normal as well (the sum of normals is also a normal). In practice, we actually calculate a ratio: we divide by the estimated standard deviation. Here is where the sample size starts to matter more.
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
##function to compute a t-stat
computetstat <- function(n) {
y <- sample(hfPopulation,n)
x <- sample(controlPopulation,n)
(mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
}
res <- sapply(Ns,function(n) {
replicate(B,computetstat(n))
})
mypar(2,2)
for (i in seq(along=Ns)) {
qqnorm(res[,i],main=Ns[i])
qqline(res[,i],col=2)
}
Quantile versus quantile plot of simulated ratios versus theoretical normal distribution for four different sample sizes.
So we see that for \(N=3\), the CLT does not provide a usable approximation. For \(N=12\), there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation is spot on.
This simulation only proves that \(N=12\) is large enough in this case, not in general. As mentioned above, we will not be able to perform this simulation in most situations. We only use the simulation to illustrate the concepts behind the CLT and its limitations. In future sections, we will describe the approaches we actually use in practice.
We will now demonstrate how to obtain a p-value in practice. We begin by loading experimental data and walking you through the steps used to form a t-statistic and compute a p-value. We can perform this task with just a few lines of code as we will see later. However, to understand the concepts, we will construct a t-statistic from “scratch”.
We start by reading in the data. A first important step is to identify which rows are associated with treatment and control, and to compute the difference in mean.
library(dplyr)
dat <- read.csv("femaleMiceWeights.csv") #previously downloaded
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
diff <- mean(treatment) - mean(control)
print(diff)
## [1] 3.020833
We are asked to report a p-value. What do we do? We learned that diff, referred to as the observed effect size, is a random variable. We also learned that under the null hypothesis, the mean of the distribution of diff is 0. What about the standard error? We also learned that the standard error of this random variable is the population standard deviation divided by the square root of the sample size:
\[ SE(\bar{X}) = \sigma / \sqrt{N}\]
We use the sample standard deviation as an estimate of the population standard deviation. In R, we simply use the sd function and the SE is:
sd(control)/sqrt(length(control))
## [1] 0.8725323
This is the SE of the sample average, but we actually want the SE of diff. We saw how statistical theory tells us that the variance of the difference of two random variables is the sum of its variances, so we compute the variance and take the square root:
se <- sqrt(
var(treatment)/length(treatment) +
var(control)/length(control)
)
Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.
tstat <- diff/se
This ratio is what we call the t-statistic. It’s the ratio of two random variables and thus a random variable. Once we know the distribution of this random variable, we can then easily compute a p-value.
As explained in the previous section, the CLT tells us that for large sample sizes, both sample averages mean(treatment) and mean(control) are normal. Statistical theory tells us that the difference of two normally distributed random variables is again normal, so CLT tells us that tstat is approximately normal with mean 0 (the null hypothesis) and SD 1 (we divided by its SE).
So now to calculate a p-value all we need to do is ask: how often does a normally distributed random variable exceed diff? R has a built-in function, pnorm, to answer this specific question. pnorm(a) returns the probability that a random variable following the standard normal distribution falls below a. To obtain the probability that it is larger than a, we simply use 1-pnorm(a). We want to know the probability of seeing something as extreme as diff: either smaller (more negative) than -abs(diff) or larger than abs(diff). We call these two regions “tails” and calculate their size:
righttail <- 1 - pnorm(abs(tstat))
lefttail <- pnorm(-abs(tstat))
pval <- lefttail + righttail
print(pval)
## [1] 0.0398622
In this case, the p-value is smaller than 0.05 and using the conventional cutoff of 0.05, we would call the difference statistically significant.
Now there is a problem. CLT works for large samples, but is 12 large enough? A rule of thumb for CLT is that 30 is a large enough sample size (but this is just a rule of thumb). The p-value we computed is only a valid approximation if the assumptions hold, which do not seem to be the case here. However, there is another option other than using CLT.
As described earlier, statistical theory offers another useful result. If the distribution of the population is normal, then we can work out the exact distribution of the t-statistic without the need for the CLT. This is a big “if” given that, with small samples, it is hard to check if the population is normal. But for something like weight, we suspect that the population distribution is likely well approximated by normal and that we can use this approximation. Furthermore, we can look at a qq-plot for the samples. This shows that the approximation is at least close:
library(rafalib)
mypar(1,2)
qqnorm(treatment)
qqline(treatment,col=2)
qqnorm(control)
qqline(control,col=2)
Quantile-quantile plots for sample against theoretical normal distribution.
If we use this approximation, then statistical theory tells us that the distribution of the random variable tstat follows a t-distribution. This is a much more complicated distribution than the normal. The t-distribution has a location parameter like the normal and another parameter called degrees of freedom. R has a nice function that actually computes everything for us.
t.test(treatment, control)
##
## Welch Two Sample t-test
##
## data: treatment and control
## t = 2.0552, df = 20.236, p-value = 0.053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04296563 6.08463229
## sample estimates:
## mean of x mean of y
## 26.83417 23.81333
To see just the p-value, we can use the $ extractor:
result <- t.test(treatment,control)
result$p.value
## [1] 0.05299888
The p-value is slightly bigger now. This is to be expected because our CLT approximation considered the denominator of tstat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.
It may be confusing that one approximation gave us one p-value and another gave us another, because we expect there to be just one answer. However, this is not uncommon in data analysis. We used different assumptions, different approximations, and therefore we obtained different results.
Later, when we discuss power calculations, we will describe type I and type II errors. As a preview, we will point out that the test based on the CLT approximation is more likely to incorrectly reject the null hypothesis (a false positive), while the t-distribution is more likely to incorrectly accept the null hypothesis (false negative).
Now that we have gone over the concepts, we can show the relatively simple code that one would use to actually compute a t-test:
library(dplyr)
dat <- read.csv("mice_pheno.csv")
control <- filter(dat,Diet=="chow") %>% select(Bodyweight)
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight)
t.test(treatment,control)
##
## Welch Two Sample t-test
##
## data: treatment and control
## t = 7.1932, df = 735.02, p-value = 1.563e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.231533 3.906857
## sample estimates:
## mean of x mean of y
## 30.48201 27.41281
The arguments to t.test can be of type data.frame and thus we do not need to unlist them into numeric objects.