This chapter introduces the statistical concepts necessary to understand p-values and confidence intervals. These terms are ubiquitous in the life science literature. Let’s use this paper as an example.
Note that the abstract has this statement:
“Body weight was higher in mice fed the high-fat diet already after the first week, due to higher dietary intake in combination with lower metabolic efficiency.” To support this claim they provide the following in the results section:
“Already during the first week after introduction of high-fat diet, body weight increased significantly more in the high-fat diet-fed mice (\(+\) 1.6 \(\pm\) 0.1 g) than in the normal diet-fed mice (\(+\) 0.2 \(\pm\) 0.1 g; P < 0.001).” What does P < 0.001 mean? What are the \(\pm\) included? We will learn what this means and learn to compute these values in R. The first step is to understand random variables. To do this, we will use data from a mouse database (provided by Karen Svenson via Gary Churchill and Dan Gatti and partially funded by P50 GM070683). We will import the data into R and explain random variables and null distributions using R programming.
If you already downloaded the femaleMiceWeights
file
into your working directory, you can read it into R with just one
line:
dat <- read.csv("femaleMiceWeights.csv")
Remember that a quick way to read the data, without downloading it is by using the url:
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv"
url <- paste0(dir, filename)
dat <- read.csv(url)
We are interested in determining if following a given diet makes mice
heavier after several weeks. This data was produced by ordering 24 mice
from The Jackson Lab and randomly assigning either chow or high fat (hf)
diet. After several weeks, the scientists weighed each mouse and
obtained this data (head
just shows us the first 6
rows):
head(dat)
## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79
In RStudio, you can view the entire dataset with:
View(dat)
So are the hf mice heavier? Mouse 24 at 20.73 grams is one of the lightest mice, while Mouse 21 at 34.02 grams is one of the heaviest. Both are on the hf diet. Just from looking at the data, we see there is variability. Claims such as the one above usually refer to the averages. So let’s look at the average of each group:
library(dplyr)
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
print( mean(treatment) )
## [1] 26.83417
print( mean(control) )
## [1] 23.81333
obsdiff <- mean(treatment) - mean(control)
print(obsdiff)
## [1] 3.020833
So the hf diet mice are about 10% heavier. Are we done? Why do we need p-values and confidence intervals? The reason is that these averages are random variables. They can take many values.
If we repeat the experiment, we obtain 24 new mice from The Jackson Laboratory and, after randomly assigning them to each diet, we get a different mean. Every time we repeat this experiment, we get a different value. We call this type of quantity a random variable.
Let’s explore random variables further. Imagine that we actually have the weight of all control female mice and can upload them to R. In Statistics, we refer to this as the population. These are all the control mice available from which we sampled 24. Note that in practice we do not have access to the population. We have a special dataset that we are using here to illustrate concepts.
The first step is to download the data from here into your working directory and then read it into R:
population <- read.csv("femaleControlsPopulation.csv")
##use unlist to turn it into a numeric vector
population <- unlist(population)
Now let’s sample 12 mice three times and see how the average changes.
control <- sample(population,12)
mean(control)
## [1] 23.47667
control <- sample(population,12)
mean(control)
## [1] 25.12583
control <- sample(population,12)
mean(control)
## [1] 23.72333
Note how the average varies. We can continue to do this repeatedly and start learning something about the distribution of this random variable.