Introduction to Random Variables

Introduction

This chapter introduces the statistical concepts necessary to understand p-values and confidence intervals. These terms are ubiquitous in the life science literature. Let’s use this paper as an example.

Note that the abstract has this statement:

“Body weight was higher in mice fed the high-fat diet already after the first week, due to higher dietary intake in combination with lower metabolic efficiency.” To support this claim they provide the following in the results section:

“Already during the first week after introduction of high-fat diet, body weight increased significantly more in the high-fat diet-fed mice (\(+\) 1.6 \(\pm\) 0.1 g) than in the normal diet-fed mice (\(+\) 0.2 \(\pm\) 0.1 g; P < 0.001).” What does P < 0.001 mean? What are the \(\pm\) included? We will learn what this means and learn to compute these values in R. The first step is to understand random variables. To do this, we will use data from a mouse database (provided by Karen Svenson via Gary Churchill and Dan Gatti and partially funded by P50 GM070683). We will import the data into R and explain random variables and null distributions using R programming.

If you already downloaded the femaleMiceWeights file into your working directory, you can read it into R with just one line:

dat <- read.csv("femaleMiceWeights.csv")

Remember that a quick way to read the data, without downloading it is by using the url:

dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv"
url <- paste0(dir, filename)
dat <- read.csv(url)

Our first look at data

We are interested in determining if following a given diet makes mice heavier after several weeks. This data was produced by ordering 24 mice from The Jackson Lab and randomly assigning either chow or high fat (hf) diet. After several weeks, the scientists weighed each mouse and obtained this data (head just shows us the first 6 rows):

head(dat)

##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

In RStudio, you can view the entire dataset with:

View(dat)

So are the hf mice heavier? Mouse 24 at 20.73 grams is one of the lightest mice, while Mouse 21 at 34.02 grams is one of the heaviest. Both are on the hf diet. Just from looking at the data, we see there is variability. Claims such as the one above usually refer to the averages. So let’s look at the average of each group:

library(dplyr)
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
print( mean(treatment) )

## [1] 26.83417

print( mean(control) )

## [1] 23.81333

obsdiff <- mean(treatment) - mean(control)
print(obsdiff)

## [1] 3.020833

So the hf diet mice are about 10% heavier. Are we done? Why do we need p-values and confidence intervals? The reason is that these averages are random variables. They can take many values.

If we repeat the experiment, we obtain 24 new mice from The Jackson Laboratory and, after randomly assigning them to each diet, we get a different mean. Every time we repeat this experiment, we get a different value. We call this type of quantity a random variable.

Random Variables

Let’s explore random variables further. Imagine that we actually have the weight of all control female mice and can upload them to R. In Statistics, we refer to this as the population. These are all the control mice available from which we sampled 24. Note that in practice we do not have access to the population. We have a special dataset that we are using here to illustrate concepts.

The first step is to download the data from here into your working directory and then read it into R:

population <- read.csv("femaleControlsPopulation.csv")
##use unlist to turn it into a numeric vector
population <- unlist(population)

Now let’s sample 12 mice three times and see how the average changes.

control <- sample(population,12)
mean(control)

## [1] 23.47667

control <- sample(population,12)
mean(control)

## [1] 25.12583

control <- sample(population,12)
mean(control)

## [1] 23.72333

Note how the average varies. We can continue to do this repeatedly and start learning something about the distribution of this random variable.

Source

Introduction to Random Variables

Tural Naghiyev

2022-11-09

Introduction

Our first look at data

Random Variables