Probability
Random variables and their distributions
Discrete distributions
- The binomial distribution: the repeated coin-tossing distribution
  - Related distributions
- The Poisson distribution: the event-counting distribution
  - Related distributions
Continuous distributions
This week’s tasks

Probability

The probability of an event X can be written as \(P(X)\). Probability is hard to define, but it is a measure of uncertainty. There are two common interpretations of probability:

Frequentism: probability is the relative frequency of an event “in the long-run” (i.e., after an infinite number of identical trials).
Bayesianism: probability is an objective or subjective degree of belief.

Probabilities satisfy Kolgomorov’s properties:

Non-negative
Sum of all probabilities in the sample space equals 1
The probability that either of two mutually exclusive events occurs is the sum of their probabilities

A conditional probability is the probability that an event occurs assuming that another event has occurred. The probability that A occurs assuming that B has occurred is written as \(P(A|B)\); it can be calculated as:

\(P(A|B) = P(A \& B) / P(B)\) - the probability that both events occur, after accounting for the fact that B will not necessarily occur
\(P(A|B) = P(A) \times P(B|A) / P(B)\) - Bayes’ theorem; the prior probability that A occurs and the likelihood that B occurs given A, after accounting for the fact that B can occur even in the absence of A

Random variables and their distributions

A random variable is a variable that takes on random values.¹ The probability that a random variable takes on a particular value depends on the probability distribution from which the random variable is drawn. The most commonly used distributions belong to the exponential family. Every probability distribution can be described by either of two functions:

Its probability mass function (PMF) in the case of a discrete random variable or its probability density function (PDF) in the case of a continuous random variable. These functions describe the probability of obtaining a particular value (PMF) or a value somewhere in a particular range of values (PDF). PDFs are standardized so that the area under the curve sums to 1. In R, PMFs and PDFs can be obtained using function names that start with d, such as dbinom() for the PMF of a binomial distribution or dnorm() for the PDF of a normal distribution.
Its cumulative distribution function (CDF). The CDF describes the probability that a random variable will take a value less than or equal to a particular value. In R, CDFs can be obtained using function names that start with p, such as pbinom() for a binomial distribution or pnorm() for a normal distribution.

The PMF/PDF provides the most familiar shape of a distribution. For example, the PDF of the standard normal distribution is the bell curve shown in the left panel below. On the right, I have plotted the CDF. The height of the CDF of the standard normal distribution shows that, for example, the probability of obtaining a random variable with a value \(\le0\) is 50%.

opar <- par(mfrow = c(1, 2), pty = "s", cex.axis = 0.8)
curve(dnorm(x), from = -3, to = 3, main = "Standard normal PDF")
curve(pnorm(x), from = -3, to = 3, main = "Standard normal CDF")
par(opar)

The expected value of a random variable [E(Y)] is the mean of its probability distribution. The mean of a probability distribution is its “center of mass”; i.e., the value at which it would balance. * Other measures of centrality include the median (the middle value) and the mode (the most common value) * The spread of a distribution is generally summarized as a distance from the center (i.e., variance is the average squared deviation from the mean)

R also provides quantile functions (which begin with q) that allow you to look up the quantiles of a distribution, such as qbinom() for the binomial distribution and qnorm() for the normal distribution. For example, if a random variable is distributed according to the standard normal distribution, it will be \(\le -1.644\) 5% of the time, \(\le 0\) 50% of the time, and \(\le 1.644\) 95% of the time:

qnorm(c(0.05, 0.5, 0.95))

## [1] -1.644854  0.000000  1.644854

Finally, R also provides facilities to draw random values from a distribution using functions that begin with r, such as rbinom() for the binomial distribution and rnorm() for the normal distribution. For example, to sample 10 random values from a normal distribution with a mean of 100 and a standard deviation of 15:

# I am also using the round() function to reduce the digits after the decimal place
round(rnorm(10, mean = 100, sd = 15), digits = 1)

##  [1] 118.9  95.1 119.9 119.1 106.2  76.9  86.1  95.6  99.9 136.1

If you type help(distributions), you will get a list of distributions that are available in base R. For more advanced features related to distributions (e.g., mixture distributions or joint probability distributions like the multivariate normal distribution), you can try the packages distr and mvtnorm.

Discrete distributions

The binomial distribution: the repeated coin-tossing distribution

A Bernoulli trial is a single experiment that results either in “success” or “failure” (with a particular probability of success). For example, if we toss a coin, we may define heads as “success” and tails as “failure” (or vice versa). The binomial distribution gives the number of expected successes in a series of independent Bernoulli trials with the same probability of success. For example, if you flip a fair coin ten times, the binomial distribution gives the probability that you will obtain 0 heads, 1 head, 2 heads … or 10 heads:

# Probability of obtaining 0-10 heads if you flip a fair coin 10 times:
probs <- dbinom(0:10, size = 10, prob = 0.5)
names(probs) <- 0:10
round(probs, digits = 3)

##     0     1     2     3     4     5     6     7     8     9    10 
## 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001

The binomial distribution is parameterized by two values: the probability of “success” (p) and the number of trials (n). The following PMFs show how the binomial distribution appears for ten trials under different probabilities of success:

opar <- par(mfrow = c(2, 2))
plot(0:10,dbinom(0:10,size=10,prob=.2),main="p=0.2",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
plot(0:10,dbinom(0:10,size=10,prob=.5),main="p=0.5",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
plot(0:10,dbinom(0:10,size=10,prob=.7),main="p=0.7",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
plot(0:10,dbinom(0:10,size=10,prob=.9),main="p=0.9",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
par(opar)

The binomial distribution forms the basis of the binomial test, which is useful to determine if observed count data for two categories (male vs. female, infected vs. non-infected, etc.) are consistent with expected frequencies based on a theoretical proportion.

The Poisson distribution: the event-counting distribution

The Poisson distribution shows the expected number of independent occurrences in a particular unit (of time, space, etc.). It is commonly used to model count data of rare events/objects in Biology and in other fields. For example, the Poisson distribution might be used to describe:

The number of weevil larvae infesting each bean.
The number of occurrences of a species per quadrat.
The number of chocolate chips per chocolate chip cookie.
The number of telephone calls that a call centre receives per hour.

A Poisson distribution is parameterized by a single value: \(\lambda\) (lambda). \(\lambda\) is the average rate of occurrence (e.g., on average, maybe: 0.4 weevils per bean, 1 occurrence per quadrat, 10 chips per cookie, 20 calls per hour). \(\lambda\) also indicates the variance among beans, quadrats, cookies, or hours. Here are Poisson distributions for each of these \(\lambda\) values. Each panel represents a simulation in which I sampled 100,000 beans, 100,000 quadrats, 100,000 cookies, or 100,000 call-centre hours and counted the number of larvae, occurrences, chocolate chips, or received telephone calls, respectively.

opar <- par(mfrow = c(2, 2))
plot(table(rpois(100000, lambda = 0.4)), xlim = c(0, 40), main = "larvae (lambda = 0.4)", ylab = "")
plot(table(rpois(100000, lambda = 1)), xlim = c(0, 40), main = "occurrences (lambda = 1)", ylab = "")
plot(table(rpois(100000, lambda = 10)), xlim = c(0, 40), main = "chips (lambda = 10)", ylab = "")
plot(table(rpois(100000, lambda = 20)), xlim = c(0, 40), main = "calls (lambda = 20)", ylab = "")
par(opar)

Importantly, \(\lambda\) is equal to both the mean and the variance of the distribution: if data are exactly Poisson distributed, then \(D=\frac{\sigma^2}{\mu}=1\), where \(D\) is the coefficient of dispersion, \(\sigma^2\) is the variance and \(\mu\) is the mean.

Continuous distributions

The uniform distribution: “complete randomness”

The uniform distribution provides a “completely random” number: all values in the supported range are equally likely. The uniform distribution is often what we have in mind when we think of “randomness”. A uniform distribution is parameterized by two values: the minimum and maximum supported values. Here is the PDF of a uniform distribution with support over the range 0-1. All values between 0 and 1 are equally likely. All values outside this range are impossible.

curve(dunif(x, min = 0, max = 1), ylim = c(0, 1))

The normal distribution: the bell curve

The normal distribution (also called the Gaussian distribution) is the familiar “bell curve”. The normal distribution is ubiquitous in biology because of the Central Limit Theorem. According to the Central Limit Theorem, the sum (or average) of a large number of i.i.d. (independent and identically distributed) variables is approximately normally distributed. This means, for example, if human height is the result of many genes and/or environmental influences that each have a very small additive effect, we would expect height to be normally distributed.

Here are some examples of the Central Limit Theorem in action. In each of these figures, I have sampled 10,000 random values from a particular distribution and taken the average. I have repeated this random sampling/averaging process 100,000 times and plotted the results as a histogram. As you can see, regardless of the distribution from which the values are originally drawn, the average value is approximately normally distributed:

opar <- par(mfrow = c(2, 2))
hist(replicate(100000, mean(runif(10000))), breaks = "scott", xlab = "", main = "Uniform")
hist(replicate(100000, mean(rnorm(10000))), breaks = "scott", xlab = "", main = "Normal")
hist(replicate(100000, mean(rlnorm(10000))), breaks = "scott", xlab = "", main = "Log-normal")
hist(replicate(100000, mean(rexp(10000))), breaks = "scott", xlab = "", main = "Exponential")
par(opar)

A normal distribution is characterized by two parameters: \(\mu\) (mu; the mean) and \(\sigma\) (sigma; the standard deviation, s.d.). \(\mu\) controls the location of the distribution, while \(\sigma\) controls its dispersion (or spread). Here are three different normal distributions overlaid:

curve(dnorm(x, mean = 100, sd = 5), from = 50, to = 150, ylab = "Density")
curve(dnorm(x, mean = 100, sd = 15), col = "red", add = TRUE)
curve(dnorm(x, mean = 70, sd = 5), col = "blue", add = TRUE)
legend("topright", lty = 1, legend = c("mu=100, sd=5", "mu=100, sd=15", "mu=70, sd=5"),
       col = c("black", "red", "blue"))

A normally distributed variable follows the 68-95-99.7 rule:

Approximately 68% of values fall within 1 s.d. of the mean.
Approximately 95% of values fall within 2 s.d. of the mean.
Approximately 99.7% of values fall within 3 s.d. of the mean.

Comparing your data to a normal distribution

Many techniques in parametric statistics depend on normality in some way. So we often want to examine our data (or the residuals of an analysis) to see if they are normally distributed. There are two main ways in which we can do this graphically.

The first way is to plot our data as a histogram and overlay a normal distribution. To do this, you standardize your data and compare it with the z-distribution (also known as the standard normal distribution). The z-distribution is a special form of the normal distribution in which the mean is 0 and the s.d. is 1. (It represents the probability distribution of a random variable that is the difference between a sample statistic and the population parameter, relative to its population standard deviation.)

To standardize a sample, you center it by subtracting its mean, and scale it by dividing by its standard deviation. The scale() function in R will do both of these steps for you. For example, imagine I have sampled 1000 individuals and measured their IQ (saved as the R object IQs.) Then, I could compare my sample with a normal distribution as follows:

IQs.standardized <- scale(IQs) # By default, this function centers and scales the data
hist(IQs.standardized, breaks = "scott", right = FALSE, freq = FALSE, ylim = c(0, 0.5))
curve(dnorm(x), from = -4, to = 4, add = TRUE, col = "red")

The second way to graphically compare our data with a normal distribution is to use a normal QQ plot (“quantile-quantile plot”). In one of R’s normal QQ plots, the x-axis represents expected values from a normal distribution, and the y-axis represents our actual values. If our actual values are normally distributed, then they should fall exactly on the line of expectation.

qqnorm(IQs)
qqline(IQs)

When you have small sample sizes, your QQ plots will not usually be as clean as the one above. For example, all of the following QQ plots are for samples of normally distributed variables (sample size = 30). Although we might expect that they should all be exactly on the line, they are noisy:

Student’s t-distribution: for critical values of a normal distribution when sample size is small

The t-distribution represents the probability distribution of a random variable that is the difference between a sample statistic and the population parameter, relative to its sample standard deviation. In a nutshell, the t-distribution tells you how likely it is that a sample mean will be a certain number of sample standard errors away from the true population mean. The t-distribution is commonly used to estimate confidence intervals for the mean, and it is used for t-tests. The t-distribution takes a single parameter: the number of degrees of freedom (d.f.).²

To plot the probability density of the t-distribution for various d.f., we can use the following code. Notice that the t-distribution is always symmetrical around 0.

curve(dt(x, df = 1), from = -5, to = 5, ylim = c(0, 0.40), col = "red", ylab = "Density")
curve(dt(x, df = 2), add = TRUE, col = "orange")
curve(dt(x, df = 5), add = TRUE, col = "green")
curve(dt(x, df = 10), add = TRUE, col = "blue")
curve(dt(x, df = 30), add = TRUE, col = "purple")
curve(dt(x, df = Inf), add = TRUE) # default colour is black
legend("topright", title = "d.f.", lty = 1, legend = c("1", "2", "5", "10", "30", "Inf"),
       col = c("red", "orange", "green", "blue", "purple", "black"))

We need different t-distributions for different degrees of freedom because the formula used to estimate the standard error of the mean is \(\frac{s}{\sqrt{n}}\), where \(s\) is the sample s.d. and \(n\) is the sample size. The sample s.d. is a downwardly biased estimator: it tends to underestimate the true population s.d., especially when the sample size is small.³ Therefore, when the sample size is small, the t-distribution has fat tails: if the sample s.d. underestimates the population s.d., then the sample mean may be many standard errors away from the population mean. As the sample size increases, the population s.d. is estimated more accurately, and probability becomes concentrated in the centre of the t-distribution. When there are infinite degrees of freedom, the population s.d. is estimated perfectly and the t-distribution is identical to the z-distribution.

For example, imagine that you randomly sample two units (\(n=2\)). The appropriate t-distribution has \(n-1=1\) degree of freedom. When d.f. = 1, the probability that a sample mean will be within 1 standard error of the true population mean is 50%:

# This formula represents the area under the t-distribution (with d.f. = 1)
# The first term represents the area under the curve from negative infinity to +1
# The second term represents the area under the curve from negative infinity to -1
# By subtracting, we obtain the area under the curve between -1 and 1
pt(1, df = 1) - pt(-1, df = 1)

## [1] 0.5

Now watch what happens when we increase the sample size: we become more and more sure that a sample mean will be located within 1 standard error of the population mean:

results <- pt(1, df = c(2, 5, 10, 30, Inf)) - pt(-1, df = c(2, 5, 10, 30, Inf))
names(results) <- c("n=3", "n=6", "n=11", "n=31", "n=Infinity")
round(results, digits = 3)

##        n=3        n=6       n=11       n=31 n=Infinity 
##      0.577      0.637      0.659      0.675      0.683

The t-distribution is used as the null distribution of the t-test among other purposes.

Confidence intervals for the mean from the t-distribution

To calculate confidence intervals for the mean using the t-distribution, use the formula \(\bar{x} \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}}\). For example, if you sampled the five human heights below, the mean would be 183.2 cm (95% CI: 178.1-188.3 cm):

alpha <- 0.05                         # alpha value for significance (1 - 0.95)
heights <- c(186, 188, 178, 180, 184) # the sampled heights (in cm)
n <- length(heights)                  # sample size
height.se <- sd(heights)/sqrt(n)      # standard error = sd/sqrt(n)
mean(heights)                         # print the mean
mean(heights) - qt(alpha/2, df = n - 1, lower.tail = FALSE) * height.se # lower limit
mean(heights) + qt(alpha/2, df = n - 1, lower.tail = FALSE) * height.se # upper limit

## [1] 183.2

## [1] 178.0505

## [1] 188.3495

The \(\chi^2\)-distribution: the sums-of-squares distribution

If you take n independent random values from a z-distribution, square them, and sum these squares, the result is a random variable that is theoretically distributed according to the \(\chi^2\)-distribution (chi-square distribution) with n degrees of freedom. (A graphical proof is left as an exercise for the reader.) Therefore, the \(\chi^2\)-distribution is appropriate for modelling sums of squares (or variances, after scaling). The \(\chi^2\)-distribution is bounded by 0 and its shape changes based on the degrees of freedom. When the \(\chi^2\)-distribution is applied to the sum of squares estimated from a sample, the degrees of freedom equals the sample size minus 1.

curve(dchisq(x, df = 1), from = 0, to = 20, ylim = c(0, 0.40), col = "red", ylab = "Density")
curve(dchisq(x, df = 2), add = TRUE, col = "orange")
curve(dchisq(x, df = 5), add = TRUE, col = "green")
curve(dchisq(x, df = 8), add = TRUE, col = "blue")
curve(dchisq(x, df = 10), add = TRUE, col = "purple")
legend("topright", title = "d.f.", lty = 1, legend = c("1", "2", "5", "8", "10"),
       col = c("red", "orange", "green", "blue", "purple"))

The \(\chi^2\)-distribution is the null distribution used in the \(\chi^2\) test, and sometimes also appears in other surprising places, such as the null distribution for a likelihood ratio test.

Confidence intervals for the variance from the \(\chi^2\)-distribution

To calculate confidence intervals for the population variance using the \(\chi^2\)-distribution, use \(\frac{(n-1)s^2}{{\chi^2}_{\alpha/2,n-1}}\) to calculate the lower limit and \(\frac{(n-1)s^2}{{\chi^2}_{1-(\alpha/2),n-1}}\) to calculate the upper limit. For example, for the five human heights discussed in the section on the t-distribution above, the variance is 17.2 cm² (95% CI: 6.2-142.0 cm²):

alpha <- 0.05                         # alpha value for significance (1 - 0.95)
heights <- c(186, 188, 178, 180, 184) # the sampled heights (in cm)
n <- length(heights)                  # sample size
var(heights)                          # print the variance
(n - 1) * var(heights) / qchisq(alpha/2, df = n - 1, lower.tail = FALSE)     # lower limit
(n - 1) * var(heights) / qchisq(1 - alpha/2, df = n - 1, lower.tail = FALSE) # upper limit

## [1] 17.2

## [1] 6.174121

## [1] 142.0259

Confidence intervals for the standard deviation are simply the square root of the confidence intervals for the variance. Note that–unlike the confidence intervals for the mean–confidence intervals for the variance (and standard deviation) are not symmetrical.

The F-distribution: the ratio-of-variances distribution

The ratio of two sample variances (independently taken from two identical normally distributed populations) is theoretically distributed according to the F-distribution. The F-distribution takes two parameters: the numerator and denominator degrees of freedom. The following figure shows how the F-distribution changes shape based on the degrees of freedom. For example, the green line in the figure below represents the expected distribution from the following sampling procedure:

Take a sample of size 3 from a normally distributed population and calculate the sample variance (d.f. = 2).
Take a sample of size 16 from an identical population and calculate the sample variance (d.f. = 15).
Divide the variance obtained in step 1 by the variance obtained in step 2. The result is a random variable distributed according to the \(F_{2,15}\) distribution.

curve(df(x, df1 = 1, df2 = 1), col = "red", ylab = "Density")
curve(df(x, df1 = 1, df2 = 10), col = "orange", add = TRUE)
curve(df(x, df1 = 2, df2 = 15), col = "green", add = TRUE)
curve(df(x, df1 = 15, df2 = 2), col = "blue", add = TRUE)
curve(df(x, df1 = 5, df2 = 25), col = "purple", add = TRUE)
legend("topright", title = "d.f.", lty = 1,
       legend = c("1, 1", "1, 10", "2, 15", "15, 2", "5, 25"),
       col = c("red", "orange", "green", "blue", "purple"))

Notice that the \(F_{2,15}\) distribution is not the same as the \(F_{15,2}\) distribution (i.e., the green and blue lines in the figure are not the same). Confusing the numerator and denominator degrees of freedom is a common error.

The F-distribution is used as a statistical null distribution in the analysis of variance (ANOVA). Because it is a ratio of variances (which are always positive), the F distribution is bounded by 0.

This week’s tasks

Answer the following questions:

In Canada, the sex ratio at birth is approximately 1.06 males/female (i.e., ~51.4% male). All else being equal, what is the probability that of the next 100 births in Canada, exactly 50 newborns will be female? What is the probability that more than 60 of the next 100 births will be female? Assume that all births can be categorized as either male or female and that the sex of every birth is independent of all others (e.g., there are no identical twins.)
According to the most commonly used scale, IQ is approximately normally distributed with a mean of 100 and a standard deviation of 15.
- Approximately what proportion of individuals have an IQ between 110 and 125?
- What are the 1^st and 99^th percentiles for IQ?
You are conducting an agricultural study on cow milk yield. You have recorded the following milk yields (in gallons/day) of 15 cows: 6.0, 4.7, 5.6, 5.2, 5.3, 7.2, 3.0, 5.6, 4.2, 5.1, 5.1, 5.5, 5.6, 5.2, 5.1.
- Does milk yield appear to be normally distributed? Explain.
- Assume that milk yield is normally distributed. What are the mean and standard deviation (and their 95% confidence intervals)?
You are examining tick density on moose. You have counted the number of ticks on 30 infested moose and obtained: 70, 122, 104, 86, 214, 54, 58, 89, 57, 82, 84, 113, 85, 59, 93, 129, 28, 132, 88, 81, 63, 82, 56, 55, 95, 65, 58, 162, 103, 127. Does the number of ticks per moose appear to follow a Poisson distribution? (A statistical test is not necessary… Just give your general impression and your reasoning.)
What does the probability density function for a log-normal distribution look like? Plot one.
Demonstrate graphically that summing \(n\) independent standard normal deviates results in a variable that is \(\chi^2\)-distributed with \(n\) degrees of freedom. (Hint: You will need to use a combination of replicate(), rnorm(), hist(), dchisq(), and curve().)
Demonstrate graphically that if \(X=\frac{U_1/\nu_1}{U_2/\nu_2}\) where \(U_1\) and \(U_2\) are independent \(\chi^2\)-distributed variables (with \(\nu_1\) and \(\nu_2\) degrees of freedom respectively), then \(X\) is a random variable distributed according to the \(F_{\nu_1,\nu_2}\)-distribution.

For a more formal and less circular definition, see Blitzstein & Hwang, 2014.↩︎
Degrees of freedom are the number of observations that are “free to vary” when estimating a parameter. They represent the number of independent pieces of information used to calculate a statistic. The number of degrees of freedom is the sample size (n) minus the number of other parameters that need to be estimated from the same data (p) before you can calculate the statistic. I.e., \(d.f. = n-p\) For instance, the sample mean has n d.f.; the sample variance has n-1 d.f., sample skew and sample kurtosis have n-2 d.f.↩︎
For more information, see: https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation ↩︎

BIOL 5473 Tutorial 3

Probability

Random variables and their distributions

Discrete distributions

The binomial distribution: the repeated coin-tossing distribution

Related distributions

The Poisson distribution: the event-counting distribution

Related distributions

Continuous distributions

The uniform distribution: “complete randomness”

The normal distribution: the bell curve

Comparing your data to a normal distribution

Related distributions

Student’s t-distribution: for critical values of a normal distribution when sample size is small

Confidence intervals for the mean from the t-distribution

The \(\chi^2\)-distribution: the sums-of-squares distribution

Confidence intervals for the variance from the \(\chi^2\)-distribution

The F-distribution: the ratio-of-variances distribution

This week’s tasks