The probability of an event X can be written as \(P(X)\). Probability is hard to define, but it is a measure of uncertainty. There are two common interpretations of probability:
Probabilities satisfy Kolgomorov’s properties:
A conditional probability is the probability that an event occurs assuming that another event has occurred. The probability that A occurs assuming that B has occurred is written as \(P(A|B)\); it can be calculated as:
A random variable is a variable that takes on random values.1 The probability that a random variable takes on a particular value depends on the probability distribution from which the random variable is drawn. The most commonly used distributions belong to the exponential family. Every probability distribution can be described by either of two functions:
d, such as dbinom() for the PMF of a binomial
distribution or dnorm() for the PDF of a normal
distribution.p, such as
pbinom() for a binomial distribution or
pnorm() for a normal distribution.The PMF/PDF provides the most familiar shape of a distribution. For example, the PDF of the standard normal distribution is the bell curve shown in the left panel below. On the right, I have plotted the CDF. The height of the CDF of the standard normal distribution shows that, for example, the probability of obtaining a random variable with a value \(\le0\) is 50%.
opar <- par(mfrow = c(1, 2), pty = "s", cex.axis = 0.8)
curve(dnorm(x), from = -3, to = 3, main = "Standard normal PDF")
curve(pnorm(x), from = -3, to = 3, main = "Standard normal CDF")
par(opar)
The expected value of a random variable [E(Y)] is the mean of its probability distribution. The mean of a probability distribution is its “center of mass”; i.e., the value at which it would balance. * Other measures of centrality include the median (the middle value) and the mode (the most common value) * The spread of a distribution is generally summarized as a distance from the center (i.e., variance is the average squared deviation from the mean)
R also provides quantile functions (which begin with q)
that allow you to look up the quantiles of a distribution, such as
qbinom() for the binomial distribution and
qnorm() for the normal distribution. For example, if a
random variable is distributed according to the standard normal
distribution, it will be \(\le -1.644\)
5% of the time, \(\le 0\) 50% of the
time, and \(\le 1.644\) 95% of the
time:
qnorm(c(0.05, 0.5, 0.95))
## [1] -1.644854 0.000000 1.644854
Finally, R also provides facilities to draw random values from a
distribution using functions that begin with r, such as
rbinom() for the binomial distribution and
rnorm() for the normal distribution. For example, to sample
10 random values from a normal distribution with a mean of 100 and a
standard deviation of 15:
# I am also using the round() function to reduce the digits after the decimal place
round(rnorm(10, mean = 100, sd = 15), digits = 1)
## [1] 118.9 95.1 119.9 119.1 106.2 76.9 86.1 95.6 99.9 136.1
If you type help(distributions), you will get a list of
distributions that are available in base R. For more advanced features
related to distributions (e.g., mixture distributions or joint
probability distributions like the multivariate normal distribution),
you can try the packages distr and
mvtnorm.
A Bernoulli trial is a single experiment that results either in “success” or “failure” (with a particular probability of success). For example, if we toss a coin, we may define heads as “success” and tails as “failure” (or vice versa). The binomial distribution gives the number of expected successes in a series of independent Bernoulli trials with the same probability of success. For example, if you flip a fair coin ten times, the binomial distribution gives the probability that you will obtain 0 heads, 1 head, 2 heads … or 10 heads:
# Probability of obtaining 0-10 heads if you flip a fair coin 10 times:
probs <- dbinom(0:10, size = 10, prob = 0.5)
names(probs) <- 0:10
round(probs, digits = 3)
## 0 1 2 3 4 5 6 7 8 9 10
## 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001
The binomial distribution is parameterized by two values: the probability of “success” (p) and the number of trials (n). The following PMFs show how the binomial distribution appears for ten trials under different probabilities of success:
opar <- par(mfrow = c(2, 2))
plot(0:10,dbinom(0:10,size=10,prob=.2),main="p=0.2",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
plot(0:10,dbinom(0:10,size=10,prob=.5),main="p=0.5",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
plot(0:10,dbinom(0:10,size=10,prob=.7),main="p=0.7",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
plot(0:10,dbinom(0:10,size=10,prob=.9),main="p=0.9",xlim=c(0,10),ylim=c(0,.4),xlab="",ylab="")
par(opar)
The binomial distribution forms the basis of the binomial test, which is useful to determine if observed count data for two categories (male vs. female, infected vs. non-infected, etc.) are consistent with expected frequencies based on a theoretical proportion.
The Poisson distribution shows the expected number of independent occurrences in a particular unit (of time, space, etc.). It is commonly used to model count data of rare events/objects in Biology and in other fields. For example, the Poisson distribution might be used to describe:
A Poisson distribution is parameterized by a single value: \(\lambda\) (lambda). \(\lambda\) is the average rate of occurrence (e.g., on average, maybe: 0.4 weevils per bean, 1 occurrence per quadrat, 10 chips per cookie, 20 calls per hour). \(\lambda\) also indicates the variance among beans, quadrats, cookies, or hours. Here are Poisson distributions for each of these \(\lambda\) values. Each panel represents a simulation in which I sampled 100,000 beans, 100,000 quadrats, 100,000 cookies, or 100,000 call-centre hours and counted the number of larvae, occurrences, chocolate chips, or received telephone calls, respectively.
opar <- par(mfrow = c(2, 2))
plot(table(rpois(100000, lambda = 0.4)), xlim = c(0, 40), main = "larvae (lambda = 0.4)", ylab = "")
plot(table(rpois(100000, lambda = 1)), xlim = c(0, 40), main = "occurrences (lambda = 1)", ylab = "")
plot(table(rpois(100000, lambda = 10)), xlim = c(0, 40), main = "chips (lambda = 10)", ylab = "")
plot(table(rpois(100000, lambda = 20)), xlim = c(0, 40), main = "calls (lambda = 20)", ylab = "")
par(opar)
Importantly, \(\lambda\) is equal to both the mean and the variance of the distribution: if data are exactly Poisson distributed, then \(D=\frac{\sigma^2}{\mu}=1\), where \(D\) is the coefficient of dispersion, \(\sigma^2\) is the variance and \(\mu\) is the mean.
The uniform distribution provides a “completely random” number: all values in the supported range are equally likely. The uniform distribution is often what we have in mind when we think of “randomness”. A uniform distribution is parameterized by two values: the minimum and maximum supported values. Here is the PDF of a uniform distribution with support over the range 0-1. All values between 0 and 1 are equally likely. All values outside this range are impossible.
curve(dunif(x, min = 0, max = 1), ylim = c(0, 1))
The normal distribution (also called the Gaussian distribution) is the familiar “bell curve”. The normal distribution is ubiquitous in biology because of the Central Limit Theorem. According to the Central Limit Theorem, the sum (or average) of a large number of i.i.d. (independent and identically distributed) variables is approximately normally distributed. This means, for example, if human height is the result of many genes and/or environmental influences that each have a very small additive effect, we would expect height to be normally distributed.
Here are some examples of the Central Limit Theorem in action. In each of these figures, I have sampled 10,000 random values from a particular distribution and taken the average. I have repeated this random sampling/averaging process 100,000 times and plotted the results as a histogram. As you can see, regardless of the distribution from which the values are originally drawn, the average value is approximately normally distributed:
opar <- par(mfrow = c(2, 2))
hist(replicate(100000, mean(runif(10000))), breaks = "scott", xlab = "", main = "Uniform")
hist(replicate(100000, mean(rnorm(10000))), breaks = "scott", xlab = "", main = "Normal")
hist(replicate(100000, mean(rlnorm(10000))), breaks = "scott", xlab = "", main = "Log-normal")
hist(replicate(100000, mean(rexp(10000))), breaks = "scott", xlab = "", main = "Exponential")
par(opar)
A normal distribution is characterized by two parameters: \(\mu\) (mu; the mean) and \(\sigma\) (sigma; the standard deviation, s.d.). \(\mu\) controls the location of the distribution, while \(\sigma\) controls its dispersion (or spread). Here are three different normal distributions overlaid:
curve(dnorm(x, mean = 100, sd = 5), from = 50, to = 150, ylab = "Density")
curve(dnorm(x, mean = 100, sd = 15), col = "red", add = TRUE)
curve(dnorm(x, mean = 70, sd = 5), col = "blue", add = TRUE)
legend("topright", lty = 1, legend = c("mu=100, sd=5", "mu=100, sd=15", "mu=70, sd=5"),
col = c("black", "red", "blue"))
A normally distributed variable follows the 68-95-99.7 rule:
Many techniques in parametric statistics depend on normality in some way. So we often want to examine our data (or the residuals of an analysis) to see if they are normally distributed. There are two main ways in which we can do this graphically.
The first way is to plot our data as a histogram and overlay a normal distribution. To do this, you standardize your data and compare it with the z-distribution (also known as the standard normal distribution). The z-distribution is a special form of the normal distribution in which the mean is 0 and the s.d. is 1. (It represents the probability distribution of a random variable that is the difference between a sample statistic and the population parameter, relative to its population standard deviation.)
To standardize a sample, you center it by
subtracting its mean, and scale it by dividing by its
standard deviation. The scale() function in R will do both
of these steps for you. For example, imagine I have sampled 1000
individuals and measured their IQ (saved as the R object
IQs.) Then, I could compare my sample with a normal
distribution as follows:
IQs.standardized <- scale(IQs) # By default, this function centers and scales the data
hist(IQs.standardized, breaks = "scott", right = FALSE, freq = FALSE, ylim = c(0, 0.5))
curve(dnorm(x), from = -4, to = 4, add = TRUE, col = "red")
The second way to graphically compare our data with a normal distribution is to use a normal QQ plot (“quantile-quantile plot”). In one of R’s normal QQ plots, the x-axis represents expected values from a normal distribution, and the y-axis represents our actual values. If our actual values are normally distributed, then they should fall exactly on the line of expectation.
qqnorm(IQs)
qqline(IQs)
When you have small sample sizes, your QQ plots will not usually be as clean as the one above. For example, all of the following QQ plots are for samples of normally distributed variables (sample size = 30). Although we might expect that they should all be exactly on the line, they are noisy:
The t-distribution represents the probability distribution of a random variable that is the difference between a sample statistic and the population parameter, relative to its sample standard deviation. In a nutshell, the t-distribution tells you how likely it is that a sample mean will be a certain number of sample standard errors away from the true population mean. The t-distribution is commonly used to estimate confidence intervals for the mean, and it is used for t-tests. The t-distribution takes a single parameter: the number of degrees of freedom (d.f.).2
To plot the probability density of the t-distribution for various d.f., we can use the following code. Notice that the t-distribution is always symmetrical around 0.
curve(dt(x, df = 1), from = -5, to = 5, ylim = c(0, 0.40), col = "red", ylab = "Density")
curve(dt(x, df = 2), add = TRUE, col = "orange")
curve(dt(x, df = 5), add = TRUE, col = "green")
curve(dt(x, df = 10), add = TRUE, col = "blue")
curve(dt(x, df = 30), add = TRUE, col = "purple")
curve(dt(x, df = Inf), add = TRUE) # default colour is black
legend("topright", title = "d.f.", lty = 1, legend = c("1", "2", "5", "10", "30", "Inf"),
col = c("red", "orange", "green", "blue", "purple", "black"))
We need different t-distributions for different degrees of freedom because the formula used to estimate the standard error of the mean is \(\frac{s}{\sqrt{n}}\), where \(s\) is the sample s.d. and \(n\) is the sample size. The sample s.d. is a downwardly biased estimator: it tends to underestimate the true population s.d., especially when the sample size is small.3 Therefore, when the sample size is small, the t-distribution has fat tails: if the sample s.d. underestimates the population s.d., then the sample mean may be many standard errors away from the population mean. As the sample size increases, the population s.d. is estimated more accurately, and probability becomes concentrated in the centre of the t-distribution. When there are infinite degrees of freedom, the population s.d. is estimated perfectly and the t-distribution is identical to the z-distribution.
For example, imagine that you randomly sample two units (\(n=2\)). The appropriate t-distribution has \(n-1=1\) degree of freedom. When d.f. = 1, the probability that a sample mean will be within 1 standard error of the true population mean is 50%:
# This formula represents the area under the t-distribution (with d.f. = 1)
# The first term represents the area under the curve from negative infinity to +1
# The second term represents the area under the curve from negative infinity to -1
# By subtracting, we obtain the area under the curve between -1 and 1
pt(1, df = 1) - pt(-1, df = 1)
## [1] 0.5
Now watch what happens when we increase the sample size: we become more and more sure that a sample mean will be located within 1 standard error of the population mean:
results <- pt(1, df = c(2, 5, 10, 30, Inf)) - pt(-1, df = c(2, 5, 10, 30, Inf))
names(results) <- c("n=3", "n=6", "n=11", "n=31", "n=Infinity")
round(results, digits = 3)
## n=3 n=6 n=11 n=31 n=Infinity
## 0.577 0.637 0.659 0.675 0.683
The t-distribution is used as the null distribution of the t-test among other purposes.
To calculate confidence intervals for the mean using the t-distribution, use the formula \(\bar{x} \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}}\). For example, if you sampled the five human heights below, the mean would be 183.2 cm (95% CI: 178.1-188.3 cm):
alpha <- 0.05 # alpha value for significance (1 - 0.95)
heights <- c(186, 188, 178, 180, 184) # the sampled heights (in cm)
n <- length(heights) # sample size
height.se <- sd(heights)/sqrt(n) # standard error = sd/sqrt(n)
mean(heights) # print the mean
mean(heights) - qt(alpha/2, df = n - 1, lower.tail = FALSE) * height.se # lower limit
mean(heights) + qt(alpha/2, df = n - 1, lower.tail = FALSE) * height.se # upper limit
## [1] 183.2
## [1] 178.0505
## [1] 188.3495
If you take n independent random values from a z-distribution, square them, and sum these squares, the result is a random variable that is theoretically distributed according to the \(\chi^2\)-distribution (chi-square distribution) with n degrees of freedom. (A graphical proof is left as an exercise for the reader.) Therefore, the \(\chi^2\)-distribution is appropriate for modelling sums of squares (or variances, after scaling). The \(\chi^2\)-distribution is bounded by 0 and its shape changes based on the degrees of freedom. When the \(\chi^2\)-distribution is applied to the sum of squares estimated from a sample, the degrees of freedom equals the sample size minus 1.
curve(dchisq(x, df = 1), from = 0, to = 20, ylim = c(0, 0.40), col = "red", ylab = "Density")
curve(dchisq(x, df = 2), add = TRUE, col = "orange")
curve(dchisq(x, df = 5), add = TRUE, col = "green")
curve(dchisq(x, df = 8), add = TRUE, col = "blue")
curve(dchisq(x, df = 10), add = TRUE, col = "purple")
legend("topright", title = "d.f.", lty = 1, legend = c("1", "2", "5", "8", "10"),
col = c("red", "orange", "green", "blue", "purple"))
The \(\chi^2\)-distribution is the null distribution used in the \(\chi^2\) test, and sometimes also appears in other surprising places, such as the null distribution for a likelihood ratio test.
To calculate confidence intervals for the population variance using the \(\chi^2\)-distribution, use \(\frac{(n-1)s^2}{{\chi^2}_{\alpha/2,n-1}}\) to calculate the lower limit and \(\frac{(n-1)s^2}{{\chi^2}_{1-(\alpha/2),n-1}}\) to calculate the upper limit. For example, for the five human heights discussed in the section on the t-distribution above, the variance is 17.2 cm2 (95% CI: 6.2-142.0 cm2):
alpha <- 0.05 # alpha value for significance (1 - 0.95)
heights <- c(186, 188, 178, 180, 184) # the sampled heights (in cm)
n <- length(heights) # sample size
var(heights) # print the variance
(n - 1) * var(heights) / qchisq(alpha/2, df = n - 1, lower.tail = FALSE) # lower limit
(n - 1) * var(heights) / qchisq(1 - alpha/2, df = n - 1, lower.tail = FALSE) # upper limit
## [1] 17.2
## [1] 6.174121
## [1] 142.0259
Confidence intervals for the standard deviation are simply the square root of the confidence intervals for the variance. Note that–unlike the confidence intervals for the mean–confidence intervals for the variance (and standard deviation) are not symmetrical.
The ratio of two sample variances (independently taken from two identical normally distributed populations) is theoretically distributed according to the F-distribution. The F-distribution takes two parameters: the numerator and denominator degrees of freedom. The following figure shows how the F-distribution changes shape based on the degrees of freedom. For example, the green line in the figure below represents the expected distribution from the following sampling procedure:
curve(df(x, df1 = 1, df2 = 1), col = "red", ylab = "Density")
curve(df(x, df1 = 1, df2 = 10), col = "orange", add = TRUE)
curve(df(x, df1 = 2, df2 = 15), col = "green", add = TRUE)
curve(df(x, df1 = 15, df2 = 2), col = "blue", add = TRUE)
curve(df(x, df1 = 5, df2 = 25), col = "purple", add = TRUE)
legend("topright", title = "d.f.", lty = 1,
legend = c("1, 1", "1, 10", "2, 15", "15, 2", "5, 25"),
col = c("red", "orange", "green", "blue", "purple"))
Notice that the \(F_{2,15}\) distribution is not the same as the \(F_{15,2}\) distribution (i.e., the green and blue lines in the figure are not the same). Confusing the numerator and denominator degrees of freedom is a common error.
The F-distribution is used as a statistical null distribution in the analysis of variance (ANOVA). Because it is a ratio of variances (which are always positive), the F distribution is bounded by 0.
Answer the following questions:
In Canada, the sex ratio at birth is approximately 1.06 males/female (i.e., ~51.4% male). All else being equal, what is the probability that of the next 100 births in Canada, exactly 50 newborns will be female? What is the probability that more than 60 of the next 100 births will be female? Assume that all births can be categorized as either male or female and that the sex of every birth is independent of all others (e.g., there are no identical twins.)
According to the most commonly used scale, IQ is approximately normally distributed with a mean of 100 and a standard deviation of 15.
Approximately what proportion of individuals have an IQ between 110 and 125?
What are the 1st and 99th percentiles for IQ?
You are conducting an agricultural study on cow milk yield. You have recorded the following milk yields (in gallons/day) of 15 cows: 6.0, 4.7, 5.6, 5.2, 5.3, 7.2, 3.0, 5.6, 4.2, 5.1, 5.1, 5.5, 5.6, 5.2, 5.1.
Does milk yield appear to be normally distributed? Explain.
Assume that milk yield is normally distributed. What are the mean and standard deviation (and their 95% confidence intervals)?
You are examining tick density on moose. You have counted the number of ticks on 30 infested moose and obtained: 70, 122, 104, 86, 214, 54, 58, 89, 57, 82, 84, 113, 85, 59, 93, 129, 28, 132, 88, 81, 63, 82, 56, 55, 95, 65, 58, 162, 103, 127. Does the number of ticks per moose appear to follow a Poisson distribution? (A statistical test is not necessary… Just give your general impression and your reasoning.)
What does the probability density function for a log-normal distribution look like? Plot one.
Demonstrate graphically that summing \(n\) independent standard normal deviates
results in a variable that is \(\chi^2\)-distributed with \(n\) degrees of freedom. (Hint: You will
need to use a combination of replicate(),
rnorm(), hist(), dchisq(), and
curve().)
Demonstrate graphically that if \(X=\frac{U_1/\nu_1}{U_2/\nu_2}\) where \(U_1\) and \(U_2\) are independent \(\chi^2\)-distributed variables (with \(\nu_1\) and \(\nu_2\) degrees of freedom respectively), then \(X\) is a random variable distributed according to the \(F_{\nu_1,\nu_2}\)-distribution.
For a more formal and less circular definition, see Blitzstein & Hwang, 2014.↩︎
Degrees of freedom are the number of observations that are “free to vary” when estimating a parameter. They represent the number of independent pieces of information used to calculate a statistic. The number of degrees of freedom is the sample size (n) minus the number of other parameters that need to be estimated from the same data (p) before you can calculate the statistic. I.e., \(d.f. = n-p\) For instance, the sample mean has n d.f.; the sample variance has n-1 d.f., sample skew and sample kurtosis have n-2 d.f.↩︎
For more information, see: https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation↩︎