Probability review

Big Data Summer Institute 2022

June 21, 2022

Soumik Purkayastha
[https://soumikp.github.io]



[Press ‘k’ on your keyboard and use arrows/space bar to navigate!]

Learning objectives

At the end of this review, you will be able to

  1. Recall basic notions of probability
    • Coding skill: estimating probabilities through simulated repetitions of an experiment.
  2. Understand notions of conditional probability.
    • Coding skill: implementing control structures through use of if statements and simulating populations based on conditional probabilities.
  3. Link random variables to concepts of probability.
    • Coding skill: generating random samples from a given distribution and using Monte Carlo methods to estimate distribution parameters.
  4. Analyse basic properties of some common random variables.
    • Coding skill: explore of the relationships between prevalence, sensitivity, specificity and positive predictive value through a real-world example.

1A: Introduction

People often colloquially refer to probability…

Formalizing concepts and terminology around probability is essential for improved understanding.

1B: Random experiments

1C: Formalising our notion of probability (somewhat)

Probability is used to assign a level of uncertainty to the outcomes of phenomena that either happen randomly (e.g. rolling dice) or appear random because of a lack of understanding about exactly how the phenomenon occurs (e.g., patient response to medical treatment).

The probability of an outcome is the proportion of times the outcome would occur if the random phenomenon could be observed an infinite number of times.

1D: Disjoint events

Two events or outcomes are called disjoint or mutually exclusive if they cannot both happen at the same time.

If \(A\) and \(B\) represent two disjoint events, then the probability that either occurs is \[P(A \cup B) = P(A \text{ or } B) = P(A) + P(B).\]

A and B are disjoint; B and D are disjoint but A and D are not.

What is \(P(A \cup B)\)?

1E: Addition rule for disjoint outcomes and inclusion-exclusion principle.

If there are \(k\) disjoint events \(A_1, A_2, \ldots, A_k\), then the probability that at least one of these outcomes will occur is

\[P(A_1 \text{ or } A_2 \text{ or } \ldots A_k) = \sum_{i=1}^k P(A_i).\]

What if the events are not disjoint, but ‘overlapping’?

Let \(A\):= drawing a diamond and \(B\):= drawing a face card. What is \(P(A \cup B)?\)

Have to account for double-counting!


Thus, for two events \(A\) and \(B\), the probability that either occurs is

\[P(A \cup B) = P(A) + P(B) - P(A \cap B),\] where the ‘cup’ \(\cup\) symbol denotes the union of two events and the ‘cap’ \(\cap\) symbol denotes the intersection of two events. This is the general addition rule.

1F: Complement of an event.

Let \(D = \{2, 3 \}\) represent the event that the outcome of a die roll is 2 or 3.

The complement of an event \(D\), denoted by \(D^C\) represents all outcomes of the experiment that are NOT in \(D\).

\(D\) and \(D^C\) are related in the following probabilistic sense:

\(P(D) + P(D^C) = 1\).

1H: Independence

If \(A\) and \(B\) represent events from two different and independent processes, then the probability that both \(A\) and \(B\) occur is given by

\[P(A \cap B) = P(A) \times P(B).\]

A blue die and a green die are rolled separately; what is the probability of rolling two 1’s?

Lab 1: Basic concepts of probability

Notes for review may be found here.

Question: Suppose that a biased coin is tossed 5 times; the coin is weighted such that the probability of obtaining a heads is 0.6. What is the probability of obtaining exactly 3 heads?

Exercise 1 of 2: sample()

Click here for more details

Before writing the code for a simulation to estimate probability, it is helpful to clearly define the experiment and event of interest. In this case, the experiment is tossing a biased coin 5 times and the event of interest is observing exactly 3 heads.

The following code illustrates the use of the sample() command to simulate the result for one set of 5 coin tosses.

#define parameters
prob.heads =  ## FILL THIS IN!! 
number.tosses =   ## FILL THIS IN!! 
#simulate the coin tosses
outcomes = sample(c(0, 1), 
                  size = number.tosses,
                  prob = c(1 - prob.heads, prob.heads), replace = TRUE)
#view the results
table(outcomes)
#store the results as a single number
total.heads = sum(outcomes)
total.heads
  1. Using the information given about the experiment, set the parameters for prob.heads and number.tosses and run the code chunk.

    Click here for answer

    prob.heads = 0.6

    number.tosses = 5

  2. To generate outcomes, the sample() command draws from the values 0 and 1 with probabilites corresponding to those specified by the argument prob. Which number corresponds to heads, and which corresponds to tails?

    Click here for answer

    The number 0 corresponds to tails and 1 corresponds to heads. This information comes from the structure of the vectors used in the sample() command; the elements in c(0, 1) are sampled with probabilities c(1 - prob.heads, prob.heads), respectively.

  3. Why is it important to sample with replacement?

    Click here for answer

    There are only two numbers being sampled from: 0 and 1. From a coding perspective, sampling without replacement would only allow for two coin tosses, since the vector contains only two elements. From a statistical perspective, sampling with replacement allows any coin to have the outcome 0 (tails) or 1 (heads), even if another coin was observed to be heads or tails; i.e., sampling with replacement allows the coin tosses to be independent.

  4. What is the advantage of representing the outcomes with the values 0 and 1, rather than with letters like “T” and “H”?

    Click here for answer

    Representing the outcomes with 0 and 1 allows for the use of sum() to return the number of heads. Note that sum() can also be used if the outcomes are represented with letters, but the syntax is slightly more complex; this method is shown in the Notes section below.

  5. Run the code chunk again to simulate another set of 5 coin tosses. Is it reasonable to expect that the results might differ from the first set of tosses? Explain your answer.

    Click here for answer

    Yes, it is reasonable to expect that the number of heads in a set 5 coin tosses will not always be the same, even as the probability of a single heads remains constant. The inherent randomness makes it possible to observe anywhere between 0 and 5 heads (inclusive) in any set of 5 tosses.


Exercise 2 of 2: for

Click here for more details

The following code uses a for loop to repeat (i.e., replicate) the experiment and record the results of each replicate. The term k is an index, used to keep track of each iteration of the loop; think of it as similar to the index of summation \(k\) (or \(i\)) in sigma notation (\(\sum_{k = 1}^n\)). The value num.replicates is set to 50, specifying that the experiment is to be repeated 50 times. The command set.seed()is used to draw a reproducible random sample; i.e., re-running the chunk will produce the same set of outcomes.

#define parameters
prob.heads = 0.6
number.tosses = 5
number.replicates = 50
#create empty vector to store outcomes
outcomes = vector("numeric", number.replicates)
#set the seed for a pseudo-random sample
set.seed(20220621)
#simulate the coin tosses
for(k in 1:number.replicates){
  
  outcomes.replicate = sample(c(0, 1), size = number.tosses,
                      prob = c(1 - prob.heads, prob.heads), replace = TRUE)
  
  outcomes[k] = sum(outcomes.replicate)
}
  1. The parameters of the experiment have already been filled in; the probability of heads remains 0.6 and the number of tosses is set to 5. This code repeats the experiment 50 times, as specified by number.replicates. Run the code chunk.
    Click here for answer
     #view the results
     outcomes
     ##  [1] 2 3 2 2 4 4 3 3 1 4 2 4 5 5 3 0 3 4 4 3 3 4 4 4 4 3 3 3 4 2 4 1 2 4 3 1 4 2
     ## [39] 3 3 0 3 2 4 5 3 3 3 3 2
     addmargins(table(outcomes))
     ## outcomes
     ##   0   1   2   3   4   5 Sum 
     ##   2   3   9  18  15   3  50
     heads.3 = (outcomes == 3)
     table(heads.3)
     ## heads.3
     ## FALSE  TRUE 
     ##    32    18
  2. How many heads were observed in the fourth replicate of the experiment? Hint: look at outcomes
    Click here for answer
     outcomes[4]
     ## [1] 2
    In the fourth replicate of the experiment, 2 heads were observed.
  3. Out of the 50 replicates, how often were exactly 3 heads observed in a single experiment?
    Click here for answer
     sum(outcomes == 3)
     ## [1] 18
    Out of the 50 replicates, exactly 3 heads were observed in 18 of the replicates.
  4. From the tabled results of the simulation, calculate an estimate of the probability of observing exactly 3 heads when the biased coin is tossed 5 times.
    Click here for answer
     sum(outcomes == 3)/length(outcomes)
     ## [1] 0.36
    The estimate of the probability is 18/50 = 0.36.
  5. Re-run the simulation with 10,000 replicates of the experiment; calculate a new estimate of the probability of observing exactly 3 heads when the biased coin is tossed 5 times.
    Click here for answer
    sum(outcomes == 3)/length(outcomes)
    ## [1] 0.3443
    The (more reliable) estimate of the probability is = 0.3443.

2: Conditional probability and independence

Lab 2A: Bayes’ theorem through an example: Motivating problem

Breast cancer screening

Lab 2B: Bayes’ theorem through an example: Formulating the problem

Lab 2C: Bayes’ theorem through an example: Defining events in diagnostic testing

  • Events of interest include

    -\(A\) = {disease present}

    -\(A^C\) = {disease absent}

    -\(B\) = {positive test result}

    -\(B^C\) = {negative test result}.

  • Based on the quantities above, we derive characteristics of a diagnostic test.

    -Sensitivity = \(P(B|A)\)

    -Specificity = \(P(B^{C} | A^C)\)

Lab 2D: Bayes’ theorem through an example: Calculating PPV using Bayes’ theorem

Exercise 1 of 1

Using R as a calculator: one advantage R has over a standard hand calculator is the ability to easily store values as named variables. This can make numerical computations like the calculation of positive predictive value much less error-prone. Using a short script like the following to do the computation, rather than directly entering numbers into a calculator, is more efficient. We focus on the relationships between prevalence, sensitivity, specificity, positive predictive value, and negative predictive value.

#define variables
prevalence = 0.10
sensitivity = 0.98
specificity = 0.95
#calculate ppv
ppv.num = prevalence*sensitivity
ppv.den = ppv.num + (1 - specificity)*(1 - prevalence)
ppv = ppv.num/ppv.den
ppv

Question: The script above can also be re-run with different starting values. How do we obtain \(PPV\) for starting prevalence values \(P(A) = \{0.01, 0.05, 0.1, 0.25\}\) which corresponds to 1%, 5%, 10% and 25% prevalences of the diseases (breast cancer for our example)? We want to graphically compare how \(PPV\) changes as we vary \(P(A)\).

Click here for answer
#define variables
prevalence = c(0.01, 0.05, 0.1, 0.2)
sensitivity = rep(0.98, 4)
specificity = rep(0.95, 4)
#calculate ppv
ppv.num = prevalence*sensitivity
ppv.den = ppv.num + (1 - specificity)*(1 - prevalence)
ppv = ppv.num/ppv.den
plot(x = prevalence, y = ppv, type = "o", pch = 20, xlab = "Prevalence P(A)", ylab = "Positive predictive value (PPV)")

3A: Distributions of random variables: Introduction

A random variable is a function that maps each event in a sample space to a number.

Suppose \(X\) is the number of heads in 3 tosses of a fair coin.

3B: Distribution of a discrete random variable

The distribution of a discrete random variable is the collection of its values and the probabilities associated with those values.

To each value \(x\) that the random variable \(X\) can take, we assign a probability \(P(X = x)\).

For example, consider the following probability distribution:

\(x_i\) 0 1 2 3
\(P(X = x_i)\) 1/8 3/8 3/8 1/8

We note the following

  1. \(0 \leq P(X = x_i) \leq 1\) for \(x_i \in \{0, 1, 2, 3\}\)
  2. \(\sum_{x_i \in \{0, 1, 2, 3\}} P(X = x_i) = 1\).

We examine the bar-graph which summarises the distribution above.

Distributions of random variables that arise in science can be more complex. We will learn more about this soon!

3C: Expectation and variance

If \(X\) has outcomes \(x_1\), …, \(x_k\) with probabilities \(P(X=x_1)\), …, \(P(X=x_k)\), the expected value of \(X\) is the sum of each outcome multiplied by its corresponding probability: \[E(X) = x_1 P(X=x_1) + \cdots + x_k P(X=x_k) = \sum_{i=1}^{k}x_iP(X=x_i)\] The Greek letter \(\mu\) may be used in place of the notation \(E(X)\) and is sometimes written \(\mu_X\).

The variance of \(X\), denoted by \(\text{Var}(X)\) or \(\sigma^2\), is \[\begin{align*} \text{Var}(X) &= (x_1-\mu)^2 P(X=x_1) + \cdots+ (x_k-\mu)^2 P(X=x_k) \\ &= \sum_{j=1}^{k} (x_j - \mu)^2 P(X=x_j) \end{align*}\] The standard deviation of \(X\), written as \(\text{SD}(X)\) or \(\sigma\), is the square root of the variance. It is sometimes written \(\sigma_X\).


Exercise: recall the probability distribution:
\(x_i\) 0 1 2 3
\(P(X = x_i)\) 1/8 3/8 3/8 1/8

and compute \(\mu_X\) and \(\sigma^2_X\).

Click here for answer.

\[\begin{align*} \mu_X &= 0P(X=0) + 1P(X=1) + 2P(X=2) + 3P(X = 3) \\ &= (0)(1/8) + (1)(3/8) + (2)(3/8) + (3)(1/8) \\ &= 12/8 \\ &= 1.5 \end{align*}\]

\[\begin{align} \sigma_X^2 &= (x_1-\mu_X)^2P(X=x_1) + \cdots+ (x_4-\mu)^2 P(X=x_4) \notag \\ &= (0- 1.5)^2(1/8) + (1 - 1.5)^2 (3/8) + (2 -1.5)^2 (3/8) + (3-1.5)^2 (1/8) \notag \\ &= 3/4 \notag \end{align}\]

3D: Binomial random variables

One specific type of discrete random variable is a binomial random variable.

\(X\) is a binomial random variable if it represents the number of successes in \(n\) independent replications of an experiment where

A binomial random variable takes on values \(0, 1, 2, \dots, n\).


Click here for details on binomial coefficient

The Binomial Coefficient

\(\binom{n}{x}\) is the number of ways to choose \(x\) items from a set of size \(n\), where the order of the choice is ignored.

Mathematically,

\[\binom{n} {x} = \frac{n!}{x!(n-x)!}\]


Formulation: Let \(X\) be random variable modeling the number of successes in \(n\) trials

\[P(X = x)=\binom{n}{x} p^x (1-p)^{n-x},\: x= 0, 1, 2, \dots, n\]

Parameters of the distribution:

Shorthand notation: \(X\sim \text{Bin}(n,p).\) The number of heads in 3 tosses of a fair coin is a binomial random variable with parameters \(n = 3\) and \(p = 0.5\).


Mean and SD for a binomial random variable

For a binomial distribution with parameters \(n\) and \(p\), it can be shown that:


Click here for details on calculating binomial probabilities in R

The function dbinom() is used to calculate \(P(X = k)\).

The function pbinom() is used to calculate \(P(X \leq k)\) or \(P(X > k)\).

The function rbinom() is used to generate random samples from \(Bin(n, p)\) distribution. [will be used for simulation studies!]

3E: Continuous random variables

A discrete random variable takes on a finite number of values.

A continuous random variable can take on any real value in an interval.

A general distinction to keep in mind: discrete random variables are counted, but continuous random variables are measured.

3F: Probability density functions

Recall: for a discrete random variable \(X\), we assign probabilities to each of the values that \(X\) can take. This assignment yields a probability mass function (pmf): input a value \(x\) and get \(P(X = x)\) as output.

In comparison, continuous random variables almost never take an exact prescribed value \(c\), but there is a positive probability that its value will lie in particular intervals which can be arbitrarily small.

Continuous random variables usually admit probability density functions (PDF) which share two important features

When working with continuous random variables, probability is found for intervals of values rather than individual values.

3G: Normal distribution

3H: Calculating normal probabilities using R

Lab 3: Distributions of random variables

Exercise 1 of 2: _binom()

Click here for more details

The US Centers for Disease Control and Prevention (CDC) estimates that 90% of Americans have had chickenpox by the time they reach adulthood. Let \(X\) represent the number of individuals in a sample who had chickenpox during childhood.

  1. Using the information given about the experiment, calculate the probability that exactly 97 out of 100 randomly sampled American adults had chickenpox during childhood.
    Click here for answer
    dbinom(97, 100, 0.9)
    ## [1] 0.005891602
    Hence, the probability that exactly 97 out of 100 randomly sampled American adults had chickenpox during childhood is 0.0058916.
  2. Calculate the probability that exactly 3 out of a new sample of 100 American adults have not had chickenpox in their childhood.
    Click here for answer
    dbinom(97, 100, 0.9)
    ## [1] 0.005891602
    The event that exactly 3 out of 100 adults did not have chickenpox during childhood is equivalent to the event that exactly 97 out of 100 did have chickenpox during childhood; thus, the probability is 0.0058916.
  3. What is the probability that at least 1 out of 10 randomly sampled American adult has had chickenpox?
    Click here for answer
    pbinom(0, size = 10, prob = 0.90, lower.tail = FALSE)
    ## [1] 1
    Hence, the probability that at least 1 out of 10 randomly sampled American adult has had chickenpox is almost 1.
  4. What is the probability that at most 3 out of 10 randomly sampled American adults have not had chickenpox?
    Click here for answer
    pbinom(3, size = 10, prob = 0.10)
    ## [1] 0.9872048
    Hence, the probability that at most 3 out of 10 randomly sampled American adults have not had chickenpox is 0.9872048.

Exercise 2 of 2: _norm()

Click here for more details

People are classified as hypertensive if their systolic blood pressure (SBP) is higher than a specified level for their age group.

Assume SBP is normally distributed. Define a family as a group of two people in age group 1-14 and two people in age group 15-44. A family is classified as hypertensive if at least one adult and at least one child are hypertensive.

  1. What proportion of 1- to 14-year-olds are hypertensive?
    Click here for answer
    pnorm(115, mean = 105, sd = 5, lower.tail = FALSE)
    ## [1] 0.02275013
    Let \(X\) be the SBP for 1-14 year olds. \(X \sim N(105, 5)\). We compute \(P(X \geq 115)\)
  2. What proportion of 15- to 44-year-olds are hypertensive?
    Click here for answer
    pnorm(140, mean = 125, sd = 10, lower.tail = FALSE)
    ## [1] 0.0668072
    Let \(Y\) be the SBP for 15-44 year olds. \(Y \sim N(125, 10)\). We compute \(P(X \geq 140)\)
  3. What is the probability that a family is hypertensive? Assume that the hypertensive status of different members of a family are independent random variables. (highly unrealistic but we’ll still soldier through!)
    Click here for answer
    p1 <- pnorm(115, mean = 105, sd = 5, lower.tail = FALSE)
    p2 <- pnorm(140, mean = 125, sd = 10, lower.tail = FALSE)
    (1 - pbinom(0, size = 2, p1))*(1 - pbinom(0, size = 2, p2))
    ## [1] 0.005809569
    Let \(C\) be a binomial random variable modeling the number of children in a family that are hypertensive, and \(A\) be a binomial random variable modeling the number of hypertensive adults in a family; \(C \sim {Bin}(2, 0.0228)\), \(A \sim {Bin}(2, 0.0668)\). A family is considered hypertensive if at least one adult and at least one child are hypertensive. Let \(H\) represent the event that a family is hypertensive. Assuming that \(C\) and \(A\) are independent: \(P(H) = P(C \geq 1) \times P(A \geq 1) = 0.0058\)
  4. Consider a community of 1,000 families. What is the probability that between one and five families (inclusive) are hypertensive?
    Click here for answer
    p1 <- pnorm(115, mean = 105, sd = 5, lower.tail = FALSE)
    p2 <- pnorm(140, mean = 125, sd = 10, lower.tail = FALSE)
    p3 <- (1 - pbinom(0, size = 2, p1))*(1 - pbinom(0, size = 2, p2))
    pbinom(5, 1000, p3) - dbinom(0, 1000, p3)
    ## [1] 0.4733927
    Let \(K\) be a binomial random variable modeling the number of hypertensive families in the community. \(P(1 \leq K \leq 5) = P(K \leq 5) - P(K = 0) = 0.475.\) The probability that between one and five families are hypertensive is 0.4733927

\(\infty\)