Big Data Summer Institute 2022
June 21, 2022
Soumik Purkayastha
[https://soumikp.github.io]
[Press ‘k’ on
your keyboard and use arrows/space bar to navigate!]
At the end of this review, you will be able to
if statements and simulating populations based on
conditional probabilities.People often colloquially refer to probability…
Formalizing concepts and terminology around probability is essential for improved understanding.
Probability is used to assign a level of uncertainty to the outcomes of phenomena that either happen randomly (e.g. rolling dice) or appear random because of a lack of understanding about exactly how the phenomenon occurs (e.g., patient response to medical treatment).
The probability of an outcome is the proportion of times the outcome would occur if the random phenomenon could be observed an infinite number of times.
Two events or outcomes are called disjoint or mutually exclusive if they cannot both happen at the same time.
If \(A\) and \(B\) represent two disjoint events, then the probability that either occurs is \[P(A \cup B) = P(A \text{ or } B) = P(A) + P(B).\]
\(A:\) rolling a number less than three.
\(B:\) rolling an even number more than two.
\(D:\) rolling a two or a three
A and B are disjoint; B and D are disjoint but A and D are not.
What is \(P(A \cup B)\)?
If there are \(k\) disjoint events \(A_1, A_2, \ldots, A_k\), then the probability that at least one of these outcomes will occur is
\[P(A_1 \text{ or } A_2 \text{ or } \ldots A_k) = \sum_{i=1}^k P(A_i).\]Let \(A\):= drawing a diamond and \(B\):= drawing a face card. What is \(P(A \cup B)?\)
Have to account for double-counting!
Thus, for two events \(A\) and \(B\), the probability that either occurs is
\[P(A \cup B) = P(A) + P(B) - P(A \cap B),\] where the ‘cup’ \(\cup\) symbol denotes the union of two events and the ‘cap’ \(\cap\) symbol denotes the intersection of two events. This is the general addition rule.The complement of an event \(D\), denoted by \(D^C\) represents all outcomes of the experiment that are NOT in \(D\).
\(D\) and \(D^C\) are related in the following probabilistic sense:
If \(A\) and \(B\) represent events from two different and independent processes, then the probability that both \(A\) and \(B\) occur is given by
\[P(A \cap B) = P(A) \times P(B).\]
Notes for review may be found here.
Question: Suppose that a biased coin is tossed 5 times; the coin is weighted such that the probability of obtaining a heads is 0.6. What is the probability of obtaining exactly 3 heads?
sample()Before writing the code for a simulation to estimate probability, it is helpful to clearly define the experiment and event of interest. In this case, the experiment is tossing a biased coin 5 times and the event of interest is observing exactly 3 heads.
The following code illustrates the use of the sample()
command to simulate the result for one set of 5 coin tosses.
#define parameters
prob.heads = ## FILL THIS IN!!
number.tosses = ## FILL THIS IN!!
#simulate the coin tosses
outcomes = sample(c(0, 1),
size = number.tosses,
prob = c(1 - prob.heads, prob.heads), replace = TRUE)
#view the results
table(outcomes)
#store the results as a single number
total.heads = sum(outcomes)
total.headsUsing the information given about the experiment, set the
parameters for prob.heads and number.tosses
and run the code chunk.
Click here for answer
prob.heads = 0.6
number.tosses = 5
To generate outcomes, the sample()
command draws from the values 0 and 1 with
probabilites corresponding to those specified by the argument
prob. Which number corresponds to heads, and which
corresponds to tails?
Click here for answer
The number 0 corresponds to tails and 1 corresponds to heads. This information comes from the structure of the vectors used in the sample() command; the elements in c(0, 1) are sampled with probabilities c(1 - prob.heads, prob.heads), respectively.
Why is it important to sample with replacement?
Click here for answer
There are only two numbers being sampled from: 0 and 1. From a coding perspective, sampling without replacement would only allow for two coin tosses, since the vector contains only two elements. From a statistical perspective, sampling with replacement allows any coin to have the outcome 0 (tails) or 1 (heads), even if another coin was observed to be heads or tails; i.e., sampling with replacement allows the coin tosses to be independent.
What is the advantage of representing the outcomes with the
values 0 and 1, rather than with letters like
“T” and “H”?
Click here for answer
Representing the outcomes with 0 and 1 allows for the use of sum() to return the number of heads. Note that sum() can also be used if the outcomes are represented with letters, but the syntax is slightly more complex; this method is shown in the Notes section below.
Run the code chunk again to simulate another set of 5 coin tosses. Is it reasonable to expect that the results might differ from the first set of tosses? Explain your answer.
Click here for answer
Yes, it is reasonable to expect that the number of heads in a set 5 coin tosses will not always be the same, even as the probability of a single heads remains constant. The inherent randomness makes it possible to observe anywhere between 0 and 5 heads (inclusive) in any set of 5 tosses.
forThe following code uses a for loop to repeat (i.e.,
replicate) the experiment and record the results of each replicate. The
term k is an index, used to keep track of each iteration of
the loop; think of it as similar to the index of summation \(k\) (or \(i\)) in sigma notation (\(\sum_{k = 1}^n\)). The value
num.replicates is set to 50, specifying that the experiment
is to be repeated 50 times. The command set.seed()is used
to draw a reproducible random sample; i.e., re-running the chunk will
produce the same set of outcomes.
#define parameters
prob.heads = 0.6
number.tosses = 5
number.replicates = 50
#create empty vector to store outcomes
outcomes = vector("numeric", number.replicates)
#set the seed for a pseudo-random sample
set.seed(20220621)
#simulate the coin tosses
for(k in 1:number.replicates){
outcomes.replicate = sample(c(0, 1), size = number.tosses,
prob = c(1 - prob.heads, prob.heads), replace = TRUE)
outcomes[k] = sum(outcomes.replicate)
}number.replicates. Run the code chunk.
## [1] 2 3 2 2 4 4 3 3 1 4 2 4 5 5 3 0 3 4 4 3 3 4 4 4 4 3 3 3 4 2 4 1 2 4 3 1 4 2
## [39] 3 3 0 3 2 4 5 3 3 3 3 2
## outcomes
## 0 1 2 3 4 5 Sum
## 2 3 9 18 15 3 50
## heads.3
## FALSE TRUE
## 32 18
outcomes
Events of interest include
-\(A\) = {disease present}
-\(A^C\) = {disease absent}
-\(B\) = {positive test result}
-\(B^C\) = {negative test result}.
Based on the quantities above, we derive characteristics of a diagnostic test.
-Sensitivity = \(P(B|A)\)
-Specificity = \(P(B^{C} | A^C)\)
Using R as a calculator: one advantage R
has over a standard hand calculator is the ability to easily store
values as named variables. This can make numerical computations like the
calculation of positive predictive value much less error-prone. Using a
short script like the following to do the computation, rather than
directly entering numbers into a calculator, is more efficient.
We focus on the relationships between prevalence, sensitivity,
specificity, positive predictive value, and negative predictive
value.
#define variables
prevalence = 0.10
sensitivity = 0.98
specificity = 0.95
#calculate ppv
ppv.num = prevalence*sensitivity
ppv.den = ppv.num + (1 - specificity)*(1 - prevalence)
ppv = ppv.num/ppv.den
ppvQuestion: The script above can also be re-run with different starting values. How do we obtain \(PPV\) for starting prevalence values \(P(A) = \{0.01, 0.05, 0.1, 0.25\}\) which corresponds to 1%, 5%, 10% and 25% prevalences of the diseases (breast cancer for our example)? We want to graphically compare how \(PPV\) changes as we vary \(P(A)\).
#define variables
prevalence = c(0.01, 0.05, 0.1, 0.2)
sensitivity = rep(0.98, 4)
specificity = rep(0.95, 4)
#calculate ppv
ppv.num = prevalence*sensitivity
ppv.den = ppv.num + (1 - specificity)*(1 - prevalence)
ppv = ppv.num/ppv.den
plot(x = prevalence, y = ppv, type = "o", pch = 20, xlab = "Prevalence P(A)", ylab = "Positive predictive value (PPV)")A random variable is a function that maps each event in a sample space to a number.
Suppose \(X\) is the number of heads in 3 tosses of a fair coin.
The distribution of a discrete random variable is the collection of its values and the probabilities associated with those values.
To each value \(x\) that the random variable \(X\) can take, we assign a probability \(P(X = x)\).
For example, consider the following probability distribution:
| \(x_i\) | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| \(P(X = x_i)\) | 1/8 | 3/8 | 3/8 | 1/8 |
We note the following
We examine the bar-graph which summarises the distribution above.
Distributions of random variables that arise in science can be more complex. We will learn more about this soon!
If \(X\) has outcomes \(x_1\), …, \(x_k\) with probabilities \(P(X=x_1)\), …, \(P(X=x_k)\), the expected value of \(X\) is the sum of each outcome multiplied by its corresponding probability: \[E(X) = x_1 P(X=x_1) + \cdots + x_k P(X=x_k) = \sum_{i=1}^{k}x_iP(X=x_i)\] The Greek letter \(\mu\) may be used in place of the notation \(E(X)\) and is sometimes written \(\mu_X\).
The variance of \(X\), denoted by \(\text{Var}(X)\) or \(\sigma^2\), is \[\begin{align*} \text{Var}(X) &= (x_1-\mu)^2 P(X=x_1) + \cdots+ (x_k-\mu)^2 P(X=x_k) \\ &= \sum_{j=1}^{k} (x_j - \mu)^2 P(X=x_j) \end{align*}\] The standard deviation of \(X\), written as \(\text{SD}(X)\) or \(\sigma\), is the square root of the variance. It is sometimes written \(\sigma_X\).
| \(x_i\) | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| \(P(X = x_i)\) | 1/8 | 3/8 | 3/8 | 1/8 |
and compute \(\mu_X\) and \(\sigma^2_X\).
\[\begin{align*} \mu_X &= 0P(X=0) + 1P(X=1) + 2P(X=2) + 3P(X = 3) \\ &= (0)(1/8) + (1)(3/8) + (2)(3/8) + (3)(1/8) \\ &= 12/8 \\ &= 1.5 \end{align*}\]
\[\begin{align} \sigma_X^2 &= (x_1-\mu_X)^2P(X=x_1) + \cdots+ (x_4-\mu)^2 P(X=x_4) \notag \\ &= (0- 1.5)^2(1/8) + (1 - 1.5)^2 (3/8) + (2 -1.5)^2 (3/8) + (3-1.5)^2 (1/8) \notag \\ &= 3/4 \notag \end{align}\]One specific type of discrete random variable is a binomial random variable.
\(X\) is a binomial random variable if it represents the number of successes in \(n\) independent replications of an experiment where
A binomial random variable takes on values \(0, 1, 2, \dots, n\).
The Binomial Coefficient
\(\binom{n}{x}\) is the number of ways to choose \(x\) items from a set of size \(n\), where the order of the choice is ignored.
Mathematically,
\[\binom{n} {x} = \frac{n!}{x!(n-x)!}\]
\(n = 1, 2, \ldots\)
\(x = 0, 1, 2, \ldots, n\)
For any integer \(m\), \(m! = (m)(m-1)(m-2)\cdots(1)\)
\[P(X = x)=\binom{n}{x} p^x (1-p)^{n-x},\: x= 0, 1, 2, \dots, n\]
Parameters of the distribution:
\(n\) = number of trials
\(p\) = probability of success
Shorthand notation: \(X\sim \text{Bin}(n,p).\) The number of heads in 3 tosses of a fair coin is a binomial random variable with parameters \(n = 3\) and \(p = 0.5\).
Mean and SD for a binomial random variable
For a binomial distribution with parameters \(n\) and \(p\), it can be shown that:
Mean = \(np\)
Standard Deviation = \(\sqrt{np(1-p)}\)
R
The function dbinom() is used to calculate \(P(X = k)\).
dbinom(k, n, p): \(P(X =
k)\)The function pbinom() is used to calculate \(P(X \leq k)\) or \(P(X > k)\).
pbinom(k, n, p): \(P(X \leq
k)\)pbinom(k, n, p, lower.tail = FALSE): \(P(X > k)\)rbinom() is used to generate random samples
from \(Bin(n, p)\) distribution. [will
be used for simulation studies!]
A discrete random variable takes on a finite number of values.
Number of heads in a set of coin tosses
Number of people who have had chicken pox in a random sample
A continuous random variable can take on any real value in an interval.
Height in a population
Blood pressure in a population
A general distinction to keep in mind: discrete random variables are counted, but continuous random variables are measured.
Recall: for a discrete random variable \(X\), we assign probabilities to each of the values that \(X\) can take. This assignment yields a probability mass function (pmf): input a value \(x\) and get \(P(X = x)\) as output.
In comparison, continuous random variables almost never take an exact prescribed value \(c\), but there is a positive probability that its value will lie in particular intervals which can be arbitrarily small.
Continuous random variables usually admit probability density functions (PDF) which share two important features
The total area under the density curve is 1.
The probability that a variable has a value within a specified interval is the area under the curve over that interval.
When working with continuous random variables, probability is found for intervals of values rather than individual values.
The probability that a continuous r.v. \(X\) takes on any single individual value is 0. That is, \(P(X = x) = 0\).
Thus, \(P(a < X < b)\) is equivalent to \(P(a \leq X \leq b)\).
RQuestion: What is the percentile rank for a student who scores an 1800 on the SAT for a year in which the scores are N(1500,300)? \(100 \times P(X \leq 1800)\) where \(X \sim N(1500, 300)\)
## [1] 0.8413447
Note: must multiply output by 100 to obtain `meaningful’ percentile rank.
Question: What score on the SAT would put a student in the 99th percentile? Want to find \(x\) such that \(P(X \leq x) = 0.99\) where \(X \sim N(1500, 300)\).
## [1] 2197.904Question: What is the probability that someone scores more than 1200 but less than 1600 on the SAT?? \(P(1200 \leq X \leq 1600)\) where \(X \sim N(1500, 300)\).
## [1] 0.4719034_binom()The US Centers for Disease Control and Prevention (CDC) estimates that 90% of Americans have had chickenpox by the time they reach adulthood. Let \(X\) represent the number of individuals in a sample who had chickenpox during childhood.
_norm()People are classified as hypertensive if their systolic blood pressure (SBP) is higher than a specified level for their age group.
Assume SBP is normally distributed. Define a family as a group of two people in age group 1-14 and two people in age group 15-44. A family is classified as hypertensive if at least one adult and at least one child are hypertensive.
p1 <- pnorm(115, mean = 105, sd = 5, lower.tail = FALSE)
p2 <- pnorm(140, mean = 125, sd = 10, lower.tail = FALSE)
(1 - pbinom(0, size = 2, p1))*(1 - pbinom(0, size = 2, p2))## [1] 0.005809569
Let \(C\) be
a binomial random variable modeling the number of children in a family
that are hypertensive, and \(A\) be a
binomial random variable modeling the number of hypertensive adults in a
family; \(C \sim {Bin}(2, 0.0228)\),
\(A \sim {Bin}(2, 0.0668)\). A family
is considered hypertensive if at least one adult and at least one child
are hypertensive. Let \(H\) represent
the event that a family is hypertensive. Assuming that \(C\) and \(A\) are independent: \(P(H) = P(C \geq 1) \times P(A \geq 1) =
0.0058\)
p1 <- pnorm(115, mean = 105, sd = 5, lower.tail = FALSE)
p2 <- pnorm(140, mean = 125, sd = 10, lower.tail = FALSE)
p3 <- (1 - pbinom(0, size = 2, p1))*(1 - pbinom(0, size = 2, p2))
pbinom(5, 1000, p3) - dbinom(0, 1000, p3)## [1] 0.4733927
Let \(K\) be
a binomial random variable modeling the number of hypertensive families
in the community. \(P(1 \leq K \leq 5) = P(K
\leq 5) - P(K = 0) = 0.475.\) The probability that between one
and five families are hypertensive is 0.4733927