library(ggplot2)
A probability distribution is a mathematical description of the possible outcomes of a random variable
\[ X_{1...n} \sim Binomial(size, p) \]
coin_flips <- rbinom(1000, 1000, .5)
qplot(coin_flips, geom = "histogram") +
ggtitle("Binomial distribution") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Probability of get exact number of occurrences from 10 attempts
## [1] 0.2460938
Probability of obtain 5 occurrences from 10 attempts
pbinom(5,10, 0.5)
## [1] 0.6230469
# Calculate the probability that at least five coins are heads
1 - pbinom(4,10, 0.3)
## [1] 0.1502683
# Confirm your answer with a simulation of 10,000 trials
mean(rbinom(10000, 10, 0.3) >= 5)
## [1] 0.1458
\[ \mu = np \]
\[ Var(X) = np(1-p) \]
# Calculate the variance using the exact formula
25 * 0.3 * 0.7
## [1] 5.25
# Confirm with a simulation using rbinom
var(rbinom(10000, 25, 0.3))
## [1] 5.133001
\[ Pr(A \space and \space B) = Pr(A) \space Pr(B) \]
a <- rbinom(100000, 1, .5)
b <- rbinom(100000, 1, .5)
mean(a & b)
## [1] 0.24981
\[ Pr(A \space or \space B) = Pr(A) + Pr(B) - Pr(A) \space Pr(B) \]
a <- rbinom(100000, 1, .5)
b <- rbinom(100000, 1, .5)
mean(a | b)
## [1] 0.75037
# Use rbinom to simulate 100,000 draws from each of X and Y
X <- rbinom(100000, 10, .6)
Y <- rbinom(100000, 10, .7)
# Estimate the probability either X or Y is <= to 4
mean(X <= 4 | Y <= 4)
## [1] 0.20591
# Use pbinom to calculate the probabilities separately
prob_X_less <- pbinom(4, 10, .6)
prob_Y_less <- pbinom(4, 10, .7)
# Combine these to calculate the exact probability either <= 4
prob_X_less + prob_Y_less - prob_X_less * prob_Y_less
## [1] 0.2057164
\[ E[k \cdot X] = k \cdot E[X] \] \[ Var(k \cdot X) = k^2 \cdot Var(X) \]
\[ E[X + Y] = E[X] + E[Y] \]
\[ Var[X + Y] = Var[X] + Var[Y] \]
Probability of coin was fair given the result was 11
# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom(50000, 20, 0.5)
biased <- rbinom(50000, 20, 0.75)
# How many fair cases, and how many biased, led to exactly 11 heads?
fair_11 <- sum(fair == 11)
biased_11 <- sum(biased == 11)
# Find the fraction of fair coins that are 11 out of all coins that were 11
fair_11 /(fair_11 + biased_11)
## [1] 0.8595864
Probability of coin was fair given the result was 16
# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom(50000, 20, 0.5)
biased <- rbinom(50000, 20, 0.75)
# How many fair cases, and how many biased, led to exactly 16 heads?
fair_16 <- sum(fair == 16)
biased_16 <- sum(biased == 16)
# Find the fraction of fair coins that are 16 out of all coins that were 16
fair_16 / (fair_16 + biased_16)
## [1] 0.02312436
We see 14 out of 20 flips are heads, and start with a 80% chance the coin is fair and a 20% chance it is biased to 75%.
You’ll solve this case with simulation, by starting with a “bucket” of 10,000 coins, where 8,000 are fair and 2,000 are biased, and flipping each of them 20 times.
# Simulate 8000 cases of flipping a fair coin, and 2000 of a biased coin
fair_flips <- rbinom(8000, 20, 0.5)
biased_flips <- rbinom(2000, 20, 0.75)
# Find the number of cases from each coin that resulted in 14/20
fair_14 <- sum(fair_flips == 14)
biased_14 <- sum(biased_flips == 14)
# Use these to estimate the posterior probability
fair_14 / (fair_14 + biased_14)
## [1] 0.4533133
With three coins
# Simulate 80,000 draws from fair coin, 10,000 from each of high and low coins
flips_fair <- rbinom(80000, 20, 0.5)
flips_high <- rbinom(10000, 20, 0.75)
flips_low <- rbinom(10000, 20, 0.25)
# Compute the number of coins that resulted in 14 heads from each of these piles
fair_14 <- sum(flips_fair == 14)
high_14 <- sum(flips_high == 14)
low_14 <- sum(flips_low == 14)
# Compute the posterior probability that the coin was fair
fair_14 / (fair_14 + high_14 + low_14)
## [1] 0.6399217
\[ Pr(A|B) = \frac{Pr(B|A) \space Pr(A)}{Pr(B|A) \space Pr(A) + Pr(B|not \space A)Pr(not \space A)} \]
Probability that a coin is fair given the result was 14 and 18 heads in 20 attempts
# Find the probability that a coin resulting in 14/20 is fair
probability_fair <- dbinom(14, 20, .5)
probability_biased <- dbinom(14, 20, .75)
probability_fair / (probability_fair + probability_biased)
## [1] 0.179811
# Find the probability that a coin resulting in 18/20 is fair
probability_fair <- dbinom(18, 20, .5)
probability_biased <- dbinom(18, 20, .75)
probability_fair / (probability_fair + probability_biased)
## [1] 0.002699252
Bayes theorem application:
Suppose we see 16 heads out of 20 flips, which would normally be strong evidence that the coin is biased. However, suppose we had set a prior probability of a 99% chance that the coin is fair (50% chance of heads), and only a 1% chance that the coin is biased (75% chance of heads).
You’ll solve this exercise by finding the
# Use dbinom to find the probability of 16/20 from a fair or biased coin
probability_16_fair <- dbinom(16, 20, 0.5)
probability_16_biased <- dbinom(16, 20, 0.75)
# Use Bayes' theorem to find the posterior probability that the coin is fair
probability_16_fair * 0.99 / (probability_16_fair * 0.99 + probability_16_biased * 0.01)
## [1] 0.7068775
Normal approximation to the binomial
\[ \mu = size \cdot p \]
\[ \sigma \sqrt{size \cdot p \cdot (1-p)} \]
compare_histograms <- function(variable1, variable2) {
x <- data.frame(value = variable1, variable = "Variable 1")
y <- data.frame(value = variable2, variable = "Variable 2")
ggplot(rbind(x, y), aes(value)) +
geom_histogram() +
facet_wrap(~ variable, nrow = 2)
}
binomial <- rbinom(100000, 1000, .5)
expected_value <- 1000 * .5
variance <- 1000 * .5
stdev <- sqrt(variance)
# Very similar distributions to binomial
normal <- rnorm(100000, expected_value, stdev)
compare_histograms(binomial, normal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\[ X \sim Poisson(\lambda) \] \[ E[X] = \lambda \]
\[ Var(X) = \lambda \]
binomial <- rbinom(100000, 1000, 1 /1000)
poisson <- rpois(100000, 1)
compare_histograms(binomial, poisson)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Another example
# Draw a random sample of 100,000 from the Binomial(1000, .002) distribution
binom_sample <- rbinom(100000,1000, .002)
# Draw a random sample of 100,000 from the Poisson approximation
poisson_sample <- rpois(100000, 2)
# Compare the two distributions with the compare_histograms function
compare_histograms(binom_sample, poisson_sample)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, … }
\[ X \sim Geom(p) \]
\[ E[X] = \frac{1}{p} - 1 \]
How many draws we need to get a head with 20% of probability
# Simulate 100 instances of flipping a 20% coin
flips <- rbinom(100, 1, 0.2)
# Use which to find the first case of 1 ("heads")
which(flips == 1)[1]
## [1] 7
# Existing code for finding the first instance of heads
which(rbinom(100, 1, 0.2) == 1)[1]
## [1] 2
# Replicate this 100,000 times using replicate()
replications <- replicate(100000, which(rbinom(100, 1, 0.2) == 1)[1])
# Histogram the replications with qplot
qplot(replications)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Replications from the last exercise
replications <- replicate(100000, which(rbinom(100, 1, .2) == 1)[1])
# Generate 100,000 draws from the corresponding geometric distribution
geom_sample <- rgeom(100000, 0.2)
# Compare the two distributions with compare_histograms
compare_histograms(replications, geom_sample)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Examples:
# Find the probability the machine breaks on 5th day or earlier
pgeom(4, .1)
## [1] 0.40951
# Find the probability the machine is still working on 20th day
1 - pgeom(19, .1)
## [1] 0.1215767
# Calculate the probability of machine working on day 1-30
still_working <- 1 - pgeom(0:29, .1)
# Plot the probability for days 1 to 30
qplot(1:30, still_working)