Probability

library(ggplot2)

Binomial distribution

A probability distribution is a mathematical description of the possible outcomes of a random variable

\[ X_{1...n} \sim Binomial(size, p) \]

coin_flips <- rbinom(1000, 1000, .5)
qplot(coin_flips, geom = "histogram") +
  ggtitle("Binomial distribution") +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Probability density

Probability of get exact number of occurrences from 10 attempts

## [1] 0.2460938

Cumulative density

Probability of obtain 5 occurrences from 10 attempts

pbinom(5,10, 0.5)

## [1] 0.6230469

# Calculate the probability that at least five coins are heads

1 - pbinom(4,10, 0.3)

## [1] 0.1502683

# Confirm your answer with a simulation of 10,000 trials
mean(rbinom(10000, 10, 0.3) >= 5)

## [1] 0.1458

Mean and Variance of binomial distribution

\[ \mu = np \]

\[ Var(X) = np(1-p) \]

# Calculate the variance using the exact formula
25 * 0.3 * 0.7

## [1] 5.25

# Confirm with a simulation using rbinom
var(rbinom(10000, 25, 0.3))

## [1] 5.133001

Probability of independent events

\[ Pr(A \space and \space B) = Pr(A) \space Pr(B) \]

a <- rbinom(100000, 1, .5)
b <- rbinom(100000, 1, .5)

mean(a & b)

## [1] 0.24981

Probability of A or B

\[ Pr(A \space or \space B) = Pr(A) + Pr(B) - Pr(A) \space Pr(B) \]

a <- rbinom(100000, 1, .5)
b <- rbinom(100000, 1, .5)

mean(a | b)

## [1] 0.75037

# Use rbinom to simulate 100,000 draws from each of X and Y
X <- rbinom(100000, 10, .6)
Y <- rbinom(100000, 10, .7)

# Estimate the probability either X or Y is <= to 4
mean(X <= 4 | Y <= 4)

## [1] 0.20591

# Use pbinom to calculate the probabilities separately
prob_X_less <- pbinom(4, 10, .6)
prob_Y_less <- pbinom(4, 10, .7)

# Combine these to calculate the exact probability either <= 4
prob_X_less + prob_Y_less - prob_X_less * prob_Y_less

## [1] 0.2057164

Random variable distribution properties

\[ E[k \cdot X] = k \cdot E[X] \] \[ Var(k \cdot X) = k^2 \cdot Var(X) \]

Adding two random variables

\[ E[X + Y] = E[X] + E[Y] \]

\[ Var[X + Y] = Var[X] + Var[Y] \]

Bayesian statistics

Probability of coin was fair given the result was 11

# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom(50000, 20, 0.5)
biased <- rbinom(50000, 20, 0.75)

# How many fair cases, and how many biased, led to exactly 11 heads?
fair_11 <- sum(fair == 11)
biased_11 <- sum(biased == 11)

# Find the fraction of fair coins that are 11 out of all coins that were 11
fair_11 /(fair_11 + biased_11)

## [1] 0.8595864

Probability of coin was fair given the result was 16

# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom(50000, 20, 0.5)
biased <- rbinom(50000, 20, 0.75)

# How many fair cases, and how many biased, led to exactly 16 heads?
fair_16 <- sum(fair == 16)
biased_16 <- sum(biased == 16)

# Find the fraction of fair coins that are 16 out of all coins that were 16
fair_16 / (fair_16 + biased_16)

## [1] 0.02312436

Prior probability

We see 14 out of 20 flips are heads, and start with a 80% chance the coin is fair and a 20% chance it is biased to 75%.

You’ll solve this case with simulation, by starting with a “bucket” of 10,000 coins, where 8,000 are fair and 2,000 are biased, and flipping each of them 20 times.

# Simulate 8000 cases of flipping a fair coin, and 2000 of a biased coin
fair_flips <- rbinom(8000, 20, 0.5)
biased_flips <- rbinom(2000, 20, 0.75)

# Find the number of cases from each coin that resulted in 14/20
fair_14 <- sum(fair_flips == 14)
biased_14 <- sum(biased_flips == 14)

# Use these to estimate the posterior probability
fair_14 / (fair_14 + biased_14)

## [1] 0.4533133

With three coins

# Simulate 80,000 draws from fair coin, 10,000 from each of high and low coins
flips_fair <- rbinom(80000, 20, 0.5)
flips_high <- rbinom(10000, 20, 0.75)
flips_low <-  rbinom(10000, 20, 0.25)

# Compute the number of coins that resulted in 14 heads from each of these piles
fair_14 <- sum(flips_fair == 14)
high_14 <- sum(flips_high == 14)
low_14 <- sum(flips_low == 14)

# Compute the posterior probability that the coin was fair
fair_14 / (fair_14 + high_14 + low_14)

## [1] 0.6399217

Bayes theorem

\[ Pr(A|B) = \frac{Pr(B|A) \space Pr(A)}{Pr(B|A) \space Pr(A) + Pr(B|not \space A)Pr(not \space A)} \]

Probability that a coin is fair given the result was 14 and 18 heads in 20 attempts

# Find the probability that a coin resulting in 14/20 is fair
probability_fair <- dbinom(14, 20, .5)
probability_biased <- dbinom(14, 20, .75)
probability_fair / (probability_fair + probability_biased)

## [1] 0.179811

# Find the probability that a coin resulting in 18/20 is fair
probability_fair <- dbinom(18, 20, .5)
probability_biased <- dbinom(18, 20, .75)
probability_fair / (probability_fair + probability_biased)

## [1] 0.002699252

Bayes theorem application:

Suppose we see 16 heads out of 20 flips, which would normally be strong evidence that the coin is biased. However, suppose we had set a prior probability of a 99% chance that the coin is fair (50% chance of heads), and only a 1% chance that the coin is biased (75% chance of heads).

You’ll solve this exercise by finding the

# Use dbinom to find the probability of 16/20 from a fair or biased coin
probability_16_fair <- dbinom(16, 20, 0.5)
probability_16_biased <- dbinom(16, 20, 0.75)

# Use Bayes' theorem to find the posterior probability that the coin is fair

probability_16_fair * 0.99 / (probability_16_fair * 0.99 + probability_16_biased * 0.01)

## [1] 0.7068775

Normal distribution

Normal approximation to the binomial

\[ \mu = size \cdot p \]

\[ \sigma \sqrt{size \cdot p \cdot (1-p)} \]

compare_histograms <- function(variable1, variable2) {
  x <- data.frame(value = variable1, variable = "Variable 1")
  y <- data.frame(value = variable2, variable = "Variable 2")
  ggplot(rbind(x, y), aes(value)) +
    geom_histogram() +
    facet_wrap(~ variable, nrow = 2)
}

binomial <- rbinom(100000, 1000, .5)

expected_value <- 1000 * .5
variance <- 1000 * .5
stdev <- sqrt(variance)

# Very similar distributions to binomial
normal <- rnorm(100000, expected_value, stdev)

compare_histograms(binomial, normal)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Poisson distribution

\[ X \sim Poisson(\lambda) \] \[ E[X] = \lambda \]

\[ Var(X) = \lambda \]

binomial <- rbinom(100000, 1000, 1 /1000)

poisson <- rpois(100000, 1)

compare_histograms(binomial, poisson)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another example

# Draw a random sample of 100,000 from the Binomial(1000, .002) distribution
binom_sample <- rbinom(100000,1000, .002)

# Draw a random sample of 100,000 from the Poisson approximation
poisson_sample <- rpois(100000, 2)

# Compare the two distributions with the compare_histograms function
compare_histograms(binom_sample, poisson_sample)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Geometric distribution

The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, … }

\[ X \sim Geom(p) \]

\[ E[X] = \frac{1}{p} - 1 \]

How many draws we need to get a head with 20% of probability

# Simulate 100 instances of flipping a 20% coin
flips <- rbinom(100, 1, 0.2)

# Use which to find the first case of 1 ("heads")
which(flips == 1)[1]

## [1] 7

# Existing code for finding the first instance of heads
which(rbinom(100, 1, 0.2) == 1)[1]

## [1] 2

# Replicate this 100,000 times using replicate()
replications <- replicate(100000, which(rbinom(100, 1, 0.2) == 1)[1])

# Histogram the replications with qplot
qplot(replications)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Replications from the last exercise
replications <- replicate(100000, which(rbinom(100, 1, .2) == 1)[1])

# Generate 100,000 draws from the corresponding geometric distribution
geom_sample <- rgeom(100000, 0.2)

# Compare the two distributions with compare_histograms
compare_histograms(replications, geom_sample)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Examples:

# Find the probability the machine breaks on 5th day or earlier
pgeom(4, .1)

## [1] 0.40951

# Find the probability the machine is still working on 20th day
1 - pgeom(19, .1)

## [1] 0.1215767

# Calculate the probability of machine working on day 1-30
still_working <- 1 - pgeom(0:29, .1)

# Plot the probability for days 1 to 30
qplot(1:30, still_working)