library( dplyr )
library( ggplot2 )
library( gridExtra )
library( openintro )
library( tidyr )
library(hrbrthemes)
The Binomial Distribution
Flipping Coins in R
Simulating a coin flip in R with rbinom( numdraws, numcoins, prob(1) )
#the rbinom() function:
rbinom( 1,1,.5 )
## [1] 0
Flipping many coins:
#returns a vector of flips
rbinom( 10,1,0.5 )
## [1] 1 1 1 1 0 0 0 1 0 0
#return the sum of heads (1) from the flips
rbinom( 1,10,0.5 )
## [1] 6
#return a vector of sums of multiple multi coin flip outcomes
rbinom( 10,10,0.5 )
## [1] 5 3 6 5 5 6 1 4 5 5
Unfair coin tosses:
rbinom( 10,10,0.8 )
## [1] 8 7 7 8 8 8 8 7 7 7
rbinom( 10,10,0.2 )
## [1] 0 0 3 3 2 0 4 2 0 1
Binomial Distributions \[X_{1 \cdots n} \sim \mbox{Binomial}(\mbox{size},p)\]
# Generate 100 occurrences of flipping 10 coins, each with 30% probability
manyflips <- data.frame( tenflipssum = rbinom(100000,10,0.3) )
ggplot( manyflips, aes( x = tenflipssum ) ) +
geom_bar( binwidth = 1 ) +
scale_x_continuous( limits = c( 1,10 )) +
ggtitle( 'Flipping 10 coins many times' )
## Warning: Ignoring unknown parameters: binwidth
## Warning: Removed 2814 rows containing non-finite values (stat_count).
## Warning: Removed 2 rows containing missing values (geom_bar).
Density and Cumulatice Density
What is the probability that the sum of 10 flips will be 5? \[P(flips=5)\]
flips <- data.frame( flip10 = rbinom( 100000,10,0.3 ) )
ggplot( flips, aes( x = flip10 ) ) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There were ~25000 flips of 10 with a sum of 5. \[\sim \frac{25000}{100000} = \frac{1}{4}\] We can easily get the exact number:
mean( flips$flip10 == 5 )
## [1] 0.10096
Calculating the exact probability density: dbinom( outcome to find estimate at, the number of coins, the probability of heads )
#probability at 5 for a fair coin
dbinom( 5, 10, 0.5 )
## [1] 0.2460938
#probability at 6 for a fair coin
dbinom( 6, 10, 0.5 )
## [1] 0.2050781
#probability at 10 for a fair coin
dbinom( 10, 10, 0.5 )
## [1] 0.0009765625
Cumulative Density
the probability that the outcome is \(\gt,\lt,\geq,\leq\) a point on the probability outcomes distribution. \[X \sim \mbox{Binomial}(10,.5)\] \[Pr(X \leq 4 )\] programatically calculate the simulated probability:
mean( flips <= 4 )
## [1] 0.85062
use the function pbinom() to find the exactvalue
pbinom( 4, 10, 0.5 )
## [1] 0.3769531
# Calculate the probability that 2 of 10 coins are heads using dbinom
dbinom( 2, 10, 0.3 )
## [1] 0.2334744
# Confirm your answer with a simulation using rbinom
mean( rbinom( 10000, 10, 0.3 ) == 2 )
## [1] 0.2299
# Calculate the probability that at least five coins are heads
pbinom( 5, 10, 0.3 )
## [1] 0.952651
# Confirm your answer with a simulation of 10,000 trials
mean( rbinom( 10000, 10, 0.3 ) >= 5 )
## [1] 0.1486
Expected Value and Variance
Properties of a distribution:
- Expected Value: Where is the distribution centered? (mean value)
- \(E[X] = \mbox{size}\cdot p\)
- Variance: How spread out is it?
- \(\mbox{Var}(X) = \mbox{size} \cdot p \cdot (1-p)\)
#find the expected value:
flips <- rbinom( 100000, 10, .5 )
mean( flips )
## [1] 5.00041
mean( rbinom( 100000, 100, 0.2 ) )
## [1] 20.00098
#calculate the variance
var( flips )
## [1] 2.507735
#What is the expected value of a binomial distribution where 25 coins are flipped, each having a 30% chance of heads?
# Calculate the expected value using the exact formula
25*.3
## [1] 7.5
# Confirm with a simulation using rbinom
mean(rbinom(10000,25,.3))
## [1] 7.4895
# Calculate the variance using the exact formula
25*.3*(1-.3)
## [1] 5.25
# Confirm with a simulation using rbinom
var(rbinom(10000,25,.3))
## [1] 5.222522
Laws of Probability
Probability of Event A & Event B
The probability of A & B is the P(A)*P(B): \[\mbox{Pr}( A \mbox{ and } B ) = \mbox{Pr}( A ) \cdot \mbox{Pr}( B )\]
treeDiag(c("Event A","Event B"),
c(0.5, 0.5), list(c(0.5, 0.5), c(0.5, 0.5)))
Simulate this by collecting data on lots of coin flips
#collect 100000 trials of two coin flips
A <- rbinom( 100000, 1, 0.5 )
B <- rbinom( 100000,1, 0.5 )
#compare results from the two coins to see how many trials resulted in 2 heads
mean( A & B )
## [1] 0.24994
another example with unequal probabilities
# Simulate 100,000 flips of a coin with a 40% chance of heads
A <- rbinom(100000, 1, 0.4)
# Simulate 100,000 flips of a coin with a 20% chance of heads
B <- rbinom(100000, 1, 0.2)
# Estimate the probability both A and B are heads
mean( A & B )
## [1] 0.08138
using multiple events
# You've already simulated 100,000 flips of coins A and B
A <- rbinom(100000, 1, .4)
B <- rbinom(100000, 1, .2)
# Simulate 100,000 flips of coin C (70% chance of heads)
C <- rbinom(100000, 1, .7)
# Estimate the probability A, B, and C are all heads
mean( A & B & C )
## [1] 0.05513
#0.4*0.2*0.7
Probability of A or B
\[\mbox{Pr}( A \mbox{ or } B ) = \mbox{Pr}( A ) + \mbox{Pr}( B ) - \mbox{Pr}( A ) \cdot \mbox{Pr}( B )\]
A <- rbinom(100000, 1, .5)
B <- rbinom(100000, 1, .5)
mean( A | B )
## [1] 0.74788
What about more complicated scenarios?
Three coins:
\[\mbox{Pr}( A \mbox{ or } B \mbox{ or } C) = \mbox{Pr}( A \mbox{ and } B) - \mbox{Pr}( A \mbox{ and } C) - \mbox{Pr}( A \mbox{ and } B) + \mbox{Pr}( A \mbox{ and } B \mbox{ and } C )\]
A <- rbinom(100000, 1, .5)
B <- rbinom(100000, 1, .5)
C <- rbinom(100000, 1, .5)
mean( A | B | C )
## [1] 0.87628
A <- rbinom(100000, 1, .6)
B <- rbinom(100000, 1, .1)
mean( A | B )
## [1] 0.63818
# Use rbinom to simulate 100,000 draws from each of X and Y
X <- rbinom(100000, 10, .6)
Y <- rbinom(100000, 10, .7)
# Estimate the probability either X or Y is <= to 4
mean(X <= 4 | Y <= 4)
## [1] 0.20517
# Use pbinom to calculate the probabilities separately
prob_X_less <- pbinom(4,10,.6)
prob_Y_less <- pbinom(4,10,.7)
# Combine these to calculate the exact probability either <= 4
prob_X_less + prob_Y_less - prob_X_less*prob_Y_less
## [1] 0.2057164
Multiplying Random Variables
Rules for manipulating random variables:
- \(E[k\cdot X] = k \cdot E[X]\)
- \(\mbox{Var}(k \cdot Y) = k^2 \cdot \mbox{Var}(X)\)
size <- 50
mult <- 4
mean( rbinom( 100000, size, 0.4 ) )
## [1] 19.993
#what happens is the size is increased by 3x
mean( rbinom( 100000, size*mult, 0.4 ) )
## [1] 80.00665
# Simulate 100,000 draws of a binomial with size 20 and p = .1
X <- rbinom(100000, 20, .1)
# Estimate the expected value of X
mean( X )
## [1] 1.99496
# Estimate the expected value of 5 * X
mean( 5*X )
## [1] 9.9748
# X is simulated from 100,000 draws of a binomial with size 20 and p = .1
X <- rbinom(100000, 20, .1)
# Estimate the variance of X
var(X)
## [1] 1.801832
# Estimate the variance of 5 * X
var(5*X)
## [1] 45.0458
Adding Two Random Variables
\[E[X+Y] = E[X] + E[Y]\] (even if X and Y are not independent) \[\mbox{Var}[X+Y] = \mbox{Var}[X] + \mbox{Var}[Y]\] (only if X and Y are independent)
X <- rbinom( 100000, 10, .5 )
Y <- rbinom( 100000, 100, .2 )
Z <- X + Y
res <- paste( 'E[X]=', mean(X), ', E[Y]=', mean(Y), '\nE[Z]=', mean(Z),
'\nVar[X]=', mean(X), ', Var[Y]=', mean(Y), '\nVar[Z]=', mean(Z) )
cat( res, sep = '\n')
## E[X]= 4.99939 , E[Y]= 20.00542
## E[Z]= 25.00481
## Var[X]= 4.99939 , Var[Y]= 20.00542
## Var[Z]= 25.00481
X <- rbinom(100000, 20, .3 )
Y <- rbinom(100000, 40, .1 )
mean( X + Y )
## [1] 9.98302
Bayesian Statistics
Updating with Evidence
How can we tell if a coin is fair or biased? If we run an experiment by flipping the coin, we can tell the likelihood the coin is biased from these results.
Updating beliefs after seeing evidence: this process is at the heart of Bayesian statistics
numtrials <- 50000
numcoins <- 20
fairP <- 0.5
biasedP <- 0.75
val <- 14
#flip 50,000 fair coins 20 times
X <- rbinom(numtrials, numcoins, fairP )
#do the same with a biased coin
Y <- rbinom(numtrials, numcoins, biasedP )
df <- data.frame( 'X' = X, 'Y' = Y ) %>%
mutate( condX = X == val,
condY = Y == val )
Xhist <- df %>% ggplot( aes( x = X, fill = condX ) ) +
geom_histogram( binwidth = 1 ) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) )
Yhist <- df %>% ggplot( aes( x = Y, fill = condY ) ) +
geom_histogram( binwidth = 1) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) )
grid.arrange( Xhist, Yhist, ncol = 1 )
## Warning: Removed 4 rows containing missing values (geom_bar).
## Warning: Removed 4 rows containing missing values (geom_bar).
What is the probability of getting 14 heads with the fair and biased coin?
p14fair <- dbinom( val, numcoins, fairP )
p14biased <- dbinom( val, numcoins, biasedP )
#results for simulation
p14fexp <- sum( X == val )/numtrials
p14bexp <- sum( Y == val )/numtrials
res <-paste( 'True Prop fair coin:', p14fair,
'\nTrue Prop biased coin:', p14biased,
'\nSimulated Prob fair coin:', p14fexp,
'\nSimulated Prob biased coin:', p14bexp)
cat( res )
## True Prop fair coin: 0.0369644165039063
## True Prop biased coin: 0.168609293214104
## Simulated Prob fair coin: 0.03632
## Simulated Prob biased coin: 0.1707
The biased coin has a much higher probability of a 14heads result than the fair coin. However, it is still possible for the fair coin to have a 14 result, albeit much less likely than the biased coin.
Conditional Probability: What is the probability that the coin is biased given that we get the result of 14 heads? \[\mbox{Pr}(\mbox{ Biased}|14 \mbox{ Heads }) = \frac{\mbox{ # total w/} 14 \mbox{ Heads}}{\mbox{# total w/} 14 \mbox{ Heads} }\]
p14tot <- p14fexp + p14bexp
Pbiased_given_14 <- p14bexp / p14tot
res <- paste0( 'Conditional Probability the coin is biased given 14 heads: ', Pbiased_given_14,
'\nThere is a ', round( Pbiased_given_14*100,2 ), '% chance the coin is biased' )
cat( res,sep = '\n' )
## Conditional Probability the coin is biased given 14 heads: 0.824558013718481
## There is a 82.46% chance the coin is biased
# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom( 50000, 20, .5 )
biased <- rbinom( 50000, 20, .75 )
# How many fair cases, and how many biased, led to exactly 11 heads?
fair_11 <- sum( fair == 11)
biased_11 <- sum( biased == 11 )
# Find the fraction of fair coins that are 11 out of all coins that were 11
# This is the posterior probability that the coin with 11/20 is fair
fair_11/(fair_11+biased_11)
## [1] 0.8621654
# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom( 50000, 20, .5 )
biased <- rbinom( 50000, 20, .75 )
# How many fair cases, and how many biased, led to exactly 16 heads?
fair_16 <- sum( fair == 16)
biased_16 <- sum( biased == 16 )
# Find the fraction of fair coins that are 16 out of all coins that were 16
# This is the posterior probability that the coin with 16/20 is fair
fair_16/(fair_16+biased_16)
## [1] 0.02352206
Prior Probability
The previous example compared two different samples of coins with equal sizes. But what if there are different numbers of types of coin?
numtrialsf <- 90000
numtrialsb <- 10000
numcoins <- 20
fairP <- 0.5
biasedP <- 0.75
val <- 14
#flip 90,000 fair coins 20 times
X <- rbinom(numtrialsf, numcoins, fairP )
#flip 10,000 biased coins 20 times
Y <- rbinom(numtrialsb, numcoins, biasedP )
dfX <- data.frame( 'X' = X ) %>%
mutate( condX = X == val )
dfY <- data.frame( 'Y' = Y ) %>%
mutate( condY = Y == val )
Xhist <- dfX %>% ggplot( aes( x = X, fill = condX ) ) +
geom_histogram( binwidth = 1 ) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) ) +
ylim( c( 0, 16000 ) )
Yhist <- dfY %>% ggplot( aes( x = Y, fill = condY ) ) +
geom_histogram( binwidth = 1) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) ) +
ylim( c( 0, 16000 ) )
grid.arrange( Xhist, Yhist, ncol = 1 )
## Warning: Removed 4 rows containing missing values (geom_bar).
## Warning: Removed 4 rows containing missing values (geom_bar).
In this simulation, a result of 14 heads is more likely to be from fair coin even though it is more probably for a given biased coin tiral to result in 14 than a given fair coin; this is because the fair coins are so much more numerous.
fair <- sum( X == 14 )
biased <- sum( Y == 14 )
Cond_bg14 <- biased / (biased + fair )
res <- paste( 'The conditional probability of the coin being biased given that the result was 14\nP(biased|14) =', round( Cond_bg14,4 ) )
cat( res, sep = '\n')
## The conditional probability of the coin being biased given that the result was 14
## P(biased|14) = 0.3267
#PRIOR BELIEF: a 80% chance the coin is fair and a 20% chance it is biased to 75%.
# Simulate 8000 cases of flipping a fair coin, and 2000 of a biased coin
fair_flips <- rbinom( 8000, 20, .5 )
biased_flips <- rbinom( 2000, 20, .75)
# Find the number of cases from each coin that resulted in 14/20
fair_14 <- sum( fair_flips == 14 )
biased_14 <- sum( biased_flips == 14 )
# Use these to estimate the posterior probability
fair_14/(fair_14 + biased_14)
## [1] 0.4809524
Three coins
# Simulate 80,000 draws from fair coin, 10,000 from each of high and low coins
flips_fair <- rbinom( 80000, 20, .5 )
flips_high <- rbinom( 10000, 20, .75 )
flips_low <- rbinom( 10000, 20, .25 )
# Compute the number of coins that resulted in 14 heads from each of these piles
fair_14 <- sum( flips_fair == 14 )
high_14 <- sum( flips_high == 14 )
low_14 <- sum( flips_low == 14 )
# Compute the posterior probability that the coin was fair
fair_14/(fair_14 + high_14 + low_14)
## [1] 0.6371453
#visualize the distributions
df_fair <- data.frame( 'fair' = flips_fair ) %>%
mutate( condF = flips_fair == val )
df_high <- data.frame( 'high' = flips_high ) %>%
mutate( condH = flips_high == val )
df_low <- data.frame( 'low' = flips_low ) %>%
mutate( condL = flips_low == val )
fhist <- df_fair %>% ggplot( aes( x = fair, fill = condF ) ) +
geom_histogram( binwidth = 1 ) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) ) +
ylim( c( 0, 16000 ) )
hhist <- df_high %>% ggplot( aes( x = high, fill = condH ) ) +
geom_histogram( binwidth = 1) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) ) +
ylim( c( 0, 16000 ) )
lhist <- df_low %>% ggplot( aes( x = low, fill = condL ) ) +
geom_histogram( binwidth = 1) +
scale_fill_manual( values = c( 'blue','red')) +
xlim( c( 1,numcoins ) ) +
ylim( c( 0, 16000 ) )
grid.arrange( fhist, hhist, lhist, ncol = 1 )
Bayes’ Theorem
Finding the exact conditional probability: 1) \[P(\mbox{biased & }14)=P(14|\mbox{biased})\cdot P(\mbox{biased})\] also, 2) \[P(\mbox{fair & }14)=P(14|\mbox{fair})\cdot P(\mbox{fair})\] knowing from before 3) \[P(\mbox{biased}|14)=\frac{P(14 \mbox{ Heads & Biased})}{P(14 \mbox{ Heads & Biased})+P(14 \mbox{ Heads & Fair})}\] we can substitute in 1 & 2 to 3 4) \[P(\mbox{biased}|14)= \frac{P(14|\mbox{biased})\cdot P(\mbox{biased})}{P(14|\mbox{biased})\cdot P(\mbox{biased}) + P(14|\mbox{fair})\cdot P(\mbox{fair})}\] This is an example of an application of Bayes’ Theorem: \[P(B|A)= \frac{P(B|A)\cdot P(A)}{P(B|A)\cdot P(A) + P(B|!A)\cdot P(!A)}\] where event A = biased and event B = 14 heads
Let’s try is: Calculate some conditional probabilities…
# Use dbinom to calculate the probability of 11/20 heads with fair or biased coin
probability_fair <- dbinom( 11, 20, .5)
probability_biased <- dbinom( 11, 20, .75)
# Calculate the posterior probability that the coin is fair
pf_11 <- probability_fair / (probability_fair + probability_biased)
# Find the probability that a coin resulting in 14/20 is fair
probability_fair <- dbinom( 14, 20, .5)
probability_biased <- dbinom( 14, 20, .75)
pf_14 <- probability_fair / (probability_fair + probability_biased)
# Find the probability that a coin resulting in 18/20 is fair
probability_fair <- dbinom( 18, 20, .5)
probability_biased <- dbinom( 18, 20, .75)
pf_18<- probability_fair / (probability_fair + probability_biased)
res <-paste( 'P(fair & 11):', round(pf_11,4),
'\nP(fair & 14):', round(pf_14,4),
'\nP(fair & 18):', round(pf_18,4))
cat( res, sep = '\n')
## P(fair & 11): 0.8555
## P(fair & 14): 0.1798
## P(fair & 18): 0.0027
We now learn that there is a 99% probability the coin is fair. Here we incorporate this prior into the calculation. We would like to know the probability that a coin is fair given the result of the trial was 16:
# Use dbinom to find the probability of 16/20 from a fair or biased coin
probability_16_fair <- dbinom( 16,20,.5)
probability_16_biased <- dbinom( 16,20,.75)
# Use Bayes' theorem to find the posterior probability that the coin is fair
pF <- 0.99
pB <- 0.01
probability_16_fair*pF/(probability_16_fair*pF + probability_16_biased*pB)
## [1] 0.7068775