Foundations of Probability in `R`

DataCamp: Statistics with `R`

Bonnie Cooper

library( dplyr )
library( ggplot2 )
library( gridExtra )
library( openintro )
library( tidyr )
library(hrbrthemes)

The Binomial Distribution

Flipping Coins in `R`

Simulating a coin flip in R with rbinom( numdraws, numcoins, prob(1) )

#the rbinom() function:
rbinom( 1,1,.5 )

## [1] 0

Flipping many coins:

#returns a vector of flips
rbinom( 10,1,0.5 )

##  [1] 1 1 1 1 0 0 0 1 0 0

#return the sum of heads (1) from the flips
rbinom( 1,10,0.5 )

## [1] 6

#return a vector of sums of multiple multi coin flip outcomes
rbinom( 10,10,0.5 )

##  [1] 5 3 6 5 5 6 1 4 5 5

Unfair coin tosses:

rbinom( 10,10,0.8 )

##  [1] 8 7 7 8 8 8 8 7 7 7

rbinom( 10,10,0.2 )

##  [1] 0 0 3 3 2 0 4 2 0 1

Binomial Distributions \[X_{1 \cdots n} \sim \mbox{Binomial}(\mbox{size},p)\]

# Generate 100 occurrences of flipping 10 coins, each with 30% probability
manyflips <- data.frame( tenflipssum = rbinom(100000,10,0.3) )
ggplot( manyflips, aes( x = tenflipssum ) ) +
  geom_bar( binwidth = 1 ) +
  scale_x_continuous( limits = c( 1,10 )) +
  ggtitle( 'Flipping 10 coins many times' )

## Warning: Ignoring unknown parameters: binwidth

## Warning: Removed 2814 rows containing non-finite values (stat_count).

## Warning: Removed 2 rows containing missing values (geom_bar).

Density and Cumulatice Density

What is the probability that the sum of 10 flips will be 5? \[P(flips=5)\]

flips <- data.frame( flip10 = rbinom( 100000,10,0.3 ) )
ggplot( flips, aes( x = flip10 ) ) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There were ~25000 flips of 10 with a sum of 5. \[\sim \frac{25000}{100000} = \frac{1}{4}\] We can easily get the exact number:

mean( flips$flip10 == 5 )

## [1] 0.10096

Calculating the exact probability density: dbinom( outcome to find estimate at, the number of coins, the probability of heads )

#probability at 5 for a fair coin
dbinom( 5, 10, 0.5 )

## [1] 0.2460938

#probability at 6 for a fair coin
dbinom( 6, 10, 0.5 )

## [1] 0.2050781

#probability at 10 for a fair coin
dbinom( 10, 10, 0.5 )

## [1] 0.0009765625

Cumulative Density
the probability that the outcome is \(\gt,\lt,\geq,\leq\) a point on the probability outcomes distribution. \[X \sim \mbox{Binomial}(10,.5)\] \[Pr(X \leq 4 )\] programatically calculate the simulated probability:

mean( flips <= 4 )

## [1] 0.85062

use the function pbinom() to find the exactvalue

pbinom( 4, 10, 0.5 )

## [1] 0.3769531

# Calculate the probability that 2 of 10 coins are heads using dbinom
dbinom( 2, 10, 0.3 )

## [1] 0.2334744

# Confirm your answer with a simulation using rbinom
mean( rbinom( 10000, 10, 0.3 ) == 2 )

## [1] 0.2299

# Calculate the probability that at least five coins are heads
pbinom( 5, 10, 0.3 )

## [1] 0.952651

# Confirm your answer with a simulation of 10,000 trials
mean( rbinom( 10000, 10, 0.3 ) >= 5 )

## [1] 0.1486

Expected Value and Variance

Properties of a distribution:

Expected Value: Where is the distribution centered? (mean value)
- \(E[X] = \mbox{size}\cdot p\)
Variance: How spread out is it?
- \(\mbox{Var}(X) = \mbox{size} \cdot p \cdot (1-p)\)

#find the expected value:
flips <- rbinom( 100000, 10, .5 )
mean( flips )

## [1] 5.00041

mean( rbinom( 100000, 100, 0.2 ) )

## [1] 20.00098

#calculate the variance
var( flips )

## [1] 2.507735

#What is the expected value of a binomial distribution where 25 coins are flipped, each having a 30% chance of heads?

# Calculate the expected value using the exact formula
25*.3

## [1] 7.5

# Confirm with a simulation using rbinom
mean(rbinom(10000,25,.3))

## [1] 7.4895

# Calculate the variance using the exact formula
25*.3*(1-.3)

## [1] 5.25

# Confirm with a simulation using rbinom
var(rbinom(10000,25,.3))

## [1] 5.222522

Laws of Probability

Probability of Event A & Event B

The probability of A & B is the P(A)*P(B): \[\mbox{Pr}( A \mbox{ and } B ) = \mbox{Pr}( A ) \cdot \mbox{Pr}( B )\]

treeDiag(c("Event A","Event B"),
         c(0.5, 0.5), list(c(0.5, 0.5), c(0.5, 0.5)))

Simulate this by collecting data on lots of coin flips

#collect 100000 trials of two coin flips
A <- rbinom( 100000, 1, 0.5 )
B <- rbinom( 100000,1, 0.5 )
#compare results from the two coins to see how many trials resulted in 2 heads
mean( A & B )

## [1] 0.24994

another example with unequal probabilities

# Simulate 100,000 flips of a coin with a 40% chance of heads
A <- rbinom(100000, 1, 0.4)

# Simulate 100,000 flips of a coin with a 20% chance of heads
B <- rbinom(100000, 1, 0.2)

# Estimate the probability both A and B are heads
mean( A & B )

## [1] 0.08138

using multiple events

# You've already simulated 100,000 flips of coins A and B
A <- rbinom(100000, 1, .4)
B <- rbinom(100000, 1, .2)

# Simulate 100,000 flips of coin C (70% chance of heads)
C <- rbinom(100000, 1, .7)

# Estimate the probability A, B, and C are all heads
mean( A & B & C )

## [1] 0.05513

#0.4*0.2*0.7

Probability of A or B

\[\mbox{Pr}( A \mbox{ or } B ) = \mbox{Pr}( A ) + \mbox{Pr}( B ) - \mbox{Pr}( A ) \cdot \mbox{Pr}( B )\]

A <- rbinom(100000, 1, .5)
B <- rbinom(100000, 1, .5)
mean( A | B )

## [1] 0.74788

What about more complicated scenarios?
Three coins:
\[\mbox{Pr}( A \mbox{ or } B \mbox{ or } C) = \mbox{Pr}( A \mbox{ and } B) - \mbox{Pr}( A \mbox{ and } C) - \mbox{Pr}( A \mbox{ and } B) + \mbox{Pr}( A \mbox{ and } B \mbox{ and } C )\]

A <- rbinom(100000, 1, .5)
B <- rbinom(100000, 1, .5)
C <- rbinom(100000, 1, .5)
mean( A | B | C )

## [1] 0.87628

A <- rbinom(100000, 1, .6)
B <- rbinom(100000, 1, .1)
mean( A | B )

## [1] 0.63818

# Use rbinom to simulate 100,000 draws from each of X and Y
X <- rbinom(100000, 10, .6)
Y <- rbinom(100000, 10, .7)

# Estimate the probability either X or Y is <= to 4
mean(X <= 4 | Y <= 4)

## [1] 0.20517

# Use pbinom to calculate the probabilities separately
prob_X_less <- pbinom(4,10,.6)
prob_Y_less <- pbinom(4,10,.7)

# Combine these to calculate the exact probability either <= 4
prob_X_less + prob_Y_less - prob_X_less*prob_Y_less

## [1] 0.2057164

Multiplying Random Variables

Rules for manipulating random variables:

\(E[k\cdot X] = k \cdot E[X]\)
\(\mbox{Var}(k \cdot Y) = k^2 \cdot \mbox{Var}(X)\)

size <- 50
mult <- 4
mean( rbinom( 100000, size, 0.4 ) )

## [1] 19.993

#what happens is the size is increased by 3x
mean( rbinom( 100000, size*mult, 0.4 ) )

## [1] 80.00665

# Simulate 100,000 draws of a binomial with size 20 and p = .1
X <- rbinom(100000, 20, .1)

# Estimate the expected value of X
mean( X )

## [1] 1.99496

# Estimate the expected value of 5 * X
mean( 5*X )

## [1] 9.9748

# X is simulated from 100,000 draws of a binomial with size 20 and p = .1
X <- rbinom(100000, 20, .1)

# Estimate the variance of X
var(X)

## [1] 1.801832

# Estimate the variance of 5 * X
var(5*X)

## [1] 45.0458

Adding Two Random Variables

\[E[X+Y] = E[X] + E[Y]\] (even if X and Y are not independent) \[\mbox{Var}[X+Y] = \mbox{Var}[X] + \mbox{Var}[Y]\] (only if X and Y are independent)

X <- rbinom( 100000, 10, .5 )
Y <- rbinom( 100000, 100, .2 )
Z <- X + Y 

res <- paste( 'E[X]=', mean(X), ',  E[Y]=', mean(Y), '\nE[Z]=', mean(Z), 
       '\nVar[X]=', mean(X), ',  Var[Y]=', mean(Y), '\nVar[Z]=', mean(Z) )

cat( res, sep = '\n')

## E[X]= 4.99939 ,  E[Y]= 20.00542 
## E[Z]= 25.00481 
## Var[X]= 4.99939 ,  Var[Y]= 20.00542 
## Var[Z]= 25.00481

X <- rbinom(100000, 20, .3 )
Y <- rbinom(100000, 40, .1 )
mean( X + Y )

## [1] 9.98302

Bayesian Statistics

Updating with Evidence

How can we tell if a coin is fair or biased? If we run an experiment by flipping the coin, we can tell the likelihood the coin is biased from these results.
Updating beliefs after seeing evidence: this process is at the heart of Bayesian statistics

numtrials <- 50000
numcoins <- 20
fairP <- 0.5
biasedP <- 0.75
val <- 14
#flip 50,000 fair coins 20 times
X <- rbinom(numtrials, numcoins, fairP )
#do the same with a biased coin
Y <- rbinom(numtrials, numcoins, biasedP )

df <- data.frame( 'X' = X, 'Y' = Y ) %>%
  mutate( condX = X == val,
          condY = Y == val )
Xhist <- df %>% ggplot( aes( x = X, fill = condX ) ) +
  geom_histogram( binwidth = 1 ) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) )
Yhist <- df %>% ggplot( aes( x = Y, fill = condY ) ) +
  geom_histogram( binwidth = 1) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) )
grid.arrange( Xhist, Yhist, ncol = 1 )

## Warning: Removed 4 rows containing missing values (geom_bar).

## Warning: Removed 4 rows containing missing values (geom_bar).

What is the probability of getting 14 heads with the fair and biased coin?

p14fair <- dbinom( val, numcoins, fairP )
p14biased <- dbinom( val, numcoins, biasedP )
#results for simulation
p14fexp <- sum( X == val )/numtrials
p14bexp <- sum( Y == val )/numtrials

res <-paste( 'True Prop fair coin:', p14fair,
       '\nTrue Prop biased coin:', p14biased,
       '\nSimulated Prob fair coin:', p14fexp,
       '\nSimulated Prob biased coin:', p14bexp)
cat( res )

## True Prop fair coin: 0.0369644165039063 
## True Prop biased coin: 0.168609293214104 
## Simulated Prob fair coin: 0.03632 
## Simulated Prob biased coin: 0.1707

The biased coin has a much higher probability of a 14heads result than the fair coin. However, it is still possible for the fair coin to have a 14 result, albeit much less likely than the biased coin.

Conditional Probability: What is the probability that the coin is biased given that we get the result of 14 heads? \[\mbox{Pr}(\mbox{ Biased}|14 \mbox{ Heads }) = \frac{\mbox{ # total w/} 14 \mbox{ Heads}}{\mbox{# total w/} 14 \mbox{ Heads} }\]

p14tot <- p14fexp + p14bexp
Pbiased_given_14 <- p14bexp / p14tot

res <- paste0( 'Conditional Probability the coin is biased given 14 heads: ', Pbiased_given_14,
              '\nThere is a ', round( Pbiased_given_14*100,2 ), '% chance the coin is biased' )

cat( res,sep = '\n' )

## Conditional Probability the coin is biased given 14 heads: 0.824558013718481
## There is a 82.46% chance the coin is biased

# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom( 50000, 20, .5 )
biased <- rbinom( 50000, 20, .75 )

# How many fair cases, and how many biased, led to exactly 11 heads?
fair_11 <- sum( fair == 11)
biased_11 <- sum( biased == 11 )

# Find the fraction of fair coins that are 11 out of all coins that were 11
# This is the posterior probability that the coin with 11/20 is fair
fair_11/(fair_11+biased_11)

## [1] 0.8621654

# Simulate 50000 cases of flipping 20 coins from fair and from biased
fair <- rbinom( 50000, 20, .5 )
biased <- rbinom( 50000, 20, .75 )

# How many fair cases, and how many biased, led to exactly 16 heads?
fair_16 <- sum( fair == 16)
biased_16 <- sum( biased == 16 )

# Find the fraction of fair coins that are 16 out of all coins that were 16
# This is the posterior probability that the coin with 16/20 is fair
fair_16/(fair_16+biased_16)

## [1] 0.02352206

Prior Probability

The previous example compared two different samples of coins with equal sizes. But what if there are different numbers of types of coin?

numtrialsf <- 90000
numtrialsb <- 10000
numcoins <- 20
fairP <- 0.5
biasedP <- 0.75
val <- 14
#flip 90,000 fair coins 20 times
X <- rbinom(numtrialsf, numcoins, fairP )
#flip 10,000 biased coins 20 times
Y <- rbinom(numtrialsb, numcoins, biasedP )

dfX <- data.frame( 'X' = X ) %>%
  mutate( condX = X == val )
dfY <- data.frame( 'Y' = Y ) %>%
  mutate( condY = Y == val )
Xhist <- dfX %>% ggplot( aes( x = X, fill = condX ) ) +
  geom_histogram( binwidth = 1 ) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) ) +
  ylim( c( 0, 16000 ) )
Yhist <- dfY %>% ggplot( aes( x = Y, fill = condY ) ) +
  geom_histogram( binwidth = 1) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) ) +
  ylim( c( 0, 16000 ) )
grid.arrange( Xhist, Yhist, ncol = 1 )

## Warning: Removed 4 rows containing missing values (geom_bar).

## Warning: Removed 4 rows containing missing values (geom_bar).

In this simulation, a result of 14 heads is more likely to be from fair coin even though it is more probably for a given biased coin tiral to result in 14 than a given fair coin; this is because the fair coins are so much more numerous.

fair <- sum( X == 14 )
biased <- sum( Y == 14 )
Cond_bg14 <- biased / (biased + fair )
res <- paste( 'The conditional probability of the coin being biased given that the result was 14\nP(biased|14) =', round( Cond_bg14,4 ) )
cat( res, sep = '\n')

## The conditional probability of the coin being biased given that the result was 14
## P(biased|14) = 0.3267

#PRIOR BELIEF: a 80% chance the coin is fair and a 20% chance it is biased to 75%.
# Simulate 8000 cases of flipping a fair coin, and 2000 of a biased coin
fair_flips <- rbinom( 8000, 20, .5 )
biased_flips <- rbinom( 2000, 20, .75)

# Find the number of cases from each coin that resulted in 14/20
fair_14 <- sum( fair_flips == 14 )
biased_14 <- sum( biased_flips == 14 )

# Use these to estimate the posterior probability
fair_14/(fair_14 + biased_14)

## [1] 0.4809524

Three coins

# Simulate 80,000 draws from fair coin, 10,000 from each of high and low coins
flips_fair <- rbinom( 80000, 20, .5 )
flips_high <- rbinom( 10000, 20, .75 )
flips_low <- rbinom( 10000, 20, .25 )

# Compute the number of coins that resulted in 14 heads from each of these piles
fair_14 <- sum( flips_fair == 14 )
high_14 <- sum( flips_high == 14 )
low_14 <- sum( flips_low == 14 )

# Compute the posterior probability that the coin was fair
fair_14/(fair_14 + high_14 + low_14)

## [1] 0.6371453

#visualize the distributions
df_fair <- data.frame( 'fair' = flips_fair ) %>%
  mutate( condF = flips_fair == val )
df_high <- data.frame( 'high' = flips_high ) %>%
  mutate( condH = flips_high == val )
df_low <- data.frame( 'low' = flips_low ) %>%
  mutate( condL = flips_low == val )
fhist <- df_fair %>% ggplot( aes( x = fair, fill = condF ) ) +
  geom_histogram( binwidth = 1 ) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) ) +
  ylim( c( 0, 16000 ) )
hhist <- df_high %>% ggplot( aes( x = high, fill = condH ) ) +
  geom_histogram( binwidth = 1) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) ) +
  ylim( c( 0, 16000 ) )
lhist <- df_low %>% ggplot( aes( x = low, fill = condL ) ) +
  geom_histogram( binwidth = 1) +
  scale_fill_manual( values = c( 'blue','red')) +
  xlim( c( 1,numcoins ) ) +
  ylim( c( 0, 16000 ) )
grid.arrange( fhist, hhist, lhist, ncol = 1 )

Bayes’ Theorem

Finding the exact conditional probability: 1) \[P(\mbox{biased & }14)=P(14|\mbox{biased})\cdot P(\mbox{biased})\] also, 2) \[P(\mbox{fair & }14)=P(14|\mbox{fair})\cdot P(\mbox{fair})\] knowing from before 3) \[P(\mbox{biased}|14)=\frac{P(14 \mbox{ Heads & Biased})}{P(14 \mbox{ Heads & Biased})+P(14 \mbox{ Heads & Fair})}\] we can substitute in 1 & 2 to 3 4) \[P(\mbox{biased}|14)= \frac{P(14|\mbox{biased})\cdot P(\mbox{biased})}{P(14|\mbox{biased})\cdot P(\mbox{biased}) + P(14|\mbox{fair})\cdot P(\mbox{fair})}\] This is an example of an application of Bayes’ Theorem: \[P(B|A)= \frac{P(B|A)\cdot P(A)}{P(B|A)\cdot P(A) + P(B|!A)\cdot P(!A)}\] where event A = biased and event B = 14 heads

Let’s try is: Calculate some conditional probabilities…

# Use dbinom to calculate the probability of 11/20 heads with fair or biased coin
probability_fair <- dbinom( 11, 20, .5)
probability_biased <- dbinom( 11, 20, .75)
# Calculate the posterior probability that the coin is fair
pf_11 <- probability_fair / (probability_fair + probability_biased)

# Find the probability that a coin resulting in 14/20 is fair
probability_fair <- dbinom( 14, 20, .5)
probability_biased <- dbinom( 14, 20, .75)
pf_14 <- probability_fair / (probability_fair + probability_biased)

# Find the probability that a coin resulting in 18/20 is fair
probability_fair <- dbinom( 18, 20, .5)
probability_biased <- dbinom( 18, 20, .75)
pf_18<- probability_fair / (probability_fair + probability_biased)

res <-paste( 'P(fair & 11):', round(pf_11,4),
             '\nP(fair & 14):', round(pf_14,4),
             '\nP(fair & 18):', round(pf_18,4))
cat( res, sep = '\n')

## P(fair & 11): 0.8555 
## P(fair & 14): 0.1798 
## P(fair & 18): 0.0027

We now learn that there is a 99% probability the coin is fair. Here we incorporate this prior into the calculation. We would like to know the probability that a coin is fair given the result of the trial was 16:

# Use dbinom to find the probability of 16/20 from a fair or biased coin
probability_16_fair <- dbinom( 16,20,.5)
probability_16_biased <- dbinom( 16,20,.75)

# Use Bayes' theorem to find the posterior probability that the coin is fair
pF <- 0.99
pB <- 0.01
probability_16_fair*pF/(probability_16_fair*pF + probability_16_biased*pB)

## [1] 0.7068775

Related Distributions

The Normal Distribution

run a typical coin flip, but with many coins for each trial. Observe the difference in distribution envelopes:

d10 <- data.frame( 'X' = rbinom(100000, 10, .5) )
d100 <- data.frame( 'X' = rbinom(100000, 100, .5) )
d1000 <- data.frame( 'X' = rbinom(100000, 1000, .5) )

p10 <- ggplot( d10, aes( x = X ) ) +
  geom_histogram(binwidth=1)
p100 <- ggplot( d100, aes( x = X ) ) +
  geom_histogram(binwidth=1)
p1000 <- ggplot( d1000, aes( x = X ) ) +
  geom_histogram(binwidth=1)

grid.arrange( p10, p100, p1000, ncol = 1 )

With increasing size of sample, the distribution of the binomial coin flipping process more approximates a Normal distribution \[X \sim \mbox{Normal}(\mu,\sigma)\]

\(\mu\): the mean of the distribution
\(\sigma = \sqrt{\mbox{Var}X} =\) the square root of the average squared distance of each measure from the mean, \(\mu\)

finding the normal approximation to the binomial:

size <- 1000
p <- 0.5
binomial <- rbinom( 100000,size,p )
mu <- size*p
stdev <- sqrt( size*p*( 1-p ) )

normal <- rnorm( 100000,mu,stdev )

visualize the normal and binomial distributions. are they similar?

df <- data.frame( Dist = factor( rep( c('Binomial','Normal'), each = 100000)), 
                  Val = c( rbinom( 100000,size,p ), rnorm( 100000,mu,stdev ) ) )
glimpse( df )

## Rows: 200,000
## Columns: 2
## $ Dist <fct> Binomial, Binomial, Binomial, Binomial, Binomial, Binomial, Bino…
## $ Val  <dbl> 483, 497, 495, 494, 479, 501, 493, 494, 504, 511, 491, 509, 488,…

p1 <- df %>%
  ggplot( aes( x = Val, fill = Dist ) ) +
  geom_histogram( position = 'identity', bins = 30, alpha = 0.4) +
  scale_fill_manual(values=c("blue", "red")) +
  scale_color_manual(values=c("#blue", "red"))
  
p1

# Draw a random sample of 100,000 from the Binomial(1000, .2) distribution
binom_sample <- rbinom(100000,1000,.2)

# Draw a random sample of 100,000 from the normal approximation
normal_sample <- rnorm(100000,1000*0.2,sqrt(1000*.2*(1-.2)))

df = data.frame( 'binomial' = binom_sample, 'normal' = normal_sample ) %>%
  pivot_longer( cols = everything(), names_to = 'Dist', values_to = 'Val' )

p1 <- df %>%
  ggplot( aes( x = Val, fill = Dist ) ) +
  geom_histogram( position = 'identity', bins = 30, alpha = 0.4) +
  scale_fill_manual(values=c("blue", "red")) +
  scale_color_manual(values=c("#blue", "red"))
  
p1

# Simulations from the normal and binomial distributions
binom_sample <- rbinom(100000, 1000, .2)
normal_sample <- rnorm(100000, 200, sqrt(160))

# Use binom_sample to estimate the probability of <= 190 heads
mean( binom_sample <= 190 )

## [1] 0.22662

# Use normal_sample to estimate the probability of <= 190 heads
mean( normal_sample <= 190)

## [1] 0.21253

# Calculate the probability of <= 190 heads with pbinom
pbinom( 190, 1000, .2)

## [1] 0.2273564

# Calculate the probability of <= 190 heads with pnorm
pnorm( 190, 200, sqrt(160) )

## [1] 0.2145977

# Draw a random sample of 100,000 from the Binomial(10, .2) distribution
binom_sample <- rbinom( 100000, 10, .2 )

# Draw a random sample of 100,000 from the normal approximation
normal_sample <- rnorm( 100000, 10*.2, sqrt( 10*.2*(1-.2)))

df = data.frame( 'binomial' = binom_sample, 'normal' = normal_sample ) %>%
  pivot_longer( cols = everything(), names_to = 'Dist', values_to = 'Val' ) %>%
  ggplot( aes( x = Val, fill = Dist ) ) +
  geom_histogram( position = 'identity', bins = 30, alpha = 0.4) +
  scale_fill_manual(values=c("blue", "red")) +
  scale_color_manual(values=c("#blue", "red"))
df

The Poisson Distribution

Considering when a rare event happens out of many possibilities.
look at the binomial distribution where the size is large but the probability is very small

binom <- data.frame( 'binom' = rbinom( 100000, 1000, .001 ) ) %>%
  ggplot( aes( x = binom ) ) +
  geom_histogram(binwidth = 0.5)
binom

The Poisson Distribution is described by only 1 parameter, \(\lambda\), the mean \[X \sim \mbox{Poisson}(\lambda)\] the Expectation, \(E[X] = \lambda\)
The Variance, \(\mbox{Var}(X) = \lambda\)

poisson <- data.frame( 'vals' = rpois( 100000, 1 ) ) %>%
  ggplot( aes( vals ) ) +
  geom_histogram(binwidth = 0.5)
poisson

Compare the binomial and poisson distribution

compare <- data.frame('binom'=rbinom( 100000, 1000, .001 ), 'vals'=rpois( 100000, 1 ) ) %>%
  pivot_longer( cols = everything(), names_to = 'Dist', values_to = 'Val' ) %>%
  ggplot( aes( x = Val, fill = Dist ) ) +
  geom_histogram( position = 'identity', alpha = 0.4, binwidth = 1) +
  scale_fill_manual(values=c("blue", "red")) +
  scale_color_manual(values=c("#blue", "red"))
compare

The poisson distribution can have any \(\lambda\) so long as it is a positive value

compare <- data.frame( 'binom' = rbinom( 100000, 1000, .002 ), 
                       'poisson' = rpois( 100000, 2 ) ) %>%
  pivot_longer( cols = everything(), names_to = 'Dist', values_to = 'Val' ) %>%
  ggplot( aes( x = Val, fill = Dist ) ) +
  geom_histogram( position = 'identity', alpha = 0.4, binwidth = 1) +
  scale_fill_manual(values=c("blue", "red")) +
  scale_color_manual(values=c("#blue", "red"))
compare

# Simulate 100,000 draws from Poisson(2)
poisson_sample <- rpois( 100000, 2 )

# Find the percentage of simulated values that are 0
mean( poisson_sample == 0 )

## [1] 0.13583

# Use dpois to find the exact probability that a draw is 0
dpois( 0, 2 )

## [1] 0.1353353

compare <- data.frame( 'X' = rpois( 100000, 1 ),
                       'Y' = rpois( 100000, 2 ),
                       'Z' = rpois( 100000, 1 ) + rpois( 100000, 2 ),
                       'W' = rpois( 100000, 3 ) ) %>%
  pivot_longer( cols = everything(), names_to = 'Dist', values_to = 'Val' ) %>%
  filter( Dist %in% c( 'Z', 'W' )) %>%
  ggplot( aes( x = Val, fill = Dist ) ) +
  geom_histogram( position = 'identity', alpha = 0.4, binwidth = 1) +
  scale_fill_manual(values=c("blue", "red")) +
  scale_color_manual(values=c("#blue", "red"))
compare

The Geometric Distribution

Waiting for a particular event with some probability \[X \sim \mbox{Geom}(p)\]

flips <- rbinom( 100, 1, .1 )
which( flips == 1 )[1]

## [1] 19

#replicate this process 10 times
replicate( 10, which( rbinom( 100, 1, .1 ) == 1 )[ 1 ] )

##  [1] 36 38 11  7 10  7  2 10  1  1

geom <- data.frame( 'geom' = rgeom( 100000, .1 ) ) %>%
  ggplot( aes( x = geom ) ) +
  geom_histogram(binwidth = 1)
geom

The largest value is at geom = 1, where no head has been flipped because no coin has. The distribution steadily decreases to represent the coins that have not flipped heads yet.

#find the mean
round( mean( geom$data$geom ), 0 )

## [1] 9

Expected Value: The average number of tails before the first heads. \[E[X] = \frac{1}{p} - 1\]

#A new machine arrives in a factory. This type of machine is very unreliable: every day, it has a 10% chance of breaking permanently. How long would you expect it to last?
# Find the probability the machine breaks on 5th day or earlier
pgeom(4, .1 )

## [1] 0.40951

# Find the probability the machine is still working on 20th day
1-pgeom( 19,.1)

## [1] 0.1215767

# Calculate the probability of machine working on day 1-30
still_working <- 1-pgeom( c(0:29), .1)

# Plot the probability for days 1 to 30
plot(1:30, still_working)

Foundations of Probability in R

DataCamp: Statistics with R