Bernoulli Process: A Poisson Data Generator

Introduction
Understanding Bernoulli’s process (BP)
Problem with the above model
Understanding the underlying assumptions of Poisson process
Relationship with exponential distribution

set.seed(100)

Introduction

Poisson distribution is a probability distribution function, which is commonly used to model discrete events. A Poisson process generator is a generator which emits out the events based on Poisson distribution. It can be viewed as a recurrent Bernoulli’s process with occurrence of success indicating occurrence of event.

Understanding Bernoulli’s process (BP)

Bernoulli’s process is the fundamental building block above which Poisson process is built.

A simple example of BP is coin tossing (coin may be unfair also). We suppose that occurrence of HEAD (H) is a success and that an event has occurred. Let us assume that the coin is massively unfair and chances that it will fall HEAD is only 10% (p = 0.1).

We assume that the coin is tossed 100 times in an hour at a constant rate of one toss every 1/100th of hour.

We run a simulation of the whole process 10000 times (meaning simulation is for 10000 hours). The printed runs of “H and”T represent the occurrence of HEAD and TAIL over 1 hour for first 5 hours.

samp_100 <- rerun(10000, sample(c(1, 0), 100, replace = TRUE, prob = c(0.1, 0.9)))

samp_100[1:5]

## [[1]]
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
##  [36] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
##  [71] 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
## 
## [[2]]
##   [1] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
## 
## [[3]]
##   [1] 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 
## [[4]]
##   [1] 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
## 
## [[5]]
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
##  [36] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0
##  [71] 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0

Number of successes (1) in 1 hour

We can calculate the sum of occurrence of success (event) in 1 hour for each hour. We will have 10000 values. We show the values for first 100 one hour interval.

sum_success <- samp_100 %>% map_dbl(~ sum(.))
sum_success[1:100]

##   [1]  9 12 10 11 14 11 10 11 11 14 13 14  8  9 12 10 12 16  7 12  2  8  9
##  [24]  7 10 12 10  7 10 11 10  9 11 13 10 11 10  9 12 13  6 11 10 15 10 12
##  [47]  8  6  8  7 12  7 11 10 11  6  8 12 11  7  9  7  8 10 10 15  9 14 17
##  [70] 11  7 11  4 11  9  7 11  5 14 12  6 13 10  8 13  6 12  9 12 11  5  9
##  [93] 11  7 13 15  9  9  5 12

Average of number of events in 1 hour

Before, proceeding further, we can safely assume, that the average of the number of events occurring over 1 hour is 100 (number of coin tosses in an hour) multiplied by 0.1 (chances of success per coin toss) = 10.

Problem with the above model

The above model is the approximation of Poisson Process. The problem with the above model is that there is a smallest time interval (1/100 of an hour), below which we do not have a value for occurrence of the event.

We can better approximate the Poisson Process by subdividing the 1 hour more times (lets say, 1000) (more the subdivisions, better approximation it is). It means, that we are tossing a different coin 1000 times in 1 hour, which has the chance of success of 0.01 (10/1000), which leads to the same average number of events occurring over 1 hour (1000 * 0.01) of 10.

Lets run the simulation again

samp_1000 <- rerun(10000, sample(c(1, 0), 1000, replace = TRUE, prob = c(0.01, 0.99)))

samp_1000[1]

## [[1]]
##    [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##   [35] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##   [69] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [103] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [137] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [171] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [205] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [239] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [273] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [307] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [341] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [375] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [409] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
##  [443] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [477] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [511] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [545] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [579] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [613] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [647] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [681] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [715] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [749] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [783] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [817] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [851] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [885] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [919] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [953] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [987] 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Lets calculate the number of events occurring in 1 hour

sum_success_1000 <- samp_1000 %>% map_dbl(~ sum(.))
sum_success_1000[1:100]

##   [1]  6 13  8 13  8  9 10 15  8 13  8 13  9 11 12  4  8 12 16 16  6 14 10
##  [24]  6 10  5 11 14  7 13 10 10  9  6 11  7 13  7  4 11  8 11 14 13 13 15
##  [47] 19 10 13  7  4 10 15 12  7  9 13  8  7 11 11  8 12  9  8  9  7 16  5
##  [70] 14  8  4  9 12 12 11  3 19 12 11 13  6 14 10 14  7 15  9 12  8  5  9
##  [93] 10 12  6  9  7  9 15  8

Below is the histogram of the number of events occurring in 1 hour

hist(sum_success_1000, xlab = "Number of events occuring in 1 hour")

Comparision with Poisson Process with average number of events in 1 hour as 10

data_frame(binom = sum_success_1000, pois = rpois(10000, lambda = 10)) %>% gather("process", "num_event") %>% ggplot(aes(x = num_event, colour  = process)) + geom_density() + xlab("Number of events occurring in 1 hour")

BOTH APPEARS ALMOST SAME.

Understanding the underlying assumptions of Poisson process

From above we can intuitively derive the assumptions for us to tell that any event generating process is Poisson process.

The underlying process is random, which means that the outcome (success or failure) cannot be predicted till the time the outcome is observed.
The probability that success will occur remains constant over trials.
At a particular time point the probability of success happening is 0.
At a particular time point only one event can occur.
Number of events that occur inside a particular time interval increases with the increasing time duration.

Relationship with exponential distribution

When a process follows Poisson distribution, the time to event occurence follows exponential distribution.

We are concerned about the inter-event time interval.

time <- samp_1000 %>% map(~ diff(c(0, which(. == 1)))) %>% unlist() %>% `*`(0.001)

data_frame(bernoulli = time, exp = rexp(length(time), rate = 10)) %>% gather("process", "inter_event_time") %>% ggplot(aes(x = inter_event_time, colour  = process)) + geom_density() + xlab("Inter-event time interval")

The above simulation shows that the distribution generated by bernoulli’s process is same as exponential distribution with rate = 10.

One important property of the distribution is that, the time that has elapsed without any event occurring has got no bearing on the time after which next event will occur. All processes, which generate events in a predictable manner do not follow exponential distribution and are not Poisson processes.

Bye.