The normal distribution is famous for its being bell-shaped and symmetrical. Its characteristic statistics are the mean, which represents the average value of the data, and standard deviation, which represents how spread out the data is. The mean is located in the exact center of the curve and has the same value as the median and mode, and over two-thirds of the data falls within one standard deviation of the mean (increasing to approximately 95% of all data falling within two standard deviations and 99.7% of all data falling within three standard deviations).
The binomial distribution models probabilities for independent, repeated trials where the probability of success remains constant across each trial (called a Bernoulli trial). It is used to find the probability of a given number of successes occurring within a set amount of Bernoulli trials by multiplying the probability of one occurrence of this event by the number of ways the event can happen within the parameters. If the p is the probability of success for one trial and q is the probability of failure such that p + q = 1, N is the number of trials being performed, and r is the desired number of successes, then the binomial distribution can be modeled by the following: \({N \choose r}p^r q^{N-r}\)
The Poisson distribution is useful for measuring the probability of a number of random, independent events happening within a set interval (usually of time, but can also be of space). It was first used to model the number of soldiers per corps who would die by horse kick each year in the late 1800s. The only parameter that must be defined ahead of time for a Poisson distribution is the mean number of events that occur in a given interval. If that mean is denoted by the Greek letter \(\lambda\) and the desired number of success is k, then the Poisson distribution can be modeled by the following: \(\frac{\lambda^ke^{-k}}{k!}\)
For a probability distribution, pdf refers to the probability density function and cdf refers to the cumulative distribution function The probability density function of a distribution is used to calculate the probability of the random variable having the same value as a given target. The cumulative distribution function of a probability distribution is used to calculate the probability that a random variable will be less than or equal to a target value. Since both the Poisson and binomial distributions use discrete random variables (and therefore have probability mass functions instead of probability density functions), I will look at the pdf of the normal distribution, which for a mean \(\mu\) and standard deviation \(\sigma\), has the following pdf: \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\). At a glance, we can tell that this makes sense; as the distance between the target number and the mean increases, the value of the exponential will approach (but never equal) zero, making the overall value of the function smaller. This matches what we know about the density of data in a normal distribution, meaning that the probability of an event will be greatest when it has the same value as the mean and will decrease as it gets larger or smaller than the mean.
Normal distribution: the key parameters are the mean \(\mu\) and the variance \(\sigma^2\) (the square root of which is the standard deviation \(\sigma\)). R does not require that these parameters be declared and if they are not will assume a mean of 0 and variance of 1. For example:
Binomial distribution: the key parameters are the number of trials n and the probability of an individual trial’s success p. R requires that both of these parameters need to be declared.
Poisson distribution: the only key parameter is the average rate of occurrence \(\lambda\), which R does require to be declared.
Normal distribution: the birth weight of newborn babies, height of the adult population, and the total values for multiple dice being rolled at the same time.
Binomial distribution: the number of fraudulent credit card transactions in a bank, the number of people who will experience side effects from a medication, and the ability of a machine learning model to correctly identify an image.
Poisson distribution: the number of babies that will be born at a hospital (per day/hour/etc.), the number of people who will eat at a restaurant, and hourly website traffic.
Normal Distribution
mu <- 10.5
sigma <- 105/12
sd <- sqrt(sigma)
x <- seq(from = mu-3*sd, to = mu+3*sd, length.out = 1000)
pdf <- dnorm(x=x, mean=mu, sd=sd)
plot(x=x, y=pdf, type="l", col="blue", lwd=2, xlab="3d6 Value", ylab="Density", main = "Normal Distribution of 3d6 (as a Continuous Distribution)")
Binomial Distribution
p <- 0.1666
n <- 8
x <- 0:n
probs <- dbinom(x=x, size=n, prob=p)
barplot(height = probs, names.arg=x, col = "red", main="Binomial Distribution (p=0.1666)", xlab="# Successes", ylab="Probability")
Poisson Distribution
lambda <- 3
x_vals <- 1:15
pmf_vals <- dpois(x=x_vals, lambda=lambda)
plot(x=x_vals,y=pmf_vals,type="h",lwd=2,col="green",xlab="Monsters in a Random Encounter", ylab="Probability", main="Distribution of Monsters in a Random Encounter")
Binomial:
pi <- 0.15
surgeries <- 128
deaths <- 21
pbinom(q=deaths-1, size=surgeries, prob=pi, lower.tail=FALSE)
## [1] 0.3641103
Poisson:
#approximate lambda by multiplying the earlier values for pi and N
lambda <- 0.15*128
deaths <- 21
ppois(q=deaths-1, lambda=lambda, lower.tail = FALSE)
## [1] 0.3702241
My answers were off by less than 0.006. This makes sense to me since I chose a value for the number of surgeries performed that was large (over 100) and a proportion of deaths such that their product was less than 10. As discussed in our class notes, these are ideal conditions for using the Poisson distribution as an approximation (or “rate limiter”) of a binomial distribution.