Discussion 3

I. Foundational Distributions

A. Explain the Normal, Binomial, and Poisson distributions each in less than four sentences.

The normal distribution is famous for its being bell-shaped and symmetrical. Its characteristic statistics are the mean, which represents the average value of the data, and standard deviation, which represents how spread out the data is. The mean is located in the exact center of the curve and has the same value as the median and mode, and over two-thirds of the data falls within one standard deviation of the mean (increasing to approximately 95% of all data falling within two standard deviations and 99.7% of all data falling within three standard deviations).

The binomial distribution models probabilities for independent, repeated trials where the probability of success remains constant across each trial (called a Bernoulli trial). It is used to find the probability of a given number of successes occurring within a set amount of Bernoulli trials by multiplying the probability of one occurrence of this event by the number of ways the event can happen within the parameters. If the p is the probability of success for one trial and q is the probability of failure such that p + q = 1, N is the number of trials being performed, and r is the desired number of successes, then the binomial distribution can be modeled by the following: \({N \choose r}p^r q^{N-r}\)

The Poisson distribution is useful for measuring the probability of a number of random, independent events happening within a set interval (usually of time, but can also be of space). It was first used to model the number of soldiers per corps who would die by horse kick each year in the late 1800s. The only parameter that must be defined ahead of time for a Poisson distribution is the mean number of events that occur in a given interval. If that mean is denoted by the Greek letter \(\lambda\) and the desired number of success is k, then the Poisson distribution can be modeled by the following: \(\frac{\lambda^ke^{-k}}{k!}\)

B. Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions (or a distribution from the list above that we have not covered in class), and provide some intuition as to if the pdf formula makes sense or not.

For a probability distribution, pdf refers to the probability density function and cdf refers to the cumulative distribution function The probability density function of a distribution is used to calculate the probability of the random variable having the same value as a given target. The cumulative distribution function of a probability distribution is used to calculate the probability that a random variable will be less than or equal to a target value. Since both the Poisson and binomial distributions use discrete random variables (and therefore have probability mass functions instead of probability density functions), I will look at the pdf of the normal distribution, which for a mean \(\mu\) and standard deviation \(\sigma\), has the following pdf: \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\). At a glance, we can tell that this makes sense; as the distance between the target number and the mean increases, the value of the exponential will approach (but never equal) zero, making the overall value of the function smaller. This matches what we know about the density of data in a normal distribution, meaning that the probability of an event will be greatest when it has the same value as the mean and will decrease as it gets larger or smaller than the mean.

C. What are the key parameters that define the 3 distributions above? Does R require these key parameters to be declared?

Normal distribution: the key parameters are the mean \(\mu\) and the variance \(\sigma^2\) (the square root of which is the standard deviation \(\sigma\)). R does not require that these parameters be declared and if they are not will assume a mean of 0 and variance of 1. For example:

Binomial distribution: the key parameters are the number of trials n and the probability of an individual trial’s success p. R requires that both of these parameters need to be declared.

Poisson distribution: the only key parameter is the average rate of occurrence \(\lambda\), which R does require to be declared.

D. Give a few examples of situations that can be modeled with each of the 3 distributions above.

Normal distribution: the birth weight of newborn babies, height of the adult population, and the total values for multiple dice being rolled at the same time.

Binomial distribution: the number of fraudulent credit card transactions in a bank, the number of people who will experience side effects from a medication, and the ability of a machine learning model to correctly identify an image.

Poisson distribution: the number of babies that will be born at a hospital (per day/hour/etc.), the number of people who will eat at a restaurant, and hourly website traffic.

E. Plot the distribution in part B.

Normal Distribution

mu <- 10.5
sigma <- 105/12
sd <- sqrt(sigma)
x <- seq(from = mu-3*sd, to = mu+3*sd, length.out = 1000)
pdf <- dnorm(x=x, mean=mu, sd=sd)
plot(x=x, y=pdf, type="l", col="blue", lwd=2, xlab="3d6 Value", ylab="Density", main = "Normal Distribution of 3d6 (as a Continuous Distribution)")

Binomial Distribution

p <- 0.1666
n <- 8
x <- 0:n
probs <- dbinom(x=x, size=n, prob=p)
barplot(height = probs, names.arg=x, col = "red", main="Binomial Distribution (p=0.1666)", xlab="# Successes", ylab="Probability")

Poisson Distribution

lambda <- 3
x_vals <- 1:15
pmf_vals <- dpois(x=x_vals, lambda=lambda)
plot(x=x_vals,y=pmf_vals,type="h",lwd=2,col="green",xlab="Monsters in a Random Encounter", ylab="Probability", main="Distribution of Monsters in a Random Encounter")

II. Converge of Distributions.

Binomial:

pi <- 0.15
surgeries <- 128
deaths <- 21
pbinom(q=deaths-1, size=surgeries, prob=pi, lower.tail=FALSE)

## [1] 0.3641103

Poisson:

#approximate lambda by multiplying the earlier values for pi and N
lambda <- 0.15*128
deaths <- 21
ppois(q=deaths-1, lambda=lambda, lower.tail = FALSE)

## [1] 0.3702241

My answers were off by less than 0.006. This makes sense to me since I chose a value for the number of surgeries performed that was large (over 100) and a proportion of deaths such that their product was less than 10. As discussed in our class notes, these are ideal conditions for using the Poisson distribution as an approximation (or “rate limiter”) of a binomial distribution.