Week 3 Discussion - Binomial and Poisson

Part 1

A. Please explain each of the 3 distributions in less than 4 sentences.

The Normal Distribution is what we commonly call the bell curve. It has a symmetric curved shape with most of the data clustered in the middle. For this reason, the mean, the mode, and the median are all the same number.
The Binomial Distribution records the outcomes of a number of independent trials in which said outcomes are limited to either success or failure. The probability of achieving a success is the same in each trial run. In this case, the number of trials that will be run is predetermined.
The Poisson Distribution predicts the number of rare events that will occur in a certain time frame. The rate at which these events usually occur is set as our parameter lambda. Although our events are supposed to be random, lambda shows that they are also predictable.

B. Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions (or a distribution from the list above that we have not covered in class), and provide some intuition as to if the pdf formula makes sense or not.

The PDF is the probability density function of a distribution. It is useful for finding the probability that a random variable falls within a range of values rather than equaling a specific value. If the PDF surrounding a value, x, is high, this means that the probability that our random variable is close to x is high.
The CDF is the cumulative distribution function. It gives the probability that a random variable is less than our equal to our chosen value, x. It is often denoted as F(x).

The PDF of the Binomial distribution would show the probability that the number of successes across our n number of trials is equal to a chosen value, x. The CDF of the Binomial distribution would show the probability that the number of successes we see across our n number of trials is x or less. Both the PDF and CDF make sense in regards to the Binomial Distribution because the Binomial Distribution is a discrete distribution. Therefore, we can achieve an exact number of successes, and we can achieve less than or equal to an exact number of successes.

C. What are the key parameters that define the 3 distributions above (or a distribution from the list above)? Does R require these key parameters to be declared ?

Normal Distribution: X ~ norm(mean = \(\mu\), sd = \(\sigma\)) where mean = \(\mu\) is the mean of the distribution, and sd = \(\sigma\) is the standard deviation. It is necessary to define both parameters as they completely control the distribution. They tell us the center and spread and dictate what our distribution looks like. Increasing the mean shifts our curve to the right, and changing the standard deviation causes our distribution to spread out or scrunch closer to the center. If they are not specified, the default values of mean = 0 and sd = 1 to reflect the standard normal distribution.
Binomial Distribution: X ~ binom(size = n, prob = p) where size = n is the number of trials and prob = p is the probability of success. It is necessary to define both parameters. If size is not defined as an integer, NaN is returned.
Poisson Distribution: X ~ pois(lambda = \(\lambda\)) where lambda = \(\lambda\) is the average number of events that occur in a time period. This shows us the expected rate of occurrences. It is also necessary to define this parameter because without it, we will not know how often we should expect our rare event to pop up.

D. Give a few examples of situations that can be modeled with each of the 3 distributions above.

Grades in an undergraduate class, IQ scores, heights, and shoe sizes can be modeled with the normal distribution.
The number of heads that appeared in 50 coin tosses, the number of strikes thrown by a pitcher in 20 pitches, and the number of doors that are answered when you ring the doorbell of 12 houses can be modeled with the binomial distribution.
The number of people admitted to an emergency room for heart attacks in a month, the number of flight cancellations during the week of Thanksgiving, and the number of phone calls made to your small business in a weekend can be modeled with the poisson distribution.

E. Plot the distribution in part B (3 if you stick close to class notes, or 1 if you venture out).

set.seed(10)
# Normal Distribution

mu_1 = 25
sd_1 = 1

x_1 = seq(from = mu_1 - 3 * sd_1, to = mu_1 + 3 * sd_1, length.out = 100)

# PDF plot
pdf_1 = dnorm(x = x_1, mean = mu_1, sd = sd_1)
plot(x = x_1, y = pdf_1, type = 'l', ylab = "Probability", xlab = "x", main = "PDF of Normal Distribution with Mean 25 and SD 1", col = "skyblue")

# CDF plot
cdf_1 = pnorm(q = x_1, mean =  mu_1, sd = sd_1) 
plot(x = x_1, y = cdf_1, type = 'l', ylab = "Cumulative Density", xlab = "x", main = "CDF of Normal Distribution with Mean 25 and SD 1", col = "skyblue")

set.seed(10)
# Binomail Distribution - rolling a 6 on a fair dice, 100 rolls

n_binom = 50
p_binom = 1/6
x_binom = 0:n_binom

mu_binom = n_binom * p_binom 
sd_binom = sqrt(n_binom * p_binom * (1 - p_binom))

probabilites_binom = dbinom(x = x_binom, size = n_binom, prob = p_binom)

# PDf plot 
barplot(height = probabilites_binom, names.arg = x_binom, col = "skyblue", main = "PDF of Number of 6s Rolled on 50 Tosses of a Fair Die", xlab = "Number of Successes", ylab = "Probability")

#CDF plot
cdf_binom = pbinom(x_binom, size = n_binom, prob = p_binom)
plot(x_binom, cdf_binom, type = "l", xlab = "Number of Successes", ylab = "Cumulative Probability", main = "CDF of Number of 6s Rolled on 50 Tosses of a Fair Die", col = "skyblue")

set.seed(10)
# Poisson Distribution 

# The rate of marketing texts to regular texts I receive in a day is 4

# PDF 
lambda = 4
x = 0:15
pois_probs = dpois(x, lambda)
plot(x, pois_probs, type = "l", col = "skyblue", xlab = "Number of Occurrences", ylab = "Probability", main = "Poisson PDF with Lambda of 4")

# CDF 
plot(ppois(x, lambda = lambda), type = "l", col = "skyblue")

Part 2 - Convergence of Distributions

Let’s assume that a hospital’s neurosurgical team performed N procedures for in-brain bleeding last year. x of these procedures resulted in death within 30 days. If the national proportion for death in these cases is , then is there evidence to suggest that your hospital’s proportion of deaths is more extreme than the national proportion?

Pick your own values of N, x, and . x is necessarily less than or equal to N, and is a fixed probability of success. The probability should be greater than or equal to x.

Then model both as a binomial and a Poisson, and provide your R code solutions. Do you get similar answers or not under the two different distributional assumptions, and can you guess why?

# My chosen national rate for death in these cases is 12%, N is 30, and x is 8
part_2_N = 30
part_2_pi = 0.12
part_2_x = 8

# As Binomial: P(X >= 8 | n = 30, pi = 0.12)
round(sum(dbinom(x = 8:30, size = part_2_N, prob = part_2_pi)), digits = 4)

## [1] 0.0221

# As Poisson: 0.12 is equivalent to 12 deaths out of 100 cases. Our sample is 30. Number of errors is 8. 
pois_t = 30 / 100 # 30 sample / 100 cases
lambda_t = pois_t * 12 # 12 deaths in 100 cases

round(sum(dpois(x = 8:30, lambda = lambda_t)), digits = 4)

## [1] 0.0308

My answers differ when I model the situation with a Binomial Distribution vs a Poisson Distribution. The probability that our hospital’s proportion of deaths is more extreme than the national proportion is 0.0221 under Binomial and 0.0308 under Poisson. It makes sense that the answers we see are different. The binomial distribution assumes a fixed number of trials while a poisson distribution does not have a predetermined number of trials. We should use binomial when we know the number of trials that will take place, and each has the same chance of success. We should use Poisson when we have a specific time frame, want to know how many times our event occurred, and each event has a constant average rate.