Part I

A. Please explain each of the 3 distributions in less than 4 sentences.

  • Normal Distribution: A bell-shaped, symmetric and continuous distribution. The mean (μ) and the standard deviation (σ) are the parameters used in defining a normal distribution. The mean (μ) locates the center of the peak, while the standard deviation (σ) determines the spread (ie How squished or spread out the curve is). Ubiquitous because of all the real-world applications.

  • Binomial Distribution: Used to model the number of successes in a fixed number of trials. To use the model, there are 4 requirements that must be met; there are a fixed number of trials, outcomes are binary(success or fail), the probability for success remains the same through each trial, the trials are independent. By checking for those 4 conditions, you will know that what you are looking at is a Binomial Distribution. The formula for calculating the probability of a specific outcome is:
    \[ P(k) = \binom{n}{k} p^k (1-p)^{n-k} \]

  • Poisson Distribution: Models the number of events occurring in a fixed interval of time or space. These events must happen independently at a constant rate.

B. Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions (or a distribution from the list above that we have not covered in class), and provide some intuition as to if the pdf formula makes sense or not.

  • PDF (Probability Density Function): For continuous distributions, the PDF gives the relative likelihood of the variable taking on a specific value within a certain range. The area under the PDF curve over an interval gives the probability of the variable falling in that interval.

  • CDF (Cumulative Distribution Function): Gives the probability that the random variable is less than or equal to a specific value. It accumulates probability from the left tail up to that point, always increasing from 0 to 1.

  • (Note: I tried this with Poisson Distribution formula and my head exploded. Not sure why) In a Binomial Distribution, the formula makes sense as the \(\binom{n}{k}\) counts the ways that k success can be arranged over n trials. Then, \(p^k\) is the probability of getting k successes. Then, \((1-p)^{n-k}\) is the probability of getting \((n-k)\) failures. Individual probabilities are then combined by multiplying the individual probabilities.

C. What are the key parameters that define the 3 distributions above (or a distribution from the list above)? Does R require these key parameters to be declared ?

  • Normal Distribution: Parameters are the mean (μ) and the standard deviation (σ). R does not require these parameters because it will just assume a mean of 0 and a standard deviation of 1.
  • Binomial Distribution: Parameters are the number of trials (n) and probability of success (p). Yes, R requires these parameters.
  • Poisson Distribution: Parameter is the average rate of events (ƛ). Yes, R requires the value for ƛ.

D. Give a few examples of situations that can be modeled with each of the 3 distributions above.

  • Normal Distribution: Examples include Heights of men and women in a population, Blood pressure readings, Test scores on a standardized exam, Colors of cars in dealership lot, etc.
  • Binomial Distribution: The classic coin flip 100 times and counting each time there is a head, Throw 20 free throws and counting successes, roll a pair of dice and counting each time they add to 7, etc.
  • Poisson Distribution: Examples include number of EMS calls per hour, Number of work orders generated in the facilities department of Boston College, Number of students walking through the quad per day, etc

E. Plot the distribution in part B (3 if you stick close to class notes, or 1 if you venture out). You can begin by reading up on the plot() function, and seeing the coded lecture examples - https://rpubs.com/sharmaar2/Distributions

# Normal PDF
# Parameters: mean = 0, sd = 1 (standard normal)
mu <- 0
sigma <- 1
x_norm <- seq(-3, 3, length.out = 100)

# Calculate PDF values
pdf_norm <- dnorm(x_norm, mean = mu, sd = sigma)

# Plot
plot(x_norm, pdf_norm, type = "l", lwd = 2, col = "green",
     main = "Normal PDF \n(μ=0, σ=1)",
     xlab = "x", ylab = "Density")

# Normal CDF
# Parameters: mean = 0, sd = 1 (standard normal)
mu <- 0
sigma <- 1
x_norm <- seq(-4, 4, length.out = 100)

# Calculate CDF values
cdf_norm <- pnorm(x_norm, mean = mu, sd = sigma)

# Plot
plot(x_norm, cdf_norm, type = "l", lwd = 2, col = "orange",
     main = "Normal CDF \n(μ=0, σ=1)",
     xlab = "x", ylab = "Cumulative Probability")

# Binomial PDF
# Parameters: n = 25 free throws, p = 0.6 probability
n <- 25
p <- 0.6
x_binom <- 0:n

# Calculate PDF values
pdf_binom <- dbinom(x_binom, size = n, prob = p)

# Plot
plot(x_binom, pdf_binom, type = "h", lwd = 2, col = "blue",
     main = "Binomial PDF (n=25, p=0.6)",
     xlab = "Number of Successes", ylab = "Probability",
     ylim = c(0, max(pdf_binom) * 1.1))
points(x_binom, pdf_binom, pch = 20, col = "blue")

Part II

Binomial Approach

# My selected parameters
N <- 60              #Number of trials
x <- 12               #Brain bleeds and death within 30 days
pi <- 0.01            #National proportion (1% mortality)

# Binomial Model
# Task: Compare P(x >= 12) when national proportion is 0.08
prob_binomial <- pbinom(x - 1, size = N, prob = pi)

cat("Binomial Analysis:\n")
## Binomial Analysis:
cat("Expected deaths:", N * pi, "\n")
## Expected deaths: 0.6
cat("Observed deaths:", x, "\n")
## Observed deaths: 12
cat("P(x >= 12 | pi = 0.01):", prob_binomial, "\n")
## P(x >= 12 | pi = 0.01): 1
# Compare to alpha (significance level) of 0.05. This is a concept I only just read about now in my quest to write this code. This gives a threshold in which to determine the significance of our results. Display a conclusion, Yes significance or No significance
if (prob_binomial < 0.05) {
  cat("There is evidence to suggest hospital proportion is higher than national (p < 0.05)\n")
} else {
  cat("There is no strong evidence hospital differs from national proportion\n")
}
## There is no strong evidence hospital differs from national proportion

Poisson Approach

# My selected parameters (from above)
N <- 60             #Number of trials
x <- 12               #Brain bleeds and death within 30 days
pi <- 0.01            #National proportion (8% mortality)

# Poisson Model: Lambda = expected number of deaths. In other words, we expect 8 deaths out of every 100 trials
lambda <- N * pi

# Probability P(x >= 12) when lambda = 1
prob_poisson <- 1 - ppois(x - 1, lambda = lambda)

cat("\nPoisson Analysis:\n")
## 
## Poisson Analysis:
cat("Lambda (expected deaths):", lambda, "\n")
## Lambda (expected deaths): 0.6
cat("Observed deaths:", x, "\n")
## Observed deaths: 12
cat("P(X >= 12 | lambda = 1):", prob_poisson, "\n")
## P(X >= 12 | lambda = 1): 2.614131e-12
if (prob_poisson < 0.05) {
  cat("Evidence suggests hospital proportion is higher than national (p < 0.05)\n")
} else {
  cat("No strong evidence hospital differs from national proportion\n")
}
## Evidence suggests hospital proportion is higher than national (p < 0.05)

Conclusion There is a very big difference in the results! This was partially intentional as I kept toying with the parameters in order to achieve different results. I began with high values of N (>1000) and then landed on a lower N of 60. It is clear that a Poisson Model has limitations (?) when lambda is outside of a certain range.