Part 1

A. Please explain each of the 3 distributions in less than 4 sentences.

Normal Distribution: The Normal distribution is the most common distribution we see. It is uni-modal, symmetric, and shaped like a bell curve. It can be described by the mean, the point at which the distribution centers around, and the standard deviation which determines how “flat” the curve is. Many real life variables such as test scores and height appoximate a normal distribution.

Binomial Distribution: The Binomial distribution describes the probability of seeing exactly k successes in n trials given a probability of success p. For it to applicable, 4 conditions must be met: each trial must be independent, the number of trials n must be fixed, each trial needs to be classified as a success or failure, and the rate of success p must be constant. When the sample size n is large enough, it approximates a normal distribution.

Poisson Distribution: The Poisson distribution describes the likelihood of observing k occurrences of an event given a rate of occurence \(\lambda\). For this distribution to be applicable, the population must be relatively large and fixed.

B. Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions (or a distribution from the list above that we have not covered in class), and provide some intuition as to if the pdf formula makes sense or not.

Cumulative Density Function (CDF): The CDF gives the probability of a variable being equal to or less than a given value For instance, in the textbook example of Ann’s SAT score of 1300, the normal CDF gives us 0.841, or that 84.1% of all scores are equal to or less than 1300. It can be used for either continuous or discrete data.

Probability Density Function (PDF): The PDF in contrast, gives the probability of a variable being equal to a specific value. One can integrate the PDF up to a certain value to find the CDF of that value. To stick with our SAT example, the PDF would describe the probability of getting a score of 1300.

The distribution of SAT scores is roughly normal, so the example above show that both the PDF and CDF functions make sense for normal distributions.

C. What are the key parameters that define the 3 distributions above (or a distribution from the list above)? Does R require these key parameters to be declared ? Type the “?distribution” command in R to find out.

Normal Distribution: The key parameters for the normal distribution are mean and standard deviation. Though they can be defined explicitly in R, they do have to be. The normal distribution functions will automatically set the mean to 0, and the sd to 1. One could also pass in the Z score directly which accounts for both parameters.

Binomial Distribution: The key parameters for the binomial distribution are sample size and probability of success. These also define the mean and sd. In R they do have to be explicitly defined.

Poisson Distribution: The key parameter for the poisson distribution is the rate of occurrence (\(\lambda\)). This is in absolute terms (eg. 4.4 heart attacks in NYC a day on average), so the population size, which should be stable, is already accounted for. You do have to define this explicitly in R.

E. Plot the distribution in part B (3 if you stick close to class notes, or 1 if you venture out). You can begin by reading up on the plot(). function, and seeing the coded lecture examples - https://rpubs.com/sharmaar2/Distributions

Normal Distribution:

Below is a plot of the distribution of SAT scores with a mean of 1100, a standard deviation of 200, and assuming it is perfectly normal:

x <- seq(600, 1600, 10)
px <- dnorm(x, mean = 1100, sd = 200)

barplot(px, names.arg = x, xlab = "SAT Score", ylab = "Probability", main = "Normal Distribution of SAT Scores")

Binomial Distribution:

Lets say there a medical procedure with a success rate of 60%. If there are n=100 patients undergoing the procedure, what the odds of seeing k successes? Below you can see a bar plot with the # of successes along the x-axis, and the probability of success along the y-axis.

n <- 100 #Sample Size
k <- n*seq(0.1,1,0.01) #Number of Successful Trials
p <- 0.6 #probability of success

pkn = dbinom(k, n, p) #Get vector of probabilites
barplot(pkn, names.arg = k, ylab = "Probability", xlab = "# of Successes", main = "Binomial Distribution of Procedure's Success")

Poisson Distribution:

Below is a rough approximation of the birth of quintuplets in the United States each year. About 10 quintuplet births occur in a given year.

k <- seq(1:20) 
lambda <- 10 # rate of occurrences
pk <- dpois(k, lambda) # probability of k occurrences

barplot(pk, names.arg = k, xlab = "Number of Occurences", ylab = "Probability", main = "Poisson Distribution of Quintuplet Births")

Part 2

BACKGROUND: Often, we can model processes using several different probability distributions. For example Links to an external site., we might use the Poisson instead of the binomial (if n>20 and np<10 i.e. large n and small p) as we did in class, the binomial instead of the geometric (both are repetitions of independent Bernoulli trials Download Bernoulli trials), or the normal approximation instead of the binomial (if np>10 and nq>10 i.e. n is large). If the assumptions are understood, then the probability results will be nearly identical.

Let’s assume that a hospital’s neurosurgical team performed N procedures for in-brain bleeding last year. x of these procedures resulted in death within 30 days. If the national proportion for death in these cases is pi, then is there evidence to suggest that your hospital’s proportion of deaths is more extreme than the national proportion?

Pick your own values of N, x, and pi. x is necessarily less than or equal to N, and pi is a fixed probability of success. The probability should be greater than or equal to x.

A. Then model both as a binomial and a Poisson, and provide your R code solutions.

N <- 500 # number of procedures performed last year
x <- 50 # number of last year's procedures that resulted in death
pi <- 0.05 #national proportion of deaths from procedure

#Binomial:
dbinom(x,N,pi) #probability of x deaths
## [1] 1.943414e-06
k <- seq(1:60)
bpk <- dbinom(k, N, pi) # Probability of k deaths out of N procedures given a national proportion of deaths pi

barplot(bpk, names.arg = k, ylab = "Probability", xlab = "# of Death", main = "Binomial")

#Poisson
lambda <- pi*N # find lambda based on national proportion and sample size.

dpois(x, lambda) # probability of x deaths
## [1] 3.602164e-06
ppk <- dpois(k, lambda) # probability of k deaths given rate of occurrence lambda

barplot(ppk, names.arg = k, ylab = "Probability", xlab = "# of Death", main = "Poisson")

B. Do you get similar answers or not under the two different distributional assumptions, and can you guess why?

In both cases, we see that our local death rate is more extreme than the national average. Given the national proportion of 0.05 deaths per procedure, we would expect to see about 25 deaths for 500 procedures, which both graphs demonstrate visually, but at this hospital we see 50, twice the national average. Though the two methods did not produce the exact same probability of this outcome, both were on the same extremely low (\(10^{-6}\)) order of magnitude. I suspect that due to the relatively small sample size, the Poisson distribution is not as accurate as the Binomial one.