Normal Distribution: The Normal distribution is the most common distribution we see. It is uni-modal, symmetric, and shaped like a bell curve. It can be described by the mean, the point at which the distribution centers around, and the standard deviation which determines how “flat” the curve is. Many real life variables such as test scores and height appoximate a normal distribution.
Binomial Distribution: The Binomial distribution describes the probability of seeing exactly k successes in n trials given a probability of success p. For it to applicable, 4 conditions must be met: each trial must be independent, the number of trials n must be fixed, each trial needs to be classified as a success or failure, and the rate of success p must be constant. When the sample size n is large enough, it approximates a normal distribution.
Poisson Distribution: The Poisson distribution describes the likelihood of observing k occurrences of an event given a rate of occurence \(\lambda\). For this distribution to be applicable, the population must be relatively large and fixed.
Cumulative Density Function (CDF): The CDF gives the probability of a variable being equal to or less than a given value For instance, in the textbook example of Ann’s SAT score of 1300, the normal CDF gives us 0.841, or that 84.1% of all scores are equal to or less than 1300. It can be used for either continuous or discrete data.
Probability Density Function (PDF): The PDF in contrast, gives the probability of a variable being equal to a specific value. One can integrate the PDF up to a certain value to find the CDF of that value. To stick with our SAT example, the PDF would describe the probability of getting a score of 1300.
The distribution of SAT scores is roughly normal, so the example above show that both the PDF and CDF functions make sense for normal distributions.
Normal Distribution: The key parameters for the normal distribution are mean and standard deviation. Though they can be defined explicitly in R, they do have to be. The normal distribution functions will automatically set the mean to 0, and the sd to 1. One could also pass in the Z score directly which accounts for both parameters.
Binomial Distribution: The key parameters for the binomial distribution are sample size and probability of success. These also define the mean and sd. In R they do have to be explicitly defined.
Poisson Distribution: The key parameter for the poisson distribution is the rate of occurrence (\(\lambda\)). This is in absolute terms (eg. 4.4 heart attacks in NYC a day on average), so the population size, which should be stable, is already accounted for. You do have to define this explicitly in R.
Normal Distribution: The normal distribution can be used to model many different situations. Above I used the SAT score example from the OpenStats book. Other examples include, height, IQ, and measurement errors.
Binomial Distribution: An example of the binomial distribution I like is from the 3Blue1Brown video on the subject. In it they discuss a set of amazon book reviews. Each is of varying sample sized, with different satisfaction rates. The video uses the binomial theorem to describe how likely it is that the given rate is close to the “true” satisfaction score given the composition of the sample.
Poisson Distribution: For the poisson distribution we would want to look for the occurrence of an event among a large and relatively static population. A good candidate would be the number of car accidents per day in the United States.
Normal Distribution:
Below is a plot of the distribution of SAT scores with a mean of 1100, a standard deviation of 200, and assuming it is perfectly normal:
x <- seq(600, 1600, 10)
px <- dnorm(x, mean = 1100, sd = 200)
barplot(px, names.arg = x, xlab = "SAT Score", ylab = "Probability", main = "Normal Distribution of SAT Scores")
Binomial Distribution:
Lets say there a medical procedure with a success rate of 60%. If there are n=100 patients undergoing the procedure, what the odds of seeing k successes? Below you can see a bar plot with the # of successes along the x-axis, and the probability of success along the y-axis.
n <- 100 #Sample Size
k <- n*seq(0.1,1,0.01) #Number of Successful Trials
p <- 0.6 #probability of success
pkn = dbinom(k, n, p) #Get vector of probabilites
barplot(pkn, names.arg = k, ylab = "Probability", xlab = "# of Successes", main = "Binomial Distribution of Procedure's Success")
Poisson Distribution:
Below is a rough approximation of the birth of quintuplets in the United States each year. About 10 quintuplet births occur in a given year.
k <- seq(1:20)
lambda <- 10 # rate of occurrences
pk <- dpois(k, lambda) # probability of k occurrences
barplot(pk, names.arg = k, xlab = "Number of Occurences", ylab = "Probability", main = "Poisson Distribution of Quintuplet Births")
BACKGROUND: Often, we can model processes using several different probability distributions. For example Links to an external site., we might use the Poisson instead of the binomial (if n>20 and np<10 i.e. large n and small p) as we did in class, the binomial instead of the geometric (both are repetitions of independent Bernoulli trials Download Bernoulli trials), or the normal approximation instead of the binomial (if np>10 and nq>10 i.e. n is large). If the assumptions are understood, then the probability results will be nearly identical.
Let’s assume that a hospital’s neurosurgical team performed N procedures for in-brain bleeding last year. x of these procedures resulted in death within 30 days. If the national proportion for death in these cases is pi, then is there evidence to suggest that your hospital’s proportion of deaths is more extreme than the national proportion?
Pick your own values of N, x, and pi. x is necessarily less than or equal to N, and pi is a fixed probability of success. The probability should be greater than or equal to x.
N <- 500 # number of procedures performed last year
x <- 50 # number of last year's procedures that resulted in death
pi <- 0.05 #national proportion of deaths from procedure
#Binomial:
dbinom(x,N,pi) #probability of x deaths
## [1] 1.943414e-06
k <- seq(1:60)
bpk <- dbinom(k, N, pi) # Probability of k deaths out of N procedures given a national proportion of deaths pi
barplot(bpk, names.arg = k, ylab = "Probability", xlab = "# of Death", main = "Binomial")
#Poisson
lambda <- pi*N # find lambda based on national proportion and sample size.
dpois(x, lambda) # probability of x deaths
## [1] 3.602164e-06
ppk <- dpois(k, lambda) # probability of k deaths given rate of occurrence lambda
barplot(ppk, names.arg = k, ylab = "Probability", xlab = "# of Death", main = "Poisson")
In both cases, we see that our local death rate is more extreme than the national average. Given the national proportion of 0.05 deaths per procedure, we would expect to see about 25 deaths for 500 procedures, which both graphs demonstrate visually, but at this hospital we see 50, twice the national average. Though the two methods did not produce the exact same probability of this outcome, both were on the same extremely low (\(10^{-6}\)) order of magnitude. I suspect that due to the relatively small sample size, the Poisson distribution is not as accurate as the Binomial one.