B. PDF/PMF: Probability Density/Mass Function.
The PDF is used to calculate (or find) the probability of a random variable that takes on a specific value. For discrete variables it is called the Probability Mass Function and for continuous is it called the Probability Density Function. It is harder to use the PDF on a continuous data set because it must be integrated over an interval whereas the PMF can take discrete values from the discrete data set.
CDF/CMF: Cumulative Density/Mass Function.
The CDF/CMF is used to calculate the probability that a random variable will take on a value less than or equal to x. It represents the area under the curve to the left of x. Similarly to PDF and PMF, it is called the Cumulative Density Function for continuous variables and Mass Function for discrete variables. This function is very useful in finding the probability that your random variable will take on a value in a given interval (especially for continuous variables).
Binomial PMF Function:
\(P(X) = \binom{n}{x} p^{x} q^{n-x} = \frac{n!}{(n-x)!x!}p^{x} q^{n-x}\)
\(n = trials\)
\(x = successes\)
\(p = probability of success\)
\(q = probability of failure\)
This formula makes sense. Remember this function is used to find the probability of the random variable (X) at a specific value (x). What the formula is doing is; first, performing the combination formula nCx, where we are finding the number of ways that we can get x successes in n trials. We are then multiplying that number by the joint probability of getting x successes and n-x failures.
In other words; multiply the probability of getting x success and n-x failures by the number of ways that we can get x success and n-x failures.
C. Parameters.
Normal Distribution: mean \((\mu)\) and standard deviation \((\phi)\) - need to specify both to R as well as the x or q variable (the random variable).
Binomial: pi \((\pi\)) being the probability of success in a trail, and \(n\) being the number of trials performed. You need to specify both in R as well as define a random variable x or q.
Poisson: lambda \((\lambda)\) which is really \(\lambda t\) but R assumes that lambda accounts for t. This is the rate of occurrence over the interval. You must specify this in R as well as the random variable x or q.
D. Example Uses.
Normal: Heights, weights, other biological anthropology. Oceanography (ocean temperatures), quality control, education (SAT Scores).
IN SPORTS: Time on ice (for a given game), pitches thrown (for a game), goals scored (for a game), etc.
Binomial: Medical trials 9 (cured or not cured), toxicity tests (develop cancer or not), ecology (seed germinates or not), quality control (passes or not).
IN SPORTS: On base percentage, free throw percentage, shooting percentage, wins/losses.
Poisson: Ecology (seedlings emerging), computer programming (bugs to occur), quality control (defects along an interval), genetics (mutations), traffic flow (cars arrive at intersection), Number of work related injuries per time period, number of storms per time period.
IN SPORTS: Goals scored in a hockey game, fouls in the NBA, injuries per season or game.
E. Plot PDF Distributions of the Normal, Binomial, and Poisson.
#Binomial PDF size = 10 and prob= .3
plot(x = 0:10,
y = dbinom(x = 0:10,
size = 10,
prob = .3),
type = 'h',
main = 'Binomial Distribution (n = 10, p = .3)',
xlab = '# of Successes',
ylab = 'Probability',
lwd = 3)
This plot shows the Binomial Distribution with an n size of 10 and a probability of success of .3.
#Poisson PDF (lambda = 2.4)
plot(x = 0:10,
y = dpois(x = 0:10,
lambda = 2.4),
type = 'h',
main = 'Poisson Distribution (lambda = 2.4)',
xlab = '# Successes',
ylab = 'Probability',
lwd = 3)
This is a Poisson Distribution with lambda = 2.4. It is similar to the binomial distribution.
#Normal Distribution PDF
plot(x = -100:100,
y = dnorm(x = -100:100,
mean = 0,
sd = 20),
type = 'l',
main = 'Normal Distribution (mean = 0, sd = 20)',
xlab = 'Random Variable',
ylab = 'Probability',
lwd = 3)
This is the PDF of a Normal Distribution. Notice the symmetry about the mean and the bell-curve shape.
\(N = 100\)
\(\pi = .07\)
\(\lambda t = 5\)
\(x = 5\)
#Graph the Binomial Distribution
plot(x = 0:25,
y = dbinom(x = 0:25,
size = 100,
prob = .07),
type = 'h',
main = 'Binomial Distribution of Deaths per year (prob = .05)',
xlab = 'Deaths',
ylab = 'Probability',
lwd = 3)
#calc the probability of having 5 or more deaths. (Hospital of interest averages 5 deaths in 100 operations)
1 - pbinom(q = 4, size = 100, prob = .07)
## [1] 0.8368358
#Plot Poisson Distribution
plot(x = 0:25,
y = dpois(x = 0:25,
lambda = 7),
type = 'h',
main = 'Poisson Distribution of Deaths per year (lambda = 5)',
xlab = 'Deaths',
ylab = 'Probability',
lwd = 3)
#Calc probability of having 5 or more deaths. (hospital of interest rate is 5 deaths in 100 operations.)
1 - pbinom(q = 4, size = 100, prob = .07)
## [1] 0.8368358
These distributions look about identical.
We used the 1-PDF to obtain the area under the curve to the right of 5 and including 5. This gives us the probability the hospital of interest has 5 or more deaths per 100 operations using the national average death rate of .07 (7/100). Both distributions gave identical answers and this is because the Poisson and Binomial distribution are nearly identical to identical for large sample size (N) (N>20) and a small probability (pi) (np <10). Here we have a large sample size of 100 (100>20), and a small pi (pi = .07), so NP = 7 which is less than 10.
This makes sense, as the higher N gets and there is a relatively small pi value, that can be interpreted as a long term average (Poisson). It also is a success/failure experiment (Binomial), in this case over a relatively long trail repetition size. When N is large and pi is small, they are essentially modeling the same thing and can be interpreted the same. The Binomial is just going trial by trial, where the Poisson is going interval by interval. But with the same pi value and a long enough interval, we would expect them to be the same distribution.
Both methods gave the probability of .8368 that the hospital would experience 5 or more deaths in 100 operations using the national average (.07) of death rate for that operation. Since the probability calculated is above 5% (.05) at 83.68% (.8368), we can confidently say that there is not evidence that the hospital’s proportion of deaths is more extreme than the national average. This hospital is most likely doing better than the national average.
In other words, there is a 83% chance a hospital operating at the true national average of 7 deaths per 100 operations would experience 5 or more deaths during their next 100 operations. That is a high percentage and does not stand out as unlikely, in fact it is very likely and expected.