BACKGROUND: These distributions are foundational in probability and statistics, each serving different modeling needs with specific parameters determining their characteristics.
Normal Distribution (Gaussian Distribution):
Description: Symmetric bell-shaped curve with a mean and standard deviation.
Parameters: Mean (μ) defines the center, and standard deviation (σ) controls the spread.
Example: Heights of a population; IQ scores.
Uniform Distribution:
Description: All values are equally likely within a specified range.
Parameters: Minimum and maximum values define the range.
Example: Random selection from a deck of cards; rolling a fair die.
Bernoulli Distribution:
Description: Simplest distribution; models two possible outcomes.
Parameters: Probability of success (p).
Example: Outcome of a coin flip; success or failure of a single event.
Binomial Distribution:
Description: Models the number of successes in a fixed number of independent Bernoulli trials.
Parameters: Probability of success (p) and number of trials (n).
Example: Number of successful free throws in a fixed number of attempts.
Poisson Distribution:
Description: Models the number of events in fixed intervals.
Parameters: Average event rate (λ) within the given time or space.
Example: Number of phone calls at a call center in a minute.
Exponential Distribution:
Description: Describes the time between events in a Poisson process.
Parameters: Rate parameter (λ) controls the event occurrence rate.
Example: Time between arrivals at a service point (e.g., customers at a counter).
Gamma Distribution:
Description: Generalizes the exponential distribution; used for waiting times.
Parameters: Shape (k) and scale (θ) parameters.
Example: Time until a light bulb fails; time until a computer system crashes.
Logistic Distribution:
Description: S-shaped distribution; often used in logistic regression.
Parameters: Location (μ) and scale (s) parameters.
Example: Modeling growth or decline in populations; logistic regression.
Weibull Distribution:
Description: Used in reliability engineering; can model various shapes.
Parameters: Shape (k) and scale (λ) parameters.
Example: Modeling the time until a component fails.
Chi-Square Distribution:
Description: Distribution of the sum of squared standard normal deviates.
Parameters: Degrees of freedom (k) define the shape.
Example: Testing the independence of categorical variables in a contingency table.
Please explain each of the 3 distributions in less than 4 sentences.
There are three types of distributions discussed in this week’s lecture: binomial, normal and poisson. Binomial and poisson distributions fall under the umbrella of discrete probability distributions, which signifies that the distribution is related to the frequency of finite realizations in a given statistical context; on the other hand, normal distributions are continuous, in which there is an infinite range of observations. Binomial distributions look at the total count of outcomes in a succession of trials, poisson distribution reflects the probability of an event in a given interval (an approximation of the average of binomial probabilities for a specific event/interval), and normal distributions reflect the continuous occurrence of an event across a given sample population.
Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions (or a distribution from the list above that we have not covered in class), and provide some intuition as to if the pdf formula makes sense or not.
PDF stands for Probability Density Function, whereas CDF signifies cumulative density function. Pdf(x), or p(x), results in the exact probability density for any given point x — based on this definition, pdf is best served for continuous random variables. As such, applying pdf to distributions considered normal, logististic, or gamma would be an appropriate interpretation of the function. Related to pdf, there is also the Probability Mass Function (PMF), which is used for discrete random variables, that provides the probability of each possible outcome at a given x. On the other hand, the Cumulative Density Function, written as cdf(x) or F(x), results in the probability of drawing less than a given x (i.e., total area/density of the left-side of the tailed distribution) and can be used for either continuous or discrete variables.
Click here for a simple breakdown of pdf, pmf, vs cdf.
What are the key parameters that define the 3 distributions above (or a distribution from the list above)? Does R require these key parameters to be declared ? Type the “?distribution” command in R to find out.
For a binomial distribution, the number of trials (N) and the
probability of success (𝞹) are needed; for a normal distribution, the
vector (x), mean (μ) and standard deviation (σ) are required; for a
poisson distribution, average rate that at which a rare event occurs
(λt) is calculated. The distribution() function
requires these parameters to be declared – specifically, the vectors of
quantiles/number of observations, the mean, standard deviation, and
whether we are looking at the left-tail distribution or not.
?distribution
Give a few examples of situations that can be modeled with each of the 3 distributions above. You can try to read Chapter 1.3 Parametric Families of Distribution. in Introduction to Statistical Thought by Michael Lavine recommended textbook.
Much of the work in my field (educational/developmental psychology) relies on the assumptions of a Normal Distribution. For example, we assume that height, reading ability, and standardized test scores typically will have a normal distribution, i.e., a bell curve. The Bernoulli Distribution can be used to determine whether a medication intervention was successful or not— e.g., “0” if symptoms did not improve/worsened and “1” if there were improvements. Extending on this example, a Binomial Distribution would represent the success rate of this new medication over an interval of time.
Plot the distribution in part B (3 if you stick close to class notes, or 1 if you venture out). You can begin by reading up on the plot(). function, and seeing the coded lecture examples - https://rpubs.com/sharmaar2/Distributions.
# parameters
n <- 30 #total trials (1 month)
p <- 0.5 # whether student will be on time or not
# Generate x values
x <- 0:n # Values for the number of successes
# Calculate the probability mass function (PMF) for each x value
pmf_binomial <- dbinom(x, size = n, prob = p)
# Plot the Binomial distribution
plot(x, pmf_binomial, type = "h", lwd = 10, col = "lavender", xlab = "On Time", ylab = "Probability",
main = paste("The Probability of a Student Being on Time in a Given Month"))
#create data, district wide student scores on a math test
data <- rnorm(1000, mean = 67, sd = 6.7)
#parameters
data_mean <- mean(data)
data_sd <- sd(data)
#histogram counts:
counts <- hist(data, plot = FALSE)
#density values for normal distribution
x <- seq(min(data), max(data), length.out = 100)
y <- dnorm(x, mean = data_mean, sd = data_sd)
# visual
plot(x, y, type = "l", col = "pink", lwd = 2, main = "District-wide Student Scores on a Math Test", xlab = "Scores", ylab = "Density")
# Parameters
lambda_t <- 4 # A student's average number of tardiness over an academic year
# Generate x values
x <- 0:9 # Number of months
# pmf for every event/month
pmf_pois <- dpois(x, lambda_t)
# visual
plot(x, pmf_pois, type = "h", lwd = 10, col = "purple", xlab = "Number of Events", ylab = "Probability", main = "Average Probability of Student Being Tardy")
BACKGROUND: Often, we can model processes using several different probability distributions. For example, we might use the Poisson instead of the binomial (if n>20 and np<10 i.e. large n and small p) as we did in class, the binomial instead of the geometric (both are repetitions of independent Bernoulli trials), or the normal approximation instead of the binomial (if np>10 and nq>10, i.e., n is large). If the assumptions are understood, then the probability results will be nearly identical.
Let’s assume that a hospital’s neuro-surgical team performed N procedures for in-brain bleeding last year. x of these procedures resulted in death within 30 days. If the national proportion for death in these cases is 𝞹, then is there evidence to suggest that your hospital’s proportion of deaths is more extreme than the national proportion?
Pick your own values of N, x, and 𝞹. x is necessarily less than or equal to N, and 𝞹 is a fixed probability of success. The probability should be greater than or equal to x. Then model both as a binomial and a Poisson, and provide your R code solutions. Hint: Build your code from Week 3 Inclass.R (attached below for convenience) and skim over even Key Distributions.html to brush up on your basic concepts.
n <- 60 # num of procedures in the last year
X <- 8 # num deaths in past 30 days
range <- 8:60
pi <- .12 # national proportion of death
There was a total of 60 procedures in the past year, 8 deaths in the past month, and the national fatality rate for this in-brain surgery is 12%.
#P(X >= 8 | n=60, pi=.12)
sum(dbinom(8:60,size = 60,prob = .12)) # brute force
## [1] 0.4328228
1 - pbinom(8,60,.12) # adjust for discrete variable !!!
## [1] 0.2900798
pbinom(8,60,.12,lower.tail=F) # adjust for discrete variable !!!
## [1] 0.2900798
The binomial probability is 29%.
lambda_t <- n * pi
sum(dpois(x = 8:60, lambda = lambda_t) ) # brute force
## [1] 0.4310588
1 - ppois(q = 8,lambda = lambda_t,lower.tail=T) # adjust for discrete variable !!!
## [1] 0.2973317
ppois(q = 8,lambda = lambda_t,lower.tail=F) # adjust for discrete variable !!!
## [1] 0.2973317
1 - ppois( q =8,
lambda = lambda_t,
lower.tail = TRUE
)
## [1] 0.2973317
The poisson probability is 29.73%.
Yes, we get near exact magnitudes of probability via both methods. I believe it’s because I have a large sample size, so the poisson distribution mimics one of binomial. In other words, the probability distribution does not differ between the finite and infinite number of trials because of the large sample population.
Code by Liu
x_axis <- 0:n
binom_distr <- dbinom(x_axis, n, pi)
plot(x_axis, binom_distr, type="h", lwd=2, col="orange", main="Binomial Distribution", xlab="Number of Deaths", ylab="Probability")
poisson_distr <- dpois(x_axis, lambda_t)
plot(x_axis, poisson_distr, type="h", lwd=2, col="turquoise", main="Poisson Distribution", xlab="Number of Deaths", ylab="Probability")