#Clearing the global environment 
rm(list = ls())

Q1

Please explain each of the 3 distributions in less than 4 sentences.

  1. Normal Distribution: The normal distribution, also referred to as the bell curve, is a probability distribution in which the majority of values are found closest to the mean (average) and fewer values are found farther away from the mean. It is frequently used to describe things like individuals’ heights or test results. Its curve is symmetric.
  2. Binomial Distribution: When there are just two possible outcomes, such as success or failure or heads or tails, the binomial distribution is used. It explains the likelihood of obtaining a specific number of successes in a predetermined number of tries. It can be used to determine the likelihood that a coin will land on its head a certain number of times when flipped.
  3. The Poisson distribution is used to describe unusual events that take place over a predetermined amount of time or in a particular location. When the events are rare and unrelated to one another, it indicates the possibility of a particular number of events occurring. It is frequently employed in situations like how many accidents happen at a certain crossroads each day.

Q2

Explain what the pdf and cdf of a distribution measures. Pick any of the three distributions and provide some intuition as to if the pdf formula makes sense or not.

  • Probability Density Function (PDF): A distribution’s PDF calculates the probability that a random variable will have a particular value. It displays the probability of each unique occurrence inside the distribution.

    \(P(X = k) = \binom{n}{k} \cdot p^k \cdot (1 - p)^{n - k}\)

  • Cumulative Distribution Function (CDF): A distribution’s CDF calculates the likelihood that a random variable will have a value that is less than or equal to a given value. It offers a cumulative perspective on the likelihood that the variable will go below a specific threshold. It answers: “What is the probability of having at most k successes in n trials?”

    \(P(X \leq k) = \sum_{i=0}^{k} \binom{n}{i} \cdot p^i \cdot (1 - p)^{n - i}\)

  • When we have a known probability of success (written as “p”), the Probability Density Function (PDF) for a binomial distribution makes sense since it helps us understand the likelihood of receiving a certain number of successes (typically denoted as “k”) in a fixed number of trials (denoted as “n”).

  • The binomial distribution PDF formula accounts for the individual odds of attaining k successes in n trials, taking into account the success probability (p) and the number of trials (n). It essentially outlines how these various outcomes are distributed. The PDF allows to see how the probability of different outcomes change as we modify the values of k, n, or p.

Q3

What are the key parameters that define the 3 distributions above (or a distribution from the list above)?  Does R require these key parameters to be declared ?

Parameter of Chi-square distribution:

Degrees of Freedom (df or n): This is the fundamental parameter that affects the chi-square distribution’s form. It is the sum of the squares of the number of independent and identically distributed standard normal random variables to generate a chi-square random variable. The chi-square distribution is widely used in statistics for hypothesis testing, confidence intervals, and modeling sample data variability.

# Calculating the cumulative probability for a chi-square distribution

probability <- pchisq(x, df)

?pchisq. #this gives the distribution function
## No documentation for 'pchisq.' in specified packages and libraries:
## you could try '??pchisq.'

As we see here, R uses the formula:

\(f(x; n) = \frac{1}{2^{n/2} \Gamma(n/2)} x^{(n/2) - 1} e^{-x/2}\)

and they require n and x (vector of quantiles) to be declared.

pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) #the other params are optional

Q4

Give a few examples of situations that can be modeled with each of the 3 distributions above.

Examples:

  1. Normal Distribution:
  • Scores on standardized tests, such as the SAT or IQ.

  • Measurement inaccuracies, such as instrument readings or observational errors.

  • Heights of individuals

  1. Binomial Distribution:
  • A group of students’ pass/fail results on a standardized exam.

  • The effectiveness or ineffectiveness of a set of medical therapies on patients.

  • The number of faulty items in a batch of production.

  1. The Poisson Distribution
  • The number of customers who enter a store in a particular hour.

  • The number of phone calls received by a call center in a certain period of time.

  • The number of accidents at a specific intersection in a given day.

Q5

Plot the distribution in part B

BINOMIAL

Question: In a basketball game, a player has a free throw success rate of 70%. If the player attempts 10 free throws, what is the probability of making exactly 7 of them?

We can use the binomial distribution to answer this question. The number of trials (free throws) is 10, and the probability of success (making a free throw) is 0.7 (70%).

Now, let’s calculate and visualize the probability using R, making the plot colorful:

# Parameters
n <- 10  # Number of free throws
p <- 0.7  # Probability of making a free throw

# Number of successful free throws
k <- 7

# Calculate the probability of making exactly 7 out of 10 free throws
probability <- dbinom(k, size = n, prob = p)

# Create a colorful bar plot
barplot(dbinom(0:n, size = n, prob = p), names.arg = 0:n, col = rainbow(n + 1), 
        xlab = "Number of Successes", ylab = "Probability", 
        main = "Binomial Distribution (n=10, p=0.7)",
        sub = sprintf("P(X = %d) = %.4f", k, probability))

Now, the corrected code calculates the probability for each number of successes from 0 to 10 and creates a colorful bar plot. The specific probability of making exactly 7 out of 10 free throws is displayed as well.

NORMAL

Suppose the heights of a group of people follow a normal distribution with a mean height of 170 cm and a standard deviation of 10 cm. What is the probability that a randomly selected person from this group is taller than 180 cm?

# Parameters for the normal distribution
mean_height <- 170  # Mean height
sd_height <- 10      # Standard deviation of height

# Generate a range of height values
height_values <- seq(140, 200, by = 1)

# Calculate the PDF values using the normal distribution
pdf_values <- dnorm(height_values, mean = mean_height, sd = sd_height)

# Create a colorful histogram-like plot to visualize the PDF
barplot(pdf_values, names.arg = height_values, col = rainbow(length(height_values)),
        xlab = "Height (cm)", ylab = "Probability Density",
        main = "Probability Density Function of Heights (Normal Distribution)",
        border = "black")

The plot shows the probability of a randomly selected person from the group being taller than 180 cm (blue) and shorter than 180 cm (red). In this scenario, the probability of being taller than 180 cm is calculated using the normal distribution. The answer is represented by the height of the blue bar in the plot, which indicates the likelihood of selecting a person taller than 180 cm from the group with a mean height of 170 cm and a standard deviation of 10 cm.

POISSON

Question: In a call center, on average, 10 customer calls are received every 15 minutes. What is the probability of receiving 12 customer calls in the next 15 minutes?

# Number of calls we want to find the probability for
calls_needed <- 0:20  # Calculate the probabilities for 0 to 20 calls
lambda <- 10
# Calculate the probabilities using the Poisson distribution
probability_poisson <- dpois(calls_needed, lambda)

# Create a colorful bar plot to visualize the probabilities
barplot(probability_poisson, col = rainbow(length(calls_needed)),
        names.arg = calls_needed,
        xlab = "Number of Calls in 15 Minutes", ylab = "Probability",
        main = "Probability of Receiving Calls in 15 Minutes (n=20, lambda = 10)",
        border = "black")

Part II

Let’s choose N = 1000, x = 20, and p = 0.02. This means the hospital performed 100 procedures, 10 resulted in death within 30 days, and the national proportion of death in these cases is 2% (0.02).

A. Binomial
# Parameters
N <- 1000  # Number of procedures
x <- 20    # Number of deaths
p <- 0.02  # Probability of death within 30 days

# Probability using the binomial distribution
prob_binomial <- dbinom(x, size = N, prob = p)
prob_binomial
## [1] 0.08973707
A. Poisson
# Calculate the mean for Poisson distribution
lambda <- N * p

# Probability using the Poisson distribution
prob_poisson <- dpois(x, lambda)
prob_poisson
## [1] 0.08883532
B.

Under the Poisson assumption, we will test whether the observed number of deaths follows a Poisson distribution with the mean based on the national proportion.

When N is high enough and p is small enough, the Poisson distribution can be used to approximate the binomial distribution. In this case, the outcomes of both distributions should be similar.

If N is insufficiently large or p is insufficiently small, the Poisson distribution may not be a fair approximation to the binomial distribution, and the results may diverge.

?ppois