ADEC7310 Midterm

Question 1: Question 1

Question 1.1: I have two copies of the famous ‘’Dummy Variables for Dummies’’ textbook. If offered a copy, a random student will accept it with probability 0.3. Calculate the probability that I have to offer it to exactly 6 students before I get rid of both copies. Assume I am offering each student the book sequentially, and their decisions are independent of each other’s. Calculate this probability from scratch, without using any statistical function in R (Binomial, Poisson, etc.). Attach an annotated R code detailing all the steps you took and the reasoning behind it.
RESPONSE: This is a Negative Binomial scenario. The reason I know this is that we are counting the number of trials needed to achieve a fixed number of successes. In the question, we are looking for “How many students do we need to offer the book to before 2 accept?” So, for exactly 6 students, we would need to have one of the first 5 students accept 1 (4 must reject) and have the 6th accept. (This is the opposite of a Binomial scenario where we have a fixed number of trials with random successes).

# Negative Binomial Problem: Probability of getting rid of 2 books in exactly 6 offers

# Parameters
p <- 0.3    # Probability a student accepts the book
q <- 0.7    # Probability a student rejects the book (1 - p)
r <- 2      # Number of successes needed (both books accepted)
n <- 6      # Total number of trials (students offered)

# Calculate the number of ways to arrange 1 success in the first 5 trials
# We need exactly 1 success in trials 1-5, then success on trial 6
# This is "5 choose 1"
ways_to_choose <- choose(5, 1)

# Calculate probability of the specific pattern:
# - 1 success in first 5 trials: p^1
# - 4 failures in first 5 trials: q^4
# - Success on 6th trial: p
prob_sequence <- p * (q^4) * p

# Total probability = (number of arrangements) × (probability per arrangement)
prob_total <- ways_to_choose * prob_sequence

# Display results
cat("Question 1.1:", prob_total, "\n")

## Question 1.1: 0.108045

cat("As percentage:", prob_total * 100, "%\n")

## As percentage: 10.8045 %

Question 1.2: The question in the previous part is an example of Negative Binomial distribution. The random variable in this distribution is the number of failures before a given number r of successes is obtained, given that each trial has a Bernouli distribution with probability of success p. Write a function in R that will calculate the probability distribution function for any r≥1 and p∈(0,1). You may use any statistical function you want, except the built-in Negative Binomial. Attach an annotated R code detailing all the steps and the reasoning behind them. Show that your function in R gives you the same answer as the answer you calculated in the first part.

# Negative Binomial Distribution
# P(X = k) = C(k+r-1, k) * p^r * (1-p)^k

# Parameters
p <- 0.3    # Probability a student accepts the book
r <- 2      # Number of successes needed (both books accepted)
k <- 4      # Number of failures (students who reject)

# Negative binomial probability function
neg_binomial <- function(k, r, p) {
  binomial_coef <- choose(k + r - 1, k)
  prob <- binomial_coef * (p^r) * ((1 - p)^k)
  return(prob)
}

# Calculate probability
result <- neg_binomial(k, r, p)

# Display results
cat("Question 1.2:", result, "\n")

## Question 1.2: 0.108045

cat("From question 1.1:", prob_total, "\n")

## From question 1.1: 0.108045

cat("Results match!\n")

## Results match!

Question 1.3: Check that the function you wrote generates the same probability distribution function for the Negative Binomial as the built in function in R. Check this with four graphs, each for a given value of parameters p and r. For each graph, show the probability distribution function for values of the number of failures from 0 to 20 calculated by your function and the probability distribution function calculated by the built-in function in R. Attach an annotated R code detailing all the steps. Attach the graphs.

# First, define the function from Question 1.2
neg_binomial <- function(k, r, p) {
  q <- 1 - p  # Define q inside the function for proper scoping
  
  # Calculate binomial coefficient
  binomial_coef <- choose(k + r - 1, k)
  
  # Calculate probability 
  prob <- binomial_coef * (p^r) * (q^k)
  return(prob)
}

# Range of failures to plot
failures <- 0:20

# TEST 1: r = 2, p = 0.3 (original problem parameters)

r <- 2
p <- 0.3

# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)

# Create plot
plot(failures, 
     probs_custom, 
     type = "b", 
     col = "blue", 
     lwd = 2, 
     pch = 16,
     xlab = "Number of Failures (k)",
     ylab = "Probability P(X = k)",
     main = paste0("Test 1: r = ", r, ", p = ", p))

# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)

# Add legend
legend("topright", 
       legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
       col = c("blue", "red"),
       pch = c(16, 1),
       lty = c(1, 2),
       lwd = 2)

# Add grid
grid()

# TEST 2: r = 3, p = 0.7 (high success probability)
r <- 3
p <- 0.7

# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)

# Create plot
plot(failures, 
     probs_custom, 
     type = "b", 
     col = "blue", 
     lwd = 2, 
     pch = 16,
     xlab = "Number of Failures (k)",
     ylab = "Probability P(X = k)",
     main = paste0("Test 2: r = ", r, ", p = ", p))

# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)

# Add legend
legend("topright", 
       legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
       col = c("blue", "red"),
       pch = c(16, 1),
       lty = c(1, 2),
       lwd = 2)

# Add grid
grid()

# TEST 3: r = 5, p = 0.2 (many successes, low probability)
r <- 5
p <- 0.2

# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)

# Create plot
plot(failures, 
     probs_custom, 
     type = "b", 
     col = "blue", 
     lwd = 2, 
     pch = 16,
     xlab = "Number of Failures (k)",
     ylab = "Probability P(X = k)",
     main = paste0("Test 3: r = ", r, ", p = ", p))

# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)

# Add legend
legend("topright", 
       legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
       col = c("blue", "red"),
       pch = c(16, 1),
       lty = c(1, 2),
       lwd = 2)

# Add grid
grid()

# TEST 4: r = 1, p = 0.5
r <- 1
p <- 0.5

# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)

# Create plot
plot(failures, 
     probs_custom, 
     type = "b", 
     col = "blue", 
     lwd = 2, 
     pch = 16,
     xlab = "Number of Failures (k)",
     ylab = "Probability P(X = k)",
     main = paste0("Test 4: r = ", r, ", p = ", p))

# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)

# Add legend
legend("topright", 
       legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
       col = c("blue", "red"),
       pch = c(16, 1),
       lty = c(1, 2),
       lwd = 2)

# Add grid
grid()

Question 2: Yathzee is a game played with five regular six-sided dice.

Question 2.1: What is the probability all five dice will show the same number (this is known as an Yahtzee)? Attach an annotated R code detailing all the steps for your calculations.
RESPONSE:

# 2.1 Probability of Yahtzee (all five dice showing the same value)

# Total possible outcomes when rolling 5 dice
total_outcomes <- 6^5

# Favorable outcomes: 6 possible Yahtzees (all 1s, all 2s, ..., all 6s)
favorable_outcomes <- 6

# Calculate probability
p_yahtzee <- favorable_outcomes / total_outcomes

# Display result
cat("Probability of Yahtzee:", round(p_yahtzee * 100, 2), "%\n")

## Probability of Yahtzee: 0.08 %

Question 2.2: What is the probability all five dice will show a different number? Attach an annotated R code detailing all the steps for your calculations.

# I know the probability rolling a different number:
# ...on the first die is: 1/6
# ...on the second die: 5/6
# ...on the third die: 4/6
# ...on the fourth die: 3/6
# ...on the fifth die: 2/6
# 6P5 = 6!(6-5)! = 6!/1! = 720

total_outcomes <- 6^5
favorable_different <- factorial(6)/factorial(6-5)

# Probability
p_all_different <- favorable_different / total_outcomes
cat("The probability of rolling a differnt number on each die is:", round((p_all_different * 100), 2), "%\n")

## The probability of rolling a differnt number on each die is: 9.26 %

Question 2.3: In the game of Yahtzee, a player rolls the dice three times. After each roll, the player decides which dice, if any, to roll again. Suppose a player is trying to obtain an Yahtzee (same number on all five dice). Suppose after the first roll, the player sees two dice showing the same number and three showing other numbers (for example, 1, 1, 3, 5, 6). The player decide to keep the two 1s, and roll again the other three dice. Calculate the probability the player will end up having exactly three dice showing the same number. Attach an annotated R code detailing all the steps for your calculations. RESPONSE: I know it is not safe to assume, but I am going to assume that we are talking about roll the 3 dice and receiving a 3rd 1 ie 1,1,1,3,5, for example. NOT that they were going to end up with, for example: 1,1,4,4,4

# Initial conditions: Two 1s (kept), roll again
# What we need: exactly three total 1s

# Total number of outcomes when rolling 3 dice
total_3_dice <-  6^3

# C(3, 1) = 3
# Must show a 1: 1 way
# Remaining dice must not show a 1: 5 choices each = 25 ways

which_die <- choose(3, 1)
show_1 <-  1
not_show_1 <- 5^2

exactly_three <-  which_die * show_1 * not_show_1

# Probability
p_exactly_three <- exactly_three / total_3_dice
cat("Probability of exactly three 1s total:", 
    round((p_exactly_three * 100), 2), "%\n")

## Probability of exactly three 1s total: 34.72 %

Question 2.4: Assuming the player always keeps largest set of dice showing the same number and rolls the other dice, what is the probability the player will get an Yahtzee within the three rolls (after the first, or after the second, or after the third)? Write a piece of code in R with a simulation for this probability. Attach the annotated code, detailing the reasoning behind all steps.

# I dont even know if this is simulation problem, but let's start with that
# Life, the Universe and Everything
set.seed(42)

# Simulation parameters
num_simulations <- 100000
num_dice <- 5
num_faces <- 6
max_rolls <- 3

# Strategy: Keep the largest set of matching dice and reroll the remainder

# Create a function to find the most common value and its count
get_max_match <- function(dice){
  
  # Count occurrences of each value
  count <- table(dice)
  
  # Get max count
  mac_count <- max(count)
  
  # ??? SORRY...I AM NOT SURE WHERE TO GO FROM HERE.
  # I WILL TAKE THE HIT
}

# Simulate a SINGLE game of Yahtzee
# ......Roll once all 5 dice

# ......Check for Yahtzee Yahtzee

Question 3: In 1986, the Challenger space shuttle exploded during “throttle up” due to catastrophic failure of o-rings (seals) around the rocket booster. The data (real) on all space shuttle launches prior to the Challenger disaster are in the file “challenger.csv”. The variables in the data set are defined as follows:

launch: this numbers the temperature-sorted observations from 1 to 23.
temp: temperature in degrees Fahrenheit at the time of launch.
incident: If there was an incident with an O-Ring, then it is coded “Yes”.
o_ring_prob: counts the number of O-ring partial failures experienced on the flight. Load the data into R and answer the following questions. Include all R code. Thoughts: I remember this event very clearly. “Challenger, you are go for throttle up” still haunts my brain to this day.

Question 3.1: Print the measures of center (like mean, median, mode,…), spread (like sd,min,max,…) and shape (skewness,kurtosis,…) for the variables in the data. HINT: You can use the describe function in “psych” package for this.

# Include psych for describe function
library(psych)

# Import data
challenger <- read.csv("challenger.csv")

# Descriptive Statistics for each of the variables
# Launch
cat("VARIABLE: launch\n")

## VARIABLE: launch

cat("  Mean:              ", mean(challenger$launch), "\n")

##   Mean:               12

cat("  Median:            ", median(challenger$launch), "\n")

##   Median:             12

cat("  Standard Deviation:", sd(challenger$launch), "\n")

##   Standard Deviation: 6.78233

cat("  Variance:          ", var(challenger$launch), "\n")

##   Variance:           46

cat("  Min:               ", min(challenger$launch), "\n")

##   Min:                1

cat("  Max:               ", max(challenger$launch), "\n")

##   Max:                23

cat("  Range:             ", max(challenger$launch) - min(challenger$launch), "\n")

##   Range:              22

cat("  IQR:               ", IQR(challenger$launch), "\n")

##   IQR:                11

cat("  Skewness:          ", skew(challenger$launch), "\n")

##   Skewness:           0

cat("  Kurtosis:          ", kurtosi(challenger$launch), "\n\n")

##   Kurtosis:           -1.357278

cat(rep("=",30),"\n")

## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

# Temp
cat("VARIABLE: temp (F)\n")

## VARIABLE: temp (F)

cat("  Mean:              ", round(mean(challenger$temp), 2), "F\n")

##   Mean:               69.02 F

cat("  Median:            ", median(challenger$temp), "F\n")

##   Median:             69.8 F

cat("  Standard Deviation:", round(sd(challenger$temp), 2), "F\n")

##   Standard Deviation: 6.97 F

cat("  Variance:          ", round(var(challenger$temp), 2), "\n")

##   Variance:           48.55

cat("  Min:               ", min(challenger$temp), "F\n")

##   Min:                53.6 F

cat("  Max:               ", max(challenger$temp), "F\n")

##   Max:                80.6 F

cat("  Range:             ", max(challenger$temp) - min(challenger$temp), "F\n")

##   Range:              27 F

cat("  IQR:               ", IQR(challenger$temp), "F\n")

##   IQR:                8.1 F

cat("  Skewness:          ", round(skew(challenger$temp), 3), "\n")

##   Skewness:           -0.399

cat("  Kurtosis:          ", round(kurtosi(challenger$temp), 3), "\n\n")

##   Kurtosis:           -0.443

cat(rep("=",30),"\n")

## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

# Incident
# Variable: incident (categorical, YES/NO)
cat("VARIABLE: incident\n")

## VARIABLE: incident

cat("  Frequency table:\n")

##   Frequency table:

print(table(challenger$incident))

## 
##  No Yes 
##  16   7

cat("\n  Proportions:\n")

## 
##   Proportions:

print(prop.table(table(challenger$incident)))

## 
##        No       Yes 
## 0.6956522 0.3043478

cat("\n")

cat(rep("=",30),"\n")

## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

# O-Ring Problems
# Variable: o_ring_prob
cat("VARIABLE: o_ring_prob (count of O-ring failures)\n")

## VARIABLE: o_ring_prob (count of O-ring failures)

cat("  Mean:              ", round(mean(challenger$o_ring_prob), 3), "\n")

##   Mean:               0.435

cat("  Median:            ", median(challenger$o_ring_prob), "\n")

##   Median:             0

cat("  Standard Deviation:", round(sd(challenger$o_ring_prob), 3), "\n")

##   Standard Deviation: 0.788

cat("  Variance:          ", round(var(challenger$o_ring_prob), 3), "\n")

##   Variance:           0.621

cat("  Min:               ", min(challenger$o_ring_prob), "\n")

##   Min:                0

cat("  Max:               ", max(challenger$o_ring_prob), "\n")

##   Max:                3

cat("  Range:             ", max(challenger$o_ring_prob) - min(challenger$o_ring_prob), "\n")

##   Range:              3

cat("  IQR:               ", IQR(challenger$o_ring_prob), "\n")

##   IQR:                1

cat("  Skewness:          ", round(skew(challenger$o_ring_prob), 3), "\n")

##   Skewness:           1.806

cat("  Kurtosis:          ", round(kurtosi(challenger$o_ring_prob), 3), "\n\n")

##   Kurtosis:           2.689

cat(rep("=",30),"\n")

## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Question 3.2: Second, what are the levels of measurement of these 4 variables? Discuss/Justify.

Launch Level of Measurement: Ordinal, because it ranks launches based on temperature and provides ordering information.
Temp Level of Measurement: Interval, because temperature has equal intervals between values. The interval means the same anywhere on the scale. Also, 0 degrees is strictly arbitrary and has no real meaning.
Incident Level of Measurement: Nominal, because it simply is a yes or a no response.
O-Ring Problems Level of Measurement: Ratio, because it is simply counting with a true 0 point. Also, ratios are relevant (i.e. 4 failures is 4 times as many as 1 failure, etc. )

Question 3.3: Third, provide an appropriate graph for the variable o_ring_prob. Interpret. Boxplot is acceptable, though histogram would be better.

# Histogram
hist(challenger$o_ring_prob, 
     main = "Distribution of O-Ring Failures",
     xlab = "Number of O-Ring Problems",
     ylab = "Frequency (Number of Launches)",
     col = "coral",
     border = "black",
     breaks = seq(-0.5, max(challenger$o_ring_prob) + 0.5, by = 1),
     xlim = c(-0.5, max(challenger$o_ring_prob) + 0.5),
     las = 1)

skewness <- 
+     round(skew(challenger$o_ring_prob), 3)
cat("Skewness: ", skewness)

## Skewness:  1.806

mean <- round(mean(challenger$o_ring_prob), 3)
cat("Mean: ", mean)

## Mean:  0.435

mode <- round(median(challenger$o_ring_prob), 3)
cat("Median: ", mode)

## Median:  0

Interpretation The distribution is heavily right-skewed (skewness = 1.806). Most launches, 16 out of 23 had no O-ring failures. Few launches had multiple O-Ring failures. But, in the real world, where lives are on the line, having 16 failures out of 23 is very concerning. It shows me that even though the mean is “low” (0.435) and mode is 0, we still need to look at what an acceptable risk is.

Question 3.4: The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?

# Create side-by-side boxplots
boxplot(temp ~ incident, 
        data = challenger,
        main = "Temperature Distribution by O-Ring Incident Status\nChallenger Launch Temp: 36 F",
        xlab = "O-Ring Incident (YES/NO)?",
        ylab = "Temperature (F)",
        col = c("lightgreen", "red"),
        border = "black",
        las = 1,
        outline = TRUE)

# Overlay grid
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted")

Thoughts: Data shows that Challenger launched at 36 F despite the fact that there was a history of O-ring failures at lower temperatures. They did not even need to be as low as it was before the risk to failure was evident. Form the boxplot, we can see that even at approx 57 degrees F, failures began to happen. Question 3.5: In the already temperature-sorted dataset ( order(mydata$temp) ), find on which observation the first successful launch occurred (one with no incident). Answer using a command (instead of eyeballing the data).
Thoughts: Find where the first observation where incident = NO

# Find first launch with no O-ring incident (data sorted by temperature)

first_no_incident <- which(challenger$incident == "No")[1]

# Display results
cat("RESULT:\n")

## RESULT:

cat("Observation number:", first_no_incident, "\n")

## Observation number: 5

cat("Launch number:", challenger$launch[first_no_incident], "\n")

## Launch number: 5

cat("Temperature:", challenger$temp[first_no_incident], "F\n")

## Temperature: 66.2 F

cat("O-ring problems:", challenger$o_ring_prob[first_no_incident], "\n\n")

## O-ring problems: 0

Question 3.6: How many incidents occurred above 65 degrees F? Answer using a command (instead of eyeballing the data).

incidents_above_65 <- sum(challenger$incident == "Yes" & challenger$temp > 65)
cat("Number of incidents that occurred above 65 F: ",incidents_above_65)

## Number of incidents that occurred above 65 F:  3

Question 4: High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Health Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL. United States women aged 20 and over have a mean HDL of 47 mg/dl with a standard deviation of 3.46. Assume that the distribution is Normal. For each question below, attach an annotated script in R.

Question 4.1: What percent of women in the United States have low values of HDL (40 mg/dl) or less?

# What I know
# Mean = 47 mg/dl
# Standard Deviation = 3.46 mg/dl
# Distribution is Normal

# Parameters
mu <- 47
sigma <- 3.46

# What we are looking for
# P(X <= 40)
p_low_hdl <- pnorm(40, mu, sigma)
percent_low_hdl <- round((p_low_hdl * 100), 2)

cat("The PERCENT of women with HDL <= 40: ", percent_low_hdl,"%")

## The PERCENT of women with HDL <= 40:  2.15 %

Question 4.2: What percent of women in the United States have a value of HDL larger than 50 mg/dl?

# What I am looking for: P(X > 50)
# This is 1 - P(X ≤ 50) which means I can use lower.tail = FALSE in pnorm()

p_high_hdl <- pnorm(50, mu, sigma, lower.tail = FALSE)
percent_high_hdl <- round((p_high_hdl * 100), 2)

cat("The PERCENT of women with HDL >= 50: ", percent_high_hdl,"%")

## The PERCENT of women with HDL >= 50:  19.3 %

Question 4.3: What percent of women in the United States have a value of HDL between 46 and 50 mg/dl?

# What I need: P(46 < X < 50)
# This is the same as P(X ≤ 50) - P(X ≤ 46)

p_50_or_less <- pnorm(50, mu, sigma)
p_46_or_less <- pnorm(46, mu, sigma)
p_between <- p_50_or_less - p_46_or_less
percent_between <- round((p_between * 100), 2)

cat("The PERCENT of women with HDL between 46 and 50: ", percent_between,"%")

## The PERCENT of women with HDL between 46 and 50:  42.08 %

Question 4.4: What values of HDL would place a woman in the United States in the lowest 15% of the distribution?

# I need to find the value x where P(X ≤ x) = 0.15
# This is the 15th percentile
# Using qnorm() which finds quantiles (inverse of pnorm)

hdl_percentile <- round(qnorm(0.15, mu, sigma), 2)

cat("The value of HDL that would place a woman in the LOWEST 15% is: ", hdl_percentile)

## The value of HDL that would place a woman in the LOWEST 15% is:  43.41

Question 5: Question 5

Question 5.1: Similarly to the definition of outliers in a sample, we can define an outlier for a population as on observation which is smaller than Q1−1.5⋅IQR or larger than Q3+1.5⋅IQR where IQR is the inter-quartile range (IQR=Q3−Q1). What proportion of observations in a Normal distribution are outliers acording to this definition? Attach an R script with details for your calculations.

q1 <- qnorm(0.25)
q3 <- qnorm(0.75)
iqr <- q3 - q1
lower <- q1 - (1.5 * iqr)
upper <- q3 + (1.5 * iqr)

# This is so I understand what I am trying to do...
# pnorm(upper) = P(X <= upper) "Left tail"
# 1 - pnorm(upper) = P(X > upper) "Right tail"
# Outliers are either too small or too large
proportion_outliers <- pnorm(lower) + (1 - pnorm(upper))

cat("Proportion of outliers:", round(proportion_outliers, 6), "\n")

## Proportion of outliers: 0.006977

cat("Percentage of outliers:", round(proportion_outliers * 100, 2), "%\n")

## Percentage of outliers: 0.7 %

Question 5.2: The distribution of the natural log of family income in the US in 2017 was Normal with mean 11.237 and standard deviation 0.747, according to data from the St. Louis Fed. For all questions below attach an R script with details for your calculations.

Question 5.2.1: What is the median family income in the US in 2017?

mu_log <- 11.237
sigma_log <- 0.747

# calculate e^(log mean)
median_income <- exp(mu_log)
cat("Median family income: $", median_income)

## Median family income: $ 75886.95

Question 5.2.2: Calculate the interquartile range for the distribution of natural log of family income in the US in 2017.

q1_log <- qnorm(0.25, mu_log, sigma_log)
q3_log <- qnorm(0.75, mu_log, sigma_log)
iqr_log <- q3_log - q1_log

cat("The IQR of log(income) is: ", iqr_log )

## The IQR of log(income) is:  1.007688

Question 5.2.3: What is the interquartile range for the distribution of family income in the US in 2017.

q1_log <- qnorm(0.25, mu_log, sigma_log)
q3_log <- qnorm(0.75, mu_log, sigma_log)

q1_income <- exp(q1_log)
q3_income <- exp(q3_log)
iqr_income <- q3_income - q1_income
cat("The IQR of income is:", iqr)

## The IQR of income is: 1.34898

Question 5.2.4: In 2017, families earning more than $84,500 were subject to Alternative Minimum Tax. The Tax Reform Act passed in 2017 raised that threshold to $109,400 for 2018. Assuming the distribution of income didn’t change from 2017 to 2018, what percentage of families are no longer subject to AMT in 2018? 5.

threshold_2017 <- 84500
threshold_2018 <- 109400

mu_log <- 11.237
sigma_log <- 0.747

cat("2017 AMT threshold: $", format(threshold_2017, big.mark = ","), "\n")

## 2017 AMT threshold: $ 84,500

cat("2018 AMT threshold: $", format(threshold_2018, big.mark = ","), "\n\n")

## 2018 AMT threshold: $ 109,400

# lognormal distribution
# 3 groups: 1. Below 84,500: never paid AMT, 2. Between 84,500 and 109,400 are the families who benefit because they paid AMT in 2017 and did not pay AMT in 2018, and 3. Above 109,400 still pay AMT.

# Convert those thresholds to the log scale
log_threshold_2017 <- log(threshold_2017)
log_threshold_2018 <- log(threshold_2018)

cat("log($84,500) =", round(log_threshold_2017, 4), "\n")

## log($84,500) = 11.3445

cat("log($109,400) =", round(log_threshold_2018, 4), "\n\n")

## log($109,400) = 11.6028

# Now we can use pnorm on ths log scale
prop_above_2017 <- 1 - pnorm(log_threshold_2017, mu_log, sigma_log)
prop_above_2018 <- 1 - pnorm(log_threshold_2018, mu_log, sigma_log)

cat("Proportion earning > $84,500 (paid AMT in 2017):", 
    round(prop_above_2017, 4), "\n")

## Proportion earning > $84,500 (paid AMT in 2017): 0.4428

cat("Proportion earning > $109,400 (pay AMT in 2018):", 
    round(prop_above_2018, 4), "\n\n")

## Proportion earning > $109,400 (pay AMT in 2018): 0.3122

prop_no_longer_AMT <- prop_above_2017 - prop_above_2018

cat("Percentage no longer subject to AMT:", round(prop_no_longer_AMT * 100, 2), "%\n")

## Percentage no longer subject to AMT: 13.06 %

Question 6: Your organization owns an expensive Magnetic Resonance Imaging machine (MRI). This machine has a manufacturer’s expected lifetime of 10 years i.e. the machine fails once in 10 years, or the probability of the machine failing in any given year is 110. For each question, attach an annotated R code. Thoughts: Setup includes knowing:

Expected life of machine: 10 years
Failure rate (lambda): 0.1 fails/year (1 failure every 10 years)
What is the probability the machine DOES NOT fail for 8 years?

Question 6.1: What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to useλ⋅t rather just λ).

# Poisson Model
lambda <- 0.1 # Failure rate per year
t <-  8 # years 

# Thank you for the hint!
lambda_t <- lambda * t

# P(X = 0) = probability of 0 failures in 8 years
p_poisson <-  dpois(0, lambda_t)

cat("P(machine survives 8 years) =", round(p_poisson, 4)*100, "%\n")

## P(machine survives 8 years) = 44.93 %

Question 6.2: What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.)

# Binomial Model
n <- 8 # of trials
p <- 1/10 # probability of failure in any given year

# P(X = 0) = probability of 0 failures in 8 INDEPENDENT years
p_binomial <- dbinom(0, size = n, prob = p)

cat("P(machine survives 8 years) =", round(p_binomial, 4)*100, "%\n")

## P(machine survives 8 years) = 43.05 %

CONCLUSION: The results are similar but not identical (Poisson vs Binomial). The key reason for this difference is that the Binomial distribution models the number of successes in INDEPENDENT trials. In the question above, each year is a trial with probability p = 0.1 failure. This is the more accurate way to see it. Contrast this to Poisson, which is a continuous time approximation

ADEC7310 Midterm

Dan Leone

2026-02-05

Question 1: Question 1

Question 2: Yathzee is a game played with five regular six-sided dice.

Question 5: Question 5