Question 1.1: I have two copies of the famous
‘’Dummy Variables for Dummies’’ textbook. If offered a copy, a random
student will accept it with probability 0.3. Calculate the probability
that I have to offer it to exactly 6 students before I get rid of both
copies. Assume I am offering each student the book sequentially, and
their decisions are independent of each other’s. Calculate this
probability from scratch, without using any statistical function in R
(Binomial, Poisson, etc.). Attach an annotated R code detailing all the
steps you took and the reasoning behind it.
RESPONSE: This is a Negative Binomial scenario. The
reason I know this is that we are counting the number of trials needed
to achieve a fixed number of successes. In the question, we are looking
for “How many students do we need to offer the book to before 2 accept?”
So, for exactly 6 students, we would need to have one of the first 5
students accept 1 (4 must reject) and have the 6th accept. (This is the
opposite of a Binomial scenario where we have a fixed number of trials
with random successes).
# Negative Binomial Problem: Probability of getting rid of 2 books in exactly 6 offers
# Parameters
p <- 0.3 # Probability a student accepts the book
q <- 0.7 # Probability a student rejects the book (1 - p)
r <- 2 # Number of successes needed (both books accepted)
n <- 6 # Total number of trials (students offered)
# Calculate the number of ways to arrange 1 success in the first 5 trials
# We need exactly 1 success in trials 1-5, then success on trial 6
# This is "5 choose 1"
ways_to_choose <- choose(5, 1)
# Calculate probability of the specific pattern:
# - 1 success in first 5 trials: p^1
# - 4 failures in first 5 trials: q^4
# - Success on 6th trial: p
prob_sequence <- p * (q^4) * p
# Total probability = (number of arrangements) × (probability per arrangement)
prob_total <- ways_to_choose * prob_sequence
# Display results
cat("Question 1.1:", prob_total, "\n")
## Question 1.1: 0.108045
cat("As percentage:", prob_total * 100, "%\n")
## As percentage: 10.8045 %
Question 1.2: The question in the previous part is an example of Negative Binomial distribution. The random variable in this distribution is the number of failures before a given number r of successes is obtained, given that each trial has a Bernouli distribution with probability of success p. Write a function in R that will calculate the probability distribution function for any r≥1 and p∈(0,1). You may use any statistical function you want, except the built-in Negative Binomial. Attach an annotated R code detailing all the steps and the reasoning behind them. Show that your function in R gives you the same answer as the answer you calculated in the first part.
# Negative Binomial Distribution
# P(X = k) = C(k+r-1, k) * p^r * (1-p)^k
# Parameters
p <- 0.3 # Probability a student accepts the book
r <- 2 # Number of successes needed (both books accepted)
k <- 4 # Number of failures (students who reject)
# Negative binomial probability function
neg_binomial <- function(k, r, p) {
binomial_coef <- choose(k + r - 1, k)
prob <- binomial_coef * (p^r) * ((1 - p)^k)
return(prob)
}
# Calculate probability
result <- neg_binomial(k, r, p)
# Display results
cat("Question 1.2:", result, "\n")
## Question 1.2: 0.108045
cat("From question 1.1:", prob_total, "\n")
## From question 1.1: 0.108045
cat("Results match!\n")
## Results match!
Question 1.3: Check that the function you wrote generates the same probability distribution function for the Negative Binomial as the built in function in R. Check this with four graphs, each for a given value of parameters p and r. For each graph, show the probability distribution function for values of the number of failures from 0 to 20 calculated by your function and the probability distribution function calculated by the built-in function in R. Attach an annotated R code detailing all the steps. Attach the graphs.
# First, define the function from Question 1.2
neg_binomial <- function(k, r, p) {
q <- 1 - p # Define q inside the function for proper scoping
# Calculate binomial coefficient
binomial_coef <- choose(k + r - 1, k)
# Calculate probability
prob <- binomial_coef * (p^r) * (q^k)
return(prob)
}
# Range of failures to plot
failures <- 0:20
# TEST 1: r = 2, p = 0.3 (original problem parameters)
r <- 2
p <- 0.3
# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)
# Create plot
plot(failures,
probs_custom,
type = "b",
col = "blue",
lwd = 2,
pch = 16,
xlab = "Number of Failures (k)",
ylab = "Probability P(X = k)",
main = paste0("Test 1: r = ", r, ", p = ", p))
# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)
# Add legend
legend("topright",
legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
col = c("blue", "red"),
pch = c(16, 1),
lty = c(1, 2),
lwd = 2)
# Add grid
grid()
# TEST 2: r = 3, p = 0.7 (high success probability)
r <- 3
p <- 0.7
# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)
# Create plot
plot(failures,
probs_custom,
type = "b",
col = "blue",
lwd = 2,
pch = 16,
xlab = "Number of Failures (k)",
ylab = "Probability P(X = k)",
main = paste0("Test 2: r = ", r, ", p = ", p))
# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)
# Add legend
legend("topright",
legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
col = c("blue", "red"),
pch = c(16, 1),
lty = c(1, 2),
lwd = 2)
# Add grid
grid()
# TEST 3: r = 5, p = 0.2 (many successes, low probability)
r <- 5
p <- 0.2
# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)
# Create plot
plot(failures,
probs_custom,
type = "b",
col = "blue",
lwd = 2,
pch = 16,
xlab = "Number of Failures (k)",
ylab = "Probability P(X = k)",
main = paste0("Test 3: r = ", r, ", p = ", p))
# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)
# Add legend
legend("topright",
legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
col = c("blue", "red"),
pch = c(16, 1),
lty = c(1, 2),
lwd = 2)
# Add grid
grid()
# TEST 4: r = 1, p = 0.5
r <- 1
p <- 0.5
# Calculate probabilities using both methods
probs_custom <- sapply(failures, function(k) neg_binomial(k, r, p))
probs_builtin <- dnbinom(failures, size = r, prob = p)
# Create plot
plot(failures,
probs_custom,
type = "b",
col = "blue",
lwd = 2,
pch = 16,
xlab = "Number of Failures (k)",
ylab = "Probability P(X = k)",
main = paste0("Test 4: r = ", r, ", p = ", p))
# Add built-in function line
lines(failures, probs_builtin, type = "b", col = "red", lwd = 2, pch = 1, lty = 2)
# Add legend
legend("topright",
legend = c("Custom neg_binomial()", "Built-in dnbinom()"),
col = c("blue", "red"),
pch = c(16, 1),
lty = c(1, 2),
lwd = 2)
# Add grid
grid()
Question 2.1: What is the probability all five dice
will show the same number (this is known as an Yahtzee)? Attach an
annotated R code detailing all the steps for your calculations.
RESPONSE:
# 2.1 Probability of Yahtzee (all five dice showing the same value)
# Total possible outcomes when rolling 5 dice
total_outcomes <- 6^5
# Favorable outcomes: 6 possible Yahtzees (all 1s, all 2s, ..., all 6s)
favorable_outcomes <- 6
# Calculate probability
p_yahtzee <- favorable_outcomes / total_outcomes
# Display result
cat("Probability of Yahtzee:", round(p_yahtzee * 100, 2), "%\n")
## Probability of Yahtzee: 0.08 %
Question 2.2: What is the probability all five dice will show a different number? Attach an annotated R code detailing all the steps for your calculations.
# I know the probability rolling a different number:
# ...on the first die is: 1/6
# ...on the second die: 5/6
# ...on the third die: 4/6
# ...on the fourth die: 3/6
# ...on the fifth die: 2/6
# 6P5 = 6!(6-5)! = 6!/1! = 720
total_outcomes <- 6^5
favorable_different <- factorial(6)/factorial(6-5)
# Probability
p_all_different <- favorable_different / total_outcomes
cat("The probability of rolling a differnt number on each die is:", round((p_all_different * 100), 2), "%\n")
## The probability of rolling a differnt number on each die is: 9.26 %
Question 2.3: In the game of Yahtzee, a player rolls the dice three times. After each roll, the player decides which dice, if any, to roll again. Suppose a player is trying to obtain an Yahtzee (same number on all five dice). Suppose after the first roll, the player sees two dice showing the same number and three showing other numbers (for example, 1, 1, 3, 5, 6). The player decide to keep the two 1s, and roll again the other three dice. Calculate the probability the player will end up having exactly three dice showing the same number. Attach an annotated R code detailing all the steps for your calculations. RESPONSE: I know it is not safe to assume, but I am going to assume that we are talking about roll the 3 dice and receiving a 3rd 1 ie 1,1,1,3,5, for example. NOT that they were going to end up with, for example: 1,1,4,4,4
# Initial conditions: Two 1s (kept), roll again
# What we need: exactly three total 1s
# Total number of outcomes when rolling 3 dice
total_3_dice <- 6^3
# C(3, 1) = 3
# Must show a 1: 1 way
# Remaining dice must not show a 1: 5 choices each = 25 ways
which_die <- choose(3, 1)
show_1 <- 1
not_show_1 <- 5^2
exactly_three <- which_die * show_1 * not_show_1
# Probability
p_exactly_three <- exactly_three / total_3_dice
cat("Probability of exactly three 1s total:",
round((p_exactly_three * 100), 2), "%\n")
## Probability of exactly three 1s total: 34.72 %
Question 2.4: Assuming the player always keeps largest set of dice showing the same number and rolls the other dice, what is the probability the player will get an Yahtzee within the three rolls (after the first, or after the second, or after the third)? Write a piece of code in R with a simulation for this probability. Attach the annotated code, detailing the reasoning behind all steps.
# I dont even know if this is simulation problem, but let's start with that
# Life, the Universe and Everything
set.seed(42)
# Simulation parameters
num_simulations <- 100000
num_dice <- 5
num_faces <- 6
max_rolls <- 3
# Strategy: Keep the largest set of matching dice and reroll the remainder
# Create a function to find the most common value and its count
get_max_match <- function(dice){
# Count occurrences of each value
count <- table(dice)
# Get max count
mac_count <- max(count)
# ??? SORRY...I AM NOT SURE WHERE TO GO FROM HERE.
# I WILL TAKE THE HIT
}
# Simulate a SINGLE game of Yahtzee
# ......Roll once all 5 dice
# ......Check for Yahtzee Yahtzee
Question 3.1: Print the measures of center (like mean, median, mode,…), spread (like sd,min,max,…) and shape (skewness,kurtosis,…) for the variables in the data. HINT: You can use the describe function in “psych” package for this.
# Include psych for describe function
library(psych)
# Import data
challenger <- read.csv("challenger.csv")
# Descriptive Statistics for each of the variables
# Launch
cat("VARIABLE: launch\n")
## VARIABLE: launch
cat(" Mean: ", mean(challenger$launch), "\n")
## Mean: 12
cat(" Median: ", median(challenger$launch), "\n")
## Median: 12
cat(" Standard Deviation:", sd(challenger$launch), "\n")
## Standard Deviation: 6.78233
cat(" Variance: ", var(challenger$launch), "\n")
## Variance: 46
cat(" Min: ", min(challenger$launch), "\n")
## Min: 1
cat(" Max: ", max(challenger$launch), "\n")
## Max: 23
cat(" Range: ", max(challenger$launch) - min(challenger$launch), "\n")
## Range: 22
cat(" IQR: ", IQR(challenger$launch), "\n")
## IQR: 11
cat(" Skewness: ", skew(challenger$launch), "\n")
## Skewness: 0
cat(" Kurtosis: ", kurtosi(challenger$launch), "\n\n")
## Kurtosis: -1.357278
cat(rep("=",30),"\n")
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
# Temp
cat("VARIABLE: temp (F)\n")
## VARIABLE: temp (F)
cat(" Mean: ", round(mean(challenger$temp), 2), "F\n")
## Mean: 69.02 F
cat(" Median: ", median(challenger$temp), "F\n")
## Median: 69.8 F
cat(" Standard Deviation:", round(sd(challenger$temp), 2), "F\n")
## Standard Deviation: 6.97 F
cat(" Variance: ", round(var(challenger$temp), 2), "\n")
## Variance: 48.55
cat(" Min: ", min(challenger$temp), "F\n")
## Min: 53.6 F
cat(" Max: ", max(challenger$temp), "F\n")
## Max: 80.6 F
cat(" Range: ", max(challenger$temp) - min(challenger$temp), "F\n")
## Range: 27 F
cat(" IQR: ", IQR(challenger$temp), "F\n")
## IQR: 8.1 F
cat(" Skewness: ", round(skew(challenger$temp), 3), "\n")
## Skewness: -0.399
cat(" Kurtosis: ", round(kurtosi(challenger$temp), 3), "\n\n")
## Kurtosis: -0.443
cat(rep("=",30),"\n")
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
# Incident
# Variable: incident (categorical, YES/NO)
cat("VARIABLE: incident\n")
## VARIABLE: incident
cat(" Frequency table:\n")
## Frequency table:
print(table(challenger$incident))
##
## No Yes
## 16 7
cat("\n Proportions:\n")
##
## Proportions:
print(prop.table(table(challenger$incident)))
##
## No Yes
## 0.6956522 0.3043478
cat("\n")
cat(rep("=",30),"\n")
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
# O-Ring Problems
# Variable: o_ring_prob
cat("VARIABLE: o_ring_prob (count of O-ring failures)\n")
## VARIABLE: o_ring_prob (count of O-ring failures)
cat(" Mean: ", round(mean(challenger$o_ring_prob), 3), "\n")
## Mean: 0.435
cat(" Median: ", median(challenger$o_ring_prob), "\n")
## Median: 0
cat(" Standard Deviation:", round(sd(challenger$o_ring_prob), 3), "\n")
## Standard Deviation: 0.788
cat(" Variance: ", round(var(challenger$o_ring_prob), 3), "\n")
## Variance: 0.621
cat(" Min: ", min(challenger$o_ring_prob), "\n")
## Min: 0
cat(" Max: ", max(challenger$o_ring_prob), "\n")
## Max: 3
cat(" Range: ", max(challenger$o_ring_prob) - min(challenger$o_ring_prob), "\n")
## Range: 3
cat(" IQR: ", IQR(challenger$o_ring_prob), "\n")
## IQR: 1
cat(" Skewness: ", round(skew(challenger$o_ring_prob), 3), "\n")
## Skewness: 1.806
cat(" Kurtosis: ", round(kurtosi(challenger$o_ring_prob), 3), "\n\n")
## Kurtosis: 2.689
cat(rep("=",30),"\n")
## = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Question 3.2: Second, what are the levels of measurement of these 4 variables? Discuss/Justify.
Question 3.3: Third, provide an appropriate graph for the variable o_ring_prob. Interpret. Boxplot is acceptable, though histogram would be better.
# Histogram
hist(challenger$o_ring_prob,
main = "Distribution of O-Ring Failures",
xlab = "Number of O-Ring Problems",
ylab = "Frequency (Number of Launches)",
col = "coral",
border = "black",
breaks = seq(-0.5, max(challenger$o_ring_prob) + 0.5, by = 1),
xlim = c(-0.5, max(challenger$o_ring_prob) + 0.5),
las = 1)
skewness <-
+ round(skew(challenger$o_ring_prob), 3)
cat("Skewness: ", skewness)
## Skewness: 1.806
mean <- round(mean(challenger$o_ring_prob), 3)
cat("Mean: ", mean)
## Mean: 0.435
mode <- round(median(challenger$o_ring_prob), 3)
cat("Median: ", mode)
## Median: 0
Interpretation The distribution is heavily right-skewed (skewness = 1.806). Most launches, 16 out of 23 had no O-ring failures. Few launches had multiple O-Ring failures. But, in the real world, where lives are on the line, having 16 failures out of 23 is very concerning. It shows me that even though the mean is “low” (0.435) and mode is 0, we still need to look at what an acceptable risk is.
Question 3.4: The temperature on the day of the Challenger launch was 36 degrees Fahrenheit. Provide side-by-side boxplots for temperature by incident (temp~incident in formula). Why might this have been a concern?
# Create side-by-side boxplots
boxplot(temp ~ incident,
data = challenger,
main = "Temperature Distribution by O-Ring Incident Status\nChallenger Launch Temp: 36 F",
xlab = "O-Ring Incident (YES/NO)?",
ylab = "Temperature (F)",
col = c("lightgreen", "red"),
border = "black",
las = 1,
outline = TRUE)
# Overlay grid
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted")
Thoughts: Data shows that Challenger launched at 36 F
despite the fact that there was a history of O-ring failures at lower
temperatures. They did not even need to be as low as it was before the
risk to failure was evident. Form the boxplot, we can see that even at
approx 57 degrees F, failures began to happen. Question 3.5:
In the already temperature-sorted dataset ( order(mydata$temp)
), find on which observation the first successful launch occurred (one
with no incident). Answer using a command (instead of eyeballing the
data).
Thoughts: Find where the first observation where
incident = NO
# Find first launch with no O-ring incident (data sorted by temperature)
first_no_incident <- which(challenger$incident == "No")[1]
# Display results
cat("RESULT:\n")
## RESULT:
cat("Observation number:", first_no_incident, "\n")
## Observation number: 5
cat("Launch number:", challenger$launch[first_no_incident], "\n")
## Launch number: 5
cat("Temperature:", challenger$temp[first_no_incident], "F\n")
## Temperature: 66.2 F
cat("O-ring problems:", challenger$o_ring_prob[first_no_incident], "\n\n")
## O-ring problems: 0
Question 3.6: How many incidents occurred above 65 degrees F? Answer using a command (instead of eyeballing the data).
incidents_above_65 <- sum(challenger$incident == "Yes" & challenger$temp > 65)
cat("Number of incidents that occurred above 65 F: ",incidents_above_65)
## Number of incidents that occurred above 65 F: 3
Question 4.1: What percent of women in the United States have low values of HDL (40 mg/dl) or less?
# What I know
# Mean = 47 mg/dl
# Standard Deviation = 3.46 mg/dl
# Distribution is Normal
# Parameters
mu <- 47
sigma <- 3.46
# What we are looking for
# P(X <= 40)
p_low_hdl <- pnorm(40, mu, sigma)
percent_low_hdl <- round((p_low_hdl * 100), 2)
cat("The PERCENT of women with HDL <= 40: ", percent_low_hdl,"%")
## The PERCENT of women with HDL <= 40: 2.15 %
Question 4.2: What percent of women in the United States have a value of HDL larger than 50 mg/dl?
# What I am looking for: P(X > 50)
# This is 1 - P(X ≤ 50) which means I can use lower.tail = FALSE in pnorm()
p_high_hdl <- pnorm(50, mu, sigma, lower.tail = FALSE)
percent_high_hdl <- round((p_high_hdl * 100), 2)
cat("The PERCENT of women with HDL >= 50: ", percent_high_hdl,"%")
## The PERCENT of women with HDL >= 50: 19.3 %
Question 4.3: What percent of women in the United States have a value of HDL between 46 and 50 mg/dl?
# What I need: P(46 < X < 50)
# This is the same as P(X ≤ 50) - P(X ≤ 46)
p_50_or_less <- pnorm(50, mu, sigma)
p_46_or_less <- pnorm(46, mu, sigma)
p_between <- p_50_or_less - p_46_or_less
percent_between <- round((p_between * 100), 2)
cat("The PERCENT of women with HDL between 46 and 50: ", percent_between,"%")
## The PERCENT of women with HDL between 46 and 50: 42.08 %
Question 4.4: What values of HDL would place a woman in the United States in the lowest 15% of the distribution?
# I need to find the value x where P(X ≤ x) = 0.15
# This is the 15th percentile
# Using qnorm() which finds quantiles (inverse of pnorm)
hdl_percentile <- round(qnorm(0.15, mu, sigma), 2)
cat("The value of HDL that would place a woman in the LOWEST 15% is: ", hdl_percentile)
## The value of HDL that would place a woman in the LOWEST 15% is: 43.41
Question 5.1: Similarly to the definition of outliers in a sample, we can define an outlier for a population as on observation which is smaller than Q1−1.5⋅IQR or larger than Q3+1.5⋅IQR where IQR is the inter-quartile range (IQR=Q3−Q1). What proportion of observations in a Normal distribution are outliers acording to this definition? Attach an R script with details for your calculations.
q1 <- qnorm(0.25)
q3 <- qnorm(0.75)
iqr <- q3 - q1
lower <- q1 - (1.5 * iqr)
upper <- q3 + (1.5 * iqr)
# This is so I understand what I am trying to do...
# pnorm(upper) = P(X <= upper) "Left tail"
# 1 - pnorm(upper) = P(X > upper) "Right tail"
# Outliers are either too small or too large
proportion_outliers <- pnorm(lower) + (1 - pnorm(upper))
cat("Proportion of outliers:", round(proportion_outliers, 6), "\n")
## Proportion of outliers: 0.006977
cat("Percentage of outliers:", round(proportion_outliers * 100, 2), "%\n")
## Percentage of outliers: 0.7 %
Question 5.2: The distribution of the natural log of family income in the US in 2017 was Normal with mean 11.237 and standard deviation 0.747, according to data from the St. Louis Fed. For all questions below attach an R script with details for your calculations.
Question 5.2.1: What is the median family income in the US in 2017?
mu_log <- 11.237
sigma_log <- 0.747
# calculate e^(log mean)
median_income <- exp(mu_log)
cat("Median family income: $", median_income)
## Median family income: $ 75886.95
Question 5.2.2: Calculate the interquartile range for the distribution of natural log of family income in the US in 2017.
q1_log <- qnorm(0.25, mu_log, sigma_log)
q3_log <- qnorm(0.75, mu_log, sigma_log)
iqr_log <- q3_log - q1_log
cat("The IQR of log(income) is: ", iqr_log )
## The IQR of log(income) is: 1.007688
Question 5.2.3: What is the interquartile range for the distribution of family income in the US in 2017.
q1_log <- qnorm(0.25, mu_log, sigma_log)
q3_log <- qnorm(0.75, mu_log, sigma_log)
q1_income <- exp(q1_log)
q3_income <- exp(q3_log)
iqr_income <- q3_income - q1_income
cat("The IQR of income is:", iqr)
## The IQR of income is: 1.34898
Question 5.2.4: In 2017, families earning more than $84,500 were subject to Alternative Minimum Tax. The Tax Reform Act passed in 2017 raised that threshold to $109,400 for 2018. Assuming the distribution of income didn’t change from 2017 to 2018, what percentage of families are no longer subject to AMT in 2018? 5.
threshold_2017 <- 84500
threshold_2018 <- 109400
mu_log <- 11.237
sigma_log <- 0.747
cat("2017 AMT threshold: $", format(threshold_2017, big.mark = ","), "\n")
## 2017 AMT threshold: $ 84,500
cat("2018 AMT threshold: $", format(threshold_2018, big.mark = ","), "\n\n")
## 2018 AMT threshold: $ 109,400
# lognormal distribution
# 3 groups: 1. Below 84,500: never paid AMT, 2. Between 84,500 and 109,400 are the families who benefit because they paid AMT in 2017 and did not pay AMT in 2018, and 3. Above 109,400 still pay AMT.
# Convert those thresholds to the log scale
log_threshold_2017 <- log(threshold_2017)
log_threshold_2018 <- log(threshold_2018)
cat("log($84,500) =", round(log_threshold_2017, 4), "\n")
## log($84,500) = 11.3445
cat("log($109,400) =", round(log_threshold_2018, 4), "\n\n")
## log($109,400) = 11.6028
# Now we can use pnorm on ths log scale
prop_above_2017 <- 1 - pnorm(log_threshold_2017, mu_log, sigma_log)
prop_above_2018 <- 1 - pnorm(log_threshold_2018, mu_log, sigma_log)
cat("Proportion earning > $84,500 (paid AMT in 2017):",
round(prop_above_2017, 4), "\n")
## Proportion earning > $84,500 (paid AMT in 2017): 0.4428
cat("Proportion earning > $109,400 (pay AMT in 2018):",
round(prop_above_2018, 4), "\n\n")
## Proportion earning > $109,400 (pay AMT in 2018): 0.3122
prop_no_longer_AMT <- prop_above_2017 - prop_above_2018
cat("Percentage no longer subject to AMT:", round(prop_no_longer_AMT * 100, 2), "%\n")
## Percentage no longer subject to AMT: 13.06 %
Question 6.1: What is the probability that the machine will fail after 8 years? Model as a Poisson. (Hint: Don’t forget to useλ⋅t rather just λ).
# Poisson Model
lambda <- 0.1 # Failure rate per year
t <- 8 # years
# Thank you for the hint!
lambda_t <- lambda * t
# P(X = 0) = probability of 0 failures in 8 years
p_poisson <- dpois(0, lambda_t)
cat("P(machine survives 8 years) =", round(p_poisson, 4)*100, "%\n")
## P(machine survives 8 years) = 44.93 %
Question 6.2: What is the probability that the machine will fail after 8 years? Model as a binomial. (Hint: If X is a random variable measuring counts of failure, then we want to find the probability of 0 success in 8 years.)
# Binomial Model
n <- 8 # of trials
p <- 1/10 # probability of failure in any given year
# P(X = 0) = probability of 0 failures in 8 INDEPENDENT years
p_binomial <- dbinom(0, size = n, prob = p)
cat("P(machine survives 8 years) =", round(p_binomial, 4)*100, "%\n")
## P(machine survives 8 years) = 43.05 %
CONCLUSION: The results are similar but not identical (Poisson vs Binomial). The key reason for this difference is that the Binomial distribution models the number of successes in INDEPENDENT trials. In the question above, each year is a trial with probability p = 0.1 failure. This is the more accurate way to see it. Contrast this to Poisson, which is a continuous time approximation