# ---- Set up: define the event (rain over .5 inches) ----
# create binary variable: 1 if rain > 0.5 inches, 0 otherwise
ch_rainfall_binary_05 <- ch_rainfall |> mutate(heavy_rain = ifelse(prec_in > 0.5, 1, 0))
# calculate empirical probability of heavy rain (>.5 in) on any given day
total_days <- length(ch_rainfall_binary_05$date)
total_heavy_rain_days <- sum(ch_rainfall_binary_05$heavy_rain)
p_heavy_rain <- total_heavy_rain_days / total_days
p_heavy_rain## [1] 0.09131249
# ---- Question 1: Most likely number of days (out of 10) with heavy rain ----
# binomial distribution over 10 days
binom_days <- tibble(
num_days = 0:10,
probability = dbinom(x = 0:10, size = 10, prob = p_heavy_rain)
)
# plot the distribution
ggplot(binom_days, aes(x = num_days, y = probability)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 0:10) +
labs(title = "Binomial Distribution: Days with Rain > .5 inches (n = 10)",
x = "Number of Days",
y = "Probability")# identify the most likely number of days (mode of the distribution)
most_likely_days <- binom_days$num_days[which.max(binom_days$probability)]
most_likely_days## [1] 1
# ---- Question 2: Cumulative probability of at least 3 days with heavy rain ----
# P(X < 3) = P(0, 1, or 2 days)
prob_less_than_3 <- pbinom(2, size = 10, prob = p_heavy_rain)
# P(X >= 3) = complement
prob_at_least_3 <- 1 - prob_less_than_3
prob_at_least_3## [1] 0.0560366
Over 10 days in Chapel Hill the most likely number of days with rain over .5 inches is 1 day (probability of about 0.3857) though 0 days is nearly identical in likelihood (probability of about 0.3838). These two outcomes are essentially a statistical tie, which makes sense given how rare heavy rain (>.5 inches) is on any given day.
The cumulative probability that Chapel Hill will get at least 3 days of rain over .5 inches in the next 10 days is 0.0560 (about 5.6%). This relatively low probability lines up with the shape of the binomial distribution plotted above.
# ---- Theoretical probability using the normal distribution ----
mean_temp <- mean(rdu_jan$max_temp)
sd_temp <- sd(rdu_jan$max_temp)
theoretical_prob <- 1 - pnorm(83, mean = mean_temp, sd = sd_temp)
theoretical_prob## [1] 0.002999614
# ---- Empirical probability using the observed data ----
total_days <- length(rdu_jan$max_temp)
days_83_or_above <- sum(rdu_jan$max_temp >= 83)
empirical_prob <- days_83_or_above / total_days
empirical_prob## [1] 0
# ---- Compare the two side by side ----
comparison <- tibble(
method = c("Theoretical (Normal)", "Empirical"),
probability = c(theoretical_prob, empirical_prob)
)
comparisonThe theoretical (normal) probability is about .3% while the empirical probably is exactly 0%. The normal model predicts a 83F January day which is rare but possible, while the historical data shows it has never actually happened.
83F is an extreme outlier for January in this region, so it’s not surprising that a rare tail event shows up as a zero occurences in a finite sample. It’s also likely that real January temperatures have a thinner, more physically constrained righ tail than the symmetric normal distribution assumes, causing the model to overestimate the cause of extreme heat.
The potential problems in applying the normal distribution to estimate rare events in this dataset is that normal distrubtions can meaningfully misjudge tail probabilities even when it fits the center of the data well. Relying on it for rare/extreme events risks overestimating the true likelihood so emperical data is generally more trustworthy for this kind of analysis.