Stats200Q Book

Author

Eshaan Kothari

Stats200Q Book: Theoretical Statistics Methods Applied to Earthquake Interarrival Times Modeled as Exponential Distribution

For this class, I was tasked with examining data that follows a exponential distribution (with a Gamma prior). Toggling between Bayesian and Frequentist methods (and underscoring the difference in philosophies between them including the pros and cons for each), this book covers deriving estimators, evaluating estimators, asymptotic distributions of estimators, confidence/credible interval estimation, and hypothesis testing.

Chapter 1: Finding and Exploring Dataset

For my data, I decided to look at historical earthquake data in California from 2000 to 2025 from USGS. I modeled the inter-arrival days between magnitude 3.5+ earthquakes as an exponential distribution, which we can see below. There is 1115 data points within this data sets, meaning in the past century there have been around 1115 earthquakes of magnitude 3.5 or greater in California.

library(readr)
earthquakes <- read_csv("/Users/eshaankothari/Downloads/earthquakes.csv")

Rows: 1115 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (3): mag, interarrival_hours, interarrival_days
dttm (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data <- earthquakes
interarrival_days <- data$interarrival_days

hist(interarrival_days,
     main="Days Betweeen Magnitude 3.5+ Earthquakes in Cali (2000-2025)",
     xlab="Interarrival Days",
     breaks=25,
     freq=FALSE)

curve(dgamma(x, shape=1, rate=0.1218418), # used the MLE estimate to approximate expo curve 
      add=TRUE)

Chapter 2: Estimation

Statistics is concerned with determining the underlying model of data. We can use estimators to help estimate the parameter(s) that our dataset follows.

Frequentist Estimator - Maximum Likelihood

Estimator Calculation

I started by getting the MLE for \(\lambda\) using the principal of maximum likelihood. Below is how I derived the MLE:

\[ L(\lambda) = \prod_{i=1}^n\lambda e^{-\lambda x_i} \\ = \lambda^ne^{-\lambda\sum_{i=1}^nx_i} \]\[ l(\lambda) = n\text{log}\lambda - \lambda \sum_{i=1}^n x_i \]\[ \frac{dl(\lambda)}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0 \]\[ \widehat{\lambda}_{MLE} = \frac{n}{\sum_{i=1}^n x_i} = \frac{1}{\bar{X}}. \]

Estimate for Data

Using the MLE estimator, I then calculated an estimate for my earthquake dataset.

n = length(data$interarrival_days)
lambda.MLE <- (1 / mean(interarrival_days, na.rm = TRUE))
print(paste("MLE estimate:", lambda.MLE))

[1] "MLE estimate: 0.121841846221153"

Bias and Variance for Estimator

Using Jensen’s inequality, since \(\frac{1}{x}\) is a convex function, I know that the MLE estimator is positively biased for \(\lambda\). However, in order to calculate the exact bias and variance, I used the property that the sum of exponential random variables is a gamma distribution. From this, I found that the bias is \(\frac{\lambda}{n-1}\) and variance is \(\frac{\lambda^2n^2}{(n-1)^2(n-2)}\).

I also found the bias and variance via simulation by generating a bunch of new data from an underlying exponential distribution, calculating the MLE estimates, and then examining the mean and variance of the resulting distribution of estimates over all the simulated trials.

num_sims = 10000

sims <- replicate(num_sims, {
  x <- rexp(n, rate = lambda.MLE)
  lambda.hat = 1 / mean(x)
})

hist(sims, breaks=30, freq=FALSE)
# underlying actual value of lambda
abline(v = lambda.MLE, col = "red", lwd = 2)
# expectation of distrubution of MLE estimator
abline(v = mean(sims), col = "blue", lwd = 2)

bias_MLE <- mean(sims) - lambda.MLE
actual_bias <- lambda.MLE / (n - 1)
print(paste("Simulated bias:", bias_MLE))

[1] "Simulated bias: 9.7699766840792e-05"

print(paste("Calculate bias:", actual_bias))

[1] "Calculate bias: 0.000109373291042327"

# they are pretty close

variance_MLE <- var(sims)
actual_variance <- (lambda.MLE^2 * n^2) / ((n - 1)^2 * (n - 2))
print(paste("Simulated variance:",variance_MLE))

[1] "Simulated variance: 1.35318364875699e-05"

print(paste("Calculated variance:", actual_variance))

[1] "Calculated variance: 1.3362174250235e-05"

We can see that the bias and variance via simulation are very close to that via calculation, and that the MLE estimator is in fact positively biased.

Bayesian Estimator - Posterior Mean

Estimator Calculation

I then calculated the Bayesian estimator using Bayes theorem, modelling the prior distribution of lambda as a Gamma random variable, meaning \(f(\lambda) \sim Gamma(\alpha, c)\):

\[ f(\lambda \mid x_1, \dots, x_n) \propto f(x_1, \dots, x_n)\, f(\lambda)\]\[\propto \left( \prod_{i=1}^n \lambda e^{-\lambda x_i} \right)\cdot \frac{c^\alpha}{\Gamma(\alpha)} e^{-c\lambda} \lambda^{\alpha - 1}\]\[\propto \lambda^n e^{-\lambda \sum_{i=1}^n x_i} \cdot e^{-c\lambda} \lambda^{\alpha - 1}\]\[\propto \lambda^{\alpha + n - 1} \, e^{-\lambda \left(c + \sum_{i=1}^n x_i \right)}\]\[\Rightarrow \lambda \mid x_1, \dots, x_n \sim \text{Gamma}(\alpha + n,\; c + \sum_{i=1}^n x_i) \]

Since the posterior distribution follows a gamma, we know that the posterior mean estimator is \(\widehat{\lambda}_{Bayes}\) =\(\frac{\alpha + n}{c + \sum_{i=1}^n x_1}\).

Estimate for Data

Using the posterior mean estimator, I then calculated an estimate for my earthquake dataset, using the prior \(Gamma(1, 1.5598)\). I chose this prior because, after some quick research, the average rate of magnitude 3.5+ earthquakes in California is about 1 earthquake every 1.5 days in a year since, on average, there are about 234 3.0-4.0 magnitude earthquakes in one year around California (according to the LA Times).

prior.a = 1
prior.c = (365/234)

posterior_mean <- ((n + prior.a) 
                  / (prior.c + 
                       sum(data$interarrival_days, na.rm = TRUE)))

print(paste("Posterior Mean Estimate", posterior_mean))

[1] "Posterior Mean Estimate 0.122039772374122"

We can see that this estimator is a bit larger than the MLE since it is biasing the estimate toward the prior mean of 1 more. Because of this, I expect this estimator to be more biased than the MLE, which we can calculate via simulation.

Bias and Variance of Estimator

num_sims = 10000

sims_bayes <- replicate(num_sims, {
  x <- rexp(n, rate = posterior_mean)
  lambda.hat = (n + prior.a) / ((prior.c) + sum(x))
})

hist(sims_bayes, breaks=30, freq=FALSE)

abline(v = posterior_mean, col = "red", lwd = 2)
abline(v = mean(sims_bayes), col = "blue", lwd = 2)

bias_bayes <- mean(sims_bayes) - posterior_mean
print(paste("Simulated bias (Bayes):", bias_bayes))

[1] "Simulated bias (Bayes): 0.000234043388641386"

variance_bayes <- var(sims_bayes)
print(paste("Simulated variance (Bayes):",variance_bayes))

[1] "Simulated variance (Bayes): 1.36750463758218e-05"

We can see that the posterior mean is slightly more biased, but the variance seems to be slightly smaller when compared to the MLE. We can look at the difference more closely between the bayesian and frequentist estimators below:

Comparing Bias and Variance Across The Estimators

n = 10 # for small n
lambda_actual = 0.125

sims <- replicate(num_sims, {
  x <- rexp(n, rate = lambda_actual)
  lambda.hat = 1 / mean(x)
})

sims_bayes <- replicate(num_sims, {
  x <- rexp(n, rate = lambda_actual)
  lambda.hat = (n + 1) / ((365/234) + sum(x)) # prior Gamma(1, 1.559) as before
})

plot(density(sims_bayes),
     col = "blue",
     lwd = 2,
     main = "Bayes vs MLE Estimators (with rate = 0.125 for small n)",
     xlab = "Estimate")

lines(density(sims),
      col = "red",
      lwd = 2)

legend("topright",
       legend = c("Bayes", "MLE"),
       col = c("blue", "red"),
       lwd = 2)

abline(v = lambda_actual, col = "black", lwd = 1)
abline(v = mean(sims), col = "red", lwd = 1, lty = 2)
abline(v = mean(sims_bayes), col = "blue", lwd = 1, lty = 2)

We can see that the MLE and Bayes estimators are both slightly positively biased since their means are greater than the actual \(\lambda = 0.125\). However, we can see that the Bayesian estimator is more biased to the prior mean, which explains why it is slightly less variable and speaks to the importance of choosing a good prior since that could bias the estimator towards the true value of \(\lambda\). Ultimately, the Bayesian estimator is a weighted combination of the prior mean and the MLE, so for smaller values of n (which I chose above), the Bayesian estimator weights to the prior mean, and can have a smaller MSE, assuming the prior mean is accurate. As n increases, the difference between the estimators becomes more negligible because the Bayesian estimator starts to weight the MLE more. We will explore MSE in the next chapter further.

Chapter 3: Evaluating and Comparing Estimators

We want to compare the frequentist and bayesian estimators that we derived above through frequentist mean-squared error (MSE), which we can do by looking at their respective bias and variance through the bias-variance decomposition of MSE. This chapter also briefly discusses Bayesian MSE as a metric for evaluating estimators.

Proof of \(\begin{align}\mathrm{Var}(\ell'(\theta)) &= - \mathbb{E}(\ell''(\theta))\end{align}\)

Before calculating MSEs, I proved the following property above the variance of the score, which is important to understanding the theoretical variance lower bound for unbiased estimators using the Cramer-Rao Inequality. See the proof below:

\[ \begin{align} \mathrm{Var}(\ell'(\theta)) &= \mathbb{E}\left[(\ell'(\theta) - \mathbb{E}(\ell'(\theta)))^2\right] \\[10pt]&= \mathbb{E}[(\ell'(\theta))^2] \quad \text{since } \mathbb{E}(\ell'(\theta)) = 0 \\[15pt]\ell'(\theta) &= \frac{f'(x)}{f(x)} \\[10pt]\frac{d}{d\theta} \ell'(\theta) &= \frac{f''(x)f(x) - (f'(x))^2}{(f(x))^2}\\\mathbb{E}\left(\frac{f''(x)f(x) - (f'(x))^2}{(f(x))^2}\right)&= \mathbb{E}\left(\frac{f''(x)}{f(x)}\right) - \mathbb{E}\left(\left(\frac{f'(x)}{f(x)}\right)^2\right) = \mathbb{E}(\ell''(\theta)) \\[15pt]\int_{-\infty}^{\infty} \frac{f''(x)}{f(x)} f(x)\,dx - \mathbb{E}[(\ell'(\theta))^2]&= \mathbb{E}(\ell''(\theta)) \\[10pt]\int_{-\infty}^{\infty} f''(x)\,dx&= \frac{d}{d\theta} \left( \frac{d}{d\theta} \int_{-\infty}^{\infty} f(x)\,dx \right) \\[10pt]&= \frac{d}{d\theta} \left( \frac{d}{d\theta} (1) \right) \\[10pt]&= 0 \\ \therefore \ \mathbb{E}[(\ell'(\theta))^2] &= -\mathbb{E}(\ell''(\theta)) = \mathrm{Var}(\ell'(\theta)) \end{align} \]

MSE of Frequentist Estimator - MLE

From above (where we found the exact distribution of the MLE using a gamma distribution), we know that the bias of the MLE estimator is \(\frac{\lambda}{n-1}\) and variance is \(\frac{\lambda^2n^2}{(n-1)^2(n-2)}\) (see appendix for calculations). Therefore, the MSE of the MLE estimator is \(\frac{\lambda^2(n+2)}{(n-1)(n-2)}\), applying the bias-variance decomposition. It does not make sense to calculate MSE for my specific data set since MSE looks at how an estimator varies from the true underlying value, and we do not have multiple estimates nor a true underlying lambda for the estimate of my data.

But, I ran simulations on what the MSE would look like for varying \(\lambda\) and \(n\), comparing it to the theoretical MSE that we calculated above.

n <- 10 # small n
lambdas <- seq(0.05, 1000, length.out = 100)

MLE_mses = numeric(100)
for (i in seq_along(lambdas)) {
  MLEs <- replicate(1000, {
    x <- rexp(n, lambdas[i]) # varying underlying lambdas
    n / sum(x)
  })
  MLE_mses[i] <- mean((MLEs - lambdas[i])^2)
}

plot(lambdas, MLE_mses,
     type = "l",
     xlab = "Lambda",
     ylab = "MSE",
     main = "MSE of MLE Estimator with Varying Lambda")

theory <- (lambdas^2 * (n + 2)) / ((n - 1) * (n - 2))
lines(lambdas, theory, col = "red", lwd = 1)

We can see that with larger and larger lambdas as the true underlying value, the MSE of the MLE estimator continues to increase quadratically.

lambda_true <- 1/8
ns <- seq(5, 30, length.out = 100)

MLE_mses_2 = numeric(100)
for (i in seq_along(ns)) {
  MLEs <- replicate(10000, {
    x <- rexp(ns[i], lambda_true)
    1 / mean(x)
  })
  MLE_mses_2[i] <- mean((MLEs - lambda_true)^2)
}

plot(ns, MLE_mses_2,
     type = "l",
     xlab = "n",
     ylab = "MSE",
     main = "MSE of MLE Estimator for Increasing n")


theory <- (lambda_true^2 * (ns + 2)) / ((ns - 1) * (ns - 2))
lines(ns, theory, col = "red", lwd = 1)

The fact that the MSE approaches zero as n goes to infinity makes sense since the MLE estimator for the exponential distribution is consistent and the variance will continue to shrink.

MSE of Bayesian Estimator - Posterior Mean

There is no way to derive the exact MSE of the Bayesian estimator since we could not derive the exact bias and variance, so we can run the simulations that we did above, but instead choosing a prior as well.

n <- 10
lambdas <- seq(0.05, 1000, length.out = 100)

bayes_mses = numeric(100)
for (i in seq_along(lambdas)) {
  bayes_estimates <- replicate(1000, {
    x <- rexp(n, lambdas[i])
    (n + 1) / (sum(x) + (1 / lambdas[i])) # changing prior
  })
  bayes_mses[i] <- mean((bayes_estimates - lambdas[i])^2)
}

plot(lambdas, bayes_mses,
     type = "l",
     xlab = "Lambda",
     ylab = "MSE",
     main = "MSE of Bayesian Estimator with Varying Lambda")

prior_a = 1
prior_c = 8
lambda_true <- 1/8
ns <- seq(5, 30, length.out = 100)

bayes_mses_2 = numeric(100)
for (i in seq_along(ns)) {
  MLEs <- replicate(10000, {
    x <- rexp(ns[i], lambda_true)
    (ns[i] + prior_a) / (sum(x) + prior_c)
  })
  bayes_mses_2[i] <- mean((MLEs - lambda_true)^2)
}

plot(ns, bayes_mses_2,
     type = "l",
     xlab = "n",
     ylab = "MSE",
     main = "MSE of Bayesian Estimator for Increasing n")

Both graphs follow similar trends to the frequentist estimator as expected, but let’s more closely see how they compare.

Comparing Frequentist and Bayesian Estimator Using Frequentist MSE

plot(lambdas, MLE_mses,
     type = "l",
     col = "red",
     xlab = "Lambda",
     ylab = "MSE",
     main = "MLE vs Posterior Mean MSE (varying lambda)")

lines(lambdas, bayes_mses,
      col = "blue",
      lwd = 1)

legend("topright",
       legend = c("MLE", "Bayes"),
       col = c("red", "blue"),
       lty = 1,
       lwd = 2)

plot(lambdas, MLE_mses_2,
     type = "l",
     col = "black",
     xlab = "n",
     ylab = "MSE",
     main = "MLE vs Posterior Mean MSE (varying n)")

lines(lambdas, bayes_mses_2,
      col = "blue",
      lwd = 1)

legend("topright",
       legend = c("MLE", "Bayes"),
       col = c("black", "blue"),
       lty = 1,
       lwd = 2)

The Bayesian estimator seems to do better than the MLE in terms of MSE in both graphs. I believe this is only because I chose a prior in both graphs such that it would model the underlying lambda. That would bias the estimator closer to the true value and reduce the MSE more. However, this is also very contingent on having this accurate prior knowledge, which speaks to a disadvantage of Bayesian estimation if this information is not readily available. In fact, when I tried to choose a constant prior for each iteration, the Bayesian MSE would perform a lot worse than the MLE estimator. This also speaks to how we are comparing the Bayesian estimator on the Frequentist playing field, using its definition of MSE. If we instead used Bayesian MSE, which takes the weighted average of all the MSE from each lambda over the prior, then the posterior mean, by definition, would perform a lot better.

Finally, we can construct an unbiased estimator for \(\lambda\).

MSE of Unbiased Estimator for \(\lambda\)

We can also derive an unbiased estimator for \(\lambda\) by multiplying the MLE estimator by an “unbiasing” factor. In fact, since the expectation of \(\widehat{\lambda}_{MLE}\) is \(\frac{n}{n-1}\lambda\), we can construct an unbiased estimator \(\widehat{\lambda}_{unbiased}\) to be \(\frac{n-1}{n}\widehat{\lambda}_{MLE}\) or \(\frac{n-1}{n}(\frac{1}{\bar{X}})\). To evaluate this estimator, we can look at its variance (since the bias is 0). Since we derived the variance of the \(\widehat{\lambda}_{MLE}\) above, we can multiply that by our “unbiasing” coefficient to get the variance of \(\widehat{\lambda}_{unbiased}\) as \(\frac{\lambda^2}{n-2}\).

Cramer-Rao Inequality

Now, we can use the Cramer-Rao inequality to calculate what the lower bound of variance for an unbiased estimator of \(\lambda\) would be and see how our unbiased estimator’s variance holds up to this. We cannot use the Cramer-Rao lower bound (CRLB) to assess the other estimators since those are biased. Here is how we derived the CRLB using the score that we calculated above during the MLE calculations:

\[ \begin{aligned}\\\operatorname{Var}(\ell'(\theta)) \\= \operatorname{Var}\!\left(\frac{n}{\lambda} - \sum_{i=1}^n X_i \right) \\= \operatorname{Var}\!\left(\sum_{i=1}^n X_i \right) \\= \frac{n}{\lambda^2} \\\frac{1}{n/\lambda^2} = \frac{\lambda^2}{n}\end{aligned} \]

Since the variance of \(\widehat{\lambda}_{unbiased} = \frac{\lambda}{n-2}\) is always greater than the CRLB variance value (for \(n >2\)), we know that this is not the lowest possible variance for an unbiased estimator, meaning there theoretically exist better unbiased estimators with lower MSE compared to \(\widehat{\lambda}_{unbiased}\).

Comparing All Estimators and the CRLB

Let’s compare the bias and variance to the other two estimators, and show that this unbiased estimator is in fact unbiased.

n = 10 # for small n
lambda_actual = 0.125

sims <- replicate(num_sims, {
  x <- rexp(n, rate = lambda_actual)
  lambda.hat = 1 / mean(x)
})

sims_bayes <- replicate(num_sims, {
  x <- rexp(n, rate = lambda_actual)
  lambda.hat = (n + 1) / (8 + sum(x))  # Gamma(1, 8) prior as before
})

sims_unbiased <- replicate(num_sims, {
  x <- rexp(n, rate = lambda_actual)
  lambda.hat = (n - 1) / (sum(x))
})

plot(density(sims_bayes),
     col = "blue",
     lwd = 2,
     main = "Bayes vs MLE Estimators (with rate = 0.125 for small n)",
     xlab = "Estimate")

lines(density(sims),
      col = "red",
      lwd = 2)

lines(density(sims_unbiased),
      col = "green",
      lwd = 2)

legend("topright",
       legend = c("Bayes", "MLE", "Unbiased"),
       col = c("blue", "red", "green"),
       lwd = 2)

abline(v = lambda_actual, col = "black", lwd = 1)
abline(v = mean(sims), col = "red", lwd = 1, lty = 2)
abline(v = mean(sims_bayes), col = "blue", lwd = 1, lty = 2)
abline(v = mean(sims_unbiased), col = "green", lwd = 1, lty = 2)

Now, let’s compare the MSEs of all the estimators for varying lambda and compare that against the theoretical lower bound from Cramer-Rao.

n <- 10
lambdas <- seq(0.05, 1000, length.out = 100)

unbiased_mses = numeric(100)
for (i in seq_along(lambdas)) {
  unbiased_estimates <- replicate(1000, {
    x <- rexp(n, lambdas[i])
    (n - 1) / (sum(x)) # unbiased estimator
  })
  unbiased_mses[i] <- mean((unbiased_estimates - lambdas[i])^2)
}

n <- 10
plot(lambdas, MLE_mses,
     type = "l",
     col = "red",
     xlab = "Lambda",
     ylab = "MSE",
     main = "MLE vs Bayesian MSE vs Unbiased vs CRLB (varying lambda)")

lines(lambdas, bayes_mses,
      col = "blue",
      lwd = 1)

lines(lambdas, unbiased_mses,
      col = "green",
      lwd = 1)

legend("topright",
       legend = c("MLE", "Bayes", "Unbiased"),
       col = c("red", "blue", "green"),
       lty = 1,
       lwd = 2)

theory <- (lambdas^2 / n) # CRLB
lines(lambdas, theory, col = "black", lwd = 1)

We can see that the unbiased estimator does better than the MLE (which is similar to the German Tank problem in how adding an unbiasing factor to the MLE actually produces a better estimate), but that none reach the Cramer-Rao Lower Bound MSE.

Chapter 4: Asymptotic Distribution of Estimators

Now, we can assess our estimators asymptotically, or as we get more and more data. What would we expect to happen to our estimators?

One thought is that we would expect our estimators to do better since there is more information for the estimates to more closely model the underlying true parameter. We can see through consistency and asymptotic distributions how this is the case.

Frequentist Estimator - MLE

Consistency

We proved in class that all MLE estimators are consistent, meaning the estimator approaches the actual value of the paramater. We can show that is the case for the MLE of \(\lambda\) for the exponential distribution using the Law of Large Numbers and Continuous Mapping Theorem:

\[ \hat{\lambda}_{\text{MLE}} = \frac{1}{\bar{X}}\]\[\text{By LLN: } \bar{X} \xrightarrow{p} \mu\]\[\text{By CMT: } \hat{\lambda}_{\text{MLE}} = \frac{1}{\bar{X}} \xrightarrow{p} \frac{1}{\mathbb{E}[X]} = \frac{1}{1/\lambda} = \lambda\]\[\text{where } g(x) = \frac{1}{x}, \quad g(\hat{\theta}_{\text{MLE}}) \to g(\mu)\]\[\hat{\lambda}_{\text{MLE}} \xrightarrow{p} \lambda \]

Through simulation with increasing n, we can see how as n goes to infinity, the estimates increasingly converge to the true value of \(\lambda\), showing consistency of the \(\widehat{\lambda}_{MLE}\) which we just proved.

par(mfrow = c(2, 3))  # 2 rows, 3 columns

lambda_actual <- 1/8
num_sims <- 5000

for (n in c(20, 100, 500, 1000, 2000, 10000)) {
  sims <- replicate(num_sims, {
    x <- rexp(n, rate = lambda_actual)
    1 / mean(x)
  })
  
  hist(sims, breaks=30, freq=FALSE,
       main=paste("n =", n), xlim=c(0, 0.3))
  abline(v = lambda_actual, col = "red", lwd = 1)
}

Consistency is essentially showing how our MLE estimator is unbiased asymptotically since the estimates get closer and closer to the actual value (even though the MLE estimator is postively biased in finite statistics for the exponential \(\lambda\)), which we can see in the following graph.

lambda_true <- 1/8
ns <- seq(5, 50, length.out = 100)

bias_mse = numeric(100)
for (i in seq_along(ns)) {
  MLEs <- replicate(10000, {
    x <- rexp(ns[i], lambda_true)
    1 / mean(x)
  })
  bias_mse[i] <- mean(MLEs) - lambda_true
}

plot(ns, bias_mse,
     type = "l",
     xlab = "Lambda",
     ylab = "Bias",
     main = "Bias of MLE Esimator for Increasing n")

Asymptotic Distribution of Estimator

Now, we want to examine what the asymptotic distribution of the MLE estimator looks like. We proved in class that every MLE estimator looks asymptotically normal around the true underlying value of the parameter with variance that is equal to the Cramer-Rao Lower Bound.

For the exponential random variable, we showed that the CRLB was \(\lambda^2 / n\), so we can see that \(\sqrt{n}(\lambda_{MLE} - \lambda_{0}) \xrightarrow{d} Normal(0, {\lambda_0^2})\) or \(\widehat{\lambda}_{MLE} \sim Normal(\lambda_0, \frac{\lambda_0^2}{n})\).

I also confirmed this calculation using the Delta Method (see appendix), which yielded the same distribution.

n = 1000
lambda_actual <- 1/8
num_sims <- 10000

sims <- replicate(num_sims, {
  x <- rexp(n, rate = lambda_actual)
  1 / mean(x)
  })

hist(sims, breaks=50, freq=FALSE, main=paste('Distribution of MLE estimator'))
curve(
  dnorm(x, lambda_actual, sqrt(lambda_actual^2 / n)), add=TRUE, col='red', lwd=2
)

We can see how our MLE estimator very closely models the derived distribution asymptotically through simulation.

par(mfrow = c(2, 3))  # 2 rows, 3 columns

lambda_actual <- 1/8
num_sims <- 5000

for (n in c(5, 20, 50, 100, 1000)) {
  sims <- replicate(num_sims, {
    x <- rexp(n, rate = lambda_actual)
    1 / mean(x)
  })
  
  hist(sims, breaks=30, freq=FALSE,
       main=paste("n =", n), xlim=c(0, 0.3))
  curve(
  dnorm(x, lambda_actual, sqrt(lambda_actual^2 / n)), 
        add=TRUE, col='red', lwd=2
  )
}

As we increase n, the asymptotic distribution fits the MLE estimator better and better.

Chapter 5: Confidence and Credible Intervals

Instead of a point estimate for \(\lambda\), now we can work to develop a range of plausible values where the underlying parameter may fall.

Frequentist Methods - Confidence Intervals

Since we cannot approximate the distribution of the MLE estimate for \(\lambda\) in finite statistics, we can use asymptotic confidence intervals using the asymptotic distribution of the MLE estimator that we derived last time, which was \(\widehat{\lambda}_{MLE} \sim Normal(\lambda_0, \frac{\lambda_0^2}{n})\).

From this, we can calculate a 95% confidence interval for \(\lambda_0\), which is a range of plausible values in which we are 95% confident \(\lambda_0\) lies between. There are two main ways to calculate these asymptotic intervals, one is using a consistent estimate for the Fisher information, which is just the MLE itself (Wald Interval). Another is solved for a generic \(\lambda\) fully using the Fisher information (Wilson Interval). I derived both below:

Wald Interval

\[ \hat{\lambda}_{\text{MLE}} \sim \mathcal{N}\left(\lambda, \frac{\lambda^2}{n}\right) \\ \]

\[ \mathbb{P}\left( \Phi^{-1}(0.025) \le Z \le \Phi^{-1}(0.975) \right) = 0.95 \\\\ \]

\[ \mathbb{P}\left( \Phi^{-1}(0.025) \le \frac{\hat{\lambda} - \lambda}{\sqrt{\lambda^2 / n}} \le \Phi^{-1}(0.975) \right) = 0.95 \\\\ \]

\[ C(\vec{X}) = \hat{\lambda} \pm \Phi^{-1}(0.975)\sqrt{\frac{\hat\lambda^2}{n}} \]

Wilson Interval

\[ \begin{aligned}\frac{\hat{\lambda} - \lambda}{\sqrt{\lambda^2 / n}} &\sim \mathcal{N}(0,1) \\[6pt]\hat{\lambda} - \lambda &= z \cdot \frac{\lambda}{\sqrt{n}} \\[6pt]\hat{\lambda} &= \lambda \left(1 + \frac{z}{\sqrt{n}}\right) \\[6pt]\lambda &= \frac{\hat{\lambda}}{1 + \frac{z}{\sqrt{n}}} \\[10pt]\text{95\% confidence interval:} \\[6pt]\left(\frac{\hat{\lambda}}{1 + \frac{\Phi^{-1}(0.975)}{\sqrt{n}}}, \;\frac{\hat{\lambda}}{1 - \frac{\Phi^{-1}(0.975)}{\sqrt{n}}}\right)\end{aligned} \]

Calculations for My Data

For my data, the Wald and Wilson confidence intervals would be:

lambda.MLE = (1 / mean(interarrival_days, na.rm = TRUE))

wald_lower_bound <- lambda.MLE - qnorm(.975) * sqrt((lambda.MLE)^2 / n)
wald_upper_bound <- lambda.MLE + qnorm(.975) * sqrt((lambda.MLE)^2 / n)

wilson_lower_bound <- lambda.MLE / (1 + (qnorm(.975)/sqrt(n)))
wilson_upper_bound <- lambda.MLE / (1 - (qnorm(.975)/sqrt(n)))

print(paste("Point Estimate:", lambda.MLE))

[1] "Point Estimate: 0.121841846221153"

print(paste0("Wald Confidence interval for Dataset:
             [", wald_lower_bound, ", ", wald_upper_bound, "]"))

[1] "Wald Confidence interval for Dataset:\n             [0.114290149119684, 0.129393543322621]"

print(paste0("Wilson Confidence interval for Dataset: 
             [", wilson_lower_bound, ", ", wilson_upper_bound, "]"))

[1] "Wilson Confidence interval for Dataset: \n             [0.114730883082507, 0.129892520089662]"

Comparing Wilson and Wald Confidence Intervals

We can see from the earthquake data that the Wald interval is a little tighter than the Wilson. But does that come at a cost of accuracy? Via simulation, we can see the coverage probably of the confidence interval for each type and for varying lambda, and see how close it is to 0.95.

n <- 30
Nsim <- 1000
lambdas <- seq(0.05, 1000, by=50)

coverage <- numeric(length(lambdas))

for (j in seq_along(lambdas)) {
  lambda <- lambdas[j]
  covered <- 0
  
  for (i in 1:Nsim) {
    x <- rexp(n, rate=lambda)
    
    lambda.hat <- 1 / mean(x)
    
    se <- lambda.hat / sqrt(n)

    ci.l <- lambda.hat - qnorm(.975) * se
    ci.u <- lambda.hat + qnorm(.975) * se
      
    if (ci.l <= lambda && lambda <= ci.u) {
      covered <- covered + 1
    }
  }
  
  coverage[j] <- covered / Nsim
}

plot(lambdas, coverage, type="l",
     xlab="lambda", ylab="Coverage Probability",
     main="Wald Confidence Interval: Coverage vs Lambda")

abline(h=0.95, lty=2)  # ideal line

n <- 30
Nsim <- 1000
lambdas <- seq(0.05,1000, by=50)

coverage <- numeric(length(lambdas))

for (j in seq_along(lambdas)) {
  lambda <- lambdas[j]
  covered <- 0
  
  for (i in 1:Nsim) {
    x <- rexp(n, rate=lambda)
    
    lambda.hat <- 1 / mean(x)
    
    # wilson confidence interval
    ci.l <- lambda.hat / (1 + (qnorm(.975)/sqrt(n)))
    ci.u <- lambda.hat / (1 - (qnorm(.975)/sqrt(n)))
      
    if (ci.l <= lambda && lambda <= ci.u) {
      covered <- covered + 1
    }
  }
  
  coverage[j] <- covered / Nsim
}

# plot coverage vs lambda
plot(lambdas, coverage, type="l",
     xlab="lambda", ylab="Coverage Probability",
     main="Wilson Confidence Interval: Coverage vs Lambda")

abline(h=0.95, lty=2)  # ideal line

We can see that the Wilson interval often is actually achieving coverage that is below the 95% mark with the Wald does the opposite and performs well. For our specific dataset we could actually see that the Wald interval does better for the exponential distribution compared to the Wilson (producing a tighter and clearly more accurate range), which is surprising because the Wilson does not involved plugging in an estimate for the Fisher information.

Bayesian Methods - Credible Intervals

We can use the posterior distribution of \(\lambda\) given a prior to get the 95% credible interval for \(\lambda\), which is the range of plausible values for which the probability of \(\lambda\) falling within is 95%. Recall that the posterior distribution is \(S \sim \text{Gamma}(n +\alpha, \sum_{i=1}^{n}{x_i} + \beta)\).

This is the credible interval: \((S^{-1}_{0.025}, S^{-1}_{0.975})\), assuming the posterior distribution is roughly symmetrical although we could use the highest posterior density (HPD) method as well.

For my data, here is the credible interval:

alpha = 1
beta = 8
n = length(interarrival_days)
x = interarrival_days

alpha_post <- alpha + n
beta_post  <- beta + sum(x, na.rm = TRUE)

credible_lower <- qgamma(0.025, shape = alpha_post, rate = beta_post)
credible_upper <- qgamma(0.975, shape = alpha_post, rate = beta_post)

sprintf("Credible interval for dataset: [%.4f, %.4f]", 
        credible_lower, credible_upper)

[1] "Credible interval for dataset: [0.1149, 0.1292]"

curve(dgamma(x, shape = alpha_post, rate = beta_post),
      from=0.1, to=0.3,
      xlab = "lambda", ylab = "Density",
      main = "Gamma Posterior with 95% Credible Interval For Dataset")

abline(v = credible_lower, lty = 2)
abline(v = credible_upper, lty = 2)

We can see that the credible interval has the tightest of intervals since it biased toward the prior mean more.

Comparing Confidence and Credible Intervals

We can assess the accuracy of the 95% credible interval by seeing the coverage of the credible interval for varying lambdas—that is, the percentage of times that the credible interval has the actual lambda within it.

set.seed(1)

n <- 30
Nsim <- 1000
lambdas <- seq(0.05,1000, by=50)

# prior
alpha <- 1
beta  <- 0

coverage <- numeric(length(lambdas))

for (j in seq_along(lambdas)) {
  lambda <- lambdas[j]
  covered <- 0
  
  for (i in 1:Nsim) {
    x <- rexp(n, rate=lambda)
    
    # posterior
    alpha_post <- alpha + n
    beta_post  <- beta + sum(x)
    
    # credible interval
    ci.l <- qgamma(0.025, shape=alpha_post, rate=beta_post)
    ci.u <- qgamma(0.975, shape=alpha_post, rate=beta_post)
    
    if (ci.l <= lambda && lambda <= ci.u) {
      covered <- covered + 1
    }
  }
  
  coverage[j] <- covered / Nsim
}

# plot coverage vs lambda
plot(lambdas, coverage, type="l",
     xlab="lambda", ylab="Coverage Probability",
     main="Credible Interval: Coverage vs Lambda")

abline(h=0.95, lty=2)  # ideal line

We can see that the credible interval does pretty good to provide a range of values which the lambda value is within 95% of the time. Although the Wald interval might do a bit better in terms of accuracy across the lambda values, the credible interval still provides a tighter range. However, it is important to note that this is based on the prior distribution that we place on \(\lambda\).

Chapter 6: Simulating Confidence and Credible Intervals

Last time, we got the theoretical confidence and credible intervals using either the asymptotic distribution of the MLE estimator or the posterior distribution of \(\lambda\). However, what if these methods are not possible or clean?

Now, we are going to simulate these intervals. These resampling techniques are especially useful when we have limited sample data or don’t know the underlying distribution of the data that we have.

Frequentist Methods - Confidence Intervals via Bootstrapping

Parametric Bootstrap

The parametric approach simulates data based on the MLE estimate of our parameter.

Thus, using our earthquake data set, we can resample our data based on the MLE estimate of \(\lambda\).

x <- interarrival_days
n <- length(x)

lambda_MLE <- 1 / mean(interarrival_days, na.rm = TRUE)

lambdas_boot <- replicate(10000, {
  x <- rexp(n, lambda_MLE)
  lambda_MLE_boot = 1 / mean(x)
})

hist(lambdas_boot, breaks=20)

From this distribution of lambda estimates that we derived from sampling from an exponential distribution that follows the one suggested by our one dataset, we can generate a 95% confidence interval:

param <- quantile(lambdas_boot, c(0.025, 0.975))
print(param)

     2.5%     97.5% 
0.1149992 0.1294217

Nonparametric Bootstrap

We can use the nonparametric bootstrap to make fewer assumptions about the distribution of the one sample data that we have. Instead of sampling from an exponential with the distribution that is determined by our MLE estimate for \(\lambda\), we can resample from the data itself with replacement.

sample_data <- interarrival_days[!is.na(interarrival_days)]
n <- length(x)

lambdas_boot_nonparam <- replicate(1000, {
  x <- sample(sample_data, n, replace=TRUE)
  lambda_MLE_boot = 1 / mean(x)
})

hist(lambdas_boot_nonparam, xlab="lambda", breaks=20)

From this distribution, we can get another 95% confidence interval.

non_param <- quantile(lambdas_boot_nonparam, c(0.025, 0.975))
print(non_param)

     2.5%     97.5% 
0.1115204 0.1322504

Bayesian Methods - Credible Intervals via Metropolis-Hastings (MCMC)

Now, we can also simulate the credible intervals using the Metropolis-Hastings algorithm. Although we were able to find a conjugate prior distribution for exponential with a gamma prior, this method is useful when we cannot match a name distribution to a posterior and do not know the normalizing constant.

Recall, that the posterior distribution of lambda given the data with prior \(\sim Gamma(\alpha, \beta)\):

\[ f(\lambda | x_1, ... x_n) \propto \lambda^n e^{-\lambda \sum{x_i}}\lambda^{\alpha - 1}e^{-\beta\lambda} \\ \propto \lambda^{n+\alpha-1}e^{-\lambda(\beta + \sum{x_i})} \]

To sample from a distribution \(f(\lambda)\) when you don’t know the normalizing constant, we can use the Metropolis-Hastings algorithm:

Propose \(\lambda^{(i+1)}\) from some distribution \(g(\lambda)\).
Accept the proposal with probability \(\max(\frac{f(\lambda^{(i+1)}) g(\lambda^{(i)})}{f(\lambda^{(i)}) g(\lambda^{(i+1)})}, 1)\). Otherwise, keep \(\lambda^{(i)}\).
Repeat for many steps. This is a Markov chain whose distribution converges to \(f\).

I chose \(\sim Expo(1)\) as my proposal distribution.

alpha = 1
beta = 8

dataset <- interarrival_days[!is.na(interarrival_days)]
n <- length(dataset)

log.post <- function(lambda) {
  log.post <- (((n + alpha - 1) * log(lambda)) 
            - (lambda * (beta + sum(dataset))))
}

accepted = 0
lambda.post <- c(1.5)

for (i in 1:100000) {
  lambda.curr <- lambda.post[length(lambda.post)]
  
  lambda.prop <- rexp(1, rate=10)

  log.ratio <- (log.post(lambda.prop) - log.post(lambda.curr) 
              - lambda.curr + lambda.prop)

  prob <- min(1, (exp(log.ratio)))
  
  lambda.new <- sample(c(lambda.prop, lambda.curr), 
                       size=1, prob=c(prob, 1-prob))
  lambda.post <- c(lambda.post, lambda.new)
  if (lambda.new == lambda.prop) {
    accepted = accepted + 1
  }
}

lambda.post <- lambda.post[50001:100000]
hist(lambda.post, breaks=50, probability=TRUE)

curve(dgamma(x, alpha + n, beta + sum(dataset)), add=TRUE)

paste("Simulated Posterior Mean:", mean(lambda.post))

[1] "Simulated Posterior Mean: 0.121577014372497"

MH <- quantile(lambda.post, c(.025, .975))
print(MH)

     2.5%     97.5% 
0.1145124 0.1290612

We can see that this distribution looks very similar to the gamma posterior that we got from calculating the actual credible intervals in chapter 5. The posterior mean is also relatively close to the posterior mean that we calculated before. However, the acceptance rate is still pretty low, which I would be curious to see why and is typically a sign of a poor application of Metropolis-Hastings after some light research. Let’s compare all the intervals now for our dataset from the previous chapter as well.

Comparing All Intervals

Confidence Intervals:

df <- data.frame(
  interval_type = c("Wald Interval", "Wilson Interval", 
                    "Parametric Bootstrap", "Nonparametric bootstrap"),
  lower_bound = c(wald_lower_bound, wilson_lower_bound, 
                  param["2.5%"], non_param["2.5%"]),
  upper_bound = c(wald_upper_bound, wilson_upper_bound, 
                  param["97.5%"], non_param["97.5%"]),
  difference = c(
    wald_upper_bound - wald_lower_bound,
    wilson_upper_bound - wilson_lower_bound,
    param["97.5%"] - param["2.5%"],
    non_param["97.5%"] - non_param["2.5%"]
  ))
print("Confidence Intervals:")

[1] "Confidence Intervals:"

print(df)

            interval_type lower_bound upper_bound difference
1           Wald Interval   0.1142901   0.1293935 0.01510339
2         Wilson Interval   0.1147309   0.1298925 0.01516164
3    Parametric Bootstrap   0.1149992   0.1294217 0.01442255
4 Nonparametric bootstrap   0.1115204   0.1322504 0.02073007

As we can see the Wilson results in the largest range of values while the parametric bootstrap results in the tightest interval.

Credible Intervals:

df <- data.frame(
  interval_type = c("Actual Posterior", "Metropolis-Hastings"),
  lower_bound = c(credible_lower, MH["2.5%"]),
  upper_bound = c(credible_upper, MH["97.5%"]),
  difference = c(
    credible_upper - credible_lower,
    MH["97.5%"] - MH["2.5%"]
  ))
print("Credible Intervals:")

[1] "Credible Intervals:"

print(df)

           interval_type lower_bound upper_bound difference
        Actual Posterior   0.1149029   0.1292119 0.01430894
2.5% Metropolis-Hastings   0.1145124   0.1290612 0.01454878

lower <- c(
  wald_lower_bound,
  wilson_lower_bound,
  param["2.5%"],
  non_param["2.5%"],
  credible_lower,
  MH["2.5%"]
)

upper <- c(
  wald_upper_bound,
  wilson_upper_bound,
  param["97.5%"],
  non_param["97.5%"],
  credible_upper,
  MH["97.5%"]
)

labels <- c(
  "Wald",
  "Wilson",
  "Param",
  "Nonpar",
  "Poster.",
  "MCMC"
)

# empty plot
plot(
  NULL,
  xlim = range(c(lower, upper)),
  ylim = c(1, 6),
  yaxt = "n",
  ylab = "",
  xlab = "Interval",
  main = "Interval Comparison"
)

# add labels
axis(2, at = 1:6, labels = labels, las = 1)

# draw intervals
for (i in 1:6) {
  segments(lower[i], i, upper[i], i, lwd = 2)
  
  # endpoints
  points(c(lower[i], upper[i]), c(i, i), pch = 16)
}

We can see that the credible intervals do the best, with the Metropolis-Hastings interval being one of the narrowest, which not only speaks to the power of this algorithm in simulating the posterior distribution but also the power of Bayesian methods if you choose a prior that is reflective of the data. The parametric bootstrap is the best frequentist interval, and the Wald interval actually does better than the Wilson as we saw in the previous chapter, which is a little unintuitive since we are plugging in an estimate for the Fisher information in the Wald interval.

Chapter 7: Uniformly Powerful Hypothesis Testing

Now, what if we want to conduct a study about the underlying distribution of a dataset and the value of its parameters? That’s where hypothesis testing comes into play.

The maximum power test is one that maximizes the probability that we accept the alternative hypothesis given the truth is the alternative hypothesis (power) while keeping our Type I error below a threshold (typically, \(\alpha = 0.05\)). For our data, let \(H_0: \lambda = 0.65\) and let \(H_A: \lambda > 0.65\). I chose these hypothesis since the average number of magnitude 3.5+ earthquakes in California was 0.65 a day or 234 in the past year. I am interested in seeing if our earthquake data from the decades before suggest that the rate of earthquakes might be higher.

Uniquely Most Powerful (UMP) Test

We know that the maximum power test is uniquely most powerful for all values of \(\lambda_A > 0.65\), so I will do my calculations for one such, namely \(\lambda_A = 1\).

\[ H_0:\lambda = 0.65 \qquad \text{vs.} \qquad H_A:\lambda > 0.65 \]

\[ X_1,\dots,X_n \overset{iid}{\sim} \mathrm{Exp}(\lambda) \]

\[ \frac{f_A(\mathbf X)}{f_0(\mathbf X)} = \frac{\prod_{i=1}^n \lambda e^{-\lambda X_i}} {\prod_{i=1}^n 0.65\, e^{-0.65 X_i}} \]

\[ = \left(\frac{\lambda}{0.65}\right)^n e^{-(\lambda-0.65)\sum_{i=1}^n X_i} \]

Choosing for example \(\lambda\)=1,

\[ \frac{f_A(\mathbf X)}{f_0(\mathbf X)} \approx e^{-0.35\sum X_i} \]

Since the likelihood ratio becomes larger when \(\sum X_i\) is small, the most powerful test rejects the null hypothesis for small values of \(\sum X_i\):

\[ \boxed{ \text{Reject } H_0 \text{ if } \sum_{i=1}^n X_i \le c } \]

The uniquely most powerful test ultimately tells us to reject the null hypothesis for small values of \(\sum{X_i}\). We can find the precise cutoff value by modelling the distribution of the sum of \(X_i\) as \(\sim Gamma(n, \lambda_0)\), which we can use the inverse CDF of the gamma function to get the c value.

Applied to Earthquake Dataset

Now, let’s see what this insight means for our specified dataset.

dataset <- interarrival_days[!is.na(interarrival_days)]
n = length(data)

# c = (n + sqrt(n)*qnorm(0.05)) / 0.65
c = qgamma(0.05, n, 0.65)

print(paste("value of c:", c))

[1] "value of c: 2.10202830269205"

print(paste("sum of data:", sum(x, na.rm=TRUE)))

[1] "sum of data: 9143"

We can see that \(\sum{X_i} > c\) in this case, which means we cannot reject the null hypothesis for our dataset since the rate cannot be greater than 0.65 for our data. That suggests that the rate of 0.65 that I had above was too high for the rate of magnitude 3.5+ earthquakes every day.

Power Curve

We can derive what a power curve would look like for varying \(\lambda_A\) values, using the fact that the power is \(P_A(\text{reject} \ H_0)\).

n = length(data)
c = qgamma(0.05, n, 0.65)

lambdas <- seq(from = 0.05, to = 10, by = 0.5)
results <- numeric(length(lambdas))

for (i in seq_along(lambdas)) {
  lambda_A <- lambdas[i]
  
  power <- pgamma(c, shape = n, rate = lambda_A)
  
  results[i] <- power
}

plot(lambdas, results, type = "b",
     xlab = "lambda_A",
     ylab = "Power",
     main = "Power Curve")

As we can see, the power of this test continues to increase as lambda_A gets farther from lambda_0, which makes sense since we have less change of a Type II error. The power is exactly 0.05 when lambda_A = lambda_0 and really low when lambda_A is less than lambda_0 since we are testing greater than here.

Chapter 8: General Hypothesis Testing

Last chapter, we derived the uniformly most powerful test. Now, we can examine a composite null hypothesis that is not equal to \(\lambda_A\) for my data, not just if it is greater than or less than the null hypothesis.

Let’s use the same test as last time, which looks at the rate of magnitude 3.5+ earthquakes in the past year and sees how that relates to the average over years prior until 2000.

Generalized Likelihood Ratio Test

\[ H_0:\lambda = 0.65 \qquad \text{vs.} \qquad H_A:\lambda \neq 0.65 \]

Like before, we want to reject the null hypothesis for large values of the generalized likelihood ratio now (\(\Lambda = \frac{max \ L(\lambda)}{L(\lambda_0)}\)).

\[ \text{reject for } \Lambda \ge c \]

\[ \frac{\max_{\lambda} L_{x_1,\ldots,x_n}(\lambda)} {L_{x_1,\ldots,x_n}(\lambda_0)} \]

\[ = \frac{\prod_{i=1}^n \hat{\lambda} e^{-\hat{\lambda}x_i}} {\prod_{i=1}^n \lambda_0 e^{-\lambda_0 x_i}} \]

\[ = \frac{(\hat{\lambda})^n e^{-\hat{\lambda}\sum x_i}} {(\lambda_0)^n e^{-\lambda_0\sum x_i}} \]

\[ = \left(\frac{\hat{\lambda}}{\lambda_0}\right)^n e^{\sum x_i(\lambda_0-\hat{\lambda})} \]

\[ \log \Lambda = n\log \hat{\lambda} - n\log \lambda_0 + \sum x_i(\lambda_0-\hat{\lambda}) \]

There seems to be no way to simplify this expression anymore, so we can uses Wilk’s theorem which gives us the asymptotic distribution of \(2 log \Lambda\) under the null hypothesis, which can help us calculate the value of c for which to reject until.

\[ 2\left( n\log \hat{\lambda} - n\log \lambda_0 + \sum x_i(\lambda_0-\hat{\lambda}) \right) \overset{.}{\sim} \chi_1^2 \]

\[ P_0(2\log \Lambda \ge c)=0.05 \]

\[ c = \chi^2_{1,\,0.95} \]

For Earthquake Data

Now, we can apply this test for our specific earthquake test by seeing how the test_statistic compares to the value of c.

x <- interarrival_days[!is.na(interarrival_days)]
n <- length(x)
c <- qchisq(0.95, 1)
lambda_0 <- 0.65

lambda_hat <- 1 / mean(x)

test_stat <- 2 * (n * log(lambda_hat) - n * 
                    log (lambda_0) + sum(x) * lambda_0 - n)

print(c)

[1] 3.841459

print(test_stat)

[1] 5927.674

Clearly, we are going to reject the null hypothesis here since the test statistic is greater than c, and we reject for large values of the generalized likelihood ratio. This means that the historical rate of earthquakes does not follow the past years’.

Power Curve

For varying lambda_A values, here is the power at each (which we calculate by simulating a bunch of expo data from the alternative hypothesis and seeing the number of times that you reject the null), resulting in the power curve. I also compared this Wilks’ method to if we derived a c value exactly by getting the 95 quantile of our test statistic under the null distribution:

n <- 50
B <- 10000
lambda0 <- 0.65
alpha <- 0.05

# using Wilks
c <- qchisq(0.95, 1)

lambdas <- seq(0.2, 2, length.out = 40)

power_wilks <- numeric(length(lambdas))

for (j in seq_along(lambdas)) {

  lambda_A <- lambdas[j]

  rejections <- replicate(B, {
    x <- rexp(n, rate=lambda_A)
    lambda_hat <- 1 / mean(x)
    test_stat <- 2 * (n * log(lambda_hat) - n * log (lambda0) + sum(x) * lambda0 - n)
    
    test_stat > c # binary decision
  })
  
  # because P_A(reject H_0) --> frequentist def
  power_wilks[j] <- mean(rejections, na.rm = TRUE) 
}

# calculate the test stat a lot of times under null and find its 95 quantile
c <- quantile(replicate(B, {
    x <- rexp(n, rate=lambda_0)
    2 * (n * log(1 / mean(x)) - n * log (lambda0) + sum(x) * lambda0 - n)
  }), c(0.95))

lambdas <- seq(0.2, 2, length.out = 40)

power_sim <- numeric(length(lambdas))

for (j in seq_along(lambdas)) {

  lambda_A <- lambdas[j]

  rejections <- replicate(B, {
    x <- rexp(n, rate=lambda_A)
    lambda_hat <- 1 / mean(x)
    test_stat <- 2 * (n * log(lambda_hat) - n * 
                        log (lambda0) + sum(x) * lambda0 - n)
    
    test_stat > c
  })
  
  power_sim[j] <- mean(rejections, na.rm = TRUE) # because P_A(reject H_0)
}

plot(lambdas, power_wilks,
     lwd = 2,
     xlab = expression(lambda),
     ylab = "Power", col='BLUE')

points(lambdas, power_sim,
      lwd = 2,
      lty = 2, col='RED')

abline(v = lambda0, lty = 2)

This makes sense since the power is the lowest at the value of the null hypothesis parameter and get larger as you get away from that. And, we can see when comparing the simulation versus the Wilks Theorem method (blue in the graph) that simulation generally gets us marginally better power. This helps to show how asymptotics really are imprecise, especially for smaller n, since we are never dealing with infinite data.

Chapter 9: Bayesian Hypothesis Testing

The last two chapters dealt with Frequentist hypothesis testing, but how would a Bayesian do it?

In this chapter, we are going to examine the Bayesian equivalent of hypothesis testing, which involves placing a prior probability on the null and alternative hypothesis and having a prior distribution in the parameter. I first started by calculating the Bayes factor using 1/2 prior probabilities on both the hypothesis and a Gamma prior:

\[ H_0:\lambda=\lambda_0, \qquad H_A:\lambda\neq\lambda_0 \]

\[ P(H_0)=P(H_A)=\frac{1}{2} \]

\[ \lambda \mid H_A \sim \mathrm{Gamma}(\alpha,\beta) \]

The Bayes Factor is

\[ \frac{P(H_A\mid x)}{P(H_0\mid x)} = \frac{P(H_A)}{P(H_0)} \times \frac{f(x\mid H_A)}{f(x\mid H_0)} \]

Since the prior probabilities are equal,

\[ \frac{P(H_A)}{P(H_0)} = \frac{1/2}{1/2} = 1 \]

Thus, the Bayes factor becomes

\[ BF = \frac{f(x\mid H_A)}{f(x\mid H_0)} = \frac{ \displaystyle \int_0^\infty \left( \prod_{i=1}^n \lambda e^{-\lambda x_i} \right) f(\lambda)\,d\lambda }{ \displaystyle \prod_{i=1}^n \lambda_0 e^{-\lambda_0 x_i} } \]

Substitute the Gamma prior density,

\[ f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \]

to obtain

\[ BF = \frac{ \displaystyle \int_0^\infty \lambda^n e^{-\lambda \sum_{i=1}^n x_i} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda} \, d\lambda }{ \lambda_0^n e^{-\lambda_0 \sum_{i=1}^n x_i} } \]

Combining terms gives

\[ BF = \frac{ \displaystyle \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \lambda^{n+\alpha-1} e^{-(\beta+\sum_{i=1}^n x_i)\lambda} \, d\lambda }{ \lambda_0^n e^{-\lambda_0 \sum_{i=1}^n x_i} } \]

Recognize the integral as the kernel of a Gamma density:

\[ \mathrm{Gamma} \left( n+\alpha,\, \beta+\sum_{i=1}^n x_i \right) \]

so that

\[ \int_0^\infty \lambda^{n+\alpha-1} e^{-(\beta+\sum x_i)\lambda} \, d\lambda = \frac{ \Gamma(n+\alpha) }{ \left( \beta+\sum_{i=1}^n x_i \right)^{n+\alpha} } \]

Therefore,

\[ BF = \frac{ \beta^\alpha \Gamma(n+\alpha) }{ \Gamma(\alpha) \left( \beta+\sum_{i=1}^n x_i \right)^{n+\alpha} \lambda_0^n e^{-\lambda_0 \sum_{i=1}^n x_i} } \]

Equivalently,

\[ BF = \frac{ \beta^\alpha \Gamma(n+\alpha) e^{\lambda_0 \sum_{i=1}^n x_i} }{ \Gamma(\alpha) \left( \beta+\sum_{i=1}^n x_i \right)^{n+\alpha} \lambda_0^n } \]

For Earthquake Data

For my earthquake data, I continued with the \(H_0: \lambda = 0.65\) test as before and use the prior \(\sim Gamma(1,1)\), which does not reveal much information about the rate of earthquake data.

x <- interarrival_days[!is.na(interarrival_days)]

data <- sample(x, 100, replace=FALSE)
n <- length(data)

lambda0 <- 0.65
alpha <- 1
beta <- 1

log_bf <- alpha * log(beta) +
          lgamma(n + alpha) +
          lambda0 * sum(data) -
          lgamma(alpha) -
          (n + alpha) * log(beta + sum(data)) -
          n * log(lambda0)

bf <- exp(log_bf)

print(bf)

[1] 2.542952e+88

Clearly, the Bayes factor is very large since it approaches a quantity close to infinity. In fact, I had to use logs in order to prevent numeric overflow. By Kass and Raftery (1995), this suggests to with strong evidence reject the null hypothesis.

This is exactly what the frequentist method suggested, meaning we do not reach Lindley’s paradox, but this is because the proposed null hypothesis seems to be very off for this data. I imagine there would be more discrepancy as we approach the true value of \(\lambda\), especially considering the weak weights and posterior distribution that we assigned. I would be interested to see for which values of \(\lambda_0\) the paradox arises.

Chapter 10: Sufficiency

We conclude with an elegant property of statistics.

By the sufficiency principle, we know that a sufficient statistics holds all the necessary evidence in a dataset. This helps us reduce the amount of dimensions and complexity that we are dealing with. Here is the derivation for the sufficient statistic for my data:

\[ X_1,\ldots,X_n \overset{\text{iid}}{\sim} \operatorname{Exp}(\lambda) \]

\[ f_\theta(x_1,\ldots,x_n) = \prod_{i=1}^n \lambda e^{-\lambda x_i} \]

\[ = \lambda^n e^{-\lambda \sum_{i=1}^n x_i} \]

To show that something is a sufficient statistic, we want to factor the density in the following form:

\[ f_\theta(x_1,\ldots,x_n) = g\!\left(\theta,\sum_{i=1}^n x_i\right) \,h(x_1,\ldots,x_n), \]

where

\[ g\!\left(\theta,\sum_{i=1}^n x_i\right) = \lambda^n e^{-\lambda \sum_{i=1}^n x_i} \]

and

\[ h(x_1,\ldots,x_n)=1. \]

Since the joint density can be factored in this form, we know that

\[ T(X_1,\ldots,X_n) = \sum_{i=1}^n X_i \]

is a sufficient statistic for \[\lambda\]This is supported by the theorem that states that all MLE estimators are a function of their sufficient statistic. We proved before that the MLE for our data is \[\frac{1}{\bar{X}}\]which is a one-to-one transformation of the sufficient statistic we derived above: \[\sum_{i=1}^n X_i\]

Conclusion

This book is an exploration of Frequentist and Bayesian methods when applied to an exponentially distributed dataset. Stats200Q and the creation of this book has not only made me more adept in the mathmatical and computational foundations of statistics, but through this exploration, the underlying philosophical arguments behind both sides of one of the defining debates in mathematics. This quarter makes me excited to continue in my statistics education with a much better feel for simulations and the theory behind statistical studies, papers, and arguments.

Thank you, Dennis (and Jack), for a great quarter!

Stats200Q Book: Theoretical Statistics Methods Applied to Earthquake Interarrival Times Modeled as Exponential Distribution

Chapter 1: Finding and Exploring Dataset

Chapter 2: Estimation

Frequentist Estimator - Maximum Likelihood

Bayesian Estimator - Posterior Mean

Comparing Bias and Variance Across The Estimators

Chapter 3: Evaluating and Comparing Estimators

Proof of \(\begin{align*}\mathrm{Var}(\ell'(\theta)) &= - \mathbb{E}(\ell''(\theta))\end{align*}\)

MSE of Frequentist Estimator - MLE

MSE of Bayesian Estimator - Posterior Mean

Comparing Frequentist and Bayesian Estimator Using Frequentist MSE

MSE of Unbiased Estimator for \(\lambda\)

Comparing All Estimators and the CRLB

Chapter 4: Asymptotic Distribution of Estimators

Frequentist Estimator - MLE

Chapter 5: Confidence and Credible Intervals

Frequentist Methods - Confidence Intervals

Bayesian Methods - Credible Intervals

Chapter 6: Simulating Confidence and Credible Intervals

Frequentist Methods - Confidence Intervals via Bootstrapping

Bayesian Methods - Credible Intervals via Metropolis-Hastings (MCMC)

Comparing All Intervals

Chapter 7: Uniformly Powerful Hypothesis Testing

Uniquely Most Powerful (UMP) Test

Applied to Earthquake Dataset

Power Curve

Chapter 8: General Hypothesis Testing

Generalized Likelihood Ratio Test

For Earthquake Data

Power Curve

Chapter 9: Bayesian Hypothesis Testing

For Earthquake Data

Chapter 10: Sufficiency

Conclusion

Proof of \(\begin{align}\mathrm{Var}(\ell'(\theta)) &= - \mathbb{E}(\ell''(\theta))\end{align}\)