MLE Problem Set 2

#1. The exponential distribution The exponential probability density function is defined as: \(f_e(x;\theta) = \theta\exp(-\theta x)\).

  1. Loosely speaking, the “support” of a random variable is the set of admissible values, or values that have positive probability. More formally, the support of a function is the subset of the function’s domain that are not mapped to zero. If \(X \sim f_e(x;\theta)\) what is the support of \(X\)?

The support of the function is at theta > 0.

  1. Derive the log-likelihood for \(n\) independent observations under the probability model \(f_e\).

\(\ell(\theta) = n \ln(\theta) - \theta \sum_{i=1}^{n} x_i\)

  1. Derive an analytic expression for \(\hat{\theta}\), the MLE. \[ \widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 \]

Type the following into R:

library(formatR)
set.seed(5)
x <- rexp(10000, 5)

Program the exponential log likelihood function into R; call this function exp.ll.

exp.ll <- function(theta,x){
  n <- length(x)
  log_likelihood <- n * log(theta) - theta* sum(x)
  return(log_likelihood)
}

Plot the log-likelihood as a function of \(\theta\) given x. Eyeballing the graph, what is the approximate value of the maximizer?

values_theta <- seq(0.1, 6, length.out = 100)
log_ll_val <- exp.ll(values_theta, x)


plot(values_theta,log_ll_val, xlab = "Theta_Values", ylab = "Log_Likelihood_Values")

The approximate value of the maximized is 5.

Calculate \(\hat{\theta}\) given x using the analytic result you derived in part (c).

mle <- function(i){
  n <- length(i)
  theta_hat <- n / (sum(i))
  return(theta_hat)
}

theta_hat <- mle(x)
theta_hat
## [1] 4.979927

Let \(\tilde{\theta} = 6\). Calculate the likelihood ratio for \(\hat{\theta}|x\) and \(\tilde{\theta}|x\).

val_1_log <- exp.ll(theta_hat, x)
val_2_log <- exp.ll(6, x)



lh_ratio <- val_1_log - val_2_log
lh_ratio
## [1] 184.9271

Calculate an approximation to \(\hat{\theta}\) as well as the standard error around \(\hat{\theta}\) using optim. Use 1 as your starting value. You will do this twice, once using the BFGS algorithm and once using SANN. Report the total computational time required for each.

run_mle <- function(fn, x, init_params = c(1), method = "BFGS", max_iter = 1000) {
  start_time <- Sys.time()
  
  mle.fit <- optim(
    par = init_params,
    fn = fn,
    x = x,
    method = method,
    control = list(trace = TRUE, maxit = max_iter, fnscale = -1),
    hessian = TRUE
  )
  
  completed_time <- difftime(Sys.time(), start_time, units = "secs")
  message("Total time for ", method, " Method: ", completed_time, " seconds")
  
  return(mle.fit)
}


mle_result <- run_mle(exp.ll, x)
## initial  value 2008.061622
## Warning in log(theta): NaNs produced
## Warning in log(theta): NaNs produced
## Warning in log(theta): NaNs produced
## Warning in log(theta): NaNs produced
## Warning in log(theta): NaNs produced
## Warning in log(theta): NaNs produced
## Warning in log(theta): NaNs produced
## iter  10 value -6054.151950
## iter  10 value -6054.152023
## final  value -6054.152023 
## converged
## Total time for BFGS Method: 0.00171780586242676 seconds

#2. Maximizing a multivariate function In this section we will optimize a function of two variables: \[ f(x,y)= \mathrm{e}^{-\frac{1}{2}[(x-2)^2+(y-1)^2]}\] which (up to a constant) happens to be the joint density of two independent normal random variables \(X\) and \(Y\) with \(\mu_x=2\) and \(\mu_y=1\) and equal variance \(\sigma^2_x=\sigma^2_y=1\).

  1. Use the following code to implement and visualize this function. Just eye-balling it, at what values of \(x\) and \(y\) does it look like the function achieves a maximum? Use ChatGPT or a similar tool to walk you through the code if you do not understand what the code is doing.
mvn <- function(xy) {
    x <- xy[1]
    y <- xy[2]
    z <- exp(-0.5 * ((x - 2)^2 + (y - 1)^2))
    return(z)
}

# install.packages('lattice')
library(lattice)
## Warning: package 'lattice' was built under R version 4.2.3
y <- x <- seq(-5, 5, by = 0.1)
grid <- expand.grid(x, y)
names(grid) <- c("x", "y")
grid$z <- apply(grid, 1, mvn)

wireframe(z ~ x + y, data = grid, shade = TRUE, light.source = c(10, 0, 10), scales = list(arrows = FALSE))

It looks like the function achieves a peak maximum at x = 2 and y = 0.8.

#2b Use optim() to find the values \((x^{\star},y^{\star})\) that maximize this joint density. Use c(1,0) as your starting values and method=“BFGS”. Then find the maximum again with the starting value `c(5,5), still using method=“BFGS”. Compare the performance of optim() for these two sets of starting values. #c(1,0), using method=“BFGS”

mle.fit_b1 <- optim(
  par = c(1, 0),             
  fn = mvn,             # Function to minimize (log-likelihood)
  method = "BFGS",            
  control = list(
    trace = TRUE,             
    maxit = 1000,             
    fnscale = -1           
  ),
  hessian = TRUE)   
## initial  value -0.367879 
## final  value -1.000000 
## converged
  mle.fit_b1$par
## [1] 2 1

#c(5,5), still using method=“BFGS”

mle.fit_b2 <- optim(
  par = c(5, 5),              # Start from (5,5)
  fn = mvn,
  method = "BFGS",
  control = list(
    trace = TRUE,
    maxit = 1000,
    fnscale = -1
  ),
  hessian = TRUE
)
## initial  value -0.000004 
## iter  10 value -0.000004
## iter  20 value -0.000004
## iter  30 value -0.000004
## iter  40 value -0.000004
## iter  50 value -0.000004
## iter  60 value -0.000004
## iter  70 value -0.000004
## iter  80 value -0.000004
## iter  90 value -0.000004
## iter 100 value -0.000004
## iter 110 value -0.000004
## iter 120 value -0.000004
## iter 130 value -0.000004
## iter 140 value -0.000004
## iter 150 value -0.000004
## iter 160 value -0.000004
## iter 170 value -0.000004
## iter 180 value -0.000004
## iter 190 value -0.000004
## iter 200 value -0.000004
## iter 210 value -0.000004
## iter 220 value -0.000004
## iter 230 value -0.000004
## iter 240 value -0.000004
## iter 250 value -0.000004
## iter 260 value -0.000004
## iter 270 value -0.000004
## iter 280 value -0.000004
## iter 290 value -0.000004
## iter 300 value -0.000004
## iter 310 value -0.000004
## iter 320 value -0.000004
## iter 330 value -0.000004
## iter 340 value -0.000004
## iter 350 value -0.000004
## iter 360 value -0.000004
## iter 370 value -0.000004
## iter 380 value -0.000004
## iter 390 value -0.000004
## iter 400 value -0.000004
## iter 410 value -0.000004
## iter 420 value -0.000004
## iter 430 value -0.000004
## iter 440 value -0.000004
## iter 450 value -0.000004
## iter 460 value -0.000004
## iter 470 value -0.000004
## iter 480 value -0.000004
## iter 490 value -0.000004
## iter 500 value -0.000004
## iter 510 value -0.000004
## iter 520 value -0.000004
## iter 530 value -0.000004
## iter 540 value -0.000004
## iter 550 value -0.000004
## iter 560 value -0.000004
## iter 570 value -0.000004
## iter 580 value -0.000004
## iter 590 value -0.000004
## iter 600 value -0.000004
## iter 610 value -0.000004
## iter 620 value -0.000004
## iter 630 value -0.000004
## iter 640 value -0.000004
## iter 650 value -0.000004
## iter 660 value -0.000004
## iter 670 value -0.000004
## iter 680 value -0.000004
## iter 690 value -0.000004
## iter 700 value -0.000004
## iter 710 value -0.000004
## iter 720 value -0.000004
## iter 730 value -0.000004
## iter 740 value -0.000004
## iter 750 value -0.000004
## iter 760 value -0.000004
## iter 770 value -0.000004
## iter 780 value -0.000004
## iter 790 value -0.000004
## iter 800 value -0.000004
## iter 810 value -0.000004
## iter 820 value -0.000004
## iter 830 value -0.000004
## iter 840 value -0.000004
## iter 850 value -0.000004
## iter 860 value -0.000004
## iter 870 value -0.000004
## iter 880 value -0.000004
## iter 890 value -0.000004
## iter 900 value -0.000004
## iter 910 value -0.000004
## iter 920 value -0.000004
## iter 930 value -0.000004
## iter 940 value -0.000004
## iter 950 value -0.000004
## iter 960 value -0.000004
## iter 970 value -0.000004
## iter 980 value -0.000004
## iter 990 value -0.000004
## iter1000 value -0.000004
## final  value -0.000004 
## stopped after 1000 iterations
mle.fit_b2$par 
## [1] 4.988302 4.984402

Alter the provided R function so that it returns the (natural) logarithm of \(f(x,y)\). Alter the code to plot the log likelihood. Then repeat the two procedures you performed in section (b), only as applied to the log likelihood. Describe any differences in your results, and then provide a brief explanation for those differences.

log_mvn <- function(xy) {
    x <- xy[1]
    y <- xy[2]
    z <- log(exp(-0.5 * ((x - 2)^2 + (y - 1)^2)))
    return(z)
}


y <- x <- seq(-5, 5, by = 0.1)
grid <- expand.grid(x, y)
names(grid) <- c("x", "y")
grid$z <- apply(grid, 1, log_mvn)

wireframe(z ~ x + y, data = grid, shade = TRUE, light.source = c(10, 0, 10), scales = list(arrows = FALSE))

#MLE of log-function using c(1,0)

mle.fit_b3 <- optim(
  par = c(1, 0),              # Starting values for the parameters
  fn = log_mvn,             # Function to minimize (log-likelihood)
  method = "BFGS",            # Optimization method (Broyden–Fletcher–Goldfarb–Shanno algorithm)
  control = list(
    trace = TRUE,             # Show optimization progress
    maxit = 1000,             # Maximum number of iterations
    fnscale = -1              # Negate the function to maximize instead of minimize
  ),
  hessian = TRUE              # Return the Hessian matrix
)
## initial  value 1.000000 
## final  value -0.000000 
## converged
mle.fit_b3$par
## [1] 2 1

#MLE of log-function using c(5,5)

mle.fit_b4 <- optim(
  par = c(5, 5),              # Starting values for the parameters
  fn = log_mvn,             # Function to minimize (log-likelihood)
  method = "BFGS",            # Optimization method (Broyden–Fletcher–Goldfarb–Shanno algorithm)
  control = list(
    trace = TRUE,             # Show optimization progress
    maxit = 1000,             # Maximum number of iterations
    fnscale = -1              # Negate the function to maximize instead of minimize
  ),
  hessian = TRUE              # Return the Hessian matrix
)
## initial  value 12.500000 
## final  value -0.000000 
## converged
mle.fit_b4$par
## [1] 2 1

The maximum likelihood estimation proved successful for both the original and log functions when initialized at c(1,0), as this starting point was situated in close proximity to the maximum. However, when utilizing c(5,5) as the starting value, the estimation process yielded divergent outcomes. While the log function maintained successful convergence, the original function failed to reach the maximum. This discrepancy can be attributed to the optimizer’s position in a region of minimal gradient when starting at c(5,5). In this flat terrain, the original function provides insufficient directional information due to negligible variations in the function’s slope. The log transformation, however, effectively modifies the function’s topology by smoothing extreme curvature while amplifying gradients in flat regions, all while preserving the location of the maximum. This mathematical modification enables the optimizer to successfully navigate to the true maximum, even from distant starting positions.

  1. The normal variance Derive \(\widehat{\sigma^2}\), the MLE of the variance of the normal distribution. Hint: begin with the log-likelihood of the Normal distribution given in the text.

\[ \frac{2}{n} \widehat{\sigma} = -\frac{2}{\sigma^2} + \frac{\sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2}{2(\sigma^2)^2} \]

I attached pdf of handwritten notes.

Use of ChatGPT and other generative AI tools

I Used Chat gpt to help me with this problem set, as have never taken cal and it broke down alot of the math for this assiagment here is the link;

Link to Chat Thread