(Instructor : Nishant Panda)

Optimization in Statistics: Maximum likelihood estimate

We now consider the use of optimization in maximizing likelihood functions.

Example 1: Your friend challenged you to guess the probability of landing a Heads in a coin he gave you. You cannot start without data. So, you tossed the coin 10 times and note the following sequence “T” “T” “T” “T” “H” “T” “T” “H” “T” “T”. How would you proceed?

First step: A statistical model. Before doing any estimation we need a statistical model. Here we will assume that \(X\) is Bernoulli random variable with parameter \(p\) i.e \(X = 1 (HEADS)\) with probability \(p\), and \(X = 0 (TAILS)\) with probability \(1-p\). Moreover, let \(X_1, X_2, X_3, \dots, X_{10}\) be a random sample of \(X\) (i.i.d). Thus our data: \(x_1, x_2, x_3, x_4, x_6, x_7, x_9, x_{10}\) are all \(0\) and \(x_5, x_8\) are \(1\). Our goal is to estimate \(p\) and predict the next outcome using this estimated value.

(Home Assignment!)Convince yourself that \(X \sim f\), where \(f(x\lvert p) = p^{x}(1-p)^{1-x}\)? Note, because \(X\) is discrete \(f(x)\) is the probability that \(X\) takes on the value \(x\).

Second step: An estimation method. Here, we will use a popular estimation method-the Maximum Likelihood method. Note that we only need to estimate one parameter. Thus our parameter to estimate is \(\theta = p\). Let us construct the Likelihood function,

\[ L(p) = \prod\limits_{i=1}^{n}f(x_i\vert p), \] which when plugging in \(f(x\vert p)\) and \(n = 10\), we get

\[ L(p) = \prod\limits_{i=1}^{10}p^{x_i}(1-p)^{1-x_i}, \] This is a real valued function i.e the output space is a subset of the real line. What is the input space of our function? Let us consider the log likelihood.

\[ l(p) = log(L(p)) = \sum\limits_{i = 1}^{10}\left(log(p^{x_i}(1-p)^{1-x_i})\right) \]

Our maximum likelihood estimate \(p_{ML}\) is the point in the input space where \(l(p)\) is maximized. We have done all the work here!

With notation from Lecture \(3\), our function \(g\) is given by \[ g(p) = \sum\limits_{i = 1}^{10}\left(log(p^{x_i}(1-p)^{1-x_i})\right) \]

and our goal is find the maximum value of \(g\).

Let us plot this function

# observed data
data <- c(0,0,0,0,1,0,0,1,0,0)

# Note that a general likelihood function should take
# parameters and data as argument.

# our log likelihood
g <- function(p, data){
  # p is our parameter
  # data is our observed data
  
  # note the sapply on the observed data!
  logp <- sapply(data, function(x) log((p^x) * (1 - p)^(1- x)))
  
  # return the sum of the above
  sum(logp)
}

# for plotting: create the input values for g. 
# In this case it is p which ranges from 0 to 1.
p.seq <- seq(from = 0.01, to = 0.90, length.out = 101)

# note the sapply on p.seq to get values of g.
gp <- sapply(p.seq, g, data = data)
library(ggplot2)
# put  p and g(p) values in a data frame
plot_data <- data.frame("p" = p.seq, "g" = gp)


# create the base layer of ggplot with geom_Line
plot_f <- ggplot(plot_data, aes(x = p, y = g)) + 
  geom_line(color = "red", size = 1.0) + 
  labs(x = "p", y = "log likelihood")

plot_f

It looks like the maxima occurs around 0.2. Is that intuitive?

Let us look at the R optimize function for finding the optimum of this log likelihood function.

# do ?optimize to understand the arguments passed

max_p <- optimize(g, c(0.05, 0.35), data = data, maximum = TRUE)
round(max_p$maximum, 3)
## [1] 0.2

Can you explain this? HINT: What is the proportion of Heads in our data?

BONUS: Home Assignments will not be graded. In HW1 you will implement your own netwon’s method. Your newton’s method should take as arguments a first derivative function and a second derivative function. Do not use grad or hessian functions.

(BONUS: Home Assignment!)Write your own function that implements the bisection method to confirm this result. Your function should take as arguments: the function g to optimize, lower end a, upper end b. You may use a stopping criteria or you may just run the iteration a 1000 times.

(BONUS: Home Assignment!)Write your own function that implements the Newtons method to confirm this result. Your function should take as arguments: the function g to optimize, a starting value, a gradient (first derivative) function and a hessian (second derivative) function. You may use a stopping criteria or you may just run the iteration a 1000 times. Do not use grad and hessian from packages.

Wrapping up optimization

We have barely scratched the surface when it comes to optimization. There are many Newton like methods used in optimization. The update equations for such Newton-like methods are generally of the form: \[ \mathbf{x}^{(n+1)} = \mathbf{x^{(n)}} - \left(M_{p\times p}^{(n)}\right)^{-1}\nabla g\left({\mathbf{x}^{(n)}}\right), \] where \(M_{p\times p}^{(n)}\) is an approximation of the Hessian matrix at \(\mathbf{x}^{(n)}\).

Some examples are:

  1. Discrete Newton method: Approximates the Hessian using a finite difference method.

  2. Ascent(or Descent) algorithms: At every step we either go uphill(or downhill). For example \(M_{p\times p}^{(n)} = -I\) will gurantee ascent. (No need for the Hessian matrix!). See coordinate descent in Wikipedia Wiki https://en.wikipedia.org/wiki/Coordinate_descent

  3. Quazi Newton methods: Generate an approximation to the inverse of the Hessian matrix. This is a popular method and is mostly used in packages.

Optimization in R

We used the optimize method for finding the optimum in Example 1. This method uses the golden section algorithm (very similar to bisection method.)

The optim method is a general purpose algorithm for multivariate function. By default it uses the Nedler-Mead algorithm (see the recommended textbook if you are interested). You can pass different methods. For example BFGS is a quasi-Newton method.

optim(c(150, 100), g2,method = "BFGS", control=list(fnscale=-1), data = weight.cookie$weights,
      log.density = log.gaussd)
## $par
## [1] 176.0047 154.4307
## 
## $value
## [1] -137.929
## 
## $counts
## function gradient 
##       40       38 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

By default both optim and optimize look for the minima. The control = list(fnscale=-1) argument sets the optim method to look for maxima. The maximum = TRUE argument sets the optimize method to look for maxima.

nlminb is a function that allows you to do constrained optimization. This is useful if your input space is restriced.

For a quick guide to optimization see https://neos-guide.org/Optimization-Guide