STAA 567 Lec 4: Putting it all together

Optimization in Statistics: Maximum likelihood estimate
Wrapping up optimization

(Instructor : Nishant Panda)

Optimization in Statistics: Maximum likelihood estimate

We now consider the use of optimization in maximizing likelihood functions.

Example 1: Your friend challenged you to guess the probability of landing a Heads in a coin he gave you. You cannot start without data. So, you tossed the coin 10 times and note the following sequence “T” “T” “T” “T” “H” “T” “T” “H” “T” “T”. How would you proceed?

First step: A statistical model. Before doing any estimation we need a statistical model. Here we will assume that \(X\) is Bernoulli random variable with parameter \(p\) i.e \(X = 1 (HEADS)\) with probability \(p\), and \(X = 0 (TAILS)\) with probability \(1-p\). Moreover, let \(X_1, X_2, X_3, \dots, X_{10}\) be a random sample of \(X\) (i.i.d). Thus our data: \(x_1, x_2, x_3, x_4, x_6, x_7, x_9, x_{10}\) are all \(0\) and \(x_5, x_8\) are \(1\). Our goal is to estimate \(p\) and predict the next outcome using this estimated value.

(Home Assignment!)Convince yourself that \(X \sim f\), where \(f(x\lvert p) = p^{x}(1-p)^{1-x}\)? Note, because \(X\) is discrete \(f(x)\) is the probability that \(X\) takes on the value \(x\).

Second step: An estimation method. Here, we will use a popular estimation method-the Maximum Likelihood method. Note that we only need to estimate one parameter. Thus our parameter to estimate is \(\theta = p\). Let us construct the Likelihood function,

\[ L(p) = \prod\limits_{i=1}^{n}f(x_i\vert p), \] which when plugging in \(f(x\vert p)\) and \(n = 10\), we get

\[ L(p) = \prod\limits_{i=1}^{10}p^{x_i}(1-p)^{1-x_i}, \] This is a real valued function i.e the output space is a subset of the real line. What is the input space of our function? Let us consider the log likelihood.

\[ l(p) = log(L(p)) = \sum\limits_{i = 1}^{10}\left(log(p^{x_i}(1-p)^{1-x_i})\right) \]

Our maximum likelihood estimate \(p_{ML}\) is the point in the input space where \(l(p)\) is maximized. We have done all the work here!

With notation from Lecture \(3\), our function \(g\) is given by \[ g(p) = \sum\limits_{i = 1}^{10}\left(log(p^{x_i}(1-p)^{1-x_i})\right) \]

and our goal is find the maximum value of \(g\).

Let us plot this function

# observed data
data <- c(0,0,0,0,1,0,0,1,0,0)

# Note that a general likelihood function should take
# parameters and data as argument.

# our log likelihood
g <- function(p, data){
  # p is our parameter
  # data is our observed data
  
  # note the sapply on the observed data!
  logp <- sapply(data, function(x) log((p^x) * (1 - p)^(1- x)))
  
  # return the sum of the above
  sum(logp)
}

# for plotting: create the input values for g. 
# In this case it is p which ranges from 0 to 1.
p.seq <- seq(from = 0.01, to = 0.90, length.out = 101)

# note the sapply on p.seq to get values of g.
gp <- sapply(p.seq, g, data = data)

library(ggplot2)
# put  p and g(p) values in a data frame
plot_data <- data.frame("p" = p.seq, "g" = gp)


# create the base layer of ggplot with geom_Line
plot_f <- ggplot(plot_data, aes(x = p, y = g)) + 
  geom_line(color = "red", size = 1.0) + 
  labs(x = "p", y = "log likelihood")

plot_f

It looks like the maxima occurs around 0.2. Is that intuitive?

Let us look at the R optimize function for finding the optimum of this log likelihood function.

# do ?optimize to understand the arguments passed

max_p <- optimize(g, c(0.05, 0.35), data = data, maximum = TRUE)
round(max_p$maximum, 3)

## [1] 0.2

Can you explain this? HINT: What is the proportion of Heads in our data?

BONUS: Home Assignments will not be graded. In HW1 you will implement your own netwon’s method. Your newton’s method should take as arguments a first derivative function and a second derivative function. Do not use grad or hessian functions.

(BONUS: Home Assignment!)Write your own function that implements the bisection method to confirm this result. Your function should take as arguments: the function g to optimize, lower end a, upper end b. You may use a stopping criteria or you may just run the iteration a 1000 times.

(BONUS: Home Assignment!)Write your own function that implements the Newtons method to confirm this result. Your function should take as arguments: the function g to optimize, a starting value, a gradient (first derivative) function and a hessian (second derivative) function. You may use a stopping criteria or you may just run the iteration a 1000 times. Do not use grad and hessian from packages.

Example 2: You work in a cookie factory. The boss is suspicious that his employees are munching on his cookies while at work. You being the resident statistician are called on to measure the weights of all 35 employees (including you!) and draw some conclusion. First step you thought was to get the data and see if you can fit a distribution on the data.

First step: Draw a histogram of the data.

weight.cookie <- read.csv("~/Data-Analysis-Machine-Learning/Computational Statistics/Lecture 4/weights.csv")
ggplot(data = weight.cookie, aes(weights)) +
  geom_histogram(aes(y = ..density..), color = "black", fill = "white", bins = 12) +
  geom_density(alpha = 0.2, fill = "steelblue")

qqnorm(weight.cookie$weights)
qqline(weight.cookie$weights)

You decide this looks normal (pun intended)! You want to fit a normal distribution to this data. In other words you think that \(W \sim N(\mu, \sigma^2)\) and you would use the data to estimate \(\mu\) and \(\sigma^2\). Thus your parameter vector \(\mathbf{\theta} = (\theta_1, \theta_2)\) is a point in \(\mathbb{R}^{2}\), where \(\theta_1 = \mu\) and \(\theta_2 = \sigma^2\).

You have a random sample of size \(35\) i.e \(W_1, W_2, \dots, W_{35}\) are an i.i.d sample with each \(W_i \sim f(x_i\vert \mu, \sigma^{2})\) and \(f\) is Gaussian distribution.

The Likelihood function is given by, \[ L(\theta_1, \theta_2) = \prod\limits_{i=1}^{n}f(w_i\vert \theta_1, \theta_2), \] which when plugging in \(f(w\vert \theta_1, \theta_2)\) and \(n = 35\), we get \[ L(\theta_1, \theta_2) = \prod\limits_{i=1}^{35}\frac{1}{\sqrt{2\pi\theta_2}}e^{-\frac{(w_i-\theta_1)^2}{2\theta_2}}. \] Hence, the log likelihood is given by, \[ l(\theta_1, \theta_2) = log(L(\theta_1, \theta_2)) = \sum\limits_{i=1}^{35}\left(-log(\sqrt{2\pi\theta_2}) - \frac{(w_i-\theta_1)^2}{2\theta_2}\right) \] Our maximum likelihood estimate \(\mathbf{\theta}_{ML} = (\theta^{\ast}_{1}, \theta^{\ast}_2)\) is the point in the vector space \(\mathbb{R}^2\) where \(l(\theta_1, \theta_2)\) is maximized. This is multivariate optimization, we have done all the work here!

With notation from Lecture \(3\), our function \(g\) is given by \[ g(\theta_1, \theta_2) = \sum\limits_{i=1}^{35}\left(-log(\sqrt{2\pi\theta_2}) - \frac{(w_i-\theta_1)^2}{2\theta_2}\right) \] and we wish to find the maxima of this function.

First let us create our likelihood function

# log gaussian density
log.gaussd <- function(data.point, parameter){
  a <- log(sqrt(2*pi*parameter[2]))
  b <- (data.point - parameter[1])^2/(2*parameter[2])
  return(-1*(a + b))
}
# log likelihood

# $g2(\theta, parameter)$ is my log likelihood function
g2 <- function(theta, data, log.density){
  # theta : input space for g2
  # data : constants. observed data that defines my likelihood.
  # log.density : log of the density function in the log likelihood.
  
  # for each w_i compute the log density in the likelihood function
  # notice the sapply on the data!
  log.den <-sapply(data, log.density, parameter = theta)
  
  # log likelihood is the sum of these
  return(sum(log.den))
}

Now, let us use the R optim function to find the optima.

# intial guess, function, maximize/minimize, additional function arguments
optim(c(150, 100), g2, control=list(fnscale=-1), data = weight.cookie$weights, 
      log.density = log.gaussd)

## $par
## [1] 176.0011 155.0818
## 
## $value
## [1] -137.9289
## 
## $counts
## function gradient 
##       57       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

mean(weight.cookie$weights)

## [1] 176.0047

(sd(weight.cookie$weights))^2

## [1] 159.6141

This checks out!

Wrapping up optimization

We have barely scratched the surface when it comes to optimization. There are many Newton like methods used in optimization. The update equations for such Newton-like methods are generally of the form: \[ \mathbf{x}^{(n+1)} = \mathbf{x^{(n)}} - \left(M_{p\times p}^{(n)}\right)^{-1}\nabla g\left({\mathbf{x}^{(n)}}\right), \] where \(M_{p\times p}^{(n)}\) is an approximation of the Hessian matrix at \(\mathbf{x}^{(n)}\).

Some examples are:

Discrete Newton method: Approximates the Hessian using a finite difference method.
Ascent(or Descent) algorithms: At every step we either go uphill(or downhill). For example \(M_{p\times p}^{(n)} = -I\) will gurantee ascent. (No need for the Hessian matrix!). See coordinate descent in Wikipedia https://en.wikipedia.org/wiki/Coordinate_descent
Quazi Newton methods: Generate an approximation to the inverse of the Hessian matrix. This is a popular method and is mostly used in packages.

Optimization in R

We used the optimize method for finding the optimum in Example 1. This method uses the golden section algorithm (very similar to bisection method.)

The optim method is a general purpose algorithm for multivariate function. By default it uses the Nedler-Mead algorithm (see the recommended textbook if you are interested). You can pass different methods. For example BFGS is a quasi-Newton method.

optim(c(150, 100), g2,method = "BFGS", control=list(fnscale=-1), data = weight.cookie$weights,
      log.density = log.gaussd)

## $par
## [1] 176.0047 154.4307
## 
## $value
## [1] -137.929
## 
## $counts
## function gradient 
##       40       38 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

By default both optim and optimize look for the minima. The control = list(fnscale=-1) argument sets the optim method to look for maxima. The maximum = TRUE argument sets the optimize method to look for maxima.

nlminb is a function that allows you to do constrained optimization. This is useful if your input space is restriced.

For a quick guide to optimization see https://neos-guide.org/Optimization-Guide