(Instructor : Nishant Panda)
We now consider the use of optimization in maximizing likelihood functions.
First step: A statistical model. Before doing any estimation we need a statistical model. Here we will assume that \(X\) is Bernoulli random variable with parameter \(p\) i.e \(X = 1 (HEADS)\) with probability \(p\), and \(X = 0 (TAILS)\) with probability \(1-p\). Moreover, let \(X_1, X_2, X_3, \dots, X_{10}\) be a random sample of \(X\) (i.i.d). Thus our data: \(x_1, x_2, x_3, x_4, x_6, x_7, x_9, x_{10}\) are all \(0\) and \(x_5, x_8\) are \(1\). Our goal is to estimate \(p\) and predict the next outcome using this estimated value.
Second step: An estimation method. Here, we will use a popular estimation method-the Maximum Likelihood method. Note that we only need to estimate one parameter. Thus our parameter to estimate is \(\theta = p\). Let us construct the Likelihood function,
\[ L(p) = \prod\limits_{i=1}^{n}f(x_i\vert p), \] which when plugging in \(f(x\vert p)\) and \(n = 10\), we get
\[ L(p) = \prod\limits_{i=1}^{10}p^{x_i}(1-p)^{1-x_i}, \] This is a real valued function i.e the output space is a subset of the real line. What is the input space of our function? Let us consider the log likelihood.
\[ l(p) = log(L(p)) = \sum\limits_{i = 1}^{10}\left(log(p^{x_i}(1-p)^{1-x_i})\right) \]
Our maximum likelihood estimate \(p_{ML}\) is the point in the input space where \(l(p)\) is maximized. We have done all the work here!
With notation from Lecture \(3\), our function \(g\) is given by \[ g(p) = \sum\limits_{i = 1}^{10}\left(log(p^{x_i}(1-p)^{1-x_i})\right) \]
and our goal is find the maximum value of \(g\).
Let us plot this function
# observed data
data <- c(0,0,0,0,1,0,0,1,0,0)
# Note that a general likelihood function should take
# parameters and data as argument.
# our log likelihood
g <- function(p, data){
# p is our parameter
# data is our observed data
# note the sapply on the observed data!
logp <- sapply(data, function(x) log((p^x) * (1 - p)^(1- x)))
# return the sum of the above
sum(logp)
}
# for plotting: create the input values for g.
# In this case it is p which ranges from 0 to 1.
p.seq <- seq(from = 0.01, to = 0.90, length.out = 101)
# note the sapply on p.seq to get values of g.
gp <- sapply(p.seq, g, data = data)library(ggplot2)
# put p and g(p) values in a data frame
plot_data <- data.frame("p" = p.seq, "g" = gp)
# create the base layer of ggplot with geom_Line
plot_f <- ggplot(plot_data, aes(x = p, y = g)) +
geom_line(color = "red", size = 1.0) +
labs(x = "p", y = "log likelihood")
plot_fIt looks like the maxima occurs around 0.2. Is that intuitive?
Let us look at the R optimize function for finding the optimum of this log likelihood function.
# do ?optimize to understand the arguments passed
max_p <- optimize(g, c(0.05, 0.35), data = data, maximum = TRUE)
round(max_p$maximum, 3)## [1] 0.2
Can you explain this? HINT: What is the proportion of Heads in our data?
BONUS: Home Assignments will not be graded. In HW1 you will implement your own netwon’s method. Your newton’s method should take as arguments a first derivative function and a second derivative function. Do not use grad or hessian functions.
We have barely scratched the surface when it comes to optimization. There are many Newton like methods used in optimization. The update equations for such Newton-like methods are generally of the form: \[ \mathbf{x}^{(n+1)} = \mathbf{x^{(n)}} - \left(M_{p\times p}^{(n)}\right)^{-1}\nabla g\left({\mathbf{x}^{(n)}}\right), \] where \(M_{p\times p}^{(n)}\) is an approximation of the Hessian matrix at \(\mathbf{x}^{(n)}\).
Some examples are:
Discrete Newton method: Approximates the Hessian using a finite difference method.
Ascent(or Descent) algorithms: At every step we either go uphill(or downhill). For example \(M_{p\times p}^{(n)} = -I\) will gurantee ascent. (No need for the Hessian matrix!). See coordinate descent in Wikipedia https://en.wikipedia.org/wiki/Coordinate_descent
Quazi Newton methods: Generate an approximation to the inverse of the Hessian matrix. This is a popular method and is mostly used in packages.
We used the optimize method for finding the optimum in Example 1. This method uses the golden section algorithm (very similar to bisection method.)
The optim method is a general purpose algorithm for multivariate function. By default it uses the Nedler-Mead algorithm (see the recommended textbook if you are interested). You can pass different methods. For example BFGS is a quasi-Newton method.
optim(c(150, 100), g2,method = "BFGS", control=list(fnscale=-1), data = weight.cookie$weights,
log.density = log.gaussd)## $par
## [1] 176.0047 154.4307
##
## $value
## [1] -137.929
##
## $counts
## function gradient
## 40 38
##
## $convergence
## [1] 0
##
## $message
## NULL
By default both optim and optimize look for the minima. The control = list(fnscale=-1) argument sets the optim method to look for maxima. The maximum = TRUE argument sets the optimize method to look for maxima.
nlminb is a function that allows you to do constrained optimization. This is useful if your input space is restriced.
For a quick guide to optimization see https://neos-guide.org/Optimization-Guide