1 General Optimization

In mathematics, nonlinear programming (NLP) is the process of solving an optimization problem where some of the constraints or the objective function are nonlinear. An optimization problem is one of calculation of the extrema (maxima, minima or stationary points) of an objective function over a set of unknown real variables and conditional to the satisfaction of a system of equalities and inequalities, collectively termed constraints. It is the sub-field of mathematical optimization that deals with problems that are not linear.

1.1 Two-dimensional function

The blue region (shown in the diagram below) is the feasible region. The tangency of the line with the feasible region represents the solution. The line is the best achievable contour line (area with a given value of the objective function).

2-dimensional example

2-dimensional example

This simple problem can be defined by the constraints:

\[ \begin{eqnarray} x_1 & \ge & 0 \\ x_2 & \ge & 0 \\ x_1^2 + x_2^2 & \ge & 1 \\ x_1^2 + x_2^2 & \le & 2 \end{eqnarray} \] with an objective function to be maximized

\[ f(x) = x_1 + x_2 , \text{where } x = (x_1, x_2). \]

1.2 Three-dimensional function

The tangency (see diagram bellow) of the top surface with the constrained space in the center represents the solution.

3-dimensional example

3-dimensional example

This simple problem can be defined by the constraints:

\[ \begin{eqnarray} x_1^2 − x_2^2 + x_3^2 \le 2 \\ x_1^2 + x_2^2 + x_3^2 \le 10 \\ \end{eqnarray} \]

with an objective function to be maximized

\[f(x) = x_1x_2 + x_2x_3, \text{where } x = (x_1, x_2, x_3).\]

2 Steepest Descent (SD)

The method of steepest descent works on functions which have a single derivative. It is used most often in problems involving more than 1 variable. The essential idea of steepest descent is that the function decreases most quickly in the direction of the negative gradient. Let’s assume we have the following function:

\[f(x)=f(x_1,x_2,\cdots, x_n)\]

The objective is to find the maximum or minimum value (according to our purpose).

2.1 SD Algorithm

  • The method starts at an initial guess \(x\).
  • The next guess is made by moving in the direction of the negative gradient. The location of the minimum along this line can then be found by using a one-dimensional search algorithm such as golden section search.
  • The nth update is then

\[x_n=x_{n−1}−\alpha f'(x_{n−1})\]

where \(\alpha\) is chosen to minimize the one-dimensional function:

\[g(\alpha)=f(x_{n−1}−\alpha f'(x_{n−1}))\]

In order to use golden section, we need to assume that \(\alpha\) is in an interval. So in this case, we take the interval to be \([0,h]\) where \(h\) is a value that we need to choose.

2.3 SD Algorithm in Visualization

You can see the visualization process of Steepest Descent in the following graphic, or click here to see another example of Steepest descent method, specifically for a quadratic function. more

3-dimensional example

3-dimensional example

2.4 Two-dimensional Example

A simple example of a function of 2 variables to be minimized is

\[f(x)=f(x_1,x_2)={(2−x_1)^2\over 2x_2^2} + {(3−x_1)^2 \over 2x_2^2}+log(x_2)\]

Note that \(x_2\) should be positive, so we might need to protect against negative values of \(x_2\).

  • First : We need functions for both the function and the gradient
  • Second: The location of the minimum along this line can then be found by using a one-dimensional search algorithm such as golden section search.
  • Third: Let’s try a starting value of x=(.1,.1).
## [1] "Warning: Maximum number of iterations reached"
## Minimizer1 Minimizer2 
##  2.4992967  0.7123675

We haven’t converged yet.

  • Fourth: One possibility is to run the procedure again, using the most recent result as our starting guess.
## Minimizer1 Minimizer2 
##  2.5000000  0.7071068
## Minimizer1 Minimizer2 
##  2.5000000  0.7071068

Done. The value has been converged.

2.5 Multivariate Normal

One can use steepest descent to compute the maximum likelihood estimate of the mean in a multivariate Normal density, given a sample of data. However, when the data are highly correlated, as they are in the simulated example below, the log-likelihood surface can be come difficult to optimize. In such cases, a very narrow ridge develops in the log-likelihood that can be difficult for the steepest descent algorithm to navigate.

In the example below, we actually compute the negative log-likelihood because the algorithm is designed to minimize functions.

Note that in the figure above the surface is highly stretched and that the minimum (1,2) lies in the middle of a narrow valley. For the steepest descent algorithm we will start at the point (−5,−2) and track the path of the algorithm.

## Warning: package 'dplyr' was built under R version 3.6.3

We can see that the path of the algorthm is rather winding as it traverses the narrow valley. Now, we have fixed the step-length in this case, which is probably not optimal. However, one can still see that the algorithm has some difficulty navigating the surface because the direction of steepest descent does not take one directly towards the minimum ever.

3 The Newton Direction

The Newton method is a fundamental tool in optimization. This method is depent on the initial condition used and can be used to n-dimensions. In steepese decent mehtod, the first-order derivative information is only used to determine the search direction and the second derivatives are used to represent the cost surface more accurately and a better search direction could ne found.

\(Overview\) \(of\) \(the\) \(Newton\) \(Method:\)

Newton Method

Newton Method

Newton Method 2

Newton Method 2

Newton Method 3

Newton Method 3

Newton Method4

Newton Method4

Newton method uses the function of Hessian to calculate the search direction and has a quadratic rate of convergence which means that it converges very rapidly when the design point is within certain radius of the minimum point. There are several Newthod’s Method which are classical Newton’s Method, Modified Newton’s Method, and Marquardt Modification.

We can approximate \(f\) with a quadratic polynomial for small \(p\),

\(f(x_n + p)≈\) \(f(x_n) + p'f'(x_n)\) \(+ 1/2 p'f''(x_n)p\)

\(p_n\) \(=\) \(f''(x_n)^{-1}\) \([-f'(x_n)]\)

This formula is the steepest descent direction twisted by the inverse of the Hessian Matrix. The newton method is:

\(x_{n+1}=\) \(x_n - f''(x_n)^{-1} f'(x_n)\)

Newton’s method makes a quadratic approximation to the target function f at each step of algorithm.Newton step is taking the complex function of \(f\) and replace it with simplier function \(g\), then iptimize it repeatedly until convergence to the solution.

3.1 Quadratic Approximation to the Target function easily in One Dimension

The iteration of \(x_{n+1}\) is further away from the iteration of \(x_n\) so we can conclude that the quadratic approximation Newton’s method makes to \(f\) is not guarantted to be good.

The successive iterations that Newton’s method produces are not guaranteed to be improvements in the sense that each iterate is closer to the truth. The tradeoff here is that while Newton’s method is very fast , it can be unstable at times.

The solution below provided the next approximation, \(x_{n+2}\) is close to the true minimum

In rare occasion, if \(f\) is a quadratic polynomial, Newton’s method will converge in a single step because the quadratic approximation that it makes to \(f\) will be exact.

3.2 Generalized Linear Models

The extensuin of the standard linear which is for non-Normal response distributions is called the generalize linear model. This distributions come from an expenontential family whose density functions share some common characteristics. With GLM, The distributions is \(y_i\) \(~\) \(p(y_i|μ_i)\), where \(p\) is the exponential family distribution, \(E [y_i]=μ_i\).

\[ g(μ_i)=x′_iβ\] \(g\) is non linear link function and \(Var(y_i)=V(μ)\).

The Fisher scoring algorithm. This algorithm uses a linear approximation to the nonlinear link function \(g\) which can be write as \[g(y_i) ≈ g(μ_i) + (y_i-μ_i)g'(μ_i).\]. The working response can be write as \(z_i = g(μ_i) + (y_i-μ_i)g'(μ_i)\).

The Fisher scoring alogrithm works as follows:

The Fisher scoring Algorithm

The Fisher scoring Algorithm

3.2.1 Example: Poisson Regression

We can draw a connection betwwen the usual Fisher scoring algorithm for fitting GLMs and Newton’s method using Poisson regression example. In Poisson regression, we have \(y_i ∼ Poisson (μ_i)\), where \(g(μ) = log μ_i = x_i'β\) because the log is the canonical link function for the Poisson distribution. We also have \(g'(μ_i) = \frac {1} μ_i\) and \(V(μ_i) = μ_i\). The Fisher scoring alogirhm is

  1. Initialize \(\hat{μ_i}\), perhaps using \(y_i + 1\) (to avoid zeros).

  2. Let \(z_i = log\hat{μ_i} + (y_i - μ_i) \frac {1} μ_i\)

  3. Regression \(z\) on \(X\) using the weights \[w_{ii} = [\frac {1}{μ_i^2}\hat{μ_i}]^{-1} = μ_i.\]

The Newton updating scheme is \[β_{n+1} = β_n + ℓ''(β_n)^{-1} [-ℓ′(β_n)].\] The log-likelihoood for a Poisson regression model can be written in vector/matrix form as \[ℓ(β) = y' Xβ - exp(Xβ)'1\] where the exponential is taken component-wise on the vector \(Xβ\). The gradient function is \[ℓ(β) = X'y - X'exp(Xβ) = X'(y-μ)\].The hessian is \[ℓ(β)=-X'WX\]

where \(W\) is a diagonal matrix with the values \(w_{ii} = exp(x_i'β)\)on the diagonal.

The Newton iteration is then \[β_{n+1} = β_n + (-X'WX)^{-1}(-X'(y-μ))\] \[= β_n + (X'WX)^{-1}XW(z - Xβ_n)\]

\[=(X'WX)^{-1}X'Wz + β_n - (X'WX)^{-1}X'WXβ_n\] \[=(X'WX)^{-1}X'Wz\]

The iteration is exactly the same as the Fisher scoring algorithm. In general, Newton’s method and Fisher scoring will coincide with any generalized linear model using an exponential family with a canonical link function.

3.3 Newton’s Method in R

The purpose of nlm() function in R Newton’s method is for minimizing a function given a vector of starting values. By default, one does not need to supply the gradient or Hessian functions; they will be estimated numerically by the algorithm. For the purposes of improving accuracy of the algorithm, both the gradient and Hessian can be supplied as attributes of the target function.

As an example, we will use the nlm() function to fit a simple logistic regression model for binary data. This model specifies that \(y_i ∼ Bernoulli (p_i)\) where \[log \frac {p_i} {1-p_i} = β_0 + x_iβ_1\]

and the goal is to estimate β via maximum likelihood. Given the assumed Bernoulli distribution, we can write the log-likelihood for a single observation as \[log L(β) = ∑_{i=1}^n yi(β_0 + x_iβ_1)-log(1+e^{(β_0+x_iβ_1)})\]. If we take the very last line of the above derivation and take a single element inside the sum, we have

\[ℓ_i(β) = y_i(β_0+x_iβ_1) - log(1 + e^{(β_0+x_iβ_1)})\].

We will need the gradient and Hessian of this with respect to \(β\). Because the sum and the derivative are exchangeable, we can then sum each of the individual gradients and Hessians to get the full gradient and Hessian for the entire sample, so that \[ℓ'(β) = ∑_{i=1}^n ℓ_i'(β)\]

and \[ℓ''(β) = ∑_{i=1}^n ℓ_i''(β).\]

R provides an automated way to do symbolic differentiation so that manual work can be avoided. The deriv() function computes the gradient and Hessian of an expression symbolically so that it can be used in minimization routines. It cannot compute gradients of arbitrary expressions, but it does support a wide range of common statistical functions.

