In mathematics, nonlinear programming (NLP) is the process of solving an optimization problem where some of the constraints or the objective function are nonlinear. An optimization problem is one of calculation of the extrema (maxima, minima or stationary points) of an objective function over a set of unknown real variables and conditional to the satisfaction of a system of equalities and inequalities, collectively termed constraints. It is the sub-field of mathematical optimization that deals with problems that are not linear.
The blue region (shown in the diagram below) is the feasible region. The tangency of the line with the feasible region represents the solution. The line is the best achievable contour line (area with a given value of the objective function).
2-dimensional example
This simple problem can be defined by the constraints:
\[ \begin{eqnarray} x_1 & \ge & 0 \\ x_2 & \ge & 0 \\ x_1^2 + x_2^2 & \ge & 1 \\ x_1^2 + x_2^2 & \le & 2 \end{eqnarray} \] with an objective function to be maximized
\[ f(x) = x_1 + x_2 , \text{where } x = (x_1, x_2). \]
The tangency (see diagram bellow) of the top surface with the constrained space in the center represents the solution.
3-dimensional example
This simple problem can be defined by the constraints:
\[ \begin{eqnarray} x_1^2 − x_2^2 + x_3^2 \le 2 \\ x_1^2 + x_2^2 + x_3^2 \le 10 \\ \end{eqnarray} \]
with an objective function to be maximized
\[f(x) = x_1x_2 + x_2x_3, \text{where } x = (x_1, x_2, x_3).\]
The method of steepest descent works on functions which have a single derivative. It is used most often in problems involving more than 1 variable. The essential idea of steepest descent is that the function decreases most quickly in the direction of the negative gradient. Let’s assume we have the following function:
\[f(x)=f(x_1,x_2,\cdots, x_n)\]
The objective is to find the maximum or minimum value (according to our purpose).
\[x_n=x_{n−1}−\alpha f'(x_{n−1})\]
where \(\alpha\) is chosen to minimize the one-dimensional function:
\[g(\alpha)=f(x_{n−1}−\alpha f'(x_{n−1}))\]
In order to use golden section, we need to assume that \(\alpha\) is in an interval. So in this case, we take the interval to be \([0,h]\) where \(h\) is a value that we need to choose.
steepestdescent <- function(f, fprime, start, h,
tol=1e-7, maxiter=100) {
x <- start
g <- function(alpha) { f(x - alpha*fpx) }
niter <- 0
while(niter < maxiter & sum(abs(fprime(x))) > tol) {
fpx <- fprime(x)
alpha <- golden(g, 0, h)
x <- x - alpha*fpx
niter <- niter + 1
}
if (niter == maxiter) {
print("Warning: Maximum number of iterations reached")
}
c("Minimizer" = x)
}
You can see the visualization process of Steepest Descent in the following graphic, or click here to see another example of Steepest descent method, specifically for a quadratic function. more
3-dimensional example
A simple example of a function of 2 variables to be minimized is
\[f(x)=f(x_1,x_2)={(2−x_1)^2\over 2x_2^2} + {(3−x_1)^2 \over 2x_2^2}+log(x_2)\]
Note that \(x_2\) should be positive, so we might need to protect against negative values of \(x_2\).
f1 <- function(x) {
(2-x[1])^2/(2*x[2]^2) +(3-x[1])^2/(2*x[2]^2) + log(x[2])
}
f1prime <- function(x) {
c(-(2-x[1])/x[2]^2 - (3-x[1])/x[2]^2,-(2-x[1])^2/x[2]^3 -(3-x[1])^2/x[2]^3 + 1/x[2])
}
golden <- function (f, a, b, tol = 0.0000001)
{
ratio <- 2 / (sqrt(5) + 1)
x1 <- b - ratio * (b - a)
x2 <- a + ratio * (b - a)
f1 <- f(x1)
f2 <- f(x2)
while(abs(b - a) > tol) {
if (f2 > f1) {
b <- x2
x2 <- x1
f2 <- f1
x1 <- b - ratio * (b - a)
f1 <- f(x1)
} else {
a <- x1
x1 <- x2
f1 <- f2
x2 <- a + ratio * (b - a)
f2 <- f(x2)
}
}
return((a + b) / 2)
}
## [1] "Warning: Maximum number of iterations reached"
## Minimizer1 Minimizer2
## 2.4992967 0.7123675
We haven’t converged yet.
## Minimizer1 Minimizer2
## 2.5000000 0.7071068
## Minimizer1 Minimizer2
## 2.5000000 0.7071068
Done. The value has been converged.
One can use steepest descent to compute the maximum likelihood estimate of the mean in a multivariate Normal density, given a sample of data. However, when the data are highly correlated, as they are in the simulated example below, the log-likelihood surface can be come difficult to optimize. In such cases, a very narrow ridge develops in the log-likelihood that can be difficult for the steepest descent algorithm to navigate.
In the example below, we actually compute the negative log-likelihood because the algorithm is designed to minimize functions.
set.seed(2020-09-30)
mu <- c(1, 2)
S <- rbind(c(1, .9), c(.9, 1))
x <- MASS::mvrnorm(500, mu, S)
nloglike <- function(mu1, mu2) {
dmv <- mvtnorm::dmvnorm(x, c(mu1, mu2), S, log = TRUE)
-sum(dmv)
}
nloglike <- Vectorize(nloglike, c("mu1", "mu2"))
nx <- 40
ny <- 40
xg <- seq(-5, 5, len = nx)
yg <- seq(-5, 6, len = ny)
g <- expand.grid(xg, yg)
nLL <- nloglike(g[, 1], g[, 2])
z <- matrix(nLL, nx, ny)
par(mar = c(4.5, 4.5, 1, 1))
contour(xg, yg, z, nlevels = 40, xlab = expression(mu[1]),
ylab = expression(mu[2]))
abline(h = 0, v = 0, lty = 2)
Note that in the figure above the surface is highly stretched and that the minimum (1,2) lies in the middle of a narrow valley. For the steepest descent algorithm we will start at the point (−5,−2) and track the path of the algorithm.
## Warning: package 'dplyr' was built under R version 3.6.3
norm <- function(x) x / sqrt(sum(x^2))
Sinv <- solve(S) ## I know I said not to do this!
step1 <- function(mu, alpha = 1) {
D <- sweep(x, 2, mu, "-")
score <- colSums(D) %>% norm
mu + alpha * drop(Sinv %*% score)
}
steep <- function(mu, n = 10, ...) {
results <- vector("list", length = n)
for(i in seq_len(n)) {
results[[i]] <- step1(mu, ...)
mu <- results[[i]]
}
results
}
m <- do.call("rbind", steep(c(-5, -2), 8))
m <- rbind(c(-5, -2), m)
par(mar = c(4.5, 4.5, 1, 1))
contour(xg, yg, z, nlevels = 40, xlab = expression(mu[1]),
ylab = expression(mu[2]))
abline(h = 0, v = 0, lty = 2)
points(m, pch = 20, type = "b")
We can see that the path of the algorthm is rather winding as it traverses the narrow valley. Now, we have fixed the step-length in this case, which is probably not optimal. However, one can still see that the algorithm has some difficulty navigating the surface because the direction of steepest descent does not take one directly towards the minimum ever.
The Newton method is a fundamental tool in optimization. This method is depent on the initial condition used and can be used to n-dimensions. In steepese decent mehtod, the first-order derivative information is only used to determine the search direction and the second derivatives are used to represent the cost surface more accurately and a better search direction could ne found.
\(Overview\) \(of\) \(the\) \(Newton\) \(Method:\)
Newton Method
Newton Method 2
Newton Method 3
Newton Method4
Newton method uses the function of Hessian to calculate the search direction and has a quadratic rate of convergence which means that it converges very rapidly when the design point is within certain radius of the minimum point. There are several Newthod’s Method which are classical Newton’s Method, Modified Newton’s Method, and Marquardt Modification.
We can approximate \(f\) with a quadratic polynomial for small \(p\),
\(f(x_n + p)≈\) \(f(x_n) + p'f'(x_n)\) \(+ 1/2 p'f''(x_n)p\)
\(p_n\) \(=\) \(f''(x_n)^{-1}\) \([-f'(x_n)]\)
This formula is the steepest descent direction twisted by the inverse of the Hessian Matrix. The newton method is:
\(x_{n+1}=\) \(x_n - f''(x_n)^{-1} f'(x_n)\)
Newton’s method makes a quadratic approximation to the target function f at each step of algorithm.Newton step is taking the complex function of \(f\) and replace it with simplier function \(g\), then iptimize it repeatedly until convergence to the solution.
curve(-dnorm(x), -2, 3, lwd = 2, ylim = c(-0.55, .1))
xn <- -1.2
abline(v = xn, lty = 2)
axis(3, xn, expression(x[n]))
g <- function(x) {
-dnorm(xn) + (x-xn) * xn * dnorm(xn) - 0.5 * (x-xn)^2 * (dnorm(xn) - xn * (xn * dnorm(xn)))
}
curve(g, -2, 3, add = TRUE, col = 4)
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+1]))
The iteration of \(x_{n+1}\) is further away from the iteration of \(x_n\) so we can conclude that the quadratic approximation Newton’s method makes to \(f\) is not guarantted to be good.
The successive iterations that Newton’s method produces are not guaranteed to be improvements in the sense that each iterate is closer to the truth. The tradeoff here is that while Newton’s method is very fast , it can be unstable at times.
The solution below provided the next approximation, \(x_{n+2}\) is close to the true minimum
curve(-dnorm(x), -2, 3, lwd = 2, ylim = c(-0.55, .1))
xn <- -1.2
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+1]))
xn <- op$minimum
curve(g, -2, 3, add = TRUE, col = 4)
op <- optimize(g, c(0, 3))
abline(v = op$minimum, lty = 2)
axis(3, op$minimum, expression(x[n+2]))
In rare occasion, if \(f\) is a quadratic polynomial, Newton’s method will converge in a single step because the quadratic approximation that it makes to \(f\) will be exact.
The extensuin of the standard linear which is for non-Normal response distributions is called the generalize linear model. This distributions come from an expenontential family whose density functions share some common characteristics. With GLM, The distributions is \(y_i\) \(~\) \(p(y_i|μ_i)\), where \(p\) is the exponential family distribution, \(E [y_i]=μ_i\).
\[ g(μ_i)=x′_iβ\] \(g\) is non linear link function and \(Var(y_i)=V(μ)\).
The Fisher scoring algorithm. This algorithm uses a linear approximation to the nonlinear link function \(g\) which can be write as \[g(y_i) ≈ g(μ_i) + (y_i-μ_i)g'(μ_i).\]. The working response can be write as \(z_i = g(μ_i) + (y_i-μ_i)g'(μ_i)\).
The Fisher scoring alogrithm works as follows:
The Fisher scoring Algorithm
We can draw a connection betwwen the usual Fisher scoring algorithm for fitting GLMs and Newton’s method using Poisson regression example. In Poisson regression, we have \(y_i ∼ Poisson (μ_i)\), where \(g(μ) = log μ_i = x_i'β\) because the log is the canonical link function for the Poisson distribution. We also have \(g'(μ_i) = \frac {1} μ_i\) and \(V(μ_i) = μ_i\). The Fisher scoring alogirhm is
Initialize \(\hat{μ_i}\), perhaps using \(y_i + 1\) (to avoid zeros).
Let \(z_i = log\hat{μ_i} + (y_i - μ_i) \frac {1} μ_i\)
Regression \(z\) on \(X\) using the weights \[w_{ii} = [\frac {1}{μ_i^2}\hat{μ_i}]^{-1} = μ_i.\]
The Newton updating scheme is \[β_{n+1} = β_n + ℓ''(β_n)^{-1} [-ℓ′(β_n)].\] The log-likelihoood for a Poisson regression model can be written in vector/matrix form as \[ℓ(β) = y' Xβ - exp(Xβ)'1\] where the exponential is taken component-wise on the vector \(Xβ\). The gradient function is \[ℓ(β) = X'y - X'exp(Xβ) = X'(y-μ)\].The hessian is \[ℓ(β)=-X'WX\]
where \(W\) is a diagonal matrix with the values \(w_{ii} = exp(x_i'β)\)on the diagonal.
The Newton iteration is then \[β_{n+1} = β_n + (-X'WX)^{-1}(-X'(y-μ))\] \[= β_n + (X'WX)^{-1}XW(z - Xβ_n)\]
\[=(X'WX)^{-1}X'Wz + β_n - (X'WX)^{-1}X'WXβ_n\] \[=(X'WX)^{-1}X'Wz\]
The iteration is exactly the same as the Fisher scoring algorithm. In general, Newton’s method and Fisher scoring will coincide with any generalized linear model using an exponential family with a canonical link function.
The purpose of nlm()
function in R Newton’s method is for minimizing a function given a vector of starting values. By default, one does not need to supply the gradient or Hessian functions; they will be estimated numerically by the algorithm. For the purposes of improving accuracy of the algorithm, both the gradient and Hessian can be supplied as attributes of the target function.
As an example, we will use the nlm() function to fit a simple logistic regression model for binary data. This model specifies that \(y_i ∼ Bernoulli (p_i)\) where \[log \frac {p_i} {1-p_i} = β_0 + x_iβ_1\]
and the goal is to estimate β via maximum likelihood. Given the assumed Bernoulli distribution, we can write the log-likelihood for a single observation as \[log L(β) = ∑_{i=1}^n yi(β_0 + x_iβ_1)-log(1+e^{(β_0+x_iβ_1)})\]. If we take the very last line of the above derivation and take a single element inside the sum, we have
\[ℓ_i(β) = y_i(β_0+x_iβ_1) - log(1 + e^{(β_0+x_iβ_1)})\].
We will need the gradient and Hessian of this with respect to \(β\). Because the sum and the derivative are exchangeable, we can then sum each of the individual gradients and Hessians to get the full gradient and Hessian for the entire sample, so that \[ℓ'(β) = ∑_{i=1}^n ℓ_i'(β)\]
and \[ℓ''(β) = ∑_{i=1}^n ℓ_i''(β).\]
R provides an automated way to do symbolic differentiation so that manual work can be avoided. The deriv()
function computes the gradient and Hessian of an expression symbolically so that it can be used in minimization routines. It cannot compute gradients of arbitrary expressions, but it does support a wide range of common statistical functions.
The tidypvals
package written by Jeff Leek contains datasets taken from the literature collecting p-values associated with various publications along with some information about those publications (i.e. journal, year, DOI). One question that comes up is whether there has been any trend over time in the claimed statistical significance of publications, where “statistical significance” is defined as having a p-value less than \(0.05\). The tidypvals package is available from GitHub and can be installed using the install_github() function in the remotes package.
## Skipping install of 'tidypvals' from a github remote, the SHA1 (b042f653) has not changed since last install.
## Use `force = TRUE` to force installation
Once installed, we will make use of the jager2014 dataset. In particular, we are interseted in creating an indicator of whether a p-value is less than 0.05 and regressing it on the year variable.
library(tidypvals)
library(dplyr)
jager <- mutate(tidypvals::jager2014,
pvalue = as.numeric(as.character(pvalue)),
y = ifelse(pvalue < 0.05
| (pvalue == 0.05 & operator == "lessthan"),
1, 0),
x = year - 2000) %>%
tbl_df
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Note here that we have subtracted the year 2000 off of the year
variable so that \(x = 0\) corresponds to $year == 2000$
.
Next we compute the gradient and Hessian of the negative log-likelihood with respect to \(β_0\) and \(β_1\) using the deriv() function. Below, we specify function.arg = TRUE in the call to deriv() because we want deriv() to return a function whose arguments are b0 and b1.
nll_one <- deriv(~ -(y * (b0 + x * b1) - log(1 + exp(b0 + b1 * x))),
c("b0", "b1"), function.arg = TRUE, hessian = TRUE)
Here’s what that function looks like.
## function (b0, b1)
## {
## .expr6 <- exp(b0 + b1 * x)
## .expr7 <- 1 + .expr6
## .expr11 <- .expr6/.expr7
## .expr15 <- .expr7^2
## .expr18 <- .expr6 * x
## .expr19 <- .expr18/.expr7
## .value <- -(y * (b0 + x * b1) - log(.expr7))
## .grad <- array(0, c(length(.value), 2L), list(NULL, c("b0",
## "b1")))
## .hessian <- array(0, c(length(.value), 2L, 2L), list(NULL,
## c("b0", "b1"), c("b0", "b1")))
## .grad[, "b0"] <- -(y - .expr11)
## .hessian[, "b0", "b0"] <- .expr11 - .expr6 * .expr6/.expr15
## .hessian[, "b0", "b1"] <- .hessian[, "b1", "b0"] <- .expr19 -
## .expr6 * .expr18/.expr15
## .grad[, "b1"] <- -(y * x - .expr19)
## .hessian[, "b1", "b1"] <- .expr18 * x/.expr7 - .expr18 *
## .expr18/.expr15
## attr(.value, "gradient") <- .grad
## attr(.value, "hessian") <- .hessian
## .value
## }
function (b0, b1)
{
.expr6 <- exp(b0 + b1 * x)
.expr7 <- 1 + .expr6
.expr11 <- .expr6/.expr7
.expr15 <- .expr7^2
.expr18 <- .expr6 * x
.expr19 <- .expr18/.expr7
.value <- -(y * (b0 + x * b1) - log(.expr7))
.grad <- array(0, c(length(.value), 2L), list(NULL, c("b0",
"b1")))
.hessian <- array(0, c(length(.value), 2L, 2L), list(NULL,
c("b0", "b1"), c("b0", "b1")))
.grad[, "b0"] <- -(y - .expr11)
.hessian[, "b0", "b0"] <- .expr11 - .expr6 * .expr6/.expr15
.hessian[, "b0", "b1"] <- .hessian[, "b1", "b0"] <- .expr19 -
.expr6 * .expr18/.expr15
.grad[, "b1"] <- -(y * x - .expr19)
.hessian[, "b1", "b1"] <- .expr18 * x/.expr7 - .expr18 *
.expr18/.expr15
attr(.value, "gradient") <- .grad
attr(.value, "hessian") <- .hessian
.value
}
## function (b0, b1)
## {
## .expr6 <- exp(b0 + b1 * x)
## .expr7 <- 1 + .expr6
## .expr11 <- .expr6/.expr7
## .expr15 <- .expr7^2
## .expr18 <- .expr6 * x
## .expr19 <- .expr18/.expr7
## .value <- -(y * (b0 + x * b1) - log(.expr7))
## .grad <- array(0, c(length(.value), 2L), list(NULL, c("b0",
## "b1")))
## .hessian <- array(0, c(length(.value), 2L, 2L), list(NULL,
## c("b0", "b1"), c("b0", "b1")))
## .grad[, "b0"] <- -(y - .expr11)
## .hessian[, "b0", "b0"] <- .expr11 - .expr6 * .expr6/.expr15
## .hessian[, "b0", "b1"] <- .hessian[, "b1", "b0"] <- .expr19 -
## .expr6 * .expr18/.expr15
## .grad[, "b1"] <- -(y * x - .expr19)
## .hessian[, "b1", "b1"] <- .expr18 * x/.expr7 - .expr18 *
## .expr18/.expr15
## attr(.value, "gradient") <- .grad
## attr(.value, "hessian") <- .hessian
## .value
## }
The function nll_one()
produced by deriv()
evaluates the negative log-likelihood for each data point. The output from nll_one() will have attributes “gradient” and “hessian” which represent the gradient and Hessian, respectively. For example, using the data from the jager dataset, we can evaluate the negative log-likelihood at \(β_0=0, β_1=0.\)
## num [1:15653] 0.693 0.693 0.693 0.693 0.693 ...
## - attr(*, "gradient")= num [1:15653, 1:2] -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "b0" "b1"
## - attr(*, "hessian")= num [1:15653, 1:2, 1:2] 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
## ..- attr(*, "dimnames")=List of 3
## .. ..$ : NULL
## .. ..$ : chr [1:2] "b0" "b1"
## .. ..$ : chr [1:2] "b0" "b1"
The nll_one()
function evaluates the negative log-likelihood at each data point, but does not sum the points up as would be required to evaluate the full negative log-likelihood. Therefore, we will write a separate function that does that for the negative log-likelihood, gradient, and Hessian.
nll <- function(b) {
v <- nll_one(b[1], b[2])
f <- sum(v) ## log-likelihood
gr <- colSums(attr(v, "gradient")) ## gradient vector
hess <- apply(attr(v, "hessian"), c(2, 3), sum) ## Hessian matrix
attributes(f) <- list(gradient = gr,
hessian = hess)
f
}
Now, we can evaluate the full negative log-likelihood with the nll()
function. Note that nll()
takes a single numeric vector as input as this is what the nlm()
function is expecting.
## [1] 10849.83
## attr(,"gradient")
## b0 b1
## -4586.5 -21854.5
## attr(,"hessian")
## b0 b1
## b0 3913.25 19618.25
## b1 19618.25 137733.75
Using \(β_0=0, β_1=0\) as the initial value, we can call nlm()
to minimize the negative log-likelihood.
## $minimum
## [1] 7956.976
##
## $estimate
## [1] 1.57032807 -0.04416515
##
## $gradient
## [1] -1.451746e-06 -2.782241e-06
##
## $code
## [1] 1
##
## $iterations
## [1] 4
Note first in the output that there is a code with the value 4 and that the number of iterations is 100. Whenever the number of iterations in an optimization algorithm is a nice round number, the chances are good that it it some preset iteration limit. This in turn usually means the algorithm didn’t converge.
In the help for nlm()
we also learn that the code value of 4 means “iteration limit exceeded”, which is generally not good. Luckily, the solution is simple: we can increase the iteration limit and let the algorithm run longer.
## $minimum
## [1] 7956.976
##
## $estimate
## [1] 1.57032807 -0.04416515
##
## $gradient
## [1] -1.451746e-06 -2.782241e-06
##
## $code
## [1] 1
##
## $iterations
## [1] 4
Here we see that the number of iterations used was 260, which is well below the iteration limit. Now we get code equal to 2 which means that “successive iterates within tolerance, current iterate is probably solution”. Sounds like good news!
Lastly, most optimization algorithms have an option to scale your parameter values so that they roughly vary on the same scale. If your target function has paramters that vary on wildly different scales, this can cause a practical problem for the computer (it’s not a problem for the theory). The way to deal with this in nlm() is to use the typsize arguemnt, which is a vector equal in length to the parameter vector which provides the relative sizes of the parameters.
Here, I give typsize = c(1, 0.1)
, which indicates to nlm() that the first paramter, \(β_0\) should be roughly 10 times larger than the second parameter,\(β_1\) when the target function is at its minimum.
## $minimum
## [1] 7956.976
##
## $estimate
## [1] 1.57032807 -0.04416515
##
## $gradient
## [1] -1.451745e-06 -2.782238e-06
##
## $code
## [1] 1
##
## $iterations
## [1] 4
Running this call to nlm() you’ll notice that the solution is the same but the number of iterations is actually much less than before (4 iterations) which means the algorithm ran faster. Generally speaking, scaling the parameter vector appropriately (if possible) improves the performance of all optimization algorithms and in my experience is almost always a good idea. The specific values given to the typsize argument are not important; rather their relationships to each other (i.e. orders of magnitude) are what matter.