STAA 567 Lec 3: Newtons Method

Newton’s method
Back to otimization and multivariate root finding

(Instructor : Nishant Panda)

Newton’s method

We continue with root finding techniques. Today we will cover the most widely used root finding technique called Newton’s method (Newton Raphson).

We assume that \(f\) is smooth (differentiable). Our goal is to find a root of \(f\) i.e a \(x\) value for which \(f(x) = 0\). We illustrate the method with a simple example for which you know the root

Example 1: Find the root of \(f(x) = x^2 - 9\) for \(x > 0\) .

First step : Plot.

x.seq <- seq(from = 0.1, to = 10, length.out = 101)
f <- sapply(x.seq, function(x) x^2 - 9)

library(ggplot2)
# put the x and y values in a data frame
plot_data <- data.frame("x" = x.seq, "f" = f)


# create the base layer of ggplot with geom_Line
plot_f <- ggplot(plot_data, aes(x = x, y = f)) + 
  geom_line(color = "red", size = 1.0) + 
  labs(x = "x", y = "f(x) = x^2 - 9")


# beautify : Increase font size in axes 
plot_f + theme(axis.text.x = 
                 element_text(face = "bold",size = 12),
               axis.text.y = 
                 element_text(face = "bold", size = 12),
               axis.line = element_line(colour = "black", 
                      size = 1, linetype = "solid"))

Looks like the function is 0 around 3. This matches up with what we expect. Let us zoom in.

plot_f + 
  geom_hline(yintercept = 0.00, color = "green", size = 1) +
  coord_cartesian(xlim = c(2,4), ylim = c(-1, 3))

Newtons method starts with an initial guess and updates its guess by using the derivative of \(f\). The derivative of \(f\) is given by, \[ f'(x) = 2x \]

The equation of the tangent line to \(f\) at a point \(a\) is given by \[ y = f'(a)(x - a) + f(a). \] Can you derive this equation using the point-slope form of a line?

(Home Assignment!)At what \(x\) value is \(y = 0\)?

This is how Newton’s method works.

Let \(x^{(0)}\) be the first guess of the root of our function \(f\) which is approximately equal to its tangent line at \(x^{(0)}\) given by \[ y = f'(x^{(0)})(x - x^{(0)}) + f(x^{(0)}). \]
Since \(f\) is approximately equal to its tangent line \(y\), \(f\) will be \(0\) approximately when \(y\) will be \(0\). By setting \(y\) equal to \(0\) we get our updated guess \[ x^{(1)} = x^{(0)} - \frac{f\left(x^{(0)}\right)}{f'\left(x^{(0)}\right)} \]
In general starting at guess \(x^{(n)}\), our updated guess is \[ x^{(n+1)} = x^{(n)} - \frac{f\left(x^{(n)}\right)}{f'\left(x^{(n)}\right)} \]
We stop if \[ \lvert x^{(n+1)} - x^{(n)} \rvert < \epsilon. \] Or at a pre-set number of guesses \(M\) if our guesses diverges.

Let us visualize this to find the root of our function \(f(x) = x^2 - 9\).

First we need to create a function that returns the tangent line to the function \(f\) at some point \(a\).

# Let us create our tangent line function at some point a
f_tangent <- function(x, f, df, a){
  # We need the function and its derivative df
  # a is the point where we create the tangent to the function
  # returns the tangent line 
  df(a)*(x - a) + f(a)
}

We plot the tangent line to \(f\) at \(x^{(0)} = 4\) which will be our first guess of the root.

# add the tangent line at 4 to our data frame
plot_data$f.tan <- sapply(plot_data$x, f_tangent, f = function(x) x^2 - 9, df = function(x) 2*x, a = 4)

# create the base layer of ggplot with geom_Line
plot_tangent_line <- ggplot(plot_data, aes(x = x)) + 
  geom_line(aes(y = f), color = "red", size = 0.95) +
  geom_line(aes(y = f.tan), color = "blue", size = 0.95) +
  labs(x = "x", y = "f(x) and tangent line at 4") 
plot_tangent_line

Now let us zoom in on \(2\leq x \leq 4\) and try to see where the tangent line is equal to \(0\).

# zoom in between 3 and 4
plot_tangent_line + 
  # add a horizontal line at 0
  geom_hline(yintercept = 0.00, color = "green", size = 1) +
  coord_cartesian(xlim = c(2,4), ylim = c(-5, 10))

Looks like the tangent line is \(0\) around \(3.1\). This will be our updated guess at the root. Understand the rationale behind this. If \(f\) is approximately equal to its tangent line at \(4\), then the root of \(f\) is approximately equal to the root of the tangent line. Can you see this in the plot above?

The exact value when the tangent line is \(0\) is given by \[ x^{(1)} = 4 - \frac{f\left(4\right)}{f'\left(4\right)} = 3.125 \]

# add the tangent line at 3.125 to our data frame
plot_data$f.tan <- sapply(plot_data$x, f_tangent, f = function(x) x^2 - 9, df = function(x) 2*x, a = 3.125)

# create the base layer of ggplot with geom_Line
plot_tangent_line <- ggplot(plot_data, aes(x = x)) + 
  geom_line(aes(y = f), color = "red", size = 0.95) +
  geom_line(aes(y = f.tan), color = "blue", size = 0.95) +
  labs(x = "x", y = "f(x) and tangent line at 4") 
plot_tangent_line

Now let us zoom in on around \(3\) and try to see where the tangent line is equal to \(0\).

# zoom in between 2.9 and 3.1
plot_tangent_line + 
  # add a horizontal line at 0
  geom_hline(yintercept = 0.00, color = "green", size = 1) +
  coord_cartesian(xlim = c(2.9,3.1), ylim = c(-2.5, 2.5))

Looks like the tangent line is \(0\) around \(3\). In just two steps we are very close. This will be our updated guess at the root. The exact value when the tangent line is \(0\) is given by \[ x^{(1)} = 3.125 - \frac{f\left(3.125\right)}{f'\left(3.125\right)} = 3.0025 \]

We can continue until our guesses keep getting smaller and smaller.

# Newton's method from package NLRoot
library(NLRoot)
# first argument is the function whose root we need to find
# second argument is the the derivative of the function whose 
# root we need to find.
# third argument is an initial guess
NIMfzero(function(x) x^2-9, function(x) 2*x, x0 = 4)

## [1] 3
## [1] 1.081801e-12
## [1] "finding root is successful"

(Home Assignment!)Use NLRoot package and the NIMfzero function to find the root for the function \(f(x) = x^2 - 9\) with the starting value \(x0 = -2\). Is the answer intuitive?

Problems with Newton’s method.

Please read “Faliure Analysis” in Wiki https://en.wikipedia.org/wiki/Newton%27s_method

Back to otimization and multivariate root finding

Recall that our motivation for studying root finding was to determine the maxima. In other words if \(g(x)\) is our function that we need to optimize then we check the roots for the function \(f(x) = g'(x)\). In order to use Newton’s method we need \(f'(x) = g''(x)\) and thus we need our function \(g\) to be atleast twice differentiable.

With this in mind, our update equation for the bisection method is \[ x^{(n+1)} = \left(\frac{a_{n+1} + b_{n+1}}{2}\right), \] where, \[ \left[a_{n+1}, b_{n+1}\right] = \] \[ \begin{cases} \left[a_{n}, x^{(n)}\right], & \text{if } g'(a_{n})g'(x^{(n)}) \leq 0 \\ \left[x^{(n)}, b_{n}\right], & \text{if } g'(a_{n})g'(x^{(n)}) > 0 \end{cases} \] and, our update equation for the Newton’s method is \[ x^{(n+1)} = x^{(n)} - \frac{g'\left(x^{(n)}\right)}{g''\left(x^{(n)}\right)} \]

(Home Assignment!)Use NLRoot package and the NIMfzero function to find the maxima for the function \(g(x) = \frac{log(x)}{1 + x}\). Does the answer match from the previous lecture where we used the Bisection method?

(Home Assignment!)Mimic the plots from this lecture to draw three plots showing the guessed root using the Newton’s method to find the maxima for the function \(g(x) = \frac{log(x)}{1 + x}\).

So, far we have talked about optimizing functions whose input space is a subset of the Real numbers. We now extend this to real valued functions that take a vector as input.

Let \(g\) be a function that takes an input vector \(\mathbf{x} = (x_1, x_2, \dots, x_p)\) in \(\mathbb{R}^{p}\) and gives a real number as output. We wish to find the input vector \(\mathbf{x}^{\ast}\) where \(g\) is maximum (or minimum!). Let \(\nabla{g}(\mathbf{x})_{p\times 1}\) be the gradient vector (the first derivative (\(g'(\mathbf{x})\))) and let \(H_{p\times p}(\mathbf{x})\) be the Hessian matrix (the second derivative \(g''(\mathbf{x})\)). If \(p = 1\), we are at the familiar setting of a real valued input space.

The following steps outline Newton’s method for finding the maxima \(\mathbf{x}^{\ast}\) (optima) of \(g(\mathbf{x})\).

Let \(\mathbf{x}^{(n)}\) be the current guess for the optima \(\mathbf{x}^{\ast}\). Approximate \(g(\mathbf{x}^{\ast})\) by, \[ g(\mathbf{x}^{\ast}) \approx g(\mathbf{x^{(n)}}) + (\mathbf{x^{\ast}} - \mathbf{x}^{(n)})^{T}\nabla{g}(\mathbf{x^{(n)}})_{p\times 1} + (\mathbf{x^{\ast}} - \mathbf{x}^{(n)})^{T}H_{p\times p}(\mathbf{x^{(n)}})(\mathbf{x^{\ast}} - \mathbf{x}^{(n)}). \]
Take the gradient of this approximation which is approximately 0(Why?) \[ \nabla{g}(\mathbf{x^{(n)}})_{p\times 1} + H_{p\times p}(\mathbf{x^{(n)}})(\mathbf{x^{\ast}} - \mathbf{x}^{(n)}) \approx 0 \]
Solve for \(\mathbf{x}^{(n+1)}\) by setting the equation above exactly to \(0\) and replacing \(\mathbf{x}^{\ast}\) by \(\mathbf{x}^{(n+1)}\) . This amounts to solving the linear equation for \(x^{(n+1)}\). \[ A_{p\times p}\mathbf{x^{(n+1)}} = b_{p\times 1}, \] where \(A_{p\times p}\) is the Hessian matrix at \(\mathbf{x}^{(n)}\) i.e \(A_{p\times p} = H_{p\times p}(\mathbf{x^{(n)}})\) and \(b_{p\times 1}\) is the vector \(H_{p\times p}(\mathbf{x^{(n)}})\mathbf{x}^{(n)} - \nabla{g}(\mathbf{x^{(n)}})_{p\times 1}\).

Thus your updated guess \(\mathbf{x}^{(n + 1)}\) \[ \mathbf{x}^{(n + 1)} = \mathbf{x}^{(n)} - H_{p\times p}^{-1}(\mathbf{x^{(n)}})\nabla{g}(\mathbf{x^{(n)}})_{p\times 1} \] (Note that in real life we never compute the inverse of a matrix unless \(p\) is very small!)

Stop when the distance between two vectors is smaller than some tolerance level \(d(\mathbf{x^{(n+1)}}, \mathbf{x^{(n)}}) < \epsilon\). There are many distance function. The simplest one that you may know is the Euclidean norm \[ d_E(\mathbf{x^{(n+1)}}, \mathbf{x^{(n)}}) = \lvert\lvert \mathbf{x^{(n+1)}}-\mathbf{x^{(n)}} \rvert\rvert_{2}, \] Where the Euclidean norm of a vector \(\mathbf{x} = (x_1, x_2, \dots, x_p)\) in \(\mathbb{R}^{p}\) is given by \(\sqrt{(x_1^2 + x_2^2 + \dots x_n^2)}\).

Example 2: Find the optimum value of \(g(x_1, x_2) = 10 - (x_1^2 + x_2 -11)^2 - (x_1 + x_2^2 - 7)^2\).

Note that \(g\) is a constant and a difference of two squares. Squares are always non-negative. This means that an upper bound for \(g\) is given by \(10\). Now \(g\) can attain \(10\) if and only if both the square terms are \(0\) at the same vector \(\mathbf{x} = (x_1, x_2)\). Trial and error shows that the two square terms are \(0\) at \((3,2)\). Hence, without using optimization we found a maxima!

First step: Plot!

Use your functions(contour2D, surface3D) from Home Assignment in Lecture 1 to plot the contours and surface.

contour2d(10 - (x^2 + y -11)^2 - (x + y^2 - 7)^2, -5, 5, -5, 5, 101)

surface3d(10 - (x^2 + y -11)^2 - (x + y^2 - 7)^2, -5, 5, -5, 5, 101, phi = 30, theta = 30 )

Now, we write our own function to find a maxima. Our function would need

The function \(g\) whose maxima (or minima) we need to find.
The Hessian and gradient functions.
The distance function.
Stopping criteria : \(\epsilon\) value and a max iteration number \(M\) so that we don’t exceed this number if our guesses diverge.
The starting vector x0.

We will use r packaged-functions for these (Why invent the wheel?)

Let us create our function \(g\). It should take a vector and return a real number.

g <- function(x){
  # x is a vector in R2
  10 - (x[1]^2 + x[2] -11)^2 - (x[1] + x[2]^2 - 7)^2
}

We know that Hessian and gradient functions can be evaluated approximately using the numDeriv package. Let us check

library(numDeriv)
# Let us check the gradient and Hessian at (3,2)
a <- c(3,2)
round(grad(g, a), 5)

## [1] 0 0

round(hessian(g, a), 5)

##      [,1] [,2]
## [1,]  -74  -20
## [2,]  -20  -34

Please check these by hand calculation! (Note that we found by investigation that the maxima is at (3,2). Thus, the gradient is \(0\) there!)

For distance function we will use the Euclidean norm (also called the 2 norm. Note the subscript 2 when we introduced the Euclidean norm)

# Let us check the "two" norm
# We know that the "two" norm of (1,1) is sqrt(2). 
norm(c(1,1), type = "2")

## [1] 1.414214

Let us fix \(\epsilon\) = 1e-5. Take \(M = 10,000\).

# our optimize function
# by default we will have eps = 1e-5 and norm type is Euclidean
# by default our max iteration number M will be 10000

my_optim_newtons <- function(g, norm.type = "2", eps = 1e-5, x0, max.iter = 1e4){
  library(numDeriv)
  # g is the function to optimize
  # x0 is our starting guess
  
  
  # set your iteration counter to 0
  iter.count <- 0
  # set your initial error to a large number
  error <- 100
  # set your current guess to x0
  x <- x0
  
  while(error > eps){

    # HW. incorporate this check 
    # if current iteration is above max.iter
    # if(iter.count >= max.iter) break("Reached max iter")
    
    # update your guess
    A <- hessian(g,x)
    b <- A %*% x - grad(g, x)
    x_new <- solve(A, b)
  
    # get the error
    error <- norm((x_new - x), type = norm.type)
    
    # update your iteration counter
    iter.count <- iter.count + 1
    
    # set the current x to new 
    x <- x_new
    
    
  }
  # How do you modify the return statement if you run out of iterations (iter.max is reached)
  return(x)
  
}

# Let us try a starting value x0 = (2,3)
x <- my_optim_newtons(g, x0 = c(2,3))
print(x)

##      [,1]
## [1,]    3
## [2,]    2

g(x)

## [1] 10

This is indeed the maxima we discovered earlier.

# what if we try x0 = (-3,2)
x <- my_optim_newtons(g, x0 = c(-3,2))
print(x)

##           [,1]
## [1,] -2.805118
## [2,]  3.131313

g(x)

## [1] 10

We found a different maxima!