Introduction

Numerical analysis is an evolving branch of mathematics. As computers evolve, so too do the algorithms that they use to solve problems. Of course, this branch of mathematics is far too deep to properly cover in a few paragraphs and subsections. So, we will restrict ourselves to only a few select topics.

Newton’s Method

Newton’s method allows us to solve (approximately) an equation of the form \(f(x) = 0\). Suppose we have a curve \(y = f(x)\). We take a first guess at the solution to \(f(x) = 0\) with some value \(x_1\). We generate a new guess \(x_2\) by simply finding the derivative (gradient) of \(f\) at \(x_1\) and seeing where the linear approximation of \(f\) at \(x_1\) intersects the \(x\)-axis. This will be our next guess, \(x_2\). Analytically, we can see that \[x_2 = x_1 - \dfrac{f(x_1)}{f'(x_1)},\] as \(y - y_1 = m(x-x_1)\) with \(m = f'(x_1)\), \(y = 0\), \(y_1 = f(x_1)\) and solving for \(x = x_2\), our next guess.

If we iterate through this process, we can see that \[x_n = x_{n-1} - \dfrac{f(x_{n-1})}{f'(x_{n-1})}\] and we can calculate through this algorithm until the value \(y(x_i)\) gets as close to 0 as we reasonable need.

Note that we can extend this idea to higher dimensions to solve \(f(x_1,x_2, \ldots x_n)\) by looking at the gradient at a particular initial guess \(\mathbf{x}_0\) and moving along the gradient of \(f\) to where the linear approximation of \(f\) is 0. Computer algebra systems do exactly this when users want a solution to an equation but do not need it solved analytically.

Now, there are a number of drawbacks to this method. This algorithm is strongly dependent on the initial guess. The initial guess should be fairly close to the actual solution nor should it be near a critical number. However, as a positive, Newton’s method has quadratic convergence which means that the number of accurate decimal places doubles (at least) after each iteration.

An Example

Consider \(f(x) = x^2 - 2\). Clearly the equation \(f(x) = 0\) has roots of \(\pm\sqrt{2}\). Consider an initial guess of \(x_1 = 1\). Then, we set up our equation for \(x_n\) as \[x_n = x_{n-1} - \dfrac{f(x_{n-1})}{f'(x_{n-1})} = x_{n-1} - \dfrac{x_{n-1}^2 - 2}{2x_{n-1}}.\] The solutions are

\(n\)	\(x_{n}\)	\(f(x_n)\)
1	1	-1
2	1.5	0.25
3	1.416667	0.006944
4	1.414216	0.000006
5	1.414214	0.000000

Gradient Descent

We generalize Newton’s method to multi-dimensional functions in order to find the maxima and minima.

Consider the function \(f(x,y) = x^2y\). Define the vector of gradients of \(f\) called Grad \(f\) \[\nabla f = \begin{pmatrix} f_x\\f_y \end{pmatrix}.\] This vector points in the direction of largest increase. Thus, if we are at some arbitrary point on the curve, \(s_{n}\), and we want to move to a point that is higher, we simply move a little in the direction of the gradient vector. For example, \(s_{n+1} = s_n + \alpha\nabla f(s_n)\), where \(\alpha\) is some little step (called the learning rate). Of course, if we are trying to find a minimum, we subtract rather than add.

This is how most optimization algorithms work in real life. We will not get into specifics; however, it is good to see the ideas surrounding these sorts of methods in the general case.

If we consider a function such as \[y = 1.2(x-2)^2 + 3.2\], with a minimum of 2. We can also find \(\nabla y = 2.4(x-2)\). Then take \(\alpha = 0.1\) and an initial guess of \(x = 0.1\), we can see from the graph below the path that we take to get to the minimum point simply by using the formula \[x_{new} = x_{old} - \alpha\left(2.4(x_{old}-2)\right).\]

The implementation of this is fairly simple.

#  Define the function and derivative
f <-  function(x) {
  1.2 * (x-2)^2 + 3.2
}

dydx <- function(x){
  1.2 * 2 * (x-2)
}

# Gradient descent implementation
 
grad <- function(x = 0.1, alpha = 0.6, j = 1000) {
  
  xtrace <- x
  ftrace <- f(x)
  
  for (i in 1:j) {
    
    x <- x - alpha * dydx(x)
    
    xtrace <- c(xtrace,x)
    ftrace <- c(ftrace,f(x))
    
    }
  
  data.frame(
    "x" = xtrace,
    "f_x" = ftrace
    )
}

#  Implementation

grad(x = 0.1, alpha = 0.1, j = 30)

##           x      f_x
## 1  0.100000 7.532000
## 2  0.556000 5.702163
## 3  0.902560 4.645249
## 4  1.165946 4.034776
## 5  1.366119 3.682167
## 6  1.518250 3.478499
## 7  1.633870 3.360861
## 8  1.721741 3.292913
## 9  1.788523 3.253667
## 10 1.839278 3.230998
## 11 1.877851 3.217904
## 12 1.907167 3.210342
## 13 1.929447 3.205973
## 14 1.946380 3.203450
## 15 1.959248 3.201993
## 16 1.969029 3.201151
## 17 1.976462 3.200665
## 18 1.982111 3.200384
## 19 1.986404 3.200222
## 20 1.989667 3.200128
## 21 1.992147 3.200074
## 22 1.994032 3.200043
## 23 1.995464 3.200025
## 24 1.996553 3.200014
## 25 1.997380 3.200008
## 26 1.998009 3.200005
## 27 1.998487 3.200003
## 28 1.998850 3.200002
## 29 1.999126 3.200001
## 30 1.999336 3.200001
## 31 1.999495 3.200000

Lagrange Multipliers

If we want to find the minima or maxima of a function subject to a given constraint, we can use the method of Lagrange multipliers. We again look at \(f(x,y) = x^2y\). We want to find the maximum and minimum values along the circle \(g(x,y) = x^2 + y^2 = a^2\). We notice that if we find the gradient of \(f\) and the gradient of the curve \(g\) at a point where the gradient is perpendicular both to \(f\) and the curve, then the point in question is a maximum or minimum value of \(f\) along the curve \(g\).

Now, if we solve \(\nabla f = \lambda \nabla g\), then we will be able to solve for the point of maximum and minimum. So, \(\nabla f = \begin{pmatrix} 2xy\\x^2 \end{pmatrix} = \lambda \nabla g = \lambda \begin{pmatrix} 2x \\ 2y \end{pmatrix}\). This gives a system of equations \[\begin{aligned} 2xy &= \lambda 2x && \Rightarrow y = \lambda\\\\ x^2 &= \lambda 2y = 2y^2 && \Rightarrow x = \pm \sqrt{2}y\\\\ x^2 + y^2 =3y^2 &= a^2 && \Rightarrow y = \pm a /\sqrt{3}. \end{aligned}\] Our solutions then become \[\dfrac{a}{\sqrt{3}} \begin{pmatrix} \sqrt{2}\\1 \end{pmatrix}, \dfrac{a}{\sqrt{3}}\begin{pmatrix} \sqrt{2}\\-1 \end{pmatrix}, \dfrac{a}{\sqrt{3}}\begin{pmatrix} -\sqrt{2}\\1 \end{pmatrix}, \dfrac{a}{\sqrt{3}}\begin{pmatrix} -\sqrt{2}\\-1 \end{pmatrix}, \] which gives function values of \[f(x,y) = \dfrac{2a^3}{3\sqrt{3}}, \dfrac{-2a^3}{3\sqrt{3}}, \dfrac{2a^3}{3\sqrt{3}}, \dfrac{-2a^3}{3\sqrt{3}}.\] Thus, we have 2 maximum and 2 minimum values and the corresponding \(x\) and \(y\) values.

Series Approximations

A power series is a polynomial of the form \[g(x) = \sum_{i=0}^{\infty} a_ix^i.\] Often, if a power series approximates a function \(g\), only a few terms of the power series are required to get a value of \(g\) at a given point. If that is the case, we can discuss \(g_n\) which is an approximation of \(g\) by a polynomial of degree \(n\). Hence \[g_n(x) = \sum_{i=0}^n a_ix^i.\] The graphic below shows the function \(f(x) = e^{x^2}\) approximated by a number of polynomials of different degrees. Note that they are all very similar near \(x = 1\). Thus, for numbers close to 1, we may chose to use a polynomial approximation for \(f\) rather than evaluating \(e^{x^2}\).

To re-express functions, we can sometimes use a polynomial to approximate it. For example, \[e^x = \sum_{n=0}^{\infty} \dfrac{x^n}{n!}.\] Polynomial approximations like this are only available for well-behaved functions. In general, we can find the Taylor series approximation to a function \(f\) at a value \(a\), using the formula \[f(x) = \sum_{n=0}^{\infty} \dfrac{f^{(n)}(a)}{n!}(x-a)^n,\] where \(f^{(n)}(a)\) is the value of the \(n^{th}\) derivative of \(f\) at \(a\). A Taylor series at \(a = 0\) is given a special name called a MacLaurin series.

Some interesting examples of Taylor series include \[ \begin{aligned} \dfrac{1}{1-x} &= \sum_{x=0}^{\infty} x^n && x \in (-1,1)\\\\ e^x &= \sum_{x=0}^{\infty} \dfrac{x^n}{n!}&& x \in \mathbf{R}\\\\ \cos(x) &= \sum_{x=0}^{\infty} \dfrac{(-1)^n}{(2n)!}x^{2n}&& x \in \mathbf{R}\\\\ \sin(x) &= \sum_{x=0}^{\infty} \dfrac{(-1)^{n}}{(2n+1)!}x^{2n+1}&& x \in \mathbf{R}\\\\ \ln(1+x) &= \sum_{x=1}^{\infty} \dfrac{(-1)^{n+1}}{n}x^n&& x \in (-1,1]\\\\ \arctan(x) &= \sum_{x=0}^{\infty} \dfrac{(-1)^n}{(2n+1)!}x^{2n+1}&& x \in [-1,1]. \end{aligned}\] Each of these can be verified using the Taylor series formula above.

There is a discussion to be had on the region over which the approximation converges to the actual function. These regions are noted above and can be verified using the ratio test from calculus. We will not get into details about convergence here.

Linearization

We will discuss the error we get using an approximation in this section. An approximation of \(f\) near \(a\) (given by the first order Taylor polynomial) is given by \[f(a + \Delta x) \approx f(a) + f'(a)\Delta x,\] which says that if we move \(\Delta x\) units away from \(a\), we move away from \(f(a)\) by approximately \(f'(a)\Delta x\).

The error between \(f(x)\) and our approximation increases as we move away from the point at which the functions touch (here, \(x = 1\)). This is because \[f(x + \Delta x) = \sum_{n=0}^{\infty} \dfrac{f^{(n)}(x)}{n!}\Delta x^n.\] Therefore, \[\text{Error} = f(x + \Delta x) - f(x) - f'(x)\Delta x = \sum_{n=2}^{\infty} \dfrac{f^{(n)}(x)}{n!}\Delta x^n = O(\Delta x^2).\] This is a fancy way to say that if we vary \(x\) by a small amount (ie, \(\Delta x\) is small), then the error is even smaller (proportional to \(\Delta x^2\)). Thus, we can approximate, even particularly nasty functions, by straight lines.

Multivariate Taylor Series

We can expand our understanding of Taylor series to multiple dimensions, we can do so by using the formula \[\begin{aligned}f(x + \Delta x,y + \Delta y) &= f(x,y) + f_x(x,y)\Delta x + f_y(x,y)\Delta y + (1/2)(f_{xx}(x,y)\Delta x^2 + 2f_{xy}\Delta x \Delta y + f_{yy} f(x,y) \Delta y^2) + \ldots\\\\ &= f(x,y) + J_f \mathbf{\Delta x} + \dfrac{1}{2} \mathbf{\Delta x}^T \mathbf{H}_f \mathbf{\Delta x} + \ldots, \end{aligned}\] where \(J_f\) represents the Jacobian vector and the \(H_f\) represents the Hessian matrix. Of course, this generalizes to higher dimensions quite easily in the obvious way.

Citations

Boas, Mary L. Mathematical Methods in the Physical Sciences. 2nd ed. New York: Wiley, 1983.

Stewart, James. Single Variable Calculus. 7th ed., Student ed. Australia; Belmont, CA: Brooks/Cole Cengage Learning, 2012.

Approximation Algorithms

OC Data Science