Gradient Descent Example

Minimizing a second level polynomial

We’ll start by minimizing a simple function:

\[f(x) = (x - 2)^2\]

Since the whole term is squared, the minimum value of \(f(x)\) is 0, which occurs when \(x = 2\)

We can also graph it:

But what can we do if we can’t graph the function, or it the answer can’t be calculated directly?

We can use gradient descent!

What is gradient descent?

The main idea behind finding minimums and maximums of a function is that they will be located where the derivative of the function is at 0. So we’ll be using the derviative of the function to help find the optimums: \(f'(x)\)

So how does gradient descent work? If we are trying to find the minimum of a function, we’ll start with an initial guess, then update the guess with multiple iterations using the process below:

\[x^{r+1} = x^r - \alpha f'(x^r) \]

Where \(x^r\) is the value of \(x\) on the \(r^{th}\) iteration of the loop, \(\alpha\) is a pre-chosen multiplier, and \(f'(x)\) is the first derivative of the function to be optimized.

To find a minimum, if the derivative of the function is negative, then we need to move our guess to the right. If the derivative is positive, we need to move the guess to the left. Hence, we subtract the derivative.

But how do we find the value of the derivative?

We can find the derivative of \((x - 2)^2\) using a little bit of calculus:

\[f(x) = (x-2)^2 \rightarrow f'(x) = 2(x - 2)\]

But what do we do if we can’t calculate the derivative? We can use a numerical estimate:

\[f'(x) \approx \frac{f(x + \delta) - f(x - \delta)}{2\delta}\]

The smaller the choice of \(\delta\), the more accurate the estimate for the value of the derivative at \(x\).

The algorithm ends when the derivative is below a predetermined value chosen by the person writing the algorithm.

Let’s write the code for gradient descent below, starting with an initial value \(x = 5\)

##                       value
## optimal value of x:       2
## value of y:               0
## number of iterations:   117

Looks like it found it in 117 iterations (not bad!)

Let’s look at how gradient descent works with the graph below and a starting value of \(x = -4\)

Binomial function optimization

The previous section found the minimum value of \((x - 2)^2\), but gradient descent also works for functions that want to be maximized.The only difference is if we want to maximize the function, we need to add the derivative, not subtract it:

\[x^{r+1} = x^r + \alpha f'(x^r)\]

Let’s use a binomial probability mass function (pmf):

\[f(p) = p^y(1-p)^{n-y}\]

and find the value of \(p\) that will maximize \(f(p)\) for known values of \(y\) and \(n\).

With a little bit of calculus, we can find the value of \(p\) that maximizes \(f(p)\) is

\[p = \frac{y}{n}\]

Let’s graph it again:

We can see when \(n = 10\) and \(y = 5\), the value of \(p\) that maximizes the function is \(0.5\), but let’s use gradient descent to find it!

##                           value
## optimal value of p:     0.50000
## value of y:             0.00098
## number of iterations: 483.00000

How gradient descent is actually used is if we want to maximize \(f(x)\), instead we’ll minimize \(-f(x)\):

##                           value
## optimal value of p:     0.50000
## value of y:             0.00098
## number of iterations: 483.00000

You can toy around with the values of \(n\) and \(y\) to see how the value of \(p\) changes!

Function for the gradient descent for a single variable

The code chunk below will create a function to perform gradient descent

A function which you can’t take the derivative and solve for \(x\)

Gradient descent is often used in practice when the derivative of a function is either very tedious or impossible to derive by hand.

One such function is:

\[f(x) = |x| - \cos(x)\]

Let’s take a look at the graph of the function itself:

So we can see that the minimum value of \(|x| - cos(x)\) is when \(x = 0\). But let’s use gradient descent to find it starting with \(x = -5\)!

## $opt_x
## [1] 3.587686e-13
## 
## $min_y
## [1] -1
## 
## $iterations
## [1] 1044

It took quite a few iterations to find it, but 1000 iterations isn’t a ton of steps for a function that is easy to evaluate!

Draw back of gradient descent: local maximums and minimums

Sometimes functions will have multiple locations where the derivative is 0, called local optimums. Let’s look at a fourth order polynomial:

\[f(x) = (x + 3)(2x - 2)(x +5)(x - 4)\]

It will have two minimums: a local (fake) minimum, and a global minimum. But where?

Let’s start by graphing the function:

We can see there is a local minimum around -4 and a global minimum around +3

Let’s see what gradient descent finds:

## $opt_x
## [1] -4.153029
## 
## $min_y
## [1] -82.05789
## 
## $iterations
## [1] 126
## 
## $step_results
## # A tibble: 126 × 3
##        x   f_x      dx
##    <dbl> <dbl>   <dbl>
##  1 -5      0   -216.  
##  2 -2.84  18.5  114.  
##  3 -3.98 -79.5   28.8 
##  4 -4.27 -80.8  -21.9 
##  5 -4.05 -81.1   17.9 
##  6 -4.23 -81.6  -14.0 
##  7 -4.09 -81.7   11.4 
##  8 -4.20 -81.9   -9.01
##  9 -4.11 -81.9    7.31
## 10 -4.18 -82.0   -5.83
## # ℹ 116 more rows

Hmm, seems we found a local minimum, which makes sense since -5 is close the local minimum. Let’s try with a different guess at x = 5

## $opt_x
## [1] -4.153029
## 
## $min_y
## [1] -82.05789
## 
## $iterations
## [1] 128

Wait, what happened? Why did we have the same results when \(x = 5\) despite the initial value being closer to the global minimum? Let’s look at the values of \(x\) used during the gradient descent algorithm:

We had a very large jump from \(5\) to \(-4.4\) during the first iteration. Why? Because the value of the derivative is large, about 945, which will make our first step to move a far distance from our initial guess. The second step has a derivative of only about \(-60\), which means our next step will be close to the previous one.

So what do we do? We can choose a smaller value for \(\alpha\)! Let’s try \(\alpha = 0.001\)

## $opt_x
## [1] 2.820216
## 
## $min_y
## [1] -195.4849
## 
## $iterations
## [1] 123

The drawback of using a smaller \(\alpha\) is that it can greatly increase the number of iterations needed to find a minimum.

The table below compares two different \(\alpha\) values:

\(\alpha = 0.001\) which required 123 iterations
\(\alpha = 0.00001\) which required 13649 iterations

alpha	iterations
0.00100	123
0.00001	13649

Finding the true (global) minimum

What do we do to ensure that we find a global minimum instead of a local minimum?

The bad news is, there isn’t :(

But what we can do is run the algorithm multiple times with different starting values, then keep the value of x that has the smallest minimum. While this method isn’t guaranteed to work, it is more likely to find the global extreme!

Let’s run the gradient descent algorithm 20 times and see what we get!

## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1  2.82 -195.

By using 20 iterations of the gradient descent algorithm, we are able to find the global maximum!

Why we square to remove negative signs (and not use absolute value)

A common question asked when working with data is why we square values (deviations, residuals, etc…) instead of just taking the absolute value. See the standard deviation formula and least squares regression for examples.

While there are statistically theoretical reasons to square them, lets look at a more practical example: finding the minimum of

\[f(x) = |x|\]

Again, let’s start by graphing it:

So what’s the problem?

The good news is that the sign of the derivative of \[|x|\] still tells us the direction to shift our guess: if it’s negative, we move the right, if it is positive, we move to the left.

The major downside is that we lose how much to shift our guess since the derivative is the same value, just sometimes negative and sometimes positive.

\[f'(x) = \left\{\begin{array}{rcl} -1 & \mbox{if} & x < 0 \\ +1 & \mbox{if} & x > 0 \end{array} \right.\]

Using the gradient descent to find the minimum of \(|x|\) will get caught in a loop once we cross over 0, always repeating the last 2 guesses until the limit of the iterations is reached!

## $opt_x
## [1] 0.25
## 
## $min_y
## [1] 0.25
## 
## $iterations
## [1] 10000
## 
## $step_results
## # A tibble: 10,000 × 3
##         x   f_x    dx
##     <dbl> <dbl> <dbl>
##  1 -2     2     -1.00
##  2 -1.25  1.25  -1.00
##  3 -0.500 0.500 -1   
##  4  0.250 0.250  1   
##  5 -0.500 0.500 -1   
##  6  0.250 0.250  1   
##  7 -0.500 0.500 -1   
##  8  0.250 0.250  1   
##  9 -0.500 0.500 -1   
## 10  0.250 0.250  1   
## # ℹ 9,990 more rows

The one, rare exception is if alpha is an exact multiple of the difference between our initial guess and the true minimum: \[\frac{x^1 - \mbox{argmin}_{x} |x|}{\alpha} \in \mathbb{Z}\]

For example, if we have \(\alpha = 0.05\) and \(x = -2\), then it should find the minimum in

\[k = \left|\frac{-2 - 0}{0.05}\right| + 1 = 41\]

## $opt_x
## [1] -4.052314e-15
## 
## $min_y
## [1] 4.052314e-15
## 
## $iterations
## [1] 42
## 
## $step_results
## # A tibble: 42 × 3
##        x   f_x    dx
##    <dbl> <dbl> <dbl>
##  1 -2     2    -1.00
##  2 -1.95  1.95 -1.00
##  3 -1.9   1.9  -1.00
##  4 -1.85  1.85 -1.00
##  5 -1.8   1.8  -1.00
##  6 -1.75  1.75 -1.00
##  7 -1.7   1.7  -1.00
##  8 -1.65  1.65 -1.00
##  9 -1.6   1.6  -1.00
## 10 -1.55  1.55 -1.00
## # ℹ 32 more rows