2026-04-12

Running/Walking Up/Down Hills Into/Out of Valleys: exploring curves with Gradient Descent

As we learned in DAT 300, Gradient Descent (GD) and ascent algorithms allow us to find local extrema of functions.

They work by following the directional gradient vector at a particular point on the graph of a function, in discrete steps of size alpha (also known as the learning rate), until they reach a user-defined threshold gradient range centered around zero. In predictive models, cost functions calculate error, and finding their local minima and maxima allows us to optimize our models.

GD is of particular import to machine learning, as it is used to optimize a function “to minimize the sum of squares of the errors” (Cormen 1035). In other words, it aids in linear regression.

Unconstrained Gradient Descent

GD <- function(f, x_0, gamma, T) {
# Where f is the cost function, x_0 is the starting point, 
# gamma is the learning rate, and T is the number of steps.
  sum = 0
  for(t in 0:(T-1)) {
    sum += x_t # x_t = the value of x at index t
     # grad return the gradient of f at x_t
    x_t_plus_1 = x_t - gamma * grad(f, x_t) 
  return(sum / T)
}

Source: Cormen, Thomas H., et al. Introduction to Algorithms. 4th ed., MIT Press, 2022, p. 1025. Converted from pseudocode to R code by me.

Literal Pitfalls (Graphically Speaking)

GD works best with smooth and fully-continuous functions, at least along the path the algorithm is hopping. Lack of continuity of a function at a point means it is not differentiable at that point, and therefore cannot be navigated using a gradient-based descent.

If the purpose of the algorithm is to find the global minimum of a function, the function must be convex (a necessary condition of a non-negative second derivative) where there is only a single minima (unless we employ stochastic gradient descent (SGD), which is discussed later in this presentation.)

Fully-Continuous but Not Convex


The graph of \(f(x) = x^3 - 5\) is fully-continuous (i.e., its left and right limits exist and agree at all points). \(\textbf{However it is not fully-convex}\) over its entire domain.

A Not-So-Fully-Coninuous Function


The graph of \(f(x) = \frac{1}{x}\) is clearly neither fully continuous nor convex. Imagine what would happen as a GD algorithm approaches zero from either side.

Your hops may be too big!

Above: the graph of the function \(f(x) = x^2\). The red line segments represent the path of the gradient descent from point (3,9), directed by the gradient value at x. Observe how the red line “jumps over” and completely misses the minimum on the first step, while there is a lot more of the graph detail captured by the green line. The learning rate of the green line is 33% that of the red line.

Batch Gradient Descent

Batch GD is a favored approach, as it computes the gradient using the entire data set after every step, and results in accurate estimates and works well for smooth surfaces; however, because it updates the entire data set every iteration, large data sets are not conducive to efficient convergence over a local minima, requiring more memory than other methods.

(Source: GeeksforGeeks. “Difference between Batch Gradient Descent and Stochastic Gradient Descent.” GeeksforGeeks, 30 Sep. 2025, www.geeksforgeeks.org/machine-learning/difference-between-batch-gradient-descent-and-stochastic-gradient-descent/. Accessed 12 Apr. 2026.)

 

Use Case: For smaller data sets it is ideal, as it leads to accurate results.

Stochastic Gradient Descent (SGD)

SGD processes a only subset of the data set on every iteration, usually comprising of a lone training example. For this reason it is much less computationally expensive than the batch method. It can also escape local minima in order to discover a global mimimum, though it can lead to noisy estimates. It can also be slow to converge.

(Source: GeeksforGeeks. “Difference between Batch Gradient Descent and Stochastic Gradient Descent.” GeeksforGeeks, 30 Sep. 2025, www.geeksforgeeks.org/machine-learning/difference-between-batch-gradient-descent-and-stochastic-gradient-descent/. Accessed 12 Apr. 2026.)

 

Use Case: For large data sets, this can work a lot faster, though result in less accurate estimates.