Why the Gradient Points Toward Increasing Loss

The gradient of a loss function \(L\) with respect to model parameters \(W\) (e.g., weights) points in the direction of steepest ascent. To minimize the loss, we move in the opposite direction of the gradient. Here’s the mathematical intuition.


1. Gradient Direction and Taylor Expansion

Near a point \(W\), the loss \(L\) can be approximated using a first-order Taylor expansion:

\[ L(W + \Delta W) \approx L(W) + \nabla_W L \cdot \Delta W \]

  • \(\nabla_W L = \left( \frac{\partial L}{\partial W_1}, \frac{\partial L}{\partial W_2}, \dots \right)^T\) is the gradient.
  • \(\Delta W\) is a small step in parameter space.

Key Insight:

  • If \(\Delta W = \alpha \nabla_W L\) (move along the gradient): \[ L(W + \alpha \nabla_W L) \approx L(W) + \alpha \|\nabla_W L\|^2 \] Since \(\|\nabla_W L\|^2 \geq 0\), the loss increases.
  • If \(\Delta W = -\alpha \nabla_W L\) (move against the gradient): \[ L(W - \alpha \nabla_W L) \approx L(W) - \alpha \|\nabla_W L\|^2 \] The loss decreases.

2. Visualizing Gradient Descent in R

Let’s simulate a quadratic loss function \(L(w) = (w - 2)^2\) (convex, single parameter \(w\)).

# Define loss function and its gradient
L <- function(w) (w - 2)^2
grad_L <- function(w) 2 * (w - 2)

# Plot the loss function
w_values <- seq(-1, 5, by = 0.1)
df <- data.frame(w = w_values, Loss = L(w_values))

ggplot(df, aes(w, Loss)) +
  geom_line(color = "blue", linewidth = 1) +
  labs(title = "Loss Function L(w) = (w - 2)²",
       x = "Parameter (w)",
       y = "Loss") +
  theme_minimal()

Gradient Direction:

At \(w = 0\): - Gradient \(\nabla_w L = 2(0 - 2) = -4\) (points left, but steepest ascent is to the right because loss increases as \(w\) moves from 0 to -4). - To decrease loss, we move opposite to the gradient: \(w \gets w - \alpha \nabla_w L\).

# Current point (w = 0)
w_current <- 0
alpha <- 0.1

# Update rule: w_new = w_current - α * gradient
w_new <- w_current - alpha * grad_L(w_current)

cat("Initial w:", w_current, "| Loss:", L(w_current), "\n")
## Initial w: 0 | Loss: 4
cat("After gradient step (α = 0.1):", w_new, "| Loss:", L(w_new))
## After gradient step (α = 0.1): 0.4 | Loss: 2.56

Output:

Initial w: 0 | Loss: 4 
After gradient step (α = 0.1): 0.4 | Loss: 2.56
  • The loss decreased from 4 to 2.56 by moving against the gradient.

3. Formal Justification

For a small step \(\alpha\), the change in loss is: \[ \Delta L = L(W + \Delta W) - L(W) \approx \nabla_W L \cdot \Delta W \]

  • If \(\Delta W = -\alpha \nabla_W L\): \[ \Delta L \approx -\alpha \|\nabla_W L\|^2 \leq 0 \] Loss always decreases (for convex functions and small \(\alpha\)).

Summary

  1. The gradient \(\nabla_W L\) points toward steepest ascent (increasing loss).
  2. To minimize loss, update parameters in the opposite direction:
    \[ W \gets W - \alpha \nabla_W L \]
  3. The learning rate \(\alpha\) controls the step size.