The gradient of a loss function \(L\) with respect to model parameters \(W\) (e.g., weights) points in the direction of steepest ascent. To minimize the loss, we move in the opposite direction of the gradient. Here’s the mathematical intuition.
Near a point \(W\), the loss \(L\) can be approximated using a first-order Taylor expansion:
\[ L(W + \Delta W) \approx L(W) + \nabla_W L \cdot \Delta W \]
Let’s simulate a quadratic loss function \(L(w) = (w - 2)^2\) (convex, single parameter \(w\)).
# Define loss function and its gradient
L <- function(w) (w - 2)^2
grad_L <- function(w) 2 * (w - 2)
# Plot the loss function
w_values <- seq(-1, 5, by = 0.1)
df <- data.frame(w = w_values, Loss = L(w_values))
ggplot(df, aes(w, Loss)) +
geom_line(color = "blue", linewidth = 1) +
labs(title = "Loss Function L(w) = (w - 2)²",
x = "Parameter (w)",
y = "Loss") +
theme_minimal()
At \(w = 0\): - Gradient \(\nabla_w L = 2(0 - 2) = -4\) (points left, but steepest ascent is to the right because loss increases as \(w\) moves from 0 to -4). - To decrease loss, we move opposite to the gradient: \(w \gets w - \alpha \nabla_w L\).
# Current point (w = 0)
w_current <- 0
alpha <- 0.1
# Update rule: w_new = w_current - α * gradient
w_new <- w_current - alpha * grad_L(w_current)
cat("Initial w:", w_current, "| Loss:", L(w_current), "\n")
## Initial w: 0 | Loss: 4
cat("After gradient step (α = 0.1):", w_new, "| Loss:", L(w_new))
## After gradient step (α = 0.1): 0.4 | Loss: 2.56
Output:
Initial w: 0 | Loss: 4
After gradient step (α = 0.1): 0.4 | Loss: 2.56
For a small step \(\alpha\), the change in loss is: \[ \Delta L = L(W + \Delta W) - L(W) \approx \nabla_W L \cdot \Delta W \]