1 Preliminars

This document builds upon the foundational concepts introduced in previous sections on neural network architectures (Llinás, 2026) and activation functions (Llinás, 2026), extending the analysis toward the mechanisms that enable learning.

Conceptual roadmap.

This document follows a structured progression that connects the fundamental components of learning in neural networks:

\[ \text{NN} \;\Rightarrow\; \text{Loss} \;\Rightarrow\; \text{Gradient} \;\Rightarrow\; \text{LR} \;\Rightarrow\; \text{Opt. dynamics} \;\Rightarrow\; \text{Backpropagation} \;\Rightarrow\; \text{Curvature} \;\Rightarrow\; \text{Newton} \;\Rightarrow\; \text{AM} \]

Neural networks (NN): artificial neurons and network architectures define the computational structure.
Loss: specifies the objective function to be minimized.
Gradient: provides the direction of parameter updates.
Learning rate (LR): controls the magnitude and stability of these updates.
Optimization dynamics: explains behaviors such as convergence, oscillation, and divergence.
Backpropagation: enables efficient gradient computation in multilayer networks.
Curvature: reveals how the loss surface bends across directions.
Newton’s method: incorporates second-order information via the Hessian.
Adaptive methods (AM): approximate curvature effects efficiently in large-scale settings.

This progression reflects a transition from basic modeling concepts to increasingly refined optimization strategies, providing a unified geometric perspective on how neural networks learn.

2 Neural networks

2.0.1 Biological neuron

Artificial neural networks (ANNs) constitute one of the most influential paradigms in modern machine learning and artificial intelligence. Their conceptual origin is loosely inspired by the basic functional principles of biological neurons, which process and transmit information through interconnected networks (Kandel et al., 2013).

In the nervous system, neurons receive electrical or chemical signals through dendrites. These signals are integrated within the cell body (the soma), and if the accumulated stimulus surpasses a certain threshold, the neuron generates an electrical impulse known as an action potential. This signal then travels along the axon and is transmitted to other neurons through synaptic connections.

As illustrated in Figure 2.1, the biological neuron is composed of several key components, including dendrites, the soma, the axon, and the synaptic terminals, each playing a specific role in signal transmission.

Figure 2.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI).

Although artificial neurons are only simplified mathematical abstractions of this biological mechanism, the analogy provides an intuitive conceptual foundation. In essence, both systems combine multiple inputs, evaluate their relative influence, and produce an output response according to a transformation rule.

This biological structure motivates the simplified mathematical abstraction introduced in the next section.

2.0.2 Mathematical model of an artificial neuron

Modern artificial neurons translate the biological intuition of neural signal processing into a mathematical framework suitable for computation and learning. Instead of discrete electrical spikes, artificial neurons compute a weighted linear combination of inputs and then transform this signal through a nonlinear activation function.

Formally, the internal signal of a neuron is defined as

\[ z = \mathbf{w}^\top \mathbf{x} + b, \]

where $\mathbf{x} \in \mathbb{R}^d$ represents the input vector, $\mathbf{w} \in \mathbb{R}^d$ denotes the vector of weights, and $b$ is a bias parameter. The neuron output is then obtained by applying an activation function

\[ h = \varphi(z). \]

The quantity $z$ can be interpreted as a continuous analogue of synaptic integration, while the function $\varphi(\cdot)$ generalizes the biological threshold mechanism into a smooth and differentiable transformation.

This transition from discrete logical models to continuous optimization was essential for the development of modern neural networks. By allowing gradients to be computed and propagated through the model, neural networks can be trained efficiently using gradient-based learning algorithms.

As a result, ANNs are capable of approximating complex nonlinear mappings and learning expressive internal representations from data. Through iterative adjustments of weights and biases, the model progressively refines its representation of the input space. Consequently, neural networks have become fundamental tools for tasks such as classification, regression, and representation learning in high-dimensional environments (Goodfellow et al., 2016).

As illustrated in Figure 2.2, a neural network is constructed by stacking multiple artificial neurons into layers. The output of each neuron becomes the input of neurons in the next layer, allowing the model to progressively learn more complex representations of the data.

Figure 2.2: Basic architecture of a feedforward neural network. Source: Created by the author with ChatGPT (OpenAI).

Each neuron in the hidden layer computes a transformation of the form $h = \varphi(\mathbf{w}^\top \mathbf{x} + b)$, illustrating how nonlinear activation functions operate within layered architectures.

Each neuron in the network performs the same basic operation described above, namely computing a weighted sum followed by a nonlinear activation. By composing many such units, the network is able to model highly complex functions.

3 Smooth functions and the class $C^{\infty}$

3.0.1 Definition (smooth functions and class $C^{\infty}$)

In many areas of mathematics and machine learning, it is important to work with functions that are not only continuous, but also sufficiently smooth. This smoothness ensures that derivatives exist and behave well, which is essential for optimization algorithms such as gradient descent.

A function $f: \mathbb{R} \to \mathbb{R}$ is said to belong to the class $C^{\infty}$ if it is infinitely differentiable; that is, all derivatives of any order exist and are continuous. Formally,

\[ f \in C^{\infty} \quad \Longleftrightarrow \quad f^{(k)} \text{ exists and is continuous for all } k \in \mathbb{N}_0. \]

Functions in $C^{\infty}$ are often referred to as smooth functions, meaning that they can be differentiated infinitely many times without any discontinuities or irregularities in their derivatives.

3.0.2 Examples (smooth functions)

Example: exponential function (smooth case).

A function that belongs to $C^{\infty}$ is

\[ f(x) = e^x\quad \Longrightarrow \quad f'(x) = e^x \]

This function is infinitely differentiable, and all its derivatives are equal to $e^x$. In particular, every derivative is continuous over $\mathbb{R}$.

library(ggplot2)

# Data
x <- seq(-3, 3, length.out = 500)

df <- data.frame(
  x = x,
  f = exp(x),
  df = exp(x)
)

# Plot
ggplot(df, aes(x = x)) +
  
  # Function
  geom_line(
    aes(y = f, color = "func"),
    linewidth = 1.2
  ) +
  
  # Derivative
  geom_line(
    aes(y = df, color = "deriv"),
    linewidth = 1.2,
    linetype = "dashed"
  ) +
  
  # Axes
  geom_hline(yintercept = 0, color = "gray40", linewidth = 0.4) +
  geom_vline(xintercept = 0, color = "gray40", linewidth = 0.4) +
  
  # Labels
  scale_color_manual(
    values = c("func" = "blue", "deriv" = "red"),
    labels = c(
      "func"  = expression(f(x) == e^x),
      "deriv" = expression("Derivative " * f*minute(x) == e^x)
    )
  ) +
  
  labs(
    x = expression(x),
    y = expression(f(x)),
    color = NULL
  ) +
  
  coord_cartesian(xlim = c(-3, 3), ylim = c(0, exp(3))) +
  
  theme_minimal(base_size = 14) +
  theme(
    axis.title = element_text(size = 16),
    legend.position = c(0.2, 0.8),
    legend.background = element_rect(fill = "white", color = "gray75"),
    panel.grid.minor = element_line(color = "gray85", linewidth = 0.3),
    panel.grid.major = element_line(color = "gray80", linewidth = 0.5)
  )

$Exponential function $f(x)=e^x$ and its derivative $f'(x)=e^x$. Both coincide, illustrating smoothness ($C^{\infty}$).$

Figure 3.1: Exponential function $f(x)=e^x$ and its derivative $f'(x)=e^x$. Both coincide, illustrating smoothness ($C^{\infty}$).

As shown in Figure 3.1, the exponential function coincides with its derivative. This illustrates a fundamental property of smooth functions: not only are they differentiable, but all their derivatives are continuous and well-defined across the entire domain.

Example: smooth functions (common families).

In addition to the exponential function discussed earlier, other important examples of smooth functions include trigonometric functions (such as $\sin x$ and $\cos x$), sigmoid-type functions, and the hyperbolic tangent.

Typical examples are:

Sine function \[ f(x)=\sin x, \qquad f'(x)=\cos x. \]
Cosine function \[ f(x)=\cos x, \qquad f'(x)=-\sin x. \]
Sigmoid function \[ \sigma(x)=\frac{1}{1+e^{-x}}, \qquad \sigma'(x)=\sigma(x)\bigl(1-\sigma(x)\bigr). \]
Hyperbolic tangent \[ \tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}, \qquad \frac{d}{dx}\tanh(x)=1-\tanh^2(x). \]

These functions are infinitely differentiable and therefore belong to $C^\infty$. Figure 3.2 displays each function together with its derivative.

In panels (a) and (b), the sine and cosine functions illustrate smooth oscillatory behavior, with derivatives that remain continuous and periodic.
In panels (c) and (d), the sigmoid and hyperbolic tangent illustrate smooth saturation, where both the function and its derivative vary continuously, although the derivative becomes small for large values of $|x|$.
In all cases, both the function and its derivative are well-defined and continuous for all $x$, which is a defining property of smooth functions.

These examples highlight how smoothness leads to stable and predictable gradient behavior, which is particularly important in gradient-based learning.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-6, 6, 600)

# Functions
sinx = np.sin(x)
cosx = np.cos(x)
sig = 1 / (1 + np.exp(-x))
tanh = np.tanh(x)

# Derivatives
dsinx = np.cos(x)
dcosx = -np.sin(x)
dsig = sig * (1 - sig)
dtanh = 1 - tanh**2

fig, axes = plt.subplots(4, 2, figsize=(10, 8))

# ---------------- (a) sine ----------------
axes[0,0].plot(x, sinx)
axes[0,0].set_title(r"(a) $f(x)=\sin x$")

axes[0,1].plot(x, dsinx, linestyle='--')
axes[0,1].set_title(r"$f'(x)=\cos x$")

# ---------------- (b) cosine ----------------
axes[1,0].plot(x, cosx)
axes[1,0].set_title(r"(b) $f(x)=\cos x$")

axes[1,1].plot(x, dcosx, linestyle='--')
axes[1,1].set_title(r"$f'(x)=-\sin x$")

# ---------------- (c) sigmoid ----------------
axes[2,0].plot(x, sig)
axes[2,0].set_title(r"(c) $\sigma(x)=\frac{1}{1+e^{-x}}$")

axes[2,1].plot(x, dsig, linestyle='--')
axes[2,1].set_title(r"$\sigma'(x)=\sigma(x)(1-\sigma(x))$")

# ---------------- (d) tanh ----------------
axes[3,0].plot(x, tanh)
axes[3,0].set_title(r"(d) $\tanh(x)$")

axes[3,1].plot(x, dtanh, linestyle='--')
axes[3,1].set_title(r"$\frac{d}{dx}\tanh(x)=1-\tanh^2(x)$")

# Formatting
for ax in axes.flat:
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-6, 6)
    ax.set_xlabel("x")

for ax in axes[:,0]:
    ax.set_ylabel("function")

for ax in axes[:,1]:
    ax.set_ylabel("derivative")

plt.tight_layout()
plt.show()

Smooth functions and their derivatives. Left: functions. Right: derivatives. Trigonometric functions exhibit smooth oscillatory behavior, while sigmoid and hyperbolic tangent illustrate smooth saturation effects.

Figure 3.2: Smooth functions and their derivatives. Left: functions. Right: derivatives. Trigonometric functions exhibit smooth oscillatory behavior, while sigmoid and hyperbolic tangent illustrate smooth saturation effects.

While the previous examples illustrate functions that are smooth and exhibit well-behaved derivatives across their entire domain, not all functions share this property. In particular, some functions may be continuous but fail to be differentiable at specific points. The following example provides a simple and instructive case.

3.0.3 Examples (non-smooth functions)

Example: absolute value function (non-smooth case).

A simple example of a function that is not smooth is the absolute value function:

\[ |x| = \begin{cases} x & \text{si } x \geq 0 \\ -x & \text{si } x < 0 \end{cases} \]

Its derivative is given by

\[ f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ -1 & \text{if } x < 0 \end{cases} \]

However, at $x = 0$, the derivative is not defined, since the left and right limits do not coincide.

library(ggplot2)

# Data
x <- seq(-3, 3, length.out = 500)

df <- data.frame(
  x = x,
  f = abs(x),
  df = ifelse(x > 0, 1, ifelse(x < 0, -1, NA)),
  group = ifelse(x < 0, "left", ifelse(x > 0, "right", NA))
)

# Plot
ggplot(df, aes(x = x)) +
  
  # Function
  geom_line(
    aes(y = f, color = "func"),
    linewidth = 1.2
  ) +
  
  # Derivative (separated groups)
  geom_line(
    aes(y = df, color = "deriv", group = group),
    linewidth = 1.2,
    linetype = "dashed",
    na.rm = TRUE
  ) +
  
  # Open circles (discontinuity)
 geom_point(aes(x = 0, y = 1), shape = 1, size = 4, color = "red", stroke = 1) +
geom_point(aes(x = 0, y = -1), shape = 1, size = 4, color = "red", stroke = 1) +
  
  # Axes
  geom_hline(yintercept = 0, color = "gray40", linewidth = 0.4) +
  geom_vline(xintercept = 0, color = "gray40", linewidth = 0.4) +
  
  # Labels
  scale_color_manual(
    values = c("func" = "blue", "deriv" = "red"),
    labels = c(
      "func"  = expression(f(x) == abs(x)),
      "deriv" = expression("Derivative " * f*minute(x))
    )
  ) +
  
  labs(
    x = expression(x),
    y = expression(f(x)),
    color = NULL
  ) +
  
  coord_cartesian(xlim = c(-3, 3), ylim = c(-1.5, 3)) +
  
  theme_minimal(base_size = 14)

Figure 3.3: Absolute value function $f(x)=|x|$ and its derivative. The derivative is discontinuous at $x=0$, illustrating a non-smooth point.

As shown in Figure 3.3, the absolute value function is continuous but exhibits a sharp corner at $x=0$. The derivative presents a discontinuity, jumping from $−1$ to $1$, and is therefore not defined at that point.

This example contrasts directly with the smooth functions presented in Figure 3.2. While those functions exhibit continuous and well-defined derivatives across the entire domain, the absolute value function introduces a point where the derivative is undefined. This lack of smoothness has important implications for optimization and learning algorithms.

In the next section, we explore additional examples of non-smooth functions that frequently arise in practice.

Example: non-smooth functions (other common cases).

In contrast to smooth functions, several commonly used functions exhibit non-smooth behavior. These functions may fail to be differentiable at certain points or may even be discontinuous.

Typical examples include:

ReLU function. \[ f(x)=\max(0,x), \qquad f'(x)= \begin{cases} 0, & x<0,\\ 1, & x>0, \end{cases} \] The derivative is not defined at $x=0$. In practice, a subgradient is often used.
Leaky ReLU function. \[ f(x)= \begin{cases} 0.1x, & x<0,\\ x, & x\ge 0, \end{cases} \qquad f'(x)= \begin{cases} 0.1, & x<0,\\ 1, & x>0, \end{cases} \] This function is continuous but still not differentiable at $x=0$.
Sign function. \[ \mathrm{sign}(x)= \begin{cases} -1, & x<0,\\ 1, & x>0, \end{cases} \qquad f'(x)=0 \text{ for } x\neq 0,\ \text{undefined at } x=0. \]

This function is discontinuous at the origin and therefore not differentiable at $x=0$.

Step (Heaviside) function. \[ H(x)= \begin{cases} 0, & x<0,\\ 1, & x\ge 0, \end{cases} \qquad f'(x)=0 \text{ for } x\neq 0,\ \text{undefined at } x=0. \]

This function is also discontinuous at $x=0$, and thus not differentiable.

Figure 3.4 illustrates these functions together with their graphical behavior:

In panels (a) and (b), ReLU and Leaky ReLU exhibit sharp corners at $x=0$, where the derivative changes abruptly and is not defined.
In panels (c) and (d), the sign and step functions exhibit jump discontinuities, where the function itself is not continuous, and therefore no derivative exists at the origin.

These examples highlight two fundamentally different sources of non-smoothness:

lack of differentiability (kinks), and
discontinuity (jumps).

Both phenomena have important implications for optimization, particularly in gradient-based learning methods.

In practice, non-differentiable points are typically handled using subgradients or convenient definitions, allowing gradient-based optimization methods to remain applicable.

import numpy as np
import matplotlib.pyplot as plt

x_left  = np.linspace(-3, 0, 250, endpoint=False)
x_right = np.linspace(0, 3, 250, endpoint=False)

fig, axes = plt.subplots(4, 2, figsize=(10, 8))

# ---------------- (a) ReLU ----------------
axes[0,0].plot(x_left, np.zeros_like(x_left))
axes[0,0].plot(x_right, x_right)
axes[0,0].scatter([0], [0], s=35, color="C0")
axes[0,0].set_title(r"(a) $f(x)=\max(0,x)$")

axes[0,1].plot(x_left, np.zeros_like(x_left), linestyle="--")
axes[0,1].plot(x_right, np.ones_like(x_right), linestyle="--")
axes[0,1].scatter([0], [0], facecolors="white", edgecolors="black", s=45)
axes[0,1].scatter([0], [1], facecolors="white", edgecolors="black", s=45)
axes[0,1].set_title(r"$f'(x)$ (undefined at $x=0$)")

# ---------------- (b) Leaky ReLU ----------------
axes[1,0].plot(x_left, 0.1*x_left)
axes[1,0].plot(x_right, x_right)
axes[1,0].scatter([0], [0], s=35, color="C0")
axes[1,0].set_title(r"(b) Leaky ReLU")

axes[1,1].plot(x_left, 0.1*np.ones_like(x_left), linestyle="--")
axes[1,1].plot(x_right, np.ones_like(x_right), linestyle="--")
axes[1,1].scatter([0], [0.1], facecolors="white", edgecolors="black", s=45)
axes[1,1].scatter([0], [1], facecolors="white", edgecolors="black", s=45)
axes[1,1].set_title(r"$f'(x)$ (undefined at $x=0$)")

# ---------------- (c) sign ----------------
axes[2,0].plot(x_left, -np.ones_like(x_left))
axes[2,0].plot(x_right, np.ones_like(x_right))
axes[2,0].scatter([0], [-1], facecolors="white", edgecolors="black", s=45)
axes[2,0].scatter([0], [1], facecolors="white", edgecolors="black", s=45)
axes[2,0].scatter([0], [0], color="C0", s=35)
axes[2,0].set_title(r"(c) $\mathrm{sign}(x)$")

axes[2,1].plot(x_left, np.zeros_like(x_left), linestyle="--")
axes[2,1].plot(x_right, np.zeros_like(x_right), linestyle="--")
axes[2,1].set_title(r"$f'(x)$ (no classical derivative)")
axes[2,1].annotate("Dirac-like behavior", (0,0), xytext=(10,10),
                   textcoords="offset points")

# ---------------- (d) step function ----------------
axes[3,0].plot(x_left, np.zeros_like(x_left))
axes[3,0].plot(x_right, np.ones_like(x_right))
axes[3,0].scatter([0], [0], facecolors="white", edgecolors="black", s=45)
axes[3,0].scatter([0], [1], color="C0", s=35)
axes[3,0].set_title(r"(d) step function")

axes[3,1].plot(x_left, np.zeros_like(x_left), linestyle="--")
axes[3,1].plot(x_right, np.zeros_like(x_right), linestyle="--")
axes[3,1].set_title(r"$f'(x)$ (Dirac delta at $x=0$)")
axes[3,1].annotate("impulse", (0,0), xytext=(10,10),
                   textcoords="offset points")

# Formatting
for ax in axes.flat:
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-3, 3)
    ax.set_ylim(-1.5, 1.5)
    ax.set_xlabel("x")

plt.tight_layout()
plt.show()

Non-smooth functions and their derivatives. Left: functions. Right: derivatives (when defined). Non-differentiability is reflected either as discontinuities in the derivative or as the absence of a classical derivative.

Figure 3.4: Non-smooth functions and their derivatives. Left: functions. Right: derivatives (when defined). Non-differentiability is reflected either as discontinuities in the derivative or as the absence of a classical derivative.

Remark: generalized derivatives and Dirac delta

Although the sign and step functions do not admit classical derivatives at $x=0$, they can be interpreted within the framework of generalized functions (or distributions).

In this setting, their derivatives are expressed in terms of the Dirac delta function $\delta(x)$:

\[ \frac{d}{dx} H(x) = \delta(x), \qquad \frac{d}{dx} \,\mathrm{sign}(x) = 2\,\delta(x). \]

The Dirac delta is not a function in the usual sense, but a generalized object that captures an impulse concentrated at a single point, with the property

\[ \int_{-\infty}^{\infty} \delta(x)\,dx = 1. \]

This interpretation provides a useful conceptual bridge between discontinuous functions and their “derivatives,” and plays an important role in advanced topics such as signal processing, differential equations, and theoretical aspects of optimization.

3.0.4 Remark (Neural networks context)

In the context of neural networks, smooth activation functions are particularly useful because they allow gradients to be computed reliably during training. However, not all commonly used activation functions belong to $C^{\infty}$. For instance, the ReLU function is not differentiable at zero, yet it remains widely used due to its practical advantages.

Throughout this document, we will frequently refer to functions in $C^{\infty}$ when discussing theoretical properties of activation functions and optimization.

4 Activation functions

4.0.1 Motivation and definition

Activation functions are a fundamental component of artificial neural networks because they introduce nonlinearity into the model. Without nonlinear activation functions, a neural network composed of multiple layers would reduce to an equivalent linear transformation, regardless of the number of layers.

Recall that an artificial neuron computes a linear combination of its inputs

\[ z = \mathbf{w}^\top\mathbf{x} + b, \]

which represents the internal signal of the neuron. The final output is obtained by applying a nonlinear transformation

\[ h = \varphi(z), \]

where $\varphi(\cdot)$ is called the activation function.

4.0.2 Role of the activation function

The role of the activation function is to transform the internal signal of the neuron into a response that can capture complex relationships between variables. By introducing nonlinear transformations, neural networks can construct nonlinear decision boundaries and learn rich internal representations of the data.

This capability is supported by classical universal approximation results, which show that neural networks equipped with suitable activation functions can approximate a wide class of functions on compact domains.

From a geometric perspective, activation functions distort the feature space in a controlled way, allowing the model to separate patterns that would otherwise be inseparable using only linear transformations.

4.0.3 Classification of activation functions

Monotonic and periodic functions.

Activation functions can be organized into different categories according to their mathematical properties. One common classification distinguishes between monotonic and periodic activation functions.

Monotonic activation functions are functions whose output consistently increases or decreases with respect to the input. These functions have historically been the most widely used in neural networks and include examples such as the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU).
Periodic activation functions exhibit oscillatory behavior and are particularly useful in models designed to capture periodic or high-frequency patterns. These functions are commonly used in specialized neural architectures for representing signals, implicit functions, or spatial fields.

Periodic activation functions can be further divided into:

Sinusoidal activation functions.
Non-sinusoidal periodic functions.

A detailed and comprehensive treatment of activation functions, including their mathematical definitions, graphical behavior, and practical implications, is provided in the following companion document: Activation functions (LLinás, 2026).

From activation functions to learning.

While activation functions define how information is transformed within the network, they do not specify how the parameters of the model are learned. This naturally leads to the question of how neural networks are trained from data.

From a mathematical perspective, learning in neural networks consists of adjusting the parameters $\mathbf{w}$ and $b$ so as to minimize a loss function defined over the training data.

4.0.4 Monotonic activation functions

Intuition and graphical interpretation.

We begin with monotonic activation functions, which have historically played a central role in the development of neural networks. As discussed in the previous section, these functions preserve the ordering of inputs and are especially useful when the model is expected to respond in a consistent increasing or decreasing manner.

In mathematics, a function is said to be monotonic when it is either nondecreasing or nonincreasing over its domain.

From a graphical perspective, monotonicity means that the curve evolves in a single overall direction: it either increases as the input grows or decreases without reversing its global trend.

This idea is illustrated in Figure 4.1:

Two of the curves exhibit monotonic behavior: one shows a steady increase (blue curve), while another shows a steady decrease over the entire domain (red curve).
In contrast, the third curve oscillates (green curve), alternating between increasing and decreasing intervals, and therefore does not satisfy the monotonicity property.

This comparison highlights that monotonic functions preserve the ordering of inputs, whereas non-monotonic functions may reverse their direction.

Figure 4.1: Monotonic function.

Formal definition.

A function $f$ is nondecreasing if for all $x,y \in \mathbb{R}$, whenever $x \ge y$, it follows that $f(x) \ge f(y)$.

Likewise, $f$ is nonincreasing if for all $x,y \in \mathbb{R}$, whenever $x \ge y$, it follows that $f(x) \le f(y)$.

4.0.5 Common monotonic activation functions

In the context of neural networks, many commonly used activation functions exhibit monotonic behavior. The following list summarizes several important examples that will be studied in detail:

Linear function
Identity function
Piecewise linear function
Threshold (Heaviside) function
Sigmoid function
Bipolar sigmoid function
ReLU and its variants (Leaky ReLU, PReLU, RReLU)
ELU and SELU
SoftMax function
Sign function
Maxout function
Softsign function
Elliot function
Hyperbolic tangent (tanh) function
Arctangent function
Lecun’s hyperbolic tangent function
Complementary log-log function
Softplus function
Bent identity function
Soft expopnential function

Each of these functions has different properties in terms of smoothness, differentiability, and practical performance in neural networks.

A more detailed discussion of these activation functions, including their mathematical properties and visual interpretation, is available in the companion document: Activation functions (LLinás, 2026).

4.0.6 Periodic activation functions: motivation

While monotonic activation functions have historically dominated neural network design due to their stability and well-behaved optimization properties, many real-world phenomena exhibit inherently oscillatory or cyclic behavior. Signals in physics, audio processing, time-series analysis, and spatial modeling often contain repeating patterns that cannot be adequately captured using strictly monotonic transformations.

Periodic activation functions address this limitation by producing oscillatory outputs. Rather than preserving the ordering of inputs, they allow the activation response to vary cyclically, enabling neural networks to model wave-like structures, harmonic relationships, and repeating local patterns.

These functions are particularly relevant in architectures designed for signal representation, implicit neural fields, and temporal dynamics. A defining characteristic of periodic activations is that their derivatives also oscillate. While this increases expressive power, it may introduce additional optimization challenges, such as sensitivity to initialization, learning rate, and training stability.

4.0.7 Classification of periodic activation functions

The periodic activation functions considered in this section can be naturally grouped into two main categories, depending on the nature of their oscillatory behavior.

Some of these functions (e.g., Fourier or wavelet-based representations) are not activation functions in the strict sense, but rather functional transformations that can be incorporated into neural architectures to capture periodic structure.

Sinusoidal functions.

Sine wave function.
Cardinal sine function (Sinc).
Fourier transform (FT, DFT/FFT).
Short-time Fourier transform (STFT, Gabor transform).
Wavelet transform.

Non-sinusoidal periodic functions.

Gaussian (normal distribution) function.
Square wave function.
Triangle wave function.
Sawtooth wave function.
S-shaped rectified linear unit (SReLU).
Adaptive piecewise linear unit (APLU).

While activation functions determine how information is transformed within a neural network, they do not specify how the parameters of the model are learned from data. This naturally leads to the question of how neural networks adjust their weights and biases during training.

5 Mathematical preliminaries for learning in neural networks

Before introducing the learning process in neural networks, it is useful to briefly review a few mathematical concepts that will be used throughout this section. In particular, we will rely on derivatives, partial derivatives, and the gradient operator.

These tools allow us to quantify how a function changes with respect to its inputs, which is essential for optimizing model parameters.

5.0.1 Derivatives and partial derivatives

In one dimension, the derivative of a function measures how the output changes as the input varies:

\[ \frac{d}{dx} f(x). \]

In neural networks, however, functions typically depend on multiple variables. Let

\[ \mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^n \]

denote the input vector. Each component $x_i$ can be interpreted as an individual input feature, which is particularly relevant in neural networks where inputs are naturally represented as vectors.

In this setting, we use partial derivatives, which measure how a function changes with respect to one variable while keeping the others fixed:

\[ \frac{\partial f}{\partial x_i}. \]

This notation allows us to analyze the sensitivity of a function with respect to each individual component of its input.

5.0.2 Gradient vector

For functions of several variables, partial derivatives can be grouped into a single object known as the gradient.

\[ \nabla f(\mathbf{x}) \in \mathbb{R}^n \quad = \quad \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}. \]

Thus, the gradient can be interpreted as a vector field defined over $\mathbb{R}^n$, assigning to each point $\mathbf{x}$ the direction of steepest ascent.

The gradient is a vector that points in the direction of maximum increase of the function.

This concept plays a central role in optimization, as it tells us how to adjust parameters to reduce a loss function.

Intuitively, while the gradient tells us in which direction to move, it does not indicate how the landscape bends in different directions. This limitation motivates the introduction of second-order derivatives.

5.0.3 Second-order derivatives and curvature

While the gradient describes the direction of maximum increase of a function, it does not capture how this direction changes across the input space.

To describe this behavior, we introduce second-order derivatives, which measure how the gradient itself varies.

Before introducing the general multivariable case, it is useful to recall the one-dimensional setting.

For a function of a single variable, the second derivative

\[ f''(x) \]

measures how the slope of the function changes, and therefore describes the curvature of the function.

In this case, the Hessian reduces to the $1 \times 1$ matrix

\[ H_f(x)\;=\; \begin{pmatrix} f''(x) \end{pmatrix} \;=\; f''(x). \]

This shows that the Hessian generalizes the notion of second derivative from one dimension to multiple dimensions.

For a function of several variables, these second-order derivatives are organized into the Hessian matrix:

\[ H_f(\mathbf{x}) \;=\; \left[ \frac{\partial^2 f}{\partial x_i \partial x_j} \right]_{i,j=1}^n \;=\; \underbrace{\begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}}_{n\times n}. \]

The Hessian provides information about the local curvature of the function.

If the curvature is similar in all directions, optimization is relatively stable.
If curvature varies significantly across directions, optimization may become difficult.

This concept will be essential later when discussing more advanced optimization methods.

This notion of curvature will become especially important when analyzing optimization difficulties such as narrow valleys and saddle points, where the Hessian reveals strong differences across directions.

5.0.4 Simple illustrative example

Function definition.

Consider the function:

\[ L(x,y) = x^2 + y^2 \]

Gradient computation.

Its partial derivatives are:

\[ \frac{\partial L}{\partial x} = 2x, \quad \frac{\partial L}{\partial y} = 2y \]

Thus, the gradient is:

\[ \nabla L(x,y) = (2x, 2y) \]

This vector describes how the function changes at any point $(x,y)$ and, in particular, determines the direction used to adjust parameters during learning.

Hessian and curvature.

The Hessian of this function is:

\[ H_L(x,y)= \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}. \]

This matrix is constant and isotropic, meaning that curvature is the same in all directions. As a result, the function behaves like a perfectly symmetric bowl, and gradient-based optimization proceeds smoothly toward the minimum without oscillations.

Graphical representation.

The contour lines represent the level sets of the function $L(x,y)$, while the arrows correspond to the gradient vector $\nabla L(x,y)$ at each point. As shown in Figure 5.1, the gradient vectors point outward from the origin, indicating the direction of maximum increase of the function. Consequently, moving in the opposite direction leads toward the minimum, which in this case is located at $(0,0)$.

Figure 5.1 provides two complementary views of this behavior: a surface representation and its corresponding level sets, both combined with the gradient field.

This observation is fundamental for optimization, as gradient-based methods update parameters by moving in the opposite direction of the gradient.

# ==========================================================
# Function and gradient
# ==========================================================
import numpy as np
import matplotlib.pyplot as plt

# Global font configuration 
plt.rcParams.update({
    "font.size": 14,
    "axes.titlesize": 15,
    "axes.labelsize": 14,
    "xtick.labelsize": 13,
    "ytick.labelsize": 13
})

def L(x, y):
    return x**2 + y**2

def grad_L(x, y):
    return 2*x, 2*y

# Grid
x = np.linspace(-3, 3, 25)
y = np.linspace(-3, 3, 25)
X, Y = np.meshgrid(x, y)
Z = L(X, Y)
U, V = grad_L(X, Y)

# Shared color scale for both plots
vmin = Z.min()
vmax = Z.max()

# ==========================================================
# Figure (compact and balanced layout)
# ==========================================================
fig = plt.figure(figsize=(14, 6.5))

# Make the 3D plot slightly larger than the contour plot
gs = fig.add_gridspec(1, 2, width_ratios=[1.3, 1], wspace=0.07)


# ----------------------------------------------------------
# (a) Surface plot with gradient field
# ----------------------------------------------------------
ax1 = fig.add_subplot(gs[0, 0], projection='3d')

surf = ax1.plot_surface(
    X, Y, Z,
    cmap='viridis',
    edgecolor='k',
    linewidth=0.6,
    vmin=vmin, vmax=vmax,
    alpha=0.9
)

# Gradient vectors projected onto the plane z = 0
ax1.quiver(
    X, Y, 0,
    U, V, 0,
    color='black',
    length=0.22,
    normalize=True
)

# Slightly more separation from axes
ax1.set_title("(a) Surface of $L(x,y)$ with gradient field", pad=8)

ax1.set_xlabel("x", labelpad=3)
ax1.set_ylabel("y", labelpad=3)
ax1.set_zlabel("$L(x,y)$", labelpad=3)

# Adjust viewing angle for better perception
ax1.view_init(elev=30, azim=-60)

# ----------------------------------------------------------
# (b) Contour plot with gradient field
# ----------------------------------------------------------
ax2 = fig.add_subplot(gs[0, 1])

levels = np.linspace(vmin, vmax, 10)

cont = ax2.contour(
    X, Y, Z,
    levels=levels,
    cmap='viridis',
    vmin=vmin, vmax=vmax,
    linewidths=1.4
)

# Gradient vectors
ax2.quiver(
    X, Y, U, V,
    color='black',
    angles='xy',
    scale_units='xy',
    scale=25,
    width=0.0035
)

#  space above subtitle
ax2.set_title("(b) Level sets of $L(x,y)$ with gradient field", pad=8)

ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_aspect('equal', adjustable='box')

# ----------------------------------------------------------
#  Centered horizontal colorbar
# ----------------------------------------------------------

# width: [left, bottom, width, height]
cbar_ax = fig.add_axes([0.40, 0.085, 0.30, 0.025])  

cbar = fig.colorbar(
    surf,
    cax=cbar_ax,
    orientation='horizontal'
)
cbar.set_label("$L(x,y)$", labelpad=2)

# ----------------------------------------------------------
# Main title with improved spacing
# ----------------------------------------------------------
fig.suptitle(
    "Function $L(x,y)=x^2+y^2$ and its gradient field",
    fontsize=16,
    y=0.975   # more space from subplots
)

# Fine-tuned layout adjustments
fig.subplots_adjust(
    top=0.82,     # pushes plots down (more separation)
    bottom=0.17,
    left=0.05,
    right=0.98
)

plt.show()

Two complementary visualizations of $L(x,y)=x^2+y^2$ and its gradient field. (a) Surface representation with gradient vectors projected onto the input plane. (b) Level sets with gradient vectors indicating the direction of maximum increase.

Figure 5.1: Two complementary visualizations of $L(x,y)=x^2+y^2$ and its gradient field. (a) Surface representation with gradient vectors projected onto the input plane. (b) Level sets with gradient vectors indicating the direction of maximum increase.

5.0.5 Why this matters for learning

Learning in neural networks consists of minimizing a loss function with respect to the model parameters (weights and biases).

To achieve this, we need to compute how the loss changes when the parameters change:

\[ \frac{\partial L}{\partial \mathbf{w}}, \quad \frac{\partial L}{\partial b} \]

These quantities are used to iteratively update the parameters in a direction that reduces the loss.

This idea will be formalized in the next section, where we introduce the learning process in neural networks and show how these derivatives are used to update the model parameters.

6 Learning in neural networks

While activation functions determine how neurons transform their inputs, they do not by themselves provide a mechanism for learning from data. The learning process requires a way to evaluate prediction quality and a systematic procedure to adjust model parameters.

In supervised learning settings, this process is guided by a loss function, which quantifies the discrepancy between the predicted output of the network and the true target values.

To understand how this optimization process works, we first introduce a few essential mathematical tools that describe how functions change with respect to their inputs.

With these tools in place, we are now ready to formalize how neural networks learn from data.

6.0.1 Forward propagation (conceptual)

Before introducing optimization, it is useful to formalize how information flows through a neural network.

Figure 6.1 illustrates the forward propagation mechanism in a multilayer neural network. At each layer, the input signal undergoes an affine transformation followed by a nonlinear activation.

Forward propagation in a multilayer neural network. Source: [Nuñez (2026)](https://medium.com/@nuguzdani/understanding-xavier-initialization-part-i-why-weights-matter-in-neural-networks-7a3ad5511ea0)

Figure 6.1: Forward propagation in a multilayer neural network. Source: Nuñez (2026)

To formalize the flow of information through a neural network, it is useful to introduce a layer-wise notation.

Consider a neural network composed of layers indexed by $j = 1, \dots, N$. Let $n_j$ denote the number of neurons in layer $j$.

For each layer, we define:

$\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}$: the output (activation vector) of layer $j$. Each component corresponds to the output of one neuron.
$\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}$: the pre-activation signal at layer $j$, obtained before applying the activation function.
$\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}$: the weight matrix connecting layer $j-1$ to layer $j$. Each column contains the weights associated with one neuron in layer $j$.
$\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}$: the bias vector of layer $j$.

By convention, the input layer is denoted by \[ \mathbf{h}^{(0)} = \mathbf{x} \in \mathbb{R}^{n_0}. \]

where $\mathbf{x}$ is the input vector.

At each layer, the network performs two steps:

A linear transformation: \[ \mathbf{z}^{(j)} = \mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)} \]
A nonlinear transformation: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{z}^{(j)}\right) \]

Combining both steps, the forward propagation at layer $j$ can be written as: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)}\right) \]

This recursive structure shows that the output of each layer becomes the input of the next, allowing the network to build increasingly complex representations of the data.

As illustrated in Figure 6.1, each layer applies the same fundamental operation: a linear combination of inputs followed by a nonlinear activation.

6.0.2 Loss functions

A loss function quantifies the discrepancy between the predicted output of the network and the true target values.

Using the notation introduced above, let:

$\mathbf{x} = \mathbf{h}^{(0)} \in \mathbb{R}^{n_0}$: input vector, where $n_0$ denotes the number of input features (i.e., the number of neurons in the input layer)
$\mathbf{h}^{(N)} = \hat{\mathbf{y}} \in \mathbb{R}^{n_N}$: network output, where $n_N$ denotes the number of output neurons in the final layer
$\mathbf{y} \in \mathbb{R}^{n_N}$: true target vector, defined in the same output space as the prediction

Thus, the loss function is defined as:

\[ L(\mathbf{y}, \hat{\mathbf{y}}) \]

6.0.3 Common loss functions (vector form)

Mean squared error (MSE).

For regression problems, a common choice is the mean squared error, defined as:

\[ L(\mathbf{y}, \hat{\mathbf{y}}) \quad = \quad \frac{1}{2} \|\mathbf{y} \; -\; \hat{\mathbf{y}}\|_2^2 \quad =\quad \frac{1}{2} \sum_{i=1}^{n_N} (y_i \;-\; \hat{y}_i)^2 \]

$\mathbf{y}, \hat{\mathbf{y}} \in \mathbb{R}^{n_N}$.
Each component corresponds to one output neuron.

The gradient with respect to the prediction is:

\[ \frac{\partial L}{\partial \hat{\mathbf{y}}} = \hat{\mathbf{y}} - \mathbf{y} \]

This vector has dimension $\mathbb{R}^{n_N}$ and is the starting point for backpropagation.

Remark on scaling.

In the literature, the MSE is sometimes defined without the factor $\tfrac{1}{2}$ or averaged over the output dimension (e.g., dividing by $n_N$).

All these versions differ only by a constant factor, which does not affect the optimal solution, but only rescales the gradient.

The factor $\tfrac{1}{2}$ is used here because it simplifies the derivative, leading to cleaner expressions in backpropagation.

Cross-entropy loss.

For classification problems (e.g., with softmax output), a common choice is:

\[ L(\mathbf{y}, \hat{\mathbf{y}}) \quad = \quad - \sum_{i=1}^{n_N} y_i \log(\hat{y}_i) \]

where:

$\mathbf{y}$ is typically a one-hot encoded vector.
$\hat{\mathbf{y}}$ represents predicted probabilities.

7 The XOR problem

Before introducing the error measures, we consider a classical benchmark in neural networks: the exclusive OR (XOR) problem.

The XOR function is defined as:

Table 7.1: Table 7.2: The XOR function: output is 1 if the inputs differ, and 0 otherwise.
$x_1$	$x_2$	$y$	Interpretation
0	0	0	Same inputs → 0
0	1	1	Different inputs → 1
1	0	1	Different inputs → 1
1	1	0	Same inputs → 0

The output is 1 if the inputs are different, and 0 otherwise.

7.0.1 Why XOR is important

The XOR problem is fundamental because:

It is not linearly separable.
A single-layer network cannot solve it.
It requires at least one hidden layer.

This makes XOR a minimal example to illustrate how neural networks learn nonlinear patterns.

7.0.2 Geometric intuition

The following figures illustrate the XOR dataset in $\mathbb{R}^2$. Points belonging to different classes cannot be separated by a single straight line.

We first present the XOR dataset using its four canonical observations.

library(ggplot2)
library(patchwork)

xor_simple <- data.frame(
  x1 = c(0, 0, 1, 1),
  x2 = c(0, 1, 0, 1),
  y  = factor(c(0, 1, 1, 0))
)

base_theme <- theme_minimal(base_size = 20) +
  theme(
    panel.grid.minor = element_line(color = "gray92"),
    panel.grid.major = element_line(color = "gray85"),
    
    #  Axis
    axis.title = element_text(size = 26), # Titles
    axis.text  = element_text(size = 16), # Numbers
    
    #  Titles (a), (b)
    plot.title = element_text(size = 22, face = "bold", hjust = 0),
    
    #  Legends
    legend.title = element_text(size = 20),
    legend.text  = element_text(size = 18),
    
    # Spaces
    plot.margin = margin(2, 2, 2, 2),
    legend.margin = margin(0, 0, 0, 0),
    legend.box.margin = margin(0, 0, 0, 0)
  )

#  Plot (a) 
p1 <- ggplot(xor_simple, aes(x = x1, y = x2, color = y)) +
  geom_point(size = 9) +
  geom_abline(intercept = 0.5, slope = 0,
              linetype = "dashed", color = "darkorange2", linewidth = 1.1) +
  geom_abline(intercept = 0, slope = 1,
              linetype = "dashed", color = "purple", linewidth = 1.1) +
  scale_color_manual(
    values = c("0" = "#C0392B", "1" = "#2E86C1"),
    labels = c("Class 0", "Class 1")
  ) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
  coord_fixed() +
  labs(
    title = "(a) Single linear boundaries",
    x = expression(x[1]),
    y = expression(x[2]),
    color = "Output"
  ) +
  base_theme +
  theme(plot.margin = margin(2, 17, 2, 2))   # More space (right): margin(top, right, bottom, left)

# Plot (b) 
p2 <- ggplot(xor_simple, aes(x = x1, y = x2, color = y)) +
  geom_point(size = 9) +
  geom_abline(intercept = 0.5, slope = -1,
              linetype = "dashed", color = "darkgreen", linewidth = 1.1) +
  geom_abline(intercept = 1.5, slope = -1,
              linetype = "dashed", color = "darkgreen", linewidth = 1.1) +
  scale_color_manual(
    values = c("0" = "#C0392B", "1" = "#2E86C1"),
    labels = c("Class 0", "Class 1")
  ) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
  coord_fixed() +
  labs(
    title = "(b) Multiple-boundary partition",
    x = expression(x[1]),
    y = expression(x[2]),
    color = "Output"
  ) +
  base_theme +
  theme(plot.margin = margin(2, 2, 2, 17))   # More space (left): margin(top, right, bottom, left)

# Final combination
(p1 + p2) +
  plot_layout(guides = "collect") &
  # Legends
  theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title = element_text(size = 20),
    legend.text = element_text(size = 18)
  )

Geometric intuition for XOR. (a) Single linear boundaries fail to separate the classes. (b) Multiple linear boundaries can partition the input space, anticipating the role of hidden neurons.

Figure 7.1: Geometric intuition for XOR. (a) Single linear boundaries fail to separate the classes. (b) Multiple linear boundaries can partition the input space, anticipating the role of hidden neurons.

In Figure 7.1, we observe two distinct situations:

Panel (a). The dashed lines represent examples of single linear decision boundaries.
- Each line divides the input space into two half-planes, but neither one is able to separate the two classes correctly.
- In both cases, points from different classes remain on the same side of the boundary.
- This illustrates why a single linear classifier cannot solve the XOR problem.
Panel (b). The dashed lines should not be interpreted as a single linear classifier, but rather as a combination of linear boundaries that partition the input space into regions. This anticipates the role of hidden neurons in a multilayer network:
- Each hidden unit can define a linear boundary, and their combined effect can produce a nonlinear decision rule.

To reinforce this geometric intuition, Figure 7.2 shows a denser version of the same XOR pattern.

The additional points do not represent new logical cases. They are sampled from the unit square and labeled according to the same XOR rule in order to make the spatial structure more visible.
The shaded regions are included only as a visual guide:
- Pink regions regions correspond to Class 0.
- Light blue regions regions correspond to Class 1.

library(ggplot2)

set.seed(123)

n <- 240
x1 <- runif(n)
x2 <- runif(n)

y <- ifelse((x1 <= 0.5 & x2 > 0.5) | (x1 > 0.5 & x2 <= 0.5), 1, 0)

xor_dense <- data.frame(
  x1 = x1,
  x2 = x2,
  y  = factor(y)
)

ggplot(xor_dense, aes(x = x1, y = x2, color = y)) +
  annotate("rect", xmin = 0.00, xmax = 0.50, ymin = 0.50, ymax = 1.00,
           fill = "#FADBD8", alpha = 0.35, color = NA) +
  annotate("rect", xmin = 0.50, xmax = 1.00, ymin = 0.00, ymax = 0.50,
           fill = "#FADBD8", alpha = 0.35, color = NA) +
  annotate("rect", xmin = 0.00, xmax = 0.50, ymin = 0.00, ymax = 0.50,
           fill = "#D6EAF8", alpha = 0.35, color = NA) +
  annotate("rect", xmin = 0.50, xmax = 1.00, ymin = 0.50, ymax = 1.00,
           fill = "#D6EAF8", alpha = 0.35, color = NA) +
  geom_point(size = 2.2, alpha = 0.9) +
  scale_color_manual(
    values = c("0" = "#C0392B", "1" = "#2E86C1"),
    labels = c("Class 0", "Class 1")
  ) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
  coord_fixed() +
  labs(
    x = expression(x[1]),
    y = expression(x[2]),
    color = "Output"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "right",
    panel.grid.minor = element_line(color = "gray93"),
    panel.grid.major = element_line(color = "gray85")
  )

Figure 7.2: A denser visualization of the XOR pattern in the input space.

Together, these figures make the structure of XOR clear. The points $(0,0)$ and $(1,1)$ belong to Class 0, whereas the points $(0,1)$ and $(1,0)$ belong to Class 1. Because the classes occupy alternating diagonal positions in the input space, no single boundary of the form

\[ w_1 x_1 + w_2 x_2 + b = 0 \]

can separate them.

This confirms that XOR is not linearly separable and explains why solving it requires a model capable of combining multiple linear boundaries through nonlinear transformations.

Geometric interpretation

The XOR dataset exhibits an alternating diagonal class structure in the input space. Because this arrangement cannot be separated by a single straight line, a linear model is insufficient. This limitation motivates the introduction of hidden layers and nonlinear activation functions, which allow the network to transform the input space into a representation where the classes become separable.

7.0.3 Network architecture for XOR

To overcome the limitation of linear separability observed in the previous section (see Figure 7.1), we introduce a neural network with a hidden layer. This additional layer allows the model to construct nonlinear transformations of the input space, making it possible to separate patterns such as XOR.

To solve the XOR problem, we consider a simple feedforward architecture composed of:

$n_0 = 2$ input neurons.
One hidden layer (e.g., $n_1 = 2$ neurons).
$n_N = 1$ output neuron.

Thus: \[ \mathbf{x} \in \mathbb{R}^2, \quad \hat{y} \in \mathbb{R}, \quad y \in \mathbb{R} \]

Figure 7.3 illustrates this architecture. The input variables $(x_1, x_2)$ are first mapped into a hidden representation through two intermediate neurons $(h_1, h_2)$. Each hidden neuron defines a linear transformation of the input, and their combined effect enables the network to partition the space into multiple regions.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6,3))

# Posiciones
positions = {
    "x1": (0,1),
    "x2": (0,0),
    "h1": (1,1),
    "h2": (1,0),
    "y":  (2,0.5)
}

# Dibujar conexiones
connections = [
    ("x1","h1"), ("x1","h2"),
    ("x2","h1"), ("x2","h2"),
    ("h1","y"), ("h2","y")
]

for a,b in connections:
    x1,y1 = positions[a]
    x2,y2 = positions[b]
    ax.arrow(x1,y1, x2-x1, y2-y1,
             length_includes_head=True,
             head_width=0.05, color="black")

# Dibujar nodos
for node,(x,y) in positions.items():
    ax.scatter(x,y, s=1200, color="#D6EAF8", edgecolors="black")
    label = "ŷ" if node=="y" else node
    ax.text(x,y,label, ha='center', va='center', fontsize=12, fontweight='bold')

ax.set_xlim(-0.2, 2.2); 
ax.set_ylim(-0.1, 1.1);
ax.axis('off');

plt.tight_layout(pad=0.2)
plt.show()

Figure 7.3: A simple neural network architecture for solving the XOR problem.

These intermediate representations are then aggregated by the output neuron $\hat{y}$, producing the final prediction. As discussed in the geometric interpretation, the key idea is that multiple linear transformations, when combined, can induce a nonlinear decision boundary.

This structure will be essential in the following sections, where we show how these transformations are computed and how the network learns them from data.

7.0.4 Error measures used in the XOR example

In the manual training example (using a single XOR observation), we track different quantities to understand how the network learns.

In this section, we focus on the scalar case to simplify the exposition. These quantities serve different purposes:

Some are useful for interpretation.
Others are required for optimization.

We consider the following:

Absolute error (interpretation).

\[ |\hat{y} - y| \]

This quantity measures the magnitude of the prediction error in a direct and intuitive way. It is useful for monitoring performance but is not typically used for optimization due to its non-smooth derivative.

Squared error (loss function for a single observation).

\[ \text{cost} = \frac{1}{2}(\hat{y} - y)^2 \]

This is the loss function minimized during training (for a single observation).

Gradient of the loss with respect to the prediction.

\[ \frac{\partial \, \text{cost}}{\partial \hat{y}} = \hat{y} - y \]

This quantity measures how the loss changes with respect to the prediction.

To update the model parameters, we ultimately need the derivative of the loss with respect to each parameter $w$. This is obtained by combining:

The gradient of the loss with respect to the prediction (e.g., $\hat{y} - y$ for the squared error), and
The derivative of the prediction with respect to the parameter, $\frac{\partial \hat{y}}{\partial w}$.

Together, these terms determine how the error signal propagates backward through the network.

A detailed explanation of this process (backpropagation) will be provided in a later section.

7.0.5 Important distinction between the gradient of the loss and the gradient of the prediction

The gradient of the loss is not the same as the gradient of the prediction.
The former measures how the loss changes with respect to the prediction, whereas the latter measures how the prediction changes with respect to the model parameters, through derivatives of the form

\[ \frac{\partial \hat{y}}{\partial w} \]

which measure how the network output changes with respect to its parameters.

Both are combined through the chain rule during backpropagation.

7.0.6 Key idea

The absolute error helps us interpret the prediction.
The loss function defines what we optimize.
The gradient determines how we update the model.

These concepts naturally lead to the question of how parameter updates are controlled during training.

8 Learning rate

8.0.1 Definition (learning rate in gradient descent)

The learning rate, denoted by $\alpha > 0$, controls the magnitude of parameter updates during optimization.

At each iteration, the model parameters are adjusted in the direction opposite to the gradient, and the learning rate determines the size of this step.
Formally, for a parameter $w$, the update rule in gradient descent is given by:

\[ w \;\leftarrow\; w \;-\; \alpha \, \frac{\partial \text{cost}}{\partial w}. \]

Here,

\[ \frac{\partial \, \text{cost}}{\partial w} \; =\; \frac{\partial \, \text{cost}}{\partial \hat{y}} \cdot \frac{\partial \, \hat{y}}{\partial w} \]

denotes the derivative of the loss (or cost) function with respect to the parameter $w$. This quantity measures how the loss changes with respect to the prediction $\hat{y}$.

A more detailed derivation of these gradients will be presented in later sections.

8.0.2 Effect on optimization

In general, the learning rate plays a central role in optimization:

It determines the step size in parameter space
It controls the speed of convergence
It affects the stability of the learning process

Choosing an appropriate learning rate is therefore essential for effective training.

8.0.3 Analytical formulation

The learning rate $\alpha$ directly influences how quickly the model learns from the error signal:

If $\alpha$ is too large, the updates may overshoot the minimum, leading to oscillations or divergence.
If $\alpha$ is too small, the updates become very slow, resulting in inefficient learning.

This trade-off highlights the importance of selecting an appropriate learning rate to balance stability and convergence speed.

8.0.4 Analytical example

A simple quadratic example.

To illustrate the effect of the learning rate, consider the simple loss function

\[ L(w) = w^2, \]

whose graph is shown in Figure 8.1. This function has a single global minimum at $w = 0$ and increases symmetrically as $|w|$ grows.

# Plot of the quadratic loss function

import numpy as np
import matplotlib.pyplot as plt

def loss(w):
    return w**2

w = np.linspace(-4, 4, 400)
L = loss(w)

plt.figure(figsize=(6, 4))
plt.plot(w, L, linewidth=2)
plt.axhline(0, color="gray", linewidth=1, alpha=0.6)
plt.axvline(0, color="gray", linewidth=1, alpha=0.6)

plt.xlabel("Parameter $w$")
plt.ylabel("Loss $L(w)$")

plt.grid(True)
plt.tight_layout()
plt.show()

Figure 8.1: Quadratic loss function $L(w) = w^2$. The minimum is located at $w=0$.

Its gradient is

\[ \frac{dL}{dw} = 2w. \] Figure 8.2 shows both the loss function and its gradient.

import numpy as np
import matplotlib.pyplot as plt

def loss(w):
    return w**2

def grad(w):
    return 2*w

w = np.linspace(-4, 4, 400)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# -------------------
# (a) Loss
# -------------------
axes[0].plot(w, loss(w), linewidth=2)
axes[0].axhline(0, color="gray", alpha=0.6)
axes[0].axvline(0, color="gray", alpha=0.6)

axes[0].set_title("(a) Loss function $L(w)=w^2$")
axes[0].set_xlabel("Parameter $w$")
axes[0].set_ylabel("Loss")

axes[0].grid(True)

# -------------------
# (b) Gradient
# -------------------
axes[1].plot(w, grad(w), linewidth=2, color="tab:red")
axes[1].axhline(0, color="gray", alpha=0.6)
axes[1].axvline(0, color="gray", alpha=0.6)

axes[1].set_title("(b) Gradient $\\frac{dL}{dw}=2w$")
axes[1].set_xlabel("Parameter $w$")
axes[1].set_ylabel("Gradient")

axes[1].grid(True)

plt.tight_layout()
plt.show()

$(a) Quadratic loss function $L(w)=w^2$. (b) Its gradient $\frac{dL}{dw}=2w$.$

Figure 8.2: (a) Quadratic loss function $L(w)=w^2$. (b) Its gradient $\frac{dL}{dw}=2w$.

Interpretation of the graphical results.

Note that the gradient is proportional to the distance from the minimum, which leads to exponential convergence under appropriate learning rates.

While the loss function has a minimum at $w=0$, the gradient increases linearly with the distance from the minimum. This means that updates are larger when the parameter is far from the optimum and become smaller as it approaches the minimum.

This property explains why gradient descent exhibits exponential convergence under appropriate learning rates.

Applying the gradient descent update rule

\[ w_{t+1} \;=\; w_t \,- \, \alpha \frac{dL}{dw}, \]

and substituting $\frac{dL}{dw} = 2w$, we obtain

\[ w_{t+1} \;= \, w_t \,-\, \alpha \, 2w_t. \]

At this stage, the update rule is introduced only to illustrate how the learning rate affects the parameter updates. A more detailed treatment of the gradient descent algorithm will be provided in a later section.

This simple structure makes it possible to clearly observe how different learning rates affect the trajectory of the parameter as it moves toward the minimum, as illustrated in the next section.

Graphical analysis of learning rate effects.

Figure 8.3 illustrates the behavior of gradient descent under different learning rates:

When the learning rate is too small, the parameter approaches the minimum very slowly.
When the learning rate is moderate, convergence is stable and efficient.
When the learning rate is too large, the updates overshoot the minimum and may oscillate or even diverge.

import numpy as np
import matplotlib.pyplot as plt

# Loss and gradient
def loss(w):
    return w**2

def grad(w):
    return 2*w

def simulate_gd(alpha, w0=2, n_iter=12):
    ws = [w0]
    ls = [loss(w0)]
    w = w0
    for _ in range(n_iter):
        w = w - alpha * grad(w)
        ws.append(w)
        ls.append(loss(w))
    return np.array(ws), np.array(ls)

configs = [
    (0.05, "Small α"),
    (0.30, "Moderate α"),
    (1.10, "Large α")
]

colors = ["tab:green", "tab:red", "tab:blue"]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Common curve for trajectory plots
w_grid = np.linspace(-10, 10, 400)
loss_grid = loss(w_grid)

# Avoid exact zero for log-scale panels
loss_grid_log = loss_grid.copy()
loss_grid_log[loss_grid_log < 1e-1] = np.nan

# -----------------------------
# (a) Trajectories - linear scale
# -----------------------------
axes[0, 0].plot(w_grid, loss_grid, color="gray", linewidth=1.5, alpha=0.4)

for (alpha, label), color in zip(configs, colors):
    ws, ls = simulate_gd(alpha)
    axes[0, 0].plot(ws, ls, marker="o", label=label, color=color, linewidth=2)

axes[0, 0].set_title("(a) Gradient descent trajectories (linear scale)")
axes[0, 0].set_xlabel("Parameter w")
axes[0, 0].set_ylabel("Loss L(w)")
axes[0, 0].set_ylim(0, 100);
axes[0, 0].legend()

# -----------------------------
# (b) Loss by iteration - linear scale
# -----------------------------
for (alpha, label), color in zip(configs, colors):
    ws, ls = simulate_gd(alpha)
    axes[0, 1].plot(range(len(ls)), ls, marker="o", label=label, color=color, linewidth=2)

axes[0, 1].set_title("(b) Loss value across iterations (linear scale)")
axes[0, 1].set_xlabel("Iteration")
axes[0, 1].set_ylabel("Loss L(w)")
axes[0, 1].set_ylim(0, 100);
axes[0, 1].legend()

# -----------------------------
# (c) Trajectories - log scale
# -----------------------------
axes[1, 0].plot(w_grid, loss_grid_log, color="gray", linewidth=1.5, alpha=0.4)

for (alpha, label), color in zip(configs, colors):
    ws, ls = simulate_gd(alpha)
    ls_plot = ls.copy()
    ls_plot[ls_plot < 1e-10] = np.nan
    axes[1, 0].plot(ws, ls_plot, marker="o", label=label, color=color, linewidth=2)

axes[1, 0].set_title("(c) Gradient descent trajectories (log scale)")
axes[1, 0].set_xlabel("Parameter w")
axes[1, 0].set_ylabel("Loss L(w)")
axes[1, 0].set_ylim(1e-1, 10);
axes[1, 0].set_yscale("log")
axes[1, 0].legend()

# -----------------------------
# (d) Loss by iteration - log scale
# -----------------------------
for (alpha, label), color in zip(configs, colors):
    ws, ls = simulate_gd(alpha)
    axes[1, 1].plot(range(len(ls)), ls, marker="o", label=label, color=color, linewidth=2)

axes[1, 1].set_title("(d) Loss value across iterations (log scale)")
axes[1, 1].set_xlabel("Iteration")
axes[1, 1].set_ylabel("Loss L(w)")
axes[1, 1].set_ylim(1e-2, 10);
axes[1, 1].set_yscale("log")
axes[1, 1].legend()

plt.tight_layout()
plt.show()

Effect of the learning rate on gradient descent under linear and logarithmic vertical scales. Panels (a) and (b) use a linear scale to show global behavior, while panels (c) and (d) use a logarithmic scale to reveal convergence patterns near zero.

Figure 8.3: Effect of the learning rate on gradient descent under linear and logarithmic vertical scales. Panels (a) and (b) use a linear scale to show global behavior, while panels (c) and (d) use a logarithmic scale to reveal convergence patterns near zero.

Figure 8.3 provides a complementary perspective by combining linear and logarithmic vertical scales.

In panels (a) and (b), which use a linear scale, the trajectories corresponding to small and moderate learning rates become compressed near zero. As a result, differences in their convergence behavior are difficult to distinguish once the loss becomes small. In contrast, the large learning rate dominates the scale due to divergence.
In panels (c) and (d), which use a logarithmic scale, the behavior near zero is magnified. This transformation reveals important differences that are not visible in the linear scale:
- The small learning rate (green) exhibits a smooth and gradual exponential decay.
- The moderate learning rate (red) converges faster, showing a steeper linear trend in the log scale, which is characteristic of efficient exponential convergence.
- The large learning rate (blue) clearly oscillates and fails to stabilize, confirming its instability.

Thus, the logarithmic scale provides a more informative view of the convergence dynamics, especially when the loss approaches very small values.

For visualization purposes, two complementary strategies are used:

A linear scale highlights the global behavior of the trajectories, including divergence and large oscillations.
A logarithmic scale emphasizes the fine-grained behavior near zero, allowing a clearer comparison of convergence rates.

This dual representation is particularly useful because convergence in gradient-based optimization is often exponential, which appears approximately linear under a logarithmic transformation.

9 Adaptive learning rate methods

9.0.1 Classical adaptive methods

To address the limitations of a fixed learning rate, several adaptive optimization methods have been proposed:

Adaptive Gradient Algorithm (AdaGrad) (Duchi et al., 2011).
Accumulates all past squared gradients, leading to progressively smaller learning rates.
Root Mean Square Propagation (RMSProp) (Tieleman and Hinton, 2012).
Uses an exponentially decaying average of squared gradients to avoid excessive shrinkage.
Optional reading: here.
ADADELTA (Zeiler, 2012).
Eliminates the need to manually set a global learning rate by adapting updates based on past gradients.
Adaptive Moment Estimation (Adam) (Kingma and Ba, 2014).
Combines momentum and adaptive scaling to achieve fast and stable convergence.

9.0.2 More recent variants

While more recent methods have been proposed, most of them are extensions or refinements of Adam rather than entirely new paradigms. In practice, Adam remains a central and widely used optimizer in modern deep learning.

In other words, Adam is not obsolete: it continues to serve as the foundation for many current optimization methods.

AdamW (Loshchilov and Hutter, 2019).
Decouples weight decay from the gradient update, improving generalization.
Nadam (Dozat, 2016).
Incorporates Nesterov momentum into Adam for improved convergence.
AMSGrad (Reddi et al., 2018).
Modifies Adam to ensure better theoretical convergence guarantees.

These methods adjust the effective learning rate during training based on past gradients. They can be seen as extensions of gradient descent that adapt the step size dynamically during training.

In the following sections, we focus on a subset of these methods to illustrate their main ideas. The remaining variants follow similar principles and are left for further exploration.

9.0.3 Adaptive Gradient Algorithm (AdaGrad)

Update rule and mechanism.

AdaGrad (Duchi et al., 2011) adapts the learning rate for each parameter by scaling it inversely proportional to the accumulated squared gradients. Parameters that receive large gradients get smaller updates over time.

\[ G_t \;= \; \sum_{i=1}^t \,g_i^2, \quad\quad w_{t+1} \; =\; w_t \,-\, \frac{\alpha}{\sqrt{G_t \,+\, \epsilon}} \, g_t \]

Here,

$g_t = \frac{\partial \text{cost}}{\partial w}(w_t)$ denotes the gradient evaluated at the parameter value $w_t$.
$G_t = \sum_{i=1}^t g_i^2$ is the accumulated sum of past squared gradients, which keeps track of how large the gradients have been over time and is used to scale the update step.
$\epsilon$ is a small constant for numerical stability (e.g., $10^{-8}$).

Unlike RMSProp, this quantity grows continuously over time, which causes the effective learning rate to shrink.

Core intuition.

AdaGrad rescales each parameter update according to the history of its gradients.

Parameters with consistently large gradients accumulate large values in $G_t$, which leads to smaller effective learning rates.

This mechanism naturally slows down updates in frequently changing directions while allowing larger steps in directions with less activity.

Frequent updates → smaller steps  

Rare updates     → larger steps

Main limitation.

Since $G_t$ keeps growing over time, the effective learning rate can become extremely small, causing the updates to slow down significantly or even stop.

Asymptotic behavior of the effective step size.

Suppose the gradient remains approximately constant, $g_t \approx 1$. Then

\[ G_t \; = \; \sum_{i=1}^t g_i^2 \; \approx \; t, \]

In this case, the effective step size behaves as

\[ \text{step size} \; \approx \; \frac{\alpha}{\sqrt{t + \epsilon}} \]

However, since $\epsilon$ is a very small constant (e.g., $10^{-8}$), it has negligible effect for $t \ge 1$. Therefore, for intuition, we can approximate:

\[ \text{step size} \; \approx \; \frac{\alpha}{\sqrt{t}}. \]

As $t$ increases, the step size decreases:

$t = 1 \quad \Rightarrow \quad \frac{\alpha}{\sqrt{1}} = \alpha$.
$t = 100 \quad \Rightarrow \quad \frac{\alpha}{10}$.
$t = 10{,}000 \quad \Rightarrow \quad \frac{\alpha}{100}$.

Thus, the step size decreases progressively, eventually slowing down learning.

This shows that AdaGrad continuously shrinks the learning rate, even when this may no longer be desirable.

Numerical illustration.

This behavior is illustrated in Figure 9.1.

Panel (a) shows that the accumulated term $G_t$ grows continuously over time.
Panel (b) shows that the effective step size decreases accordingly.

Thus, the simplified expression $\frac{\alpha}{\sqrt{t}}$ captures the essential behavior observed in the figure.

import numpy as np
import matplotlib.pyplot as plt

# Parameters
alpha = 1.0
epsilon = 1e-8
T = 10000

# Constant gradient
g = 1.0

# Storage
t_vals = np.arange(1, T + 1)
G_vals = np.zeros(T)
step_vals = np.zeros(T)

# AdaGrad recursion
G = 0.0
for i in range(T):
    G = G + g**2
    G_vals[i] = G
    step_vals[i] = alpha / np.sqrt(G + epsilon)

# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# -----------------------------
# (a) Accumulated term G_t
# -----------------------------
axes[0].plot(t_vals, G_vals, linewidth=2.5)
axes[0].set_title(r"(a) Accumulated term $G_t$", fontsize=21)
axes[0].set_xlabel(r"Iteration $t$", fontsize=19)
axes[0].set_ylabel(r"$G_t$", fontsize=19)
axes[0].tick_params(axis='both', labelsize=15)
axes[0].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)

for spine in ["top", "right"]:
    axes[0].spines[spine].set_visible(False)
axes[0].spines["left"].set_color("gray")
axes[0].spines["bottom"].set_color("gray")
axes[0].spines["left"].set_linewidth(0.8)
axes[0].spines["bottom"].set_linewidth(0.8)

# -----------------------------
# (b) Effective step size
# -----------------------------
axes[1].plot(t_vals, step_vals, linewidth=2.5)

# Highlight specific points
points_t = [1, 100, 10000]
points_y = alpha / np.sqrt(np.array(points_t) + epsilon)
axes[1].scatter(points_t, points_y, s=80, zorder=3)

# Labels for key points
offsets = [(60, 0.02), (200, 0.01), (200, 0.002)]
for (x, y), (dx, dy) in zip(zip(points_t, points_y), offsets):
    axes[1].text(x + dx, y + dy, f"t={x}", fontsize=15)

axes[1].set_title(r"(b) Effective step size", fontsize=21)
axes[1].set_xlabel(r"Iteration $t$", fontsize=19)
axes[1].set_ylabel(r"$\alpha / \sqrt{G_t + \epsilon}$", fontsize=18)
axes[1].tick_params(axis='both', labelsize=15)
axes[1].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)

for spine in ["top", "right"]:
    axes[1].spines[spine].set_visible(False)
axes[1].spines["left"].set_color("gray")
axes[1].spines["bottom"].set_color("gray")
axes[1].spines["left"].set_linewidth(0.8)
axes[1].spines["bottom"].set_linewidth(0.8)

plt.tight_layout()
plt.show()

$Illustration of AdaGrad under an approximately constant gradient $g_t \approx 1$. Panel (a) shows that the accumulated term $G_t$ grows over time. Panel (b) shows that the effective step size $\alpha / \sqrt{G_t + \epsilon}$ decreases steadily.$

Figure 9.1: Illustration of AdaGrad under an approximately constant gradient $g_t \approx 1$. Panel (a) shows that the accumulated term $G_t$ grows over time. Panel (b) shows that the effective step size $\alpha / \sqrt{G_t + \epsilon}$ decreases steadily.

9.0.4 Root Mean Square Propagation (RMSProp)

Update rule and mechanism.

RMSProp (Tieleman and Hinton, 2012) addresses the limitation of AdaGrad by replacing the cumulative sum with an exponentially decaying average of past squared gradients. This allows the method to “forget” old gradients and maintain a more stable learning rate.

\[ v_t \;=\; \beta v_{t-1} + (1 - \beta) \, g_t^2, \quad\quad w_{t+1} \;=\; w_t \;-\; \frac{\alpha}{\sqrt{v_t + \epsilon}} \, g_t \]

Here,

$g_t = \frac{\partial \text{cost}}{\partial w}(w_t)$ denotes the gradient evaluated at the parameter value $w_t$.
$v_t$ is an exponentially decaying average of past squared gradients, which captures the scale of recent gradients and is used to normalize the update step.
$\beta \in (0,1)$ controls how fast past gradients are forgotten.
$\epsilon$ is a small constant for numerical stability.

These quantities are hyperparameters:

They are not learned from the data, but instead are chosen beforehand.
In practice, commonly used values are $\beta \approx 0.9$ and $\epsilon \approx 10^{-8}$.

They can be viewed as knobs that control how the algorithm behaves during training.

Core intuition.

Focus on recent gradients instead of all history.

Main limitation.

Requires tuning of the decay parameter $\beta$, and may still exhibit instability if the hyperparameters are not well chosen.

Asymptotic behavior under constant gradients.

Suppose the gradient remains approximately constant, $g_t \approx 1$. Then the moving average satisfies

\[ v_t \;=\; \beta v_{t-1} \,+\, (1 - \beta). \]

Rewriting,

\[ v_t - 1 \;=\; \beta (v_{t-1} - 1). \]

By iterating this recursion,

\[ v_t - 1 \;=\; \beta^t (v_0 - 1). \]

Since $0 < \beta < 1$, it follows that $\beta^t \to 0$, and therefore

\[ v_t \;\to\; 1. \]

Thus, unlike AdaGrad, the accumulated quantity does not grow indefinitely, but instead converges to a constant value.

As a consequence, the effective step size

\[ \frac{\alpha}{\sqrt{v_t + \epsilon}} \]

stabilizes over time rather than continuously decreasing.

This prevents the learning rate from vanishing and allows the algorithm to keep making meaningful updates during training.

Numerical illustration.

This behavior is illustrated in Figure 9.2.

Panel (a) shows that the moving average $v_t$ approaches a stable value.
Panel (b) shows that the effective step size stabilizes instead of shrinking indefinitely.

This contrasts with AdaGrad, where the step size decays continuously over time.

import numpy as np
import matplotlib.pyplot as plt

# Parameters
alpha = 1.0
beta = 0.9
epsilon = 1e-8
T = 50

# Constant gradient
g = 1.0

# Storage
t_vals = np.arange(1, T + 1)
v_vals = np.zeros(T)
step_vals = np.zeros(T)

# RMSProp recursion
v = 0.0
for t in range(T):
    v = beta * v + (1 - beta) * g**2
    v_vals[t] = v
    step_vals[t] = alpha / np.sqrt(v + epsilon)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# (a) v_t
axes[0].plot(t_vals, v_vals, linewidth=2)
axes[0].axhline(1.0, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[0].set_title("(a) Moving average $v_t$", fontsize=15)
axes[0].set_xlabel("Iteration $t$", fontsize=14)
axes[0].set_ylabel("$v_t$", fontsize=14)
axes[0].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
axes[0].tick_params(axis='both', labelsize=11)

# soften borders
for spine in ["top", "right"]:
    axes[0].spines[spine].set_visible(False)
axes[0].spines["left"].set_color("gray")
axes[0].spines["bottom"].set_color("gray")

# (b) effective step size
axes[1].plot(t_vals, step_vals, linewidth=2)
axes[1].axhline(1.0, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[1].set_title("(b) Effective step size", fontsize=15)
axes[1].set_xlabel("Iteration $t$", fontsize=14)
axes[1].set_ylabel(r"$\alpha / \sqrt{v_t + \epsilon}$", fontsize=14)
axes[1].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
axes[1].tick_params(axis='both', labelsize=11)

for spine in ["top", "right"]:
    axes[1].spines[spine].set_visible(False)
axes[1].spines["left"].set_color("gray")
axes[1].spines["bottom"].set_color("gray")

plt.tight_layout()
plt.show()

$Illustration of RMSProp under an approximately constant gradient $g_t \approx 1$. Panel (a) shows that the exponentially decaying average $v_t$ approaches a stable value. Panel (b) shows that the effective step size $\alpha / \sqrt{v_t + \epsilon}$ also stabilizes over time.$

Figure 9.2: Illustration of RMSProp under an approximately constant gradient $g_t \approx 1$. Panel (a) shows that the exponentially decaying average $v_t$ approaches a stable value. Panel (b) shows that the effective step size $\alpha / \sqrt{v_t + \epsilon}$ also stabilizes over time.

This stabilization of the effective step size is the key mechanism that allows RMSProp to maintain consistent learning dynamics over time.

Interpretation of the moving average.

For the constant-gradient case with $g_t = 1$ and $\beta=0.9$:

$v_t$ tends to $1$.
The denominator $\sqrt{v_t \,+ \, \epsilon}$ stabilizes over time.
Therefore, $\frac{\alpha}{\sqrt{v_t \, +\, \epsilon}}$ tends to $\alpha$.

This shows clearly that RMSProp does not collapse in the same way as AdaGrad. For a constant gradient, the moving average approaches a finite limit, so the effective step size does not keep shrinking indefinitely.

9.0.5 Adaptive Moment Estimation (Adam)

Method description (update rule).

Adam (Kingma and Ba, 2014) combines ideas from momentum and RMSProp by maintaining exponentially decaying averages of both the gradients and their squared values. This allows the method to adapt the learning rate while also incorporating information about the direction of past updates.

\[ m_t \;=\; \beta_1 m_{t-1} + (1 - \beta_1)\, g_t, \quad\quad v_t \;=\; \beta_2 v_{t-1} + (1 - \beta_2)\, g_t^2 \]

\[ w_{t+1} \;=\; w_t \;-\; \frac{\alpha}{\sqrt{v_t + \epsilon}} \, m_t \]

Here,

$g_t = \frac{\partial \text{cost}}{\partial w}(w_t)$ denotes the gradient at iteration $t$.
$m_t$ is an exponentially decaying average of past gradients, capturing the direction of the updates (momentum).
$v_t$ is an exponentially decaying average of squared gradients, capturing the scale of recent gradients.
$\beta_1, \beta_2 \in (0,1)$ control how fast past information is forgotten.
$\epsilon$ is a small constant for numerical stability.

These quantities are hyperparameters. In practice, commonly used values are:

\[ \beta_1 \approx 0.9, \qquad \beta_2 \approx 0.999, \qquad \epsilon \approx 10^{-8} \]

Core intuition.

Combine momentum (direction) and adaptive scaling (variance) to produce stable and efficient updates.
Frequent and consistent gradients → stable direction (via $m_t$)
Large gradient variability → smaller steps (via $v_t$).

Limitation.

Requires tuning of multiple hyperparameters and may sometimes lead to suboptimal generalization compared to simpler methods such as SGD with momentum.
In addition, the interaction between $m_t$ and $v_t$ may introduce implicit bias in the optimization dynamics.

Asymptotic behavior under constant gradients.

Suppose the gradient remains approximately constant, $g_t \approx 1$. Then, both $m_t$ and $v_t$ follow exponential moving average recursions of a constant value.

Behavior of the first moment ($m_t$).

It holds:

\[ m_t \;=\; \beta_1 m_{t-1} \,+\, (1 \,-\, \beta_1) g_t \quad \approx\quad \beta_1 m_{t-1} \,+\, (1 \,-\, \beta_1) \]

Rewriting,

\[ m_t \,-\, 1 \;=\; \beta_1 (m_{t-1} \,-\, 1). \]

This recursive relation can be expanded step by step:

\[ m_t \,-\, 1 \;=\; \beta_1^2 (m_{t-2} \,- \, 1) \;=\;\cdots \;=\; \beta_1^t (m_0 \,-\, 1), \]

where $m_0$ is the initial value of the moving average.

Since $0 < \beta_1 < 1$, it follows that $\beta_1^t \to 0$, and therefore $m_t \to 1$.

Behavior of the second moment ($v_t$).

A similar argument applies to $v_t$, yielding $v_t \to 1$, since $g_t^2 \approx 1$ implies that $v_t$ is also an exponential moving average of a constant value.

In summary, as $t$ increases:

\[ m_t \;\to\; 1, \quad \text{and}\quad v_t \;\to\; 1 \]

In this case, the update rule becomes approximately

\[ w_{t+1} \;\approx\; w_t \;-\; \alpha. \]

This shows that Adam stabilizes both the direction and the scale of the updates, avoiding the continuous shrinkage observed in AdaGrad and complementing the variance normalization of RMSProp with momentum-based smoothing.

Unlike AdaGrad, the effective step size does not vanish over time, and unlike RMSProp, the update direction is smoothed through momentum.

Numerical illustration.

This behavior is illustrated in Figure 9.3.

Panel (a) shows that the first moment estimate $m_t$ converges to 1.
Panel (b) shows that the second moment estimate $v_t$ also converges to 1.
Panel (c) shows that the effective step size stabilizes instead of vanishing over time.

This confirms that Adam maintains a stable learning dynamics by combining momentum and adaptive scaling.

import numpy as np
import matplotlib.pyplot as plt

# Parameters
alpha = 1.0
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
T = 300

# Constant gradient
g = 1.0

# Storage
t_vals = np.arange(1, T + 1)
m_vals = np.zeros(T)
v_vals = np.zeros(T)
step_vals = np.zeros(T)

# Initialization
m = 0.0
v = 0.0

# Adam recursion
for t in range(T):
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2
    
    m_vals[t] = m
    v_vals[t] = v
    step_vals[t] = alpha / np.sqrt(v + epsilon)

# Create figure
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# -----------------------------
# (a) m_t
# -----------------------------
axes[0].plot(t_vals, m_vals, linewidth=2.5)
axes[0].axhline(1, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[0].set_title(r"(a) First moment $m_t$", fontsize=20)
axes[0].set_xlabel(r"Iteration $t$", fontsize=16)
axes[0].set_ylabel(r"$m_t$", fontsize=16)
axes[0].tick_params(axis='both', labelsize=14)
axes[0].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)

# soften borders
for spine in ["top", "right"]:
    axes[0].spines[spine].set_visible(False)
axes[0].spines["left"].set_color("gray")
axes[0].spines["bottom"].set_color("gray")

# -----------------------------
# (b) v_t
# -----------------------------
axes[1].plot(t_vals, v_vals, linewidth=2.5)
axes[1].axhline(1, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[1].set_title(r"(b) Second moment $v_t$", fontsize=20)
axes[1].set_xlabel(r"Iteration $t$", fontsize=16)
axes[1].set_ylabel(r"$v_t$", fontsize=16)
axes[1].tick_params(axis='both', labelsize=14)
axes[1].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)

for spine in ["top", "right"]:
    axes[1].spines[spine].set_visible(False)
axes[1].spines["left"].set_color("gray")
axes[1].spines["bottom"].set_color("gray")

# -----------------------------
# (c) Step size
# -----------------------------
axes[2].plot(t_vals, step_vals, linewidth=2.5)
axes[2].axhline(alpha, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[2].set_title(r"(c) Effective step size", fontsize=20)
axes[2].set_xlabel(r"Iteration $t$", fontsize=16)
axes[2].set_ylabel(r"$\alpha / \sqrt{v_t + \epsilon}$", fontsize=16)
axes[2].tick_params(axis='both', labelsize=14)
axes[2].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)

for spine in ["top", "right"]:
    axes[2].spines[spine].set_visible(False)
axes[2].spines["left"].set_color("gray")
axes[2].spines["bottom"].set_color("gray")

plt.tight_layout()
plt.show()

$Illustration of Adam under an approximately constant gradient $g_t \approx 1$. Panel (a) shows that the first moment estimate $m_t$ converges to 1. Panel (b) shows that the second moment estimate $v_t$ also converges to 1. Panel (c) shows that the effective step size stabilizes instead of vanishing.$

Figure 9.3: Illustration of Adam under an approximately constant gradient $g_t \approx 1$. Panel (a) shows that the first moment estimate $m_t$ converges to 1. Panel (b) shows that the second moment estimate $v_t$ also converges to 1. Panel (c) shows that the effective step size stabilizes instead of vanishing.

Comparative perspective: AdaGrad, RMSProp, and Adam.

Figure 9.4 provides a visual summary of the main behavior of the three adaptive methods. Table 9.1 then compares their key ideas, update rules, advantages, and limitations.

import numpy as np
import matplotlib.pyplot as plt

# Parameters
alpha = 1.0
epsilon = 1e-8
beta = 0.9
beta1 = 0.9
beta2 = 0.999
T = 300
g = 1.0

t_vals = np.arange(1, T + 1)

# AdaGrad
G = 0.0
adagrad_step = np.zeros(T)
for i in range(T):
    G += g**2
    adagrad_step[i] = alpha / np.sqrt(G + epsilon)

# RMSProp
v_rms = 0.0
rmsprop_step = np.zeros(T)
for i in range(T):
    v_rms = beta * v_rms + (1 - beta) * g**2
    rmsprop_step[i] = alpha / np.sqrt(v_rms + epsilon)

# Adam
m = 0.0
v_adam = 0.0
adam_step = np.zeros(T)
for i in range(T):
    m = beta1 * m + (1 - beta1) * g
    v_adam = beta2 * v_adam + (1 - beta2) * g**2
    adam_step[i] = alpha / np.sqrt(v_adam + epsilon)

# Plot
plt.figure(figsize=(8, 5))
plt.plot(t_vals, adagrad_step, linewidth=2.5, label="AdaGrad")
plt.plot(t_vals, rmsprop_step, linewidth=2.5, label="RMSProp")
plt.plot(t_vals, adam_step, linewidth=2.5, label="Adam")

plt.xlabel("Iteration $t$", fontsize=15)
plt.ylabel("Effective step size", fontsize=15)
plt.title("Summary behavior of adaptive methods", fontsize=18)
plt.tick_params(axis='both', labelsize=13)
plt.grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
plt.legend(fontsize=13)

ax = plt.gca()
for spine in ["top", "right"]:
    ax.spines[spine].set_visible(False)
ax.spines["left"].set_color("gray")
ax.spines["bottom"].set_color("gray")
ax.spines["left"].set_linewidth(0.8)
ax.spines["bottom"].set_linewidth(0.8)

plt.tight_layout()
plt.show()

Conceptual comparison of the effective step size in AdaGrad, RMSProp, and Adam under a simplified constant-gradient setting. AdaGrad shows a continuously decreasing step size, whereas RMSProp and Adam stabilize over time.

Figure 9.4: Conceptual comparison of the effective step size in AdaGrad, RMSProp, and Adam under a simplified constant-gradient setting. AdaGrad shows a continuously decreasing step size, whereas RMSProp and Adam stabilize over time.

The main differences among these adaptive methods are summarized in Table 9.1.

Table 9.1: Comparative summary: AdaGrad, RMSProp, and Adam.
Method	Key idea	Update rule	Advantage	Limitation
AdaGrad (Duchi et al., 2011)	Accumulates all past squared gradients	$w_{t+1} = w_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} g_t$	Simple, works well for sparse data	Learning rate shrinks continuously
RMSProp (Tieleman and Hinton, 2012)	Uses exponential moving average of squared gradients	$w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t$	Prevents learning rate from vanishing too quickly	Requires tuning of decay parameter
Adam (Kingma and Ba, 2014)	Combines momentum and adaptive scaling	$w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} m_t$	Fast and stable convergence in practice	More complex, more hyperparameters

In the next section, we examine how the learning rate is incorporated into the gradient descent algorithm and how it interacts with the loss function and its gradient.

10 Gradient Descent

10.0.1 Definition and update rule

To minimize the loss function, neural networks rely on gradient-based optimization. The gradient indicates the direction of steepest increase of the loss function, so moving in the opposite direction reduces its value.

Vector formulation.

For a parameter vector $\mathbf{w}$, the update rule is given by

\[ \mathbf{w}^{(t+1)} \;=\; \mathbf{w}^{(t)} \,-\, \alpha \, \nabla \,L(\mathbf{w}^{(t)}), \]

where $\alpha > 0$ is the learning rate, and $\nabla L(\mathbf{w}^{(t)})$ is the gradient of the loss function evaluated at the current point.

The process starts from an initial parameter vector $\mathbf{w}^{(0)}$ (typically chosen randomly) and is iteratively updated for $t = 0, 1, 2, \dots$.

The gradient $\nabla L(\mathbf{w})$ generalizes the derivative to multiple dimensions and indicates the direction of steepest increase in the loss.

This vector formulation emphasizes that all parameters are updated simultaneously in the direction of steepest descent.

Scalar case (intuition).

For a single parameter $w$, the update rule reduces to

\[ w^{(t+1)} \;=\; w^{(t)} \,- \,\alpha \,\frac{dL}{dw}(w^{(t)}). \]

Here, $\frac{dL}{dw}$ denotes the standard derivative of the loss function with respect to the scalar parameter $w$.

This scalar version is consistent with the vector formulation and is useful for building intuition in simple examples.

Practical remark.

In practice, neural networks involve many parameters, so the vector formulation is more natural.

10.0.2 Geometric interpretation of gradient descent

This process can be interpreted geometrically as movement over a loss surface toward regions of lower values (see Figures 10.1 or 10.2).

Loss surface.

The loss function defines a surface over the parameter space.

Example: quadratic surface.

In this example, the function

\[ L(x,y) = x^2 + y^2 \]

produces a convex, bowl-shaped surface with a unique global minimum at the origin.

$Gradient descent simulation on $f(x,y)=x^2+y^2$ from initial point (3,3), learning rate $\alpha=0.35$, and 10 steps. Adapted from the interactive visualization developed by [Llinás and Llinás (2026)](https://gradientdescent-jkz8f8lzb49vc9efxjye5h.streamlit.app/).$

Figure 10.1: Gradient descent simulation on $f(x,y)=x^2+y^2$ from initial point (3,3), learning rate $\alpha=0.35$, and 10 steps. Adapted from the interactive visualization developed by Llinás and Llinás (2026).

Figure 10.1 provides an initial geometric interpretation of gradient descent on the surface defined by $L(x,y)=x^2+y^2$.

The blue point represents the initial parameter values $(3,3)$, while the red point represents the final iterate obtained after several iterations of gradient descent.
The trajectory between these points shows how the algorithm progressively updates the parameters in the direction opposite to the gradient, moving toward the minimum at $(0,0)$.
Each update is scaled by the learning rate $\alpha = 0.35$, which determines the magnitude of the step. A suitable choice of $\alpha$ allows efficient convergence, whereas poor choices may lead to slow convergence or instability.
The original visualization also includes a table of iteration values, which is presented separately in Table 10.1.

Figure 10.2 provides a complementary view by combining the descent trajectory with the descent vector field and the corresponding level sets. The numerical values reported in Table 10.1 show the evolution of the parameters, the loss, and the partial derivatives at each iteration. Together, these visual and numerical representations illustrate how the updates become progressively smaller as the algorithm approaches the minimum.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe

# ==========================================================
# Global font configuration
# ==========================================================
plt.rcParams.update({
    "font.size": 30,
    "axes.titlesize": 30,
    "axes.labelsize": 30,
    "xtick.labelsize": 26,
    "ytick.labelsize": 26
})

# ==========================================================
# Function and gradient
# ==========================================================
def L(x, y):
    return x**2 + y**2

def grad(x, y):
    return np.array([2*x, 2*y])

# Negative gradient: direction of descent
def neg_grad(x, y):
    return np.array([-2*x, -2*y])

# ==========================================================
# Gradient descent parameters
# ==========================================================
alpha = 0.35
steps = 10
x0, y0 = 3.0, 3.0

# ==========================================================
# Gradient descent simulation
# ==========================================================
rows = []
xs = [x0]
ys = [y0]
zs = [L(x0, y0)]

x, y = x0, y0
for n in range(steps + 1):
    gx, gy = grad(x, y)
    rows.append([n, x, y, L(x, y), gx, gy])
    if n < steps:
        x = x - alpha * gx
        y = y - alpha * gy
        xs.append(x)
        ys.append(y)
        zs.append(L(x, y))

# ==========================================================
# Surface grid
# ==========================================================
grid = np.linspace(-4.5, 4.5, 120)
X, Y = np.meshgrid(grid, grid)
Z = L(X, Y)

vmin = Z.min()
vmax = Z.max()

# ==========================================================
# Vector field grid (coarser for arrows)
# ==========================================================
g = np.linspace(-4.5, 4.5, 19)
Xg, Yg = np.meshgrid(g, g)
Ug, Vg = neg_grad(Xg, Yg)   # arrows point toward the minimum

# ==========================================================
# Figure layout
# ==========================================================
fig = plt.figure(figsize=(25, 17))
gs = fig.add_gridspec(1, 2, width_ratios=[2.4, 2.4], wspace=0.2)

# ==========================================================
# (a) 3D surface + descent field + trajectory
# ==========================================================
ax1 = fig.add_subplot(gs[0, 0], projection='3d')

surf = ax1.plot_surface(
    X, Y, Z,
    cmap="viridis",
    alpha=0.4,
    edgecolor="k",
    linewidth=0.35,
    vmin=vmin,
    vmax=vmax
)

# Descent vector field projected onto z = 0
ax1.quiver(
    Xg, Yg, 0,
    Ug, Vg, 0,
    color="black",
    length=0.35,
    normalize=True
)

# Trajectory
ax1.plot(xs, ys, zs, color="darkred", linewidth=6)
ax1.scatter(xs, ys, zs, color="red", s=36)

# Highlight initial and final points
ax1.scatter(xs[0], ys[0], zs[0], color="blue", s=295)
ax1.scatter(xs[-1], ys[-1], zs[-1], color="red", s=295)

# Label selected iterations
label_idx = [0, 1, 2, 5, 10]
for i in label_idx:
    ax1.text(
        xs[i] + 0.08, ys[i] + 0.08, zs[i] + 0.8, f"{i}",
        fontsize=10, color="black"
    )
                 
# Initial / final labels
ax1.text( # 🔵 Initial point
    xs[0] + 0.7, ys[0] + 0.6, zs[0] - 7.0,    
    "Initial \npoint", color="blue", fontsize=29, fontweight="bold"
)
ax1.text( # 🔴 Final point
    xs[-1] - 0.7, ys[-1] - 0.8, zs[-1] - 11,  
    "Final \npoint", color="red", fontsize=29, fontweight="bold"
)

ax1.set_title("(a) Surface of $L(x,y)$ with descent field and trajectory", pad=10)
ax1.set_xlabel("x", labelpad=15)
ax1.set_ylabel("y", labelpad=15)
ax1.set_zlabel("$L(x,y)$", labelpad=20)

ax1.tick_params(axis='x', pad=5)
ax1.tick_params(axis='y', pad=5)
ax1.zaxis.set_tick_params(pad=10)

ax1.view_init(elev=24, azim=-63)

# ==========================================================
# (b) Contours + descent field + trajectory
# ==========================================================
ax2 = fig.add_subplot(gs[0, 1])

levels = np.linspace(vmin, vmax, 10)

cont = ax2.contour(
    X, Y, Z,
    levels=levels,
    cmap="viridis",
    linewidths=1.6,
    vmin=vmin,
    vmax=vmax
)

# Descent vector field in 2D
ax2.quiver(
    Xg, Yg, Ug, Vg,
    color="black",
    angles="xy",
    scale_units="xy",
    scale=28,
    width=0.0035
)

# Trajectory
ax2.plot(xs, ys, color="red", linewidth=6)
ax2.scatter(xs, ys, color="red", s=42, zorder=3)

# Initial and final points
ax2.scatter(xs[0], ys[0], color="blue", s=295, zorder=4)
ax2.scatter(xs[-1], ys[-1], color="red", s=295, zorder=4)

# Label selected iterations
for i in label_idx:
    ax2.text(
        xs[i] + 0.10, ys[i] + 0.10, f"{i}",
        fontsize=10, color="black",
        path_effects=[pe.withStroke(linewidth=2, foreground="white")]
    )

# Initial / final labels
ax2.text(
    xs[0] + 0.35, ys[0] + 0.25, "Initial \npoint",
    color="blue", fontsize=29, fontweight="bold"
)
ax2.text(
    xs[-1] - 0.55, ys[-1] - 0.85, "Final \npoint",
    color="red", fontsize=29, fontweight="bold"
)

ax2.set_title("(b) Level sets of $L(x,y)$ with descent field and trajectory", pad=10)
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_aspect("equal", adjustable="box")

# ==========================================================
# Shared horizontal colorbar
# ==========================================================
cbar_ax = fig.add_axes([0.36, 0.09, 0.28, 0.025]) #[left, bottom, width, height]
cbar = fig.colorbar(surf, cax=cbar_ax, orientation="horizontal")
cbar.set_label("$L(x,y)$", labelpad=2)

# ==========================================================
# Main title and spacing
# ==========================================================
fig.suptitle(
    "Gradient descent on $L(x,y)=x^2+y^2$",
    fontsize=34,
    y=0.97
)

fig.subplots_adjust(
    top=0.86,
    bottom=0.18,
    left=0.05,
    right=0.98
)

plt.show()

$Gradient descent on $L(x,y)=x^2+y^2$ from $(3,3)$ with learning rate $\alpha=0.35$ and $10$ steps. The left panel shows the surface together with the descent trajectory and a descent vector field projected onto the plane. The right panel shows the corresponding level sets, the same trajectory, and arrows pointing toward the minimum.$

Figure 10.2: Gradient descent on $L(x,y)=x^2+y^2$ from $(3,3)$ with learning rate $\alpha=0.35$ and $10$ steps. The left panel shows the surface together with the descent trajectory and a descent vector field projected onto the plane. The right panel shows the corresponding level sets, the same trajectory, and arrows pointing toward the minimum.

The corresponding numerical values are shown in the following table, which reports the evolution of the parameters, the loss, and the partial derivatives at each iteration.

import pandas as pd

df_descenso = pd.DataFrame(
    rows,
    columns=["n", "x", "y", "L(x,y)", "∂L/∂x", "∂L/∂y"]
).round(3)

library(kableExtra)

df_descenso_r <- py$df_descenso

kable(
  df_descenso_r,
  align = "cccccc",
  escape = FALSE,
  col.names = c("n", "x", "y",  "$L(x,y)$", 
                "$\\partial L / \\partial x$", "$\\partial L / \\partial y$"
                )) %>%
  
  kable_styling() %>%
  kable_classic_2(full_width = FALSE) %>%
  
  row_spec(0, background = "lightgreen", bold = TRUE) %>% 
  row_spec(1, color = "Darkblue", bold = TRUE) %>%          # 🔵 Initial point
  row_spec(9, color = "red", bold = TRUE)                   # 🔴 Final point

Table 10.1: Table 10.2: Gradient descent iterations on $L(x,y)=x^2+y^2$ from $(3,3)$, $\alpha=0.35$, and $10$ steps.
n	x	y	$L(x,y)$	$\partial L / \partial x$	$\partial L / \partial y$
0	3.000	3.000	18.000	6.000	6.000
1	0.900	0.900	1.620	1.800	1.800
2	0.270	0.270	0.146	0.540	0.540
3	0.081	0.081	0.013	0.162	0.162
4	0.024	0.024	0.001	0.049	0.049
5	0.007	0.007	0.000	0.015	0.015
6	0.002	0.002	0.000	0.004	0.004
7	0.001	0.001	0.000	0.001	0.001
8	0.000	0.000	0.000	0.000	0.000
9	0.000	0.000	0.000	0.000	0.000
10	0.000	0.000	0.000	0.000	0.000

11 From gradient descent to backpropagation

Although gradient descent provides the general mechanism for updating parameters, neural networks typically involve a large number of interconnected weights organized in layers. The forward propagation structure illustrated in Figure 6.1 also serves as the computational graph over which gradients are propagated backward through successive applications of the chain rule.

To formalize this, we recall the layer-wise notation introduced earlier:

$n_j$: number of neurons in layer $j$.
$\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}$: the output (activation vector) of layer $j$. Each component corresponds to the output of one neuron.
$\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}$: the pre-activation signal at layer $j$, obtained before applying the activation function.

Each layer computes:

\[ \mathbf{z}^{(j)} \;=\; \mathbf{W}^{(j-1,j)\top} \,\mathbf{h}^{(j-1)} \,+\, \mathbf{b}^{(j)}, \qquad \mathbf{h}^{(j)} \;=\; f^{(j)}(\mathbf{z}^{(j)}), \]

where:

$f^{(j)}$ denotes the activation function applied element-wise.
$\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}$: the weight matrix connecting layer $j-1$ to layer $j$. Each column contains the weights associated with one neuron in layer $j$.
$\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}$: the bias vector of layer $j$.

This formulation generalizes the single-neuron model introduced earlier, where $z = \mathbf{w}^\top \mathbf{x} + b$.

12 Backpropagation (conceptual)

12.0.1 The local error

Backpropagation is the algorithm used to compute gradients efficiently in multilayer neural networks. It applies the chain rule to propagate error information from the output layer back to earlier layers.

For each layer $j$, define the local error:

\[ \boldsymbol{\delta}^{(j)} = \frac{\partial L}{\partial \mathbf{z}^{(j)}} \in \mathbb{R}^{n_j}. \]

The term $\boldsymbol{\delta}^{(j)}$ represents how sensitive the loss function is to changes in the pre-activation signal at layer $j$.

Recall that the loss function $L(\mathbf{y}, \hat{\mathbf{y}})$ was introduced earlier to quantify the discrepancy between predictions and targets. The local error $\boldsymbol{\delta}^{(j)}$ is therefore derived from this loss and propagates its influence through the network.

For the output layer $j = N$, the local error depends directly on the loss function. For example, in the case of mean squared error (MSE),

\[ \boldsymbol{\delta}^{(N)} = (\hat{\mathbf{y}} - \mathbf{y}) \odot f^{(N)\prime}(\mathbf{z}^{(N)}). \]

This provides the starting point for the backward propagation of errors.

Conceptually, backpropagation can be viewed as a systematic application of the chain rule over the composition of functions defined by the network.

12.0.2 Backward propagation

For hidden layers, the error is propagated backward according to

\[ \boldsymbol{\delta}^{(j-1)} \; =\; \underbrace{\left(\mathbf{W}^{(j-1,j)} \boldsymbol{\delta}^{(j)}\right)}_{\in \,\mathbb{R}^{n_{j-1}}} \;\odot\; \underbrace{f^{(j-1)\prime}(\mathbf{z}^{(j-1)})}_{\in \,\mathbb{R}^{n_{j-1}}}. \]

where:

$\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{n_{j-1} \times n_j}$
$\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}$
$f^{(j-1)\prime}$ denotes the derivative of the activation function with respect to its input, applied element-wise. That is,

\[ f^{(j-1)\prime}(\mathbf{z}^{(j-1)}) \; =\; \begin{pmatrix} \frac{d}{dz}f^{(j-1)}(z_1) \\ \vdots \\ \frac{d}{dz}f^{(j-1)}(z_{n_{j-1}}) \end{pmatrix}. \]

$\odot$ denotes element-wise (Hadamard) multiplication. For two matrices (or vectors) of the same dimension (e.g., $r \times s$), it is defined as

\[ \mathbf{A} \odot \mathbf{B} = \left( a_{ij} b_{ij} \right)_{1 \le i \le r,\; 1 \le j \le s} \;\in\; \mathbb{R}^{r \times s} \]

where $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{r \times s}$. That is, each entry of the result is obtained by multiplying the corresponding entries of $\mathbf{A}$ and $\mathbf{B}$. This operation is only defined when both operands have the same dimensions.

This contrasts with matrix multiplication, which combines rows and columns rather than corresponding entries.

Example: element-wise (Hadamard) multiplication.

Let

\[ \mathbf{a} = \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}. \]

Then

\[ \mathbf{a} \odot \mathbf{b} = \begin{pmatrix} a_1 b_1 \\ a_2 b_2 \\ a_3 b_3 \end{pmatrix}. \]

That is, each component of the resulting vector is obtained by multiplying the corresponding components.

Remark (dimensional consistency).

The dimensions of the variables involved are:

$\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}$
$\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{n_{j-1} \times n_j}$

Thus,

\[ \mathbf{W}^{(j-1,j)} \boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_{j-1}}, \]

which matches the dimension of $\boldsymbol{\delta}^{(j-1)}$.

This dimensional consistency guarantees that the backward propagation of errors is well-defined across layers.

12.0.3 Gradients with respect to the weights and biases

Using the local error terms, the gradients with respect to the model parameters are:

\[ \frac{\partial L}{\partial \mathbf{W}^{(j-1,j)}} = \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top}, \qquad \frac{\partial L}{\partial \mathbf{b}^{(j)}} = \boldsymbol{\delta}^{(j)}. \]

where:

$\mathbf{h}^{(j-1)} \in \mathbb{R}^{n_{j-1}}$
$\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}$

Thus,

\[ \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top} \in \mathbb{R}^{n_{j-1} \times n_j}, \]

which matches the dimension of $\mathbf{W}^{(j-1,j)}$.

This recursive computation allows gradients to be obtained efficiently, making training feasible even for deep neural networks.

12.0.4 Connection with the single-neuron model

The expressions above generalize the single-neuron formulation introduced earlier:

\[ z = \mathbf{w}^\top \mathbf{x} + b. \]

In this case:

$\mathbf{x} \in \mathbb{R}^d$
$\mathbf{w} \in \mathbb{R}^d$
$b \in \mathbb{R}$

This corresponds to a network with:

$n_{j-1} = d$
$n_j = 1$

Thus, backpropagation can be interpreted as a systematic extension of gradient computation from a single neuron to multilayer architectures.

12.0.5 Gradient-based optimization (remark)

Gradient-based methods rely on first-order information. That is, they use only the gradient of the loss function, without explicitly considering curvature.

More advanced methods incorporate second-order information (e.g., Hessians), but they are often computationally expensive in high-dimensional models.

13 Challenges in gradient-based optimization

Although gradient descent is conceptually simple, its practical behavior is strongly influenced by the geometry of the loss surface associated with neural networks.

In high-dimensional models, the loss function defines a complex landscape where each point corresponds to a particular configuration of the model parameters. This surface is typically highly non-convex, which gives rise to several important challenges during optimization.

13.0.1 Geometric structure of the loss surface

The loss surface may contain a variety of geometric features, including:

Local minima, where the gradient vanishes but the solution is not globally optimal.
Saddle points, where the gradient is zero but there exist directions of both ascent and descent.
Plateaus, where the gradient is very small and learning progresses slowly.
Narrow valleys, where curvature differs significantly across directions.

In high-dimensional settings, saddle points are often more prevalent than local minima and can significantly slow down the training process.

These phenomena can be visualized as movement across a complex surface, where the optimization trajectory may oscillate, stagnate, or progress unevenly depending on the local geometry.

13.0.2 Illustrative examples of common geometric difficulties

The following examples provide simple visual illustrations of some of the most common geometric difficulties encountered in gradient-based optimization. In particular, Figures 13.1-13.4 show representative examples of a local minimum, a saddle point, a plateau region, and a narrow valley, respectively. In each case, the function and its derivative or gradient are displayed to help connect the analytical expression with its geometric behavior.

Example: local minima and maxima.

Consider the function

\[ f(x) \;=\; x^4 - 2x^2, \qquad f'(x) \;=\; 4x^3 - 4x. \]

Its second derivative is

\[ f''(x)\;= \; 12x^2-4. \]

Since this is a one-dimensional function, the Hessian reduces to the $1\times 1$ matrix

\[ H_f(x) \; =\; \begin{pmatrix} f''(x) \end{pmatrix} \; =\; \begin{pmatrix} 12x^2-4 \end{pmatrix} \; =\; 12x^2-4. \]

Graphical interpretation (local minima and maxima).

Figure 13.1 illustrates both the function and its derivative, highlighting the locations where the gradient vanishes.

This function has several critical points where $f'(x)=0$.
The second derivative helps classify those critical points:
- if $f''(x)>0$, the function is locally convex and the point is a local minimum;
- if $f''(x)<0$, the function is locally concave and the point is a local maximum.
This illustrates how first-order information identifies critical points, while second-order information helps determine their local nature.

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

x = np.linspace(-2, 2, 500)
f = x**4 - 2*x**2
df = 4*x**3 - 4*x

fig, axes = plt.subplots(1, 2, figsize=(8, 3.5))

# Function
axes[0].plot(x, f, linewidth=1.5)
axes[0].axhline(0, linestyle='--', linewidth=0.7)
axes[0].axvline(0, linestyle='--', linewidth=0.7)
axes[0].set_title(r"(a) $f(x)=x^4-2x^2$")
axes[0].set_xlabel("x")
axes[0].set_ylabel("f(x)")
axes[0].grid(True, alpha=0.3)

# Gradient
axes[1].plot(x, df, linewidth=1.5)
axes[1].axhline(0, linestyle='--', linewidth=0.7)
axes[1].axvline(0, linestyle='--', linewidth=0.7)
axes[1].set_title(r"(b) $f'(x)=4x^3-4x$")
axes[1].set_xlabel("x")
axes[1].set_ylabel("f'(x)")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 13.1: Local minimum example: (a) Function $f(x)=x^4-2x^2$ and (b) its derivative $f'(x)=4x^3-4x$.

Example: saddle point.

Consider the function

\[ f(x,y)\; =\; x^2-y^2, \qquad \nabla f(x,y)\; =\; \begin{pmatrix} 2x \\ -2y \end{pmatrix}. \]

The Hessian matrix is

\[ H_f(x,y)\; =\; \begin{pmatrix} 2 & 0 \\ 0 & -2 \end{pmatrix}. \]

Graphical interpretation (saddle point).

Figure 13.2 illustrates both the surface of the function and its gradient field, making clear how the directions of ascent and descent vary around the saddle point.

The gradient vanishes at $(0,0)$.
The Hessian has eigenvalues of opposite signs ($2$ and $-2$), indicating that curvature is positive in one direction and negative in another.
Along the $x$-direction, the function increases (convex behavior).
Along the $y$-direction, the function decreases (concave behavior).
Therefore, $(0,0)$ is neither a minimum nor a maximum, but a saddle point.
This example shows that a zero gradient does not guarantee optimality; second-order information is required to determine the nature of the critical point.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

# Grid
x = np.linspace(-2, 2, 80)
y = np.linspace(-2, 2, 80)
X, Y = np.meshgrid(x, y)
Z = X**2 - Y**2

# Gradient field
xq = np.linspace(-2, 2, 13)
yq = np.linspace(-2, 2, 13)
Xq, Yq = np.meshgrid(xq, yq)
U = 2*Xq
V = -2*Yq

# Side-by-side layout
fig = plt.figure(figsize=(8, 3.8))

# --- Surface ---
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot_surface(X, Y, Z, alpha=0.85)
ax1.set_title(r"(a) $f(x,y)=x^2-y^2$")
ax1.set_xlabel("x", labelpad=2)
ax1.set_ylabel("y", labelpad=2)
ax1.set_zlabel("f(x,y)", labelpad=2)
ax1.tick_params(axis='both', labelsize=7)

# --- Gradient field ---
ax2 = fig.add_subplot(1, 2, 2)
ax2.contour(X, Y, Z, levels=20)
ax2.quiver(Xq, Yq, U, V, color="black", angles="xy",
           scale_units="xy", scale=12, width=0.004)
ax2.set_title(r"(b) $\nabla f(x,y)=(2x,-2y)^\top$")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

$Saddle point example: (a) Surface $f(x,y)=x^2-y^2$ and (b) gradient field $\nabla f(x,y)=(2x,-2y)^\top$.$

Figure 13.2: Saddle point example: (a) Surface $f(x,y)=x^2-y^2$ and (b) gradient field $\nabla f(x,y)=(2x,-2y)^\top$.

Example: plateau region.

Consider the function

\[ f(x)\;=\; \tanh(x), \qquad f'(x)\; =\; 1-\tanh^2(x). \]

Its second derivative is

\[ f''(x) \; =\; -2\,\tanh(x)\, \bigl(1\,-\,\tanh^2(x)\bigr). \]

Since this is a one-dimensional function, the Hessian reduces to

\[ H_f(x)\;=\; \begin{pmatrix} f''(x) \end{pmatrix} \;=\; -2\,\tanh(x)\, \bigl(1\,-\,\tanh^2(x)\bigr). \]

Graphical interpretation (plateau region).

Figure 13.3 illustrates the function and its derivative, highlighting the regions where the gradient becomes very small.

For large values of $|x|$, the function becomes nearly flat.
In those regions, both the first and second derivatives are close to zero.
This indicates that the function has very low curvature in those regions.
As a result, parameter updates become very small, which slows down the learning process.
This behavior is characteristic of plateau regions, where both gradient and curvature provide little information for optimization.

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

x = np.linspace(-5, 5, 500)
f = np.tanh(x)
df = 1 - np.tanh(x)**2

fig, axes = plt.subplots(1, 2, figsize=(8, 3.5))

# Function
axes[0].plot(x, f, linewidth=1.5)
axes[0].axhline(0, linestyle='--', linewidth=0.7)
axes[0].axvline(0, linestyle='--', linewidth=0.7)
axes[0].set_title(r"(a) $f(x)=\tanh(x)$")
axes[0].set_xlabel("x")
axes[0].set_ylabel("f(x)")
axes[0].grid(True, alpha=0.3)

# Gradient
axes[1].plot(x, df, linewidth=1.5)
axes[1].axhline(0, linestyle='--', linewidth=0.7)
axes[1].axvline(0, linestyle='--', linewidth=0.7)
axes[1].set_title(r"(b) $f'(x)=1-\tanh^2(x)$")
axes[1].set_xlabel("x")
axes[1].set_ylabel("f'(x)")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

$Plateau region example: (a) Function $f(x)=\tanh(x)$ and (b) derivative $f'(x)=1-\tanh^2(x)$.$

Figure 13.3: Plateau region example: (a) Function $f(x)=\tanh(x)$ and (b) derivative $f'(x)=1-\tanh^2(x)$.

Example: narrow valley.

Consider the function

\[ f(x,y)\; =\; 100x^2+y^2, \qquad \nabla f(x,y)\; =\; \begin{pmatrix} 200x \\ 2y \end{pmatrix}. \]

The Hessian matrix is

\[ H_f(x,y)\; =\; \begin{pmatrix} 200 & 0 \\ 0 & 2 \end{pmatrix}. \]

Graphical interpretation (narrow valley).

Figure 13.4 illustrates both the surface of the function and its gradient field, highlighting the strong difference in curvature between the two coordinate directions.

The Hessian reveals that curvature is much larger in the $x$-direction than in the $y$-direction.
In particular, the eigenvalues of the Hessian are $200$ and $2$, indicating a strong anisotropy in the curvature of the function.
As a result, the optimization path may oscillate across the steep direction while progressing slowly along the shallow one.
This produces a narrow valley and makes optimization more difficult.
This example shows how second-order information helps explain optimization difficulties that are not evident from the gradient alone.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

# Grid for the surface
x = np.linspace(-1.5, 1.5, 80)
y = np.linspace(-3, 3, 80)
X, Y = np.meshgrid(x, y)
Z = 100*X**2 + Y**2

# Grid for the gradient field
xq = np.linspace(-1.5, 1.5, 13)
yq = np.linspace(-3, 3, 13)
Xq, Yq = np.meshgrid(xq, yq)
U = 200*Xq
V = 2*Yq

# Side-by-side layout
fig = plt.figure(figsize=(8, 3.8))

# Surface
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot_surface(X, Y, Z, alpha=0.85)
ax1.set_title(r"(a) $f(x,y)=100x^2+y^2$")
ax1.set_xlabel("x", labelpad=2)
ax1.set_ylabel("y", labelpad=2)
ax1.set_zlabel("f(x,y)", labelpad=2)
ax1.tick_params(axis='both', labelsize=7)

# Contour + gradient field
ax2 = fig.add_subplot(1, 2, 2)
ax2.contour(X, Y, Z, levels=20)
ax2.quiver(Xq, Yq, U, V, color="black",
           angles="xy", scale_units="xy", scale=150, width=0.004)
ax2.set_title(r"(b) $\nabla f(x,y)=(200x,2y)^\top$")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

$Narrow valley example: (a) Surface $f(x,y)=100x^2+y^2$ and (b) gradient field $\nabla f(x,y)=(200x,2y)^\top$.$

Figure 13.4: Narrow valley example: (a) Surface $f(x,y)=100x^2+y^2$ and (b) gradient field $\nabla f(x,y)=(200x,2y)^\top$.

13.0.3 Vanishing and exploding gradients

A fundamental challenge in deep neural networks arises from the propagation of gradients through multiple layers.

Because gradients are computed through repeated application of the chain rule, their magnitude may either decrease or increase exponentially as they propagate backward through the network.

In the vanishing gradient case, gradients become extremely small, preventing early layers from learning effectively.
In the exploding gradient case, gradients grow excessively large, leading to numerical instability.

Both situations can hinder the training process and affect convergence.

Example: vanishing gradient.

Consider the sigmoid function

\[ \sigma(x) \;=\; \frac{1}{1 \,+\, e^{-x}}, \qquad \sigma'(x) \;=\; \frac{e^{-x}}{1 \,+\, e^{-x}} \;=\; \sigma(x)\, \bigl(1 \,-\, \sigma(x)\bigr). \]

Its second derivative is

\[ \sigma''(x) \;=\; \sigma'(x) \, \bigl(1 \,-\, 2\sigma(x)\bigr). \]

Since this is a one-dimensional function, the Hessian reduces to

\[ H_\sigma(x)= \begin{pmatrix} \sigma''(x) \end{pmatrix}. \]

Graphical interpretation (vanishing gradient).

Figure 13.5 illustrates the sigmoid function and its derivative, highlighting how the gradient becomes very small for large values of $|x|$.

For large positive or negative values of $x$, the derivative approaches zero.
In these regions, the second derivative is also close to zero, indicating very low curvature.
As a result, both first- and second-order information become weak.
When such small derivatives are multiplied across many layers, the gradient can vanish.
This explains why early layers in deep networks may learn very slowly.
This example highlights how regions with near-zero curvature can severely limit the effectiveness of gradient-based optimization.

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

x = np.linspace(-8, 8, 500)
sig = 1 / (1 + np.exp(-x))
dsig = sig * (1 - sig)

fig, axes = plt.subplots(1, 2, figsize=(8, 3.5))

# Sigmoid
axes[0].plot(x, sig, linewidth=1.5)
axes[0].set_title(r"(a) $\sigma(x)$")
axes[0].set_xlabel("x")
axes[0].set_ylabel("σ(x)")
axes[0].grid(True, alpha=0.3)

# Derivative
axes[1].plot(x, dsig, linewidth=1.5)
axes[1].set_title(r"(b) $\sigma'(x)$")
axes[1].set_xlabel("x")
axes[1].set_ylabel("σ'(x)")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

$Vanishing gradient example: (a) Sigmoid function $\sigma(x)$ and (b) derivative $\sigma'(x)=\sigma(x)(1-\sigma(x))$.$

Figure 13.5: Vanishing gradient example: (a) Sigmoid function $\sigma(x)$ and (b) derivative $\sigma'(x)=\sigma(x)(1-\sigma(x))$.

Example: exploding gradient.

Consider repeated multiplication of a value greater than 1:

\[ g(n) = (1.5)^n. \]

Graphical interpretation (exploding gradient).

Figure 13.6 shows how this quantity grows rapidly as the number of multiplications increases.

As $n$ increases, $g(n)$ grows exponentially.
In deep networks, repeated multiplication of gradients can lead to very large values.
This can cause numerical instability during training.
This behavior is not primarily explained by local curvature (as captured by the Hessian), but rather by the repeated application of the chain rule, which can amplify gradients across layers.
In such cases, even moderate local derivatives can lead to exponential growth when composed many times.

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

n = np.arange(0, 20)
g = 1.5 ** n

plt.figure(figsize=(6, 3.5))
plt.plot(n, g, marker='o', linewidth=1.5)
plt.title(r"$g(n) = (1.5)^n$")
plt.xlabel("n")
plt.ylabel("g(n)")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 13.6: Exploding gradient example: exponential growth $g(n)=(1.5)^n$.

Taken together, these examples highlight how both the geometry of the loss surface and the behavior of gradients play a crucial role in the effectiveness of gradient-based optimization methods, particularly in deep neural networks.

13.0.4 Summary of illustrative examples

Table 13.1 provides a concise summary of the main geometric and gradient-related phenomena discussed in this section, along with their visual interpretation and associated optimization challenges. For clarity, these concepts are summarized below.

Table 13.1: Table 13.2: Summary of illustrative examples used to explain geometric and gradient-related challenges in neural network optimization.
Type	Example	Main visual idea	Main difficulty
Geometric structure	Local minimum	Function and derivative	Critical points may complicate optimization
Geometric structure	Saddle point	Surface and gradient field	Zero gradient does not imply optimum
Geometric structure	Plateau region	Function and derivative	Very small gradients slow learning
Gradient propagation	Vanishing gradient	Function and derivative	Gradients shrink across many layers
Gradient propagation	Exploding gradient	Iterative exponential growth	Gradients grow excessively across many layers

The phenomena summarized above have direct implications for the design and training of neural networks.

13.0.5 Practical implications

Several strategies have been developed to mitigate these issues, including:

The use of activation functions that better preserve gradient flow (e.g., ReLU and its variants).
Careful weight initialization methods (e.g., Xavier or He initialization).
Normalization techniques such as Batch Normalization.

These approaches help stabilize training and improve convergence behavior in deep neural networks.

Overall, the effectiveness of gradient-based optimization depends not only on the optimization algorithm itself, but also on the interplay between model architecture, activation functions, and the geometry of the loss surface.

These challenges motivate the development of more advanced optimization methods that incorporate additional information about the curvature of the loss surface.

14 Advanced optimization methods

14.0.1 Motivation

While gradient descent provides a simple and effective mechanism for training neural networks, it relies exclusively on first-order information, namely the gradient of the loss function. Although this approach is widely used in practice, it does not fully exploit the geometric structure of the optimization problem.

In many settings, additional information about the curvature of the loss surface can be leveraged to accelerate convergence and improve stability. This leads to the study of second-order optimization methods, which incorporate not only gradients but also higher-order derivatives.

14.0.2 Gradient and curvature information

In previous sections, we introduced the gradient as a tool for describing how a function changes with respect to its inputs:

\[ \nabla L(\boldsymbol{\theta}) = \begin{pmatrix} \frac{\partial L}{\partial \theta_1} \\ \vdots \\ \frac{\partial L}{\partial \theta_N} \end{pmatrix} \]

While the gradient provides first-order information about the direction of steepest increase, it does not capture how this direction changes across the parameter space.

To address this, we introduced the Hessian matrix, which contains second-order derivatives and describes the local curvature of the loss function:

\[ H(\boldsymbol{\theta}) \;=\; \left[ \frac{\partial^2 L}{\partial \theta_i \partial \theta_j} \right]_{i,j=1}^N \;=\; \begin{pmatrix} \frac{\partial^2 L}{\partial \theta_1^2} & \frac{\partial^2 L}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 L}{\partial \theta_1 \partial \theta_N} \\ \frac{\partial^2 L}{\partial \theta_2 \partial \theta_1} & \frac{\partial^2 L}{\partial \theta_2^2} & \cdots & \frac{\partial^2 L}{\partial \theta_2 \partial \theta_N} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 L}{\partial \theta_N \partial \theta_1} & \frac{\partial^2 L}{\partial \theta_N \partial \theta_2} & \cdots & \frac{\partial^2 L}{\partial \theta_N^2} \end{pmatrix}. \] Each entry of the Hessian measures how the gradient with respect to one parameter changes as another parameter varies, thereby encoding interactions between parameters.

As illustrated in earlier examples, the curvature of the loss surface plays a critical role in optimization:

In functions with uniform curvature, optimization is smooth and stable.
In functions with highly anisotropic curvature (e.g., narrow valleys), optimization becomes more challenging.

This additional information allows us to design optimization methods that adapt to the geometry of the problem.

If the loss function is sufficiently smooth, the Hessian is symmetric, since mixed partial derivatives satisfy

\[ \frac{\partial^2 L}{\partial \theta_i \partial \theta_j} \;=\; \frac{\partial^2 L}{\partial \theta_j \partial \theta_i}. \]

Example: gradient descent vs Newton (one-dimensional).

Consider the function

\[ f(x) = x^2, \qquad f'(x)=2x, \qquad f''(x)=2. \]

Gradient descent update: \[ x_{t+1} = x_t - \alpha \cdot 2x_t \]
Newton update: \[ x_{t+1} = x_t - \frac{2x_t}{2} = 0 \]

Thus, Newton’s method reaches the minimum in a single step for this quadratic function.

This illustrates how second-order information allows for more efficient optimization.

While the previous example considered a one-dimensional function, we now move to a two-dimensional setting.

For simplicity, we illustrate these ideas using a function $f(x,y)$, although in neural networks the objective is typically a loss function $L(\boldsymbol{\theta})$.

Example: gradient descent vs Newton in a narrow valley.

We revisit the function

\[ f(x,y) \;=\; 100x^2 + y^2, \qquad \nabla f(x,y) \; =\; \begin{pmatrix} 200x \\ 2y \end{pmatrix}, \qquad H_f(x,y) \;=\; \begin{pmatrix} 200 & 0 \\ 0 & 2 \end{pmatrix}. \]

Graphical interpretation (gradient descent vs Newton in a narrow valley).

Figure 14.1 compares the trajectories of gradient descent and Newton’s method starting from the same initial point.

Even with a sufficiently small learning rate, gradient descent oscillates across the steep direction and progresses slowly toward the minimum.
Newton’s method rescales the update using curvature information, effectively normalizing the step size across directions with different curvature.
This illustrates how second-order information adapts to the geometry of the problem.

import numpy as np
import matplotlib.pyplot as plt

# Function and derivatives
def f(x, y):
    return 100*x**2 + y**2

def grad(x, y):
    return np.array([200*x, 2*y])

H_inv = np.array([[1/200, 0],
                  [0, 1/2]])

# Initial point
x0 = np.array([1.2, 1.2])

# Gradient descent
alpha = 0.009
gd_path = [x0.copy()]
x = x0.copy()
for _ in range(20):
    x = x - alpha * grad(x[0], x[1])
    gd_path.append(x.copy())
gd_path = np.array(gd_path)

# Newton method
newton_path = [x0.copy()]
x = x0.copy()
for _ in range(3):
    x = x - H_inv @ grad(x[0], x[1])
    newton_path.append(x.copy())
newton_path = np.array(newton_path)

# Contour grid
x_vals = np.linspace(-1.5, 1.5, 400)
y_vals = np.linspace(-1.5, 1.5, 400)
X, Y = np.meshgrid(x_vals, y_vals)
Z = 100*X**2 + Y**2

# Plot
plt.figure(figsize=(8,4.5)); 

# Contours
plt.contour(X, Y, Z, levels=25); 

# Paths
plt.plot(gd_path[:,0], gd_path[:,1], 'o-', label="Gradient Descent");
plt.plot(newton_path[:,0], newton_path[:,1], 's-', label="Newton");

# Initial point and minimum
plt.scatter([x0[0]], [x0[1]], s=50);
plt.annotate(r"$(x_0,y_0)$", (x0[0], x0[1]), xytext=(6, 6), textcoords="offset points");

plt.scatter([0], [0], s=40);
plt.annotate(r"$(0,0)$", (0, 0), xytext=(6, 6), textcoords="offset points");


# Formatting
plt.xlabel("x"); 
plt.ylabel("y"); 
plt.title("Optimization trajectories"); 
plt.legend(); 
plt.grid(True, alpha=0.2)
plt.xlim(-1.5, 1.5);
plt.ylim(-1.5, 1.5);

plt.tight_layout()
plt.show()

Optimization trajectories on a narrow valley: gradient descent (zig-zag path) versus Newton's method (direct convergence), illustrating the role of curvature in second-order optimization.

Figure 14.1: Optimization trajectories on a narrow valley: gradient descent (zig-zag path) versus Newton’s method (direct convergence), illustrating the role of curvature in second-order optimization.

The behavior of gradient descent depends strongly on the learning rate. In directions of high curvature, large learning rates may cause divergence, whereas smaller values produce the characteristic zig-zag trajectory of slow convergence. Newton’s method avoids this issue by incorporating curvature information through the inverse Hessian.

Together with the one-dimensional example, this result highlights how second-order methods incorporate curvature information to significantly improve convergence behavior.

The behavior observed above is closely related to the choice of the learning rate in gradient descent.

Example: effect of the learning rate in a narrow valley.

To better understand the behavior of gradient descent in ill-conditioned problems, it is useful to compare two different learning rates on the same function:

\[ f(x,y)=100x^2+y^2, \qquad \nabla f(x,y)= \begin{pmatrix} 200x\\ 2y \end{pmatrix}. \]

Graphical interpretation (effect of the learning rate in a narrow valley).

Figure 14.2 shows two optimization trajectories starting from the same initial point. In panel (a), a large learning rate causes divergence, whereas in panel (b), a smaller learning rate produces the characteristic zig-zag pattern associated with slow but stable convergence.

import numpy as np
import matplotlib.pyplot as plt

# ----------------------------------------------------------
# Function and gradient
# ----------------------------------------------------------
def f(x, y):
    return 100*x**2 + y**2

def grad(x, y):
    return np.array([200*x, 2*y])

# ----------------------------------------------------------
# Gradient descent path
# ----------------------------------------------------------
def gradient_descent_path(x0, alpha, n_steps):
    path = [x0.copy()]
    x = x0.copy()
    for _ in range(n_steps):
        x = x - alpha * grad(x[0], x[1])
        path.append(x.copy())
    return np.array(path)

# ----------------------------------------------------------
# Initial point
# ----------------------------------------------------------
x0 = np.array([1.2, 1.2])

# Two learning rates
path_div = gradient_descent_path(x0, alpha=0.05, n_steps=6)
path_stable = gradient_descent_path(x0, alpha=0.009, n_steps=20)

# ----------------------------------------------------------
# Contour grid
# ----------------------------------------------------------
x_vals = np.linspace(-1.5, 1.5, 400)
y_vals = np.linspace(-1.5, 1.5, 400)
X, Y = np.meshgrid(x_vals, y_vals)
Z = 100*X**2 + Y**2

# ----------------------------------------------------------
# Plot
# ----------------------------------------------------------
plt.rcParams.update({
    "font.size": 9,
    "axes.titlesize": 11,
    "axes.labelsize": 9,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8,
    "legend.fontsize": 8
})

fig, axes = plt.subplots(1, 2, figsize=(10, 4.5))

# ---------------- Panel (a): Divergence ----------------
ax = axes[0]
ax.contour(X, Y, Z, levels=25); 
ax.plot(path_div[:,0], path_div[:,1], 'o-', label="Gradient Descent"); 
ax.scatter([0], [0], s=35); 
ax.annotate(r"$(0,0)$", (0, 0), xytext=(6, 6), textcoords="offset points"); 
ax.scatter([x0[0]], [x0[1]], s=45); 
ax.annotate(r"$(x_0,y_0)$", (x0[0], x0[1]), xytext=(6, 6), textcoords="offset points"); 
ax.set_title(r"(a) $\alpha=0.05$: divergence"); 
ax.set_xlabel("x"); 
ax.set_ylabel("y"); 
ax.set_xlim(-1.5, 1.5); 
ax.set_ylim(-1.5, 1.5); 
ax.grid(True, alpha=0.2); 

# ---------------- Panel (b): Stable zig-zag ----------------
ax = axes[1]
ax.contour(X, Y, Z, levels=25); 
ax.plot(path_stable[:,0], path_stable[:,1], 'o-', label="Gradient Descent"); 
ax.scatter([0], [0], s=35); 
ax.annotate(r"$(0,0)$", (0, 0), xytext=(6, 6), textcoords="offset points"); 
ax.scatter([x0[0]], [x0[1]], s=45); 
ax.annotate(r"$(x_0,y_0)$", (x0[0], x0[1]), xytext=(6, 6), textcoords="offset points"); 
ax.set_title(r"(b) $\alpha=0.009$: stable zig-zag"); 
ax.set_xlabel("x"); 
ax.set_ylabel("y"); 
ax.set_xlim(-1.5, 1.5); 
ax.set_ylim(-1.5, 1.5); 
ax.grid(True, alpha=0.2); 

plt.tight_layout()
plt.show()

$Effect of the learning rate on gradient descent in a narrow valley. (a) With $\alpha=0.05$, the method diverges due to the high curvature in the $x$-direction. (b) With $\alpha=0.009$, the method remains stable but exhibits a zig-zag trajectory and slow convergence.$

Figure 14.2: Effect of the learning rate on gradient descent in a narrow valley. (a) With $\alpha=0.05$, the method diverges due to the high curvature in the $x$-direction. (b) With $\alpha=0.009$, the method remains stable but exhibits a zig-zag trajectory and slow convergence.

Figure 14.2 shows that the performance of gradient descent depends strongly on the choice of learning rate.

In panel (a), the learning rate is too large relative to the curvature in the $x$-direction, so the method becomes unstable and diverges.
In panel (b), the smaller learning rate prevents divergence, but the trajectory still oscillates across the steep direction, producing slow convergence.

This example reinforces the importance of curvature in optimization and helps explain why second-order methods, such as Newton’s method, can be more effective in anisotropic landscapes.

14.0.3 Newton-type methods

One classical approach that uses second-order information is the Newton-Raphson method, which updates parameters according to

\[ \boldsymbol{\theta}^{(t+1)} \;=\; \boldsymbol{\theta}^{(t)} \;-\; H\big(\boldsymbol{\theta}^{(t)}\big)^{-1} \, \nabla L\big(\boldsymbol{\theta}^{(t)}\big). \] Geometrically, this update can be interpreted as rescaling the parameter space using the inverse Hessian, allowing the optimization step to adapt to the local curvature of the loss surface.

Unlike gradient descent, where the step size is controlled by a scalar learning rate, Newton-type methods adapt the update direction using curvature information. This often leads to faster convergence, particularly near optimal points.

However, computing and inverting the Hessian matrix can be computationally expensive in high-dimensional problems, which limits the direct applicability of these methods in large neural networks.

Interpretation via local quadratic approximation.

Newton’s method can be understood as minimizing a local quadratic approximation of the loss function around the current iterate.

Let $\mathbf{d} = \boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}$. Then, using a second-order Taylor expansion,

\[ L(\boldsymbol{\theta}^{(t)} + \mathbf{d}) \;\approx\; L\big(\boldsymbol{\theta}^{(t)}\big) \;+\; \nabla L\big(\boldsymbol{\theta}^{(t)}\big)^\top \mathbf{d} \;+\; \frac{1}{2} \mathbf{d}^\top H\big(\boldsymbol{\theta}^{(t)}\big)\, \mathbf{d}. \]

Minimizing this quadratic approximation with respect to $\mathbf{d}$ yields

\[ \mathbf{d} \;=\; - H\big(\boldsymbol{\theta}^{(t)}\big)^{-1}\, \nabla L\big(\boldsymbol{\theta}^{(t)}\big), \]

which leads directly to the Newton update.

This shows that Newton’s method selects the step that minimizes a local quadratic model of the loss, incorporating both gradient and curvature information.

Graphical interpretation (interpretation via local quadratic approximation).

In Figure 14.3, the quadratic approximation is constructed around the point $x_0=1$.

import numpy as np
import matplotlib.pyplot as plt

def f(x):
    return x**4

def df(x):
    return 4*x**3

def d2f(x):
    return 12*x**2

x0 = 1.0
x = np.linspace(-1.5, 1.5, 400)

# Quadratic approximation at x0
f0 = f(x0)
df0 = df(x0)
d2f0 = d2f(x0)

quad = f0 + df0*(x - x0) + 0.5*d2f0*(x - x0)**2

plt.figure(figsize=(6,3.5));
plt.plot(x, f(x), label="f(x)");
plt.plot(x, quad, '--', label="Quadratic approximation");
plt.scatter([x0], [f0], s=40);
plt.annotate(r"$x_0$", (x0, f0), xytext=(8, 8), textcoords="offset points");

# Newton step
x_newton = x0 - df0 / d2f0
y_newton_quad = f0 + df0*(x_newton - x0) + 0.5*d2f0*(x_newton - x0)**2

plt.scatter([x_newton], [y_newton_quad], color='red', s=40);
plt.annotate(r"$x_1$", (x_newton, y_newton_quad),
             xytext=(8, -12), textcoords="offset points");
             
plt.arrow(x0, f0,
          x_newton - x0,
          y_newton_quad - f0,
          head_width=0.05,
          length_includes_head=True,
          color='black');            
             
plt.legend();
plt.grid(True, alpha=0.2);
plt.xlabel("x");
plt.ylabel("y");
plt.xlim(-1.5, 1.5);
plt.ylim(-0.2, 3);

plt.tight_layout()
plt.show()

Local quadratic approximation underlying Newton's method: the quadratic model (dashed) approximates the function near the current point $x_0$, and its minimizer determines the Newton update $x_1$.

Figure 14.3: Local quadratic approximation underlying Newton’s method: the quadratic model (dashed) approximates the function near the current point $x_0$, and its minimizer determines the Newton update $x_1$.

The point $x_1$ corresponds to the minimum of the quadratic approximation, not of the original function.

Thus, Newton’s method can be interpreted as replacing the original function by a local quadratic model and then moving to the minimizer of that model. In this way, the update incorporates both slope and curvature information, allowing the method to adapt more effectively to the local geometry of the loss surface than standard gradient descent.

14.0.4 First-order vs second-order methods

Gradient descent can be interpreted as a simplified optimization strategy that only uses first-order information. While this makes it computationally efficient and scalable, it may require many iterations to converge.

Second-order methods, on the other hand, incorporate curvature information, allowing for more informed parameter updates. In practice, modern optimization algorithms often seek a balance between these two approaches, combining efficiency with improved convergence properties.

15 Practical perspective

15.0.1 Conceptual perspective

In deep learning, first-order methods such as gradient descent and its variants (e.g., Adam and RMSProp) remain the dominant choice due to their scalability and ease of implementation.

Nevertheless, second-order ideas continue to play an important role in theoretical analysis and in specialized applications where curvature information can be exploited effectively.

This perspective highlights that optimization in neural networks is not limited to a single method, but rather involves a spectrum of approaches with different trade-offs between computational cost and convergence behavior.

Despite their effectiveness, first-order methods such as gradient descent may struggle in the presence of ill-conditioned loss surfaces, where curvature varies significantly across directions.

This limitation motivates the development of adaptive optimization methods, such as AdaGrad, RMSProp, and Adam, which adjust the learning rate dynamically during training.

These methods aim to partially compensate for curvature effects without explicitly computing second-order information, providing a practical compromise between efficiency and robustness in large-scale neural networks.

15.0.2 Graphical interpretation

For intuition, Figure 15.1 illustrates how adaptive methods may modify the optimization trajectory compared to standard gradient descent.

Gradient descent follows a zig-zag path due to anisotropic curvature.
Adaptive methods rescale the step size across directions, producing a smoother and more efficient trajectory toward the minimum.

This visual comparison highlights how adaptive methods implicitly account for curvature effects without explicitly computing second-order derivatives.

import numpy as np
import matplotlib.pyplot as plt

# ----------------------------------------------------------
# Function and gradient
# ----------------------------------------------------------
def f(x, y):
    return 100*x**2 + y**2

def grad(x, y):
    return np.array([200*x, 2*y], dtype=float)

# ----------------------------------------------------------
# Initial point
# ----------------------------------------------------------
x0 = np.array([1.2, 1.2], dtype=float)

# ----------------------------------------------------------
# Standard gradient descent
# ----------------------------------------------------------
alpha = 0.009
gd_path = [x0.copy()]
x = x0.copy()

for _ in range(20):
    x = x - alpha * grad(x[0], x[1])
    gd_path.append(x.copy())

gd_path = np.array(gd_path)

# ----------------------------------------------------------
# Simulated adaptive behavior (for intuition only)
# ----------------------------------------------------------
def adaptive_step(x_vec):
    gx, gy = grad(x_vec[0], x_vec[1])
    return np.array([gx/200, gy/2], dtype=float)

adapt_path = [x0.copy()]
x = x0.copy()

for _ in range(15):
    x = x - 0.2 * adaptive_step(x)
    adapt_path.append(x.copy())

adapt_path = np.array(adapt_path)

# ----------------------------------------------------------
# Contour grid
# ----------------------------------------------------------
x_vals = np.linspace(-1.5, 1.5, 400)
y_vals = np.linspace(-1.5, 1.5, 400)
X, Y = np.meshgrid(x_vals, y_vals)
Z = 100*X**2 + Y**2

# ----------------------------------------------------------
# Plot
# ----------------------------------------------------------
plt.figure(figsize=(8, 4.5));

plt.contour(X, Y, Z, levels=25);
plt.plot(gd_path[:,0], gd_path[:,1], 'o-', label="Gradient Descent");
plt.plot(adapt_path[:,0], adapt_path[:,1], 's-', label="Adaptive-like path");

plt.scatter([x0[0]], [x0[1]], s=50);
plt.annotate(r"$(x_0,y_0)$", (x0[0], x0[1]), xytext=(6, 6), textcoords="offset points");

plt.scatter([0], [0], s=40);
plt.annotate(r"$(0,0)$", (0, 0), xytext=(6, 6), textcoords="offset points");

plt.xlabel("x");
plt.ylabel("y");
plt.title("Optimization trajectories");
plt.legend();
plt.grid(True, alpha=0.2)
plt.xlim(-1.5, 1.5);
plt.ylim(-1.5, 1.5);

plt.tight_layout()
plt.show()

Illustrative comparison between standard gradient descent and an adaptive-like trajectory on a narrow valley. The adaptive-like path rescales directions differently, producing a smoother trajectory.

Figure 15.1: Illustrative comparison between standard gradient descent and an adaptive-like trajectory on a narrow valley. The adaptive-like path rescales directions differently, producing a smoother trajectory.

15.0.3 Practical remark

The adaptive-like path shown here is only a simplified visual illustration. Actual adaptive methods such as AdaGrad, RMSProp, and Adam rely on running statistics of past gradients rather than explicit curvature rescaling.

The goal of this figure is not to represent the exact update rules, but to provide geometric intuition about how adaptive methods modify optimization trajectories in ill-conditioned settings.

16 Summary

In this document, we developed the fundamental mathematical framework that enables learning in artificial neural networks. Starting from the definition of loss functions, we established how learning can be formulated as an optimization problem over a high-dimensional parameter space.

We then introduced gradient-based optimization methods, emphasizing how the gradient provides directional information that guides parameter updates toward regions of lower loss. This perspective allowed us to interpret learning geometrically as movement over a loss surface.

Building on this foundation, we examined the role of the learning rate and showed how its interaction with the curvature of the loss surface critically affects the behavior of gradient descent. In particular, we observed that ill-conditioned problems may lead to slow convergence, oscillatory trajectories, or even divergence.

To address these limitations, we introduced second-order methods, focusing on Newton’s method. By incorporating curvature information through the Hessian matrix, these methods adapt the optimization step to the local geometry of the problem, leading to more efficient convergence.

We also provided a geometric interpretation of Newton’s method as the minimization of a local quadratic approximation of the loss function, highlighting the role of both gradient and curvature in determining the update direction.

Finally, we discussed the practical implications of these ideas in modern deep learning. While second-order methods are often computationally expensive, their underlying principles motivate the development of adaptive optimization algorithms, such as AdaGrad, RMSProp, and Adam, which approximate curvature effects in a scalable manner.

Overall, this chapter connects first-order methods, second-order insights, and adaptive strategies within a unified geometric perspective of optimization, providing a deeper understanding of how neural networks learn in complex loss landscapes.

17 Learning activity

17.0.1 Objective

The purpose of this activity is to consolidate both conceptual and mathematical understanding of the learning process in neural networks. In particular, it aims to develop intuition about loss functions, gradient-based optimization, curvature effects, and the role of backpropagation in computing parameter updates.

17.0.2 Instructions

Consider a feedforward neural network with multiple layers and a differentiable loss function $L$.

Answer the following questions:

17.0.3 Required sections

1. Conceptual understanding.

What is the role of a loss function in a neural network?
Why can the training of a neural network be interpreted as an optimization problem?

2. Gradient interpretation.

Define the gradient of a function $L(\boldsymbol{\theta})$.
Explain why the negative gradient direction is used to update parameters.
Describe geometrically how gradient descent behaves on different loss surfaces (e.g., isotropic vs anisotropic curvature).

3. Learning rate.

What is the role of the learning rate $\alpha$ in gradient descent?
What happens if $\alpha$ is too large?
What happens if $\alpha$ is too small?
How does the appropriate choice of $\alpha$ depend on the curvature of the loss surface?

4. Backpropagation intuition.

What is the meaning of the local error $\boldsymbol{\delta}^{(j)}$?
Why is backpropagation necessary in deep neural networks?
Explain in your own words what it means to propagate the error backward.
How is backpropagation related to the chain rule?

5. Dimensional reasoning.

Consider the expression:

\[ \boldsymbol{\delta}^{(j-1)} = \left(\mathbf{W}^{(j-1,j)\top}\boldsymbol{\delta}^{(j)}\right) \odot f'(\mathbf{z}^{(j-1)}). \]

Explain why the transpose of $\mathbf{W}^{(j-1,j)}$ is required.
Justify why the dimensions of all terms are consistent.
What role does the element-wise product play in this expression?

6. Critical reflection.

Why would training be infeasible without backpropagation?
Explain why deep networks require efficient gradient computation.
In your opinion, which component is most critical for successful learning: the loss function, the optimization method, or backpropagation? Discuss how they interact.

7. Curvature and optimization.

What does the Hessian matrix represent in the context of optimization?
How does curvature affect the behavior of gradient descent?
Explain why narrow valleys lead to oscillatory trajectories.
Why can second-order methods improve convergence in such settings?

8. Forward pass.

Consider a simple neural network with two layers.

Describe step by step how the forward pass is computed.
Then describe conceptually how the backward pass is performed.
Identify where the gradient is computed and how it is used.

9. Conceptual design.

Propose an intuitive explanation (without formulas) of how a neural network learns from its mistakes.

What information is propagated?
How is it used to adjust the model?
Why does this process improve performance over time?

17.0.4 Reproducibility Requirement

The R Markdown document must be fully reproducible.
All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.
All random seeds (if applicable) must be set to ensure deterministic results.
All library versions used should be clearly reported.
All code must be self-contained within the document. External dependencies are not allowed unless explicitly included.
The use of external functions not included in the document prevents reproducibility, as the analysis cannot be executed independently.

17.0.5 Submission Guidelines

To facilitate the review process and maintain the academic quality of submissions, the following requirements must be met:

The assignment must be submitted in both formats:
- PDF
- HTML
The document must be fully reproducible (see Reproducibility Requirement).
If additional files are used (e.g., modules, datasets, or auxiliary functions), they must:
- Be included in the submission,
- Be clearly organized, and
- Be properly documented.
Multiple versions of the same assignment must not be submitted unless the final version is clearly identified.

17.0.6 Submission Timing and Late Policy

Timely submission of assignments is strictly required.

Assignments must be submitted by the deadlines specified for each activity.
A grace period of 30 minutes is allowed after the deadline.
Submissions made after the grace period will incur penalties.
Penalty: −1.0 point for each hour or fraction of an hour of delay.
Submissions significantly past the deadline may receive a substantially reduced grade or may not be accepted.

Method	Key idea	Update rule	Advantage	Limitation
AdaGrad (Duchi et al., 2011)	Accumulates all past squared gradients	\(w_{t+1} = w_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} g_t\)	Simple, works well for sparse data	Learning rate shrinks continuously
RMSProp (Tieleman and Hinton, 2012)	Uses exponential moving average of squared gradients	\(w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t\)	Prevents learning rate from vanishing too quickly	Requires tuning of decay parameter
Adam (Kingma and Ba, 2014)	Combines momentum and adaptive scaling	\(w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} m_t\)	Fast and stable convergence in practice	More complex, more hyperparameters

NEURAL NETWORKS FOR REPRESENTATION LEARNING

Learning and optimization in neural networks

Dr. rer. nat. Humberto LLinás Solano

1 Preliminars

Conceptual roadmap.

2 Neural networks

2.0.1 Biological neuron

2.0.2 Mathematical model of an artificial neuron

3 Smooth functions and the class \(C^{\infty}\)

3.0.1 Definition (smooth functions and class \(C^{\infty}\))

3.0.2 Examples (smooth functions)

Example: exponential function (smooth case).

Example: smooth functions (common families).

3.0.3 Examples (non-smooth functions)

Example: absolute value function (non-smooth case).

Example: non-smooth functions (other common cases).

Remark: generalized derivatives and Dirac delta

3.0.4 Remark (Neural networks context)

4 Activation functions

4.0.1 Motivation and definition

4.0.2 Role of the activation function

4.0.3 Classification of activation functions

Monotonic and periodic functions.

From activation functions to learning.

4.0.4 Monotonic activation functions

Intuition and graphical interpretation.

Formal definition.

4.0.5 Common monotonic activation functions

4.0.6 Periodic activation functions: motivation

4.0.7 Classification of periodic activation functions

Sinusoidal functions.

Non-sinusoidal periodic functions.

5 Mathematical preliminaries for learning in neural networks

5.0.1 Derivatives and partial derivatives

5.0.2 Gradient vector

5.0.3 Second-order derivatives and curvature

5.0.4 Simple illustrative example

Function definition.

Gradient computation.

Hessian and curvature.

Graphical representation.

5.0.5 Why this matters for learning

6 Learning in neural networks

6.0.1 Forward propagation (conceptual)

6.0.2 Loss functions

6.0.3 Common loss functions (vector form)

Mean squared error (MSE).

Remark on scaling.

Cross-entropy loss.

7 The XOR problem

7.0.1 Why XOR is important

7.0.2 Geometric intuition

Geometric interpretation

7.0.3 Network architecture for XOR

7.0.4 Error measures used in the XOR example

7.0.5 Important distinction between the gradient of the loss and the gradient of the prediction

7.0.6 Key idea

8 Learning rate

8.0.1 Definition (learning rate in gradient descent)

8.0.2 Effect on optimization

8.0.3 Analytical formulation

8.0.4 Analytical example

A simple quadratic example.

Interpretation of the graphical results.

Graphical analysis of learning rate effects.

9 Adaptive learning rate methods

9.0.1 Classical adaptive methods

9.0.2 More recent variants

9.0.3 Adaptive Gradient Algorithm (AdaGrad)

Update rule and mechanism.

Core intuition.

Main limitation.

Asymptotic behavior of the effective step size.

Numerical illustration.

9.0.4 Root Mean Square Propagation (RMSProp)

Update rule and mechanism.

Core intuition.

Main limitation.

Asymptotic behavior under constant gradients.

Numerical illustration.