22/04/26
Abstract
Other related documents can be found at Rpubs:: toc.
This document builds upon the foundational concepts introduced in previous sections on neural network architectures (Llinás, 2026) and activation functions (Llinás, 2026), extending the analysis toward the mechanisms that enable learning.
Artificial neural networks (ANNs) constitute one of the most influential paradigms in modern machine learning and artificial intelligence. Their conceptual origin is loosely inspired by the basic functional principles of biological neurons, which process and transmit information through interconnected networks (Kandel et al., 2013).
In the nervous system, neurons receive electrical or chemical signals through dendrites. These signals are integrated within the cell body (the soma), and if the accumulated stimulus surpasses a certain threshold, the neuron generates an electrical impulse known as an action potential. This signal then travels along the axon and is transmitted to other neurons through synaptic connections.
As illustrated in Figure 2.1, the biological neuron is composed of several key components, including dendrites, the soma, the axon, and the synaptic terminals, each playing a specific role in signal transmission.
Figure 2.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI).
Although artificial neurons are only simplified mathematical abstractions of this biological mechanism, the analogy provides an intuitive conceptual foundation. In essence, both systems combine multiple inputs, evaluate their relative influence, and produce an output response according to a transformation rule.
This biological structure motivates the simplified mathematical abstraction introduced in the next section.
Modern artificial neurons translate the biological intuition of neural signal processing into a mathematical framework suitable for computation and learning. Instead of discrete electrical spikes, artificial neurons compute a weighted linear combination of inputs and then transform this signal through a nonlinear activation function.
Formally, the internal signal of a neuron is defined as
\[ z = \mathbf{w}^\top \mathbf{x} + b, \]
where \(\mathbf{x} \in \mathbb{R}^d\) represents the input vector, \(\mathbf{w} \in \mathbb{R}^d\) denotes the vector of weights, and \(b\) is a bias parameter. The neuron output is then obtained by applying an activation function
\[ h = \varphi(z). \]
The quantity \(z\) can be interpreted as a continuous analogue of synaptic integration, while the function \(\varphi(\cdot)\) generalizes the biological threshold mechanism into a smooth and differentiable transformation.
This transition from discrete logical models to continuous optimization was essential for the development of modern neural networks. By allowing gradients to be computed and propagated through the model, neural networks can be trained efficiently using gradient-based learning algorithms.
As a result, ANNs are capable of approximating complex nonlinear mappings and learning expressive internal representations from data. Through iterative adjustments of weights and biases, the model progressively refines its representation of the input space. Consequently, neural networks have become fundamental tools for tasks such as classification, regression, and representation learning in high-dimensional environments (Goodfellow et al., 2016).
As illustrated in Figure 2.2, a neural network is constructed by stacking multiple artificial neurons into layers. The output of each neuron becomes the input of neurons in the next layer, allowing the model to progressively learn more complex representations of the data.
Figure 2.2: Basic architecture of a feedforward neural network. Source: Created by the author with ChatGPT (OpenAI).
Each neuron in the hidden layer computes a transformation of the form \(h = \varphi(\mathbf{w}^\top \mathbf{x} + b)\), illustrating how nonlinear activation functions operate within layered architectures.
Each neuron in the network performs the same basic operation described above, namely computing a weighted sum followed by a nonlinear activation. By composing many such units, the network is able to model highly complex functions.
In many areas of mathematics and machine learning, it is important to work with functions that are not only continuous, but also sufficiently smooth. This smoothness ensures that derivatives exist and behave well, which is essential for optimization algorithms such as gradient descent.
A function \(f: \mathbb{R} \to \mathbb{R}\) is said to belong to the class \(C^{\infty}\) if it is infinitely differentiable; that is, all derivatives of any order exist and are continuous. Formally,
\[ f \in C^{\infty} \quad \Longleftrightarrow \quad f^{(k)} \text{ exists and is continuous for all } k \in \mathbb{N}_0. \]
Functions in \(C^{\infty}\) are often referred to as smooth functions, meaning that they can be differentiated infinitely many times without any discontinuities or irregularities in their derivatives.
A function that belongs to \(C^{\infty}\) is
\[ f(x) = e^x\quad \Longrightarrow \quad f'(x) = e^x \]
This function is infinitely differentiable, and all its derivatives are equal to \(e^x\). In particular, every derivative is continuous over \(\mathbb{R}\).
(#fig:f_exp_derivative)Exponential function \(f(x)=e^x\) and its derivative \(f'(x)=e^x\). Both coincide, illustrating smoothness (\(C^{\infty}\)).
As shown in Figure @ref(fig:f_exp_derivative), the exponential function coincides with its derivative. This illustrates a fundamental property of smooth functions: not only are they differentiable, but all their derivatives are continuous and well-defined across the entire domain.
In general, examples of smooth functions include exponential, trigonometric, and sigmoid-type functions.
A function that does not belong to \(C^{\infty}\) is
\[ f(x) = |x| \quad \Longrightarrow \quad f'(x) = \begin{cases} -1, & x < 0, \\ 1, & x > 0. \end{cases} \]
The derivative is not defined at \(x=0\). Therefore, although the function is continuous, it is not differentiable at the origin, since the left and right derivatives do not coincide. Consequently, it is not smooth.
(#fig:f_abs_derivative)Absolute value function \(f(x)=|x|\) and its derivative. The derivative is discontinuous at \(x=0\).
In contrast, Figure @ref(fig:f_abs_derivative) shows that the function \(f(x)=|x|\) is continuous but not differentiable at \(x=0\). The function itself is well-defined at this point, but the derivative is not, since the left and right derivatives do not coincide.
These examples highlight the contrast between smooth and non-smooth functions.
In the context of neural networks, smooth activation functions are particularly useful because they allow gradients to be computed reliably during training. However, not all commonly used activation functions belong to \(C^{\infty}\). For instance, the ReLU function is not differentiable at zero, yet it remains widely used due to its practical advantages.
Throughout this document, we will frequently refer to functions in \(C^{\infty}\) when discussing theoretical properties of activation functions and optimization.
Activation functions are a fundamental component of artificial neural networks because they introduce nonlinearity into the model. Without nonlinear activation functions, a neural network composed of multiple layers would reduce to an equivalent linear transformation, regardless of the number of layers.
Recall that an artificial neuron computes a linear combination of its inputs
\[ z = \mathbf{w}^\top\mathbf{x} + b, \]
which represents the internal signal of the neuron. The final output is obtained by applying a nonlinear transformation
\[ h = \varphi(z), \]
where \(\varphi(\cdot)\) is called the activation function.
The role of the activation function is to transform the internal signal of the neuron into a response that can capture complex relationships between variables. By introducing nonlinear transformations, neural networks can construct nonlinear decision boundaries and learn rich internal representations of the data.
This capability is supported by classical universal approximation results, which show that neural networks equipped with suitable activation functions can approximate a wide class of functions on compact domains .
From a geometric perspective, activation functions distort the feature space in a controlled way, allowing the model to separate patterns that would otherwise be inseparable using only linear transformations .
Activation functions can be organized into different categories according to their mathematical properties. One common classification distinguishes between monotonic and periodic activation functions.
Monotonic activation functions are functions whose output consistently increases or decreases with respect to the input. These functions have historically been the most widely used in neural networks and include examples such as the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU).
Periodic activation functions exhibit oscillatory behavior and are particularly useful in models designed to capture periodic or high-frequency patterns. These functions are commonly used in specialized neural architectures for representing signals, implicit functions, or spatial fields.
Periodic activation functions can be further divided into:
Sinusoidal activation functions
Non-sinusoidal periodic functions
In the following sections, we examine in detail several activation functions commonly used in neural networks, discussing their mathematical form, key properties, and implications for learning algorithms.
We begin by examining monotonic activation functions, which have historically played a central role in the development of neural networks.
While activation functions define how information is transformed within the network, they do not specify how the parameters of the model are learned. This naturally leads to the question of how neural networks are trained from data.
From a mathematical perspective, learning in neural networks consists of adjusting the parameters \(\mathbf{w}\) and \(b\) so as to minimize a loss function defined over the training data.
Before introducing the learning process in neural networks, it is useful to briefly review a few mathematical concepts that will be used throughout this section. In particular, we will rely on derivatives, partial derivatives, and the gradient operator.
These tools allow us to quantify how a function changes with respect to its inputs, which is essential for optimizing model parameters.
In one dimension, the derivative of a function measures how the output changes as the input varies:
\[ \frac{d}{dx} f(x) \]
In neural networks, however, functions typically depend on multiple variables. In this setting, we use partial derivatives, which measure how a function changes with respect to one variable while keeping the others fixed.
\[ \frac{\partial f}{\partial x_i} \]
This notion allows us to analyze the sensitivity of a function with respect to each individual component of its input.
For functions of several variables, partial derivatives can be grouped into a single object known as the gradient.
\[ \nabla f(\mathbf{x}) = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right) \]
The gradient is a vector that points in the direction of maximum increase of the function.
This concept plays a central role in optimization, as it tells us how to adjust parameters to reduce a loss function.
Consider the function:
\[ L(x,y) = x^2 + y^2 \]
Its partial derivatives are:
\[ \frac{\partial L}{\partial x} = 2x, \quad \frac{\partial L}{\partial y} = 2y \]
Thus, the gradient is:
\[ \nabla L(x,y) = (2x, 2y) \]
This vector describes how the function changes at any point \((x,y)\) and, in particular, determines the direction used to adjust parameters during learning.
The contour lines represent the level sets of the function \(L(x,y)\), while the arrows correspond to the gradient vector \(\nabla L(x,y)\) at each point. As shown in Figure 3.1, the gradient vectors point outward from the origin, indicating the direction of maximum increase of the function. Consequently, moving in the opposite direction leads toward the minimum, which in this case is located at \((0,0)\).
Figure 3.1 provides two complementary views of this behavior: a surface representation and its corresponding level sets, both combined with the gradient field.
This observation is fundamental for optimization, as gradient-based methods update parameters by moving in the opposite direction of the gradient.
# ==========================================================
# Function and gradient
# ==========================================================
import numpy as np
import matplotlib.pyplot as plt
# Global font configuration
plt.rcParams.update({
"font.size": 14,
"axes.titlesize": 15,
"axes.labelsize": 14,
"xtick.labelsize": 13,
"ytick.labelsize": 13
})
def L(x, y):
return x**2 + y**2
def grad_L(x, y):
return 2*x, 2*y
# Grid
x = np.linspace(-3, 3, 25)
y = np.linspace(-3, 3, 25)
X, Y = np.meshgrid(x, y)
Z = L(X, Y)
U, V = grad_L(X, Y)
# Shared color scale for both plots
vmin = Z.min()
vmax = Z.max()
# ==========================================================
# Figure (compact and balanced layout)
# ==========================================================
fig = plt.figure(figsize=(14, 6.5))
# Make the 3D plot slightly larger than the contour plot
gs = fig.add_gridspec(1, 2, width_ratios=[1.3, 1], wspace=0.07)
# ----------------------------------------------------------
# (a) Surface plot with gradient field
# ----------------------------------------------------------
ax1 = fig.add_subplot(gs[0, 0], projection='3d')
surf = ax1.plot_surface(
X, Y, Z,
cmap='viridis',
edgecolor='k',
linewidth=0.6,
vmin=vmin, vmax=vmax,
alpha=0.9
)
# Gradient vectors projected onto the plane z = 0
ax1.quiver(
X, Y, 0,
U, V, 0,
color='black',
length=0.22,
normalize=True
)
# Slightly more separation from axes
ax1.set_title("(a) Surface of $L(x,y)$ with gradient field", pad=8)
ax1.set_xlabel("x", labelpad=3)
ax1.set_ylabel("y", labelpad=3)
ax1.set_zlabel("$L(x,y)$", labelpad=3)
# Adjust viewing angle for better perception
ax1.view_init(elev=30, azim=-60)
# ----------------------------------------------------------
# (b) Contour plot with gradient field
# ----------------------------------------------------------
ax2 = fig.add_subplot(gs[0, 1])
levels = np.linspace(vmin, vmax, 10)
cont = ax2.contour(
X, Y, Z,
levels=levels,
cmap='viridis',
vmin=vmin, vmax=vmax,
linewidths=1.4
)
# Gradient vectors
ax2.quiver(
X, Y, U, V,
color='black',
angles='xy',
scale_units='xy',
scale=25,
width=0.0035
)
# space above subtitle
ax2.set_title("(b) Level sets of $L(x,y)$ with gradient field", pad=8)
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_aspect('equal', adjustable='box')
# ----------------------------------------------------------
# Centered horizontal colorbar
# ----------------------------------------------------------
# width: [left, bottom, width, height]
cbar_ax = fig.add_axes([0.40, 0.085, 0.30, 0.025])
cbar = fig.colorbar(
surf,
cax=cbar_ax,
orientation='horizontal'
)
cbar.set_label("$L(x,y)$", labelpad=2)
# ----------------------------------------------------------
# Main title with improved spacing
# ----------------------------------------------------------
fig.suptitle(
"Function $L(x,y)=x^2+y^2$ and its gradient field",
fontsize=16,
y=0.975 # more space from subplots
)
# Fine-tuned layout adjustments
fig.subplots_adjust(
top=0.82, # pushes plots down (more separation)
bottom=0.17,
left=0.05,
right=0.98
)
plt.show()
Figure 3.1: Two complementary visualizations of \(L(x,y)=x^2+y^2\) and its gradient field. (a) Surface representation with gradient vectors projected onto the input plane. (b) Level sets with gradient vectors indicating the direction of maximum increase.
Learning in neural networks consists of minimizing a loss function with respect to the model parameters (weights and biases).
To achieve this, we need to compute how the loss changes when the parameters change:
\[ \frac{\partial L}{\partial \mathbf{w}}, \quad \frac{\partial L}{\partial b} \]
These quantities are used to iteratively update the parameters in a direction that reduces the loss.
This idea will be formalized in the next section, where we introduce the learning process in neural networks and show how these derivatives are used to update the model parameters.
While activation functions determine how neurons transform their inputs, they do not by themselves provide a mechanism for learning from data. The learning process requires a way to evaluate prediction quality and a systematic procedure to adjust model parameters.
In supervised learning settings, this process is guided by a loss function, which quantifies the discrepancy between the predicted output of the network and the true target values.
To understand how this optimization process works, we first introduce a few essential mathematical tools that describe how functions change with respect to their inputs.
With these tools in place, we are now ready to formalize how neural networks learn from data.
Before introducing optimization, it is useful to formalize how information flows through a neural network.
Figure 4.1 illustrates the forward propagation mechanism in a multilayer neural network. At each layer, the input signal undergoes an affine transformation followed by a nonlinear activation.
Figure 4.1: Forward propagation in a multilayer neural network. Source: Nuñez (2026)
To formalize the flow of information through a neural network, it is useful to introduce a layer-wise notation.
Consider a neural network composed of layers indexed by \(j = 1, \dots, N\). Let \(n_j\) denote the number of neurons in layer \(j\).
For each layer, we define:
\(\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}\): the output (activation vector) of layer \(j\). Each component corresponds to the output of one neuron.
\(\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}\): the pre-activation signal at layer \(j\), obtained before applying the activation function.
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}\): the weight matrix connecting layer \(j-1\) to layer \(j\). Each column contains the weights associated with one neuron in layer \(j\).
\(\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}\): the bias vector of layer \(j\).
By convention, the input layer is denoted by \[ \mathbf{h}^{(0)} = \mathbf{x} \in \mathbb{R}^{n_0}. \]
where \(\mathbf{x}\) is the input vector.
At each layer, the network performs two steps:
A linear transformation: \[ \mathbf{z}^{(j)} = \mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)} \]
A nonlinear transformation: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{z}^{(j)}\right) \]
Combining both steps, the forward propagation at layer \(j\) can be written as: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)}\right) \]
This recursive structure shows that the output of each layer becomes the input of the next, allowing the network to build increasingly complex representations of the data.
As illustrated in Figure 4.1, each layer applies the same fundamental operation: a linear combination of inputs followed by a nonlinear activation.
A loss function quantifies the discrepancy between the predicted output of the network and the true target values.
Using the notation introduced above, let:
\(\mathbf{x} = \mathbf{h}^{(0)} \in \mathbb{R}^{n_0}\): input vector, where \(n_0\) denotes the number of input features (i.e., the number of neurons in the input layer)
\(\mathbf{h}^{(N)} = \hat{\mathbf{y}} \in \mathbb{R}^{n_N}\): network output, where \(n_N\) denotes the number of output neurons in the final layer
\(\mathbf{y} \in \mathbb{R}^{n_N}\): true target vector, defined in the same output space as the prediction
Thus, the loss function is defined as:
\[ L(\mathbf{y}, \hat{\mathbf{y}}) \]
For regression problems, a common choice is the mean squared error, defined as:
\[ L(\mathbf{y}, \hat{\mathbf{y}}) \quad = \quad \frac{1}{2} \|\mathbf{y} \; -\; \hat{\mathbf{y}}\|_2^2 \quad =\quad \frac{1}{2} \sum_{i=1}^{n_N} (y_i \;-\; \hat{y}_i)^2 \]
\(\mathbf{y}, \hat{\mathbf{y}} \in \mathbb{R}^{n_N}\).
Each component corresponds to one output neuron.
The gradient with respect to the prediction is:
\[ \frac{\partial L}{\partial \hat{\mathbf{y}}} = \hat{\mathbf{y}} - \mathbf{y} \]
This vector has dimension \(\mathbb{R}^{n_N}\) and is the starting point for backpropagation.
In the literature, the MSE is sometimes defined without the factor \(\tfrac{1}{2}\) or averaged over the output dimension (e.g., dividing by \(n_N\)).
All these versions differ only by a constant factor, which does not affect the optimal solution, but only rescales the gradient.
The factor \(\tfrac{1}{2}\) is used here because it simplifies the derivative, leading to cleaner expressions in backpropagation.
For classification problems (e.g., with softmax output), a common choice is:
\[ L(\mathbf{y}, \hat{\mathbf{y}}) \quad = \quad - \sum_{i=1}^{n_N} y_i \log(\hat{y}_i) \]
where:
\(\mathbf{y}\) is typically a one-hot encoded vector.
\(\hat{\mathbf{y}}\) represents predicted probabilities.
Before introducing the error measures, we consider a classical benchmark in neural networks: the exclusive OR (XOR) problem.
The XOR function is defined as:
| \(x_1\) | \(x_2\) | \(y\) | Interpretation |
|---|---|---|---|
| 0 | 0 | 0 | Same inputs → 0 |
| 0 | 1 | 1 | Different inputs → 1 |
| 1 | 0 | 1 | Different inputs → 1 |
| 1 | 1 | 0 | Same inputs → 0 |
The output is 1 if the inputs are different, and 0 otherwise.
The XOR problem is fundamental because:
It is not linearly separable.
A single-layer network cannot solve it.
It requires at least one hidden layer.
This makes XOR a minimal example to illustrate how neural networks learn nonlinear patterns.
The following figures illustrate the XOR dataset in \(\mathbb{R}^2\). Points belonging to different classes cannot be separated by a single straight line.
We first present the XOR dataset using its four canonical observations.
library(ggplot2)
library(patchwork)
xor_simple <- data.frame(
x1 = c(0, 0, 1, 1),
x2 = c(0, 1, 0, 1),
y = factor(c(0, 1, 1, 0))
)
base_theme <- theme_minimal(base_size = 20) +
theme(
panel.grid.minor = element_line(color = "gray92"),
panel.grid.major = element_line(color = "gray85"),
# Axis
axis.title = element_text(size = 26), # Titles
axis.text = element_text(size = 16), # Numbers
# Titles (a), (b)
plot.title = element_text(size = 22, face = "bold", hjust = 0),
# Legends
legend.title = element_text(size = 20),
legend.text = element_text(size = 18),
# Spaces
plot.margin = margin(2, 2, 2, 2),
legend.margin = margin(0, 0, 0, 0),
legend.box.margin = margin(0, 0, 0, 0)
)
# Plot (a)
p1 <- ggplot(xor_simple, aes(x = x1, y = x2, color = y)) +
geom_point(size = 9) +
geom_abline(intercept = 0.5, slope = 0,
linetype = "dashed", color = "darkorange2", linewidth = 1.1) +
geom_abline(intercept = 0, slope = 1,
linetype = "dashed", color = "purple", linewidth = 1.1) +
scale_color_manual(
values = c("0" = "#C0392B", "1" = "#2E86C1"),
labels = c("Class 0", "Class 1")
) +
scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
coord_fixed() +
labs(
title = "(a) Single linear boundaries",
x = expression(x[1]),
y = expression(x[2]),
color = "Output"
) +
base_theme +
theme(plot.margin = margin(2, 17, 2, 2)) # More space (right): margin(top, right, bottom, left)
# Plot (b)
p2 <- ggplot(xor_simple, aes(x = x1, y = x2, color = y)) +
geom_point(size = 9) +
geom_abline(intercept = 0.5, slope = -1,
linetype = "dashed", color = "darkgreen", linewidth = 1.1) +
geom_abline(intercept = 1.5, slope = -1,
linetype = "dashed", color = "darkgreen", linewidth = 1.1) +
scale_color_manual(
values = c("0" = "#C0392B", "1" = "#2E86C1"),
labels = c("Class 0", "Class 1")
) +
scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
coord_fixed() +
labs(
title = "(b) Multiple-boundary partition",
x = expression(x[1]),
y = expression(x[2]),
color = "Output"
) +
base_theme +
theme(plot.margin = margin(2, 2, 2, 17)) # More space (left): margin(top, right, bottom, left)
# Final combination
(p1 + p2) +
plot_layout(guides = "collect") &
# Legends
theme(
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 20),
legend.text = element_text(size = 18)
)
Figure 5.1: Geometric intuition for XOR. (a) Single linear boundaries fail to separate the classes. (b) Multiple linear boundaries can partition the input space, anticipating the role of hidden neurons.
In Figure 5.1, we observe two distinct situations:
Panel (a). The dashed lines represent examples of single linear decision boundaries.
Each line divides the input space into two half-planes, but neither one is able to separate the two classes correctly.
In both cases, points from different classes remain on the same side of the boundary.
This illustrates why a single linear classifier cannot solve the XOR problem.
Panel (b). The dashed lines should not be interpreted as a single linear classifier, but rather as a combination of linear boundaries that partition the input space into regions. This anticipates the role of hidden neurons in a multilayer network:
To reinforce this geometric intuition, Figure 5.2 shows a denser version of the same XOR pattern.
The additional points do not represent new logical cases. They are sampled from the unit square and labeled according to the same XOR rule in order to make the spatial structure more visible.
The shaded regions are included only as a visual guide:
Pink regions regions correspond to Class 0.
Light blue regions regions correspond to Class 1.
library(ggplot2)
set.seed(123)
n <- 240
x1 <- runif(n)
x2 <- runif(n)
y <- ifelse((x1 <= 0.5 & x2 > 0.5) | (x1 > 0.5 & x2 <= 0.5), 1, 0)
xor_dense <- data.frame(
x1 = x1,
x2 = x2,
y = factor(y)
)
ggplot(xor_dense, aes(x = x1, y = x2, color = y)) +
annotate("rect", xmin = 0.00, xmax = 0.50, ymin = 0.50, ymax = 1.00,
fill = "#FADBD8", alpha = 0.35, color = NA) +
annotate("rect", xmin = 0.50, xmax = 1.00, ymin = 0.00, ymax = 0.50,
fill = "#FADBD8", alpha = 0.35, color = NA) +
annotate("rect", xmin = 0.00, xmax = 0.50, ymin = 0.00, ymax = 0.50,
fill = "#D6EAF8", alpha = 0.35, color = NA) +
annotate("rect", xmin = 0.50, xmax = 1.00, ymin = 0.50, ymax = 1.00,
fill = "#D6EAF8", alpha = 0.35, color = NA) +
geom_point(size = 2.2, alpha = 0.9) +
scale_color_manual(
values = c("0" = "#C0392B", "1" = "#2E86C1"),
labels = c("Class 0", "Class 1")
) +
scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
coord_fixed() +
labs(
x = expression(x[1]),
y = expression(x[2]),
color = "Output"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "right",
panel.grid.minor = element_line(color = "gray93"),
panel.grid.major = element_line(color = "gray85")
)
Figure 5.2: A denser visualization of the XOR pattern in the input space.
Together, these figures make the structure of XOR clear. The points \((0,0)\) and \((1,1)\) belong to Class 0, whereas the points \((0,1)\) and \((1,0)\) belong to Class 1. Because the classes occupy alternating diagonal positions in the input space, no single boundary of the form
\[ w_1 x_1 + w_2 x_2 + b = 0 \]
can separate them.
This confirms that XOR is not linearly separable and explains why solving it requires a model capable of combining multiple linear boundaries through nonlinear transformations.
The XOR dataset exhibits an alternating diagonal class structure in the input space. Because this arrangement cannot be separated by a single straight line, a linear model is insufficient. This limitation motivates the introduction of hidden layers and nonlinear activation functions, which allow the network to transform the input space into a representation where the classes become separable.
To overcome the limitation of linear separability observed in the previous section (see Figure 5.1), we introduce a neural network with a hidden layer. This additional layer allows the model to construct nonlinear transformations of the input space, making it possible to separate patterns such as XOR.
To solve the XOR problem, we consider a simple feedforward architecture composed of:
\(n_0 = 2\) input neurons.
One hidden layer (e.g., \(n_1 = 2\) neurons).
\(n_N = 1\) output neuron.
Thus: \[ \mathbf{x} \in \mathbb{R}^2, \quad \hat{y} \in \mathbb{R}, \quad y \in \mathbb{R} \]
Figure 5.3 illustrates this architecture. The input variables \((x_1, x_2)\) are first mapped into a hidden representation through two intermediate neurons \((h_1, h_2)\). Each hidden neuron defines a linear transformation of the input, and their combined effect enables the network to partition the space into multiple regions.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6,3))
# Posiciones
positions = {
"x1": (0,1),
"x2": (0,0),
"h1": (1,1),
"h2": (1,0),
"y": (2,0.5)
}
# Dibujar conexiones
connections = [
("x1","h1"), ("x1","h2"),
("x2","h1"), ("x2","h2"),
("h1","y"), ("h2","y")
]
for a,b in connections:
x1,y1 = positions[a]
x2,y2 = positions[b]
ax.arrow(x1,y1, x2-x1, y2-y1,
length_includes_head=True,
head_width=0.05, color="black")
# Dibujar nodos
for node,(x,y) in positions.items():
ax.scatter(x,y, s=1200, color="#D6EAF8", edgecolors="black")
label = "ŷ" if node=="y" else node
ax.text(x,y,label, ha='center', va='center', fontsize=12, fontweight='bold')
ax.set_xlim(-0.2, 2.2);
ax.set_ylim(-0.1, 1.1);
ax.axis('off');
plt.tight_layout(pad=0.2)
plt.show()
Figure 5.3: A simple neural network architecture for solving the XOR problem.
These intermediate representations are then aggregated by the output neuron \(\hat{y}\), producing the final prediction. As discussed in the geometric interpretation, the key idea is that multiple linear transformations, when combined, can induce a nonlinear decision boundary.
This structure will be essential in the following sections, where we show how these transformations are computed and how the network learns them from data.
In the manual training example (using a single XOR observation), we track different quantities to understand how the network learns.
In this section, we focus on the scalar case to simplify the exposition. These quantities serve different purposes:
Some are useful for interpretation.
Others are required for optimization.
We consider the following:
\[ |\hat{y} - y| \]
This quantity measures the magnitude of the prediction error in a direct and intuitive way. It is useful for monitoring performance but is not typically used for optimization due to its non-smooth derivative.
\[ \text{cost} = \frac{1}{2}(\hat{y} - y)^2 \]
This is the loss function minimized during training (for a single observation).
\[ \frac{\partial \, \text{cost}}{\partial \hat{y}} = \hat{y} - y \]
This quantity measures how the loss changes with respect to the prediction.
To update the model parameters, we ultimately need the derivative of the loss with respect to each parameter \(w\). This is obtained by combining:
The gradient of the loss with respect to the prediction (e.g., \(\hat{y} - y\) for the squared error), and
The derivative of the prediction with respect to the parameter, \(\frac{\partial \hat{y}}{\partial w}\).
Together, these terms determine how the error signal propagates backward through the network.
A detailed explanation of this process (backpropagation) will be provided in a later section.
The gradient of the loss is not the same as the gradient of the prediction.
The former measures how the loss changes with respect to the prediction, whereas the latter measures how the prediction changes with respect to the model parameters, through derivatives of the form
\[ \frac{\partial \hat{y}}{\partial w} \]
which measure how the network output changes with respect to its parameters.
The absolute error helps us interpret the prediction.
The loss function defines what we optimize.
The gradient determines how we update the model.
These concepts naturally lead to the question of how parameter updates are controlled during training.
The learning rate, denoted by \(\alpha > 0\), controls the magnitude of parameter updates during optimization.
At each iteration, the model parameters are adjusted in the direction opposite to the gradient, and the learning rate determines the size of this step.
Formally, for a parameter \(w\), the update rule in gradient descent is given by:
\[ w \;\leftarrow\; w \;-\; \alpha \, \frac{\partial \text{cost}}{\partial w}. \]
Here,
\[ \frac{\partial \, \text{cost}}{\partial w} \; =\; \frac{\partial \, \text{cost}}{\partial \hat{y}} \cdot \frac{\partial \, \hat{y}}{\partial w} \]
denotes the derivative of the loss (or cost) function with respect to the parameter \(w\). This quantity measures how the loss changes with respect to the prediction \(\hat{y}\).
A more detailed derivation of these gradients will be presented in later sections.
In general, the learning rate plays a central role in optimization:
It determines the step size in parameter space
It controls the speed of convergence
It affects the stability of the learning process
Choosing an appropriate learning rate is therefore essential for effective training.
The learning rate \(\alpha\) directly influences how quickly the model learns from the error signal:
If \(\alpha\) is too large, the updates may overshoot the minimum, leading to oscillations or divergence.
If \(\alpha\) is too small, the updates become very slow, resulting in inefficient learning.
This trade-off highlights the importance of selecting an appropriate learning rate to balance stability and convergence speed.
To illustrate the effect of the learning rate, consider the simple loss function
\[ L(w) = w^2, \]
whose graph is shown in Figure 6.1. This function has a single global minimum at \(w = 0\) and increases symmetrically as \(|w|\) grows.
# Plot of the quadratic loss function
import numpy as np
import matplotlib.pyplot as plt
def loss(w):
return w**2
w = np.linspace(-4, 4, 400)
L = loss(w)
plt.figure(figsize=(6, 4))
plt.plot(w, L, linewidth=2)
plt.axhline(0, color="gray", linewidth=1, alpha=0.6)
plt.axvline(0, color="gray", linewidth=1, alpha=0.6)
plt.xlabel("Parameter $w$")
plt.ylabel("Loss $L(w)$")
plt.grid(True)
plt.tight_layout()
plt.show()
Figure 6.1: Quadratic loss function \(L(w) = w^2\). The minimum is located at \(w=0\).
Its gradient is
\[ \frac{dL}{dw} = 2w. \] Figure 6.2 shows both the loss function and its gradient.
import numpy as np
import matplotlib.pyplot as plt
def loss(w):
return w**2
def grad(w):
return 2*w
w = np.linspace(-4, 4, 400)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# -------------------
# (a) Loss
# -------------------
axes[0].plot(w, loss(w), linewidth=2)
axes[0].axhline(0, color="gray", alpha=0.6)
axes[0].axvline(0, color="gray", alpha=0.6)
axes[0].set_title("(a) Loss function $L(w)=w^2$")
axes[0].set_xlabel("Parameter $w$")
axes[0].set_ylabel("Loss")
axes[0].grid(True)
# -------------------
# (b) Gradient
# -------------------
axes[1].plot(w, grad(w), linewidth=2, color="tab:red")
axes[1].axhline(0, color="gray", alpha=0.6)
axes[1].axvline(0, color="gray", alpha=0.6)
axes[1].set_title("(b) Gradient $\\frac{dL}{dw}=2w$")
axes[1].set_xlabel("Parameter $w$")
axes[1].set_ylabel("Gradient")
axes[1].grid(True)
plt.tight_layout()
plt.show()
Figure 6.2: (a) Quadratic loss function \(L(w)=w^2\). (b) Its gradient \(\frac{dL}{dw}=2w\).
Note that the gradient is proportional to the distance from the minimum, which leads to exponential convergence under appropriate learning rates.
While the loss function has a minimum at \(w=0\), the gradient increases linearly with the distance from the minimum. This means that updates are larger when the parameter is far from the optimum and become smaller as it approaches the minimum.
This property explains why gradient descent exhibits exponential convergence under appropriate learning rates.
Applying gradient descent gives the update rule
\[ w_{t+1} = w_t - \alpha \, 2w_t. \]
At this stage, the update rule is introduced only to illustrate how the learning rate affects the parameter updates. A more detailed treatment of the gradient descent algorithm will be provided in a later section.
This simple structure makes it possible to clearly observe how different learning rates affect the trajectory of the parameter as it moves toward the minimum, as illustrated in the next section.
Figure 6.3 illustrates the behavior of gradient descent under different learning rates:
When the learning rate is too small, the parameter approaches the minimum very slowly.
When the learning rate is moderate, convergence is stable and efficient.
When the learning rate is too large, the updates overshoot the minimum and may oscillate or even diverge.
import numpy as np
import matplotlib.pyplot as plt
# Loss and gradient
def loss(w):
return w**2
def grad(w):
return 2*w
def simulate_gd(alpha, w0=2, n_iter=12):
ws = [w0]
ls = [loss(w0)]
w = w0
for _ in range(n_iter):
w = w - alpha * grad(w)
ws.append(w)
ls.append(loss(w))
return np.array(ws), np.array(ls)
configs = [
(0.05, "Small α"),
(0.30, "Moderate α"),
(1.10, "Large α")
]
colors = ["tab:green", "tab:red", "tab:blue"]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Common curve for trajectory plots
w_grid = np.linspace(-10, 10, 400)
loss_grid = loss(w_grid)
# Avoid exact zero for log-scale panels
loss_grid_log = loss_grid.copy()
loss_grid_log[loss_grid_log < 1e-1] = np.nan
# -----------------------------
# (a) Trajectories - linear scale
# -----------------------------
axes[0, 0].plot(w_grid, loss_grid, color="gray", linewidth=1.5, alpha=0.4)
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
axes[0, 0].plot(ws, ls, marker="o", label=label, color=color, linewidth=2)
axes[0, 0].set_title("(a) Gradient descent trajectories (linear scale)")
axes[0, 0].set_xlabel("Parameter w")
axes[0, 0].set_ylabel("Loss L(w)")
axes[0, 0].set_ylim(0, 100);
axes[0, 0].legend()
# -----------------------------
# (b) Loss by iteration - linear scale
# -----------------------------
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
axes[0, 1].plot(range(len(ls)), ls, marker="o", label=label, color=color, linewidth=2)
axes[0, 1].set_title("(b) Loss value across iterations (linear scale)")
axes[0, 1].set_xlabel("Iteration")
axes[0, 1].set_ylabel("Loss L(w)")
axes[0, 1].set_ylim(0, 100);
axes[0, 1].legend()
# -----------------------------
# (c) Trajectories - log scale
# -----------------------------
axes[1, 0].plot(w_grid, loss_grid_log, color="gray", linewidth=1.5, alpha=0.4)
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
ls_plot = ls.copy()
ls_plot[ls_plot < 1e-10] = np.nan
axes[1, 0].plot(ws, ls_plot, marker="o", label=label, color=color, linewidth=2)
axes[1, 0].set_title("(c) Gradient descent trajectories (log scale)")
axes[1, 0].set_xlabel("Parameter w")
axes[1, 0].set_ylabel("Loss L(w)")
axes[1, 0].set_ylim(1e-1, 10);
axes[1, 0].set_yscale("log")
axes[1, 0].legend()
# -----------------------------
# (d) Loss by iteration - log scale
# -----------------------------
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
axes[1, 1].plot(range(len(ls)), ls, marker="o", label=label, color=color, linewidth=2)
axes[1, 1].set_title("(d) Loss value across iterations (log scale)")
axes[1, 1].set_xlabel("Iteration")
axes[1, 1].set_ylabel("Loss L(w)")
axes[1, 1].set_ylim(1e-2, 10);
axes[1, 1].set_yscale("log")
axes[1, 1].legend()
plt.tight_layout()
plt.show()
Figure 6.3: Effect of the learning rate on gradient descent under linear and logarithmic vertical scales. Panels (a) and (b) use a linear scale to show global behavior, while panels (c) and (d) use a logarithmic scale to reveal convergence patterns near zero.
Figure 6.3 provides a complementary perspective by combining linear and logarithmic vertical scales.
In panels (a) and (b), which use a linear scale, the trajectories corresponding to small and moderate learning rates become compressed near zero. As a result, differences in their convergence behavior are difficult to distinguish once the loss becomes small. In contrast, the large learning rate dominates the scale due to divergence.
In panels (c) and (d), which use a logarithmic scale, the behavior near zero is magnified. This transformation reveals important differences that are not visible in the linear scale:
The small learning rate (green) exhibits a smooth and gradual exponential decay.
The moderate learning rate (red) converges faster, showing a steeper linear trend in the log scale, which is characteristic of efficient exponential convergence.
The large learning rate (blue) clearly oscillates and fails to stabilize, confirming its instability.
Thus, the logarithmic scale provides a more informative view of the convergence dynamics, especially when the loss approaches very small values.
For visualization purposes, two complementary strategies are used:
A linear scale highlights the global behavior of the trajectories, including divergence and large oscillations.
A logarithmic scale emphasizes the fine-grained behavior near zero, allowing a clearer comparison of convergence rates.
This dual representation is particularly useful because convergence in gradient-based optimization is often exponential, which appears approximately linear under a logarithmic transformation.
To address the limitations of a fixed learning rate, several adaptive optimization methods have been proposed:
Adaptive Gradient Algorithm (AdaGrad) (Duchi et al., 2011).
Accumulates all past squared gradients, leading to progressively smaller learning rates.
Root Mean Square Propagation (RMSProp) (Tieleman and Hinton, 2012).
Uses an exponentially decaying average of squared gradients to avoid excessive shrinkage.
Optional reading: here.
ADADELTA (Zeiler, 2012).
Eliminates the need to manually set a global learning rate by adapting updates based on past gradients.
Adaptive Moment Estimation (Adam) (Kingma and Ba, 2014).
Combines momentum and adaptive scaling to achieve fast and stable convergence.
While more recent methods have been proposed, most of them are extensions or refinements of Adam rather than entirely new paradigms. In practice, Adam remains a central and widely used optimizer in modern deep learning.
In other words, Adam is not obsolete: it continues to serve as the foundation for many current optimization methods.
AdamW (Loshchilov and Hutter, 2019).
Decouples weight decay from the gradient update, improving generalization.
Nadam (Dozat, 2016).
Incorporates Nesterov momentum into Adam for improved convergence.
AMSGrad (Reddi et al., 2018).
Modifies Adam to ensure better theoretical convergence guarantees.
These methods adjust the effective learning rate during training based on past gradients. They can be seen as extensions of gradient descent that adapt the step size dynamically during training.
In the following sections, we focus on a subset of these methods to illustrate their main ideas. The remaining variants follow similar principles and are left for further exploration.
AdaGrad (Duchi et al., 2011) adapts the learning rate for each parameter by scaling it inversely proportional to the accumulated squared gradients. Parameters that receive large gradients get smaller updates over time.
\[ G_t \;= \; \sum_{i=1}^t \,g_i^2, \quad\quad w_{t+1} \; =\; w_t \,-\, \frac{\alpha}{\sqrt{G_t \,+\, \epsilon}} \, g_t \]
Here,
\(g_t = \frac{\partial \text{cost}}{\partial w}(w_t)\) denotes the gradient evaluated at the parameter value \(w_t\).
\(G_t = \sum_{i=1}^t g_i^2\) is the accumulated sum of past squared gradients, which keeps track of how large the gradients have been over time and is used to scale the update step.
\(\epsilon\) is a small constant for numerical stability (e.g., \(10^{-8}\)).
Unlike RMSProp, this quantity grows continuously over time, which causes the effective learning rate to shrink.
Frequent updates → smaller steps
Since \(G_t\) keeps growing over time, the effective learning rate can become extremely small, causing the updates to slow down significantly or even stop.
Suppose the gradient remains approximately constant, \(g_t \approx 1\). Then
\[ G_t \; = \; \sum_{i=1}^t g_i^2 \; \approx \; t, \]
In this case, the effective step size behaves as
\[ \text{step size} \; \approx \; \frac{\alpha}{\sqrt{t + \epsilon}} \]
However, since \(\epsilon\) is a very small constant (e.g., \(10^{-8}\)), it has negligible effect for \(t \ge 1\). Therefore, for intuition, we can approximate:
\[ \text{step size} \; \approx \; \frac{\alpha}{\sqrt{t}}. \]
As \(t\) increases, the step size decreases:
\(t = 1 \quad \Rightarrow \quad \frac{\alpha}{\sqrt{1}} = \alpha\).
\(t = 100 \quad \Rightarrow \quad \frac{\alpha}{10}\).
\(t = 10{,}000 \quad \Rightarrow \quad \frac{\alpha}{100}\).
This means that updates become progressively smaller, eventually slowing down learning significantly. In other words, AdaGrad keeps shrinking the learning rate even when it may no longer be desirable.
This behavior is illustrated in Figure 7.1. Panel (a) shows that the accumulated term \(G_t\) grows continuously over time, while panel (b) shows that the effective step size decreases accordingly.
Thus, the simplified expression \(\frac{\alpha}{\sqrt{t}}\) captures the essential behavior observed in the figure.
This limitation motivates the use of methods such as RMSProp, which avoid the continuous shrinkage of the effective learning rate.
import numpy as np
import matplotlib.pyplot as plt
# Parameters
alpha = 1.0
epsilon = 1e-8
T = 10000
# Constant gradient
g = 1.0
# Storage
t_vals = np.arange(1, T + 1)
G_vals = np.zeros(T)
step_vals = np.zeros(T)
# AdaGrad recursion
G = 0.0
for i in range(T):
G = G + g**2
G_vals[i] = G
step_vals[i] = alpha / np.sqrt(G + epsilon)
# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# -----------------------------
# (a) Accumulated term G_t
# -----------------------------
axes[0].plot(t_vals, G_vals, linewidth=2.5)
axes[0].set_title(r"(a) Accumulated term $G_t$", fontsize=21)
axes[0].set_xlabel(r"Iteration $t$", fontsize=19)
axes[0].set_ylabel(r"$G_t$", fontsize=19)
axes[0].tick_params(axis='both', labelsize=15)
axes[0].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
for spine in ["top", "right"]:
axes[0].spines[spine].set_visible(False)
axes[0].spines["left"].set_color("gray")
axes[0].spines["bottom"].set_color("gray")
axes[0].spines["left"].set_linewidth(0.8)
axes[0].spines["bottom"].set_linewidth(0.8)
# -----------------------------
# (b) Effective step size
# -----------------------------
axes[1].plot(t_vals, step_vals, linewidth=2.5)
# Highlight specific points
points_t = [1, 100, 10000]
points_y = alpha / np.sqrt(np.array(points_t) + epsilon)
axes[1].scatter(points_t, points_y, s=80, zorder=3)
# Labels for key points
offsets = [(60, 0.02), (200, 0.01), (200, 0.002)]
for (x, y), (dx, dy) in zip(zip(points_t, points_y), offsets):
axes[1].text(x + dx, y + dy, f"t={x}", fontsize=15)
axes[1].set_title(r"(b) Effective step size", fontsize=21)
axes[1].set_xlabel(r"Iteration $t$", fontsize=19)
axes[1].set_ylabel(r"$\alpha / \sqrt{G_t + \epsilon}$", fontsize=18)
axes[1].tick_params(axis='both', labelsize=15)
axes[1].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
for spine in ["top", "right"]:
axes[1].spines[spine].set_visible(False)
axes[1].spines["left"].set_color("gray")
axes[1].spines["bottom"].set_color("gray")
axes[1].spines["left"].set_linewidth(0.8)
axes[1].spines["bottom"].set_linewidth(0.8)
plt.tight_layout()
plt.show()
Figure 7.1: Illustration of AdaGrad under an approximately constant gradient \(g_t \approx 1\). Panel (a) shows that the accumulated term \(G_t\) grows over time. Panel (b) shows that the effective step size \(\alpha / \sqrt{G_t + \epsilon}\) decreases steadily.
RMSProp (Tieleman and Hinton, 2012) addresses the limitation of AdaGrad by replacing the cumulative sum with an exponentially decaying average of past squared gradients. This allows the method to “forget” old gradients and maintain a more stable learning rate.
\[ v_t \;=\; \beta v_{t-1} + (1 - \beta) \, g_t^2, \quad\quad w_{t+1} \;=\; w_t \;-\; \frac{\alpha}{\sqrt{v_t + \epsilon}} \, g_t \]
Here,
\(g_t = \frac{\partial \text{cost}}{\partial w}(w_t)\) denotes the gradient evaluated at the parameter value \(w_t\).
\(v_t\) is an exponentially decaying average of past squared gradients, which captures the scale of recent gradients and is used to normalize the update step.
\(\beta \in (0,1)\) controls how fast past gradients are forgotten.
\(\epsilon\) is a small constant for numerical stability.
These quantities are hyperparameters:
They are not learned from the data, but instead are chosen beforehand.
In practice, commonly used values are \(\beta \approx 0.9\) and \(\epsilon \approx 10^{-8}\).
They can be viewed as knobs that control how the algorithm behaves during training.
Focus on recent gradients instead of all history.
Requires tuning of the decay parameter \(\beta\), and may still exhibit instability if the hyperparameters are not well chosen.
Suppose the gradient remains approximately constant, \(g_t \approx 1\). Then the moving average becomes
\[ v_t \;=\; \beta v_{t-1} \,+\, (1 - \beta) \cdot 1 \]
In this case, \(v_t\) converges to a constant value instead of growing indefinitely, as in AdaGrad.
As a result, the effective step size
\[ \frac{\alpha}{\sqrt{v_t}} \]
stabilizes over time rather than continuously decreasing.
This prevents the learning rate from vanishing and allows the algorithm to keep making meaningful updates during training. This behavior is illustrated in Figure 7.2. Panel (a) shows that the moving average \(v_t\) approaches a stable value, while panel (b) shows that the effective step size also stabilizes instead of shrinking indefinitely.
import numpy as np
import matplotlib.pyplot as plt
# Parameters
alpha = 1.0
beta = 0.9
epsilon = 1e-8
T = 50
# Constant gradient
g = 1.0
# Storage
t_vals = np.arange(1, T + 1)
v_vals = np.zeros(T)
step_vals = np.zeros(T)
# RMSProp recursion
v = 0.0
for t in range(T):
v = beta * v + (1 - beta) * g**2
v_vals[t] = v
step_vals[t] = alpha / np.sqrt(v + epsilon)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# (a) v_t
axes[0].plot(t_vals, v_vals, linewidth=2)
axes[0].axhline(1.0, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[0].set_title("(a) Moving average $v_t$", fontsize=15)
axes[0].set_xlabel("Iteration $t$", fontsize=14)
axes[0].set_ylabel("$v_t$", fontsize=14)
axes[0].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
axes[0].tick_params(axis='both', labelsize=11)
# soften borders
for spine in ["top", "right"]:
axes[0].spines[spine].set_visible(False)
axes[0].spines["left"].set_color("gray")
axes[0].spines["bottom"].set_color("gray")
# (b) effective step size
axes[1].plot(t_vals, step_vals, linewidth=2)
axes[1].axhline(1.0, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[1].set_title("(b) Effective step size", fontsize=15)
axes[1].set_xlabel("Iteration $t$", fontsize=14)
axes[1].set_ylabel(r"$\alpha / \sqrt{v_t + \epsilon}$", fontsize=14)
axes[1].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
axes[1].tick_params(axis='both', labelsize=11)
for spine in ["top", "right"]:
axes[1].spines[spine].set_visible(False)
axes[1].spines["left"].set_color("gray")
axes[1].spines["bottom"].set_color("gray")
plt.tight_layout()
plt.show()
Figure 7.2: Illustration of RMSProp under an approximately constant gradient \(g_t \approx 1\). Panel (a) shows that the exponentially decaying average \(v_t\) approaches a stable value. Panel (b) shows that the effective step size \(\alpha / \sqrt{v_t + \epsilon}\) also stabilizes over time.
For the constant-gradient case with \(g_t = 1\) and \(\beta=0.9\):
\(v_t\) tends to \(1\).
Therefore, \(\frac{\alpha}{\sqrt{v_t \, +\, \epsilon}}\) tends to \(\alpha\).
This shows clearly that RMSProp does not collapse in the same way as AdaGrad. For a constant gradient, the moving average approaches a finite limit, so the effective step size does not keep shrinking indefinitely.
Adam (Kingma and Ba, 2014) combines ideas from momentum and RMSProp by maintaining exponentially decaying averages of both the gradients and their squared values. This allows the method to adapt the learning rate while also incorporating information about the direction of past updates.
\[ m_t \;=\; \beta_1 m_{t-1} + (1 - \beta_1)\, g_t, \quad\quad v_t \;=\; \beta_2 v_{t-1} + (1 - \beta_2)\, g_t^2 \]
\[ w_{t+1} \;=\; w_t \;-\; \frac{\alpha}{\sqrt{v_t + \epsilon}} \, m_t \]
Here,
\(g_t = \frac{\partial \text{cost}}{\partial w}(w_t)\) denotes the gradient at iteration \(t\).
\(m_t\) is an exponentially decaying average of past gradients, capturing the direction of the updates (momentum).
\(v_t\) is an exponentially decaying average of squared gradients, capturing the scale of recent gradients.
\(\beta_1, \beta_2 \in (0,1)\) control how fast past information is forgotten.
\(\epsilon\) is a small constant for numerical stability.
These quantities are hyperparameters. In practice, commonly used values are:
\[ \beta_1 \approx 0.9, \qquad \beta_2 \approx 0.999, \qquad \epsilon \approx 10^{-8} \]
Combine momentum (direction) and adaptive scaling (variance) to produce stable and efficient updates.
Requires tuning of multiple hyperparameters and may sometimes lead to suboptimal generalization compared to simpler methods such as SGD with momentum.
Suppose the gradient remains approximately constant, \(g_t \approx 1\). Then, both \(m_t\) and \(v_t\) follow exponential moving average recursions of a constant value.
\[ m_t \;=\; \beta_1 m_{t-1} \,+\, (1 \,-\, \beta_1) g_t \quad \approx\quad \beta_1 m_{t-1} \,+\, (1 \,-\, \beta_1) \]
Rewriting,
\[ m_t \,-\, 1 \;=\; \beta_1 (m_{t-1} \,-\, 1). \]
This recursive relation can be expanded step by step:
\[ m_t \,-\, 1 \;=\; \beta_1^2 (m_{t-2} \,- \, 1) \;=\;\cdots \;=\; \beta_1^t (m_0 \,-\, 1), \]
where \(m_0\) is the initial value of the moving average.
Since \(0 < \beta_1 < 1\), it follows that \(\beta_1^t \to 0\), and therefore \(m_t \to 1\).
In summary, as \(t\) increases:
\[ m_t \;\to\; 1, \quad \text{and}\quad v_t \;\to\; 1 \]
In this case, the update rule becomes approximately
\[ w_{t+1} \;\approx\; w_t \;-\; \alpha. \]
This shows that the effective step size stabilizes around \(\alpha\) instead of shrinking over time.
This shows that Adam stabilizes both the direction and the scale of the updates, avoiding the continuous shrinkage observed in AdaGrad.
Unlike AdaGrad, the effective step size does not vanish over time, and unlike RMSProp, the update direction is smoothed through momentum.
This behavior is illustrated in Figure 7.3. Panel (a) shows that the first moment estimate \(m_t\) converges to 1, while panel (b) shows that the second moment estimate \(v_t\) also converges to 1. Panel (c) shows that the effective step size stabilizes instead of vanishing over time.
import numpy as np
import matplotlib.pyplot as plt
# Parameters
alpha = 1.0
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
T = 300
# Constant gradient
g = 1.0
# Storage
t_vals = np.arange(1, T + 1)
m_vals = np.zeros(T)
v_vals = np.zeros(T)
step_vals = np.zeros(T)
# Initialization
m = 0.0
v = 0.0
# Adam recursion
for t in range(T):
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_vals[t] = m
v_vals[t] = v
step_vals[t] = alpha / np.sqrt(v + epsilon)
# Create figure
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# -----------------------------
# (a) m_t
# -----------------------------
axes[0].plot(t_vals, m_vals, linewidth=2.5)
axes[0].axhline(1, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[0].set_title(r"(a) First moment $m_t$", fontsize=20)
axes[0].set_xlabel(r"Iteration $t$", fontsize=16)
axes[0].set_ylabel(r"$m_t$", fontsize=16)
axes[0].tick_params(axis='both', labelsize=14)
axes[0].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
# soften borders
for spine in ["top", "right"]:
axes[0].spines[spine].set_visible(False)
axes[0].spines["left"].set_color("gray")
axes[0].spines["bottom"].set_color("gray")
# -----------------------------
# (b) v_t
# -----------------------------
axes[1].plot(t_vals, v_vals, linewidth=2.5)
axes[1].axhline(1, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[1].set_title(r"(b) Second moment $v_t$", fontsize=20)
axes[1].set_xlabel(r"Iteration $t$", fontsize=16)
axes[1].set_ylabel(r"$v_t$", fontsize=16)
axes[1].tick_params(axis='both', labelsize=14)
axes[1].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
for spine in ["top", "right"]:
axes[1].spines[spine].set_visible(False)
axes[1].spines["left"].set_color("gray")
axes[1].spines["bottom"].set_color("gray")
# -----------------------------
# (c) Step size
# -----------------------------
axes[2].plot(t_vals, step_vals, linewidth=2.5)
axes[2].axhline(alpha, color="gray", linestyle="--", linewidth=1, alpha=0.7)
axes[2].set_title(r"(c) Effective step size", fontsize=20)
axes[2].set_xlabel(r"Iteration $t$", fontsize=16)
axes[2].set_ylabel(r"$\alpha / \sqrt{v_t + \epsilon}$", fontsize=16)
axes[2].tick_params(axis='both', labelsize=14)
axes[2].grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
for spine in ["top", "right"]:
axes[2].spines[spine].set_visible(False)
axes[2].spines["left"].set_color("gray")
axes[2].spines["bottom"].set_color("gray")
plt.tight_layout()
plt.show()
Figure 7.3: Illustration of Adam under an approximately constant gradient \(g_t \approx 1\). Panel (a) shows that the first moment estimate \(m_t\) converges to 1. Panel (b) shows that the second moment estimate \(v_t\) also converges to 1. Panel (c) shows that the effective step size stabilizes instead of vanishing.
Figure 7.4 provides a visual summary of the main behavior of the three adaptive methods. Table 7.1 then compares their key ideas, update rules, advantages, and limitations.
import numpy as np
import matplotlib.pyplot as plt
# Parameters
alpha = 1.0
epsilon = 1e-8
beta = 0.9
beta1 = 0.9
beta2 = 0.999
T = 300
g = 1.0
t_vals = np.arange(1, T + 1)
# AdaGrad
G = 0.0
adagrad_step = np.zeros(T)
for i in range(T):
G += g**2
adagrad_step[i] = alpha / np.sqrt(G + epsilon)
# RMSProp
v_rms = 0.0
rmsprop_step = np.zeros(T)
for i in range(T):
v_rms = beta * v_rms + (1 - beta) * g**2
rmsprop_step[i] = alpha / np.sqrt(v_rms + epsilon)
# Adam
m = 0.0
v_adam = 0.0
adam_step = np.zeros(T)
for i in range(T):
m = beta1 * m + (1 - beta1) * g
v_adam = beta2 * v_adam + (1 - beta2) * g**2
adam_step[i] = alpha / np.sqrt(v_adam + epsilon)
# Plot
plt.figure(figsize=(8, 5))
plt.plot(t_vals, adagrad_step, linewidth=2.5, label="AdaGrad")
plt.plot(t_vals, rmsprop_step, linewidth=2.5, label="RMSProp")
plt.plot(t_vals, adam_step, linewidth=2.5, label="Adam")
plt.xlabel("Iteration $t$", fontsize=15)
plt.ylabel("Effective step size", fontsize=15)
plt.title("Summary behavior of adaptive methods", fontsize=18)
plt.tick_params(axis='both', labelsize=13)
plt.grid(True, linestyle="--", linewidth=0.5, alpha=0.5)
plt.legend(fontsize=13)
ax = plt.gca()
for spine in ["top", "right"]:
ax.spines[spine].set_visible(False)
ax.spines["left"].set_color("gray")
ax.spines["bottom"].set_color("gray")
ax.spines["left"].set_linewidth(0.8)
ax.spines["bottom"].set_linewidth(0.8)
plt.tight_layout()
plt.show()
Figure 7.4: Conceptual comparison of the effective step size in AdaGrad, RMSProp, and Adam under a simplified constant-gradient setting. AdaGrad shows a continuously decreasing step size, whereas RMSProp and Adam stabilize over time.
The main differences among these adaptive methods are summarized in Table 7.1.
| Method | Key idea | Update rule | Advantage | Limitation |
|---|---|---|---|---|
| AdaGrad (Duchi et al., 2011) | Accumulates all past squared gradients | \(w_{t+1} = w_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} g_t\) | Simple, works well for sparse data | Learning rate shrinks continuously |
| RMSProp (Tieleman and Hinton, 2012) | Uses exponential moving average of squared gradients | \(w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t\) | Prevents learning rate from vanishing too quickly | Requires tuning of decay parameter |
| Adam (Kingma and Ba, 2014) | Combines momentum and adaptive scaling | \(w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} m_t\) | Fast and stable convergence in practice | More complex, more hyperparameters |
In the next section, we examine how the learning rate is incorporated into the gradient descent algorithm and how it interacts with the loss function and its gradient.
To minimize the loss function, neural networks rely on gradient-based optimization. The gradient indicates the direction of steepest increase of the loss function, so moving in the opposite direction reduces its value.
For a parameter vector \(\mathbf{w}\), the update rule is given by
\[ \mathbf{w}^{(t+1)} \;=\; \mathbf{w}^{(t)} \,-\, \alpha \, \nabla \,L(\mathbf{w}^{(t)}), \]
where \(\alpha > 0\) is the learning rate, and \(\nabla L(\mathbf{w}^{(t)})\) is the gradient of the loss function evaluated at the current point.
The process starts from an initial parameter vector \(\mathbf{w}^{(0)}\) (typically chosen randomly) and is iteratively updated for \(t = 0, 1, 2, \dots\).
The gradient \(\nabla L(\mathbf{w})\) generalizes the derivative to multiple dimensions and indicates the direction of steepest increase in the loss.
This vector formulation emphasizes that all parameters are updated simultaneously in the direction of steepest descent.
For a single parameter \(w\), the update rule reduces to
\[ w^{(t+1)} \;=\; w^{(t)} \,- \,\alpha \,\frac{dL}{dw}(w^{(t)}). \]
Here, \(\frac{dL}{dw}\) denotes the standard derivative of the loss function with respect to the scalar parameter \(w\).
This scalar version is consistent with the vector formulation and is useful for building intuition in simple examples.
In practice, neural networks involve many parameters, so the vector formulation is more natural.
This process can be interpreted geometrically as movement over a loss surface toward regions of lower values (see Figures 8.1 or 8.2).
The loss function defines a surface over the parameter space. In this example, the function
\[ L(x,y) = x^2 + y^2 \]
produces a convex, bowl-shaped surface with a unique global minimum at the origin.
Figure 8.1: Gradient descent simulation on \(f(x,y)=x^2+y^2\) from initial point (3,3), learning rate \(\alpha=0.35\), and 10 steps. Adapted from the interactive visualization developed by Llinás and Llinás (2026).
Figure 8.1 provides an initial geometric interpretation of gradient descent on the surface defined by \(L(x,y)=x^2+y^2\).
The blue point represents the initial parameter values \((3,3)\), while the red point represents the final iterate obtained after several iterations of gradient descent.
The trajectory between these points shows how the algorithm progressively updates the parameters in the direction opposite to the gradient, moving toward the minimum at \((0,0)\).
Each update is scaled by the learning rate \(\alpha = 0.35\), which determines the magnitude of the step. A suitable choice of \(\alpha\) allows efficient convergence, whereas poor choices may lead to slow convergence or instability.
The original visualization also includes a table of iteration values, which is presented separately in Table 8.1.
Figure 8.2 provides a complementary view by combining the descent trajectory with the descent vector field and the corresponding level sets. The numerical values reported in Table 8.1 show the evolution of the parameters, the loss, and the partial derivatives at each iteration. Together, these visual and numerical representations illustrate how the updates become progressively smaller as the algorithm approaches the minimum.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
# ==========================================================
# Global font configuration
# ==========================================================
plt.rcParams.update({
"font.size": 30,
"axes.titlesize": 30,
"axes.labelsize": 30,
"xtick.labelsize": 26,
"ytick.labelsize": 26
})
# ==========================================================
# Function and gradient
# ==========================================================
def L(x, y):
return x**2 + y**2
def grad(x, y):
return np.array([2*x, 2*y])
# Negative gradient: direction of descent
def neg_grad(x, y):
return np.array([-2*x, -2*y])
# ==========================================================
# Gradient descent parameters
# ==========================================================
alpha = 0.35
steps = 10
x0, y0 = 3.0, 3.0
# ==========================================================
# Gradient descent simulation
# ==========================================================
rows = []
xs = [x0]
ys = [y0]
zs = [L(x0, y0)]
x, y = x0, y0
for n in range(steps + 1):
gx, gy = grad(x, y)
rows.append([n, x, y, L(x, y), gx, gy])
if n < steps:
x = x - alpha * gx
y = y - alpha * gy
xs.append(x)
ys.append(y)
zs.append(L(x, y))
# ==========================================================
# Surface grid
# ==========================================================
grid = np.linspace(-4.5, 4.5, 120)
X, Y = np.meshgrid(grid, grid)
Z = L(X, Y)
vmin = Z.min()
vmax = Z.max()
# ==========================================================
# Vector field grid (coarser for arrows)
# ==========================================================
g = np.linspace(-4.5, 4.5, 19)
Xg, Yg = np.meshgrid(g, g)
Ug, Vg = neg_grad(Xg, Yg) # arrows point toward the minimum
# ==========================================================
# Figure layout
# ==========================================================
fig = plt.figure(figsize=(25, 17))
gs = fig.add_gridspec(1, 2, width_ratios=[2.4, 2.4], wspace=0.2)
# ==========================================================
# (a) 3D surface + descent field + trajectory
# ==========================================================
ax1 = fig.add_subplot(gs[0, 0], projection='3d')
surf = ax1.plot_surface(
X, Y, Z,
cmap="viridis",
alpha=0.4,
edgecolor="k",
linewidth=0.35,
vmin=vmin,
vmax=vmax
)
# Descent vector field projected onto z = 0
ax1.quiver(
Xg, Yg, 0,
Ug, Vg, 0,
color="black",
length=0.35,
normalize=True
)
# Trajectory
ax1.plot(xs, ys, zs, color="darkred", linewidth=6)
ax1.scatter(xs, ys, zs, color="red", s=36)
# Highlight initial and final points
ax1.scatter(xs[0], ys[0], zs[0], color="blue", s=295)
ax1.scatter(xs[-1], ys[-1], zs[-1], color="red", s=295)
# Label selected iterations
label_idx = [0, 1, 2, 5, 10]
for i in label_idx:
ax1.text(
xs[i] + 0.08, ys[i] + 0.08, zs[i] + 0.8, f"{i}",
fontsize=10, color="black"
)
# Initial / final labels
ax1.text( # 🔵 Initial point
xs[0] + 0.7, ys[0] + 0.6, zs[0] - 7.0,
"Initial \npoint", color="blue", fontsize=29, fontweight="bold"
)
ax1.text( # 🔴 Final point
xs[-1] - 0.7, ys[-1] - 0.8, zs[-1] - 11,
"Final \npoint", color="red", fontsize=29, fontweight="bold"
)
ax1.set_title("(a) Surface of $L(x,y)$ with descent field and trajectory", pad=10)
ax1.set_xlabel("x", labelpad=15)
ax1.set_ylabel("y", labelpad=15)
ax1.set_zlabel("$L(x,y)$", labelpad=20)
ax1.tick_params(axis='x', pad=5)
ax1.tick_params(axis='y', pad=5)
ax1.zaxis.set_tick_params(pad=10)
ax1.view_init(elev=24, azim=-63)
# ==========================================================
# (b) Contours + descent field + trajectory
# ==========================================================
ax2 = fig.add_subplot(gs[0, 1])
levels = np.linspace(vmin, vmax, 10)
cont = ax2.contour(
X, Y, Z,
levels=levels,
cmap="viridis",
linewidths=1.6,
vmin=vmin,
vmax=vmax
)
# Descent vector field in 2D
ax2.quiver(
Xg, Yg, Ug, Vg,
color="black",
angles="xy",
scale_units="xy",
scale=28,
width=0.0035
)
# Trajectory
ax2.plot(xs, ys, color="red", linewidth=6)
ax2.scatter(xs, ys, color="red", s=42, zorder=3)
# Initial and final points
ax2.scatter(xs[0], ys[0], color="blue", s=295, zorder=4)
ax2.scatter(xs[-1], ys[-1], color="red", s=295, zorder=4)
# Label selected iterations
for i in label_idx:
ax2.text(
xs[i] + 0.10, ys[i] + 0.10, f"{i}",
fontsize=10, color="black",
path_effects=[pe.withStroke(linewidth=2, foreground="white")]
)
# Initial / final labels
ax2.text(
xs[0] + 0.35, ys[0] + 0.25, "Initial \npoint",
color="blue", fontsize=29, fontweight="bold"
)
ax2.text(
xs[-1] - 0.55, ys[-1] - 0.85, "Final \npoint",
color="red", fontsize=29, fontweight="bold"
)
ax2.set_title("(b) Level sets of $L(x,y)$ with descent field and trajectory", pad=10)
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_aspect("equal", adjustable="box")
# ==========================================================
# Shared horizontal colorbar
# ==========================================================
cbar_ax = fig.add_axes([0.36, 0.09, 0.28, 0.025]) #[left, bottom, width, height]
cbar = fig.colorbar(surf, cax=cbar_ax, orientation="horizontal")
cbar.set_label("$L(x,y)$", labelpad=2)
# ==========================================================
# Main title and spacing
# ==========================================================
fig.suptitle(
"Gradient descent on $L(x,y)=x^2+y^2$",
fontsize=34,
y=0.97
)
fig.subplots_adjust(
top=0.86,
bottom=0.18,
left=0.05,
right=0.98
)
plt.show()
Figure 8.2: Gradient descent on \(L(x,y)=x^2+y^2\) from \((3,3)\) with learning rate \(\alpha=0.35\) and \(10\) steps. The left panel shows the surface together with the descent trajectory and a descent vector field projected onto the plane. The right panel shows the corresponding level sets, the same trajectory, and arrows pointing toward the minimum.
The corresponding numerical values are shown in the following table, which reports the evolution of the parameters, the loss, and the partial derivatives at each iteration.
import pandas as pd
df_descenso = pd.DataFrame(
rows,
columns=["n", "x", "y", "L(x,y)", "∂L/∂x", "∂L/∂y"]
).round(3)
library(kableExtra)
df_descenso_r <- py$df_descenso
kable(
df_descenso_r,
align = "cccccc",
escape = FALSE,
col.names = c("n", "x", "y", "$L(x,y)$",
"$\\partial L / \\partial x$", "$\\partial L / \\partial y$"
)) %>%
kable_styling() %>%
kable_classic_2(full_width = FALSE) %>%
row_spec(0, background = "lightgreen", bold = TRUE) %>%
row_spec(1, color = "Darkblue", bold = TRUE) %>% # 🔵 Initial point
row_spec(9, color = "red", bold = TRUE) # 🔴 Final point
| n | x | y | \(L(x,y)\) | \(\partial L / \partial x\) | \(\partial L / \partial y\) |
|---|---|---|---|---|---|
| 0 | 3.000 | 3.000 | 18.000 | 6.000 | 6.000 |
| 1 | 0.900 | 0.900 | 1.620 | 1.800 | 1.800 |
| 2 | 0.270 | 0.270 | 0.146 | 0.540 | 0.540 |
| 3 | 0.081 | 0.081 | 0.013 | 0.162 | 0.162 |
| 4 | 0.024 | 0.024 | 0.001 | 0.049 | 0.049 |
| 5 | 0.007 | 0.007 | 0.000 | 0.015 | 0.015 |
| 6 | 0.002 | 0.002 | 0.000 | 0.004 | 0.004 |
| 7 | 0.001 | 0.001 | 0.000 | 0.001 | 0.001 |
| 8 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 9 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 10 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Although gradient descent provides the general mechanism for updating parameters, neural networks typically involve a large number of interconnected weights organized in layers. The forward propagation structure illustrated in Figure 4.1 also serves as the computational graph over which gradients are propagated backward through successive applications of the chain rule.
To formalize this, we recall the layer-wise notation introduced earlier:
\(n_j\): number of neurons in layer \(j\).
\(\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}\): the output (activation vector) of layer \(j\). Each component corresponds to the output of one neuron.
\(\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}\): the pre-activation signal at layer \(j\), obtained before applying the activation function.
Each layer computes:
\[ \mathbf{z}^{(j)} \;=\; \mathbf{W}^{(j-1,j)\top} \,\mathbf{h}^{(j-1)} \,+\, \mathbf{b}^{(j)}, \qquad \mathbf{h}^{(j)} \;=\; f^{(j)}(\mathbf{z}^{(j)}), \]
where:
\(f^{(j)}\) denotes the activation function applied element-wise.
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}\): the weight matrix connecting layer \(j-1\) to layer \(j\). Each column contains the weights associated with one neuron in layer \(j\).
\(\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}\): the bias vector of layer \(j\).
This formulation generalizes the single-neuron model introduced earlier, where \(z = \mathbf{w}^\top \mathbf{x} + b\).
Backpropagation is the algorithm used to compute gradients efficiently in multilayer neural networks. It applies the chain rule to propagate error information from the output layer back to earlier layers.
For each layer \(j\), define the local error:
\[ \boldsymbol{\delta}^{(j)} = \frac{\partial L}{\partial \mathbf{z}^{(j)}} \in \mathbb{R}^{n_j}. \]
The term \(\boldsymbol{\delta}^{(j)}\) represents how sensitive the loss function is to changes in the pre-activation signal at layer \(j\).
Recall that the loss function \(L(\mathbf{y}, \hat{\mathbf{y}})\) was introduced earlier to quantify the discrepancy between predictions and targets. The local error \(\boldsymbol{\delta}^{(j)}\) is therefore derived from this loss and propagates its influence through the network.
For the output layer \(j = N\), the local error depends directly on the loss function. For example, in the case of mean squared error (MSE),
\[ \boldsymbol{\delta}^{(N)} = (\hat{\mathbf{y}} - \mathbf{y}) \odot f^{(N)\prime}(\mathbf{z}^{(N)}). \]
This provides the starting point for the backward propagation of errors.
Conceptually, backpropagation can be viewed as a systematic application of the chain rule over the composition of functions defined by the network.
For hidden layers, the error is propagated backward according to
\[ \boldsymbol{\delta}^{(j-1)} \; =\; \underbrace{\left(\mathbf{W}^{(j-1,j)} \boldsymbol{\delta}^{(j)}\right)}_{\in \,\mathbb{R}^{n_{j-1}}} \;\odot\; \underbrace{f^{(j-1)\prime}(\mathbf{z}^{(j-1)})}_{\in \,\mathbb{R}^{n_{j-1}}}. \]
where:
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{n_{j-1} \times n_j}\)
\(\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}\)
\(f^{(j-1)\prime}\) denotes the derivative of the activation function with respect to its input, applied element-wise. That is,
\[ f^{(j-1)\prime}(\mathbf{z}^{(j-1)}) \; =\; \begin{pmatrix} \frac{d}{dz}f^{(j-1)}(z_1) \\ \vdots \\ \frac{d}{dz}f^{(j-1)}(z_{n_{j-1}}) \end{pmatrix}. \]
\[ \mathbf{A} \odot \mathbf{B} = \left( a_{ij} b_{ij} \right)_{1 \le i \le r,\; 1 \le j \le s} \;\in\; \mathbb{R}^{r \times s} \]
where \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{r \times s}\). That is, each entry of the result is obtained by multiplying the corresponding entries of \(\mathbf{A}\) and \(\mathbf{B}\). This operation is only defined when both operands have the same dimensions.
This contrasts with matrix multiplication, which combines rows and columns rather than corresponding entries.
Let
\[ \mathbf{a} = \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}. \]
Then
\[ \mathbf{a} \odot \mathbf{b} = \begin{pmatrix} a_1 b_1 \\ a_2 b_2 \\ a_3 b_3 \end{pmatrix}. \]
That is, each component of the resulting vector is obtained by multiplying the corresponding components.
The dimensions of the variables involved are:
Thus,
\[ \mathbf{W}^{(j-1,j)} \boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_{j-1}}, \]
which matches the dimension of \(\boldsymbol{\delta}^{(j-1)}\).
This dimensional consistency guarantees that the backward propagation of errors is well-defined across layers.
Using the local error terms, the gradients with respect to the model parameters are:
\[ \frac{\partial L}{\partial \mathbf{W}^{(j-1,j)}} = \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top}, \qquad \frac{\partial L}{\partial \mathbf{b}^{(j)}} = \boldsymbol{\delta}^{(j)}. \]
where:
Thus,
\[ \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top} \in \mathbb{R}^{n_{j-1} \times n_j}, \]
which matches the dimension of \(\mathbf{W}^{(j-1,j)}\).
This recursive computation allows gradients to be obtained efficiently, making training feasible even for deep neural networks.
The expressions above generalize the single-neuron formulation introduced earlier:
\[ z = \mathbf{w}^\top \mathbf{x} + b. \]
In this case:
This corresponds to a network with:
Thus, backpropagation can be interpreted as a systematic extension of gradient computation from a single neuron to multilayer architectures.
Gradient-based methods rely on first-order information. That is, they use only the gradient of the loss function, without explicitly considering curvature.
More advanced methods incorporate second-order information (e.g., Hessians), but they are often computationally expensive in high-dimensional models.
Although gradient descent is conceptually simple, its practical behavior is strongly influenced by the geometry of the loss surface associated with neural networks.
In high-dimensional models, the loss function defines a complex landscape where each point corresponds to a particular configuration of the model parameters. This surface is typically highly non-convex, which gives rise to several important challenges during optimization.
The loss surface may contain a variety of geometric features, including:
Local minima, where the gradient vanishes but the solution is not globally optimal.
Saddle points, where the gradient is zero but there exist directions of both ascent and descent.
Plateaus, where the gradient is very small and learning progresses slowly.
Narrow valleys, where curvature differs significantly across directions.
In high-dimensional settings, saddle points are often more prevalent than local minima and can significantly slow down the training process.
These phenomena can be visualized as movement across a complex surface, where the optimization trajectory may oscillate, stagnate, or progress unevenly depending on the local geometry.
A fundamental challenge in deep neural networks arises from the propagation of gradients through multiple layers.
Because gradients are computed through repeated application of the chain rule, their magnitude may either decrease or increase exponentially as they propagate backward through the network.
In the vanishing gradient case, gradients become extremely small, preventing early layers from learning effectively.
In the exploding gradient case, gradients grow excessively large, leading to numerical instability.
Both situations can hinder the training process and affect convergence.
Several strategies have been developed to mitigate these issues, including:
The use of activation functions that preserve gradient flow (e.g., ReLU and its variants).
Careful weight initialization methods (e.g., Xavier or He initialization).
Normalization techniques such as Batch Normalization.
These approaches help stabilize training and improve convergence behavior in deep neural networks.
Overall, the effectiveness of gradient-based optimization depends not only on the algorithm itself, but also on the interplay between model architecture, activation functions, and the geometry of the loss surface.
These challenges motivate the development of more advanced optimization methods, which incorporate additional information about the curvature of the loss surface.
While gradient descent provides a simple and effective mechanism for training neural networks, it relies exclusively on first-order information, namely the gradient of the loss function. Although this approach is widely used in practice, it does not fully exploit the geometric structure of the optimization problem.
In many settings, additional information about the curvature of the loss surface can be used to accelerate convergence and improve stability. This leads to the study of second-order optimization methods, which incorporate not only gradients but also higher-order derivatives.
In previous sections, parameters were introduced individually (e.g., \(w_n\)). In more general settings, it is convenient to represent all model parameters collectively as a vector \(\boldsymbol{\theta}\). Let \(L(\boldsymbol{\theta})\) denote a differentiable loss function, where \(\boldsymbol{\theta}\) represents the vector of model parameters.
To formalize this idea, we introduce the notion of the gradient, which generalizes partial derivatives to vector-valued parameter spaces. The gradient of the loss function is defined as
\[ \nabla L(\boldsymbol{\theta}) = \left( \frac{\partial L}{\partial \theta_1}, \dots, \frac{\partial L}{\partial \theta_N} \right)^\top, \]
and provides first-order information about the direction of steepest increase.
In contrast, the Hessian matrix captures second-order information:
\[ H(\boldsymbol{\theta}) = \left[ \frac{\partial^2 L}{\partial \theta_i \, \partial \theta_j} \right]_{i,j=1}^N. \]
The Hessian describes the local curvature of the loss surface and plays a central role in more advanced optimization methods.
While gradient descent uses only first-order information, it does not account for how the gradient changes across directions. This limitation motivates the use of second-order methods.
One classical approach that uses second-order information is the Newton-Raphson method, which updates parameters according to
\[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - H(\boldsymbol{\theta}^{(t)})^{-1} \nabla L(\boldsymbol{\theta}^{(t)}). \]
Unlike gradient descent, where the step size is controlled by a scalar learning rate, Newton-type methods adapt the update direction using curvature information. This often leads to faster convergence, particularly near optimal points.
However, computing and inverting the Hessian matrix can be computationally expensive in high-dimensional problems, which limits the direct applicability of these methods in large neural networks.
Gradient descent can be interpreted as a simplified optimization strategy that only uses first-order information. While this makes it computationally efficient and scalable, it may require many iterations to converge.
Second-order methods, on the other hand, incorporate curvature information, allowing for more informed parameter updates. In practice, modern optimization algorithms often seek a balance between these two approaches, combining efficiency with improved convergence properties.
In deep learning, first-order methods such as gradient descent and its variants (e.g., Adam, RMSProp) remain the dominant choice due to their scalability and ease of implementation.
Nevertheless, second-order ideas continue to play an important role in theoretical analysis and in specialized applications where curvature information can be exploited effectively.
This perspective highlights that optimization in neural networks is not limited to a single method, but rather involves a spectrum of approaches with different trade-offs between computational cost and convergence behavior.
This topic will be explored in greater depth in a dedicated document on optimization methods in machine learning.
In this document, we developed the fundamental mathematical framework that enables learning in artificial neural networks. Starting from the definition of loss functions, we established how learning can be formulated as an optimization problem over a high-dimensional parameter space.
We then introduced gradient-based optimization methods, emphasizing how the gradient provides directional information that guides parameter updates toward regions of lower loss. This perspective allowed us to interpret learning geometrically as movement over a loss surface.
A central component of this process is the backpropagation algorithm, which enables the efficient computation of gradients in multilayer architectures. By recursively propagating local error signals across layers, backpropagation makes it possible to train deep neural networks in a computationally feasible manner.
We also examined the role of dimensional consistency and matrix operations, highlighting how vectorized representations and element-wise operations contribute to both the efficiency and clarity of the learning process.
Overall, these components (loss functions, gradient-based optimization, and backpropagation) form the core of modern neural network training. Together, they establish the bridge between mathematical theory and practical learning algorithms in machine learning.
The purpose of this activity is to consolidate the conceptual and mathematical understanding of the learning process in neural networks. In particular, it aims to develop intuition about loss functions, gradient-based optimization, and the role of backpropagation in computing parameter updates.
Consider a feedforward neural network with multiple layers and a differentiable loss function \(L\).
Answer the following questions:
Conceptual understanding
What is the role of a loss function in a neural network?
Why can the training of a neural network be interpreted as an optimization problem?
Gradient interpretation
Define the gradient of a function \(L(\boldsymbol{\theta})\).
Explain why the negative gradient direction is used to update parameters.
Describe geometrically what happens during gradient descent.
Learning rate
What is the role of the learning rate \(\alpha\) in gradient descent?
What happens if \(\alpha\) is too large?
What happens if \(\alpha\) is too small?
Backpropagation intuition
What is the meaning of the local error \(\boldsymbol{\delta}^{(j)}\)?
Why is backpropagation necessary in deep neural networks?
Explain in your own words what it means to “propagate the error backward”.
Dimensional reasoning
Consider the expression:
\[ \boldsymbol{\delta}^{(j-1)} = \left(\mathbf{W}^{(j-1,j)\top}\boldsymbol{\delta}^{(j)}\right) \odot f'(\mathbf{z}^{(j-1)}). \]
Explain why the transpose of \(\mathbf{W}^{(j-1,j)}\) is required.
Justify why the dimensions of all terms are consistent.
What role does the element-wise product play in this expression?
Critical reflection
Why would training be infeasible without backpropagation?
Explain why deep networks require efficient gradient computation.
In your opinion, what is the most important component of the learning process: the loss function, the optimization method, or backpropagation? Justify your answer.
Forward pass
Consider a simple neural network with two layers.
Describe step by step how the forward pass is computed.
Then describe conceptually how the backward pass is performed.
Identify where the gradient is computed and how it is used.
Conceptual design
Propose an intuitive explanation (without formulas) of how a neural network “learns from its mistakes”.
What information is propagated?
How is it used to adjust the model?
Why does this process improve performance over time?
The R Markdown document must be fully reproducible.
All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.
All random seeds (if applicable) must be set to ensure deterministic results.
All library versions used should be clearly reported.
All code must be self-contained within the document. External dependencies are not allowed unless explicitly included.
The use of external functions not included in the document prevents reproducibility, as the analysis cannot be executed independently.
To facilitate the review process and maintain the academic quality of submissions, the following requirements must be met:
The assignment must be submitted in both formats:
HTML
The document must be fully reproducible (see Reproducibility Requirement).
If additional files are used (e.g., modules, datasets, or auxiliary functions), they must:
Be included in the submission,
Be clearly organized, and
Be properly documented.
Multiple versions of the same assignment must not be submitted unless the final version is clearly identified. Submitting multiple unclear versions may hinder the evaluation process.
Timely submission of assignments is strictly required.
Assignments must be submitted by the deadlines specified for each activity.
A grace period of 30 minutes is allowed after the deadline.
Submissions made after the grace period will incur penalties.
Penalty: −1.0 point for each hour or fraction of an hour of delay.
Submissions significantly past the deadline may receive a substantially reduced grade or may not be accepted.