24/04/26
Abstract
Other related documents can be found at Rpubs:: toc.
The previous sections introduced the fundamental components of neural networks. In document An introduction to neural networks (LLinás, 2026), we described the structure of a neural network as a composition of linear transformations across layers. Document activation functions (LLinás, 2026) emphasized the role of activation functions in introducing nonlinearity, enabling the model to capture complex relationships. Finally, document Learning and optimization in neural networks (LLinás, 2026) presented the learning process, where parameters are updated through gradient-based optimization to minimize a loss function.
While these elements may appear conceptually simple when considered independently, their interaction gives rise to rich and sometimes non-intuitive behavior. To build a deeper understanding, it is useful to analyze a concrete example in detail.
A classical benchmark for illustrating the expressive power of neural networks is the exclusive-OR (XOR) problem. The XOR function is not linearly separable, meaning that it cannot be correctly modeled by a single-layer perceptron. This limitation highlights the necessity of introducing hidden layers and nonlinear activation functions.
In this document, we examine how a small neural network can learn the XOR function through a fully explicit and step-by-step computation. The objective is to:
Perform the forward propagation manually,
Compute the loss function,
Derive the gradients using backpropagation, and
Update the parameters iteratively.
This detailed walkthrough provides insight into how the theoretical components introduced earlier operate together in practice.
Artificial neural networks (ANNs) constitute one of the most influential paradigms in modern machine learning and artificial intelligence. Their conceptual origin is loosely inspired by the basic functional principles of biological neurons, which process and transmit information through interconnected networks (Kandel et al., 2013).
In the nervous system, neurons receive electrical or chemical signals through dendrites. These signals are integrated within the cell body (the soma), and if the accumulated stimulus surpasses a certain threshold, the neuron generates an electrical impulse known as an action potential. This signal then travels along the axon and is transmitted to other neurons through synaptic connections.
As illustrated in Figure 2.1, the biological neuron is composed of several key components, including dendrites, the soma, the axon, and the synaptic terminals, each playing a specific role in signal transmission.
Figure 2.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI).
Although artificial neurons are only simplified mathematical abstractions of this biological mechanism, the analogy provides an intuitive conceptual foundation. In essence, both systems combine multiple inputs, evaluate their relative influence, and produce an output response according to a transformation rule.
This biological structure motivates the simplified mathematical abstraction introduced in the next section.
Modern artificial neurons translate the biological intuition of neural signal processing into a mathematical framework suitable for computation and learning. Instead of discrete electrical spikes, artificial neurons compute a weighted linear combination of inputs and then transform this signal through a nonlinear activation function.
Formally, the internal signal of a neuron is defined as
\[ z = \mathbf{w}^\top \mathbf{x} + b, \]
where \(\mathbf{x} \in \mathbb{R}^d\) represents the input vector, \(\mathbf{w} \in \mathbb{R}^d\) denotes the vector of weights, and \(b\) is a bias parameter. The neuron output is then obtained by applying an activation function
\[ h = \varphi(z). \]
The quantity \(z\) can be interpreted as a continuous analogue of synaptic integration, while the function \(\varphi(\cdot)\) generalizes the biological threshold mechanism into a smooth and differentiable transformation. Further details on activation functions can be found in the companion document: Activation functions (LLinás, 2026).
This transition from discrete logical models to continuous optimization was essential for the development of modern neural networks. By allowing gradients to be computed and propagated through the model, neural networks can be trained efficiently using gradient-based learning algorithms.
As a result, ANNs are capable of approximating complex nonlinear mappings and learning expressive internal representations from data. Through iterative adjustments of weights and biases, the model progressively refines its representation of the input space. Consequently, neural networks have become fundamental tools for tasks such as classification, regression, and representation learning in high-dimensional environments (Goodfellow et al., 2016).
As illustrated in Figure 2.2, a neural network is constructed by stacking multiple artificial neurons into layers. The output of each neuron becomes the input of neurons in the next layer, allowing the model to progressively learn more complex representations of the data.
Figure 2.2: Basic architecture of a feedforward neural network. Source: Created by the author with ChatGPT (OpenAI).
Each neuron in the hidden layer computes a transformation of the form \(h = \varphi(\mathbf{w}^\top \mathbf{x} + b)\), illustrating how nonlinear activation functions operate within layered architectures.
Each neuron in the network performs the same basic operation described above, namely computing a weighted sum followed by a nonlinear activation. By composing many such units, the network is able to model highly complex functions.
Activation functions are a fundamental component of artificial neural networks because they introduce nonlinearity into the model. Without nonlinear activation functions, a neural network composed of multiple layers would reduce to an equivalent linear transformation, regardless of the number of layers.
Recall that an artificial neuron computes a linear combination of its inputs
\[ z = \mathbf{w}^\top\mathbf{x} + b, \]
which represents the internal signal of the neuron. The final output is obtained by applying a nonlinear transformation
\[ h = \varphi(z), \]
where \(\varphi(\cdot)\) is called the activation function.
The role of the activation function is to transform the internal signal of the neuron into a response that can capture complex relationships between variables. By introducing nonlinear transformations, neural networks can construct nonlinear decision boundaries and learn rich internal representations of the data.
This capability is supported by classical universal approximation results, which show that neural networks equipped with suitable activation functions can approximate a wide class of functions on compact domains.
From a geometric perspective, activation functions distort the feature space in a controlled way, allowing the model to separate patterns that would otherwise be inseparable using only linear transformations.
Activation functions can be organized into different categories according to their mathematical properties. One common classification distinguishes between monotonic and periodic activation functions.
Monotonic activation functions are functions whose output consistently increases or decreases with respect to the input. These functions have historically been the most widely used in neural networks and include examples such as the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU).
Periodic activation functions exhibit oscillatory behavior and are particularly useful in models designed to capture periodic or high-frequency patterns. These functions are commonly used in specialized neural architectures for representing signals, implicit functions, or spatial fields.
A detailed and comprehensive treatment of activation functions, including their mathematical definitions, graphical behavior, and practical implications, is provided in the following companion document: Activation functions (LLinás, 2026).
In the context of neural networks, many commonly used activation functions exhibit monotonic behavior. The following list summarizes several important examples that will be studied in detail:
Linear function
Identity function
Piecewise linear function
Threshold (Heaviside) function
Sigmoid function
Bipolar sigmoid function
ReLU and its variants (Leaky ReLU, PReLU, RReLU)
ELU and SELU
SoftMax function
Sign function
Maxout function
Softsign function
Elliot function
Hyperbolic tangent (tanh) function
Arctangent function
Lecun’s hyperbolic tangent function
Complementary log-log function
Softplus function
Bent identity function
Soft expopnential function
Each of these functions has different properties in terms of smoothness, differentiability, and practical performance in neural networks.
A more detailed discussion of these activation functions, including their mathematical properties and visual interpretation, is available in the companion document: Activation functions (LLinás, 2026).
The periodic activation functions considered in this section can be naturally grouped into two main categories, depending on the nature of their oscillatory behavior.
Some of these functions (e.g., Fourier or wavelet-based representations) are not activation functions in the strict sense, but rather functional transformations that can be incorporated into neural architectures to capture periodic structure.
Sine wave function.
Cardinal sine function (Sinc).
Fourier transform (FT, DFT/FFT).
Short-time Fourier transform (STFT, Gabor transform).
Wavelet transform.
Gaussian (normal distribution) function.
Square wave function.
Triangle wave function.
Sawtooth wave function.
S-shaped rectified linear unit (SReLU).
Adaptive piecewise linear unit (APLU).
A more detailed discussion of these activation functions, including their mathematical properties and visual interpretation, is available in the companion document: Activation functions (LLinás, 2026).
While activation functions determine how information is transformed within a neural network, they do not specify how the parameters of the model are learned from data. This naturally leads to the question of how neural networks adjust their weights and biases during training.
Before introducing the learning process in neural networks, it is useful to briefly review a few mathematical concepts that will be used throughout this section. In particular, we will rely on derivatives, partial derivatives, and the gradient operator.
These tools allow us to quantify how a function changes with respect to its inputs, which is essential for optimizing model parameters.
In one dimension, the derivative of a function measures how the output changes as the input varies:
\[ \frac{d}{dx} f(x). \]
In neural networks, however, functions typically depend on multiple variables. Let
\[ \mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^n \]
denote the input vector. Each component \(x_i\) can be interpreted as an individual input feature, which is particularly relevant in neural networks where inputs are naturally represented as vectors.
In this setting, we use partial derivatives, which measure how a function changes with respect to one variable while keeping the others fixed:
\[ \frac{\partial f}{\partial x_i}. \]
This notation allows us to analyze the sensitivity of a function with respect to each individual component of its input.
For functions of several variables, partial derivatives can be grouped into a single object known as the gradient.
\[ \nabla f(\mathbf{x}) \in \mathbb{R}^n \quad = \quad \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}. \]
Thus, the gradient can be interpreted as a vector field defined over \(\mathbb{R}^n\), assigning to each point \(\mathbf{x}\) the direction of steepest ascent.
The gradient is a vector that points in the direction of maximum increase of the function.
This concept plays a central role in optimization, as it tells us how to adjust parameters to reduce a loss function.
Intuitively, while the gradient tells us in which direction to move, it does not indicate how the landscape bends in different directions. This limitation motivates the introduction of second-order derivatives.
While the gradient describes the direction of maximum increase of a function, it does not capture how this direction changes across the input space.
To describe this behavior, we introduce second-order derivatives, which measure how the gradient itself varies.
Before introducing the general multivariable case, it is useful to recall the one-dimensional setting.
For a function of a single variable, the second derivative
\[ f''(x) \]
measures how the slope of the function changes, and therefore describes the curvature of the function.
In this case, the Hessian reduces to the \(1 \times 1\) matrix
\[ H_f(x)\;=\; \begin{pmatrix} f''(x) \end{pmatrix} \;=\; f''(x). \]
This shows that the Hessian generalizes the notion of second derivative from one dimension to multiple dimensions.
For a function of several variables, these second-order derivatives are organized into the Hessian matrix:
\[ H_f(\mathbf{x}) \;=\; \left[ \frac{\partial^2 f}{\partial x_i \partial x_j} \right]_{i,j=1}^n \;=\; \underbrace{\begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}}_{n\times n}. \]
The Hessian provides information about the local curvature of the function.
If the curvature is similar in all directions, optimization is relatively stable.
If curvature varies significantly across directions, optimization may become difficult.
This concept will be essential later when discussing more advanced optimization methods.
This notion of curvature will become especially important when analyzing optimization difficulties such as narrow valleys and saddle points, where the Hessian reveals strong differences across directions.
Consider the function:
\[ L(x,y) = x^2 + y^2 \]
Its partial derivatives are:
\[ \frac{\partial L}{\partial x} = 2x, \quad \frac{\partial L}{\partial y} = 2y \]
Thus, the gradient is:
\[ \nabla L(x,y) = (2x, 2y) \]
This vector describes how the function changes at any point \((x,y)\) and, in particular, determines the direction used to adjust parameters during learning.
The Hessian of this function is:
\[ H_L(x,y)= \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}. \]
This matrix is constant and isotropic, implying that the curvature is identical in all directions. As a result, the function behaves like a perfectly symmetric bowl, and gradient-based optimization proceeds smoothly toward the minimum without oscillations.
The contour lines represent the level sets of the function \(L(x,y)\), while the arrows correspond to the gradient vector \(\nabla L(x,y)\) at each point. As shown in Figure 4.1, the gradient vectors point outward from the origin, indicating the direction of maximum increase of the function. Consequently, moving in the opposite direction leads toward the minimum, which in this case is located at \((0,0)\).
Figure 4.1 provides two complementary views of this behavior: a surface representation and its corresponding level sets, both combined with the gradient field.
This observation is fundamental for optimization, as gradient-based methods update parameters by moving in the opposite direction of the gradient.
# ==========================================================
# Function and gradient
# ==========================================================
import numpy as np
import matplotlib.pyplot as plt
# Global font configuration
plt.rcParams.update({
"font.size": 14,
"axes.titlesize": 15,
"axes.labelsize": 14,
"xtick.labelsize": 13,
"ytick.labelsize": 13
})
def L(x, y):
return x**2 + y**2
def grad_L(x, y):
return 2*x, 2*y
# Grid
x = np.linspace(-3, 3, 25)
y = np.linspace(-3, 3, 25)
X, Y = np.meshgrid(x, y)
Z = L(X, Y)
U, V = grad_L(X, Y)
# Shared color scale for both plots
vmin = Z.min()
vmax = Z.max()
# ==========================================================
# Figure (compact and balanced layout)
# ==========================================================
fig = plt.figure(figsize=(14, 6.5))
# Make the 3D plot slightly larger than the contour plot
gs = fig.add_gridspec(1, 2, width_ratios=[1.3, 1], wspace=0.07)
# ----------------------------------------------------------
# (a) Surface plot with gradient field
# ----------------------------------------------------------
ax1 = fig.add_subplot(gs[0, 0], projection='3d')
surf = ax1.plot_surface(
X, Y, Z,
cmap='viridis',
edgecolor='k',
linewidth=0.6,
vmin=vmin, vmax=vmax,
alpha=0.9
)
# Gradient vectors projected onto the plane z = 0
ax1.quiver(
X, Y, 0,
U, V, 0,
color='black',
length=0.22,
normalize=True
)
# Slightly more separation from axes
ax1.set_title("(a) Surface of $L(x,y)$ with gradient field", pad=8)
ax1.set_xlabel("x", labelpad=3)
ax1.set_ylabel("y", labelpad=3)
ax1.set_zlabel("$L(x,y)$", labelpad=3)
# Adjust viewing angle for better perception
ax1.view_init(elev=30, azim=-60)
# ----------------------------------------------------------
# (b) Contour plot with gradient field
# ----------------------------------------------------------
ax2 = fig.add_subplot(gs[0, 1])
levels = np.linspace(vmin, vmax, 10)
cont = ax2.contour(
X, Y, Z,
levels=levels,
cmap='viridis',
vmin=vmin, vmax=vmax,
linewidths=1.4
)
# Gradient vectors
ax2.quiver(
X, Y, U, V,
color='black',
angles='xy',
scale_units='xy',
scale=25,
width=0.0035
)
# space above subtitle
ax2.set_title("(b) Level sets of $L(x,y)$ with gradient field", pad=8)
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_aspect('equal', adjustable='box')
# ----------------------------------------------------------
# Centered horizontal colorbar
# ----------------------------------------------------------
# width: [left, bottom, width, height]
cbar_ax = fig.add_axes([0.40, 0.085, 0.30, 0.025])
cbar = fig.colorbar(
surf,
cax=cbar_ax,
orientation='horizontal'
)
cbar.set_label("$L(x,y)$", labelpad=2)
# ----------------------------------------------------------
# Main title with improved spacing
# ----------------------------------------------------------
fig.suptitle(
"Function $L(x,y)=x^2+y^2$ and its gradient field",
fontsize=16,
y=0.975 # more space from subplots
)
# Fine-tuned layout adjustments
fig.subplots_adjust(
top=0.82, # pushes plots down (more separation)
bottom=0.17,
left=0.05,
right=0.98
)
plt.show()
Figure 4.1: Two complementary visualizations of \(L(x,y)=x^2+y^2\) and its gradient field. (a) Surface representation with gradient vectors projected onto the input plane. (b) Level sets with gradient vectors indicating the direction of maximum increase.
Learning in neural networks consists of minimizing a loss function with respect to the model parameters (weights and biases).
To achieve this, we need to compute how the loss changes when the parameters change:
\[ \frac{\partial L}{\partial \mathbf{w}}, \quad \frac{\partial L}{\partial b} \]
These quantities are used to iteratively update the parameters in a direction that reduces the loss.
This idea will be formalized in the next section, where we introduce the learning process in neural networks and show how these derivatives are used to update the model parameters.
While activation functions determine how neurons transform their inputs, they do not by themselves provide a mechanism for learning from data. The learning process requires a way to evaluate prediction quality and a systematic procedure to adjust model parameters.
In supervised learning settings, this process is guided by a loss function, which quantifies the discrepancy between the predicted output of the network and the true target values.
To understand how this optimization process works, we first introduce a few essential mathematical tools that describe how functions change with respect to their inputs.
With these tools in place, we are now ready to formalize how neural networks learn from data.
Before introducing optimization, it is useful to formalize how information flows through a neural network.
Figure 5.1 illustrates the forward propagation mechanism in a multilayer neural network. At each layer, the input signal undergoes an affine transformation followed by a nonlinear activation.
Figure 5.1: Forward propagation in a multilayer neural network. Source: Nuñez (2026)
To formalize the flow of information through a neural network, it is useful to introduce a layer-wise notation.
Consider a neural network composed of layers indexed by \(j = 1, \dots, N\). Let \(n_j\) denote the number of neurons in layer \(j\).
For each layer, we define:
\(\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}\): the output (activation vector) of layer \(j\). Each component corresponds to the output of one neuron.
\(\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}\): the pre-activation signal at layer \(j\), obtained before applying the activation function.
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}\): the weight matrix connecting layer \(j-1\) to layer \(j\). Each column contains the weights associated with one neuron in layer \(j\).
\(\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}\): the bias vector of layer \(j\).
By convention, the input layer is denoted by \[ \mathbf{h}^{(0)} = \mathbf{x} \in \mathbb{R}^{n_0}. \]
where \(\mathbf{x}\) is the input vector.
At each layer, the network performs two steps:
A linear transformation: \[ \mathbf{z}^{(j)} = \mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)} \]
A nonlinear transformation: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{z}^{(j)}\right) \]
Combining both steps, the forward propagation at layer \(j\) can be written as: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)}\right) \]
This recursive structure shows that the output of each layer becomes the input of the next, allowing the network to build increasingly complex representations of the data.
As illustrated in Figure 5.1, each layer applies the same fundamental operation: a linear combination of inputs followed by a nonlinear activation.
A loss function quantifies the discrepancy between the predicted output of the network and the true target values.
Using the notation introduced above, let:
\(\mathbf{x} = \mathbf{h}^{(0)} \in \mathbb{R}^{n_0}\): input vector, where \(n_0\) denotes the number of input features (i.e., the number of neurons in the input layer)
\(\mathbf{h}^{(N)} = \hat{\mathbf{y}} \in \mathbb{R}^{n_N}\): network output, where \(n_N\) denotes the number of output neurons in the final layer
\(\mathbf{y} \in \mathbb{R}^{n_N}\): true target vector, defined in the same output space as the prediction
Thus, the loss function is defined as:
\[ L(\mathbf{y}, \hat{\mathbf{y}}) \]
For regression problems, a common choice is the mean squared error, defined as:
\[ L(\mathbf{y}, \hat{\mathbf{y}}) \quad = \quad \frac{1}{2} \|\mathbf{y} \; -\; \hat{\mathbf{y}}\|_2^2 \quad =\quad \frac{1}{2} \sum_{i=1}^{n_N} (y_i \;-\; \hat{y}_i)^2 \]
\(\mathbf{y}, \hat{\mathbf{y}} \in \mathbb{R}^{n_N}\).
Each component corresponds to one output neuron.
The gradient with respect to the prediction is:
\[ \frac{\partial L}{\partial \hat{\mathbf{y}}} = \hat{\mathbf{y}} - \mathbf{y} \]
This vector has dimension \(\mathbb{R}^{n_N}\) and is the starting point for backpropagation.
In the literature, the MSE is sometimes defined without the factor \(\tfrac{1}{2}\) or averaged over the output dimension (e.g., dividing by \(n_N\)).
All these versions differ only by a constant factor, which does not affect the optimal solution, but only rescales the gradient.
The factor \(\tfrac{1}{2}\) is used here because it simplifies the derivative, leading to cleaner expressions in backpropagation.
For classification problems (e.g., with softmax output), a common choice is:
\[ L(\mathbf{y}, \hat{\mathbf{y}}) \quad = \quad - \sum_{i=1}^{n_N} y_i \log(\hat{y}_i) \]
where:
\(\mathbf{y}\) is typically a one-hot encoded vector.
\(\hat{\mathbf{y}}\) represents predicted probabilities.
The learning rate, denoted by \(\alpha > 0\), controls the magnitude of parameter updates during optimization.
At each iteration, the model parameters are adjusted in the direction opposite to the gradient, and the learning rate determines the size of this step.
Formally, for a parameter \(w\), the update rule in gradient descent is given by:
\[ w \;\leftarrow\; w \;-\; \alpha \, \frac{\partial \text{cost}}{\partial w}. \]
Here,
\[ \frac{\partial \, \text{cost}}{\partial w} \; =\; \frac{\partial \, \text{cost}}{\partial \hat{y}} \cdot \frac{\partial \, \hat{y}}{\partial w} \]
denotes the derivative of the loss (or cost) function with respect to the parameter \(w\). This quantity measures how the loss changes with respect to the prediction \(\hat{y}\).
A more detailed derivation of these gradients will be presented in later sections.
In general, the learning rate plays a central role in optimization:
It determines the step size in parameter space
It controls the speed of convergence
It affects the stability of the learning process
Choosing an appropriate learning rate is therefore essential for effective training.
The learning rate \(\alpha\) directly influences how quickly the model learns from the error signal:
If \(\alpha\) is too large, the updates may overshoot the minimum, leading to oscillations or divergence.
If \(\alpha\) is too small, the updates become very slow, resulting in inefficient learning.
This trade-off highlights the importance of selecting an appropriate learning rate to balance stability and convergence speed.
To illustrate the effect of the learning rate, consider the simple loss function
\[ L(w) = w^2, \]
whose graph is shown in Figure 6.1. This function has a single global minimum at \(w = 0\) and increases symmetrically as \(|w|\) grows.
# Plot of the quadratic loss function
import numpy as np
import matplotlib.pyplot as plt
def loss(w):
return w**2
w = np.linspace(-4, 4, 400)
L = loss(w)
plt.figure(figsize=(6, 4))
plt.plot(w, L, linewidth=2)
plt.axhline(0, color="gray", linewidth=1, alpha=0.6)
plt.axvline(0, color="gray", linewidth=1, alpha=0.6)
plt.xlabel("Parameter $w$")
plt.ylabel("Loss $L(w)$")
plt.grid(True)
plt.tight_layout()
plt.show()
Figure 6.1: Quadratic loss function \(L(w) = w^2\). The minimum is located at \(w=0\).
Its gradient is
\[ \frac{dL}{dw} = 2w. \] Figure 6.2 shows both the loss function and its gradient.
import numpy as np
import matplotlib.pyplot as plt
def loss(w):
return w**2
def grad(w):
return 2*w
w = np.linspace(-4, 4, 400)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# -------------------
# (a) Loss
# -------------------
axes[0].plot(w, loss(w), linewidth=2)
axes[0].axhline(0, color="gray", alpha=0.6)
axes[0].axvline(0, color="gray", alpha=0.6)
axes[0].set_title("(a) Loss function $L(w)=w^2$")
axes[0].set_xlabel("Parameter $w$")
axes[0].set_ylabel("Loss")
axes[0].grid(True)
# -------------------
# (b) Gradient
# -------------------
axes[1].plot(w, grad(w), linewidth=2, color="tab:red")
axes[1].axhline(0, color="gray", alpha=0.6)
axes[1].axvline(0, color="gray", alpha=0.6)
axes[1].set_title("(b) Gradient $\\frac{dL}{dw}=2w$")
axes[1].set_xlabel("Parameter $w$")
axes[1].set_ylabel("Gradient")
axes[1].grid(True)
plt.tight_layout()
plt.show()
Figure 6.2: (a) Quadratic loss function \(L(w)=w^2\). (b) Its gradient \(\frac{dL}{dw}=2w\).
Note that the gradient is proportional to the distance from the minimum, which leads to exponential convergence under appropriate learning rates.
While the loss function has a minimum at \(w=0\), the gradient increases linearly with the distance from the minimum. This means that updates are larger when the parameter is far from the optimum and become smaller as it approaches the minimum.
This property explains why gradient descent exhibits exponential convergence under appropriate learning rates.
Applying the gradient descent update rule
\[ w_{t+1} \;=\; w_t \,- \, \alpha \frac{dL}{dw}, \]
and substituting \(\frac{dL}{dw} = 2w\), we obtain
\[ w_{t+1} \;= \, w_t \,-\, \alpha \, 2w_t. \]
At this stage, the update rule is introduced only to illustrate how the learning rate affects the parameter updates. A more detailed treatment of the gradient descent algorithm will be provided in a later section.
This simple structure makes it possible to clearly observe how different learning rates affect the trajectory of the parameter as it moves toward the minimum, as illustrated in the next section.
Figure 6.3 illustrates the behavior of gradient descent under different learning rates:
When the learning rate is too small, the parameter approaches the minimum very slowly.
When the learning rate is moderate, convergence is stable and efficient.
When the learning rate is too large, the updates overshoot the minimum and may oscillate or even diverge.
import numpy as np
import matplotlib.pyplot as plt
# Loss and gradient
def loss(w):
return w**2
def grad(w):
return 2*w
def simulate_gd(alpha, w0=2, n_iter=12):
ws = [w0]
ls = [loss(w0)]
w = w0
for _ in range(n_iter):
w = w - alpha * grad(w)
ws.append(w)
ls.append(loss(w))
return np.array(ws), np.array(ls)
configs = [
(0.05, "Small α"),
(0.30, "Moderate α"),
(1.10, "Large α")
]
colors = ["tab:green", "tab:red", "tab:blue"]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Common curve for trajectory plots
w_grid = np.linspace(-10, 10, 400)
loss_grid = loss(w_grid)
# Avoid exact zero for log-scale panels
loss_grid_log = loss_grid.copy()
loss_grid_log[loss_grid_log < 1e-1] = np.nan
# -----------------------------
# (a) Trajectories - linear scale
# -----------------------------
axes[0, 0].plot(w_grid, loss_grid, color="gray", linewidth=1.5, alpha=0.4)
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
axes[0, 0].plot(ws, ls, marker="o", label=label, color=color, linewidth=2)
axes[0, 0].set_title("(a) Gradient descent trajectories (linear scale)")
axes[0, 0].set_xlabel("Parameter w")
axes[0, 0].set_ylabel("Loss L(w)")
axes[0, 0].set_ylim(0, 100);
axes[0, 0].legend()
# -----------------------------
# (b) Loss by iteration - linear scale
# -----------------------------
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
axes[0, 1].plot(range(len(ls)), ls, marker="o", label=label, color=color, linewidth=2)
axes[0, 1].set_title("(b) Loss value across iterations (linear scale)")
axes[0, 1].set_xlabel("Iteration")
axes[0, 1].set_ylabel("Loss L(w)")
axes[0, 1].set_ylim(0, 100);
axes[0, 1].legend()
# -----------------------------
# (c) Trajectories - log scale
# -----------------------------
axes[1, 0].plot(w_grid, loss_grid_log, color="gray", linewidth=1.5, alpha=0.4)
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
ls_plot = ls.copy()
ls_plot[ls_plot < 1e-10] = np.nan
axes[1, 0].plot(ws, ls_plot, marker="o", label=label, color=color, linewidth=2)
axes[1, 0].set_title("(c) Gradient descent trajectories (log scale)")
axes[1, 0].set_xlabel("Parameter w")
axes[1, 0].set_ylabel("Loss L(w)")
axes[1, 0].set_ylim(1e-1, 10);
axes[1, 0].set_yscale("log")
axes[1, 0].legend()
# -----------------------------
# (d) Loss by iteration - log scale
# -----------------------------
for (alpha, label), color in zip(configs, colors):
ws, ls = simulate_gd(alpha)
axes[1, 1].plot(range(len(ls)), ls, marker="o", label=label, color=color, linewidth=2)
axes[1, 1].set_title("(d) Loss value across iterations (log scale)")
axes[1, 1].set_xlabel("Iteration")
axes[1, 1].set_ylabel("Loss L(w)")
axes[1, 1].set_ylim(1e-2, 10);
axes[1, 1].set_yscale("log")
axes[1, 1].legend()
plt.tight_layout()
plt.show()
Figure 6.3: Effect of the learning rate on gradient descent under linear and logarithmic vertical scales. Panels (a) and (b) use a linear scale to show global behavior, while panels (c) and (d) use a logarithmic scale to reveal convergence patterns near zero.
Figure 6.3 provides a complementary perspective by combining linear and logarithmic vertical scales.
In panels (a) and (b), which use a linear scale, the trajectories corresponding to small and moderate learning rates become compressed near zero. As a result, differences in their convergence behavior are difficult to distinguish once the loss becomes small. In contrast, the large learning rate dominates the scale due to divergence.
In panels (c) and (d), which use a logarithmic scale, the behavior near zero is magnified. This transformation reveals important differences that are not visible in the linear scale:
The small learning rate (green) exhibits a smooth and gradual exponential decay.
The moderate learning rate (red) converges faster, showing a steeper linear trend in the log scale, which is characteristic of efficient exponential convergence.
The large learning rate (blue) clearly oscillates and fails to stabilize, confirming its instability.
Thus, the logarithmic scale provides a more informative view of the convergence dynamics, especially when the loss approaches very small values.
For visualization purposes, two complementary strategies are used:
A linear scale highlights the global behavior of the trajectories, including divergence and large oscillations.
A logarithmic scale emphasizes the fine-grained behavior near zero, allowing a clearer comparison of convergence rates.
This dual representation is particularly useful because convergence in gradient-based optimization is often exponential, which appears approximately linear under a logarithmic transformation.
To address the limitations of a fixed learning rate, several adaptive optimization methods have been proposed:
Adaptive Gradient Algorithm (AdaGrad) (Duchi et al., 2011).
Accumulates all past squared gradients, leading to progressively smaller learning rates.
Root Mean Square Propagation (RMSProp) (Tieleman and Hinton, 2012).
Uses an exponentially decaying average of squared gradients to avoid excessive shrinkage.
Optional reading: here.
ADADELTA (Zeiler, 2012).
Eliminates the need to manually set a global learning rate by adapting updates based on past gradients.
Adaptive Moment Estimation (Adam) (Kingma and Ba, 2014).
Combines momentum and adaptive scaling to achieve fast and stable convergence.
A brief overview of these adaptive optimization methods, including their main ideas and practical intuition, is provided in the companion document: Learning and optimization in neural networks (LLinás, 2026).
While more recent methods have been proposed, most of them are extensions or refinements of Adam rather than entirely new paradigms. In practice, Adam remains a central and widely used optimizer in modern deep learning.
In other words, Adam is not obsolete: it continues to serve as the foundation for many current optimization methods.
AdamW (Loshchilov and Hutter, 2019).
Decouples weight decay from the gradient update, improving generalization.
Nadam (Dozat, 2016).
Incorporates Nesterov momentum into Adam for improved convergence.
AMSGrad (Reddi et al., 2018).
Modifies Adam to ensure better theoretical convergence guarantees.
These methods adjust the effective learning rate during training based on past gradients. They can be seen as extensions of gradient descent that adapt the step size dynamically during training.
In the following sections, we focus on a subset of these methods to illustrate their main ideas. The remaining variants follow similar principles and are left for further exploration.
To minimize the loss function, neural networks rely on gradient-based optimization. The gradient indicates the direction of steepest increase of the loss function, so moving in the opposite direction reduces its value.
For a parameter vector \(\mathbf{w}\), the update rule is given by
\[ \mathbf{w}^{(t+1)} \;=\; \mathbf{w}^{(t)} \,-\, \alpha \, \nabla \,L(\mathbf{w}^{(t)}), \]
where \(\alpha > 0\) is the learning rate, and \(\nabla L(\mathbf{w}^{(t)})\) is the gradient of the loss function evaluated at the current point.
The process starts from an initial parameter vector \(\mathbf{w}^{(0)}\) (typically chosen randomly) and is iteratively updated for \(t = 0, 1, 2, \dots\).
The gradient \(\nabla L(\mathbf{w})\) generalizes the derivative to multiple dimensions and indicates the direction of steepest increase in the loss.
This vector formulation emphasizes that all parameters are updated simultaneously in the direction of steepest descent.
For a single parameter \(w\), the update rule reduces to
\[ w^{(t+1)} \;=\; w^{(t)} \,- \,\alpha \,\frac{dL}{dw}(w^{(t)}). \]
Here, \(\frac{dL}{dw}\) denotes the standard derivative of the loss function with respect to the scalar parameter \(w\).
This scalar version is consistent with the vector formulation and is useful for building intuition in simple examples.
In practice, neural networks involve many parameters, so the vector formulation is more natural.
This process can be interpreted geometrically as movement over a loss surface toward regions of lower values (see Figures 8.1.
The loss function defines a surface over the parameter space.
:::sangria3 In this example, the function
\[ L(x,y) = x^2 + y^2 \]
produces a convex, bowl-shaped surface with a unique global minimum at the origin.
Figure 8.1: Gradient descent simulation on \(f(x,y)=x^2+y^2\) from initial point (3,3), learning rate \(\alpha=0.35\), and 10 steps. Adapted from the interactive visualization developed by LLinás and LLinás (2026).
Figure 8.1 provides an initial geometric interpretation of gradient descent on the surface defined by \(L(x,y)=x^2+y^2\).
The blue point represents the initial parameter values \((3,3)\), while the red point represents the final iterate obtained after several iterations of gradient descent.
The trajectory between these points shows how the algorithm progressively updates the parameters in the direction opposite to the gradient, moving toward the minimum at \((0,0)\).
Each update is scaled by the learning rate \(\alpha = 0.35\), which determines the magnitude of the step. A suitable choice of \(\alpha\) allows efficient convergence, whereas poor choices may lead to slow convergence or instability.
The original visualization also includes a table of iteration values.
Additional details on this example, including the full set of iterations and further visualizations, are provided in the companion document:
Learning and optimization in neural networks (LLinás, 2026).
Although gradient descent provides the general mechanism for updating parameters, neural networks typically involve a large number of interconnected weights organized in layers. The forward propagation structure illustrated in Figure 5.1 also serves as the computational graph over which gradients are propagated backward through successive applications of the chain rule.
To formalize this, we recall the layer-wise notation introduced earlier:
\(n_j\): number of neurons in layer \(j\).
\(\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}\): the output (activation vector) of layer \(j\). Each component corresponds to the output of one neuron.
\(\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}\): the pre-activation signal at layer \(j\), obtained before applying the activation function.
Each layer computes:
\[ \mathbf{z}^{(j)} \;=\; \mathbf{W}^{(j-1,j)\top} \,\mathbf{h}^{(j-1)} \,+\, \mathbf{b}^{(j)}, \qquad \mathbf{h}^{(j)} \;=\; f^{(j)}(\mathbf{z}^{(j)}), \]
where:
\(f^{(j)}\) denotes the activation function applied element-wise.
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}\): the weight matrix connecting layer \(j-1\) to layer \(j\). Each column contains the weights associated with one neuron in layer \(j\).
\(\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}\): the bias vector of layer \(j\).
This formulation generalizes the single-neuron model introduced earlier, where \(z = \mathbf{w}^\top \mathbf{x} + b\).
Backpropagation is the algorithm used to compute gradients efficiently in multilayer neural networks. It applies the chain rule to propagate error information from the output layer back to earlier layers.
For each layer \(j\), define the local error:
\[ \boldsymbol{\delta}^{(j)} = \frac{\partial L}{\partial \mathbf{z}^{(j)}} \in \mathbb{R}^{n_j}. \]
The term \(\boldsymbol{\delta}^{(j)}\) represents how sensitive the loss function is to changes in the pre-activation signal at layer \(j\).
Recall that the loss function \(L(\mathbf{y}, \hat{\mathbf{y}})\) was introduced earlier to quantify the discrepancy between predictions and targets. The local error \(\boldsymbol{\delta}^{(j)}\) is therefore derived from this loss and propagates its influence through the network.
For the output layer \(j = N\), the local error depends directly on the loss function. For example, in the case of mean squared error (MSE),
\[ \boldsymbol{\delta}^{(N)} = (\hat{\mathbf{y}} - \mathbf{y}) \odot f^{(N)\prime}(\mathbf{z}^{(N)}). \]
This provides the starting point for the backward propagation of errors.
Conceptually, backpropagation can be viewed as a systematic application of the chain rule over the composition of functions defined by the network.
For hidden layers, the error is propagated backward according to
\[ \boldsymbol{\delta}^{(j-1)} \; =\; \underbrace{\left(\mathbf{W}^{(j-1,j)} \boldsymbol{\delta}^{(j)}\right)}_{\in \,\mathbb{R}^{n_{j-1}}} \;\odot\; \underbrace{f^{(j-1)\prime}(\mathbf{z}^{(j-1)})}_{\in \,\mathbb{R}^{n_{j-1}}}. \]
where:
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{n_{j-1} \times n_j}\)
\(\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}\)
\(f^{(j-1)\prime}\) denotes the derivative of the activation function with respect to its input, applied element-wise. That is,
\[ f^{(j-1)\prime}(\mathbf{z}^{(j-1)}) \; =\; \begin{pmatrix} \frac{d}{dz}f^{(j-1)}(z_1) \\ \vdots \\ \frac{d}{dz}f^{(j-1)}(z_{n_{j-1}}) \end{pmatrix}. \]
\[ \mathbf{A} \odot \mathbf{B} = \left( a_{ij} b_{ij} \right)_{1 \le i \le r,\; 1 \le j \le s} \;\in\; \mathbb{R}^{r \times s} \]
where \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{r \times s}\). That is, each entry of the result is obtained by multiplying the corresponding entries of \(\mathbf{A}\) and \(\mathbf{B}\). This operation is only defined when both operands have the same dimensions.
This contrasts with matrix multiplication, which combines rows and columns rather than corresponding entries.
Let
\[ \mathbf{a} = \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}. \]
Then
\[ \mathbf{a} \odot \mathbf{b} = \begin{pmatrix} a_1 b_1 \\ a_2 b_2 \\ a_3 b_3 \end{pmatrix}. \]
That is, each component of the resulting vector is obtained by multiplying the corresponding components.
The dimensions of the variables involved are:
Thus,
\[ \mathbf{W}^{(j-1,j)} \boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_{j-1}}, \]
which matches the dimension of \(\boldsymbol{\delta}^{(j-1)}\).
This dimensional consistency guarantees that the backward propagation of errors is well-defined across layers.
Using the local error terms, the gradients with respect to the model parameters are:
\[ \frac{\partial L}{\partial \mathbf{W}^{(j-1,j)}} = \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top}, \qquad \frac{\partial L}{\partial \mathbf{b}^{(j)}} = \boldsymbol{\delta}^{(j)}. \]
where:
Thus,
\[ \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top} \in \mathbb{R}^{n_{j-1} \times n_j}, \]
which matches the dimension of \(\mathbf{W}^{(j-1,j)}\).
This recursive computation allows gradients to be obtained efficiently, making training feasible even for deep neural networks.
Further details on backpropagation and gradient-based optimization, including step-by-step derivations and practical insights, can be found in the companion document: Learning and optimization in neural networks (LLinás, 2026).
Before introducing the error measures, we consider a classical benchmark in neural networks: the exclusive OR (XOR) problem.
The XOR function is defined as:
| \(x_1\) | \(x_2\) | \(y\) | Interpretation |
|---|---|---|---|
| 0 | 0 | 0 | Same inputs → 0 |
| 0 | 1 | 1 | Different inputs → 1 |
| 1 | 0 | 1 | Different inputs → 1 |
| 1 | 1 | 0 | Same inputs → 0 |
The output is 1 if the inputs are different, and 0 otherwise.
The XOR problem is fundamental because:
It is not linearly separable.
A single-layer network cannot solve it.
It requires at least one hidden layer.
This makes XOR a minimal example to illustrate how neural networks learn nonlinear patterns.
The following figures illustrate the XOR dataset in \(\mathbb{R}^2\). Points belonging to different classes cannot be separated by a single straight line.
We first present the XOR dataset using its four canonical observations.
library(ggplot2)
library(patchwork)
xor_simple <- data.frame(
x1 = c(0, 0, 1, 1),
x2 = c(0, 1, 0, 1),
y = factor(c(0, 1, 1, 0))
)
base_theme <- theme_minimal(base_size = 20) +
theme(
panel.grid.minor = element_line(color = "gray92"),
panel.grid.major = element_line(color = "gray85"),
# Axis
axis.title = element_text(size = 26), # Titles
axis.text = element_text(size = 16), # Numbers
# Titles (a), (b)
plot.title = element_text(size = 22, face = "bold", hjust = 0),
# Legends
legend.title = element_text(size = 20),
legend.text = element_text(size = 18),
# Spaces
plot.margin = margin(2, 2, 2, 2),
legend.margin = margin(0, 0, 0, 0),
legend.box.margin = margin(0, 0, 0, 0)
)
# Plot (a)
p1 <- ggplot(xor_simple, aes(x = x1, y = x2, color = y)) +
geom_point(size = 9) +
geom_abline(intercept = 0.5, slope = 0,
linetype = "dashed", color = "darkorange2", linewidth = 1.1) +
geom_abline(intercept = 0, slope = 1,
linetype = "dashed", color = "purple", linewidth = 1.1) +
scale_color_manual(
values = c("0" = "#C0392B", "1" = "#2E86C1"),
labels = c("Class 0", "Class 1")
) +
scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
coord_fixed() +
labs(
title = "(a) Single linear boundaries",
x = expression(x[1]),
y = expression(x[2]),
color = "Output"
) +
base_theme +
theme(plot.margin = margin(2, 17, 2, 2)) # More space (right): margin(top, right, bottom, left)
# Plot (b)
p2 <- ggplot(xor_simple, aes(x = x1, y = x2, color = y)) +
geom_point(size = 9) +
geom_abline(intercept = 0.5, slope = -1,
linetype = "dashed", color = "darkgreen", linewidth = 1.1) +
geom_abline(intercept = 1.5, slope = -1,
linetype = "dashed", color = "darkgreen", linewidth = 1.1) +
scale_color_manual(
values = c("0" = "#C0392B", "1" = "#2E86C1"),
labels = c("Class 0", "Class 1")
) +
scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(-0.05, 1.05)) +
coord_fixed() +
labs(
title = "(b) Multiple-boundary partition",
x = expression(x[1]),
y = expression(x[2]),
color = "Output"
) +
base_theme +
theme(plot.margin = margin(2, 2, 2, 17)) # More space (left): margin(top, right, bottom, left)
# Final combination
(p1 + p2) +
plot_layout(guides = "collect") &
# Legends
theme(
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 20),
legend.text = element_text(size = 18)
)
Figure 11.1: Geometric intuition for XOR. (a) Single linear boundaries fail to separate the classes. (b) Multiple linear boundaries can partition the input space, anticipating the role of hidden neurons.
In Figure 11.1, we observe two distinct situations:
Panel (a). The dashed lines represent examples of single linear decision boundaries.
Each line divides the input space into two half-planes, but neither one is able to separate the two classes correctly.
In both cases, points from different classes remain on the same side of the boundary.
This illustrates why a single linear classifier cannot solve the XOR problem.
Panel (b). The dashed lines should not be interpreted as a single linear classifier, but rather as a combination of linear boundaries that partition the input space into regions. This anticipates the role of hidden neurons in a multilayer network:
To reinforce this geometric intuition, Figure 11.2 shows a denser version of the same XOR pattern.
The additional points do not represent new logical cases. They are sampled from the unit square and labeled according to the same XOR rule in order to make the spatial structure more visible.
The shaded regions are included only as a visual guide:
Pink regions regions correspond to Class 0.
Light blue regions regions correspond to Class 1.
library(ggplot2)
set.seed(123)
n <- 240
x1 <- runif(n)
x2 <- runif(n)
y <- ifelse((x1 <= 0.5 & x2 > 0.5) | (x1 > 0.5 & x2 <= 0.5), 1, 0)
xor_dense <- data.frame(
x1 = x1,
x2 = x2,
y = factor(y)
)
ggplot(xor_dense, aes(x = x1, y = x2, color = y)) +
annotate("rect", xmin = 0.00, xmax = 0.50, ymin = 0.50, ymax = 1.00,
fill = "#FADBD8", alpha = 0.35, color = NA) +
annotate("rect", xmin = 0.50, xmax = 1.00, ymin = 0.00, ymax = 0.50,
fill = "#FADBD8", alpha = 0.35, color = NA) +
annotate("rect", xmin = 0.00, xmax = 0.50, ymin = 0.00, ymax = 0.50,
fill = "#D6EAF8", alpha = 0.35, color = NA) +
annotate("rect", xmin = 0.50, xmax = 1.00, ymin = 0.50, ymax = 1.00,
fill = "#D6EAF8", alpha = 0.35, color = NA) +
geom_point(size = 2.2, alpha = 0.9) +
scale_color_manual(
values = c("0" = "#C0392B", "1" = "#2E86C1"),
labels = c("Class 0", "Class 1")
) +
scale_x_continuous(breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
coord_fixed() +
labs(
x = expression(x[1]),
y = expression(x[2]),
color = "Output"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "right",
panel.grid.minor = element_line(color = "gray93"),
panel.grid.major = element_line(color = "gray85")
)
Figure 11.2: A denser visualization of the XOR pattern in the input space.
Together, these figures make the structure of XOR clear. The points \((0,0)\) and \((1,1)\) belong to Class 0, whereas the points \((0,1)\) and \((1,0)\) belong to Class 1. Because the classes occupy alternating diagonal positions in the input space, no single boundary of the form
\[ w_1 x_1 + w_2 x_2 + b = 0 \]
can separate them.
This confirms that XOR is not linearly separable and explains why solving it requires a model capable of combining multiple linear boundaries through nonlinear transformations.
The XOR dataset exhibits an alternating diagonal class structure in the input space. Because this arrangement cannot be separated by a single straight line, a linear model is insufficient. This limitation motivates the introduction of hidden layers and nonlinear activation functions, which allow the network to transform the input space into a representation where the classes become separable.
To overcome the limitation of linear separability observed in the previous section (see Figure 11.1), we introduce a neural network with a hidden layer. This additional layer allows the model to construct nonlinear transformations of the input space, making it possible to separate patterns such as XOR.
To solve the XOR problem, we consider a simple feedforward architecture composed of:
\(n_0 = 2\) input neurons.
One hidden layer (e.g., \(n_1 = 2\) neurons).
\(n_N = 1\) output neuron.
Thus: \[ \mathbf{x} \in \mathbb{R}^2, \quad \hat{y} \in \mathbb{R}, \quad y \in \mathbb{R} \]
Figure 11.3 illustrates this architecture. The input variables \((x_1, x_2)\) are first mapped into a hidden representation through two intermediate neurons \((h_1, h_2)\). Each hidden neuron defines a linear transformation of the input, and their combined effect enables the network to partition the space into multiple regions.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6,3))
# Posiciones
positions = {
"x1": (0,1),
"x2": (0,0),
"h1": (1,1),
"h2": (1,0),
"y": (2,0.5)
}
# Dibujar conexiones
connections = [
("x1","h1"), ("x1","h2"),
("x2","h1"), ("x2","h2"),
("h1","y"), ("h2","y")
]
for a,b in connections:
x1,y1 = positions[a]
x2,y2 = positions[b]
ax.arrow(x1,y1, x2-x1, y2-y1,
length_includes_head=True,
head_width=0.05, color="black")
# Dibujar nodos
for node,(x,y) in positions.items():
ax.scatter(x,y, s=1200, color="#D6EAF8", edgecolors="black")
label = "ŷ" if node=="y" else node
ax.text(x,y,label, ha='center', va='center', fontsize=12, fontweight='bold')
ax.set_xlim(-0.2, 2.2);
ax.set_ylim(-0.1, 1.1);
ax.axis('off');
plt.tight_layout(pad=0.2)
plt.show()
Figure 11.3: A simple neural network architecture for solving the XOR problem.
These intermediate representations are then aggregated by the output neuron \(\hat{y}\), producing the final prediction. As discussed in the geometric interpretation, the key idea is that multiple linear transformations, when combined, can induce a nonlinear decision boundary.
This structure will be essential in the following sections, where we show how these transformations are computed and how the network learns them from data.
In the manual training example (using a single XOR observation), we track different quantities to understand how the network learns.
In this section, we focus on the scalar case to simplify the exposition. These quantities serve different purposes:
Some are useful for interpretation.
Others are required for optimization.
We consider the following:
\[ |\hat{y} - y| \]
This quantity measures the magnitude of the prediction error in a direct and intuitive way. It is useful for monitoring performance but is not typically used for optimization due to its non-smooth derivative.
\[ \text{cost} = \frac{1}{2}(\hat{y} - y)^2 \]
This is the loss function minimized during training (for a single observation).
\[ \frac{\partial \, \text{cost}}{\partial \hat{y}} = \hat{y} - y \]
This quantity measures how the loss changes with respect to the prediction.
To update the model parameters, we ultimately need the derivative of the loss with respect to each parameter \(w\). This is obtained by combining:
The gradient of the loss with respect to the prediction (e.g., \(\hat{y} - y\) for the squared error), and
The derivative of the prediction with respect to the parameter, \(\frac{\partial \hat{y}}{\partial w}\).
Together, these terms determine how the error signal propagates backward through the network.
A detailed explanation of this process (backpropagation) will be provided in a later section.
The gradient of the loss is not the same as the gradient of the prediction.
The former measures how the loss changes with respect to the prediction, whereas the latter measures how the prediction changes with respect to the model parameters, through derivatives of the form
\[ \frac{\partial \hat{y}}{\partial w} \]
which measure how the network output changes with respect to its parameters.
The absolute error helps us interpret the prediction.
The loss function defines what we optimize.
The gradient determines how we update the model.
These concepts naturally lead to the question of how parameter updates are controlled during training.
The previous section introduced the XOR problem, its geometric interpretation, and the reason why a hidden layer is needed to model its nonlinear structure. We now move from the general architecture to a concrete numerical example.
In this first example, we focus on a single training observation from the XOR table:
\[ x_1 = 0, \qquad x_2 = 1, \qquad y = 1. \]
The goal is to compute, by hand, how this observation is propagated through a small neural network before applying backpropagation. This step is important because the backward pass depends directly on the intermediate quantities produced during forward propagation.
For this example, we consider a neural network with two input neurons, one hidden layer with two neurons, and one output neuron. The logistic function is used as the activation function in both the hidden layer and the output layer.
Figure 12.1 shows the structure of the network used in this manual example. The inputs \((x_1,x_2)\) are first transformed into the hidden activations \((h_1,h_2)\), and these hidden activations are then combined to produce the prediction \(\hat{y}\).
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(9, 6))
# -----------------------------
# Positions
# -----------------------------
pos = {
"x1": (0, 2),
"x2": (0, 0),
"h1": (3, 2),
"h2": (3, 0),
"y": (6, 1),
"b_hidden": (1.6, 4.5),
"b_output": (4.2, 4.5)
}
node_radius = 0.35
# -----------------------------
# Draw nodes
# -----------------------------
for key in ["x1", "x2", "h1", "h2", "y"]:
x, y = pos[key]
circle = plt.Circle((x, y), node_radius, color='crimson')
ax.add_patch(circle)
# -----------------------------
# Draw bias triangles
# -----------------------------
def draw_triangle(center):
x, y = center
triangle = plt.Polygon([
(x, y),
(x - 0.28, y - 0.50),
(x + 0.28, y - 0.50)
], color='gray')
ax.add_patch(triangle)
ax.text(x, y - 0.30, "1", ha='center', va='center',
fontsize=12, color='black')
draw_triangle(pos["b_hidden"])
draw_triangle(pos["b_output"])
# -----------------------------
# Helper: shorten arrows so they do not enter node centers
# -----------------------------
def shorten(p1, p2, start_offset=0.36, end_offset=0.42):
p1 = np.array(p1, dtype=float)
p2 = np.array(p2, dtype=float)
v = p2 - p1
d = np.linalg.norm(v)
u = v / d
return p1 + start_offset * u, p2 - end_offset * u
def arrow(p1, p2, color='black', lw=1.5):
if color == 'green':
q1, q2 = shorten(p1, p2, start_offset=0.55, end_offset=0.50) # arrow size
mscale = 25 # arrow point
width = 1
else:
q1, q2 = shorten(p1, p2, start_offset=0.36, end_offset=0.42)
mscale = 18
width = lw
ax.annotate(
"",
xy=q2, xytext=q1,
arrowprops=dict(
arrowstyle="-|>",
color=color,
lw=width,
mutation_scale=mscale,
shrinkA=0,
shrinkB=0
)
)
# -----------------------------
# Connections
# -----------------------------
arrow(pos["x1"], pos["h1"])
arrow(pos["x1"], pos["h2"])
arrow(pos["x2"], pos["h1"])
arrow(pos["x2"], pos["h2"])
arrow(pos["h1"], pos["y"])
arrow(pos["h2"], pos["y"])
arrow(pos["b_hidden"], pos["h1"], color='green')
arrow(pos["b_hidden"], pos["h2"], color='green')
arrow(pos["b_output"], pos["y"], color='green')
# -----------------------------
# Node labels
# -----------------------------
ax.text(*pos["x1"], r"$X_1$", ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(*pos["x2"], r"$X_2$", ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(*pos["y"], r"$\hat{y}$",
ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(pos["y"][0], pos["y"][1] - 0.6,
r"$y$",
ha='center', va='center',
fontsize=18,
color='black',
bbox=dict(facecolor='#FFF59D', edgecolor='none', boxstyle='round,pad=0.3'))
ax.text(*pos["h1"], r"$h_1$", ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(*pos["h2"], r"$h_2$", ha='center', va='center',
fontsize=18, fontweight='bold')
# -----------------------------
# Weight labels adjusted
# -----------------------------
ax.text(1.45, 2.1, r"$w_1$", fontsize=15, fontweight='bold')
ax.text(0.50, 1.22, r"$w_2$", fontsize=15, fontweight='bold')
ax.text(0.68, 0.7, r"$w_3$", fontsize=15, fontweight='bold')
ax.text(1.45, -0.2, r"$w_4$", fontsize=15, fontweight='bold')
ax.text(4.15, 1.70, r"$w_5$", fontsize=15, fontweight='bold')
ax.text(4.15, 0.12, r"$w_6$", fontsize=15, fontweight='bold')
# -----------------------------
# Bias labels adjusted
# -----------------------------
ax.text(2.4, 3.3, r"$b_1$", color='green',
fontsize=15, fontweight='bold')
ax.text(2.8, 1, r"$b_2$", color='green',
fontsize=15, fontweight='bold')
ax.text(5.05, 2.92, r"$b_3$", color='green',
fontsize=15, fontweight='bold')
# -----------------------------
# Layer labels
# -----------------------------
ax.text(0, -1, r"Input Layer $\in \mathbb{R}^2$", ha='center',
fontsize=15, color='gray')
ax.text(3, -1, r"Hidden Layer $\in \mathbb{R}^2$", ha='center',
fontsize=15, color='gray')
ax.text(6, -1, r"Output Layer $\in \mathbb{R}^1$", ha='center',
fontsize=15, color='gray')
# -----------------------------
# Layout
# -----------------------------
ax.set_xlim(-1, 7);
ax.set_ylim(-1.5, 4.5);
plt.subplots_adjust(top=0.90);
ax.axis('off');
plt.tight_layout()
plt.show()
Figure 12.1: Neural network architecture used for the manual XOR example with two inputs, two hidden neurons, one output neuron, weights, and bias terms.
We will apply the backpropagation framework to adjust the weights and biases so that the network learns to map the input pattern to the correct XOR output. However, before updating the parameters, we first need to determine what the current network predicts for the observation \((x_1,x_2)=(0,1)\).
To do this, we perform a forward pass. First, we compute the net input to each hidden neuron. Then, we apply the logistic activation function to obtain the hidden outputs. Finally, we repeat the same process at the output neuron to obtain the predicted value \(\hat{y}\).
Figure 12.2 presents the initial weights, biases, and training values used in this example. These values define the starting point for the first forward pass of the network.
Mathematically, the input vector and target output are given by
\[ \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, \qquad y = 1. \]
The initial parameters of the network are defined as follows:
\[ \mathbf{W}^{(0,1)} = \begin{pmatrix} w_1 & w_3 \\ w_2 & w_4 \end{pmatrix} = \begin{pmatrix} 0.1 & -0.7 \\ 0.5 & 0.3 \end{pmatrix}, \qquad \mathbf{b}^{(1)} = \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \]
\[ \mathbf{W}^{(1,2)} = \begin{pmatrix} w_5 \\ w_6 \end{pmatrix} = \begin{pmatrix} 0.2 \\ 0.4 \end{pmatrix}, \qquad b^{(2)} = b_3 = 0. \]
import matplotlib.pyplot as plt
import numpy as np
w1 = 0.1
w2 = 0.5
w3 = -0.7
w4 = 0.3
w5 = 0.2
w6 = 0.4
b1 = 0
b2 = 0
b3 = 0
x1 = 0
x2 = 1
y_target = 1
size1 = 13 # font size (parameters)
size2 = 13 # font size (legends)
fig, ax = plt.subplots(figsize=(9, 6))
# -----------------------------
# Positions
# -----------------------------
pos = {
"x1": (0, 2),
"x2": (0, 0),
"h1": (3, 2),
"h2": (3, 0),
"y": (6, 1),
"b_hidden": (1.6, 4.5),
"b_output": (4.2, 4.5)
}
node_radius = 0.35
# -----------------------------
# Draw nodes
# -----------------------------
for key in ["x1", "x2", "h1", "h2", "y"]:
x, y = pos[key]
circle = plt.Circle((x, y), node_radius, color='crimson')
ax.add_patch(circle)
# -----------------------------
# Draw bias triangles
# -----------------------------
def draw_triangle(center):
x, y = center
triangle = plt.Polygon([
(x, y),
(x - 0.28, y - 0.50),
(x + 0.28, y - 0.50)
], color='gray')
ax.add_patch(triangle)
ax.text(x, y - 0.30, "1", ha='center', va='center',
fontsize=12, color='black')
draw_triangle(pos["b_hidden"])
draw_triangle(pos["b_output"])
# -----------------------------
# Helper: shorten arrows so they do not enter node centers
# -----------------------------
def shorten(p1, p2, start_offset=0.36, end_offset=0.42):
p1 = np.array(p1, dtype=float)
p2 = np.array(p2, dtype=float)
v = p2 - p1
d = np.linalg.norm(v)
u = v / d
return p1 + start_offset * u, p2 - end_offset * u
def arrow(p1, p2, color='black', lw=1.5):
if color == 'green':
q1, q2 = shorten(p1, p2, start_offset=0.55, end_offset=0.50) # arrow size
mscale = 25 # arrow point
width = 1
else:
q1, q2 = shorten(p1, p2, start_offset=0.36, end_offset=0.42)
mscale = 18
width = lw
ax.annotate(
"",
xy=q2, xytext=q1,
arrowprops=dict(
arrowstyle="-|>",
color=color,
lw=width,
mutation_scale=mscale,
shrinkA=0,
shrinkB=0
)
)
# -----------------------------
# Connections
# -----------------------------
arrow(pos["x1"], pos["h1"])
arrow(pos["x1"], pos["h2"])
arrow(pos["x2"], pos["h1"])
arrow(pos["x2"], pos["h2"])
arrow(pos["h1"], pos["y"])
arrow(pos["h2"], pos["y"])
arrow(pos["b_hidden"], pos["h1"], color='green')
arrow(pos["b_hidden"], pos["h2"], color='green')
arrow(pos["b_output"], pos["y"], color='green')
# -----------------------------
# Node labels
# -----------------------------
ax.text(*pos["x1"], rf"$x_1$" + f"\n{x1}",
ha='center', va='center',
fontsize=16, fontweight='bold')
ax.text(*pos["x2"], rf"$x_2$" + f"\n{x2}",
ha='center', va='center',
fontsize=16, fontweight='bold')
ax.text(*pos["y"], r"$\hat{y}$",
ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(pos["y"][0], pos["y"][1] - 0.6,
r"$y = 1$",
ha='center', va='center',
fontsize=18,
color='black',
bbox=dict(facecolor='#FFF59D', edgecolor='none', boxstyle='round,pad=0.3'))
ax.text(*pos["h1"], r"$h_1$", ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(*pos["h2"], r"$h_2$", ha='center', va='center',
fontsize=18, fontweight='bold')
# -----------------------------
# Weight labels adjusted
# -----------------------------
ax.text(1, 2.1, rf"$w_1 = {w1}$", fontsize=size1, fontweight='bold')
ax.text(-0.1, 1.3, rf"$w_2 = {w2}$", fontsize=size1, fontweight='bold')
ax.text(-0.2, 0.6, rf"$w_3 = {w3}$", fontsize=size1, fontweight='bold')
ax.text(1.3, 0.08, rf"$w_4 = {w4}$", fontsize=size1, fontweight='bold')
ax.text(4.15, 1.70, rf"$w_5 = {w5}$", fontsize=size1, fontweight='bold')
ax.text(4.15, 0.12, rf"$w_6 = {w6}$", fontsize=size1, fontweight='bold')
# -----------------------------
# Bias labels adjusted
# -----------------------------
ax.text(2.4, 3.3, rf"$b_1 = {b1}$", color='green',
fontsize=size1, fontweight='bold')
ax.text(2.8, 1, rf"$b_2 = {b2}$", color='green',
fontsize=size1, fontweight='bold')
ax.text(5.05, 2.92, rf"$b_3 = {b3}$", color='green',
fontsize=size1, fontweight='bold')
# -----------------------------
# Layer labels
# -----------------------------
ax.text(0, -1, r"Input Layer $\in \mathbb{R}^2$", ha='center',
fontsize=size2, color='gray')
ax.text(3, -1, r"Hidden Layer $\in \mathbb{R}^2$", ha='center',
fontsize=size2, color='gray')
ax.text(6, -1, r"Output Layer $\in \mathbb{R}^1$", ha='center',
fontsize=size2, color='gray')
# -----------------------------
# Layout
# -----------------------------
ax.set_xlim(-1, 7);
ax.set_ylim(-1.5, 4.5);
plt.subplots_adjust(top=0.90);
ax.axis('off');
plt.tight_layout()
plt.show()
Figure 12.2: Initial values for the manual XOR example with input values, weights, biases, and the target output.
As shown in the figure, this example uses a single training observation. Given the inputs
\[ x_1 = 0, \qquad x_2 = 1, \]
the target output is
\[ y = 1. \]
Thus, the purpose of the first forward pass is to compute the initial prediction \(\hat{y}\) generated by the network from these inputs, before any parameter update is performed.
We begin by propagating the input values through the network.
Hidden neuron \(h_1\):
Net input:
\[ z_1 = w_1 x_1 + w_3 x_2 + b_1 = (0.1)(0) + (-0.7)(1) + 0 = -0.7. \]
Activation output:
\[ h_1 = f(z_1) = \frac{1}{1+e^{-z_1}} = \frac{1}{1+e^{0.7}} = 0.3318. \]
Hidden neuron \(h_2\):
Net input:
\[ z_2 = w_2 x_1 + w_4 x_2 + b_2 = (0.5)(0) + (0.3)(1) + 0 = 0.3. \]
Activation output:
\[ h_2 = f(z_2) = \frac{1}{1+e^{-z_2}} = \frac{1}{1+e^{-0.3}} = 0.5744. \]
Output neuron \(\hat{y}\):
Net input:
\[ z_3 = w_5 h_1 + w_6 h_2 + b_3 = (0.2)(0.3318) + (0.4)(0.5744) + 0 = 0.2961. \]
Activation output:
\[ \hat{y} = f(z_3) = \frac{1}{1+e^{-z_3}} = \frac{1}{1+e^{-0.2961}} = 0.5735. \]
Therefore, before any training update, the network predicts approximately
\[ \hat{y} = 0.5735, \]
while the target value is
\[ y = 1. \]
For this single observation, the squared-error loss is computed as
\[ E = \frac{1}{2}(y-\hat{y})^2. \]
Substituting the target value and the network prediction gives
\[ E = \frac{1}{2}(1-0.5735)^2 = 0.09095. \]
This value quantifies how far the current prediction is from the desired XOR output for the observation \((x_1,x_2)=(0,1)\).
The next step is to update the weights and biases so that the prediction \(\hat{y}\) moves closer to the target value \(y=1\). Backpropagation provides an efficient way to compute how each parameter contributes to the total error.
In the following calculations, we apply the chain rule to propagate the error backward from the output neuron to the hidden layer, obtaining the gradients needed to update the parameters.
We begin with the weights connecting the hidden layer to the output neuron. For example, consider \(w_5\). We want to determine how a small change in \(w_5\) affects the total error. That is, we need to compute
\[ \frac{\partial E}{\partial w_5}. \]
Using the chain rule,
\[\begin{eqnarray*} \frac{\partial E}{\partial z_3} &=& \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_3} \\[4pt] &=& (\hat{y}-y)\,\hat{y}(1-\hat{y}) \\[4pt] &=& (0.5735-1)(0.5735)(1-0.5735) \\[4pt] &=& -0.10432. \end{eqnarray*}\]
Therefore,
\[\begin{eqnarray*} \frac{\partial E}{\partial w_5} &=& \frac{\partial E}{\partial z_3} \cdot \frac{\partial z_3}{\partial w_5} \\[4pt] &=& (-0.10432)h_1 \\[4pt] &=& (-0.10432)(0.3318) \\[4pt] &=& -0.03462. \end{eqnarray*}\]
Similarly, for \(w_6\),
\[\begin{eqnarray*} \frac{\partial E}{\partial w_6} &=& \frac{\partial E}{\partial z_3} \cdot \frac{\partial z_3}{\partial w_6} \\[4pt] &=& (-0.10432)h_2 \\[4pt] &=& (-0.10432)(0.5744) \\[4pt] &=& -0.05993. \end{eqnarray*}\]
Using the gradient descent update rule,
\[ w_i^{new} = w_i - \alpha \frac{\partial E}{\partial w_i}, \]
with learning rate \(\alpha=0.25\), we obtain
\[\begin{eqnarray*} w_5^{new} &=& 0.2 - (0.25)(-0.03462) = 0.20865, \\[4pt] w_6^{new} &=& 0.4 - (0.25)(-0.05993) = 0.41498. \end{eqnarray*}\]
Thus, both output-layer weights increase because the prediction \(\hat{y}=0.5735\) is below the target value \(y=1\).
After completing the backward pass for this single observation, the updated weights are summarized as follows:
\[\begin{eqnarray*} w_1^{new} &=& 0.10000, \qquad w_2^{new} = 0.50000, \\[4pt] w_3^{new} &=& -0.69884, \qquad w_4^{new} = 0.30255, \\[4pt] w_5^{new} &=& 0.20865, \qquad w_6^{new} = 0.41498. \end{eqnarray*}\]
Figure 12.3 displays the neural network with the updated parameter values after one step of backpropagation.
import matplotlib.pyplot as plt
import numpy as np
# Updated weights after the second backpropagation step
w1 = 0.1
w2 = 0.5
w3 = -0.69884
w4 = 0.30255
w5 = 0.20865
w6 = 0.41498
b1 = 0
b2 = 0
b3 = 0
x1 = 0
x2 = 1
y_target = 1
size1 = 13
size2 = 13
fig, ax = plt.subplots(figsize=(9, 6))
# -----------------------------
# Positions
# -----------------------------
pos = {
"x1": (0, 2),
"x2": (0, 0),
"h1": (3, 2),
"h2": (3, 0),
"y": (6, 1),
"b_hidden": (1.6, 4.5),
"b_output": (4.2, 4.5)
}
node_radius = 0.35
# -----------------------------
# Draw nodes
# -----------------------------
for key in ["x1", "x2", "h1", "h2", "y"]:
x, y = pos[key]
circle = plt.Circle((x, y), node_radius, color='crimson')
ax.add_patch(circle)
def draw_triangle(center):
x, y = center
triangle = plt.Polygon([
(x, y),
(x - 0.28, y - 0.50),
(x + 0.28, y - 0.50)
], color='gray')
ax.add_patch(triangle)
ax.text(x, y - 0.30, "1", ha='center', va='center',
fontsize=12, color='black')
draw_triangle(pos["b_hidden"])
draw_triangle(pos["b_output"])
# -----------------------------
# Helper: shorten arrows so they do not enter node centers
# -----------------------------
def shorten(p1, p2, start_offset=0.36, end_offset=0.42):
p1 = np.array(p1, dtype=float)
p2 = np.array(p2, dtype=float)
v = p2 - p1
d = np.linalg.norm(v)
u = v / d
return p1 + start_offset * u, p2 - end_offset * u
def arrow(p1, p2, color='black', lw=1.5):
if color == 'green':
q1, q2 = shorten(p1, p2, start_offset=0.55, end_offset=0.50)
mscale = 25
width = 1
else:
q1, q2 = shorten(p1, p2, start_offset=0.36, end_offset=0.42)
mscale = 18
width = lw
ax.annotate(
"",
xy=q2, xytext=q1,
arrowprops=dict(
arrowstyle="-|>",
color=color,
lw=width,
mutation_scale=mscale,
shrinkA=0,
shrinkB=0
)
)
# -----------------------------
# Connections
# -----------------------------
arrow(pos["x1"], pos["h1"])
arrow(pos["x1"], pos["h2"])
arrow(pos["x2"], pos["h1"])
arrow(pos["x2"], pos["h2"])
arrow(pos["h1"], pos["y"])
arrow(pos["h2"], pos["y"])
arrow(pos["b_hidden"], pos["h1"], color='green')
arrow(pos["b_hidden"], pos["h2"], color='green')
arrow(pos["b_output"], pos["y"], color='green')
# -----------------------------
# Node labels
# -----------------------------
ax.text(*pos["x1"], rf"$x_1$" + f"\n{x1}",
ha='center', va='center',
fontsize=16, fontweight='bold')
ax.text(*pos["x2"], rf"$x_2$" + f"\n{x2}",
ha='center', va='center',
fontsize=16, fontweight='bold')
ax.text(*pos["y"], r"$\hat{y}$",
ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(pos["y"][0], pos["y"][1] - 0.6,
r"$y = 1$",
ha='center', va='center',
fontsize=18,
color='black',
bbox=dict(facecolor='#FFF59D', edgecolor='none', boxstyle='round,pad=0.3'))
ax.text(*pos["h1"], r"$h_1$", ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(*pos["h2"], r"$h_2$", ha='center', va='center',
fontsize=18, fontweight='bold')
# -----------------------------
# Weight labels adjusted
# -----------------------------
ax.text(1, 2.1, rf"$w_1 = {w1:.5f}$", fontsize=size1, fontweight='bold')
ax.text(-0.1, 1.3, rf"$w_2 = {w2:.5f}$", fontsize=size1, fontweight='bold')
ax.text(-0.5, 0.6, rf"$w_3 = {w3:.5f}$", fontsize=size1, fontweight='bold')
ax.text(1, 0.08, rf"$w_4 = {w4:.5f}$", fontsize=size1, fontweight='bold')
ax.text(4.15, 1.70, rf"$w_5 = {w5:.5f}$", fontsize=size1, fontweight='bold')
ax.text(4.15, 0.12, rf"$w_6 = {w6:.5f}$", fontsize=size1, fontweight='bold')
# -----------------------------
# Bias labels adjusted
# -----------------------------
ax.text(2.4, 3.3, rf"$b_1 = {b1}$", color='green',
fontsize=size1, fontweight='bold')
ax.text(2.8, 1, rf"$b_2 = {b2}$", color='green',
fontsize=size1, fontweight='bold')
ax.text(5.05, 2.92, rf"$b_3 = {b3}$", color='green',
fontsize=size1, fontweight='bold')
# -----------------------------
# Layer labels
# -----------------------------
ax.text(0, -1, r"Input Layer $\in \mathbb{R}^2$", ha='center',
fontsize=size2, color='gray')
ax.text(3, -1, r"Hidden Layer $\in \mathbb{R}^2$", ha='center',
fontsize=size2, color='gray')
ax.text(6, -1, r"Output Layer $\in \mathbb{R}^1$", ha='center',
fontsize=size2, color='gray')
# -----------------------------
# Layout
# -----------------------------
ax.set_xlim(-1, 7);
ax.set_ylim(-1.5, 4.5);
plt.subplots_adjust(top=0.90);
ax.axis('off');
plt.tight_layout()
plt.show()
Figure 12.3: Neural network after one backpropagation step showing the updated weights for the XOR example.
We now proceed to the second epoch, using the updated weights obtained from the previous backpropagation step.
Using the updated parameters, we compute a new forward pass for the same training observation \((x_1, x_2) = (0,1)\).
Neuron \(h_1\):
\[ z_1 = (0.1)(0)+(-0.69884)(1)+0=-0.69884 \]
\[ h_1=\frac{1}{1+e^{0.69884}}=0.3321 \]
Neuron \(h_2\):
\[ z_2=(0.5)(0)+(0.30255)(1)+0=0.30255 \]
\[ h_2=\frac{1}{1+e^{-0.30255}}=0.5751 \]
Output neuron \(\hat{y}\):
\[ z_3=(0.20865)(0.3321)+(0.41498)(0.5751)+0=0.30793 \]
\[ \hat{y}=\frac{1}{1+e^{-0.30793}}=0.5764 \]
\[ E_{total} = \frac{1}{2}(1-0.5764)^2 = 0.08973 \]
The previous error was approximately \(0.09095\), while the current error is approximately \(0.08973\). Thus, after one backpropagation update, the error decreases slightly.
This indicates that, for this observation and this learning rate, the parameter update moved the prediction \(\hat{y}\) closer to the target value \(y=1\).
We now compute the gradients again to perform a new update step.
For the output neuron, we compute:
\[ \frac{\partial E}{\partial z_3} = (\hat{y}-y)\hat{y}(1-\hat{y}) = (0.5764-1)(0.5764)(1-0.5764) = -0.10343. \]
Then,
\[ \frac{\partial E}{\partial w_5} = \frac{\partial E}{\partial z_3} h_1 = (-0.10343)(0.3321) = -0.03435. \]
Similarly,
\[ \frac{\partial E}{\partial w_6} = \frac{\partial E}{\partial z_3} h_2 = (-0.10343)(0.5751) = -0.05948. \]
Updating with \(\alpha = 0.25\):
\[ w_5^{new} = 0.20865 - 0.25(-0.03435) = 0.21724, \]
\[ w_6^{new} = 0.41498 - 0.25(-0.05948) = 0.42985. \]
After completing the second forward and backward pass, the updated weights are:
\[\begin{eqnarray*} w_1^{new} &=& 0.10000, \qquad w_2^{new} = 0.50000, \\[4pt] w_3^{new} &=& -0.69764, \qquad w_4^{new} = 0.30517, \\[4pt] w_5^{new} &=& 0.21724, \qquad w_6^{new} = 0.42985. \end{eqnarray*}\]
Figure 12.4 shows the network after the second parameter update.
import matplotlib.pyplot as plt
import numpy as np
w1 = 0.1
w2 = 0.5
w3 = -0.69764
w4 = 0.30517
w5 = 0.21724
w6 = 0.42985
b1 = 0
b2 = 0
b3 = 0
x1 = 0
x2 = 1
y_target = 1
size1 = 13
size2 = 13
fig, ax = plt.subplots(figsize=(9, 6))
# -----------------------------
# Positions
# -----------------------------
pos = {
"x1": (0, 2),
"x2": (0, 0),
"h1": (3, 2),
"h2": (3, 0),
"y": (6, 1),
"b_hidden": (1.6, 4.5),
"b_output": (4.2, 4.5)
}
node_radius = 0.35
# -----------------------------
# Draw nodes
# -----------------------------
for key in ["x1", "x2", "h1", "h2", "y"]:
x, y = pos[key]
circle = plt.Circle((x, y), node_radius, color='crimson')
ax.add_patch(circle)
# -----------------------------
# Draw bias triangles
# -----------------------------
def draw_triangle(center):
x, y = center
triangle = plt.Polygon([
(x, y),
(x - 0.28, y - 0.50),
(x + 0.28, y - 0.50)
], color='gray')
ax.add_patch(triangle)
ax.text(x, y - 0.30, "1", ha='center', va='center',
fontsize=12, color='black')
draw_triangle(pos["b_hidden"])
draw_triangle(pos["b_output"])
def shorten(p1, p2, start_offset=0.36, end_offset=0.42):
p1 = np.array(p1, dtype=float)
p2 = np.array(p2, dtype=float)
v = p2 - p1
d = np.linalg.norm(v)
u = v / d
return p1 + start_offset * u, p2 - end_offset * u
# -----------------------------
# Helper: shorten arrows so they do not enter node centers
# -----------------------------
def arrow(p1, p2, color='black', lw=1.5):
if color == 'green':
q1, q2 = shorten(p1, p2, start_offset=0.55, end_offset=0.50)
mscale = 25
width = 1
else:
q1, q2 = shorten(p1, p2, start_offset=0.36, end_offset=0.42)
mscale = 18
width = lw
ax.annotate(
"",
xy=q2, xytext=q1,
arrowprops=dict(
arrowstyle="-|>",
color=color,
lw=width,
mutation_scale=mscale,
shrinkA=0,
shrinkB=0
)
)
# -----------------------------
# Connections
# -----------------------------
arrow(pos["x1"], pos["h1"])
arrow(pos["x1"], pos["h2"])
arrow(pos["x2"], pos["h1"])
arrow(pos["x2"], pos["h2"])
arrow(pos["h1"], pos["y"])
arrow(pos["h2"], pos["y"])
arrow(pos["b_hidden"], pos["h1"], color='green')
arrow(pos["b_hidden"], pos["h2"], color='green')
arrow(pos["b_output"], pos["y"], color='green')
# -----------------------------
# Node labels
# -----------------------------
ax.text(*pos["x1"], rf"$x_1$" + f"\n{x1}",
ha='center', va='center',
fontsize=16, fontweight='bold')
ax.text(*pos["x2"], rf"$x_2$" + f"\n{x2}",
ha='center', va='center',
fontsize=16, fontweight='bold')
ax.text(*pos["y"], r"$\hat{y}$",
ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(pos["y"][0], pos["y"][1] - 0.6,
r"$y = 1$",
ha='center', va='center',
fontsize=18,
color='black',
bbox=dict(facecolor='#FFF59D', edgecolor='none', boxstyle='round,pad=0.3'))
ax.text(*pos["h1"], r"$h_1$", ha='center', va='center',
fontsize=18, fontweight='bold')
ax.text(*pos["h2"], r"$h_2$", ha='center', va='center',
fontsize=18, fontweight='bold')
# -----------------------------
# Weight labels adjusted
# -----------------------------
ax.text(1, 2.1, rf"$w_1 = {w1:.5f}$", fontsize=size1, fontweight='bold')
ax.text(-0.1, 1.3, rf"$w_2 = {w2:.5f}$", fontsize=size1, fontweight='bold')
ax.text(-0.5, 0.6, rf"$w_3 = {w3:.5f}$", fontsize=size1, fontweight='bold')
ax.text(1, 0.08, rf"$w_4 = {w4:.5f}$", fontsize=size1, fontweight='bold')
ax.text(4.15, 1.70, rf"$w_5 = {w5:.5f}$", fontsize=size1, fontweight='bold')
ax.text(4.15, 0.12, rf"$w_6 = {w6:.5f}$", fontsize=size1, fontweight='bold')
# -----------------------------
# Bias labels adjusted
# -----------------------------
ax.text(2.4, 3.3, rf"$b_1 = {b1}$", color='green',
fontsize=size1, fontweight='bold')
ax.text(2.8, 1, rf"$b_2 = {b2}$", color='green',
fontsize=size1, fontweight='bold')
ax.text(5.05, 2.92, rf"$b_3 = {b3}$", color='green',
fontsize=size1, fontweight='bold')
# -----------------------------
# Layer labels
# -----------------------------
ax.text(0, -1, r"Input Layer $\in \mathbb{R}^2$", ha='center',
fontsize=size2, color='gray')
ax.text(3, -1, r"Hidden Layer $\in \mathbb{R}^2$", ha='center',
fontsize=size2, color='gray')
ax.text(6, -1, r"Output Layer $\in \mathbb{R}^1$", ha='center',
fontsize=size2, color='gray')
# -----------------------------
# Layout
# -----------------------------
ax.set_xlim(-1, 7);
ax.set_ylim(-1.5, 4.5);
plt.subplots_adjust(top=0.90);
ax.axis('off');
plt.tight_layout()
plt.show()
Figure 12.4: Neural network after the second backpropagation step showing the updated weights for the XOR example.
At this point, we stop the manual calculations. The purpose of this example was to illustrate, step by step, how a single observation is propagated forward through the network and how the resulting error is propagated backward to update the weights.
In practice, this procedure must be repeated over many epochs and across all training observations until the network identifies parameter values that yield predictions sufficiently close to the target outputs. Because these computations quickly become lengthy and repetitive, neural network training is typically carried out using automated algorithms rather than manual calculations.
The prediction increased from \(\hat{y}=0.5735\) to \(\hat{y}=0.5764\).
The error decreased from \(0.09095\) to \(0.08973\).
Since the target is \(y=1\), the update moved the network output in the correct direction.
This confirms that the gradient-based updates are steering the model toward better performance, even though convergence requires multiple iterations.
In this document, we developed the fundamental mathematical framework that enables learning in artificial neural networks. Starting from the definition of loss functions, we established how learning can be formulated as an optimization problem over a high-dimensional parameter space.
We then introduced gradient-based optimization methods, emphasizing how the gradient provides directional information that guides parameter updates toward regions of lower loss. This perspective allowed us to interpret learning geometrically as movement over a loss surface.
A central component of this process is the backpropagation algorithm, which enables the efficient computation of gradients in multilayer architectures. By recursively propagating local error signals across layers, backpropagation makes it possible to train deep neural networks in a computationally feasible manner.
We also examined the role of dimensional consistency and matrix operations, highlighting how vectorized representations and element-wise operations contribute to both the efficiency and clarity of the learning process.
Overall, these components (loss functions, gradient-based optimization, and backpropagation) form the core of modern neural network training. Together, they establish the bridge between mathematical theory and practical learning algorithms in machine learning.
The purpose of this activity is to consolidate the mathematical and computational understanding of neural network training through a manual and reproducible implementation of the XOR example. In particular, students will automate the forward pass, loss computation, gradient calculation, backpropagation, and parameter updates across several epochs.
This activity is based on the one-dimensional XOR example developed in the class notes. Students must implement the training process in Python or R, without using predefined neural network libraries or automatic differentiation tools.
The implementation must be fully reproducible and must show all intermediate calculations.
Using the one-dimensional XOR example developed in the class notes,
\[ x_1 = 0, \qquad x_2 = 1, \qquad y = 1, \]
continue the manual training process from the parameter values obtained at the end of the second epoch.
Students do not need to repeat the first and second epochs. Instead, they must compute the third epoch step by step.
Your answer must include:
The updated parameter values at the beginning of the third epoch.
The hidden pre-activation values \(z_1\) and \(z_2\).
The hidden activations \(h_1\) and \(h_2\).
The output pre-activation value \(z_3\).
The predicted value \(\hat{y}\).
The loss or cost value.
The local errors used in backpropagation.
The gradients with respect to the weights and biases.
The updated parameters after completing the third epoch.
A brief comparison with the results from the first two epochs reported in the notes.
Automate the previous XOR example using Python or R.
The code must run between 10 and 20 epochs and must update the parameters using gradient descent.
The implementation must be done from scratch. Students are not allowed to use predefined neural network functions, automatic differentiation tools, or machine learning libraries for model training. In particular, functions from libraries such as keras, tensorflow, torch, sklearn.neural_network, or similar tools are not allowed.
All forward passes, loss computations, gradients, backpropagation steps, and parameter updates must be explicitly programmed by the student.
For each epoch, present a table including at least:
Epoch number.
Values of \(z_1\), \(z_2\), and \(z_3\).
Values of \(h_1\) and \(h_2\).
Predicted value \(\hat{y}\).
Target value \(y\).
Absolute error \(|y-\hat{y}|\).
Cost or loss value.
Gradient of the cost.
Updated weights.
Updated biases.
The table must be clearly formatted and interpreted.
Using the results from the previous section, construct and interpret the following graphs:
Cost function across epochs.
Absolute error across epochs.
Gradient magnitude across epochs.
Evolution of selected parameters across epochs.
Explain how the behavior of these graphs reflects the learning process of the network.
Repeat the training process using a different set of initial weights and biases.
Compare the results with the original experiment in terms of:
Initial prediction.
Speed of convergence.
Final cost.
Evolution of the gradients.
Stability of the parameter updates.
Discuss whether the new initialization improves or worsens the learning process.
Extend the implementation to train the network using two observations from the XOR table.
The implementation must be done using a vectorized or matrix-based approach.
Your answer must include:
The input matrix \(\mathbf{X}\).
The target vector \(\mathbf{y}\).
The matrix form of the forward pass.
The matrix form of the backward pass.
The parameter updates.
A table summarizing the results by epoch.
Graphs showing cost, error, and gradient behavior across epochs.
Compare the results obtained from:
One XOR observation.
The same observation with different initial parameters.
Two XOR observations.
Discuss:
How the number of observations affects the training process.
How the initialization affects convergence.
Why the XOR problem requires a hidden layer.
How backpropagation allows the network to adjust all parameters.
What limitations remain when only part of the XOR table is used.
Explain, in your own words, how the network “learns from its mistakes” in this example.
Your explanation must connect the following ideas:
Prediction \(\hat{y}\).
Target value \(y\).
Error.
Loss function.
Gradient.
Backpropagation.
Parameter updates.
Improvement across epochs.
Extend the implementation to the complete XOR dataset:
\[ (0,0) \mapsto 0,\qquad (0,1) \mapsto 1,\qquad (1,0) \mapsto 1,\qquad (1,1) \mapsto 0. \]
Train the network for a larger number of epochs and analyze whether the model learns the XOR pattern.
Include:
Final predictions for the four observations.
Final classification results.
Cost evolution.
Error evolution.
A brief discussion of whether the model successfully represents the XOR function.
The R Markdown document must be fully reproducible.
All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.
All random seeds, if applicable, must be set to ensure deterministic results.
All library versions used should be clearly reported.
The code must be self-contained within the document.
Neural network libraries or automatic differentiation tools are not allowed.
The use of external functions not included in the document prevents reproducibility.
The assignment must be submitted in both formats:
The document must include code, tables, graphs, and interpretations.
Tables must be clearly formatted.
Figures must include captions and be explicitly discussed in the text.
If additional files are used, they must be included, organized, and documented.
Multiple unclear versions of the same assignment must not be submitted.
Timely submission of assignments is strictly required.
Assignments must be submitted by the deadline specified for the activity.
A grace period of 30 minutes is allowed after the deadline.
Submissions made after the grace period will incur penalties.
Penalty: −1.0 point for each hour or fraction of an hour of delay.
Submissions significantly past the deadline may receive a substantially reduced grade or may not be accepted.