23/03/26
Abstract
Other related documents can be found at Rpubs:: toc.
This document builds upon the foundational concepts introduced in previous sections on neural network architectures (Llinás, 2026) and activation functions (Llinás, 2026), extending the analysis toward the mechanisms that enable learning.
Artificial neural networks (ANNs) constitute one of the most influential paradigms in modern machine learning and artificial intelligence. Their conceptual origin is loosely inspired by the basic functional principles of biological neurons, which process and transmit information through interconnected networks (Kandel et al., 2013).
In the nervous system, neurons receive electrical or chemical signals through dendrites. These signals are integrated within the cell body (the soma), and if the accumulated stimulus surpasses a certain threshold, the neuron generates an electrical impulse known as an action potential. This signal then travels along the axon and is transmitted to other neurons through synaptic connections.
As illustrated in Figure 2.1, the biological neuron is composed of several key components, including dendrites, the soma, the axon, and the synaptic terminals, each playing a specific role in signal transmission.
Figure 2.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI).
Although artificial neurons are only simplified mathematical abstractions of this biological mechanism, the analogy provides an intuitive conceptual foundation. In essence, both systems combine multiple inputs, evaluate their relative influence, and produce an output response according to a transformation rule.
This biological structure motivates the simplified mathematical abstraction introduced in the next section.
Modern artificial neurons translate the biological intuition of neural signal processing into a mathematical framework suitable for computation and learning. Instead of discrete electrical spikes, artificial neurons compute a weighted linear combination of inputs and then transform this signal through a nonlinear activation function.
Formally, the internal signal of a neuron is defined as
\[ z = \mathbf{w}^\top \mathbf{x} + b, \]
where \(\mathbf{x} \in \mathbb{R}^d\) represents the input vector, \(\mathbf{w} \in \mathbb{R}^d\) denotes the vector of weights, and \(b\) is a bias parameter. The neuron output is then obtained by applying an activation function
\[ h = \varphi(z). \]
The quantity \(z\) can be interpreted as a continuous analogue of synaptic integration, while the function \(\varphi(\cdot)\) generalizes the biological threshold mechanism into a smooth and differentiable transformation.
This transition from discrete logical models to continuous optimization was essential for the development of modern neural networks. By allowing gradients to be computed and propagated through the model, neural networks can be trained efficiently using gradient-based learning algorithms.
As a result, ANNs are capable of approximating complex nonlinear mappings and learning expressive internal representations from data. Through iterative adjustments of weights and biases, the model progressively refines its representation of the input space. Consequently, neural networks have become fundamental tools for tasks such as classification, regression, and representation learning in high-dimensional environments (Goodfellow et al., 2016).
As illustrated in Figure 2.2, a neural network is constructed by stacking multiple artificial neurons into layers. The output of each neuron becomes the input of neurons in the next layer, allowing the model to progressively learn more complex representations of the data.
Figure 2.2: Basic architecture of a feedforward neural network. Source: Created by the author with ChatGPT (OpenAI).
Each neuron in the hidden layer computes a transformation of the form \(h = \varphi(\mathbf{w}^\top \mathbf{x} + b)\), illustrating how nonlinear activation functions operate within layered architectures.
Each neuron in the network performs the same basic operation described above, namely computing a weighted sum followed by a nonlinear activation. By composing many such units, the network is able to model highly complex functions.
In many areas of mathematics and machine learning, it is important to work with functions that are not only continuous, but also sufficiently smooth. This smoothness ensures that derivatives exist and behave well, which is essential for optimization algorithms such as gradient descent.
A function \(f: \mathbb{R} \to \mathbb{R}\) is said to belong to the class \(C^{\infty}\) if it is infinitely differentiable; that is, all derivatives of any order exist and are continuous. Formally,
\[ f \in C^{\infty} \quad \Longleftrightarrow \quad f^{(k)} \text{ exists and is continuous for all } k \in \mathbb{N}_0. \]
Functions in \(C^{\infty}\) are often referred to as smooth functions, meaning that they can be differentiated infinitely many times without any discontinuities or irregularities in their derivatives.
A function that belongs to \(C^{\infty}\) is
\[ f(x) = e^x\quad \Longrightarrow \quad f'(x) = e^x \]
This function is infinitely differentiable, and all its derivatives are equal to \(e^x\). In particular, every derivative is continuous over \(\mathbb{R}\).
(#fig:f_exp_derivative)Exponential function \(f(x)=e^x\) and its derivative \(f'(x)=e^x\). Both coincide, illustrating smoothness (\(C^{\infty}\)).
As shown in Figure @ref(fig:f_exp_derivative), the exponential function coincides with its derivative. This illustrates a fundamental property of smooth functions: not only are they differentiable, but all their derivatives are continuous and well-defined across the entire domain.
In general, examples of smooth functions include exponential, trigonometric, and sigmoid-type functions.
A function that does not belong to \(C^{\infty}\) is
\[ f(x) = |x| \quad \Longrightarrow \quad f'(x) = \begin{cases} -1, & x < 0, \\ 1, & x > 0. \end{cases} \]
The derivative is not defined at \(x=0\). Therefore, although the function is continuous, it is not differentiable at the origin, since the left and right derivatives do not coincide. Consequently, it is not smooth.
(#fig:f_abs_derivative)Absolute value function \(f(x)=|x|\) and its derivative. The derivative is discontinuous at \(x=0\).
In contrast, Figure @ref(fig:f_abs_derivative) shows that the function \(f(x)=|x|\) is continuous but not differentiable at \(x=0\). The function itself is well-defined at this point, but the derivative is not, since the left and right derivatives do not coincide.
These examples highlight the contrast between smooth and non-smooth functions.
In the context of neural networks, smooth activation functions are particularly useful because they allow gradients to be computed reliably during training. However, not all commonly used activation functions belong to \(C^{\infty}\). For instance, the ReLU function is not differentiable at zero, yet it remains widely used due to its practical advantages.
Throughout this document, we will frequently refer to functions in \(C^{\infty}\) when discussing theoretical properties of activation functions and optimization.
Activation functions are a fundamental component of artificial neural networks because they introduce nonlinearity into the model. Without nonlinear activation functions, a neural network composed of multiple layers would reduce to an equivalent linear transformation, regardless of the number of layers.
Recall that an artificial neuron computes a linear combination of its inputs
\[ z = \mathbf{w}^\top\mathbf{x} + b, \]
which represents the internal signal of the neuron. The final output is obtained by applying a nonlinear transformation
\[ h = \varphi(z), \]
where \(\varphi(\cdot)\) is called the activation function.
The role of the activation function is to transform the internal signal of the neuron into a response that can capture complex relationships between variables. By introducing nonlinear transformations, neural networks can construct nonlinear decision boundaries and learn rich internal representations of the data.
This capability is supported by classical universal approximation results, which show that neural networks equipped with suitable activation functions can approximate a wide class of functions on compact domains .
From a geometric perspective, activation functions distort the feature space in a controlled way, allowing the model to separate patterns that would otherwise be inseparable using only linear transformations .
Activation functions can be organized into different categories according to their mathematical properties. One common classification distinguishes between monotonic and periodic activation functions.
Monotonic activation functions are functions whose output consistently increases or decreases with respect to the input. These functions have historically been the most widely used in neural networks and include examples such as the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU).
Periodic activation functions exhibit oscillatory behavior and are particularly useful in models designed to capture periodic or high-frequency patterns. These functions are commonly used in specialized neural architectures for representing signals, implicit functions, or spatial fields.
Periodic activation functions can be further divided into:
Sinusoidal activation functions
Non-sinusoidal periodic functions
In the following sections, we examine in detail several activation functions commonly used in neural networks, discussing their mathematical form, key properties, and implications for learning algorithms.
We begin by examining monotonic activation functions, which have historically played a central role in the development of neural networks.
While activation functions define how information is transformed within the network, they do not specify how the parameters of the model are learned. This naturally leads to the question of how neural networks are trained from data.
From a mathematical perspective, learning in neural networks consists of adjusting the parameters \(\mathbf{w}\) and \(b\) so as to minimize a loss function defined over the training data.
While activation functions determine how neurons transform their inputs, they do not by themselves provide a mechanism for learning from data. The learning process requires a way to evaluate prediction quality and a systematic procedure to adjust model parameters.
In supervised learning settings, this process is guided by a loss function, which quantifies the discrepancy between the predicted output of the network and the true target values.
From a mathematical perspective, learning in neural networks consists of adjusting the parameters \(\mathbf{w}\) and \(b\) so as to minimize a loss function defined over the training data.
Before introducing optimization, it is useful to formalize how information flows through a neural network.
Figure 3.1 illustrates the forward propagation mechanism in a multilayer neural network. At each layer, the input signal undergoes an affine transformation followed by a nonlinear activation.
Figure 3.1: Forward propagation in a multilayer neural network. Source: Nuñez (2026)
To formalize the flow of information through a neural network, it is useful to introduce a layer-wise notation.
Consider a neural network composed of layers indexed by \(j = 1, \dots, N\). Let \(n_j\) denote the number of neurons in layer \(j\).
For each layer, we define:
\(\mathbf{h}^{(j)} \in \mathbb{R}^{n_j}\): the output (activation vector) of layer \(j\). Each component corresponds to the output of one neuron.
\(\mathbf{z}^{(j)} \in \mathbb{R}^{n_j}\): the pre-activation signal at layer \(j\), obtained before applying the activation function.
\(\mathbf{W}^{(j-1,j)} \in \mathbb{R}^{\,n_{j-1} \times n_j}\): the weight matrix connecting layer \(j-1\) to layer \(j\). Each column contains the weights associated with one neuron in layer \(j\).
\(\mathbf{b}^{(j)} \in \mathbb{R}^{n_j}\): the bias vector of layer \(j\).
By convention, the input layer is denoted by \[ \mathbf{h}^{(0)} = \mathbf{x} \in \mathbb{R}^{n_0}. \]
where \(\mathbf{x}\) is the input vector.
At each layer, the network performs two steps:
A linear transformation: \[ \mathbf{z}^{(j)} = \mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)} \]
A nonlinear transformation: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{z}^{(j)}\right) \]
Combining both steps, the forward propagation at layer \(j\) can be written as: \[ \mathbf{h}^{(j)} = f^{(j)}\left(\mathbf{W}^{(j-1,j)\top} \mathbf{h}^{(j-1)} + \mathbf{b}^{(j)}\right) \]
This recursive structure shows that the output of each layer becomes the input of the next, allowing the network to build increasingly complex representations of the data.
As illustrated in Figure 3.1, each layer applies the same fundamental operation: a linear combination of inputs followed by a nonlinear activation.
A loss function measures how far the network output is from the desired target. Let \(y\) denote the true value and \(\hat{y}\) the predicted output. A loss function is written as \(L(y,\hat{y})\).
For example:
For regression, a common choice is the mean squared error (MSE): \[ L(y,\hat{y}) = \frac{1}{2}(y-\hat{y})^2 \]
For classification, a widely used option is cross-entropy: \[ L(y,\hat{y}) = -\sum_i y_i \log(\hat{y}_i) \]
Loss functions can be broadly categorized into:
The learning rate, denoted by \(\alpha > 0\), controls the size of the parameter updates during optimization. It determines how far the model moves in the direction opposite to the gradient at each iteration.
If \(\alpha\) is too large, the optimization process may overshoot the minimum and become unstable.
If \(\alpha\) is too small, convergence may be excessively slow.
Adaptive methods such as Adam, RMSProp, and AdaGrad modify effective learning rates during training to improve stability.
The learning rate therefore plays a crucial role in balancing stability and speed during training. In the next section, we examine how this parameter is used within the gradient descent algorithm.
To minimize the loss function, neural networks rely on gradient-based optimization. The gradient indicates the direction of steepest increase, so moving in the opposite direction reduces the loss.
For a parameter \(w_n\), the update rule is:
\[ w_n^{(t+1)} = w_n^{(t)} - \alpha \, \frac{\partial L}{\partial w_n}, \]
where \(\alpha>0\) is the learning rate introduced in the previous section.
This process can be interpreted geometrically as movement over a loss surface toward regions of lower values (see Figure 3.2). The interactive gradient descent visualization employed in this study was developed by LLinás and LLinás (2026a) and is available online. Additional implementation details and documentation are provided in the corresponding GitHub repository (LLinás and LLinás (2026b)).
Figure 3.2: Gradient descent simulation on \(f(x,y)=x^2+y^2\). Adapted on the interactive visualization developed by Llinás and Llinás (2026).
As shown in Figure 3.2, the loss function defines a surface over the parameter space. In this example, the function \[L(x,y) = x^2 + y^2\]
produces a convex bowl-shaped surface with a unique global minimum at the origin.
The blue point represents the initial parameter values, while the red point corresponds to the final solution obtained after several iterations of gradient descent. The trajectory between these points illustrates how the algorithm progressively updates the parameters in the direction opposite to the gradient.
Each step in this path is scaled by the learning rate \(\alpha\), which determines how large the updates are. A suitable choice of \(\alpha\) allows the algorithm to converge efficiently toward the minimum, whereas poor choices may lead to slow convergence or instability.
The table shown in the figure provides the numerical evolution of the parameters and gradients at each iteration, illustrating how the magnitude of the updates decreases as the algorithm approaches the minimum.
Although gradient descent provides the general mechanism for updating parameters, neural networks typically involve a large number of interconnected weights. In such cases, computing the required gradients efficiently becomes essential.
This is achieved through the backpropagation algorithm, which systematically applies the chain rule to compute gradients layer by layer.
Backpropagation is the algorithm used to compute gradients efficiently in multilayer neural networks. It applies the chain rule to propagate error information from the output layer back to earlier layers. The structure illustrated in Figure 3.1 also serves as the basis for understanding how gradients are propagated backward through the network.
For each layer \(j\), define the local error:
\[ \boldsymbol{\delta}^{(j)} = \frac{\partial L}{\partial \mathbf{z}^{(j)}} \in \mathbb{R}^{n_j}. \]
Here, \(n_j\) denotes the number of neurons in layer \(j\), ensuring that all vectors and matrices are dimensionally consistent across layers.
The term \(\boldsymbol{\delta}^{(j)}\) represents how sensitive the loss function is to changes in the pre-activation signal at layer \(j\), and plays a central role in propagating error information backward through the network.
For hidden layers, the error is propagated backward according to
\[ \boldsymbol{\delta}^{(j-1)} = \left(\mathbf{W}^{(j-1,j)\top}\; \boldsymbol{\delta}^{(j)}\right) \odot f^{(j-1)\prime}(\mathbf{z}^{(j-1)}), \]
where \(\odot\) denotes element-wise (Hadamard) multiplication.
Let \[ \mathbf{a} = \begin{pmatrix} a_1 \\ a_2 \\ a_3 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}. \]
Then
\[ \mathbf{a} \odot \mathbf{b} = \begin{pmatrix} a_1 b_1 \\ a_2 b_2 \\ a_3 b_3 \end{pmatrix}. \]
That is, each component of the resulting vector is obtained by multiplying the corresponding components of the original vectors.
Here, \(\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}\) and \(\mathbf{W}^{(j-1,j)\top} \in \mathbb{R}^{n_j \times n_{j-1}}\), so that the product \(\mathbf{W}^{(j-1,j)\top} \boldsymbol{\delta}^{(j)}\) lies in \(\mathbb{R}^{n_{j-1}}\), matching the dimension of \(\boldsymbol{\delta}^{(j-1)}\).
This dimensional consistency guarantees that the backward propagation of errors is well-defined across layers. Moreover, this recursive structure highlights how error signals flow backward through the network, enabling efficient gradient computation for all parameters.
These expressions form the basis for efficient gradient computation in deep neural networks and are central to modern machine learning algorithms.
These gradient expressions will be used in conjunction with optimization algorithms, such as gradient descent, to update the model parameters during training.
Using these local error terms, the gradients with respect to the model parameters (weights and biases) are given by: \[ \frac{\partial L}{\partial \mathbf{W}^{(j-1,j)}} = \mathbf{h}^{(j-1)} \boldsymbol{\delta}^{(j)\top}, \qquad \frac{\partial L}{\partial \mathbf{b}^{(j)}} = \boldsymbol{\delta}^{(j)}. \] where \(\mathbf{h}^{(j-1)} \in \mathbb{R}^{n_{j-1}}\) and \(\boldsymbol{\delta}^{(j)} \in \mathbb{R}^{n_j}\).
This recursive computation allows gradients to be obtained efficiently, making training feasible even for deep networks.
Gradient-based methods rely on first-order information. That is, they use only the gradient of the loss function, without explicitly considering curvature. More advanced methods incorporate second-order information (e.g., Hessians), but they are often computationally expensive in high-dimensional models.
Although gradient descent is conceptually simple, its practical behavior is strongly influenced by the geometry of the loss surface associated with neural networks.
In high-dimensional models, the loss function defines a complex landscape where each point corresponds to a particular configuration of the model parameters. This surface is typically highly non-convex, which gives rise to several important challenges during optimization.
The loss surface may contain a variety of geometric features, including:
Local minima, where the gradient vanishes but the solution is not globally optimal.
Saddle points, where the gradient is zero but there exist directions of both ascent and descent.
Plateaus, where the gradient is very small and learning progresses slowly.
Narrow valleys, where curvature differs significantly across directions.
In high-dimensional settings, saddle points are often more prevalent than local minima and can significantly slow down the training process.
These phenomena can be visualized as movement across a complex surface, where the optimization trajectory may oscillate, stagnate, or progress unevenly depending on the local geometry.
A fundamental challenge in deep neural networks arises from the propagation of gradients through multiple layers.
Because gradients are computed through repeated application of the chain rule, their magnitude may either decrease or increase exponentially as they propagate backward through the network.
In the vanishing gradient case, gradients become extremely small, preventing early layers from learning effectively.
In the exploding gradient case, gradients grow excessively large, leading to numerical instability.
Both situations can hinder the training process and affect convergence.
Several strategies have been developed to mitigate these issues, including:
The use of activation functions that preserve gradient flow (e.g., ReLU and its variants).
Careful weight initialization methods (e.g., Xavier or He initialization).
Normalization techniques such as Batch Normalization.
These approaches help stabilize training and improve convergence behavior in deep neural networks.
Overall, the effectiveness of gradient-based optimization depends not only on the algorithm itself, but also on the interplay between model architecture, activation functions, and the geometry of the loss surface.
These challenges motivate the development of more advanced optimization methods, which incorporate additional information about the curvature of the loss surface.
While gradient descent provides a simple and effective mechanism for training neural networks, it relies exclusively on first-order information, namely the gradient of the loss function. Although this approach is widely used in practice, it does not fully exploit the geometric structure of the optimization problem.
In many settings, additional information about the curvature of the loss surface can be used to accelerate convergence and improve stability. This leads to the study of second-order optimization methods, which incorporate not only gradients but also higher-order derivatives.
In previous sections, parameters were introduced individually (e.g., \(w_n\)). In more general settings, it is convenient to represent all model parameters collectively as a vector \(\boldsymbol{\theta}\). Let \(L(\boldsymbol{\theta})\) denote a differentiable loss function, where \(\boldsymbol{\theta}\) represents the vector of model parameters.
To formalize this idea, we introduce the notion of the gradient, which generalizes partial derivatives to vector-valued parameter spaces. The gradient of the loss function is defined as
\[ \nabla L(\boldsymbol{\theta}) = \left( \frac{\partial L}{\partial \theta_1}, \dots, \frac{\partial L}{\partial \theta_N} \right)^\top, \]
and provides first-order information about the direction of steepest increase.
In contrast, the Hessian matrix captures second-order information:
\[ H(\boldsymbol{\theta}) = \left[ \frac{\partial^2 L}{\partial \theta_i \, \partial \theta_j} \right]_{i,j=1}^N. \]
The Hessian describes the local curvature of the loss surface and plays a central role in more advanced optimization methods.
While gradient descent uses only first-order information, it does not account for how the gradient changes across directions. This limitation motivates the use of second-order methods.
One classical approach that uses second-order information is the Newton-Raphson method, which updates parameters according to
\[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - H(\boldsymbol{\theta}^{(t)})^{-1} \nabla L(\boldsymbol{\theta}^{(t)}). \]
Unlike gradient descent, where the step size is controlled by a scalar learning rate, Newton-type methods adapt the update direction using curvature information. This often leads to faster convergence, particularly near optimal points.
However, computing and inverting the Hessian matrix can be computationally expensive in high-dimensional problems, which limits the direct applicability of these methods in large neural networks.
Gradient descent can be interpreted as a simplified optimization strategy that only uses first-order information. While this makes it computationally efficient and scalable, it may require many iterations to converge.
Second-order methods, on the other hand, incorporate curvature information, allowing for more informed parameter updates. In practice, modern optimization algorithms often seek a balance between these two approaches, combining efficiency with improved convergence properties.
In deep learning, first-order methods such as gradient descent and its variants (e.g., Adam, RMSProp) remain the dominant choice due to their scalability and ease of implementation.
Nevertheless, second-order ideas continue to play an important role in theoretical analysis and in specialized applications where curvature information can be exploited effectively.
This perspective highlights that optimization in neural networks is not limited to a single method, but rather involves a spectrum of approaches with different trade-offs between computational cost and convergence behavior.
This topic will be explored in greater depth in a dedicated document on optimization methods in machine learning.
In this document, we developed the fundamental mathematical framework that enables learning in artificial neural networks. Starting from the definition of loss functions, we established how learning can be formulated as an optimization problem over a high-dimensional parameter space.
We then introduced gradient-based optimization methods, emphasizing how the gradient provides directional information that guides parameter updates toward regions of lower loss. This perspective allowed us to interpret learning geometrically as movement over a loss surface.
A central component of this process is the backpropagation algorithm, which enables the efficient computation of gradients in multilayer architectures. By recursively propagating local error signals across layers, backpropagation makes it possible to train deep neural networks in a computationally feasible manner.
We also examined the role of dimensional consistency and matrix operations, highlighting how vectorized representations and element-wise operations contribute to both the efficiency and clarity of the learning process.
Overall, these components (loss functions, gradient-based optimization, and backpropagation) form the core of modern neural network training. Together, they establish the bridge between mathematical theory and practical learning algorithms in machine learning.
The purpose of this activity is to consolidate the conceptual and mathematical understanding of the learning process in neural networks. In particular, it aims to develop intuition about loss functions, gradient-based optimization, and the role of backpropagation in computing parameter updates.
Consider a feedforward neural network with multiple layers and a differentiable loss function \(L\).
Answer the following questions:
Conceptual understanding
What is the role of a loss function in a neural network?
Why can the training of a neural network be interpreted as an optimization problem?
Gradient interpretation
Define the gradient of a function \(L(\boldsymbol{\theta})\).
Explain why the negative gradient direction is used to update parameters.
Describe geometrically what happens during gradient descent.
Learning rate
What is the role of the learning rate \(\alpha\) in gradient descent?
What happens if \(\alpha\) is too large?
What happens if \(\alpha\) is too small?
Backpropagation intuition
What is the meaning of the local error \(\boldsymbol{\delta}^{(j)}\)?
Why is backpropagation necessary in deep neural networks?
Explain in your own words what it means to “propagate the error backward”.
Dimensional reasoning
Consider the expression:
\[ \boldsymbol{\delta}^{(j-1)} = \left(\mathbf{W}^{(j-1,j)\top}\boldsymbol{\delta}^{(j)}\right) \odot f'(\mathbf{z}^{(j-1)}). \]
Explain why the transpose of \(\mathbf{W}^{(j-1,j)}\) is required.
Justify why the dimensions of all terms are consistent.
What role does the element-wise product play in this expression?
Critical reflection
Why would training be infeasible without backpropagation?
Explain why deep networks require efficient gradient computation.
In your opinion, what is the most important component of the learning process: the loss function, the optimization method, or backpropagation? Justify your answer.
Forward pass
Consider a simple neural network with two layers.
Describe step by step how the forward pass is computed.
Then describe conceptually how the backward pass is performed.
Identify where the gradient is computed and how it is used.
Conceptual design
Propose an intuitive explanation (without formulas) of how a neural network “learns from its mistakes”.
What information is propagated?
How is it used to adjust the model?
Why does this process improve performance over time?