1 Preliminars

Biological neuron.

Artificial neural networks (ANNs) are among the most influential developments in modern machine learning and artificial intelligence. Inspired by fundamental functional principles of biological neurons (Kandel et al., 2013), ANNs abstract the core mechanism of neural information processing: multiple input signals are integrated, their relative importance is modulated, and an output response is generated through a nonlinear transformation.

In a biological neuron, signals arrive through dendrites, are integrated in the soma, and (if a threshold is effectively exceeded) an action potential is propagated along the axon to other neurons. This threshold-based firing mechanism constitutes one of the foundational biological inspirations for artificial neural computation.

Figure 1.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI)

Historical origin of the threshold neuron.

The first formal abstraction of neural computation was introduced by McCulloch and Pitts (1943), who proposed a logical model of a neuron based on threshold activation. Their formulation described neural firing through three fundamental principles:

(a) Threshold Principle: A neuron possesses a threshold that excitation must exceed to trigger firing.
(b) All-or-None Law: Neural activity follows an “all-or-none” firing mechanism.
(c) Spatial Summation of Excitatory Inputs: A sufficient number of excitatory inputs must be spatially aggregated within a short temporal window.

These principles are explicitly stated in the original paper (see Figure 1.2), where neural behavior is described as a threshold-based logical process.

Principles of a threshold-based logical neuron. Source: [McCulloch and Pitts (1943)](https://www.cs.cmu.edu/~epxing/Class/10715/reading/McCulloch.and.Pitts.pdf)

Figure 1.2: Principles of a threshold-based logical neuron. Source: McCulloch and Pitts (1943)

Although the McCulloch–Pitts neuron was binary and propositional rather than continuous and differentiable, it established the essential computational paradigm of neural processing: input aggregation followed by threshold activation.

From Logical neurons to continuous models.

Modern artificial neurons generalize this mechanism into a continuous framework. Instead of counting active synapses, they compute a weighted linear combination of inputs and apply a nonlinear activation function:

\[ z = \mathbf{w}^\top \mathbf{x} + b, \qquad \hat{y} = \varphi(z) \]

Here, the weighted sum \(z\) serves as a continuous analogue of synaptic summation, while the activation function \(\varphi(\cdot)\) extends the discrete threshold rule into a differentiable transformation. This transition from logical thresholding to continuous optimization was crucial for enabling large-scale learning.

This functional correspondence explains why artificial neural networks are capable of approximating nonlinear mappings and learning expressive internal representations through iterative parameter adjustment. By optimizing weights and biases, networks progressively refine their representation of the input space. Consequently, ANNs have become fundamental tools for classification, regression, and representation learning in high-dimensional and structured data (Goodfellow et al., 2016).

Historical Evolution of Neural Networks.

The development of artificial neural networks reflects a gradual transition from logical abstraction to statistical learning.

In the 1940s, McCulloch and Pitts (1943) demonstrated that networks of simple threshold units could implement basic logical functions, representing one of the earliest attempts to formalize cognitive processes using mathematical structures.

Later, Rosenblatt (1958) introduced the Perceptron, the first neural architecture capable of learning directly from data by adjusting connection weights. This development demonstrated that machines could adapt their internal parameters based on experience, generating significant enthusiasm in early artificial intelligence research. Figure 1.3 shows the original schematic representation of the perceptron as presented in the 1958 paper.

Schematic representation of the Perceptron. Source: [Rosenblatt (1958)](https://psycnet.apa.org/record/1959-09865-001)

Figure 1.3: Schematic representation of the Perceptron. Source: Rosenblatt (1958)

However, approximately a decade later, Minsky and Papert, (1969) revealed fundamental theoretical limitations of single-layer perceptrons. In particular, they proved that these models cannot represent functions that are not linearly separable, such as the XOR classification problem illustrated in Figure 1.4. This discovery substantially slowed neural network research and contributed to what is often referred to as the first “AI winter”.

XOR classification problem. No single linear boundary can correctly separate the two classes, illustrating the limitation of single-layer perceptrons.

Figure 1.4: XOR classification problem. No single linear boundary can correctly separate the two classes, illustrating the limitation of single-layer perceptrons.

A decisive breakthrough occurred with the introduction of the backpropagation algorithm by Rumelhart et al. (1986), which enabled the effective training of multi-layer neural networks and addressed the limitations of shallow architectures. By allowing gradients to be propagated through multiple layers, this method made it possible to learn complex nonlinear representations directly from data.

Multilayer neural network used to illustrate the backpropagation learning procedure. Source: [Rumelhart et al. (1986)](https://www.researchgate.net/publication/375595601_Learning_Representations_by_Back-Propagating_Errors)

Figure 1.5: Multilayer neural network used to illustrate the backpropagation learning procedure. Source: Rumelhart et al. (1986)

Figure 1.5 illustrates a multilayer neural network similar to the one used by Rumelhart et al. (1986) to demonstrate the backpropagation learning procedure. The network consists of three main components: a layer of input units at the bottom, a set of hidden units in the middle, and an output unit at the top. The numerical values along the connections represent the weights learned during training. Information flows forward from the input layer to the output layer, while the backpropagation algorithm adjusts these weights by propagating the error gradient backward through the network. This iterative adjustment allows the network to progressively improve its internal representations and minimize prediction error.

This development marked a turning point in the history of neural networks and laid the conceptual and algorithmic foundation for what is now known as deep learning. Since then, advances in computational power, the availability of large-scale datasets, and improved optimization techniques have propelled neural networks to the forefront of modern artificial intelligence.

2 Foundations of Neural Networks

To understand how neural networks learn and generalize from data, it is necessary to examine their underlying mathematical and computational structure. This section introduces the fundamental components used to define and train neural networks. While some of these elements will be treated in greater detail in subsequent documents, they are briefly introduced here to establish the conceptual framework required to study neural architectures.

The main components include:

The notation used to describe internal network operations,
The principal neural architectures used for representation learning,
Activation functions that introduce nonlinearity,
Loss functions that quantify prediction error, and
Optimization procedures such as gradient descent and backpropagation.

Together, these elements provide the theoretical and algorithmic foundations required to understand how modern neural networks are trained and applied in practice.

2.0.1 Notation and Preliminaries

Let \(\mathbb{R}\) denote the set of real numbers, and let \(\mathbb{R}^n\) be the \(n\)-dimensional real vector space. Vectors are written in bold lowercase, e.g., \(\mathbf{x} = (x_1, x_2, \dots, x_n)^\top \in \mathbb{R}^n\), and matrices in bold uppercase, e.g., \(\mathbf{W} \in \mathbb{R}^{m \times n}\).

An artificial neuron can be viewed as a function that maps an input vector to a scalar (or vector) output through a linear combination followed by a nonlinear activation function. Formally,

\[ \hat{y} = \varphi(z), \qquad z = \mathbf{w}^\top \mathbf{x} + b, \]

where

\(\mathbf{x} \in \mathbb{R}^n\) is the input vector,
\(\mathbf{w} \in \mathbb{R}^n\) is the weight vector,
\(b \in \mathbb{R}\) is the bias term,
\(z\) is the linear response (also called the pre-activation),
\(\varphi:\mathbb{R} \to \mathbb{R}\) is the activation function.

This model captures the core role of a neuron in an ANN: it receives multiple inputs, assigns them relative importance through weights, and produces an output whose form is controlled by the chosen activation. When \(\varphi(\cdot)\) is differentiable (or piecewise differentiable), network parameters can be learned using gradient-based optimization, which underlies modern deep learning (Goodfellow et al., 2016).

In practice, the expressive power of neural networks depends not only on the mathematical form of the neuron but also on the architectural structure used to organize multiple neurons into layers and modules.

3 Types of Neural Networks (Architectures)

Neural networks can be categorized according to how they process information and which inductive biases are embedded in their architecture. These architectural choices determine how the model represents patterns in the data.

Below is a practical taxonomy commonly used in representation learning:

Perceptron (single-layer)
A linear classifier followed by an activation function. Although simple, it represents the historical starting point of neural network research.
Multi-Layer Perceptron (MLP)
A stack of fully connected layers capable of learning nonlinear mappings. MLPs are general-purpose models widely used for tabular data and as building blocks in larger systems.
Convolutional Neural Networks (CNNs)
Architectures designed for grid-like data such as images or spectrograms. Convolutional layers exploit spatial locality and translation invariance, enabling efficient feature extraction.
Recurrent Neural Networks (RNNs)
Models designed for sequential data by maintaining a hidden state across time steps. Variants such as LSTM and GRU improve stability and allow modeling of long-range dependencies.
Autoencoders
Encoder-decoder architectures trained to reconstruct their inputs. They are widely used for compression, denoising, and unsupervised representation learning (including variational autoencoders).
Graph Neural Networks (GNNs)
Architectures designed for graph-structured data, where node representations are computed by aggregating information from neighboring nodes.
Transformers
Architectures based on attention mechanisms that learn contextual representations by weighting interactions among tokens in parallel. Transformers have become the dominant architecture in natural language processing and are increasingly used in vision and multimodal learning.

In modern machine learning systems, these architectures are often combined within larger pipelines. For example, CNN-based feature extractors may feed Transformer encoders, or graph neural networks may be integrated into broader predictive models, depending on the structure of the data and the representation objectives.

NEURAL NETWORKS

Neural Networks for Representation Learning

Dr. rer. nat. Humberto LLinás Solano