20/03/26
Abstract
Other related documents can be found at Rpubs:: toc.
Artificial neural networks (ANNs) constitute one of the most influential paradigms in modern machine learning and artificial intelligence. Their conceptual origin is loosely inspired by the basic functional principles of biological neurons, which process and transmit information through interconnected networks (Kandel et al., 2013).
In the nervous system, neurons receive electrical or chemical signals through dendrites. These signals are integrated within the cell body (the soma), and if the accumulated stimulus surpasses a certain threshold, the neuron generates an electrical impulse known as an action potential. This signal then travels along the axon and is transmitted to other neurons through synaptic connections.
As illustrated in Figure 1.1, the biological neuron is composed of several key components, including dendrites, the soma, the axon, and the synaptic terminals, each playing a specific role in signal transmission.
Figure 1.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI).
Although artificial neurons are only simplified mathematical abstractions of this biological mechanism, the analogy provides an intuitive conceptual foundation. In essence, both systems combine multiple inputs, evaluate their relative influence, and produce an output response according to a transformation rule.
Modern artificial neurons translate the biological intuition of neural signal processing into a mathematical framework suitable for computation and learning. Instead of discrete electrical spikes, artificial neurons compute a weighted linear combination of inputs and then transform this signal through a nonlinear activation function.
Formally, the internal signal of a neuron is defined as
\[ z = \mathbf{w}^\top \mathbf{x} + b, \]
where \(\mathbf{x}\) represents the input vector, \(\mathbf{w}\) denotes the vector of weights, and \(b\) is a bias parameter. The neuron output is then obtained by applying an activation function
\[ h = \varphi(z). \]
The quantity \(z\) can be interpreted as a continuous analogue of synaptic integration, while the function \(\varphi(\cdot)\) generalizes the biological threshold mechanism into a smooth and differentiable transformation.
This transition from discrete logical models to continuous optimization was essential for the development of modern neural networks. By allowing gradients to be computed and propagated through the model, neural networks can be trained efficiently using gradient-based learning algorithms.
As a result, ANNs are capable of approximating complex nonlinear mappings and learning expressive internal representations from data. Through iterative adjustments of weights and biases, the model progressively refines its representation of the input space. Consequently, neural networks have become fundamental tools for tasks such as classification, regression, and representation learning in high-dimensional environments (Goodfellow et al., 2016).
As illustrated in Figure 1.2, a neural network is constructed by stacking multiple artificial neurons into layers. The output of each neuron becomes the input of neurons in the next layer, allowing the model to progressively learn more complex representations of the data.
Figure 1.2: Basic architecture of a feedforward neural network. Source: Created by the author with ChatGPT (OpenAI).
Each neuron in the network performs the same basic operation described above, namely computing a weighted sum followed by a nonlinear activation. By composing many such units, the network is able to model highly complex functions.
In many areas of mathematics and machine learning, it is important to work with functions that are not only continuous, but also sufficiently smooth. This smoothness ensures that derivatives exist and behave well, which is essential for optimization algorithms such as gradient descent.
A function \(f: \mathbb{R} \to \mathbb{R}\) is said to belong to the class \(C^{\infty}\) if it is infinitely differentiable; that is, all derivatives of any order exist and are continuous. Formally,
\[ f \in C^{\infty} \quad \Longleftrightarrow \quad f^{(k)} \text{ exists and is continuous for all } k \in \mathbb{N}. \]
Functions in \(C^{\infty}\) are often referred to as smooth functions, meaning that they can be differentiated infinitely many times without any discontinuities or irregularities in their derivatives.
A function that belongs to \(C^{\infty}\):
\[ f(x) = e^x \]
This function is infinitely differentiable, and all its derivatives are equal to \(e^x\), which are continuous everywhere.
In general, examples of smooth functions include exponential, trigonometric, and sigmoid-type functions.
A function that does not belong to \(C^{\infty}\):
\[ f(x) = |x| \]
Although this function is continuous, it is not differentiable at \(x = 0\), and therefore it is not smooth.
In the context of neural networks, smooth activation functions are particularly useful because they allow gradients to be computed reliably during training. However, not all commonly used activation functions belong to \(C^{\infty}\). For instance, the ReLU function is not differentiable at zero, yet it remains widely used due to its practical advantages.
Throughout this document, we will frequently refer to functions in \(C^{\infty}\) when discussing theoretical properties of activation functions and optimization.
Activation functions are a fundamental component of artificial neural networks because they introduce nonlinearity into the model. Without nonlinear activation functions, a neural network composed of multiple layers would reduce to an equivalent linear transformation, regardless of the number of layers.
Recall that an artificial neuron computes a linear combination of its inputs
\[ z = \mathbf{w}^\top\mathbf{x} + b, \]
which represents the internal signal of the neuron. The final output is obtained by applying a nonlinear transformation
\[ h = \varphi(z), \]
where \(\varphi(\cdot)\) is called the activation function.
The role of the activation function is to transform the internal signal of the neuron into a response that can capture complex relationships between variables. By introducing nonlinear transformations, neural networks can construct nonlinear decision boundaries and learn rich internal representations of the data.
This capability is supported by classical universal approximation results, which show that neural networks equipped with suitable activation functions can approximate a wide class of functions on compact domains .
From a geometric perspective, activation functions distort the feature space in a controlled way, allowing the model to separate patterns that would otherwise be inseparable using only linear transformations .
Activation functions can be organized into different categories according to their mathematical properties. One common classification distinguishes between monotonic and periodic activation functions.
Monotonic activation functions are functions whose output consistently increases or decreases with respect to the input. These functions have historically been the most widely used in neural networks and include examples such as the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU).
Periodic activation functions exhibit oscillatory behavior and are particularly useful in models designed to capture periodic or high-frequency patterns. These functions are commonly used in specialized neural architectures for representing signals, implicit functions, or spatial fields.
Periodic activation functions can be further divided into:
Sinusoidal activation functions
Non-sinusoidal periodic functions
In the following sections, we examine in detail several activation functions commonly used in neural networks, discussing their mathematical form, key properties, and implications for learning algorithms.
We begin by examining monotonic activation functions, which have historically played a central role in the development of neural networks.
We begin with monotonic activation functions, which have historically played a central role in the development of neural networks. As discussed in the previous section, these functions preserve the ordering of inputs and are especially useful when the model is expected to respond in a consistent increasing or decreasing manner.
In mathematics, a function is said to be monotonic when it is either nondecreasing or nonincreasing over its domain \cite(Royden and Fitzpatrick, 1998). From a graphical perspective, monotonicity means that the curve evolves in a single overall direction: it either increases as the input grows or decreases without reversing its global trend.
This idea is illustrated in Figure 2.1. Two of the curves exhibit monotonic behavior: one shows a steady increase, while another shows a steady decrease over the entire domain. In contrast, the third curve oscillates, alternating between increasing and decreasing intervals, and therefore does not satisfy the monotonicity property. This comparison highlights that monotonic functions preserve the ordering of inputs, whereas non-monotonic functions may reverse their direction.
Figure 2.1: Monotonic function.
A function \(f\) is nondecreasing if for all \(x,y \in \mathbb{R}\), whenever \(x \ge y\), it follows that \(f(x) \ge f(y)\).
Likewise, \(f\) is nonincreasing if for all \(x,y \in \mathbb{R}\), whenever \(x \ge y\), it follows that \(f(x) \le f(y)\).
In the context of neural networks, many commonly used activation functions exhibit monotonic behavior. The following list summarizes several important examples that will be studied in detail:
Linear function
Identity function
Piecewise linear function
Threshold (Heaviside) function
Sigmoid function
Bipolar sigmoid function
ReLU and its variants (Leaky ReLU, PReLU, RReLU)
ELU and SELU
Softplus function
Hyperbolic tangent (tanh) function
Arctangent function
Softsign function
Maxout function
Each of these functions has different properties in terms of smoothness, differentiability, and practical performance in neural networks.
Among the simplest monotonic transformations, the linear function is often the first candidate to consider when inputs are combined through weighted sums before entering a neuron. If the inputs are already shaped by weights, whether specified manually or learned from data, a linear transformation provides the most direct mapping from input to output.
However, this function has two major limitations in the context of neural networks. First, its derivative is constant, which means that gradient-based optimization does not benefit from any input-dependent curvature. As a result, the gradient conveys no richer structure than a fixed slope. Second, when the function is used in backpropagation, error corrections remain proportional to a constant term, so the update dynamics do not meaningfully adapt to changes in the input. In this sense, the function lacks the expressive nonlinearity needed for deep learning.
The general form of the linear activation is
\[ \begin{equation} f(x) = \alpha x \tag{2.1} \end{equation} \]
where \(\alpha \in \mathbb{R}\). Its domain is \((-\infty, \infty)\), it is continuous in \(C^{\infty}\), and it is monotonic together with its first derivative:
\[ f'(x)=\alpha \]
When \(\alpha = 1\), the function reduces to the identity function. Both cases are displayed in Figure 2.2.
The function is monotonic increasing if \(\alpha > 0\), monotonic decreasing if \(\alpha < 0\), and constant if \(\alpha = 0\).
Figure 2.2: Linear and Identity functions.
The corresponding derivatives, which remain constant for each case, are shown in Figure 2.3.
Figure 2.3: Linear and Identity functions (derivative).
When \(\alpha = 1\), the linear function becomes the identity function:
\[ f(x)=x \]
\[ f'(x)=1 \]
At first glance, and as noted earlier, this function may seem uninformative because it leaves the input unchanged. Nevertheless, in neural computation it still plays a role, since it passes the weighted sum directly to the next stage without additional distortion.
In that sense, the identity function acts as a direct transmitter of the summation term \(\sum_j w_{kj} x_j\), sometimes described as a replicator or duplicator of the neuron’s internal linear combination \cite(Haykin, 2001; Rice, 1953). The corresponding expression is also represented in Figure 2.2.
A more flexible alternative is the piecewise linear function, defined in Equation (2.2). This function constrains the input between two thresholds, \(\alpha_{min}\) and \(\alpha_{max}\), so that the output remains between 0 and 1 \cite(Zeng et al., 2010). Inputs below the lower threshold are mapped to 0, whereas inputs above the upper threshold are mapped to 1, as illustrated in Figure 2.4.
\[ \begin{equation} f(x) = \left\{ \begin{array}{ll} 0, & \text{if } x < \alpha_{min}, \\ mx + b, & \text{if } \alpha_{min} \le x \le \alpha_{max},\\ 1, & \text{if } x > \alpha_{max}, \end{array} \right. \tag{2.2} \end{equation} \]
Where the slope \(m\) is given by
\[ \begin{equation} m = \frac{1}{\alpha_{max} - \alpha_{min}} \tag{2.3} \end{equation} \]
and the intercept \(b\) is given by
\[ \begin{equation} b=-m \alpha_{min} = 1 - m \alpha_{max} \tag{2.4} \end{equation} \]
Its domain is \((-\infty, \infty)\); it is continuous on \(\mathbb{R}\) and monotonic, but it does not belong to \(C^{\infty}\) due to the nondifferentiability at the threshold points.
Figure 2.4: Piecewise Linear function.
\[ \begin{equation} f'(x) = \left\{ \begin{array}{cl} 0 & \text{if } x < \alpha_{min} \\ m & \text{if } \alpha_{min} < x < \alpha_{max}\\ 0 & \text{if } x > \alpha_{max} \end{array} \right. \tag{2.5} \end{equation} \]
The derivative is piecewise constant and is defined only on the open interval \((\alpha_{min}, \alpha_{max})\), excluding the boundary points where the function is not differentiable.
Therefore, the function is not differentiable at these points and, consequently, does not belong to \(C^{\infty}\).
The behavior of the derivative across regions is illustrated in Figure 2.5.
Figure 2.5: Piecewise Linear function (derivative).
The threshold function is one of the earliest and most intuitive activation mechanisms. It is also known as the unit Heaviside function, binary function, or step function \cite(Batres-Estrada, 2015; Osher and Fedkiw, 2003; Cox, 1992), and is defined in Equation (2.5).
In econometric language, this function resembles a dummy variable, which may be used alone or combined with other terms. In neural networks, its usefulness lies in its filtering capacity: it decides whether an input signal is strong enough to activate the neuron. In that sense, it behaves somewhat like a gate that alters the final prediction by changing whether the signal is passed forward.
The threshold function is
\[ \begin{equation} f(x) = \left\{ \begin{array}{cc} 1, & \text{if } x \ge 0, \\ 0, & \text{if } x < 0. \end{array} \right. \tag{2.5} \end{equation} \]
The function is discontinuous at \(x=0\) and monotonic nondecreasing. Its range is \(\{0,1\}\). This discontinuous behavior is illustrated in Figure 2.6, where the abrupt jump at the origin can be clearly observed.
Figure 2.6: Threshold (Heaviside) function illustrating its discontinuity at x = 0.
The derivative of the threshold function is not defined at \(x=0\) due to the discontinuity.
For all \(x \neq 0\), the function is locally constant on each side of the origin, and therefore its derivative is zero:
\[ f'(x) = 0 \quad \text{for } x \neq 0. \]
Thus, from a classical perspective, the function is not differentiable at the origin.
In more advanced settings, such as distribution theory, the derivative of the Heaviside function can be represented by the Dirac delta function. However, this interpretation goes beyond the scope of standard neural network models.
In practice, this lack of differentiability makes the threshold function unsuitable for gradient-based optimization methods, which is why it has been largely replaced by smooth or piecewise-linear activation functions.
The sigmoid function is among the most widely used activation functions in neural networks, especially in classification settings \cite(Friedman et al., 2001; Batres-Estrada, 2015). Its main role is to transform inputs smoothly into values between 0 and 1, making it particularly suitable for probabilistic interpretation of outputs.
Its general form is given by Equation (2.6) and illustrated in Figure 2.7:
\[ \begin{equation} f(x) = \frac{1}{1 + e^{- \alpha x}} \tag{2.6} \end{equation} \]
It has domain \(\mathbb{R}\) and range \((0,1)\); it belongs to \(C^{\infty}\) and is strictly increasing on \(\mathbb{R}\).
Historically, the sigmoid was introduced by Verhulst (1838) in the study of population growth and later became known as the logistic function or logistic curve (Verhulst, 1977). In econometrics, when \(\alpha = 1\), it appears as the standard logistic link used for dichotomous outcomes \cite(Greene, 2003). Because of its smoothness and symmetry properties relative to the origin, it has also been associated with more stable convergence in backpropagation procedures \cite(Haykin, 2001; LeCun et al., 2012).
Figure 2.7 shows the sigmoid function for different values of \(\alpha\), illustrating how this parameter controls the steepness of the transition.
Figure 2.7: Sigmoid function.
The derivative of the sigmoid function, given in Equation (2.7), can be expressed directly as a function of \(f(x)\) itself.
\[ \begin{equation} f'(x)=\alpha\,f(x)\bigl(1-f(x)\bigr) \tag{2.7} \end{equation} \]
In particular, this closed-form expression avoids the need for explicit exponential differentiation during backpropagation.
Its behavior is illustrated in Figure 2.8, where the derivative attains its maximum at the origin and decreases symmetrically as \(|x|\) increases for different values of \(\alpha\).
Figure 2.8: Sigmoid function (derivative).
The bipolar sigmoid is closely related to the standard sigmoid, but its output range is shifted to \((-1,1)\) instead of \((0,1)\). Because of this, it is not directly suitable for probability estimation, although it can be advantageous in other learning contexts. Panicker and Babu (2012) report that bipolar sigmoid functions may perform more efficiently than other sigmoidal variants in some settings.
Its form is given by Equation (2.8) and illustrated in Figure 2.9:
\[ \begin{equation} f(x) = \frac{1 - e^{- \alpha x}}{1 + e^{- \alpha x}} \tag{2.8} \end{equation} \]
The function has domain \(\mathbb{R}\) and range \((-1,1)\); it belongs to \(C^{\infty}\) and is strictly increasing.
Figure 2.9 shows the bipolar sigmoid for different values of \(\alpha\), illustrating how this parameter controls the steepness of the transition.
Figure 2.9: Bipolar Sigmoid function.
This function can be written as a scaled and shifted version of the standard sigmoid:
\[ f(x) = 2\,\sigma(\alpha x) - 1, \]
where \(\sigma(x)\) denotes the standard sigmoid function defined as
\[ \begin{equation} \sigma(x) = \frac{1}{1 + e^{-x}} \tag{2.9} \end{equation} \]
The derivative of the bipolar sigmoid function, given in Equation (2.10), can be expressed directly as a function of \(f(x)\) itself.
\[ \begin{equation} f'(x)=\frac{\alpha}{2}\left(1-f(x)^2\right) \tag{2.10} \end{equation} \]
Its behavior is illustrated in Figure 2.10, where the derivative attains its maximum at the origin and decreases symmetrically as \(|x|\) increases for different values of \(\alpha\).
Figure 2.10: Bipolar Sigmoid function (derivative).
This function is closely related to the hyperbolic tangent, which will be discussed later. In fact, it is equivalent to the hyperbolic tangent function up to a scaling factor in the input, which provides a centered alternative to the standard sigmoid, namely:
\[ f(x) = \tanh\left(\frac{\alpha x}{2}\right) \]
The Rectified Linear Unit (ReLU) is a more recent activation function that has become extremely influential in modern deep learning. In the literature, it is also related to the ramp function and other advanced rectifiers \cite(Nair and Hinton, 2010). Although simple in form, it can mimic the effect of multiple sigmoidal units while using the same learned weights and biases \cite(Batres-Estrada, 2015).
A general representation for the ReLU family (including LReLU and PReLU) is given by
\[ \begin{equation} f(x) = \left\{ \begin{array}{lc} \alpha x, & \text{if } x<0, \\ x, & \text{if } x \geq 0. \end{array} \right. \tag{2.11} \end{equation} \]
Standard ReLU corresponds to \(\alpha=0\):
\[ \begin{equation} f(x) = \left\{ \begin{array}{lc} 0, & \text{if } x < 0, \\ x, & \text{if } x \geq 0. \end{array} \right. \tag{2.12} \end{equation} \]
The function is continuous on \(\mathbb{R}\), but it is not differentiable at \(x=0\). It is monotonic nondecreasing when \(\alpha \ge 0\). Its behavior for different values of \(\alpha\) is illustrated in Figure 2.11.
Figure 2.11: Rectified Linear Unit (ReLU) Family.
The derivative of the function defined in Equation (2.11) is given by
\[ \begin{equation} f'(x) = \left\{ \begin{array}{lc} \alpha, & \text{if } x<0, \\ 1, & \text{if } x>0. \end{array} \right. \end{equation} \]
The derivative is not defined at \(x=0\). Its behavior across different values of \(\alpha\) is illustrated in Figure 2.12.
Figure 2.12: Rectified Linear Unit (ReLU) Family (derivative).
As Equation (2.11) suggests, the parameter \(\alpha\) is included mainly to describe the broader ReLU family, even though the standard ReLU sets \(\alpha = 0\). As previously shown in Figure 2.11, the parameter \(\alpha\) controls the behavior of the ReLU family. In large networks, its behavior may interact with other activation functions such as tanh, softplus, sinusoidal, sigmoid, and Gaussian functions.. For \(\alpha > 0\), the function avoids the “dying ReLU” problem by allowing a nonzero gradient for negative inputs.
This family was introduced to accelerate learning and extend the advantages of simpler linear models into deeper nonlinear settings \cite(Nair and Hinton, 2010; Goodfellow et al., 2016).
The first member of the ReLU family is the Leaky ReLU (LReLU). It corresponds to Equation (2.11) with \(\alpha = 0.01\). Unlike the standard ReLU, it allows a small nonzero slope for negative inputs, reducing the risk of inactive neurons. It has domain \(\mathbb{R}\) and range \(\mathbb{R}\), is continuous on \(\mathbb{R}\), and is monotonic nondecreasing when \(\alpha \ge 0\) \cite(Maas et al., 2013). It is shown in Figure 2.11.
The second member is the Parametric ReLU (PReLU), which uses the same form as Equation (??) but leaves \(\alpha\) unrestricted. This parameter is learned from the data. It has domain \(\mathbb{R}\) and range \(\mathbb{R}\), is continuous on \(\mathbb{R}\), and is monotonic nondecreasing when \(\alpha \ge 0\). It is also shown in Figure 2.11.
The Randomized ReLU (RReLU) assigns \(\alpha\) randomly within a prescribed interval, as indicated by Equation (2.11) without fixing the parameter \cite(Xu et al., 2015). One limitation is that, during backpropagation, its distinction from PReLU may become less transparent at the derivative level. It has domain \(\mathbb{R}\) and range \(\mathbb{R}\), is continuous on \(\mathbb{R}\), and is monotonic nondecreasing when \(\alpha \ge 0\). It is illustrated in Figure 2.11.
The Exponential Linear Unit (ELU) is another member of the broader rectifier family, but its negative branch differs substantially from ReLU and its variants (Nair and Hinton, 2010). It has been shown to outperform standard ReLU in several settings (Clevert et al., 2015; Trottier et al., 2016), particularly because it pushes the mean activation closer to zero and may reduce training time in tasks such as computer vision.
Its definition is
\[ \begin{equation} f(x)= \left\{ \begin{array}{ll} \alpha(e^x-1), & \text{if } x<0,\\ x, & \text{if } x\ge 0. \end{array} \right. \tag{2.13} \end{equation} \]
Figure 2.13: Exponential Linear Unit (ELU).
\[ \begin{equation} f'(x)= \left\{ \begin{array}{ll} \alpha e^x, & \text{if } x<0,\\ 1, & \text{if } x\ge 0. \end{array} \right. \tag{2.13} \end{equation} \]
Figure 2.14: Exponential Linear Unit (ELU) (derivative).
The function takes values in \((-\alpha,+\infty)\) and is continuous in \(C^1\) when \(\alpha = 1\), otherwise in \(C^0\). It is monotonic in itself if and only if \(\alpha \ge 0\), and monotonic in its derivative for \(\alpha \in [0,1]\). As shown in Figure 2.13, as \(x \to -\infty\), ELU approaches \(-\alpha\).
The Scaled Exponential Linear Unit (SELU) extends ELU by introducing a multiplicative factor \(\lambda\), which may be specified externally or arise from the neuron’s self-normalizing dynamics (Klambauer et al., 2017). This function enables recursive normalization within the network and helps mitigate vanishing-gradient issues (Clevert et al., 2015). SELU is shown in Figure 2.15 and defined by Equation (2.14):
\[ \begin{equation} f(x)=\lambda \left\{ \begin{array}{ll} \alpha(e^x-1), & \text{if } x<0,\\ x, & \text{if } x\ge 0. \end{array} \right. \tag{2.14} \end{equation} \]
\[ \begin{equation} f'(x)=\lambda \left\{ \begin{array}{ll} \alpha e^x, & \text{if } x<0,\\ 1, & \text{if } x\ge 0. \end{array} \right. \end{equation} \]
As noted earlier, when the output variable is dichotomous, the sigmoid or logistic function is appropriate for computing probabilities. When the output consists of multiple discrete classes, however, the corresponding generalization is the multinomial logistic function, which in machine learning is commonly implemented as the SoftMax function (Friedman et al., 2001).
SoftMax transforms a vector of raw scores into a vector of probabilities across multiple classes. Its standard form is
\[ \begin{equation} f_i(\mathbf{x})=\frac{e^{x_i}}{\sum\limits_{j=1}^{J} e^{x_j}} \tag{2.15} \end{equation} \]
Its range is \((0,+1)\), and it is continuous in \(C^{\infty}\).
SoftMax is widely used not only in artificial neural networks, but also in Naïve Bayes, linear discriminant analysis, and multinomial logistic regression. A key statistical requirement is that the output categories should be mutually exclusive, with no dependence structure among the classes represented by the SoftMax outputs.
¿falta?
The odd activation function, also known as the signum or sign function, is defined in Equation (2.16) and displayed in Figure 2.15:
\[ \begin{equation} f(x) = \left\{ \begin{array}{lc} -1, & if \ x<0, \\ 0, & if \ x=0, \\ 1, & if \ x > 0. \end{array} \right. \tag{2.16} \end{equation} \]
Its range is \(\{-1,0,1\}\). The function is discontinuous at \(x=0\) and monotonic nondecreasing.
Figure 2.15: Odd Activation (Signum / Sign) function.
FALTA?
Hinton et al. (2012) introduced Dropout as a strategy for reducing training time in ensemble-like neural architectures, and this later motivated the development of Maxout by Goodfellow et al. (2013). In a Maxout neuron, the activation is simply the maximum among a set of input values, which makes the function especially compatible with Dropout regularization.
Its form is
\[ \begin{equation} f(\mathbf{x})=\max_i x_i \tag{2.17} \end{equation} \]
with domain \((-\infty,+\infty)\). Note that it is continuous and piecewise linear, though not differentiable everywhere.
Goodfellow et al. (2013) report that when training extends beyond roughly 30 epochs, Maxout maintains substantial improvements on large validation sets. In addition, model averaging with Maxout on MNIST achieved better results than several alternative rectifiers.
FALTA?
Some activation functions are useful because they act as smooth compromises between other well-known functions. The softsign is one such example. It resembles a smoothed version of tanh, with some similarity to sigmoid and even to the odd activation function (Aghdam and Heravi, 2017). Near the origin, it also approximates the identity function through its derivative.
Its form is
\[ \begin{equation} f(x)=\frac{x}{1+|x|} \tag{2.18} \end{equation} \]
with range \([-1,+1]\), continuity in \(C^{\infty}\), and monotonicity in itself.
\[ \begin{equation} f'(x)=\frac{1}{(1+|x|)^2} \end{equation} \]
This works for \(x \neq 0\), and in fact the derivative is also 1 at 0, so it extends nicely.
A particularly important feature of softsign is the way it saturates as \(|x|\) becomes large. Compared with tanh, this saturation is milder, which can make the function easier to compute and sometimes a practical alternative to more sharply clipped activations.
The Elliot function (Elliott, 1993) is another sigmoidal function that maps outputs into the interval \((0,+1)\). It has been reported to perform well in practice and, according to Matlab-based experiments, does not suffer severely from vanishing-neuron issues (Ploskas and Samaras, 2016). Its form is given below and illustrated in Figure 2.16.
\[ \begin{equation} f(x)=\frac{0.5x}{1+|x|}+0.5 \tag{2.19} \end{equation} \]
Its range is \([0,+1]\), it is continuous in \(C^{\infty}\), and it is monotonic in itself.
\[ \begin{equation} f'(x)=\frac{0.5}{(1+|x|)^2} \end{equation} \]
The hyperbolic tangent, or tanh, was first introduced by Sauri (1774) and later became a standard activation function in machine learning. Although it is derived from \(\frac{\sinh(x)}{\cosh(x)}\), its graph resembles both sigmoid and softsign. In fact, tanh can be interpreted as a rescaled sigmoid.
Its main expression and equivalent rescalings are given in Equations (2.20), (2.21), and (2.22), and are shown in Figure 2.16:
\[ \begin{equation} f(x)=\tanh(x)=\frac{2}{1+e^{-2x}}-1 \tag{2.20} \end{equation} \]
Its range is \((-1,+1)\), it is continuous in \(C^{\infty}\), and it is monotonic in itself. It can also be derived from a sigmoid function:
\[ \begin{equation} \tanh(x) = 2 \cdot \sigma (2x) - 1, \tag{2.21} \end{equation} \]
where \(\sigma(x)\) is,
\[ \begin{equation} \sigma(x) = \frac{e^x}{1 + e^x}, \tag{2.22} \end{equation} \]
One of the main contributions of tanh and sigmoid to the machine learning literature is their ability to map real-valued inputs from \((-\infty,+\infty)\) into bounded ranges: \([-1,+1]\) for tanh and \([0,1]\) for sigmoid. These bounded mappings help stabilize learning and are closely related to normalization effects within the network. In particular, they support procedures such as batch normalization by promoting more regular activation distributions, which can improve backpropagation dynamics (LeCun et al., 2012).
Figure 2.16: Softsign, Hyperbolic Tangent and Elliot function.
FALTA?
\[ \begin{equation} f'(x)=1-\tanh^2(x)=\frac{1}{\cosh^2(x)} \end{equation} \]
Figure 2.17: Softsign, Hyperbolic Tangent and Elliot function (derivative).
When the upper and lower bounds provided by tanh, sigmoid, or softsign are not appropriate, the arc tangent or arctan function offers another sigmoidal alternative (Aghdam and Heravi, 2017). Its output saturates symmetrically around the origin at \(\pm \frac{\pi}{2}\), making it useful as a normalizer.
Its form is shown in Figure 2.18 and defined by Equation (2.23):
\[ \begin{equation} f(x)=\tan^{-1}(x) \tag{2.23} \end{equation} \]
Figure 2.18: Arc Tangent function.
\[ \begin{equation} f'(x)=\frac{1}{1+x^2} \end{equation} \]
Figure 2.19: Arc Tangent function (derivative).
Its range is \(\left( - \frac{\pi}{2}, \frac{\pi}{2} \right)\), it is continuous in \(C^{\infty}\), and it is monotonic in itself.
LeCun proposed a scaled form of the hyperbolic tangent that may also be used as a normalizing activation. Its graph appears in Figure 2.18, and its expression is
\[ \begin{equation} f(x)=1.7159\,\tanh\left(\frac{2}{3}x\right) \tag{2.24} \end{equation} \]
Its range is \((-1.7159,\,1.7159)\). It is continuous in \(C^{\infty}\), and monotonic in itself.
\[ \begin{equation} f'(x)=1.7159\cdot \frac{2}{3}\left[1-\tanh^2\left(\frac{2}{3}x\right)\right] \end{equation} \]
The complementary log-log function is the inverse of the cumulative distribution function of the extreme-value (or log-Weibull) distribution. It is widely used in statistical modeling of hazard-type responses and resembles members of the sigmoid family.
Like sigmoid, it produces outputs between 0 and 1, but these outputs can be interpreted in terms of hazard effects associated with reverse extreme-value errors. Gomes and Ludermir (2008) report that it may outperform logit and tanh activations in multilayer perceptrons when evaluated through mean squared error. The function, often abbreviated cloglog, is shown in Figure 2.20 and defined by Equation (2.25).
\[ \begin{equation} f(x)=1-\exp(-\exp(x)) \tag{2.25} \end{equation} \]
Its range is \((0,+1)\), it is continuous in \(C^{\infty}\), and it is monotonic in itself.
In practice, the inverse cloglog form is commonly used as an activation function for classification tasks.
Figure 2.20: Complementary Log-Log function.
\[ \begin{equation} f'(x)=\exp(x-\exp(x)) \end{equation} \]
Figure 2.21: Complementary Log-Log function (derivative).
The softplus function gained importance after Glorot et al. (2011) emphasized its relevance, and later work by Zheng et al. (2015) showed improvements in deep neural networks obtained through its use. Softplus can be understood as a smooth version of the ReLU, especially on the negative side.
\[ \begin{equation} f(x)=\ln(1+e^x) \tag{2.26} \end{equation} \]
Its range is \((0,+\infty)\), it is continuous in \(C^{\infty}\), and it is monotonic in itself and in its derivative.
Figure 2.22: Softplus function.
\[ \begin{equation} f'(x)=\frac{e^x}{1+e^x} \end{equation} \]
This is exactly the logistic function.
Figure 2.23: Softplus function (derivative).
Although bent functions were originally defined in the 1960s, they were formally published by Rothaus (1976). Around the same period, related ideas were also used in Soviet cryptography by V.A. Eliseev and O.P. Stepchenkov (Tokareva, 2015). Bent functions are generally classified within the Boolean function family (Çeşmelioğlu et al., 2016; Savický, 1994).
The bent identity function is given by Equation (2.27) and displayed in Figure 2.24:
\[ \begin{equation} f(x)=\frac{\sqrt{x^2+1}-1}{2}+x \tag{2.27} \end{equation} \]
Its domain is \((-\infty,+\infty)\), it is continuous in \(C^{\infty}\), and it is monotonic in itself and in its derivative.
Bent identity is interesting because it smoothly maps negative inputs upward and positive inputs slightly downward relative to a purely linear identity. At the time of writing, it has even been mentioned in applications involving Ethereum-related optimization problems when standard rectifiers are insufficient.
Figure 2.24: Bent Identity function.
\[ \begin{equation} f'(x)=\frac{x}{2\sqrt{x^2+1}}+1 \end{equation} \]
Figure 2.25: Bent Identity function (derivative).
The soft exponential function, defined in Equation (2.28), was proposed by Godfrey and Gashler (2015), who argued that it could serve as a useful activation function for neural networks, although with limited empirical evidence at the time. Different values of \(\alpha\) generate different functional behaviors, as shown in Figure 2.26:
\[ \begin{equation} f(\alpha,x)= \left\{ \begin{array}{ll} -\dfrac{\ln\!\left(1-\alpha(x+\alpha)\right)}{\alpha}, & \text{if } \alpha<0,\\[6pt] x, & \text{if } \alpha=0,\\[6pt] \dfrac{e^{\alpha x}-1}{\alpha}+\alpha, & \text{if } \alpha>0. \end{array} \right. \tag{2.28} \end{equation} \]
Figure 2.26: Soft Exponential function.
The derivative of the soft exponential function with respect to \(x\) is
\[ \begin{equation} \frac{\partial}{\partial x}f(\alpha,x)= \left\{ \begin{array}{ll} \dfrac{1}{1-\alpha(x+\alpha)}, & \text{if } \alpha<0,\\[6pt] 1, & \text{if } \alpha=0,\\[6pt] e^{\alpha x}, & \text{if } \alpha>0. \end{array} \right. \end{equation} \]
Figure 2.27: Soft Exponential function (derivative).
The domain of the function is \(\mathbb{R}\), and it is smooth (\(C^{\infty}\)) wherever the logarithmic branch is defined. The parameter \(\alpha\) controls the functional regime: negative values produce logarithmic-like behavior, \(\alpha=0\) yields a linear mapping, and positive values generate exponential growth.
Godfrey and Gashler (2015) describe the SoftExponential activation as a unifying function that combines the characteristics of logarithmic, linear, and exponential transformations within a single parameterized family. Because the function is smooth, differentiable, and computationally tractable, it can adapt its curvature to different learning scenarios. Moreover, the flexibility introduced by the parameter \(\alpha\) may help mitigate certain optimization issues, including vanishing-gradient effects in specific settings.
Despite these advantages, monotonic activation functions (including SoftExponential) share a common structural limitation: they preserve the ordering of inputs and therefore cannot directly represent oscillatory or cyclic relationships.
While monotonic activation functions have historically dominated neural network design due to their stability and well-behaved optimization properties, many real-world phenomena exhibit inherently oscillatory or cyclic behavior. Signals in physics, audio processing, time-series analysis, and spatial modeling often contain repeating patterns that cannot be adequately captured using strictly monotonic transformations.
Periodic activation functions address this limitation by producing oscillatory outputs. Rather than preserving the ordering of inputs, they allow the activation response to vary cyclically, enabling neural networks to model wave-like structures, harmonic relationships, and repeating local patterns.
These functions are particularly relevant in architectures designed for signal representation, implicit neural fields, and temporal dynamics. A defining characteristic of periodic activations is that their derivatives also oscillate. While this increases expressive power, it may introduce additional optimization challenges, such as sensitivity to initialization, learning rate, and training stability.
In the following subsections, periodic activation functions are organized into two main groups: sinusoidal and non-sinusoidal functions.
A natural starting point for periodic activations is the class of sinusoidal functions. Unlike monotonic transformations, sinusoidal functions are characterized by repeated oscillations over a fixed period. These oscillations may differ in amplitude, phase, or frequency, but their defining feature is periodic repetition.
From a mathematical perspective, sinusoidal functions are fundamental because they serve as the building blocks of many signal-processing techniques, including Fourier analysis and its extensions. Their smoothness and infinite differentiability make them especially suitable for neural models that aim to represent continuous and structured patterns. .
The sine wave is the canonical example of a periodic function. It oscillates smoothly around a central axis and produces values that repeat over time or space. In its normalized form, it takes values in the interval \([-1,1]\), yielding a characteristic wave-like pattern (Parascandolo et al., 2016).
A general sinusoidal function can be expressed as
\[ \begin{equation} f(x,t) = A \sin(kx \pm \omega t + \varphi) + D \tag{4.1} \end{equation} \]
where:
\(A\) is the amplitude,
\(k\) is the wave number (spatial frequency),
\(\omega\) is the angular frequency,
\(t\) denotes time,
\(\varphi\) is the phase shift, and
\(D\) is a vertical offset.
This function is smooth, belongs to \(C^{\infty}\), and is inherently non-monotonic. The sign \(-\) typically represents rightward propagation, while \(+\) corresponds to leftward propagation.
A key property is that differentiation preserves oscillatory structure:
\[ \frac{d}{dx} \sin(x) = \cos(x), \qquad \frac{d}{dx} \cos(x) = -\sin(x). \]
This makes sinusoidal functions particularly suitable for representing periodic dependencies in neural networks. They are widely used in modeling temporal signals, recurrent dynamics, and continuous structured data.
However, due to their non-monotonic and non-convex nature, sinusoidal activations may introduce optimization challenges such as gradient instability, slower convergence, and sensitivity to learning rates (Lapedes and Farber, 1987). Nevertheless, they have shown strong performance in certain recurrent architectures, particularly for short-term prediction tasks (Sopena and Alquezar, 1994; Alquezar and Sanfeliu, 1994).
An additional useful property is that the mean value of sine and cosine over a full period is zero, which is advantageous in several statistical and signal-processing contexts.
The cardinal sine function, or sinc function, is another important periodic-related function in signal processing. It oscillates around the horizontal axis with decreasing amplitude, producing a distinctive shape.
Its standard definition is
\[ \begin{equation} f(x) = \frac{\sin(x)}{x}, \qquad x \neq 0 \tag{4.2} \end{equation} \]
with the continuous extension
\[ f(0) = 1. \]
A commonly used normalized version is
\[ \begin{equation} f(x) = \frac{\sin(\pi x)}{\pi x}, \qquad x \neq 0 \tag{4.3} \end{equation} \]
again with \(f(0)=1\).
With this definition, the sinc function is continuous and infinitely differentiable on \(\mathbb{R}\). Its range is bounded, although not easily described in closed form.
A fundamental property is that the normalized sinc function is the Fourier transform of a rectangular function, making it essential in sampling theory, interpolation, and band-limited signal reconstruction.
Figure 4.1: Sine and Cosine Wave functions.
Figure 4.2: Sinc function.
For discrete-time signals \(\{x[n]\}\), the DTFT is defined as
\[ \begin{equation} X(\omega) = \sum_{n=-\infty}^{\infty} x[n] e^{- i \omega n} \tag{4.4} \end{equation} \]
where \(\omega\) is the angular frequency.
If \(x[n] = x(nT)\) corresponds to sampled data, then
\[ X(f) = \sum_{n=-\infty}^{\infty} x(nT)e^{-i2\pi f nT}. \]
The DTFT is periodic in frequency and is closely related to the Shannon–Nyquist sampling theorem, which governs signal reconstruction.
The Short-Time Fourier Transform (STFT) extends the Fourier transform by introducing local time information:
\[ \begin{equation} \mathrm{STFT}\{x[n]\}(m,\omega) = \sum_{n=-\infty}^{\infty} x[n]\, w[n-m]\, e^{-i\omega n} \tag{4.5} \end{equation} \]
where \(w[\cdot]\) is a window function.
STFT allows analysis of how frequency content evolves over time but is subject to the uncertainty principle:
\[ \Delta f \, \Delta t \ge \frac{1}{4\pi}. \]
The wavelet transform provides a time-frequency representation using localized basis functions:
\[ \begin{equation} W_\psi x(a,b) = \frac{1}{\sqrt{|a|}} \int_{-\infty}^{\infty} x(t)\, \psi^*\!\left(\frac{t-b}{a}\right)\, dt \tag{4.6} \end{equation} \]
where \(a\) is the scale and \(b\) is the translation.
Wavelets capture both temporal localization and frequency information, making them particularly suitable for analyzing nonstationary signals.
Figure 4.3: Fourier and its variants by domain.
Not all periodic, or potentially periodic, activation functions are sinusoidal. Some functions do not explicitly involve sine or cosine terms, yet they can still produce repeating patterns or be adapted to periodic settings through suitable transformations.
In this section, we examine several non-sinusoidal activation functions. Some of them are intrinsically periodic, whereas others can acquire periodic behavior through repetition, wrapping, or piecewise construction.
The Gaussian, or normal, distribution is one of the most fundamental functions in statistics (Stigler, 1986). In its standard form, it is not periodic; instead, it defines a single bell-shaped curve centered at a mean value.
Its density function is given by
\[ \begin{equation} f(x \mid \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\!\left(- \frac{(x - \mu)^2}{2 \sigma^2}\right) \tag{5.1} \end{equation} \]
which is smooth and belongs to \(C^{\infty}\) on \(\mathbb{R}\).
Figure 5.1: Periodic Gaussian function.
Although the ordinary Gaussian function is not periodic, a periodic version can be obtained by wrapping the argument over a bounded interval. One such construction is
\[ \begin{equation} \rho(x) = f \left(\left(\left( x + \frac{N}{2} \right) \text{mod} \ N \right) - \frac{N}{2} \right) \tag{5.2} \end{equation} \]
This transformation produces a periodic repetition of the Gaussian profile over intervals of length \(N\).
Another important periodic analogue is the wrapped normal distribution, which is defined on the unit circle. Closely related to it is the von Mises distribution, a widely used model in directional statistics.
A square wave can be interpreted as a periodic extension of the Heaviside step function. Instead of switching once between two levels, it alternates repeatedly, making it a natural model for binary transmission, switching systems, and certain types of audio or electrical distortion.
A common representation is
\[ \begin{equation} x(t) = \text{sgn}(\sin t), \quad v(t) = \text{sgn}(\cos t) \tag{5.3} \end{equation} \]
These functions are periodic but discontinuous, and therefore they do not belong to \(C^{\infty}\).
Another representation uses shifted step functions:
\[ \begin{equation} x(t) = \sum^{\infty}_{n = - \infty} \prod (t - nT) = \sum^{\infty}_{n = - \infty} \left[u \left(t - nT + \frac{1}{2} \right) - u\left(t - nT - \frac{1}{2} \right)\right] \tag{5.4} \end{equation} \]
where \(u(\cdot)\) denotes the unit step function and \(T\) is the period.
Figure 5.2: Square Wave function.
The triangle wave takes its name from its repeated triangular shape around the horizontal axis (Tansel et al., 1991). It is periodic, continuous, and piecewise linear.
Foresee and Hagan (1997) employed triangle-wave-type functions in the context of Gauss-Newton approximations to Bayesian regularization, reporting favorable error behavior in several applications, including regression, time-series estimation, and chaotic signal modeling.
One possible expression for a triangle wave is
\[ \begin{equation} x(t) = \frac{2}{a} \left( t - a \left[ \frac{t}{a} + \frac{1}{2} \right] \right) (-1)^{\left[ \frac{t}{a} + \frac{1}{2} \right]} \tag{5.5} \end{equation} \]
where \(a\) controls the scale of the oscillation.
The triangle wave is continuous, but it is not differentiable at its corner points.
The integral of a square wave is closely related to a triangle wave:
\[ \begin{equation} \int \text{sgn} \left( \sin x \right) dx \tag{5.6} \end{equation} \]
Figure 5.3: Triangle Wave function.
The sawtooth wave is another non-sinusoidal periodic function. It resembles the triangle wave, but instead of rising and falling symmetrically, it changes linearly in one direction and then resets abruptly.
Sawtooth waves have been used in engineering applications such as power electronics and motor drives (Bose, 2007), as well as in neural systems for biomedical pattern recognition (Wang et al., 2017).
A standard sawtooth representation can be written as
$$ \[\begin{equation} x(t) = 2 \left( \frac{t}{a} - \left[ \frac{t}{p} + \frac{1}{2} \right] \right) \tag{5.7} \end{equation}\]
$$
where \(a\) and \(p\) control scale and period, respectively.
This function is periodic and piecewise linear, but it is not differentiable at its jump discontinuities. In many constructions, the absolute value of a sawtooth wave produces a triangle wave.
Figure 5.4: Sawtooth Wave function.
The S-shaped Rectified Linear Unit (SReLU) was introduced by Jin et al. (2016a) as a flexible activation function capable of learning both convex and non-convex response patterns. It is a piecewise linear function controlled by four learnable parameters, which define two transition points, or “knuckles”.
Because these parameters are not known in advance, Jin et al. (2016a, 2016b) proposed initialization strategies to improve training stability. In practice, poor initialization may lead to weak performance, making parameter selection especially important during the early stages of learning.
One proposed strategy is to freeze the SReLU parameters during the initial epochs so that the network first behaves like a simpler rectifier. After this stage, the right threshold may be set adaptively as
\[ \begin{equation} t^r_i = supp(X_i, k); \tag{5.8} \end{equation} \]
where \(\operatorname{supp}(X_i,k)\) denotes the \(k\)-th largest value in the set \(X_i\), and \(X_i\) contains all input values associated with a given SReLU unit.
The activation function is defined as
\[ \begin{equation} f_{t_l,a_l,t_r,a_r}(x) = \left\{ \begin{array} {ll} t_l + a_l (x - t_l), & \text{if} \ x \le t_l\\ x, & \text{if} \ t_l < x < t_r \\ t_r + a_r (x - t_r), & \text{if} \ x \ge t_r \end{array} \right. \tag{5.9} \end{equation} \]
The function is continuous on \(\mathbb{R}\), and therefore belongs to \(C^0\), but it is generally not differentiable at the two transition points. The parameters \(t_l\), \(a_l\), \(t_r\), and \(a_r\) may either be specified in advance or learned during training.
Its main limitation lies in the difficulty of choosing suitable initial values. Even so, Jin et al. (2016a) reported that SReLU can outperform several rectifier-based activation functions in deep neural networks.
The Adaptive Piecewise Linear Unit (APLU) is another flexible activation designed to learn its shape directly from data. Unlike the standard ReLU, APLU incorporates additional hinge-like terms, allowing each neuron to develop its own adaptive piecewise linear response.
Agostinelli et al. (2014) proposed this unit as a generalization of simpler rectifier functions. Its form combines a standard ReLU term with a sum of shifted hinge components:
\[ \begin{equation} f(x) = \max(0,x) + \sum^{S}_{s=1} a^{(s)}_i \max\left(0,-x+b^{(s)}_i\right) \tag{5.10} \end{equation} \]
where the coefficients \(a_i^{(s)}\) and shifts \(b_i^{(s)}\) are learned from the data.
Because the shape of the activation is updated independently for each neuron, APLU can adapt to local patterns more flexibly than fixed activation functions. In this sense, it acts as a neuron-specific activation that is learned jointly with the other model parameters.
Agostinelli et al. (2014) reported that APLU improves predictive performance in experiments such as CIFAR-10 and CIFAR-100. On the other hand, this flexibility may also make the network more fragile, since the overall behavior of the model depends on the stability of many independently learned nonlinearities.
In this document, we examined the role of activation functions in artificial neural networks, emphasizing their importance as nonlinear transformations that determine how neurons respond to incoming signals.
We first distinguished between monotonic and periodic activation functions. Monotonic activations, such as the sigmoid, hyperbolic tangent, ReLU, ELU, and SoftExponential families, were presented as core tools for stable and efficient learning in standard neural network architectures.
We then turned to periodic activation functions, which are particularly useful for representing oscillatory, wave-like, and structured patterns. Within this group, we discussed both sinusoidal functions, such as the sine wave and sinc function, and non-sinusoidal functions, including Gaussian-based periodic constructions, square waves, triangle waves, sawtooth waves, SReLU, and APLU.
Throughout the chapter, attention was given not only to the analytical form of each activation function, but also to its smoothness, differentiability, geometric behavior, and practical relevance for learning algorithms.
Overall, the discussion highlights that no single activation function is universally optimal. Instead, the choice of activation should depend on the structure of the data, the architecture of the model, and the type of relationships the network is expected to learn.
This activity is designed to integrate and apply the concepts introduced in this chapter related to word embeddings, Word2Vec, and semantic similarity. The reader will explore how contextual word representations are structured, queried, and used to compare words and short texts in a numerical vector space.
To build a fully reproducible workflow that:
explores pretrained word embeddings,
analyzes semantic similarity between words,
performs analogy-style queries, and
compares short texts using embedding-based distances.
Select one pretrained word embedding model available through a standard NLP library (e.g., gensim).
Work with:
a small set of common words, and
a small collection of short sentences (2–4 sentences).
Create an R Markdown (.Rmd) document that compiles successfully to HTML (or PDF).
The document must include:
the code, and
the generated output (tables, printed objects, or numerical results).
Briefly describe:
the selected pretrained embedding model,
the source of the training corpus, and
the dimensionality of the word vectors.
Explain why a pretrained model is appropriate for this activity.
Select 10–15 common tokens (e.g., nouns or verbs).
For each token:
verify its presence in the embedding vocabulary, and
report the dimensionality of its vector representation.
Briefly comment on why some tokens may be missing.
Choose three query words and:
retrieve their top 5 most similar words using cosine similarity,
present the results in a clear table.
Interpret the semantic relationships observed in the results.
Construct at least two analogy-style queries using the form:
\[ \text{word}_A - \text{word}_B + \text{word}_C \approx \text{word}_D \]
For each analogy:
specify the positive and negative sets,
report the top predicted result(s), and
discuss whether the analogy is semantically reasonable.
Select four words and compute pairwise cosine similarity scores between their vectors.
Present the results as:
a similarity table, or
a similarity matrix.
Optionally, convert similarity to distance using \(1-\cos(\theta)\). Explain how numerical distance reflects semantic proximity.
Define three short sentences (one or two lines each).
Using an embedding-based similarity or distance measure (e.g., Word Mover’s Distance or cosine similarity applied to averaged word vectors):
compute the distance between each pair of sentences,
identify the most similar and most dissimilar sentence pairs.
If both WMD and cosine similarity are computed, compare their rankings and comment on any discrepancies. Interpret the results in terms of semantic content.
Before computing similarities, verify that all selected tokens are present in the embedding vocabulary. Briefly report any out-of-vocabulary (OOV) words and explain their potential impact.
Write a concise reflection (6–10 lines) discussing:
how word embeddings differ from Bag-of-Words and TF-IDF representations,
what semantic information embeddings capture, and
one limitation of static word embeddings such as Word2Vec.
The R Markdown document must be fully reproducible.
All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.
All random seeds (if applicable) must be set to ensure deterministic results.
All library versions used should be clearly reported.