1 Neural networks

1.0.1 Biological neuron

Artificial neural networks (ANNs) constitute one of the most influential paradigms in modern machine learning and artificial intelligence. Their conceptual origin is loosely inspired by the basic functional principles of biological neurons, which process and transmit information through interconnected networks (Kandel et al., 2013).

In the nervous system, neurons receive electrical or chemical signals through dendrites. These signals are integrated within the cell body (the soma), and if the accumulated stimulus surpasses a certain threshold, the neuron generates an electrical impulse known as an action potential. This signal then travels along the axon and is transmitted to other neurons through synaptic connections.

As illustrated in Figure 1.1, the biological neuron is composed of several key components, including dendrites, the soma, the axon, and the synaptic terminals, each playing a specific role in signal transmission.

Figure 1.1: Human neuron. Source: Created by the author with ChatGPT (OpenAI).

Although artificial neurons are only simplified mathematical abstractions of this biological mechanism, the analogy provides an intuitive conceptual foundation. In essence, both systems combine multiple inputs, evaluate their relative influence, and produce an output response according to a transformation rule.

1.0.2 Mathematical model of an artificial neuron

Modern artificial neurons translate the biological intuition of neural signal processing into a mathematical framework suitable for computation and learning. Instead of discrete electrical spikes, artificial neurons compute a weighted linear combination of inputs and then transform this signal through a nonlinear activation function.

Formally, the internal signal of a neuron is defined as

\[ z = \mathbf{w}^\top \mathbf{x} + b, \]

where \(\mathbf{x}\) represents the input vector, \(\mathbf{w}\) denotes the vector of weights, and \(b\) is a bias parameter. The neuron output is then obtained by applying an activation function

\[ h = \varphi(z). \]

The quantity \(z\) can be interpreted as a continuous analogue of synaptic integration, while the function \(\varphi(\cdot)\) generalizes the biological threshold mechanism into a smooth and differentiable transformation.

This transition from discrete logical models to continuous optimization was essential for the development of modern neural networks. By allowing gradients to be computed and propagated through the model, neural networks can be trained efficiently using gradient-based learning algorithms.

As a result, ANNs are capable of approximating complex nonlinear mappings and learning expressive internal representations from data. Through iterative adjustments of weights and biases, the model progressively refines its representation of the input space. Consequently, neural networks have become fundamental tools for tasks such as classification, regression, and representation learning in high-dimensional environments (Goodfellow et al., 2016).

As illustrated in Figure 1.2, a neural network is constructed by stacking multiple artificial neurons into layers. The output of each neuron becomes the input of neurons in the next layer, allowing the model to progressively learn more complex representations of the data.

Figure 1.2: Basic architecture of a feedforward neural network. Source: Created by the author with ChatGPT (OpenAI).

Each neuron in the network performs the same basic operation described above, namely computing a weighted sum followed by a nonlinear activation. By composing many such units, the network is able to model highly complex functions.

For a more detailed introduction to neural networks and their architectures, the reader is referred to the companion document An introduction to neural networks (Llinás, 2026), which provides a comprehensive overview of neural network models and their variants.

1.0.3 Smooth functions and the class \(C^{\infty}\)

Definition.

In many areas of mathematics and machine learning, it is important to work with functions that are not only continuous, but also sufficiently smooth. This smoothness ensures that derivatives exist and behave well, which is essential for optimization algorithms such as gradient descent.

A function \(f: \mathbb{R} \to \mathbb{R}\) is said to belong to the class \(C^{\infty}\) if it is infinitely differentiable; that is, all derivatives of any order exist and are continuous. Formally,

\[ f \in C^{\infty} \quad \Longleftrightarrow \quad f^{(k)} \text{ exists and is continuous for all } k \in \mathbb{N}. \]

Functions in \(C^{\infty}\) are often referred to as smooth functions, meaning that they can be differentiated infinitely many times without any discontinuities or irregularities in their derivatives.

Example 1.

A function that belongs to \(C^{\infty}\):

\[ f(x) = e^x \]

This function is infinitely differentiable, and all its derivatives are equal to \(e^x\), which are continuous everywhere.

Example 2.

In general, examples of smooth functions include exponential, trigonometric, and sigmoid-type functions.

Example 3.

A function that does not belong to \(C^{\infty}\):

\[ f(x) = |x| \]

Although this function is continuous, it is not differentiable at \(x = 0\), and therefore it is not smooth.

Remark (Neural networks context).

In the context of neural networks, smooth activation functions are particularly useful because they allow gradients to be computed reliably during training. However, not all commonly used activation functions belong to \(C^{\infty}\). For instance, the ReLU function is not differentiable at zero, yet it remains widely used due to its practical advantages.

Throughout this document, we will frequently refer to functions in \(C^{\infty}\) when discussing theoretical properties of activation functions and optimization.

1.0.4 Activation functions

Definition.

Activation functions are a fundamental component of artificial neural networks because they introduce nonlinearity into the model. Without nonlinear activation functions, a neural network composed of multiple layers would reduce to an equivalent linear transformation, regardless of the number of layers.

Recall that an artificial neuron computes a linear combination of its inputs

\[ z = \mathbf{w}^\top\mathbf{x} + b, \]

which represents the internal signal of the neuron. The final output is obtained by applying a nonlinear transformation

\[ h = \varphi(z), \]

where \(\varphi(\cdot)\) is called the activation function.

Role of the activation function.

The role of the activation function is to transform the internal signal of the neuron into a response that can capture complex relationships between variables. By introducing nonlinear transformations, neural networks can construct nonlinear decision boundaries and learn rich internal representations of the data.

This capability is supported by classical universal approximation results, which show that neural networks equipped with suitable activation functions can approximate a wide class of functions on compact domains .

From a geometric perspective, activation functions distort the feature space in a controlled way, allowing the model to separate patterns that would otherwise be inseparable using only linear transformations .

1.0.5 Classification of activation functions

Monotonic and periodic functions.

Activation functions can be organized into different categories according to their mathematical properties. One common classification distinguishes between monotonic and periodic activation functions.

Monotonic activation functions are functions whose output consistently increases or decreases with respect to the input. These functions have historically been the most widely used in neural networks and include examples such as the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU).
Periodic activation functions exhibit oscillatory behavior and are particularly useful in models designed to capture periodic or high-frequency patterns. These functions are commonly used in specialized neural architectures for representing signals, implicit functions, or spatial fields.

Classification of periodic activation functions.

Periodic activation functions can be further divided into:

Sinusoidal activation functions
Non-sinusoidal periodic functions

In the following sections, we examine in detail several activation functions commonly used in neural networks, discussing their mathematical form, key properties, and implications for learning algorithms.

We begin by examining monotonic activation functions, which have historically played a central role in the development of neural networks.

2 Monotonic activation functions

Intuition and graphical interpretation.

We begin with monotonic activation functions, which have historically played a central role in the development of neural networks. As discussed in the previous section, these functions preserve the ordering of inputs and are especially useful when the model is expected to respond in a consistent increasing or decreasing manner.

In mathematics, a function is said to be monotonic when it is either nondecreasing or nonincreasing over its domain \cite(Royden and Fitzpatrick, 1998). From a graphical perspective, monotonicity means that the curve evolves in a single overall direction: it either increases as the input grows or decreases without reversing its global trend.

This idea is illustrated in Figure 2.1. Two of the curves exhibit monotonic behavior: one shows a steady increase, while another shows a steady decrease over the entire domain. In contrast, the third curve oscillates, alternating between increasing and decreasing intervals, and therefore does not satisfy the monotonicity property. This comparison highlights that monotonic functions preserve the ordering of inputs, whereas non-monotonic functions may reverse their direction.

Figure 2.1: Monotonic function.

Formal definition.

A function \(f\) is nondecreasing if for all \(x,y \in \mathbb{R}\), whenever \(x \ge y\), it follows that \(f(x) \ge f(y)\).

Likewise, \(f\) is nonincreasing if for all \(x,y \in \mathbb{R}\), whenever \(x \ge y\), it follows that \(f(x) \le f(y)\).

Common monotonic activation functions.

In the context of neural networks, many commonly used activation functions exhibit monotonic behavior. The following list summarizes several important examples that will be studied in detail:

Linear function
Identity function
Piecewise linear function
Threshold (Heaviside) function
Sigmoid function
Bipolar sigmoid function
ReLU and its variants (Leaky ReLU, PReLU, RReLU)
ELU and SELU
SoftMax function
Sign function
Maxout function
Softsign function
Elliot function
Hyperbolic tangent (tanh) function
Arctangent function
Lecun’s hyperbolic tangent function
Complementary log-log function
Softplus function
Bent identity function
Soft expopnential function

Each of these functions has different properties in terms of smoothness, differentiability, and practical performance in neural networks.

2.0.1 Linear function

Among the simplest monotonic transformations, the linear function is often the first candidate to consider when inputs are combined through weighted sums before entering a neuron. If the inputs are already shaped by weights, whether specified manually or learned from data, a linear transformation provides the most direct mapping from input to output.

However, this function has two major limitations in the context of neural networks. First, its derivative is constant, which means that gradient-based optimization does not benefit from any input-dependent curvature. As a result, the gradient conveys no richer structure than a fixed slope. Second, when the function is used in backpropagation, error corrections remain proportional to a constant term, so the update dynamics do not meaningfully adapt to changes in the input. In this sense, the function lacks the expressive nonlinearity needed for deep learning.

The general form of the linear activation is

\[ \begin{equation} f(x) = \alpha x \tag{2.1} \end{equation} \]

where \(\alpha \in \mathbb{R}\). Its domain is \((-\infty, \infty)\), it is continuous in \(C^{\infty}\), and it is monotonic together with its first derivative:

\[ f'(x)=\alpha \]

When \(\alpha = 1\), the function reduces to the identity function. Both cases are displayed in Figure 2.2.

The function is monotonic increasing if \(\alpha > 0\), monotonic decreasing if \(\alpha < 0\), and constant if \(\alpha = 0\).

Figure 2.2: Linear and Identity functions.

The corresponding derivatives, which remain constant for each case, are shown in Figure 2.3.

Figure 2.3: Linear and Identity functions (derivative).

2.0.2 Identity function

The function.

When \(\alpha = 1\), the linear function becomes the identity function:

\[ f(x)=x \]

Derivative.

\[ f'(x)=1 \]

At first glance, and as noted earlier, this function may seem uninformative because it leaves the input unchanged. Nevertheless, in neural computation it still plays a role, since it passes the weighted sum directly to the next stage without additional distortion.

In that sense, the identity function acts as a direct transmitter of the summation term \(\sum_j w_{kj} x_j\), sometimes described as a replicator or duplicator of the neuron’s internal linear combination \cite(Haykin, 2001; Rice, 1953). The corresponding expression is also represented in Figure 2.2.

2.0.3 Piecewise linear function

The function.

A more flexible alternative is the piecewise linear function, defined in Equation (2.2). This function constrains the input between two thresholds, \(\alpha_{min}\) and \(\alpha_{max}\), so that the output remains between 0 and 1 \cite(Zeng et al., 2010). Inputs below the lower threshold are mapped to 0, whereas inputs above the upper threshold are mapped to 1, as illustrated in Figure 2.4.

\[ \begin{equation} f(x) = \left\{ \begin{array}{ll} 0, & \text{if } x < \alpha_{min}, \\ mx + b, & \text{if } \alpha_{min} \le x \le \alpha_{max},\\ 1, & \text{if } x > \alpha_{max}, \end{array} \right. \tag{2.2} \end{equation} \]

Where the slope \(m\) is given by

\[ \begin{equation} m = \frac{1}{\alpha_{max} - \alpha_{min}} \tag{2.3} \end{equation} \]

and the intercept \(b\) is given by

\[ \begin{equation} b=-m \alpha_{min} = 1 - m \alpha_{max} \tag{2.4} \end{equation} \]

Its domain is \((-\infty, \infty)\); it is continuous on \(\mathbb{R}\) and monotonic, but it does not belong to \(C^{\infty}\) due to the nondifferentiability at the threshold points.

Figure 2.4: Piecewise Linear function.

Derivative.

\[ \begin{equation} f'(x) = \left\{ \begin{array}{cl} 0 & \text{if } x < \alpha_{min} \\ m & \text{if } \alpha_{min} < x < \alpha_{max}\\ 0 & \text{if } x > \alpha_{max} \end{array} \right. \tag{2.5} \end{equation} \]

The derivative is piecewise constant and is defined only on the open interval \((\alpha_{min}, \alpha_{max})\), excluding the boundary points where the function is not differentiable.

Therefore, the function is not differentiable at these points and, consequently, does not belong to \(C^{\infty}\).

The behavior of the derivative across regions is illustrated in Figure 2.5.

Figure 2.5: Piecewise Linear function (derivative).

2.0.4 Threshold (Unit heaviside, binary, step) function

The function.

The threshold function is one of the earliest and most intuitive activation mechanisms. It is also known as the unit Heaviside function, binary function, or step function \cite(Batres-Estrada, 2015; Osher and Fedkiw, 2003; Cox, 1992), and is defined in Equation (2.5).

In econometric language, this function resembles a dummy variable, which may be used alone or combined with other terms. In neural networks, its usefulness lies in its filtering capacity: it decides whether an input signal is strong enough to activate the neuron. In that sense, it behaves somewhat like a gate that alters the final prediction by changing whether the signal is passed forward.

The threshold function is

\[ \begin{equation} f(x) = \left\{ \begin{array}{cc} 1, & \text{if } x \ge 0, \\ 0, & \text{if } x < 0. \end{array} \right. \tag{2.5} \end{equation} \]

The function is discontinuous at \(x=0\) and monotonic nondecreasing. Its range is \(\{0,1\}\). This discontinuous behavior is illustrated in Figure 2.6, where the abrupt jump at the origin can be clearly observed.

Figure 2.6: Threshold (Heaviside) function illustrating its discontinuity at x = 0.

Derivative.

The derivative of the threshold function is not defined at \(x=0\) due to the discontinuity.

For all \(x \neq 0\), the function is locally constant on each side of the origin, and therefore its derivative is zero:

\[ f'(x) = 0 \quad \text{for } x \neq 0. \]

Thus, from a classical perspective, the function is not differentiable at the origin.

In more advanced settings, such as distribution theory, the derivative of the Heaviside function can be represented by the Dirac delta function. However, this interpretation goes beyond the scope of standard neural network models.

In practice, this lack of differentiability makes the threshold function unsuitable for gradient-based optimization methods, which is why it has been largely replaced by smooth or piecewise-linear activation functions.

2.0.5 Sigmoid function

The function.

The sigmoid function is among the most widely used activation functions in neural networks, especially in classification settings \cite(Friedman et al., 2001; Batres-Estrada, 2015). Its main role is to transform inputs smoothly into values between 0 and 1, making it particularly suitable for probabilistic interpretation of outputs.

Its general form is given by Equation (2.6) and illustrated in Figure 2.7:

\[ \begin{equation} f(x) = \frac{1}{1 + e^{- \alpha x}} \tag{2.6} \end{equation} \]

It has domain \(\mathbb{R}\) and range \((0,1)\); it belongs to \(C^{\infty}\) and is strictly increasing on \(\mathbb{R}\).

Historically, the sigmoid was introduced by Verhulst (1838) in the study of population growth and later became known as the logistic function or logistic curve (Verhulst, 1977). In econometrics, when \(\alpha = 1\), it appears as the standard logistic link used for dichotomous outcomes (LLinás, 2026). Because of its smoothness and symmetry properties relative to the origin, it has also been associated with more stable convergence in backpropagation procedures \cite(Haykin, 2001; LeCun et al., 2012).

Figure 2.7 shows the sigmoid function for different values of \(\alpha\), illustrating how this parameter controls the steepness of the transition.

Figure 2.7: Sigmoid function.

Derivative.

The derivative of the sigmoid function, given in Equation (2.7), can be expressed directly as a function of \(f(x)\) itself.

\[ \begin{equation} f'(x)=\alpha\,f(x)\bigl(1-f(x)\bigr) \tag{2.7} \end{equation} \]

In particular, this closed-form expression avoids the need for explicit exponential differentiation during backpropagation.

Its behavior is illustrated in Figure 2.8, where the derivative attains its maximum at the origin and decreases symmetrically as \(|x|\) increases for different values of \(\alpha\).

Figure 2.8: Sigmoid function (derivative).

2.0.6 Bipolar sigmoid function

The function.

The bipolar sigmoid is closely related to the standard sigmoid, but its output range is shifted to \((-1,1)\) instead of \((0,1)\). Because of this, it is not directly suitable for probability estimation, although it can be advantageous in other learning contexts. Panicker and Babu (2012) report that bipolar sigmoid functions may perform more efficiently than other sigmoidal variants in some settings.

Its form is given by Equation (2.8) and illustrated in Figure 2.9:

\[ \begin{equation} f(x) = \frac{1 - e^{- \alpha x}}{1 + e^{- \alpha x}} \tag{2.8} \end{equation} \]

The function has domain \(\mathbb{R}\) and range \((-1,1)\); it belongs to \(C^{\infty}\) and is strictly increasing.

Figure 2.9 shows the bipolar sigmoid for different values of \(\alpha\), illustrating how this parameter controls the steepness of the transition.

Figure 2.9: Bipolar Sigmoid function.

This function can be written as a scaled and shifted version of the standard sigmoid:

\[ f(x) = 2\,\sigma(\alpha x) - 1, \]

where \(\sigma(x)\) denotes the standard sigmoid function defined as

\[ \begin{equation} \sigma(x) = \frac{1}{1 + e^{-x}} \tag{2.9} \end{equation} \]

Derivative.

The derivative of the bipolar sigmoid function, given in Equation (2.10), can be expressed directly as a function of \(f(x)\) itself.

\[ \begin{equation} f'(x)=\frac{\alpha}{2}\left(1-f(x)^2\right) \tag{2.10} \end{equation} \]

Its behavior is illustrated in Figure 2.10, where the derivative attains its maximum at the origin and decreases symmetrically as \(|x|\) increases for different values of \(\alpha\).

Figure 2.10: Bipolar Sigmoid function (derivative).

This function is closely related to the hyperbolic tangent, which will be discussed later. In fact, it is equivalent to the hyperbolic tangent function up to a scaling factor in the input, which provides a centered alternative to the standard sigmoid, namely:

\[ f(x) = \tanh\left(\frac{\alpha x}{2}\right) \]

2.0.7 Rectified linear unit (ReLU)

The function.

The Rectified Linear Unit (ReLU) is a more recent activation function that has become extremely influential in modern deep learning. In the literature, it is also related to the ramp function and other advanced rectifiers \cite(Nair and Hinton, 2010). Although simple in form, it can mimic the effect of multiple sigmoidal units while using the same learned weights and biases \cite(Batres-Estrada, 2015).

A general representation for the ReLU family (including LReLU and PReLU) is given by

\[ \begin{equation} f(x) = \left\{ \begin{array}{lc} \alpha x, & \text{if } x<0, \\ x, & \text{if } x \geq 0. \end{array} \right. \tag{2.11} \end{equation} \]

Standard ReLU corresponds to \(\alpha=0\):

\[ \begin{equation} f(x) = \left\{ \begin{array}{lc} 0, & \text{if } x < 0, \\ x, & \text{if } x \geq 0. \end{array} \right. \tag{2.12} \end{equation} \]

The function is continuous on \(\mathbb{R}\), but it is not differentiable at \(x=0\). It is monotonic nondecreasing when \(\alpha \ge 0\). Its behavior for different values of \(\alpha\) is illustrated in Figure 2.11.

Figure 2.11: Rectified Linear Unit (ReLU) Family.

Derivative.

The derivative of the function defined in Equation (2.11) is given by

\[ \begin{equation} f'(x) = \left\{ \begin{array}{lc} \alpha, & \text{if } x<0, \\ 1, & \text{if } x>0. \end{array} \right. \tag{2.13} \end{equation} \]

The derivative is not defined at \(x=0\). Its behavior across different values of \(\alpha\) is illustrated in Figure 2.12.

Figure 2.12: Rectified Linear Unit (ReLU) Family (derivative).

2.0.8 Variants of ReLU (LReLU, PReLU, RReLU)

As Equation (2.11) suggests, the parameter \(\alpha\) is included mainly to describe the broader ReLU family, even though the standard ReLU sets \(\alpha = 0\). As previously shown in Figure 2.11, the parameter \(\alpha\) controls the behavior of the ReLU family. In large networks, its behavior may interact with other activation functions such as tanh, softplus, sinusoidal, sigmoid, and Gaussian functions. For \(\alpha > 0\), the function avoids the “dying ReLU” problem by allowing a nonzero gradient for negative inputs.

This family was introduced to accelerate learning and extend the advantages of simpler linear models into deeper nonlinear settings \cite(Nair and Hinton, 2010; Goodfellow et al., 2016).

Leaky rectified linear unit (LReLU).

The first member of the ReLU family is the Leaky ReLU (LReLU). It corresponds to Equation (2.11) with \(\alpha = 0.01\). Unlike the standard ReLU, it allows a small nonzero slope for negative inputs, reducing the risk of inactive neurons. It has domain \(\mathbb{R}\) and range \(\mathbb{R}\), is continuous on \(\mathbb{R}\), and is monotonic nondecreasing when \(\alpha \ge 0\) \cite(Maas et al., 2013). It is shown in Figure 2.11.

Parametric rectified linear unit (PReLU).

The second member is the Parametric ReLU (PReLU), which uses the same form as Equation (2.11) but leaves \(\alpha\) unrestricted. This parameter is learned from the data. It has domain \(\mathbb{R}\) and range \(\mathbb{R}\), is continuous on \(\mathbb{R}\), and is monotonic nondecreasing when \(\alpha \ge 0\). It is also shown in Figure 2.11.

Randomized rectified linear unit (RReLU).

The Randomized ReLU (RReLU) assigns \(\alpha\) randomly within a prescribed interval, as indicated by Equation (2.11) without fixing the parameter \cite(Xu et al., 2015). One limitation is that, during backpropagation, its distinction from PReLU may become less transparent at the derivative level. It has domain \(\mathbb{R}\) and range \(\mathbb{R}\), is continuous on \(\mathbb{R}\), and is monotonic nondecreasing when \(\alpha \ge 0\). It is illustrated in Figure 2.11.

2.0.9 Exponential linear Unit (ELU)

The function.

The Exponential Linear Unit (ELU) is another member of the broader rectifier family, but its negative branch differs substantially from ReLU and its variants (Nair and Hinton, 2010). It has been shown to outperform standard ReLU in several settings (Clevert et al., 2015; Trottier et al., 2016), particularly because it pushes the mean activation closer to zero and may reduce training time in tasks such as computer vision.

Its definition is given by Equation (2.14):

\[ \begin{equation} f(x)= \left\{ \begin{array}{ll} \alpha(e^x-1), & \text{if } x<0,\\ x, & \text{if } x\ge 0. \end{array} \right. \tag{2.14} \end{equation} \] The parameter \(\alpha\) controls the saturation level for negative inputs. Figure 2.13 illustrates the ELU function for different parameter values, together with its scaled variant (SELU), which will be described in the following section.

Figure 2.13: Exponential Linear Unit (ELU) and Scaled Exponential Linear Unit (SELU).

The function has domain \(\mathbb{R}\) and range \((-\alpha,+\infty)\). It is continuous on \(\mathbb{R}\) and is continuously differentiable if and only if \(\alpha = 1\). It is monotonic increasing if \(\alpha \ge 0\). As shown in Figure 2.13, as \(x \to -\infty\), the function approaches the asymptote \(-\alpha\)

Derivative.

The derivative of the ELU function, given in Equation (2.15), is

\[ \begin{equation} f'(x)= \left\{ \begin{array}{ll} \alpha e^x, & \text{if } x<0,\\ 1, & \text{if } x\ge 0. \end{array} \right. \tag{2.15} \end{equation} \]

The derivative is monotonic increasing for \(\alpha \in [0,1]\). Its behavior is illustrated in Figure 2.14.

Figure 2.14: ELU and SELU (derivatives).

2.0.10 Scaled exponential linear unit (SELU)

The function.

The Scaled Exponential Linear Unit (SELU) extends ELU by introducing a scaling factor \(\lambda\) applied to the entire function. Here, \(\alpha\) retains the same role as in ELU, controlling the saturation for negative inputs, while \(\lambda\) scales the overall output. This parameter is typically fixed to values that ensure self-normalizing properties in neural networks \cite(Klambauer et al., 2017).

This function enables recursive normalization within the network and helps mitigate vanishing-gradient issues \cite(Clevert et al., 2015). SELU is shown in Figure 2.13 and defined by Equation (2.16):

\[ \begin{equation} f(x)=\lambda \left\{ \begin{array}{ll} \alpha(e^x-1), & \text{if } x<0,\\ x, & \text{if } x\ge 0. \end{array} \right. \tag{2.16} \end{equation} \]

The parameters \(\alpha\) and \(\lambda\) are typically set to fixed values (e.g., \(\alpha \approx 1.673\) and \(\lambda \approx 1.051\)) to ensure self-normalizing properties in deep neural networks. The function has domain \(\mathbb{R}\) and range \((-\lambda\alpha,+\infty)\).

Derivative.

The derivative of the SELU function follows directly from Equation (2.16):

\[ \begin{equation} f'(x)=\lambda \left\{ \begin{array}{ll} \alpha e^x, & \text{if } x<0,\\ 1, & \text{if } x\ge 0. \end{array} \right. \tag{2.17} \end{equation} \]

2.0.11 SoftMax function

The function.

As noted earlier, when the output variable is dichotomous, the sigmoid or logistic function is appropriate for computing probabilities. When the output consists of multiple discrete classes, however, the corresponding generalization is the multinomial logistic function(LLinás, 2026), which in machine learning is commonly implemented as the SoftMax function \cite(Friedman et al., 2001).

SoftMax transforms a vector of raw scores into a vector of probabilities across multiple classes. Its standard form is

\[ \begin{equation} f_r(\mathbf{x})=\frac{e^{x_r}}{\sum\limits_{s=1}^{R} e^{x_s}} \tag{2.18} \end{equation} \]

where \(r=1,\ldots,R\), and \(R\) denotes the total number of classes. Its range is \((0,1)\) for each component, and the outputs satisfy

\[ \sum_{r=1}^{R} f_r(\mathbf{x}) = 1. \]

Thus, SoftMax converts arbitrary real-valued scores into a valid probability distribution over the classes. The function is infinitely differentiable, that is, it belongs to \(C^{\infty}\).

SoftMax is widely used in artificial neural networks, especially in multiclass classification problems. It is also closely related to multinomial logistic regression. A key statistical requirement is that the output categories should be mutually exclusive, so that each observation belongs to one and only one class.

SoftMax probabilities for three classes.

Figure 2.15: SoftMax output probabilities for three classes as a function of a single input score x. The three probabilities are coupled and always sum to 1.

Figure 2.15 illustrates the behavior of the SoftMax function for three classes. As the input score increases, the probability associated with one class increases while the probabilities of the remaining classes decrease.

This behavior reflects the competitive nature of SoftMax: the outputs are not independent, since they are constrained to sum to one. In particular, when one class becomes more likely, the others are automatically penalized.

The central curve (Class 2) reaches its maximum when its associated score dominates relative to the others, while the extreme classes dominate when the input strongly favors one direction.

Unlike scalar activation functions such as the sigmoid, where each output depends only on its own input, SoftMax introduces interactions among outputs. This coupling is essential in multiclass classification, as it ensures that predictions form a coherent probability distribution. This coupling implies that the gradient for one class cannot be computed independently of the others, which is a fundamental difference with respect to element-wise activation functions.

This structure is particularly important during training, as it ensures that increasing the probability of the correct class necessarily decreases the probabilities of the competing classes.

Derivative.

Unlike scalar activation functions, SoftMax maps a vector into another vector, so its derivative is expressed through partial derivatives. For each pair of components \(r\) and \(s\),

\[ \frac{\partial f_r}{\partial x_s}= f_r(\mathbf{x})\left(\delta_{rs}-f_s(\mathbf{x})\right), \]

where \(\delta_{rs}\) is the Kronecker delta, equal to 1 when \(r=s\) and 0 otherwise.

Equivalently,

\[ \frac{\partial f_r}{\partial x_s}= \begin{cases} f_r(\mathbf{x})\left(1-f_r(\mathbf{x})\right), & \text{if } r=s,\\[6pt] -f_r(\mathbf{x})f_s(\mathbf{x}), & \text{if } r\neq s. \end{cases} \]

This expression shows that the derivative of each output depends not only on its own input component, but also on the other components of the vector. For this reason, the derivative of SoftMax is naturally represented by a Jacobian matrix rather than by a single scalar derivative.

Figure 2.16: Selected partial derivatives of the SoftMax function for three classes. The diagonal term is positive, whereas the off-diagonal terms are negative, reflecting the competition among classes.

Figure 2.16 shows selected partial derivatives of the SoftMax function for Class 1.

The diagonal term (Class 1 with respect to itself) is positive, indicating that increasing its own score increases its probability. In contrast, the off-diagonal terms (cross effects) are negative, showing that increasing the score of competing classes reduces the probability assigned to Class 1.

This illustrates a key property of SoftMax: learning is inherently competitive. The model does not adjust each class independently; instead, increasing one class necessarily decreases others. This interaction is captured by the Jacobian structure of the derivative.

2.0.12 Sign activation (signum) function

The function.

The sign activation function, also known as the signum or sign function, is defined in Equation (2.19) and displayed in Figure 2.17:

\[ \begin{equation} f(x) = \left\{ \begin{array}{lc} -1, & \text{if } x<0, \\ 0, & \text{if } x=0, \\ 1, & \text{if } x>0. \end{array} \right. \tag{2.19} \end{equation} \]

Its range is \(\{-1,0,1\}\). The function is discontinuous at \(x=0\) and monotonic nondecreasing. As shown in Figure 2.17, it assigns negative inputs to \(-1\), positive inputs to \(1\), and the origin to \(0\).

Plot of the sign activation function showing output -1 for negative inputs, 0 at the origin, and 1 for positive inputs.

Figure 2.17: Sign activation (signum) function.

As shown in Figure 2.17, the function assigns negative inputs to \(-1\), positive inputs to \(1\), and the origin to \(0\).

Derivative.

The sign function is constant on the intervals \((-\infty,0)\) and \((0,\infty)\), so its derivative is

\[ f'(x) = 0, \quad x \ne 0\] At \(x=0\), the function is not continuous and therefore is not differentiable.

Because its derivative is zero almost everywhere and undefined at the origin, the sign function is not suitable for gradient-based optimization methods such as backpropagation. For this reason, it is rarely used in modern deep learning, although it remains conceptually important in threshold-based models, binary decision rules, and related classification settings.

2.0.13 Maxout function

The function.

\cite Hinton et al. (2012) introduced Dropout as a strategy for improving generalization in neural networks, and this later motivated the development of the Maxout activation by \cite Goodfellow et al. (2013). In a Maxout neuron, the activation is defined as the maximum over a set of affine transformations of the input, which makes the function especially compatible with Dropout regularization.

Its general form is

\[ \begin{equation} f(\mathbf{x})=\max_{i=1,\ldots,k}\left(\mathbf{w}_i^{\top}\mathbf{x}+b_i\right) \tag{2.20} \end{equation} \]

where each \(\mathbf{w}_i^{\top}\mathbf{x}+b_i\) is an affine component, and \(k\) denotes the number of such components. The function is continuous and piecewise linear, although it is not differentiable at the transition points where two or more components attain the same maximum value.

Goodfellow et al. (2013) reported that Maxout units, particularly when combined with Dropout, achieved strong empirical performance on benchmark datasets such as MNIST.

To illustrate the idea in one dimension, consider the three affine components

\[ z_1(x)=x, \qquad z_2(x)=-x, \qquad z_3(x)=0.5x-1. \]

In this case, the Maxout function becomes

\[ f(x)=\max\{x,\,-x,\,0.5x-1\}. \]

Figure 2.18 illustrates this example. Each dashed line represents one affine component, while the solid curve corresponds to their pointwise maximum. This representation highlights that Maxout acts as the upper envelope of several linear functions. Different components become active in different regions of the input space, and the switching points between them are precisely the points where the function is not differentiable.

Figure 2.18: Maxout activation as the maximum of several linear functions. Maxout function formed as the maximum of multiple linear components.

Derivative.

The derivative of the Maxout function depends on which affine component attains the maximum value.

For the general form in Equation (2.20), let

\[ k^{*} = \arg\max_{i=1,\ldots,k} \left( \mathbf{w}_i^{\top}\mathbf{x} + b_i \right). \]

Whenever the maximum is attained by a unique component, the gradient with respect to the input vector is

\[ \nabla f(\mathbf{x}) = \mathbf{w}_{k^{*}}. \]

Thus, the gradient of the Maxout unit is simply the gradient of the active affine component. At points where two or more components share the maximum value, the function is not differentiable. In practice, subgradient methods are used, and one of the maximizing components is selected.

For the one-dimensional example shown above,

\[ f(x)=\max\{x,\,-x,\,0.5x-1\}, \]

the third affine component never attains the maximum. Therefore, in this case,

\[ f(x)=|x|, \]

and its derivative is

\[ f'(x)= \begin{cases} -1, & x<0,\\[6pt] 1, & x>0. \end{cases} \]

At \(x=0\), the function is not differentiable because the active component changes abruptly.

Figure 2.19 shows this derivative for the illustrative one-dimensional case. The jumps occur at the switching points where the active component changes, and these are precisely the points where the Maxout function is not differentiable.

Derivative of the illustrative one-dimensional Maxout example f(x)=max{x,-x,0.5x-1}. In this case, the third component never becomes active, so the derivative coincides with that of |x|.

Figure 2.19: Derivative of the illustrative one-dimensional Maxout example f(x)=max{x,-x,0.5x-1}. In this case, the third component never becomes active, so the derivative coincides with that of |x|.

2.0.14 Softsign function

The function.

Some activation functions are useful because they act as smooth compromises between other well-known functions. The softsign is one such example. It resembles a smoothed version of the sign function and has a shape similar to the hyperbolic tangent, although its saturation is more gradual \cite(Aghdam and Heravi, 2017). Near the origin, it behaves approximately like the identity function.

Its form is given by Equation (2.21):

\[ \begin{equation} f(x)=\frac{x}{1+|x|} \tag{2.21} \end{equation} \]

Its domain is \(\mathbb{R}\) and its range is \((-1,1)\). The function is continuous and continuously differentiable on \(\mathbb{R}\), and monotonic increasing.

As shown in Figure 2.20, the softsign function (green curve) exhibits a smoother transition than tanh, with slower saturation toward its asymptotic values.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=\frac{1}{(1+|x|)^2} \tag{2.22} \end{equation} \]

This expression holds for all \(x\), including \(x=0\), where \(f'(0)=1\).

Figure 2.21 shows that the derivative (green curve) reaches its maximum at the origin and decreases gradually as \(|x|\) increases, reflecting the mild saturation behavior characteristic of softsign.

2.0.15 Elliot function

The function.

The Elliot function \cite(Elliott, 1993) is a sigmoidal activation that maps outputs into the interval \((0,1)\). It provides a computationally efficient alternative to exponential-based activations such as the logistic sigmoid.

Unlike sigmoid, it avoids exponential computations, which reduces computational cost.

Its form is given by Equation (2.23):

\[ \begin{equation} f(x)=\frac{0.5x}{1+|x|}+0.5 \tag{2.23} \end{equation} \]

Its domain is \(\mathbb{R}\) and its range is \((0,1)\). The function is continuous and continuously differentiable on \(\mathbb{R}\), and monotonic increasing.

As shown in Figure 2.20, the Elliot function (red curve) resembles a shifted sigmoid, but with a simpler algebraic form.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=\frac{0.5}{(1+|x|)^2} \tag{2.24} \end{equation} \]

As illustrated in Figure 2.21, the derivative (red curve) is symmetric around the origin and decreases toward zero as \(|x|\) increases, similarly to other sigmoidal functions but with reduced curvature.

2.0.16 Hyperbolic tangent (tanh) function

The function.

The hyperbolic tangent, or tanh, is a classical activation function widely used in neural networks. It can be interpreted as a centered and rescaled version of the sigmoid function.

Its main expression is given by Equation (2.25):

\[ \begin{equation} f(x)=\tanh(x)=\frac{2}{1+e^{-2x}}-1 \tag{2.25} \end{equation} \]

Its range is \((-1,1)\), it belongs to \(C^{\infty}\), and it is monotonic increasing. It can also be written in terms of the sigmoid function:

\[ \begin{equation} \tanh(x) = 2 \cdot \sigma (2x) - 1, \tag{2.26} \end{equation} \]

where \(\sigma(x)\) is,

\[ \begin{equation} \sigma(x) = \frac{e^x}{1 + e^x}, \tag{2.27} \end{equation} \]

As shown in Figure 2.20, the tanh function (blue curve) has a steeper slope near the origin compared to softsign and Elliot, and saturates more rapidly toward \(\pm1\).

Figure 2.20: Softsign, Hyperbolic Tangent and Elliot function.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=1-\tanh^2(x)=\frac{1}{\cosh^2(x)} \tag{2.28} \end{equation} \]

As illustrated in Figure 2.21, the derivative (blue curve) attains its maximum at \(x=0\) and decays rapidly toward zero as \(|x|\) increases. This behavior reflects the saturation of the tanh function and explains why it may suffer from vanishing-gradient effects in deep neural networks.

Figure 2.21: Softsign, Hyperbolic Tangent and Elliot function (derivative).

Compared to softsign, tanh saturates more quickly, which results in faster gradient decay in extreme regions.

2.0.17 Arc tangent function

The function.

When the upper and lower bounds provided by tanh, sigmoid, or softsign are not appropriate, the arc tangent or arctan function offers another sigmoidal alternative (Aghdam and Heravi, 2017). Its output saturates symmetrically around the origin at \(\pm \frac{\pi}{2}\), making it useful as a normalizer.

Its form is shown in Figure 2.22 and defined by Equation (2.29):

\[ \begin{equation} f(x)=\arctan(x) \tag{2.29} \end{equation} \]

Its range is \(\left( - \frac{\pi}{2}, \frac{\pi}{2} \right)\), it is continuous in \(C^{\infty}\), and it is strictly increasing. Unlike tanh, the arc tangent does not map inputs to a fixed bounded interval such as \((-1,1)\), but instead to \((-\pi/2, \pi/2)\), which may require additional scaling in practical neural network applications.

Figure 2.22: Arc Tangent function.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=\frac{1}{1+x^2} \end{equation} \]

This derivative is positive for all \(x\), which confirms that the function is strictly increasing.

As illustrated in Figure 2.23, the derivative attains its maximum at \(x=0\) and decreases smoothly as \(|x|\) increases. This gradual decay indicates a milder saturation compared to tanh, which may help mitigate vanishing-gradient effects compared to more rapidly saturating functions such as tanh.

Figure 2.23: Arc Tangent function (derivative).

2.0.18 Lecun’s hyperbolic tangent function

The function.

\cite LeCun proposed a scaled form of the hyperbolic tangent designed to improve learning dynamics in neural networks. Its graph appears in Figure 2.22, and its expression is

\[ \begin{equation} f(x)=1.7159\,\tanh\left(\frac{2}{3}x\right) \tag{2.30} \end{equation} \]

Its range is \((-1.7159,\,1.7159)\). It is continuous in \(C^{\infty}\), and strictly increasing. As shown in Figure 2.22, LeCun’s tanh (green curve) exhibits a steeper slope near the origin than the standard tanh, while preserving a symmetric saturation behavior with a larger output range.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=1.7159\cdot \frac{2}{3}\left[1-\tanh^2\left(\frac{2}{3}x\right)\right] \end{equation} \] This expression follows directly from the derivative of the standard tanh function via the chain rule.

As illustrated in Figure 2.23, the derivative (green curve) reaches its maximum at the origin and decreases as \(|x|\) increases. Compared to the standard tanh, this scaling leads to stronger gradients near zero, which can improve learning dynamics during training.

2.0.19 Complementary log-log function

The function.

The complementary log-log function is the inverse of the cumulative distribution function of the extreme-value (Gumbel) distribution. It is widely used in statistical modeling of hazard-type responses and resembles other sigmoidal activation functions.

Like the sigmoid function, it produces outputs between 0 and 1, but these outputs admit an interpretation in terms of hazard effects associated with reverse extreme-value errors. \cite Gomes and Ludermir (2008) report that it may outperform logit and tanh activations in multilayer perceptrons when evaluated through mean squared error. The function, often abbreviated cloglog, is shown in Figure 2.24 and defined by Equation (2.31).

\[ \begin{equation} f(x)=1-\exp(-\exp(x)) \tag{2.31} \end{equation} \]

Its range is \((0,+1)\), it is continuous in \(C^{\infty}\), and it is strictly increasing.

Unlike the logistic sigmoid, the complementary log-log function is not symmetric around the origin. Instead, it exhibits a skewed transition, reaching its upper bound more rapidly than its lower bound.

In practice, the complementary log-log function is commonly used as a link function in classification models, particularly when modeling asymmetric response behavior.

Figure 2.24: Complementary Log-Log function.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=\exp(x-\exp(x)) \tag{2.32} \end{equation} \]

As illustrated in Figure 2.25, the derivative is strictly positive for all \(x\), confirming that the function is strictly increasing. Unlike symmetric activations such as tanh or sigmoid, the derivative of the cloglog function is asymmetric, with a sharper decay for positive values of \(x\).

This asymmetry reflects the underlying extreme-value distribution and makes the function particularly suitable for modeling processes with asymmetric growth patterns or hazard-type behavior.

Figure 2.25: Complementary Log-Log function (derivative).

2.0.20 Softplus function

The function.

The softplus function gained importance after Glorot et al. (2011) emphasized its relevance, and later work by Zheng et al. (2015) showed improvements in deep neural networks obtained through its use. Softplus can be understood as a smooth version of the ReLU activation, particularly on the negative side, where it avoids the sharp transition at zero.

Its form is given by Equation (2.33):

\[ \begin{equation} f(x)=\ln(1+e^x) \tag{2.33} \end{equation} \]

Its range is \((0,+\infty)\), it is continuous in \(C^{\infty}\), and it is strictly increasing, with a strictly positive derivative.

As shown in Figure 2.26, the softplus function exhibits a smooth transition from near-zero values for large negative inputs to an approximately linear behavior for large positive inputs, closely resembling the ReLU function without introducing non-differentiability at the origin.

Figure 2.26: Softplus function.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=\frac{e^x}{1+e^x} \end{equation} \]

which corresponds exactly to the logistic sigmoid function (LLinás, 2026).

As illustrated in Figure @ref(fig:f4-18d), the derivative is bounded between 0 and 1 and increases smoothly with \(x\). For large negative values, the derivative approaches zero, while for large positive values it approaches one, reflecting the gradual transition from a flat to a linear regime.

Figure 2.27: Softplus function (derivative).

2.0.21 Bent identity function

The function.

Although bent functions were originally defined in the 1960s, they were formally published by Rothaus (1976). Around the same period, related ideas were also used in Soviet cryptography by V.A. Eliseev and O.P. Stepchenkov \cite(Tokareva, 2015). Bent functions are generally classified within the Boolean function family \cite(Çeşmelioğlu et al., 2016; Savický, 1994).

The bent identity function is given by Equation (2.34) and displayed in Figure 2.28:

\[ \begin{equation} f(x)=\frac{\sqrt{x^2+1}-1}{2}+x \tag{2.34} \end{equation} \]

Its domain is \(\mathbb{R}\), it belongs to \(C^{\infty}\), it is strictly increasing, and its derivative is also strictly positive.

Bent identity is interesting because it smoothly adjusts the linear identity mapping by introducing a mild nonlinear correction. As shown in Figure 2.28, the function behaves almost linearly, while bending slightly upward for positive inputs and softening the slope for negative inputs.

Plot of the bent identity activation function.

Figure 2.28: Bent identity function.

Derivative.

The derivative is given by

\[ \begin{equation} f'(x)=\frac{x}{2\sqrt{x^2+1}}+1 \tag{2.35} \end{equation} \]

As illustrated in Figure 2.29, the derivative increases smoothly from approximately 0.5 for large negative values of \(x\) to approximately 1.5 for large positive values. At \(x=0\), the derivative equals 1. This behavior shows that the bent identity function remains close to a linear mapping, while introducing only a mild nonlinear adjustment through its curvature.

Figure 2.29: Bent Identity function (derivative). Derivative of the bent identity function, increasing smoothly from 0.5 to 1.5.

2.0.22 Soft exponential function

The function.

The soft exponential function, defined in Equation (2.36), was proposed by Godfrey and Gashler (2015) as a parameterized activation that continuously interpolates among logarithmic, linear, and exponential behaviors.

\[ \begin{equation} f(\alpha,x)= \left\{ \begin{array}{ll} -\dfrac{\ln\!\left(1-\alpha(x+\alpha)\right)}{\alpha}, & \text{if } \alpha<0,\\[6pt] x, & \text{if } \alpha=0,\\[6pt] \dfrac{e^{\alpha x}-1}{\alpha}+\alpha, & \text{if } \alpha>0. \end{array} \right. \tag{2.36} \end{equation} \]

As shown in Figure 2.30, different values of \(\alpha\) produce distinct functional regimes: negative values yield logarithmic-like curves, \(\alpha=0\) recovers the identity function, and positive values lead to exponential growth.

For \(\alpha<0\), the logarithmic branch imposes a restriction on the admissible values of \(x\) through the condition \[ 1-\alpha(x+\alpha)>0. \] Thus, the domain is all of \(\mathbb{R}\) only in the cases \(\alpha \ge 0\); wherever it is defined, the function is smooth and belongs to \(C^{\infty}\).

Soft exponential activation for different values of alpha.

Figure 2.30: Soft exponential function for several values of the parameter \(\alpha\).

Derivative.

The derivative of the soft exponential function with respect to \(x\) is

\[ \begin{equation} \frac{\partial}{\partial x}f(\alpha,x)= \left\{ \begin{array}{ll} \dfrac{1}{1-\alpha(x+\alpha)}, & \text{if } \alpha<0,\\[6pt] 1, & \text{if } \alpha=0,\\[6pt] e^{\alpha x}, & \text{if } \alpha>0. \end{array} \right. \tag{2.37} \end{equation} \]

This expression shows that the derivative depends strongly on the value of \(\alpha\): it remains constant when \(\alpha=0\), follows a rational form when \(\alpha<0\), and grows exponentially when \(\alpha>0\).

Figure 2.31 illustrates the derivatives of the soft exponential function for several values of \(\alpha\). The derivative remains constant when \(\alpha=0\), decreases rationally when \(\alpha<0\), and grows exponentially when \(\alpha>0\), showing how the parameter controls the local sensitivity of the activation.

Figure 2.31: Derivatives of the soft exponential function for several values of the parameter \(\alpha\).

\cite Godfrey and Gashler (2015) described the soft exponential activation as a unifying family that combines logarithmic, linear, and exponential transformations within a single parameterized expression. Because the function is smooth and adaptable, it can modify its curvature according to the learning scenario. However, like other monotonic activations, it preserves the ordering of inputs and therefore cannot directly represent oscillatory or cyclic relationships. Because the function is smooth and adaptable, it can modify its curvature according to the learning scenario. However, like other monotonic activations, it preserves the ordering of inputs and therefore cannot directly represent oscillatory or cyclic relationships.

3 Comparative monotonic activation functions

Table 3.1 summarizes the main characteristics of the monotonic activation functions discussed in this section. It highlights how these functions differ in terms of smoothness, range, symmetry, and suitability for different learning tasks. In particular, modern deep learning architectures tend to favor functions that combine smoothness with non-saturating gradients, such as ReLU variants and Softplus, while more specialized functions (e.g., cloglog or soft exponential) are useful in contexts where asymmetry or adaptive curvature is required.

Table 3.1: Comparative monotonic functions.
Function	Expression	Range	Key properties	Typical use	Reference
Linear / Identity	\(\alpha x\)	\(( -\infty,\infty )\)	Linear	Baseline	Classical
Piecewise linear	Piecewise linear	\(( -\infty,\infty )\)	Simple nonlinearity	Simple models	Classical
Threshold (Heaviside)	\(\mathbf{1}_{\{x \ge 0\}}\)	\(\{0,1\}\)	Non-differentiable	Binary logic
Sigmoid	\(\frac{1}{1+e^{-x}}\)	\((0,1)\)	Smooth	Classification
Bipolar sigmoid	\(\frac{1-e^{-x}}{1+e^{-x}}\)	\((-1,1)\)	Zero-centered	Centered nets	Derived
ReLU	\(\max(0,x)\)	\([0,\infty)\)	Sparse	Deep learning
Leaky ReLU	\(\max(\alpha x, x)\)	\(( -\infty,\infty )\)	Improved ReLU	Improved ReLU
ELU	\(\max(\alpha (e^{x}-1), x)\)	\(( -\alpha,\infty )\)	Smooth ReLU	Deep nets
SELU	\(\lambda \max(\alpha (e^{x}-1), x)\)	\(( -\lambda\alpha,\infty )\)	Self-normalizing	Self-normalizing nets
Softmax	\(\frac{e^{x_r}}{\sum_s e^{x_s}}\)	\((0,1)\)	Multi-class	Output layer
Sign	\(\mathrm{sign}(x)\)	\(\{-1,1\}\)	Discrete	Binary output	Classical
Maxout	\(\max_i(\mathbf{w}_i^\top \mathbf{x}+b_i)\)	\(( -\infty,\infty )\)	Learnable	Adaptive models
Softsign	\(\frac{x}{1+\lvert x \rvert}\)	\((-1,1)\)	Smooth	Alternative sigmoid
Elliot	\(\frac{0.5x}{1+\lvert x \rvert}+0.5\)	\((0,1)\)	Fast approximation	Fast computation
tanh	\(\frac{2}{1+ e^{-2 x}}-1\)	\((-1,1)\)	Symmetric	Hidden layers	Classical
Arctan	\(\arctan(x)\)	\(\left(-\frac{\pi}{2},\frac{\pi}{2}\right)\)	Mild saturation	Alternative tanh	Classical
LeCun tanh	\(1.7159\,\tanh\!\left(\frac{2}{3}x\right)\)	\((-1.7159,1.7159)\)	Stronger gradients	Improved tanh
Cloglog	\(1-e^{-e^x}\)	\((0,1)\)	Asymmetric	Hazard models
Softplus	\(\ln(1+e^x)\)	\((0,\infty)\)	Smooth ReLU	Deep learning
Bent identity	\(\frac{\sqrt{x^2+1}-1}{2}+x\)	\(( -\infty,\infty )\)	Near-linear	Experimental	Recent
Soft exponential	\(f(\alpha,x)\)	Depends on \(\alpha\)	Adaptive	Adaptive activations

4 Periodic activation functions

4.0.1 Motivation

While monotonic activation functions have historically dominated neural network design due to their stability and well-behaved optimization properties, many real-world phenomena exhibit inherently oscillatory or cyclic behavior. Signals in physics, audio processing, time-series analysis, and spatial modeling often contain repeating patterns that cannot be adequately captured using strictly monotonic transformations.

Periodic activation functions address this limitation by producing oscillatory outputs. Rather than preserving the ordering of inputs, they allow the activation response to vary cyclically, enabling neural networks to model wave-like structures, harmonic relationships, and repeating local patterns.

These functions are particularly relevant in architectures designed for signal representation, implicit neural fields, and temporal dynamics. A defining characteristic of periodic activations is that their derivatives also oscillate. While this increases expressive power, it may introduce additional optimization challenges, such as sensitivity to initialization, learning rate, and training stability.

5 Periodic activation functions

5.0.1 Motivation

5.0.2 Classification of periodic activation functions

The periodic activation functions considered in this section can be naturally grouped into two main categories, depending on the nature of their oscillatory behavior.

Some of these functions (e.g., Fourier or wavelet-based representations) are not activation functions in the strict sense, but rather functional transformations that can be incorporated into neural architectures to capture periodic structure.

Sinusoidal functions

Sine wave function.
Cardinal sine function (Sinc).
Fourier transform (FT, DFT/FFT).
Short-time Fourier transform (STFT, Gabor transform).
Wavelet transform.

Non-sinusoidal periodic functions

Gaussian (normal distribution) function.
Square wave function.
Triangle wave function.
Sawtooth wave function.
S-shaped rectified linear unit (SReLU).
Adaptive piecewise linear unit (APLU).

6 Periodic activation functions: sinusoidals

A natural starting point for periodic activations is the class of sinusoidal functions. Unlike monotonic transformations, sinusoidal functions are characterized by repeated oscillations over a fixed period. These oscillations may differ in amplitude, phase, or frequency, but their defining feature is periodic repetition.

From a mathematical perspective, sinusoidal functions are fundamental because they serve as the building blocks of many signal-processing techniques, including Fourier analysis and its extensions. Their smoothness and infinite differentiability make them especially suitable for neural models that aim to represent continuous and structured patterns.

Unlike monotonic activation functions, periodic functions do not preserve the ordering of inputs and exhibit oscillatory derivatives. While this increases representational power, it also introduces multiple local extrema in the loss landscape, which may complicate optimization.

Although sinusoidal functions are smooth and infinitely differentiable, their oscillatory nature distinguishes them from monotonic activation functions. This property makes them particularly suitable for modeling periodic or structured signals, while also introducing behaviors that will be addressed later in the context of neural network training.

6.0.1 Sine wave function

The sine wave is the canonical example of a periodic function. It oscillates smoothly around a central axis and produces values that repeat over time or space. In its normalized form, it takes values in the interval \([-1,1]\), yielding a characteristic wave-like pattern \cite(Parascandolo et al., 2016).

A general sinusoidal function can be expressed as

\[ \begin{equation} f(x,t) = A \sin(kx - \omega t + \varphi) + D \tag{6.1} \end{equation} \]

Here,

\(A\) is the amplitude,
\(k\) is the wave number (spatial frequency),
\(\omega\) is the angular frequency,
\(t\) denotes time,
\(\varphi\) is the phase shift, and
\(D\) is a vertical offset.

This function is smooth (belongs to \(C^{\infty}\)) and is inherently non-monotonic. The sign determines the direction of propagation: the negative sign typically represents rightward propagation, while the positive sign corresponds to leftward propagation.

A key property is that differentiation preserves oscillatory structure:

\[ \frac{d}{dx} \sin(x) = \cos(x), \qquad \frac{d}{dx} \cos(x) = -\sin(x). \]

This makes sinusoidal functions particularly suitable for representing periodic dependencies in neural networks. They are widely used in modeling temporal signals, recurrent dynamics, and continuous structured data.

However, due to their non-monotonic and non-convex nature, sinusoidal activations may introduce optimization challenges such as gradient instability, slower convergence, and sensitivity to learning rates \cite (Lapedes and Farber, 1987). Nevertheless, they have shown strong performance in certain recurrent architectures, particularly for short-term prediction tasks \cite (Sopena and Alquezar, 1994; Alquezar and Sanfeliu, 1994).

An additional useful property is that the mean value of sine and cosine over a full period is zero, which is advantageous in several statistical and signal-processing contexts.

The behavior of sine and cosine functions is illustrated in Figure 6.1.

Figure 6.1: Sine and Cosine Wave functions.

6.0.2 Cardinal sine function (Sinc)

The cardinal sine function, or sinc function, is an important oscillatory function widely used in signal processing and harmonic analysis. Unlike pure sinusoidal functions, the sinc function exhibits oscillations with decreasing amplitude, producing a characteristic damped waveform.

Its standard definition is given by Equation (6.2):

\[ \begin{equation} f(x) = \frac{\sin(x)}{x}, \qquad x \neq 0 \tag{6.2} \end{equation} \]

with the continuous extension \[ f(0) = 1. \]

A commonly used normalized version is defined by Equation (6.3):

\[ \begin{equation} f(x) = \frac{\sin(\pi x)}{\pi x}, \qquad x \neq 0 \tag{6.3} \end{equation} \]

again with \(f(0)=1\).

With this definition, the sinc function is continuous and infinitely differentiable on \(\mathbb{R}\), with a removable singularity at \(x=0\). Its range is bounded, although it does not admit a simple closed-form characterization. The oscillations decay proportionally to \(1/x\), which explains the diminishing amplitude observed away from the origin.

Although the sinc function is not strictly periodic, its oscillatory structure makes it closely related to sinusoidal functions.

A fundamental property is that the normalized sinc function is the Fourier transform of a rectangular function, making it essential in sampling theory, interpolation, and band-limited signal reconstruction.

The behavior of both the standard and normalized versions is illustrated in Figure 6.2.

Figure 6.2: Standard and normalized sinc functions.

6.0.3 Fourier transform (FT, DFT/FFT)

The Fourier transform provides a global frequency-domain representation of a signal by decomposing it into sinusoidal components. Unlike pointwise activation functions, it acts as a transformation that maps a signal from the time domain into the frequency domain.

For discrete-time signals \(\{x[n]\}\), the Discrete-Time Fourier Transform (DTFT) is defined by Equation (6.4):

\[ \begin{equation} X(\omega) = \sum_{n=-\infty}^{\infty} x[n] e^{- i \omega n} \tag{6.4} \end{equation} \]

where \(\omega\) is the angular frequency.

If \(x[n] = x(nT)\) corresponds to sampled data, the transform can also be expressed as

\[ X(f) = \sum_{n=-\infty}^{\infty} x(nT)e^{-i2\pi f nT}. \]

Figure 6.3 shows an example of a nonstationary signal in the time domain.

Figure 6.3: Time-domain signal.

As shown in Figure 6.3, the signal is nonstationary: its oscillatory pattern changes over time, combining low-frequency structure with faster local oscillations. This makes it a useful example for illustrating why a purely global frequency representation may be insufficient.

The DTFT is periodic in frequency and is closely related to sampling theory, particularly the Shannon-Nyquist theorem, which establishes the conditions required for perfect signal reconstruction. Figure 6.4 displays the corresponding frequency-domain representation.

Figure 6.4: Frequency-domain representation (FFT magnitude).

The main peaks indicate the dominant frequency components present in the signal. However, this representation does not reveal when these components occur, since the Fourier transform summarizes the signal globally over the entire time interval.

6.0.4 Short-time Fourier transform (Gabor, STFT)

The Short-Time Fourier Transform (STFT) extends the Fourier transform by introducing local time information:

\[ \begin{equation} \mathrm{STFT}\{x[n]\}(m,\omega) = \sum_{n=-\infty}^{\infty} x[n]\, w[n-m]\, e^{-i\omega n} \tag{6.5} \end{equation} \]

where \(w[\cdot]\) is a window function centered at time index \(m\).

Unlike the standard Fourier transform, STFT provides a joint time-frequency representation, allowing the analysis of how frequency content evolves over time. However, this comes at the cost of a trade-off between time and frequency resolution, governed by the uncertainty principle:

\[ \Delta f \, \Delta t \ge \frac{1}{4\pi}. \]

Figure 6.5 shows a spectrogram-style time-frequency representation of the signal.

Figure 6.5: Short-time Fourier transform.

In Figure 6.5, the STFT provides a joint time-frequency representation of the signal. The horizontal axis represents time, the vertical axis represents frequency, and the shading intensity indicates the local spectral energy. Unlike the global Fourier transform, this representation makes it possible to identify when different frequency components become more prominent.

6.0.5 Wavelet transform

The wavelet transform provides a time-frequency representation using localized basis functions. It is defined by Equation (6.6):

\[ \begin{equation} W_\psi x(a,b) = \frac{1}{\sqrt{|a|}} \int_{-\infty}^{\infty} x(t)\, \psi^*\!\left(\frac{t-b}{a}\right)\, dt \tag{6.6} \end{equation} \]

where \(a\) is the scale parameter and \(b\) is the translation parameter.

Wavelets overcome some limitations of STFT by using adaptive resolution: they provide good time localization at high frequencies and good frequency resolution at low frequencies, making them particularly suitable for analyzing nonstationary signals. Figure 6.6 presents a wavelet-based time-scale representation of the signal.

Figure 6.6: Wavelet transform.

Figure 6.6 shows a wavelet-based time-scale representation of the same signal. The horizontal axis corresponds to time, while the vertical axis represents scale (which is inversely related to frequency). The color intensity reflects the local power of the signal at each time-scale combination. Compared with STFT, the wavelet transform provides a more flexible representation, offering finer time localization at high frequencies and finer frequency resolution at low frequencies.

This multi-resolution property makes wavelets particularly effective for analyzing signals with localized and transient features.

These transformations illustrate how periodic structures can be represented at different levels of resolution, from global frequency decomposition (Fourier) to localized time-frequency analysis (STFT) and adaptive multi-scale representations (wavelets).

7 Periodic activation functions: non-sinusoidals

Not all periodic, or potentially periodic, activation functions are sinusoidal. Some functions do not explicitly involve sine or cosine terms, yet they can still produce repeating patterns or be adapted to periodic settings through suitable transformations.

In this section, we examine several non-sinusoidal activation functions. Some of them are intrinsically periodic, whereas others can acquire periodic behavior through repetition, wrapping, or piecewise construction.

Unlike sinusoidal activations, these functions often rely on discontinuities, piecewise definitions, or localized structures to generate periodic or quasi-periodic behavior.

7.0.1 Gaussian (normal) distribution function

The Gaussian, or normal, distribution is one of the most fundamental functions in statistics (Stigler, 1986). In its standard form, it is not periodic; instead, it defines a single bell-shaped curve centered at a mean value.

Its density function is given by

\[ \begin{equation} f(x \mid \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\!\left(- \frac{(x - \mu)^2}{2 \sigma^2}\right) \tag{7.1} \end{equation} \]

which is smooth and belongs to \(C^{\infty}\) on \(\mathbb{R}\). The standard Gaussian profile is illustrated in Figure 7.1.

Figure 7.1: Periodic extension of a Gaussian function.

As shown in Figure 7.1, the Gaussian function exhibits a single localized peak and decays rapidly away from its mean. When extended periodically, this localized structure repeats across intervals, producing a smooth oscillatory pattern without sharp discontinuities.

Although the ordinary Gaussian function is not periodic, a periodic version can be obtained by wrapping the argument over a bounded interval. One such construction is

\[ \begin{equation} \rho(x) = f \left(\left(\left( x + \frac{N}{2} \right) \text{mod} \ N \right) - \frac{N}{2} \right) \tag{7.2} \end{equation} \]

This transformation produces a periodic repetition of the Gaussian profile over intervals of length \(N\).

Another important periodic analogue is the wrapped normal distribution, which is defined on the unit circle. Closely related to it is the von Mises distribution, a widely used model in directional statistics.

7.0.2 Square wave function

A square wave can be interpreted as a periodic extension of the Heaviside step function. Instead of switching once between two levels, it alternates repeatedly, making it a natural model for binary transmission, switching systems, and certain types of audio or electrical distortion.

A common representation is

\[ \begin{equation} x(t) = \operatorname{sgn}(\sin t), \quad v(t) = \operatorname{sgn}(\cos t) \tag{7.3} \end{equation} \]

These functions are periodic but discontinuous, and therefore they do not belong to \(C^{\infty}\).

Another representation uses shifted step functions:

\[ \begin{equation} x(t) = \sum_{n=-\infty}^{\infty} \left[ u\left(t - nT\right) - u\left(t - nT - \frac{T}{2}\right) \right] \tag{7.4} \end{equation} \]

where \(u(\cdot)\) denotes the unit step function and \(T\) is the period. The resulting periodic switching behavior is illustrated in Figure 7.2.

Figure 7.2: Square Wave function.

As shown in Figure 7.2, the square wave alternates abruptly between two levels, producing discontinuities at transition points. This lack of smoothness distinguishes it from sinusoidal activations.

7.0.3 Triangle wave function

The triangle wave takes its name from its repeated triangular shape around the horizontal axis (Tansel et al., 1991). It is periodic, continuous, and piecewise linear.

Foresee and Hagan (1997) employed triangle-wave-type functions in the context of Gauss-Newton approximations to Bayesian regularization, reporting favorable error behavior in several applications, including regression, time-series estimation, and chaotic signal modeling.

One possible expression for a triangle wave is

\[ \begin{equation} x(t) = \frac{2}{a} \left( t - a \left\lfloor \frac{t}{a} + \frac{1}{2} \right\rfloor \right) (-1)^{\left[ \frac{t}{a} + \frac{1}{2} \right]} \tag{7.5} \end{equation} \]

where \(a\) controls the scale of the oscillation.

The triangle wave is continuous, but it is not differentiable at its corner points. The integral of a square wave is closely related to a triangle wave:

\[ \begin{equation} \int \operatorname{sgn} \left(\sin x \right) dx \tag{7.6} \end{equation} \]

An example of the triangle wave is shown in Figure 7.3.

Figure 7.3: Triangle Wave function.

As illustrated in Figure 7.3, the triangle wave evolves linearly between peaks, producing a continuous but non-smooth signal with sharp corners at turning points.

7.0.4 Sawtooth Wave function

The sawtooth wave is another non-sinusoidal periodic function. It resembles the triangle wave, but instead of rising and falling symmetrically, it changes linearly in one direction and then resets abruptly.

Sawtooth waves have been used in engineering applications such as power electronics and motor drives (Bose, 2007), as well as in neural systems for biomedical pattern recognition (Wang et al., 2017).

A standard sawtooth representation can be written as

\[ \begin{equation} x(t) = 2 \left( \frac{t}{T} - \left\lfloor \frac{t}{T} + \frac{1}{2} \right\rfloor \right) \tag{7.7} \end{equation} \]

This function is periodic and piecewise linear, but it is not differentiable at its jump discontinuities. In many constructions, the absolute value of a sawtooth wave produces a triangle wave. The sawtooth waveform is illustrated in Figure 7.4.

Figure 7.4: Sawtooth Wave function.

Figure 7.4 shows the characteristic linear rise followed by a sharp drop, which introduces discontinuities and strong high-frequency components.

7.0.5 S-shaped rectified linear unit (SReLU)

The S-shaped Rectified Linear Unit (SReLU) was introduced by Jin et al. (2016a) as a flexible activation function capable of learning both convex and non-convex response patterns. It is a piecewise linear function controlled by four learnable parameters, which define two transition points, or “knuckles”.

Because these parameters are not known in advance, Jin et al. (2016a, 2016b) proposed initialization strategies to improve training stability. In practice, poor initialization may lead to weak performance, making parameter selection especially important during the early stages of learning.

One proposed strategy is to freeze the SReLU parameters during the initial epochs so that the network first behaves like a simpler rectifier. After this stage, the right threshold may be set adaptively as

\[ \begin{equation} t^r_i = \operatorname{supp}(X_i, k) \tag{7.8} \end{equation} \]

where \(\operatorname{supp}(X_i,k)\) denotes the \(k\)-th largest value in the set \(X_i\), and \(X_i\) contains all input values associated with a given SReLU unit. This structure allows the activation to approximate both convex and non-convex shapes, adapting to different regions of the input space.

The activation function is defined as

\[ \begin{equation} f_{t_l,a_l,t_r,a_r}(x) = \left\{ \begin{array}{ll} t_l + a_l (x - t_l), & \text{if} \ x \le t_l\\ x, & \text{if} \ t_l < x < t_r \\ t_r + a_r (x - t_r), & \text{if} \ x \ge t_r \end{array} \right. \tag{7.9} \end{equation} \]

The function is continuous on \(\mathbb{R}\), and therefore belongs to \(C^0\), but it is generally not differentiable at the two transition points. The parameters \(t_l\), \(a_l\), \(t_r\), and \(a_r\) may either be specified in advance or learned during training.

Its main limitation lies in the difficulty of choosing suitable initial values. Even so, Jin et al. (2016a) reported that SReLU can outperform several rectifier-based activation functions in deep neural networks.

7.0.6 Adaptive piecewise linear unit (APLU)

The Adaptive Piecewise Linear Unit (APLU) is another flexible activation designed to learn its shape directly from data. Unlike the standard ReLU, APLU incorporates additional hinge-like terms, allowing each neuron to develop its own adaptive piecewise linear response.

Agostinelli et al. (2014) proposed this unit as a generalization of simpler rectifier functions. Its form combines a standard ReLU term with a sum of shifted hinge components:

\[ \begin{equation} f(x) = \max(0,x) + \sum^{S}_{s=1} a_i^{(s)} \max\left(0,-x+b^{(s)}_i\right) \tag{7.10} \end{equation} \]

where the coefficients \(a_i^{(s)}\) and shifts \(b_i^{(s)}\) are learned from the data.

Because the shape of the activation is updated independently for each neuron, APLU can adapt to local patterns more flexibly than fixed activation functions. In this sense, it acts as a neuron-specific activation that is learned jointly with the other model parameters.

Agostinelli et al. (2014) reported that APLU improves predictive performance in experiments such as CIFAR-10 and CIFAR-100. On the other hand, this flexibility may also make the network more fragile, since the overall behavior of the model depends on the stability of many independently learned nonlinearities.

The behavior of APLU can be interpreted as a superposition of multiple hinge functions, allowing the activation to approximate complex nonlinear patterns.

In contrast to sinusoidal activations, non-sinusoidal periodic functions often rely on discontinuities, piecewise structures, or localized responses. This leads to richer representational flexibility, but may also introduce challenges related to smoothness and optimization.

To synthesize the main ideas discussed in this section, Tables 8.1, 9.1, and 10.1 provide three complementary perspectives. The first summarizes the periodic functions examined individually, the second contrasts sinusoidal and non-sinusoidal periodic families, and the third compares periodic activations with the monotonic functions discussed earlier.

8 Comparative periodic activation functions

Table 8.1 summarizes the main periodic and periodic-related functions presented in this document. It highlights their mathematical type, smoothness, periodicity behavior, and typical applications. In contrast to monotonic activations, these functions are designed to represent oscillatory, localized, or multi-scale patterns rather than preserving input order.

Table 8.1: Comparative periodic activation functions.
Function	Type	Expression	Smoothness	Periodicity	Typical use	Reference
Sine	Sinusoidal	\(\sin(x)\)	Smooth	Strict	Signal modeling	Parascandolo et al. (2016)
Sinc	Sinusoidal-related	\(\frac{\sin x}{x}\)	Smooth	Quasi-periodic	Sampling theory	Shannon (1949)
Fourier transform	Transform	\(X(\omega)\)	Smooth	Global	Spectral analysis	Fourier (1822)
STFT	Transform	\(\mathrm{STFT}(x)\)	Smooth	Local	Time-frequency analysis	Gabor (1946)
Wavelet	Transform	\(W_\psi(x)\)	Smooth	Multi-scale	Nonstationary signals	Daubechies (1992)
Gaussian (periodic)	Non-sinusoidal	\(\exp(-x^2)\)	Smooth	Wrapped	Kernel methods	Stigler (1986)
Square wave	Non-sinusoidal	\(\operatorname{sgn}(\sin x)\)	Discontinuous	Strict	Binary systems	Oppenheim & Schafer (1999)
Triangle wave	Non-sinusoidal	Piecewise linear	Continuous (non-smooth)	Strict	Approximation	Tansel et al. (1991)
Sawtooth wave	Non-sinusoidal	Piecewise linear	Discontinuous	Strict	Signal synthesis	Bose (2007)
SReLU	Adaptive	Piecewise linear	Continuous	Learned	Deep learning	Jin et al. (2016)
APLU	Adaptive	Adaptive sum	Continuous	Learned	Adaptive networks	Agostinelli et al. (2014)

Table 8.1 shows that the periodic functions discussed here are mathematically heterogeneous. Some are strictly periodic and smooth, such as the sine wave, whereas others are discontinuous or piecewise linear, such as the square and sawtooth waves. In addition, transforms such as Fourier, STFT, and wavelets are not pointwise activations in the usual sense, but they are included because they provide essential representations of oscillatory structure.

9 Sinusoidal versus non-sinusoidal periodic functions

Table 9.1 contrasts the two broad families of periodic functions considered in this chapter. Sinusoidal functions are typically smooth and analytically convenient, while non-sinusoidal periodic functions often introduce discontinuities or sharp transitions that enrich representational flexibility.

Table 9.1: Comparison between sinusoidal and non-sinusoidal periodic functions.
Property	Sinusoidal	Non-sinusoidal
Mathematical structure	Based on sine/cosine	Piecewise / constructed
Smoothness	Smooth	May be discontinuous
Differentiability	Infinitely differentiable	Not always differentiable
Frequency behavior	Single or harmonic frequencies	Rich harmonic content
Representation power	Good for global patterns	Good for sharp/local features
Typical use	Signal modeling, physics	Engineering, DL activations

As indicated in Table 9.1, sinusoidal functions are especially suitable for representing smooth and globally structured oscillations. By contrast, non-sinusoidal periodic functions are often better suited to signals with abrupt changes, localized features, or richer harmonic content. This distinction is useful when selecting mathematical representations for different classes of neural or signal-processing models.

10 Monotonic versus periodic activation functions

Finally, Table 10.1 compares the monotonic activation functions discussed earlier with the periodic functions presented in this section. This broader contrast helps clarify why the two families play different roles in neural modeling.

Table 10.1: Comparison between monotonic and periodic activation functions.
Property	Monotonic	Periodic
Monotonicity	Preserved order	Not preserved
Output behavior	Non-oscillatory	Oscillatory
Smoothness	Often smooth	May oscillate
Gradient behavior	Stable gradients	Oscillatory gradients
Representation	Hierarchical features	Cyclic/structured patterns
Typical use	Deep learning standard	Signals, implicit models

Table 10.1 emphasizes the main conceptual difference between these two families. Monotonic functions are usually preferred in standard feedforward architectures because they preserve ordering and often provide more stable optimization behavior. Periodic functions, in contrast, sacrifice monotonicity in order to capture cyclic, oscillatory, or structured dependencies that monotonic activations cannot represent directly.

These three tables collectively show that periodic functions should not be viewed merely as alternatives to monotonic activations. Rather, they constitute a complementary family of tools that are especially valuable when the data exhibit repeated, localized, or multi-scale patterns.

11 Summary

In this document, we examined the role of activation functions in artificial neural networks, emphasizing their importance as nonlinear transformations that determine how neurons respond to incoming signals.

We began by distinguishing between monotonic and periodic activation functions. Monotonic activations, including sigmoid, hyperbolic tangent, ReLU, ELU, and SoftExponential families, were presented as fundamental components for stable and efficient learning in conventional neural network architectures.

We then explored periodic activation functions, which are particularly suited for modeling oscillatory, wave-like, and structured patterns. Within this group, we analyzed both sinusoidal functions (such as the sine wave and the sinc function) and non-sinusoidal constructions, including periodic Gaussian variants, square, triangle, and sawtooth waves, as well as adaptive activations such as SReLU and APLU.

Throughout the chapter, attention was given not only to the analytical expressions of these functions, but also to their smoothness, differentiability, geometric behavior, and practical implications for learning.

The comparative analysis developed across sections highlights that activation functions provide complementary modeling capabilities rather than interchangeable alternatives. Monotonic functions are well suited for hierarchical feature extraction and stable optimization, whereas periodic functions enable the representation of cyclic, structured, and multi-scale patterns.

Overall, no single activation function is universally optimal. Instead, the appropriate choice depends on the structure of the data, the architecture of the model, and the type of relationships the network is expected to learn.

12 Learning activity

12.0.1 Objective

The purpose of this activity is to consolidate the conceptual understanding of activation functions and their role in artificial neural networks. In particular, it aims to develop intuition about the differences between monotonic and periodic activations, their geometric properties, and their suitability for modeling different types of patterns.

12.0.2 Instructions

Consider the following set of activation functions:

Sigmoid.
ReLU.
Hyperbolic tangent (tanh).
Sine function.
Square wave.

Answer the following questions:

Classification

Classify each function as monotonic or periodic. Briefly justify your answer.
Geometric interpretation

Describe the qualitative shape of each function. In particular:
- Is it bounded or unbounded?
- Is it symmetric?
- Does it saturate?
Smoothness and differentiability

For each function, discuss whether it is:
- Continuous.
- Differentiable everywhere.
- Piecewise differentiable.
Modeling capability

For each function, indicate what type of patterns it is better suited to model:
- Linear trends.
- Binary decisions.
- Smooth nonlinear relationships.
- Oscillatory or periodic patterns.
Critical reflection

Suppose you are designing a neural network to model:
- A time series with seasonal behavior.
- A classification problem with sharp decision boundaries.
Which activation functions would you choose in each case? Justify your reasoning.
Conceptual synthesis

Based on your analysis, explain in your own words why no single activation function is universally optimal.

12.0.3 Optional extension

Propose a new activation function by combining two of the functions discussed in this document. Describe its expected behavior and potential advantages.
Design a conceptual activation function that combines monotonic and periodic behavior. Describe:
- Its qualitative mathematical form.
- Its expected shape.
- Its advantages compared to standard activation functions.

NEURAL NETWORKS FOR REPRESENTATION LEARNING

Activation functions

Dr. rer. nat. Humberto LLinás Solano

1 Neural networks

1.0.1 Biological neuron

1.0.2 Mathematical model of an artificial neuron

1.0.3 Smooth functions and the class \(C^{\infty}\)

Definition.

Example 1.

Example 2.

Example 3.

Remark (Neural networks context).

1.0.4 Activation functions

Definition.

Role of the activation function.

1.0.5 Classification of activation functions

Monotonic and periodic functions.

Classification of periodic activation functions.

2 Monotonic activation functions

Intuition and graphical interpretation.

Formal definition.

Common monotonic activation functions.

2.0.1 Linear function

2.0.2 Identity function

The function.

Derivative.

2.0.3 Piecewise linear function

The function.

Derivative.

2.0.4 Threshold (Unit heaviside, binary, step) function

The function.

Derivative.

2.0.5 Sigmoid function

The function.

Derivative.

2.0.6 Bipolar sigmoid function

The function.

Derivative.

2.0.7 Rectified linear unit (ReLU)

The function.

Derivative.

2.0.8 Variants of ReLU (LReLU, PReLU, RReLU)

Leaky rectified linear unit (LReLU).

Parametric rectified linear unit (PReLU).

Randomized rectified linear unit (RReLU).

2.0.9 Exponential linear Unit (ELU)

The function.

Derivative.

2.0.10 Scaled exponential linear unit (SELU)

The function.

Derivative.

2.0.11 SoftMax function

The function.

Derivative.

2.0.12 Sign activation (signum) function

The function.

Derivative.

2.0.13 Maxout function

The function.

Derivative.

2.0.14 Softsign function

The function.

Derivative.

2.0.15 Elliot function

The function.

Derivative.

2.0.16 Hyperbolic tangent (tanh) function

The function.

Derivative.

2.0.17 Arc tangent function

The function.

Derivative.

2.0.18 Lecun’s hyperbolic tangent function

The function.

Derivative.

2.0.19 Complementary log-log function

The function.

Derivative.

2.0.20 Softplus function

The function.