A function is a relationship between one or more input variables and an output. We often write expressions like \(f(x) = x^2 + 3\) and interpret that as \(f\) is a function of an input variable \(x\). An expression like \(g(x,y,z)\) means that \(g\) is a function of three input variables \(x\), \(y\) and \(z\). Calculus, at it’s essence, is simply the study of how these functions change. It is not a bad idea to refresh yourself a bit with algebra before proceeding. However, we approach calculus in this paper using a graphical approach.
The gradient of a function is the slope of that function. We often refer to the gradient as the derivative. Formally, we define the derivative of a function \(f\) by \[f'(x) = \lim_{h \rightarrow 0} \dfrac{f(x+h) - f(x)}{h}.\] This is the slope of a secant line as the length of the secant line gets small.
For example, consider the function \(f(x) = x^2\). Then \[ \begin{aligned}f'(x) &= \lim_{h \rightarrow 0} \dfrac{f(x+h) - f(x)}{h}\\ &= \lim_{h \rightarrow 0} \dfrac{(x+h)^2 - x^2}{h} \\ &= \lim_{h \rightarrow 0} \dfrac{2xh + h^2}{h}\\ &= \lim_{h \rightarrow 0} 2x +h\\ &= 2x.\end{aligned}\]
For a second example, consider the function \(f(x) = 1/x\). Then \[ \begin{aligned}f'(x) &= \lim_{h \rightarrow 0} \dfrac{f(x+h) - f(x)}{h}\\ &= \lim_{h \rightarrow 0} \dfrac{1/(x+h) - 1/x}{h} \\ &= \lim_{h \rightarrow 0} \dfrac{\dfrac{x - (x+h)}{x(x+h)}}{h}\\ &= \lim_{h \rightarrow 0} \dfrac{-h}{hx(x+h)}\\ &= \lim_{h \rightarrow 0} \dfrac{-1}{x(x+h)}\\ &= \dfrac{-1}{x^2}.\end{aligned}\]
We have a number of rules to avoid the algebra involved with the limit definition of the derivative. \[\begin{aligned} \dfrac{d}{dx}c &= 0 & \dfrac{d}{dx}e^x &= e^x\\\\ \dfrac{d}{dx} cf(x) &= c\dfrac{d}{dx}f(x) & \dfrac{d}{dx}\sin(x) &= \cos(x)\\\\ \dfrac{d}{dx} (f(x) \pm g(x)) &= \dfrac{d}{dx}f(x) \pm \dfrac{d}{dx}g(x) & \dfrac{d}{dx}\cos(x) &= -\sin(x)\\\\ \dfrac{d}{dx}x^n &= nx^{n-1}&\dfrac{d}{dx}\ln(x) &= \dfrac{1}{x}\\\\ \end{aligned}\]
The product rule helps differentiate products of differentiable functions. \[\dfrac{d}{dx}(f(x) g(x)) = f(x)\dfrac{d}{dx} g(x) + g(x) \dfrac{d}{dx}f(x).\]
For example, \[\begin{aligned}\dfrac{d}{dx} x^3\sin(x) &= x^3 \dfrac{d}{dx}\sin(x) + \sin(x) \dfrac{d}{dx} x^3\\\\ &=x^3\cos(x) + \sin(x)(3x^2). \end{aligned}\]
The quotient rule helps differentiate quotients of differentiable functions. \[\dfrac{d}{dx}\dfrac{f(x)}{g(x)} = \dfrac{g(x)\dfrac{d}{dx}f(x) - f(x) \dfrac{d}{dx}g(x)}{g(x)^2}.\]
For example, \[\begin{aligned}\dfrac{d}{dx} \dfrac{e^x }{4x^6} &= \dfrac{4x^6\dfrac{d}{dx}e^x - e^x\dfrac{d}{dx}4x^6}{(4x^6)^2}\\\\ &=\dfrac{4x^6 e^x - e^x(24x^5)}{16x^{12}}\\\\ &=\dfrac{e^x(4x^5)(x - 6)}{16x^{12}}\\\\ &=\dfrac{e^x(x - 6)}{4x^{7}}. \end{aligned}\]
The chain rule allows us to differentiate composite functions. The chain rule states \[\dfrac{d}{dx} f(g(x)) = f'(g(x))g'(x).\]
As an example, we can compute \(F'(x)\) when \(F(x) = \sin(x^4)\). Here \(f(x) = \sin(x)\) and \(g(x) = x^4\). Thus, \[\begin{aligned} F'(x) &= \cos(x^4) \dfrac{d}{dx} x^4 \\\\ &= 4x^3 \cos(x^4). \end{aligned}\]
We now generalize the idea of calculus to multiple variables.
Consider the following expression relating force \(F\) generated by a car’s engine to its mass \(m\), acceleration \(a\), aerodynamic drag \(d\) and velocity \(v\), \[F = ma + dv^2.\] In the context of driving the car, the \(a\) and \(v\) are variables that depend on how you apply force to the gas pedal and we can think of \(m\) and \(d\) as constant in this respect. On the other hand, if you design cars, then \(m\) and \(d\) are variables that you may wish to change in order to produce a different force for your customers. Thus, context is important when it comes to problems with many variables and the reality is that one can differentiate any variable with respect to any other variable.
As an example, consider the idea of creating a soup can. The amount of metal required to create the can depends on 2 round parts (top and bottom of the can) of radius \(r\) plus the body of the can of height \(h\). The total metal is the sum of the areas of these pieces times the thickness (\(t\)) times the density (\(\rho\)). The total amount of metal required, \(m\), is given by \[m = 2\pi r^2t\rho + 2\pi r h t \rho.\] All of these factors are variable when designing a soup can.
When we differentiate with respect to a variable, we treat the other variables as constant with respect to that variable when we differentiate. So, to find the change in \(m\) as a result of changing \(h\), we find \(\dfrac{\partial m}{\partial h}\). We find that \[\dfrac{\partial m}{\partial h} = 0 + 2\pi r t \rho= 2\pi r t \rho.\] Finding \(\dfrac{\partial m}{\partial t}\), we get \[\dfrac{\partial m}{\partial t} = 2 \pi r^2 \rho + 2\pi r h \rho.\] Finding \(\dfrac{\partial m}{\partial r}\), we get \[\dfrac{\partial m}{\partial r} = 4 \pi r t \rho + 2\pi r h t \rho.\] Finding \(\dfrac{\partial m}{\partial r}\), we get \[\dfrac{\partial m}{\partial \rho} = 2 \pi r^2 t + 2\pi r h t.\]
Consider \(f(x,y,z) = \sin(x)e^{yz^2}\). Here we have \[\begin{aligned}\dfrac{\partial f}{\partial x} &= \cos(x)e^{yz^2}\\ \\ \dfrac{\partial f}{\partial y} &= \sin(x)e^{yz^2}z^2\\ \\ \dfrac{\partial f}{\partial z} &= \sin(x)e^{yz^2}2yz. \end{aligned}\] Suppose \(x,y\) and \(z\) were all functions of a single parameter \(t\). We may want to then differentiate with respect to that parameter. This is where the total derivative comes in. We use the chain rule to help with the total derivative \(\dfrac{d f(x,y,z)}{dt}\) and this function is given by \[\dfrac{d f(x,y,z)}{dt} = \dfrac{\partial f}{\partial x} \dfrac{dx}{dt} + \dfrac{\partial f}{\partial y} \dfrac{dy}{dt} + \dfrac{\partial f}{\partial z} \dfrac{dz}{dt}.\] Thus, the total derivative is the sum of the chains of the three variables.
For the sake of example, suppose \(x = t-1\), \(y = t^2\) and \(z = \dfrac{1}{t}\). Then \(\dfrac{dx}{dt} = 1\), \(\dfrac{dy}{dt} = 2t\) and \(\dfrac{dz}{dt} = -1/t^2\). Therefore, \[\begin{aligned}\dfrac{d f(x,y,z)}{dt} &=\dfrac{\partial f}{\partial x} \dfrac{dx}{dt} + \dfrac{\partial f}{\partial y} \dfrac{dy}{dt} + \dfrac{\partial f}{\partial z} \dfrac{dz}{dt} \\\\ &= \cos(x)e^{yz^2}(1) + z^2\sin(x)e^{yz^2}(2t) + 2yz\sin(x)e^{yz^2}(-t^{-2})\\\\ &= \cos(t-1)e + 2t^{-1}\sin(t-1)e - 2t^{-1}\sin(t-1)e\\ \\ &= \cos(t-1)e. \end{aligned}\] Note that this is exactly the expression for \(df/dt\) that we would have found if we had substituted \(t\) in for \(x, y\) and \(z\) in the original function and differentiated as if we had a single variable.
Now, consider a vector \(\mathbf{x} = \begin{pmatrix} x_1, & x_2, &\ldots, &x_n \end{pmatrix}\) consisting of variables \(x_1, x_2\), etc. and each of those variables are a function of \(t\). We want to find \(\dfrac{df}{dt}\). Since \[\dfrac{\partial f}{\partial \mathbf{x}} = \begin{pmatrix} \partial f / \partial x_1 \\ \partial f/ \partial x_2\\ \vdots \\\partial f / \partial x_n \end{pmatrix}\] and \[\dfrac{d \mathbf{x}}{\partial t} = \begin{pmatrix} dx_1/dt \\ dx_2/dt\\ \vdots \\ dx_n/dt \end{pmatrix}\] and the total derivative is the sum of the products of the pairs of entries in the same matrix position, we know that is the dot product of the two matrices and so \[\dfrac{d f}{dt} = \dfrac{\partial f}{\partial x} \cdot \dfrac{d\mathbf{x}}{dt}.\] Therefore, we now have an expression for the chain rule in many dimensions.
For functions that have more than 1 dependency, we can still apply the chain rule. For example, \(f(\mathbf{x}(\mathbf{u}(t)))\), with \(f(\mathbf{x}) = f(x_1,x_2)\), \(\mathbf{x}(\mathbf{u}) = \begin{pmatrix} x_1(u_1,u_2)\\ x_2(u_1,u_2) \end{pmatrix}\) and \(\mathbf{u}(t) &= \begin{pmatrix}u_1(t)\\u_2(t) \end{pmatrix}\). In this case, we have \[ \begin{aligned}\dfrac{df}{dt} = \dfrac{\partial f}{\partial \mathbf{x}} \dfrac{\partial \mathbf{x}}{\partial \mathbf{u}} \dfrac{d \mathbf{u}}{dt}\\\\ \begin{pmatrix} \dfrac{\partial f}{\partial x_1},&\dfrac{\partial f}{\partial x_2} \end{pmatrix} \begin{pmatrix} \dfrac{\partial x_1}{\partial u_1}&\dfrac{\partial x_1}{\partial u_2}\\ \dfrac{\partial x_2}{\partial u_1}&\dfrac{\partial x_2}{\partial u_2}\end{pmatrix} \begin{pmatrix} \dfrac{\partial u_1}{\partial t}\\\dfrac{\partial u_2}{\partial t}\end{pmatrix} \end{aligned}\]
The idea of the Jacobian combines calculus and linear algebra ideas. Suppose we have a function of several variables, \(f(x_1,x_2, \ldots, x_n)\) then the Jacobian is a vector consisting of the derivative of \(f\) with respect to each variable, \[J = \begin{pmatrix} \dfrac{\partial f}{\partial x_1}, & \dfrac{\partial f}{\partial x_2}, & \dots, & \dfrac{\partial f}{\partial x_n}\end{pmatrix}. \]
For example, for \(f(x,y,z) = x^2y + 3z\), we have \[J = \begin{pmatrix}2xy, & x^2, & 3 \end{pmatrix}.\]
The Jacobian, when evaluated at a particular point, will give us a vector pointing in the direction of the steepest slope of the function. Moreover, the size of this slope (rate of change) is the length of the Jacobian. For example, the Jacobian for the previous example at the origin is \(J(0,0,0) = (0,0,3)\) and so the direction of the steepest slope is entirely in the \(z\) direction and that rate of change is 3.
The Jacobian is also orthogonal (perpendicular) to the level curve \(f(x_1,x_2, \ldots, x_n) = k\) at the point \((x_1, x_2, \ldots, x_n)\).
As another example, consider \[f(x,y) = e^{-(x^2+y^2)}.\] This has Jacobian given by \[J = \begin{pmatrix} -2xe^{-(x^2+y^2)},& -2ye^{-(x^2+y^2)} \end{pmatrix}.\] If we evaluate the Jacobian at the origin \((0,0)\), we notice that the Jacobian is 0. This means that this point is a maximum, a minimum or a saddle point. To investigate further, we have included a vector field for \(f\) below created simply by plotting Jacobian vectors for various values of \(x\) and \(y\).
We can see that the origin is a maximum value. This fact is confirmed with a contour plot and a 3D graph.
One last note is that the expression for \(\partial f/ \partial \mathbf{x}\), from our examination of the chain rule and total derivative, is exactly the same as the Jacobian vector. Thus, we can rewrite our expression for the chain rule as \[\dfrac{df}{dt} = J_f \dfrac{d \mathbf{x}}{xt}.\]
To explain the idea of a Jacobian matrix, consider two functions \[ \begin{aligned} u(x,y) &= x-2y \\ v(x,y) &= 3y - 2x \end{aligned}.\] We can make Jacobian vectors for both \(u\) and \(v\) \[ \begin{aligned} J_u &= \begin{pmatrix} \partial u/ \partial x & \partial u/ \partial y \end{pmatrix} \\ J_v &= \begin{pmatrix} \partial v/ \partial x & \partial v/ \partial y \end{pmatrix} \end{aligned}.\] However, it makes far more sense to put these into a single matrix, \(J\) where \[J = \begin{pmatrix} \partial u/ \partial x & \partial u/ \partial y \\ \partial v/ \partial x & \partial v/ \partial y \end{pmatrix}.\]
For \[ \begin{aligned} u(x,y) &= x-2y \\ v(x,y) &= 3y - 2x \end{aligned},\] the Jacobian is \[J = \begin{pmatrix} 1 & -2 \\ -2 & 3 \end{pmatrix}.\] Of course, most examples will be more complicated than this example.
Another example is given by the transition between polar and Cartesian coordinate systems. In this example, \[ \begin{aligned} x(r, \theta) &= r\cos(\theta) \\ y(r, \theta) &= r\sin(\theta) \end{aligned}\] and the Jacobian matrix is given by \[J = \begin{pmatrix} \cos(\theta) & -r\sin(\theta) \\ \sin(\theta) & r\cos(\theta) \end{pmatrix},\] with \(|J| = r(\cos^2\theta + \sin^2\theta) = r\).
Optimization is important in data science. There are many examples where optimization problems arise in everyday life. For example, we may look for the quickest travel route from one point to another.
Mathematically, a critical value is a point where the rate of change of a function is 0 or undefined. These points give us local extreme values (maxima or minima). The largest of all of the maxima is called the global maximum and the other maxima are called local maxima. The smallest of the minima is called the global minimum while the other minima are called local minima. A point which has a rate of change of 0 but is neither a maximum or a minimum point is called a saddle point.
Recall that the Jacobian matrix always points upward. While they do not always point to the top of the tallest peak, they do point in an upward direction. Simply following Jacobian vectors may not lead you to the global maximum but to a local maximum.
The Hessian is a simple extension of the Jacobian. The Hessian matrix is the matrix consisting of all second order partial derivatives with respect to each pair of variables. Rather than using a notation like \(\partial f/ \partial x\), we use the \(f_x\) notation to denote the first derivative of \(f\) with respect to \(x\) and \(f_{xy}\) to denote the second derivative of \(f\) first with respect to \(x\) and then with respect to \(y\). Thus, the Hessian is \[H \begin{pmatrix} f_{x_1x_1} & f_{x_1x_2} & \cdots f_{x_1x_n} \\ f_{x_2x_1} & f_{x_2x_2} & \cdots f_{x_2x_n} \\ \vdots & \vdots & \ddots & \vdots \\ f_{x_nx_1} & f_{x_nx_2} & \cdots f_{x_nx_n} \\ \end{pmatrix}.\]
For the function, \(f(x,y,z) = x^2yz\), the Jacobian is \(J = \begin{pmatrix} 2xyz, & x^2z, & x^2y \end{pmatrix}\) and so the Hessian matrix is \[\begin{pmatrix} 2yz & 2xz & 2xy\\ 2xz & 0 & x^2\\ 2xy & x^2 & 0 \end{pmatrix}.\]
Consider the function \(f(x,y) = x^2 + y^2\) with Jacobian \(J = (2x, 2y)\) and Hessian \(H = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}\). Note that \((0,0)\) is a critical number of \(f\). However, without knowing something about \(f\), we don’t know if this is a maximum or minimum (of course, we have \(f\), so pretend that we didn’t have \(f\) for now). The determinant of the Hessian matrix is 4. Now, if the determinant of the Hessian matrix is positive, the critical point is either a maximum or minimum. (If the Hessian matrix is negative, the critical point is neither a maximum or minimum but what is referred to as a saddle point). If additionally, \(f_{x x} > 0\), then the critical point is a minimum (if it is negative, the point is a maximum).
3Blue1Brown. “3Blue1Brown.” Accessed June 2, 2021. Available Here.
Boas, Mary L. Mathematical Methods in the Physical Sciences. 2nd ed. New York: Wiley, 1983.
“Calculus III - Directional Derivatives.” Accessed June 4, 2021. Available here.
“Polar Coordinate System.” In Wikipedia, May 31, 2021. Available here.