Introduction

Machine learning is a set of powerful mathematical tools that enable us, to represent, interpret, and control the complex world around us. The purpose of this section is to understand the math underlying these ideas. We want to introduce the national conventions, operations and ideas around linear algebra. This will allow you to manipulate large systems of equations conveniently.

Vectors and Matrices

Consider the problem of solving the system of equations given by \[3x + 4y + 2z= 10\] \[2x - y + 5z= 3\] \[x-5y + 4z = -3.\] It is often difficult to solve large systems of equations by hand. Thus, we may want to use a computer algebra system like MAPLE or R in general. Solving a system of equations is an example of a linear algebra problem. We have constant linear coefficients (3, 4, 2, -1, 5, -5) that relate the input variables \(x\), \(y\) and \(z\) to the outputs 10, 3 and -3.

We can think about a vector \(\begin{pmatrix}x \\ y \\ z \end{pmatrix}\), that describes the values of \(x\), \(y\) and \(z\). The constant terms can be described in a similar way \(\begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}\). We can also write down a matrix of coefficients \(\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix}\). This system can be expressed as \[\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix},\] using matrix notation.

We can think of a vector, we can think of as an object that moves about space. This could be a physical space like \(\mathbb{R}^n\), or a space of data. In data science, we think of a vector as a list of attributes of an object whilst in mathematics and physics, we normally think of a vector moving around physical space.

As an example, consider the attributes of a house. The house could be 2500 square feet, it might have three bedrooms, it might have two bathroom and it might be worth $500,000. Thus, the vector representing this house would be \[\begin{pmatrix} 2500 \text{ft}^2\\ 3\text{ bedrooms}\\ 2 \text{ bathrooms}\\ \$500,000 \end{pmatrix} .\] Therefore, the generalized idea of moving about space includes the description of the attributes of an object.

Vectors follow two basic rules: addition and multiplication by a scalar number. If we think about a vector as a geometric object starting at the origin. Vector addition is done tip to tail.

Here, \(r = \begin{pmatrix}4\\2 \end{pmatrix}\) while \(s = \begin{pmatrix} 1 \\ 3 \end{pmatrix}\). We shift \(s\) so the the tail of \(s\) is at the tip of \(r\) and so the resulting vector from the origin to the tip of the new location of \(s\) is defined to be \(r + s\). If you compute \(s + r\), you will get the same vector and thus \(r+s=s+r\).

Scalar multiplication is defined by repeated addition of a vector.

We could also create vectors that are half as long or twice as long, etc. What we mean by \(-a\) is a vector with the same length of \(a\) going in the opposite direction.

We define space by two vectors. The one that moves left to right of length 1 is \(i\) and the one that moves vertically called \(j\). Thus, the vector \(r = \begin{pmatrix} 4\\2 \end{pmatrix} = 4i + 2j\).

As another example, consider \(a = \begin{pmatrix} 2\\1 \end{pmatrix}\) and \(b = \begin{pmatrix}3 \\-4\end{pmatrix}\). We would then say \[a + b = \begin{pmatrix} 2\\1 \end{pmatrix}+ \begin{pmatrix}3 \\-4\end{pmatrix} = (2i+1j) + (3i - 4j) = 5i-3j = \begin{pmatrix}5\\-3 \end{pmatrix}.\]

Note that one can show, using this definition, that associativity applies. Addition of vectors can be done in any order with as many vectors as you want. For scalar multiplication, we can see that \[ -2a = -2 \begin{pmatrix} 2\\1 \end{pmatrix} = -2(2i+1j) = -4i-2j = \begin{pmatrix} -4\\-2 \end{pmatrix}.\] It is also worth noting that this defines subtraction - simply addition by a negative multiple of 1.

Referring to the house example, if we added two house vectors together, we would have 5000 square feet, 6 bedrooms, 4 bathrooms and it would cost us $1,000,000. That is what we mean by twice a house - or simply two houses next to each other.

Vectors

We define two things; the length of a vector, also called its size, and the dot product of a vector, also called it’s inner scalar or projection product.

The Modulus

When we define a vector, we did it without reference to any coordinate system. In fact, the geometric object, just has two properties, its length and its direction. So irrespective of the coordinate system we decided to use, we want to know how to calculate these two properties of length and direction. If the coordinate system was constructed out of two unit vectors that are orthogonal to each other, like \(i = \begin{pmatrix}1 \\ 0 \end{pmatrix}\) and \(j = \begin{pmatrix}0 \\ 1 \end{pmatrix}\) in \(\mathbb{R}^2\), then we can say that a vector \(r\) is a linear combination of \(i\) and \(j\). That is, \[r =a \times i + b \times j.\] When we say \(i\) and \(j\) are unit vectors, we mean their length is 1.

By the Pythagorean Theorem, the length of \(r\) is given by the hypotenuse. So, if we draw a triangle, then we’ve got length \(ai\). This has length \(a\), because \(i\) is of length one. The perpendicular side is \(bj\) with length \(b\). So \(r^2 = a^2 + b^2\).

Rather than dealing with \(i\) and \(j\), we can write \(r = ai + bj = \begin{pmatrix}a \\ b \end{pmatrix}\). In our previous example, \[r = 3i + 5j = \begin{pmatrix}3 \\ 5 \end{pmatrix}.\]

The analysis so far has been for two spatial directions defined by unit vectors \(i\) and \(j\) that are at right angles to each other. However, the definition works more generally. The length of a vector \(r\), denoted \(|r|\), is always the square root of the sum of the squares of the components, \[|r| = \left|\begin{pmatrix} a \\ b \\ \vdots \\ n \end{pmatrix} \right| = \sqrt{a^2 + b^2 + \cdots + n^2}.\] In our example, the length of \(r\) is \[|r| = \sqrt{3^2+5^2} = \sqrt{34}.\]

The Dot Product

The dot product is one way to multiply two vectors. Given two vectors, \(r = \begin{pmatrix} r_i \\ r_j \end{pmatrix}\) and \(s = \begin{pmatrix} s_i \\ s_j \end{pmatrix}\), then \[r \cdot s = r_is_i + r_js_j.\] For longer vectors of equal length, the dot product is defined similarly. For example, if \(r = \begin{pmatrix} 2 \\ 1 \end{pmatrix}\) and \(s = \begin{pmatrix} 4 \\ 3 \end{pmatrix}\), then \[r \cdot s = 2 \times 4 + 1\times 3 = 11.\]

Note that the dot product is a scalar number. It is also commutative which means that \(r \cdot s = s \cdot r\). Next, the dot product is distributive over addition. Thus, \(r \cdot (s + t) = r \cdot s + r \cdot t\). Fourthly, the dot product is associative over scalar multiplication. So, for a scalar \(a\) and vectors \(r\) and \(s\), we know that \(r \cdot(as) = a(r \cdot s)\). Finally, we can note is that \(r \cdot r = |r|^2\).

The cosine rule from algebra states that \[c^2 = a^2 + b^2 - 2ab\cos \theta,\] where \(\theta\) is the angle between sides \(a\) and \(b\).

We translate that into a vector notation with vectors \(r\) and \(s\). In terms of sizes, we can then say that \[|r-s|^2 = |r|^2 + |s|^2 - 2|r||s|\cos \theta,\] where \(\theta\) is the angle between \(r\) and \(s\).

We can multiply this out using our dot-product. Here \[|r - s|^2 = (r-s) \cdot (r-s) = r \cdot r - 2s \cdot r + s \cdot s = |r|^2 - 2 s \cdot r + |s|^2.\] Comparing that to the cosine law, we see that \[s \cdot r = |r||s| \cos \theta.\] Thus, we find out something about the extent to which the vectors go in the same direction. If the vectors are perpendicular to each other, \(\cos \theta = 0\), so \(s \cdot r = 0\). If the vectors are parallel, \(\cos \theta = 1\), so \(s \cdot r = |s||r|\). If \(s\) and \(r\) are in opposite directions, then \(\cos \theta = -1\) and \(s \cdot r = -|s||r|\).

Projections

Consider the right triangle formed when we take vectors \(r\) and \(s\) and drop a line from the end of \(s\) perpendicular to \(r\). The projection vector can be thought of as the shadow of \(s\) onto \(r\). For that reason, this vector is called the scalar projection of \(s\) onto \(r\). We can determine the scalar projection by \[\text{scalar projection of s onto r} = \dfrac{r \cdot s}{|r|}.\] To determine the vector, called the vector projection of \(s\) onto \(r\), we compute \[\text{vector projection of s onto r} = \dfrac{r \cdot s}{|r|} \times \dfrac{r}{|r|}.\]

Changing the Basis

What we haven’t talked about is the coordinate system that we use to describe space. Consider the coordinate system defined by unit vectors \(e_1\) and \(e_2\).

Here, \(r = 3e_1 + 4e_2 = \begin{pmatrix}3 \\ 4 \end{pmatrix}\). However, the selection of \(e_1\) and \(e_2\) is a little arbitrary. For example, we could set up a green coordinate system with vectors \(b_1 = \begin{pmatrix}2 \\ 1 \end{pmatrix}\) and \(b_2 = \begin{pmatrix} -2\\4\end{pmatrix}\).

We can now describe \(r\) in terms of the vectors \(b_1\) and \(b_2\). The vectors we use to define the space (be they \(e\)’s, \(b\)’s or something else) are called basis vectors. Thus, the vector used to describe \(r\) only have meaning if we know the basis vectors. In the basis defined by \(e\), we define \(r\) as the vector \(r_e= \begin{pmatrix} 3 \\ 4\end{pmatrix}\). In the basis described by the \(b\) vectors, we would describe \(r\) as \(r_b= \begin{pmatrix} 2 \\ 1/2 \end{pmatrix}\). The vector \(r\) exists completely independently of the coordinate system we use to describe the numbers in the list, describing \(r\).

Now, if the new basis vectors, \(b\), are perpendicular to each other, then the dot product has a nice application. We can determine the value of \(r_b\), if we can describe the \(b\) vectors in terms of the \(e\) vectors. Here, \(b_1 = \begin{pmatrix}2 \\ 1 \end{pmatrix}\) and \(b_2 = \begin{pmatrix} -2\\4\end{pmatrix}\) in terms of vectors \(e\). (In this case, the \(b\) vectors are orthogonal - this can be checked by taking their dot product). By projecting \(r\) onto \(b_1\) and \(b_2\), we find two vectors in the direction of \(b_1\) and \(b_2\), each with lengths representing how much of those vectors is required to create \(r\). Looking at the diagram below, it appears we need 2 \(b_1\) vectors and only 1/2 of a \(b_2\) vector.

Mathematically, we calculate \(r_b\) by taking scalar projections. The scalar projection of \(r\) onto \(b_1\) is \[\dfrac{r_e \cdot b_1}{|b_1|^2} = \dfrac{3(2) + 4(1)}{\sqrt{2^2+1^2}^2} = \dfrac{10}{5} = 2.\] The scalar projection of \(r\) onto \(b_2\) is \[\dfrac{r_e \cdot b_2}{|b_2|^2} = \dfrac{3(-2) + 4(4)}{\sqrt{(-2)^2+4^2}^2} = \dfrac{10}{20} = \dfrac{1}{2}.\]

Thus, we can see that \(r_b\) is simply the vector that has first component 2 and second component 1/2, \(r_b = \begin{pmatrix} 2 \\ 1/2 \end{pmatrix}\), as illustrated in the diagram. Therefore, vectors can be re-described using a basis that is not the traditional basis provided that the new basis vectors are orthogonal to each other.

Linear Independence

A basis is a set of \(n\) vectors that are not linear combinations of each other (they are said to be linearly independent) and they span (describe in terms of linear combinations) all points in the space. The space is then \(n\) dimensional.

For example, consider the vectors \(b_1\) and \(b_2\). By taking linear combinations of \(b_1\) and \(b_2\), we can create any vector in \(\mathbb{R}^2\). If we consider a third vector \(b_3\), it must not be a linear combination of \(b_1\) and \(b_2\) to be considered part of the basis. Thus, \[b_3 \not = a_1 b_1 + a_2 b_2,\] for any real numbers \(a_1\) and \(a_2\). If this were true, we would say \(b_3\) is linearly independent.

Note that our basis need not be unit vectors nor do they have to be orthogonal. However, it is easier if they are. So, a good trick is to try to use unit vectors that are orthogonal to each other.

An Application

Consider a neural network in machine learning that recognizes faces. This sort of neural network may want to make some transformation of all the pixels into a new basis that describes the nose shape, the skin tone, the distance between the eyes, etc. Thus, the goal of the learning process of the neural network is going to be to derive a set of basis vectors that extract the most information-rich features of the faces.

Matrices

We have seen the problem of solving the system of equations given by \[3x + 4y + 2z= 10\] \[2x - y + 5z= 3\] \[x-5y + 4z = -3.\] We begin by constructing a matrix \[\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix},\] representing the problem. We can multiply the matrix by the vector by saying \[\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}3x+ 4y +2z\\ 2x -1y +5z\\ 1x-5y +4z \end{pmatrix}.\] So, we take the entries in row one and multiply each one by the corresponding entry in the vector and add them up. We do the same with rows 2 and 3. This gives us back the original system of equations.

Using Matrices to Transform Space

Notice that if we take a matrix \(\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\) and multiply it by the unit vector \(\begin{pmatrix} 1 \\ 0 \end{pmatrix}\). The result is \[\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\begin{pmatrix} 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 1 \\ 3 \end{pmatrix}.\] Similarly, if we multiply by the other unit vector, we get \[\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} 2 \\ 4 \end{pmatrix}.\] Thus, multiplying a matrix by a unit vector simply gives the vector of terms in that variable or direction.

Some properties of matrices that are important include that for a matrix \(A\) and a vector \(r\) and \(s\) and scalar \(n\), \[A(nr) = n(Ar)\] \[A(r + s) = Ar + As.\] Playing around with a few quick examples will help you verify these properties.

Special Transformations

We begin by defining the identity matrix \(I = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\). Multiplying a vector or matrix by \(I\) leaves the vector or matrix unchanged. That is, \(\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} x \\ y \end{pmatrix}\) for any matrix \(A\). To scale space by \(a\) in the \(x\) direction and \(b\) in the \(y\) direction, we replace \(I\) by the matrix \(\begin{pmatrix} a & 0 \\ 0 & b \end{pmatrix}\). For example, \[\begin{pmatrix} a & 0 \\ 0 & b \end{pmatrix}\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} ax \\ by \end{pmatrix}\]. Thus, whatever we had as our \(\begin{pmatrix} x \\ y \end{pmatrix}\) will be stretched (or shrunk) \(a\) times in the \(x\) direction and \(b\) times in the \(y\) direction. In essence, we have changed the size of our grid space from 1 x 1 to a x b.

If we want to flip the matrix in the negative direction, simply make the entries in the identity matrix negative rather than positive. Here, \(\begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix}\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} -x \\ -y \end{pmatrix}\). Combining this type of transformation with an expansion or contraction allows us to create grids of any size in any direction.

Another matrix to consider is \(\begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}\). This matrix flips everything about the 45 degree line. The same matrix with -1 in place of 1 will flip over the 45 degree line then again over the \(x\)- axis.

While the goal here is not to teach linear algebra, we do need some of these ideas for data science. For example, for facial recognition, we cannot assume that every picture has a face that faced forward and was exactly centered with the exact same level of focus on the face. Thus, we need to stretch, sheet or rotate the face based on these sorts of transformations.

Matrix Composition

If you want to do any shape alteration, such as all the pixels in an image of a face or something like that, then you can always make that shape change out of some combination of rotations, shears, stretches and inverses. That is, one can apply one translation \(A_1\) to a vector \(r\). Then, applying a second translation \(A_2\) to the results, then one has, in essence performed a composition of two translations.

Consider an example in which we want to rotate our vector 90 degrees clockwise. In this case, our first standard basis vector \(e_1\) would move to \(\begin{pmatrix} 0 \\ -1 \end{pmatrix}\) and our second standard basis vector \(e_2\) would be rotated to become \(\begin{pmatrix} 1\\0 \end{pmatrix}\). So, our first transformation \(A_1 = \begin{pmatrix} 0 &1\\ -1 & 0 \end{pmatrix}\).

The second transformation we want to make is to take the original basis vectors and we want to mirror these vectors across the \(y\)-axis. Thus, the first basis vector \(e_1\) becomes \(\begin{pmatrix} -1\\0 \end{pmatrix}\) and the second basis vector \(e_2\) is unchanged. Thus, \(A_2 = \begin{pmatrix} -1 & 0\\0 & 1 \end{pmatrix}\).

Putting these transformations together, we get \(A_2 A_1 = \begin{pmatrix} -1 & 0\\0 & 1 \end{pmatrix} \begin{pmatrix} 0 &1\\ -1 & 0 \end{pmatrix}= \begin{pmatrix}0 & -1\\ -1 & 0 \end{pmatrix}\). Without thinking about this geometrically, the multiplication is done by multiplying each row of \(A_2\) by each column of \(A_1\) and taking every combination of these.

Note that doing \(A_2\) to \(A_1\) is not the same thing as doing \(A_1\) to \(A_2\). Thus, matrix multiplication (composition of matrix translations) is not commutative. We can show this by noting that \(A_1 A_2 = \begin{pmatrix} 0 & 1\\ 1 & 0 \end{pmatrix}\). So, while matrix multiplication is associative, \[A_3(A_2A_1) = (A_3A_2)A_1\] but it is not commutative and \(A_1A_2 \not = A_2A_1,\) in general.

Gaussian Elimination

Finally, we can solve our system of equations problem given by \[3x + 4y + 2z= 10\] \[2x - y + 5z= 3\] \[x-5y + 4z = -3.\]

We rewrote this as \[\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}.\]

A matrix is in row echelon form if it has the form \[\begin{pmatrix}1 & a & b\\0 & 1 & c \\ 0 & 0 & 1 \end{pmatrix}.\] By adding or subtracting rows of a matrix, you can convert a matrix to row echelon form without changing what the matrix tells you. In essence, you would be solving a system of equations by elimination - though, in a matrix form.

\[\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}\] First, we replace row 1 by row 1 minus row 2 (leave the xyz vector alone since that just holds the places of the variables).

\[\begin{pmatrix}1 & 5 & -3\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}7 \\ 3 \\ -3 \end{pmatrix}\]

Next, we can now replace row 2 with 2 times row 1 minus row 2 and row 3 with row 1 minus row 3.

\[\begin{pmatrix}1 & 5 & -3\\ 0 & 11 & -11\\ 0 & 10 & -7 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}7 \\ 11 \\ 10 \end{pmatrix}\] Next, we divide row 2 by 11.

\[\begin{pmatrix}1 & 5 & -3\\ 0 & 1 & -1\\ 0 & 10 & -7 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}7 \\ 1 \\ 10 \end{pmatrix}\]

We subtract 10 times row 2 from row 3 and replace row 3 with that row.

\[\begin{pmatrix}1 & 5 & -3\\ 0 & 1 & -1\\ 0 & 0 & 3 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}7 \\ 1 \\ 0 \end{pmatrix}\] Finally, we divide row 3 by 3 to get row echilon form.

\[\begin{pmatrix}1 & 5 & -3\\ 0 & 1 & -1\\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}7 \\ 1 \\ 0 \end{pmatrix}\]

So, we know that \(z = 0\), and by back substituting in row 2 (essentially adding rows 2 and 3), we know \(y = 1\).

\[\begin{pmatrix}1 & 5 & -3\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}7 \\ 1 \\ 0 \end{pmatrix}\]

By back substituting in row 1 (adding 3 row 3 and -5 row 2 to row 1), we know that \(x = 2\).

\[\begin{pmatrix}1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}2 \\ 1 \\ 0 \end{pmatrix}\]

Thus, we have solved this particular case and by doing so, we have created an identity matrix on the left side. The general case will require matrix inversion but will rely heavily on our ability to create this identity matrix.

Matrix Inversion

Now, consider a matrix \(A^{-1}\) with the property that \(A^{-1}A = I\), where \(A = \begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix}\). In this scenario, we could solve this matrix equation by noting that \[\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = \begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}\] \[A^{-1}\begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix} \begin{pmatrix}x \\ y \\ z \end{pmatrix} = A^{-1}\begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}\] \[I \begin{pmatrix}x \\ y \\ z \end{pmatrix} = A^{-1}\begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}\] \[\begin{pmatrix}x \\ y \\ z \end{pmatrix} = A^{-1}\begin{pmatrix}10 \\ 3 \\ -3 \end{pmatrix}\]

In order to find this matrix \(A^{-1} = \begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix}\), we write \[ \begin{pmatrix}3 & 4 & 2\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix}, \] with the identity matrix on the right side of the equals sign. What we’re going to do now is to put \(A\) into row echelon form while doing the same thing to the identity matrix simultaneously.

As before, row 1 will be row 1 minus row 2. \[ \begin{pmatrix}1 & 5 & -3\\ 2 & -1 & 5\\ 1 & -5 & 4 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & -1 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix}\]

Next, row 2 becomes 2 times row 1 minus row 2 and row 3 becomes row 1 minus row 3.

\[ \begin{pmatrix}1 & 5 & -3\\ 0 & 11 & -11\\ 0 & 10 & -7 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & -1 & 0\\ 2 & -3 & 0\\ 1 & -1 & -1 \end{pmatrix}\]

Next, we replace row 2 by row 2 minus row 3.

\[ \begin{pmatrix}1 & 5 & -3\\ 0 & 1 & -4\\ 0 & 10 & -7 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & -1 & 0\\ 1 & -2 & 1\\ 1 & -1 & -1 \end{pmatrix}\]

We replace row 3 by 10 row 2 minus row 3.

\[ \begin{pmatrix}1 & 5 & -3\\ 0 & 1 & -4\\ 0 & 0 & -33 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & -1 & 0\\ 1 & -2 & 1\\ 9 & -19 & 11 \end{pmatrix}\] Divide row 3 by -33.

\[ \begin{pmatrix}1 & 5 & -3\\ 0 & 1 & -4\\ 0 & 0 & 1 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & -1 & 0\\ 1 & -2 & 1\\ -3/11 & 19/33 & -1/3 \end{pmatrix}\] We add 4 row 3 to row 2.

\[ \begin{pmatrix}1 & 5 & -3\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}1 & -1 & 0\\ -1/11 & 10/33 & -1/3\\ -3/11 & 19/33 & -1/3 \end{pmatrix}\]

Finally 5 row 2 and -3 row 3 are added to row 1.

\[ \begin{pmatrix}1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix}\begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\end{pmatrix} = \begin{pmatrix}7/11 & -26/33 & 2/3\\ -1/11 & 10/33 & -1/3\\ -3/11 & 19/33 & -1/3 \end{pmatrix}\]

The inverse matrix for \(A\) is the matrix on the right side \(A^{-1} = \begin{pmatrix}7/11 & -26/33 & 2/3\\ -1/11 & 10/33 & -1/3\\ -3/11 & 19/33 & -1/3 \end{pmatrix}\).

Determinants

Consider \(\begin{pmatrix} a & 0 \\ 0 & d \end{pmatrix}\). This matrix scales space. If our original basis vectors are the standard \(e_1\) and \(e_2\), the matrix scales them to \(e_1' = \begin{pmatrix}a \\ 0 \end{pmatrix}\) and \(e_2'=\begin{pmatrix} 0 \\ d \end{pmatrix}\). As a result, the space is scaled by a factor \(ad\). This is called the determinant of the matrix.

If we have the matrix \(\begin{pmatrix} a & b \\ 0 & d \end{pmatrix}\), we still scale \(e_1\) by the same factor of \(a\) to become \(e_1' = \begin{pmatrix} a \\ 0 \end{pmatrix}\); however, \(e_2\) is transformed to \(e_2' = \begin{pmatrix} b \\ d \end{pmatrix}\). Thus, the space is no longer square - it is a trapezoid. However, the determinant (scaling factor) is still \(ad\).

Now, if we instead change the matrix to \(\begin{pmatrix} a & b \\ c & d \end{pmatrix}\), we end up with a diamond shape. One can prove that the area of this diamond is \(ad - bc\). That is the general equation for the determinant of a matrix \(A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}\), \[det(A) = ad-bc.\]

For a 2 x 2 matrix \(A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}\), the inverse of the matrix can be computed as \[a^{-1} = \dfrac{1}{det(A)}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}.\] For a matrix larger than 2 x 2, simply use a built in R feature to find the determinant.

One important property of the determinant is if your matrix has a row that is linearly dependent, the determinant is going to be 0. In essence, by having a linearly dependent row in the matrix, we lose a dimension. Thus, the transformation given by the matrix cannot be undone because we have lost some information by losing the dimension. Thus, it is worth checking that the determinant is non-zero before attempting a transformation.

Linear Mappings

Changing Basis

Recall that the columns of a transformation matrix, are the axes of the new basis vectors of the mapping in my coordinate system. We now discuss how to change from one basis system to another.

Consider two worlds, one for Alice and one for Bob. In Alice’s world, she has basis vectors given by our standard unit vectors \(e_1\) and \(e_2\) and Bob has basis vectors (in Alice’s world) as \(\begin{pmatrix} 1\\ 1 \end{pmatrix}\) and \(\begin{pmatrix} 3\\1 \end{pmatrix}\). Thus Bob’s transformation matrix is given by \(\begin{pmatrix} 3&1\\ 1&1 \end{pmatrix}\) in Alice’s world.
In Bob’s world, he views his basis vectors as our standard basis vectors \(e_1\) and \(e_2\).

Now, consider a vector \(\begin{pmatrix} 3/2 \\ 1/2 \end{pmatrix}\) in Bob’s world. We can map this into Alice’s world by calculating \[\begin{pmatrix} 3 & 1\\ 1&1 \end{pmatrix} \begin{pmatrix} 3/2 \\ 1/2 \end{pmatrix} = \begin{pmatrix} 5 \\ 2 \end{pmatrix}.\] Thus, Bob’s basis vectors, given in Alice’s world, multiplied by a vector given in terms of Bob’s world will yield the coordinates for the vector in Alice’s world.

So, we know how to transform vectors in Bob’s world to Alice’s world. We want to reverse the process and, given the information at hand, transform a vector from Alice’s world to Bob’s world. If \(B\) is the transformation matrix for Bob given in terms of Alice’s world, we can find \(B^{-1} = \begin{pmatrix} 1 & -1/2 \\ -1/2 & 3/2 \end{pmatrix}\). This represents the matrix of Alice’s basis vectors given in Bob’s world (Alice’s transformation matrix in Bob’s world). Now, if we take the vector \(\begin{pmatrix} 5 \\ 2 \end{pmatrix}\) in Alice’s world, we can map it into Bob’s world simply by computing \[ \begin{pmatrix} 1 & -1/2 \\ -1/2 & 3/2 \end{pmatrix}\begin{pmatrix} 5 \\ 2 \end{pmatrix} = \begin{pmatrix} 3/2 \\ 1/2 \end{pmatrix}.\]

Another Example

Consider another situation in which Alice has unit basis vectors \(e_1\) and \(e_2\) in her world and Bob has basis vectors \(\begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}\) and \(\begin{pmatrix} -1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}\), in Alice’s world. A quick dot product will show that Bob’s unit vectors are orthogonal to each other.

Bob’s transformation matrix is \(B = \begin{pmatrix} 1/\sqrt{2} & -1/\sqrt{2}\\ 1/\sqrt{2} & -1/\sqrt{2}\end{pmatrix}\) in Alice’s world. Now, consider a vector \(\begin{pmatrix} 2\\1\end{pmatrix}\) in Bob’s world. We can map that into Alice’s world \[\begin{pmatrix} 1/\sqrt{2} & -1/\sqrt{2}\\ 1/\sqrt{2} & -1/\sqrt{2}\end{pmatrix} \begin{pmatrix} 2\\1\end{pmatrix} = \begin{pmatrix} 1/\sqrt{2} \\ 3/\sqrt{2}\end{pmatrix}.\] To do the reverse transformation, we find the inverse of \(B\) given by \(B^{-1} = \begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2}\\ -1/\sqrt{2} & 1/\sqrt{2}\end{pmatrix}\). We map the vector \(\begin{pmatrix} 1/\sqrt{2}\\3/\sqrt{2}\end{pmatrix}\) from Alice’s world into Bob’s world as follows \[ \begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2}\\ -1/\sqrt{2} & 1/\sqrt{2}\end{pmatrix} \begin{pmatrix} 1/\sqrt{2}\\3/\sqrt{2}\end{pmatrix} = \begin{pmatrix} 2\\1\end{pmatrix}.\]

Interestingly, since Bob’s basis vectors are orthogonal in Alice’s world, we can do this same calculation with projections. We can take the vector \(\begin{pmatrix} 1/\sqrt{2}\\3/\sqrt{2}\end{pmatrix}\) in Alice’s world and compute the dot product of that vector with Bob’s unit vectors (in Alice’s world). Thus, we get \[\begin{pmatrix} 1/\sqrt{2}\\3/\sqrt{2}\end{pmatrix} \cdot \begin{pmatrix} 1/\sqrt{2}\\1/\sqrt{2}\end{pmatrix} = 2.\] This gives the first component of the vector in Bob’s world (since this was the first of Bob’s unit vectors). Moreover, \[\begin{pmatrix} 1/\sqrt{2}\\3/\sqrt{2}\end{pmatrix} \cdot \begin{pmatrix} -1/\sqrt{2}\\1/\sqrt{2}\end{pmatrix} = 1,\] the second component of the vector given in Bob’s world.

Transformations in a Changed Basis

Suppose Alice wants to take a vector in Bob’s world and rotate it 45 degrees in Bob’s world. Unfortunately, she doesn’t know how to do that in Bob’s world - only in her own. How does she take the vector \(\begin{pmatrix} x \\ y\end{pmatrix}\) (in Bob’s coordinate world) and rotate it (or more generally, how does she apply a transformation when she only knows transformations in her world)? First, Alice takes the vector and maps it into her own world using Bob’s transformation matrix \(B = \begin{pmatrix} 3 & 1 \\ 1 & 1\end{pmatrix}\) by computing \(B\begin{pmatrix} x \\ y\end{pmatrix}\). This is now a vector in Alice’s world and we can apply the 45 degree rotation in Alice’s world to that matrix. To rotate the vector, she applies the rotation matrix transformation (from Alice’s world) \(R = \begin{pmatrix} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2}\end{pmatrix}\) to the vector in Alice’s world. Thus, the rotated vector in Alice’s world is \(RB\begin{pmatrix} x \\ y\end{pmatrix}\). To map the vector back into Bob’s world, we apply the transformation matrix that transforms vectors from Alice’s world into Bob’s world, \(B^{-1}\). And so, we end up with \(B^{-1}RB\begin{pmatrix} x \\ y\end{pmatrix}\) which gives us the rotated vector in Bob’s coordinate world. Therefore, a rotation in Bob’s coordinate world can be computed as \(B^{-1}RB\). This works for any transformation.

One can compute the rotation matrix in Bob’s world by \[B^{-1}RB = \begin{pmatrix} 1/2 & -1/2 \\ -1/2 & 3/2\end{pmatrix}\begin{pmatrix} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2}\end{pmatrix}\begin{pmatrix} 3 & 1 \\ 1 & 1\end{pmatrix} = \begin{pmatrix} -1/\sqrt{2} & -1/\sqrt{2} \\ 5/\sqrt{2} & 3/\sqrt{2}\end{pmatrix}.\] The point here is that we can perform the same sort of matrix transformations in different basis systems; however, the matrix that we use to transform will change depending on the system and the new transformation will take the form \(B^{-1}RB\), where \(B\) is the matrix of the new basis in our standard coordinate system and \(R\) is the transformation in our standard coordinate system.

Orthogonal Matrices

It is useful to be able to create a transformation matrix whose column vectors make up a basis and whose column vectors are orthogonal.

We begin with the idea of a matrix transpose. In short, the transpose of a matrix \(A\) is denoted \(A^T\) and is defined to be \(A_{ij}^T = A_{ji}\). Or, \(\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}^T = \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}\) as an example.

Consider a matrix \(A\) that is \(n \times n\) with the properties that \(A = \begin{pmatrix} \begin{pmatrix} \; \\ a_1\\ \; \end{pmatrix} & \begin{pmatrix} \; \\ a_2 \\ \;\end{pmatrix} & \cdots & \begin{pmatrix} \; \\ a_n \\ \; \end{pmatrix} \end{pmatrix},\) and \(a_{i} \cdot a_j = \begin{cases} 0 & i \not = j\\ 1 & i = j \end{cases}\). That is, the column vectors of \(A\) are of unit length and are orthogonal to each other. For such an \(A\), we can verify that \(A^TA = I\) since this multiplication is a series of dot products. However, this means that \(A^T\) is the inverse of \(A\).

A set of vectors that satisfy the property of \(a_{i} \cdot a_j = \begin{cases} 0 & i \not = j\\ 1 & i = j \end{cases}\) are called an orthonormal basis set and a matrix composed of them is called an orthogonal matrix. The determinant of an orthogonal matrix must be \(\pm 1\) and thus it scales space by a factor of one. Note that the rows of an orthogonal matrix must also be orthonormal. This will be the most convenient set of basis vectors, for data science calculations, due to the properties that we have discussed above.

The Gram-Schmidt Process

Suppose \(V = \{v_1, v2, \ldots v_n\}\) span an \(n\)-dimensional space. If we want to check linear independence, we would simply check that the determinant is non-zero. Suppose also that they are not necessarily orthogonal nor are they of unit length. The goal here is to construct an orthonormal basis and the procedure for doing this is the Gram-Schmidt process.

First, we define \(e_1 = \dfrac{v_1}{|v_1|}\). This will be our first basis vector. Now, think of \(v_2\) as being composed of two components - the vector projection of \(v_2\) in the direction of \(e_1\) and some component \(u_2\) that is perpendicular to \(e_1\). Thus, \(V_2 = (v_2 \cdot e_1)e_1 + u_2\) or equivalently, \(u_2 = v_2 - (v_2 \cdot e_1)e_1\). We then define \(e_2 = \dfrac{u_2}{|u_2|}\). In general, we define \(u_i = v_i - \sum_{j=1}^i (v_i \cdot e_j)e_j\) and \(e_i = \dfrac{u_i}{|u_i|}\), we have created a set of basis vectors \(e_i\) which are all orthogonal to each other and are of unit length. Thus, the set of \(\{e_i\}_{i=1}^n\) forms an orthonormal basis.

An Example

Consider two vectors \(v_1 = \begin{pmatrix} 1 \\ 1 \\ 1\end{pmatrix}\) and \(v_2 = \begin{pmatrix} 2 \\ 0 \\ 1\end{pmatrix}\) both in the plane and a vector \(v_3 = \begin{pmatrix} 3 \\ 1 \\ -1\end{pmatrix}\) that is not in the plan described by \(v_1\) and \(v_2\). We will first find an orthonormal basis for these vectors.

Now, \(e_1 = \dfrac{v_1}{|v_1|} = \dfrac{1}{\sqrt{3}}\begin{pmatrix} 1 \\ 1 \\ 1\end{pmatrix}\).

Next we find \[u_2 = v_2 - (v_2 \cdot e_1)e_1 = \begin{pmatrix} 2 \\ 0 \\ 1\end{pmatrix} - \begin{pmatrix} 1 \\ 1 \\ 1\end{pmatrix} = \begin{pmatrix} 1 \\ -1 \\ 0\end{pmatrix}.\] Normalizing, we find that \(e_2 = \dfrac{1}{\sqrt{2}} \begin{pmatrix} 1 \\ -1 \\ 0\end{pmatrix}\).

Finally, \[u_3 = v_3 - (v_3 \cdot e_1)e_1 - (v_3 \cdot e_2)e_2 = \begin{pmatrix} 3 \\ 1 \\ -1\end{pmatrix} - \begin{pmatrix} 1\\ 1 \\ 1\end{pmatrix} - \begin{pmatrix} 1\\ -1 \\ 0 \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ -2\end{pmatrix}.\] Normalizing, we find that \(e_3 = \dfrac{1}{\sqrt{6}} \begin{pmatrix} 1 \\ 1 \\ -2\end{pmatrix}\).

Therefore, the transformation matrix is \(E = \begin{pmatrix} 1/\sqrt{3} & 1/\sqrt{2} & 1/\sqrt{6} \\ 1/\sqrt{3} & -1/\sqrt{2} & 1/\sqrt{6}\\ 1/\sqrt{3} & 0 & -2/\sqrt{6}\end{pmatrix}\).

Now, consider a vector \(r = \begin{pmatrix} 2 \\ 3 \\ 5 \end{pmatrix}\) in \(\mathbb{R}^3\). We want to reflect \(r\) through the plane created by \(e_1\) and \(e_2\) (or \(v_1\) and \(v_2\)) and out the opposite side to where it currently sits. We can think of \(r\) as being composed of a portion that is in the \(e_1\), \(e_2\) plane and some part that is perpendicular to the plane. The part that is perpendicular to the plane (ie, the part formed by \(e_3\)), we will simply make negative to create this reflection. Therefore, the transformation matrix (in terms of the \(E\) basis) is \(T_E = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & -1 \end{pmatrix}\). So, if we take \(r\) and map it to the \(E\) basis, reflect it using \(T_E\), then map it back to \(\mathbb{R}^3\), we will have the reflection. Mathematically, we map \(r\) into basis \(E\) using \(E^{-1}\), transform (reflect) it using \(T_E\), then move it back to \(\mathbb{R}^3\) using \(E\). Our final answer will therefore be \[E T_E E^{-1} r = \begin{pmatrix} 1/3 + 1/2 - 1/6 & 1/3 - 1/2 - 1/6 & 1/3 + 0 + 1/6\\ 1/3-1/2-1/6 & 1/3 + 1/2 - 1/6 & 1/3 + 0 + 2/6\\ 1/3 + 0 + 2/6 & 1/3 + 0 + 2/6 & 1/3 + 0 - 4/6 \end{pmatrix}\begin{pmatrix} 2 \\ 3 \\5 \end{pmatrix} = \dfrac{1}{3} \begin{pmatrix} 2 & -1 & 2\\ -1 & 2 & 2 \\ 2 & 2 & -1\end{pmatrix} \begin{pmatrix} 2 \\ 3 \\5 \end{pmatrix} = \dfrac{1}{3}\begin{pmatrix} 11 \\ 14 \\ 5 \end{pmatrix}, \] where \(E^{-1} = E^T\). This is a vector that is reflected through the \(e_1\) and \(e_2\) plane.

Eigenvalues and Eigenvectors

The word eigen means characteristic. An eigenproblem is the problem of finding the characteristic properties.

We know that we can express linear transformations using matrices. These operations include scalings, rotations, and shears. For example, we may begin with a square (in black) centred at the origin then apply a transformation (a sheer) and change the shape (in blue).

Now, changing this shape essentially shifts points within the original square. However, some do not shift at all. Consider 3 vectors (red, yellow, green) in the original square.

We scale by a factor of 2 in the vertical direction and we get the following diagram.

We can see that the horizontal (green) vector remains unchanged. The direction of the vertical (red) vector remains the same but is doubled in length. The angle (direction) and size of the diagonal (yellow) vector have both changed. Since the horizontal and vertical vectors did not change direction, we refer to them as eigenvectors under this transformation. Since the horizontal vector did not change length, it has a corresponding eigenvalue. Since the vertical eigenvector doubled in size, it has an eigenvalue of 2. In essence, we apply a transformation to a space and look for vectors whose direction has not changed (eigenvectors) and then we measure the change in their length (eigenvalues).

As a second example, consider what happens when we take the same square and sheer it without changing the area. As you can see, only one of the example vectors is an eigenvector (the green) and it has an eigenvalue of 1.

A third example is a rotation. As you can imagine, a rotation that is not a multiple of 90 degrees will move every vector from its original angle and thus, there will be no eigenvectors.

Eigenvector Examples

We know that an eigenvector can be thought of as a vector having the same span both before and after a transformation and the eigenvalue is the amount by which the eigenvector is stretched due to the transformation. Here we highlight some special cases.

First, consider a uniform scaling in which we scale each direction by the same amount in each direction. In this case, every vector is an eigenvector.

Next, we consider a rotation of 180 degrees. Our three example vectors all have the same span but are simply pointed in different directions. This is the only rotation (below 360 degrees) that has some eigenvectors. Indeed, you can see that all vectors are eigenvectors with eigenvalue -1.

Finally, consider a case in which we combine a horizontal shear and a vertical scaling. In this case, our horizontal (green) vector is again an eigenvector (with eigenvalue 1) and a second eigenvector exists (in purple).

In machine learning, most of our data is \(n\) dimensional and so scaling and shearing become more complex. Moreover, when we move to higher dimensions, any eigenvector becomes an axis of rotation.

Calculating Eigenvectors

Consider a transformation \(A\). If this transformation has an eigenvector \(x\), we can say that \[Ax = \lambda x\] which simply means that after applying the transformation \(A\) to \(x\), the vector \(x\) remains on the same span (but possibly stretched by a factor of \(\lambda \in \mathbb{R}\)). We rewrite this as \[(A - \lambda I) x = 0,\] where \(I\) is the \(n \times n\) identity matrix. Now, either \(A - \lambda I = 0\) of \(x = 0\). The solution wherer \(x = 0\) is not interesting, so we want to know when \(A - \lambda I = 0\). We do so by finding the determinant. Thus, we want to find \[det(A - \lambda I) = 0.\]

Now, consider \(A = \begin{pmatrix} a & b\\c & d \end{pmatrix}\). Then \[det(A - \lambda I) = det\left(\begin{pmatrix} a & b\\c & d \end{pmatrix} - \begin{pmatrix} \lambda & 0\\0 & \lambda \end{pmatrix}\right) = det\begin{pmatrix} a - \lambda & b\\c & d - \lambda \end{pmatrix}.\] The determinant here is called the characteristic polynomial and is \[\lambda^2 - (a+d) \lambda + ad-bc = 0.\] Are eigenvalues are simply the solution to the characteristic polynomial and these can be used to find the eigenvectors.

As an example, we will apply a vertical scaling of 2 and find the eigenvalues and eigenvectors. Here, \(A = \begin{pmatrix}1 & 0 \\ 0 & 2 \end{pmatrix}\) and so our characteristic polynomial is \[\lambda^2 - 3 \lambda + 2 = (\lambda-1)(\lambda - 2) = 0.\] So, \(\lambda = 1,2\).

When \(\lambda = 1\), \((A - \lambda I)x = \begin{pmatrix} 0 & 0 \\ 0 & 1 \end{pmatrix}\begin{pmatrix}x_1 \\ x_2 \end{pmatrix}= \begin{pmatrix}0 \\ x_2 \end{pmatrix}= 0\). When \(\lambda = 2\), \((A - \lambda I)x = \begin{pmatrix} -1 & 0 \\ 0 & 0 \end{pmatrix}\begin{pmatrix}x_1 \\ x_2 \end{pmatrix}= \begin{pmatrix}-x_1 \\ 0 \end{pmatrix}= 0\). So, when our eigenvalue \(\lambda = 1\), we have an eigenvector in which \(x_2 = 0\). Thus, any vector with \(x_2 = 0\) is an eigenvector for this system. For example, any vector \(\begin{pmatrix} t \\ 0 \end{pmatrix}\). When our eigenvalue is \(\lambda = 2\), we have an eigenvector in which \(-x_1 = 0\). Thus, any vector with \(x_1 = 0\) is an eigenvector for the system such as \(\begin{pmatrix} 0 \\ t \end{pmatrix}\).

Consider a second example in which we rotate the system counterclockwise 90 degrees. Here \(A = \begin{pmatrix} 0 & -1\\ 1 & 0 \end{pmatrix}\). The characteristic polynomial in this case is \[\lambda^2 + 1 = 0.\] This polynomial equation has no real solutions and thus we conclude that there are no real eigenvalues (and hence no eigenvectors).

Normally, we will want to find eigenvalue/ eigenvectors by computer since most problems will require large dimensional matrices and finding the roots of complicated polynomials.

Changing the Eigenbasis

Here we combine the idea of eigenvectors with the concept of changing basis in a procedure called diagnolization which helps make matrix operations more efficient.

Consider a particle with initial location \(v_0 = \begin{pmatrix} 1/2 \\ 1 \end{pmatrix}\). The matrix \(T = \begin{pmatrix} 0.9 & 0.8 \\ -1 & 0.35 \end{pmatrix}\) represents the change in position of the particle after a single time step. Hence, \(v_1 = Tv_0\) represents the position of the particle after 1 step. After 2 time steps, we get \(v_2 = Tv_1 = T^2v_0\). After \(n\) steps, we get \(v_n = T^nv_0\).

A matrix which consists entirely of 0’s, except possibly on the main diagonal, the matrix is called a diagonal matrix. In the case of a diagonal matrix \(T = \begin{pmatrix} a & 0&0\\0&b&0\\0&0&c \end{pmatrix}\), we compute \(T^n = \begin{pmatrix} a^n & 0&0\\0&b^n&0\\0&0&c^n \end{pmatrix}\).

Let \(C\) be a matrix consisting of columns of eigenvectors and \(D\) the diagonal matrix with eigenvalues along the main diagonal. Then, \(T = CDC^{-1}\). However, we will note that \(T^2 = CDC^{-1}CDC^{-1} = CD^2C^{-1}\). In general, \(T^n = CD^nC^{-1}\). However, since \(D\) is diagonal, computing \(D^n\) is quite easy.

Eigenbasis Example

Here we examine the matrix \(T = \begin{pmatrix} 1 & 1 \\ 0 & 2 \end{pmatrix}\). This matrix performs a vertical stretch and a shear.

Now consider the vector \(\begin{pmatrix} -1 \\ 1 \end{pmatrix}\). Multiplying by \(T\), we get \(\begin{pmatrix} 1 & 1 \\ 0 & 2 \end{pmatrix}\begin{pmatrix} -1 \\ 1 \end{pmatrix} = \begin{pmatrix} 0 \\ 2 \end{pmatrix}\). If we multiply the result by \(T\) again, we get \(\begin{pmatrix} 1 & 1 \\ 0 & 2 \end{pmatrix}\begin{pmatrix} 0\\ 2 \end{pmatrix} = \begin{pmatrix} 2 \\ 4 \end{pmatrix}\). (You can check that this is the same as \(T^2 \begin{pmatrix} -1 \\ 1 \end{pmatrix}\)).

Notice that the eigenvalue and eigenvectors are \(v_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}\), \(\lambda_1 = 1\), \(v_2 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}\), \(\lambda_2 = 2\). So, we can do the same computation using the diagonalization method. Here \(C = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\) and \(C^{-1} = \begin{pmatrix} 1 & -1 \\ 0 & 1 \end{pmatrix}\). Also, \(D = \begin{pmatrix} 1 & 0 \\ 0 & 2 \end{pmatrix}\). Thus, \[T^2 = CD^2C^{-1} = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 0 & 2 \end{pmatrix}^2\begin{pmatrix} 1 & -1 \\ 0 & 4 \end{pmatrix} = \begin{pmatrix} 1 & 3 \\ 0 & 4 \end{pmatrix}.\] Therefore, we arrive at our answer \[\begin{pmatrix} 1 & 3 \\ 0 & 4 \end{pmatrix}\begin{pmatrix} -1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2 \\ 4 \end{pmatrix}.\]

The PageRank Algorithm

The PageRank algorithm was published by and named after Google founder Larry Page and colleagues in 1998. Google uses it to decide the order of websites it displays after a search. The basic idea is that the importance of a website is related to the number of websites that link to it and from it.

Consider 4 websites and the extent to which each website contains a link to another website. We can visualize this as a graph.

We can write down the matrix that governs the links by displaying a 1 if the website given by the column links to the website in the given row. Hence we have \(\begin{pmatrix}0&1&0&0\\1&0&0&1\\1&0&0&1\\1&1&1&0 \end{pmatrix}\). Now, define \(L\) as the previous matrix; however, we divide each column vector by the number of links on the website. Hence \(L = \begin{pmatrix}0&1/2&0&0\\1/3&0&0&1/2\\1/3&0&0&1/2\\1/3&1/2&1&0 \end{pmatrix}\). This matrix \(L\) gives the probability of ending up on a page (by randomly), given that we are on a particular page.

To determine the rank of page \(A\), we need to know three things: (a) what is your rank, (b) do you link to page \(A\) and (c) how many outgoing links do you have on your page. Here, \(r_A = \sum_{j=1}^n L_{A,j}r_j\) since the rank of \(A\) is the sum of all of the probabilities that link to \(A\) times the ranks of those particular pages. In total, \(r = Lr\) because this is simply a matrix multiplication problem. If we initially assume that each rank is equal, then for 4 webpages, \(r_0 = \begin{pmatrix}1/4\\1/4\\1/4\\1/4 \end{pmatrix}\) and then we multiply by \(L\) repetedly, updating \(r\) each time, to find the asymptotic solution. Thus \(r_{i+1} = Lr_i = L^{i+1}r_0\). Eventually, \(r_{i+1} = Lr_{i}\) or \(r = Lr\). However, this means that \(r\) is an eigenvector of \(L\) with eigenvector 1. We simulate this to find (roughly) that \(r_{10} = \begin{pmatrix}0.12\\0.24\\0.24\\0.40 \end{pmatrix}\). Thus, website \(D\) is most important and would randomly be viewed 40% of the time. \(A\) is the least important and would be viewed randomly about 12% of the time. The important point here is that each website is ranked in terms of importance and thus are ranked in terms of how they should be displayed once they are searched for.

Citations

3Blue1Brown. “3Blue1Brown.” Accessed June 2, 2021. Available Here.

Banerjee, Sudipto, and Anindya Roy. Linear Algebra and Matrix Analysis for Statistics. Chapman & Hall/CRC Texts in Statistical Science Series. Boca Raton: CRC Press, Taylor & Francis Group, 2014.

Boas, Mary L. Mathematical Methods in the Physical Sciences. 2nd ed. New York: Wiley, 1983.

“Linear Algebra.” In Wikipedia, May 26, 2021. Available Here

Strang, Gilbert. Introduction to Linear Algebra. Fifth edition. Wellesley, MA: Wellesley-Cambridge Press, 2016.