intro_matrices

Motivation

\[\text{softmax}\left(\frac{KQ^T}{\sqrt{d}} \right)V\]
What is a matrix and why is it useful for modeling data?
What is an embedding and how are they used in deep learning models?
What is the self-attention mechanism inside a transformer model?

\(5x = 15\)

\(x = 3\)

\(5x = 15\)

\(x = 3\)

\(5x + y = 15\)

\((x, 15 - 5x)\)

\(5x + y = 15\)

\((1, 10)\)

\(5x + y = 15\)

\((1, 10)\)

\((2, 5)\)

\(5x + y = 15\)

\((1, 10)\)

\((2, 5)\)

\((3, 0)\)

\(5x + y = 15\)

\(3x + 2y = 16\)

\(5x + y = 15\)

\(3x + 2y = 16\)

\(x = 2, y = 5\)

\(5x + y = 15\) \(3x + 2y = 16\)

\(ax + by = e\) \(cx + dy = f\)

\[\begin{pmatrix} a & b\\ c & d \end{pmatrix} \begin{pmatrix} x\\y \end{pmatrix} = \begin{pmatrix} e\\f \end{pmatrix} \]

\[ A\mathbf{x} = \mathbf{y} \]

What is a matrix?

A way of generalizing the previous example to any number of dimensions and variables
A representation of a linear map for a fixed basis

What is a basis?

What is a linear map?

A map from one vector space \(U\) to a vector space \(V\) . Usually denoted \(T: U \to V\)
With additivity: \(\mathbf{u},\mathbf{v} \in U\), \(T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})\)
and scalar multiplication: \(c \in \mathbb{R}, u \in U\), \(T(c\mathbf{u}) = cT(\mathbf{u})\)

Why are linear maps useful for machine learning?

Matrices as linear maps

Linear map \(T : U \to V\)
Basis exists for \(U\), \(B_u\) and for \(V\), \(B_v\)
There is a unique matrix \(A\) for the linear map \(T\) for the above bases
i.e. Matrix multiplication \(A\mathbf{x}\) is the same thing as applying a linear map \(T(\mathbf{u})\) as long as you know the coordinate system in both spaces

Examples of matrices

\[\begin{pmatrix} a & b & c & d \end{pmatrix} \begin{pmatrix} w\\x\\y\\z \end{pmatrix} = aw + bx + cy + dz \]

Examples of matrices

\[\begin{pmatrix} a & b \\ c & d \\ e & f \\ g & h \end{pmatrix} \begin{pmatrix} x\\y \end{pmatrix} = \begin{pmatrix} ax + by \\ cx + dy \\ ex + fy \\ gx + hy \end{pmatrix} \]

Transpose of a matrix

\[\begin{pmatrix} a & b \\ c & d \\ e & f \\ g & h \end{pmatrix}^T = \begin{pmatrix} a & c & e & g \\ b & d & f & h \end{pmatrix} \]

Transpose of a matrix

\((A^T)^T = A\)
\(A^TA = I\) if \(A\) is a rotation
Importantly, if \(A\) maps \(U\) onto \(V\), then \(A^T\) maps \(V\) onto \(U\)

Linear maps are composable

Some properties of matrices

Associative \((AB)C = A(BC)\)
Non-commutative \(AB \neq BA\)
Transpose \((AB)^T = B^TA^T\)
Inverse \(A^{-1}A = I\)

Application to retrieval systems

Start with a query e.g. “dog”, “cat”
Items (videos) have corresponding keywords e.g. “food blog”, “pet tips”
Items have values associated with the keywords by how much they correspond with each one.
Think of each of these as linear maps. Can map from queries to keywords and from keywords to values.

Query matrix \(Q\)

Keys matrix \(K\)

Query x Keys matrix \(QK^T\)

mutliplying matrices might produce large values, which can make training models hard
Can control this by scaling with the number of dimensions of the matrix \(QK^T/\sqrt{d}\)

Scaled Query x Keys matrix \(QK^T/\sqrt{d}\)

Softmax values

\[\text{softmax}(\mathbf{x}) = \frac{\exp(x_i)}{\sum_i \exp(x_i)}\]

The scores between queries and keys aren’t relative to each other making it hard to compare especially when we have a large number of keys
Want to pick just a few keywords for each query otherwise results won’t be specific enough
Can use the softmax function to do this (it’s also differentiable which is nice!)

Softmax Query x Keys matrix \(\text{softmax}(QK^T/\sqrt{d})\)

Values matrix \(V\)

Attention \(\text{softmax}(QK^T/\sqrt{d})V\)

Self-attention

Instead of going from one domain to another (e.g. search keywords to video titles), can using the attention mechanism from the same domain
For example “The animal didn’t cross the street because it was too tired.”
What does it refer to? What did the animal not do? Why didn’t it do it?

Motivation

What is a matrix?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a linear map?

What is a linear map?

Why are linear maps useful for machine learning?

Why are linear maps useful for machine learning?

Matrices as linear maps

Examples of matrices

Examples of matrices

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Linear maps are composable

Some properties of matrices

Application to retrieval systems

Application to retrieval systems

Application to retrieval systems

Query matrix \(Q\)

Keys matrix \(K\)

Query x Keys matrix \(QK^T\)

Query x Keys matrix \(QK^T\)

Scaled Query x Keys matrix \(QK^T/\sqrt{d}\)

Softmax values

Softmax Query x Keys matrix \(\text{softmax}(QK^T/\sqrt{d})\)

Values matrix \(V\)

Attention \(\text{softmax}(QK^T/\sqrt{d})V\)

Self-attention

Self-attention

Multi-headed self-attention

Attention is all you need

Some further reading