Some math behind large language models

Mike Irvine

24 May, 2024

Motivation

  • \[\text{softmax}\left(\frac{KQ^T}{\sqrt{d}} \right)V\]

  • What is a matrix and why is it useful for modeling data?

  • What is an embedding and how are they used in deep learning models?

  • What is the self-attention mechanism inside a transformer model?

\(5x = 15\)

\(5x = 15\)

\(x = 3\)

\(5x = 15\)

\(x = 3\)

\(5x + y = 15\)

\(5x + y = 15\)

\((x, 15 - 5x)\)

\(5x + y = 15\)

\(5x + y = 15\)

\((1, 10)\)

\(5x + y = 15\)

\((1, 10)\)

\((2, 5)\)

\(5x + y = 15\)

\((1, 10)\)

\((2, 5)\)

\((3, 0)\)

\(5x + y = 15\)

\(5x + y = 15\)

\(3x + 2y = 16\)

\(5x + y = 15\)

\(3x + 2y = 16\)

\(x = 2, y = 5\)

\(5x + y = 15\) \(3x + 2y = 16\)

\(ax + by = e\) \(cx + dy = f\)

\[\begin{pmatrix} a & b\\ c & d \end{pmatrix} \begin{pmatrix} x\\y \end{pmatrix} = \begin{pmatrix} e\\f \end{pmatrix} \]

\[ A\mathbf{x} = \mathbf{y} \]

What is a matrix?

  • A way of generalizing the previous example to any number of dimensions and variables

  • A representation of a linear map for a fixed basis

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a basis?

What is a linear map?

What is a linear map?

  • A map from one vector space \(U\) to a vector space \(V\) . Usually denoted \(T: U \to V\)

  • With additivity: \(\mathbf{u},\mathbf{v} \in U\), \(T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})\)

  • and scalar multiplication: \(c \in \mathbb{R}, u \in U\), \(T(c\mathbf{u}) = cT(\mathbf{u})\)

Why are linear maps useful for machine learning?

Why are linear maps useful for machine learning?

Matrices as linear maps

  • Linear map \(T : U \to V\)
  • Basis exists for \(U\), \(B_u\) and for \(V\), \(B_v\)
  • There is a unique matrix \(A\) for the linear map \(T\) for the above bases
  • i.e. Matrix multiplication \(A\mathbf{x}\) is the same thing as applying a linear map \(T(\mathbf{u})\) as long as you know the coordinate system in both spaces

Examples of matrices

\[\begin{pmatrix} a & b & c & d \end{pmatrix} \begin{pmatrix} w\\x\\y\\z \end{pmatrix} = aw + bx + cy + dz \]

Examples of matrices

\[\begin{pmatrix} a & b \\ c & d \\ e & f \\ g & h \end{pmatrix} \begin{pmatrix} x\\y \end{pmatrix} = \begin{pmatrix} ax + by \\ cx + dy \\ ex + fy \\ gx + hy \end{pmatrix} \]

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

Transpose of a matrix

\[\begin{pmatrix} a & b \\ c & d \\ e & f \\ g & h \end{pmatrix}^T = \begin{pmatrix} a & c & e & g \\ b & d & f & h \end{pmatrix} \]

Transpose of a matrix

  • \((A^T)^T = A\)
  • \(A^TA = I\) if \(A\) is a rotation
  • Importantly, if \(A\) maps \(U\) onto \(V\), then \(A^T\) maps \(V\) onto \(U\)

Linear maps are composable

Some properties of matrices

  • Associative \((AB)C = A(BC)\)
  • Non-commutative \(AB \neq BA\)
  • Transpose \((AB)^T = B^TA^T\)
  • Inverse \(A^{-1}A = I\)

Application to retrieval systems

Application to retrieval systems

Application to retrieval systems

  • Start with a query e.g. “dog”, “cat”
  • Items (videos) have corresponding keywords e.g. “food blog”, “pet tips”
  • Items have values associated with the keywords by how much they correspond with each one.
  • Think of each of these as linear maps. Can map from queries to keywords and from keywords to values.

Query matrix \(Q\)

Keys matrix \(K\)

Query x Keys matrix \(QK^T\)

Query x Keys matrix \(QK^T\)

  • mutliplying matrices might produce large values, which can make training models hard
  • Can control this by scaling with the number of dimensions of the matrix \(QK^T/\sqrt{d}\)

Scaled Query x Keys matrix \(QK^T/\sqrt{d}\)

Softmax values

\[\text{softmax}(\mathbf{x}) = \frac{\exp(x_i)}{\sum_i \exp(x_i)}\]

  • The scores between queries and keys aren’t relative to each other making it hard to compare especially when we have a large number of keys
  • Want to pick just a few keywords for each query otherwise results won’t be specific enough
  • Can use the softmax function to do this (it’s also differentiable which is nice!)

Softmax Query x Keys matrix \(\text{softmax}(QK^T/\sqrt{d})\)

Values matrix \(V\)

Attention \(\text{softmax}(QK^T/\sqrt{d})V\)

Self-attention

  • Instead of going from one domain to another (e.g. search keywords to video titles), can using the attention mechanism from the same domain
  • For example “The animal didn’t cross the street because it was too tired.”
  • What does it refer to? What did the animal not do? Why didn’t it do it?

Self-attention

Multi-headed self-attention

Attention is all you need

Some further reading

Questions?