Some math behind large language models
Mike Irvine
24 May, 2024
\[\text{softmax}\left(\frac{KQ^T}{\sqrt{d}} \right)V\]
What is a matrix and why is it useful for modeling data?
What is an embedding and how are they used in deep learning models?
What is the self-attention mechanism inside a transformer model?
\(5x = 15\)
\(5x = 15\)
\(x = 3\)
\(5x = 15\)
\(x = 3\)
\(5x + y = 15\)
\(5x + y = 15\)
\((x, 15 - 5x)\)
\(5x + y = 15\)
\(5x + y = 15\)
\((1, 10)\)
\(5x + y = 15\)
\((1, 10)\)
\((2, 5)\)
\(5x + y = 15\)
\((1, 10)\)
\((2, 5)\)
\((3, 0)\)
\(5x + y = 15\)
\(5x + y = 15\)
\(3x + 2y = 16\)
\(5x + y = 15\)
\(3x + 2y = 16\)
\(x = 2, y = 5\)
\(5x + y = 15\) \(3x + 2y = 16\)
\(ax + by = e\) \(cx + dy = f\)
\[\begin{pmatrix} a & b\\ c & d \end{pmatrix} \begin{pmatrix} x\\y \end{pmatrix} = \begin{pmatrix} e\\f \end{pmatrix} \]
\[ A\mathbf{x} = \mathbf{y} \]
A way of generalizing the previous example to any number of dimensions and variables
A representation of a linear map for a fixed basis
A map from one vector space \(U\) to a vector space \(V\) . Usually denoted \(T: U \to V\)
With additivity: \(\mathbf{u},\mathbf{v} \in U\), \(T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})\)
and scalar multiplication: \(c \in \mathbb{R}, u \in U\), \(T(c\mathbf{u}) = cT(\mathbf{u})\)
\[\begin{pmatrix} a & b & c & d \end{pmatrix} \begin{pmatrix} w\\x\\y\\z \end{pmatrix} = aw + bx + cy + dz \]
\[\begin{pmatrix} a & b \\ c & d \\ e & f \\ g & h \end{pmatrix} \begin{pmatrix} x\\y \end{pmatrix} = \begin{pmatrix} ax + by \\ cx + dy \\ ex + fy \\ gx + hy \end{pmatrix} \]
\[\begin{pmatrix} a & b \\ c & d \\ e & f \\ g & h \end{pmatrix}^T = \begin{pmatrix} a & c & e & g \\ b & d & f & h \end{pmatrix} \]
\[\text{softmax}(\mathbf{x}) = \frac{\exp(x_i)}{\sum_i \exp(x_i)}\]
softmax
function to do this (it’s also differentiable which is nice!)Questions?