2025-11-08

Introduction

  • The cosine of similarity formula is a powerful technique to measure the similarity between two digital items.
  • These items could include:
    • Photos
    • Raw Text
    • Vector Data etc.
  • In this presentation we will focus on raw text.
  • We will build an intuition of how it works through data visualization.
  • But first we need to go over the vectorization of data.

Vectorization

  • What is the vectorization of data?
  • The easiest way to explain is with an example.
  • Let’s say we have two text files containing the following:
    • raw1.txt: hello goodbye
    • raw2.txt: hello hello goodbye
  • We need to convert these files into vectors \(\overset{\rightharpoonup}v_1\) and \(\overset{\rightharpoonup}v_2\).
  • We can do this by counting each word and storing the count in its own index in a vector.
  • Doing this we end up with the following: \[ \overset{\rightharpoonup}v_1= \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \overset{\rightharpoonup}v_2= \begin{bmatrix} 2 \\ 1 \end{bmatrix} \]

Calculating Similarity

  • Now that we have our vectors we can use them to calculate the similarity by measuring the angles between them.
  • For our example we could use the Law of Cosines, since we’re dealing with a simple triangle. \[ c^2=a^2+b^2-2ab \cos \theta \\ \cos \theta = \frac{a^2+b^2-c^2}{2ab} \]
  • However, this becomes complicated for documents with more than two unique words.
  • Instead, we’ll use the Dot Product, since it works directly on vectors. \[ \overset{\rightharpoonup}v_1 \cdot \overset{\rightharpoonup}v_2 = |\overset{\rightharpoonup}v_1| |\overset{\rightharpoonup}v_2| \cos \theta \\ \cos \theta =\frac{\overset{\rightharpoonup}v_1 \cdot \overset{\rightharpoonup}v_2}{|\overset{\rightharpoonup}v_1||\overset{\rightharpoonup}v_2|} \]

Visualizing Similarity

  • Enough math lets plot \(\overset{\rightharpoonup}v_1\) and \(\overset{\rightharpoonup}v_2\), so we can see what we’re talking about.

Visualizing Similarity Continued

  • What if we doubled the amount of words in \(\overset{\rightharpoonup}v_2\) but kept the same proportion?
  • The angle stays the same. One of the main benefits is that similarity isn’t dependent on length.

Visualizing Similarity Continued

  • If you added the word bird to each vector it would look like this.
  • Notice the angle is still the same, since we added the same amount of bird to each document.
  • You can interact with the plot below with your cursor.

Computing Cosine of Similarity in R

v_1 <- c(1, 1)
v_2 <- c(2, 1)
cos_sim <- sum(v_1 * v_2) / (sqrt(sum(v_1^2)) * sqrt(sum(v_2^2)))
cos_sim
## [1] 0.9486833