Exploring GloVe 300d Word Vectors in R

WORD VECTOR ANALYSIS

The first step is to load the GloVe word vectors.

number.of.words<-9900
word.vector.file<-"glove.6B.300d.txt"

The analysis uses the first 9900 of the GloVe word vectors glove.6B.300d.txt. It took 23 seconds to process the data into a data frame.

WORD VECTOR SIMILARITY

Before delving into the full analysis, it’s helpful to look at vector representations of familiar word-pairs to help build an intuitive sense of how vector representations of words facilitate their coparison and analysis.

Vectors of individual words are easily extracted from the word vector data frame using simple regular expressions syntax.

man.vec <- vector.normalize( colSums( word.vector.df[ grep( "^man$", word.list), -1] ) )

The vector $\textbf{man.vec}$ is normalized to unit length representing the word man.

There is an embedded colSums command, irrelevant in this specific case, to ensure all selected vectors are added in case of a more general regular expression search. This generaization isn’t useful right now, but may be useful later. For example, selecting strings starting with “govern”

govern.vec <- vector.normalize( colSums( word.vector.df[ grep( "^govern", word.list), -1] ) )

creates a unit vector representing the following 8 words: government, governor, governments, governing, governors, governmental, governance, and governed.

UNDERSTANDING WORD VECTORS

Before diving in to comparisons, it’s helpful to reproduce an well-known and intuitively appealing example of word vector comparisons. Looking at the vectors of multiple words with a male-female relationship, we can see how the relationships develop.
The example below approximately reproduces results here ). The postion of each word represents the “vector” of that word (noting that this is a two dimenstional representation of a 300 element vector)

Recalling basic properties of vector addition, we can see from the above that the approximate expressions seem to hold:

$\begin{aligned} \textbf{male} + \textbf{woman} - \textbf{female} \approx \textbf{man} \end{aligned}$

$\begin{aligned} \textbf{man} + \textbf{queen} - \textbf{woman} \approx \textbf{king} \end{aligned}$

While these identities are not fully satisfied in the two dimensional representation of the 300 element word vectors, we can quantify differences more precisely by taking the quantity

$\begin{aligned} \cos{\theta_{d}} = \| \textbf{pseudo.d} \cdot \textbf{d} \| / ( \| \textbf{d} \| \| \textbf{pseudo.d} \| )) \end{aligned}$

where

$\begin{aligned} \textbf{pseudo.d} \equiv \textbf{a} + \textbf{b} - \textbf{c} \approx \textbf{d} \end{aligned}$

In this case $\cos{\theta_{king}}$ = 0.655 and the $\cos{\theta_{man}}$ = 0.709. The question is, is this good or not good?

VECTORS OF COMPOUNDED WORDS

To get an idea of the significance of compounding wrods, let’s compare the similarity of the pseudo.vector to the word itself.

As a first stab we can look at the vectors themselves as shown below. The plot simply encodes the value of each vector component. While the vector for $\textbf{king}$ and $\textbf{pseudo.king}$ are similar, it’s hard, at least for for me, to distinguish that the two vectors have very much in common. For instance, $\textbf{king}$ any closer to $\textbf{woman}$ than to $\textbf{man}$ ? In retrospect this is not surprising. The vectors are highly abstracted representations that make sense only in the context of other words.

WORD TO WORD COMPARISON IS MORE INTERESTING

It turns out it’s much more interesting to look at the vectors not in the abstract, but in the context of actual words. The plots below I’ve taken the scalar product of the vectors $\textbf{king}$ and $\textbf{pseudo.king}$ with $\textbf{king}$ and some neighboring words (for all intents, random).

Here we take just a sample of the nearest 88 neighboring words. Note that both the $\textbf{king}$ and $\textbf{pseudo.king}$ have a high scalar product with the vector representing $\textbf{king}$ , whereas the root vector $\textbf{man}$ does not.
Another point is the relatively higher scalar products of the pseudo.vector with words like “britain” and “george” also stand out. This is important to recognize in relation to “compounded meaning” of text in relation to specific words.
As a comparison the plot below shows the scalar products for $\textbf{king}$ and $\textbf{man}$ with neighboring words. This reveals the scope of the change in the vector, and some sense the meaning of the word.

CITY STATE PAIRS

This was just a trial to test how capitol and state paris lined up. It’s interesting that California and Sacramento are “upside down”, showing a limitation of this technique.

CONCLUSIONS

It’s relatively easy to reproduce “canonical” examples of word vector relations. The work as advertised.

The value of the scalar product of two word vectors seems to convey less information than the context of the scalar product, i.e. there appears to be more information in comparing a spectrum of words. This suggests a metric of goodness might be synthesized from something like this nearest neighbor comparison. Something to explore.

Exploring GloVe 300d Word Vectors in R

WW44SS

March 18, 2016

SUMMARY

DATA SOURCES AND METHODS