Introduction

We will use Kernel Density Estimate to compute a form of smoothed word co-occurence to see which words occur in similar contexts.

Loading data

We mostly use tidy packages, plus ggraph and ggiraph for plotting the network

library(tidyverse)
library(tidytext)
library(tidygraph)
library(ggraph)
library(ggiraph)

We will use the text of War and Peace as a test case. First, read in the book, tokenize, and remove stop words:

# Read in 'War and Peace'
fn = "book-war-and-peace.txt"
source = "https://raw.githubusercontent.com/mmcky/nyu-econ-370/master/notebooks/data/book-war-and-peace.txt"
if (!file.exists(fn)) download.file(source, destfile = fn)
# Tokenize and extract top n words
t = tibble(text=read_file(fn))
tokens = unnest_tokens(t, word, text) |>
  anti_join(filter(stop_words, lexicon=="snowball"))

Kernel Density Estimation

We define a helper function to calculate the PDF of a word using the built-in density function:

#' Compute the KDE smoothing of the occurrence of 'target' in the 'tokens'
#' 
#' @param tokens a character vector of words in the corpus
#' @param target a word to look for in the corpus
#' @param n the number of points to sample
#' @param bw the bandwidth of the smoothing
kde <- function(tokens, target, n=1000, bw=5000) {
  d = density(which(tokens == target), from=1, to=length(tokens), 
              n=1000, window = "gaussian", bw = 5000)
  d$y / n * length(tokens)
}

Now, let’s define some parameters:

TOPWORDS=500
SAMPLES=1000
BANDWIDTH=2000
SKIM=10

Get the top-N most frequent words:

topwords <- tokens |> 
  group_by(word) |> 
  summarize(n=n()) |> 
  arrange(-n) |> 
  pull(word) |>
  head(n=TOPWORDS)

And compute the pdfs for each word as a long-format data frame:

kde_row <- function(target) {
  p = kde(tokens$word, target, n=SAMPLES, bw=BANDWIDTH)
  tibble(offset=1:SAMPLES, word=target, p=p)
}
kdes <- map(topwords, kde_row, .progress=TRUE) |> list_rbind()

Smoothed word occurrence

We can reproduce the graph of napoleon-related words:

options(scipen = 999)
kdes |>
  filter(word %in% c("napoleon", "war", "military", "order", "general")) |>
  ggplot(aes(x=offset, y=p, color=word)) + geom_line() + 
  scale_color_manual(values=c("napoleon"="blue", war="darkgreen", military="red", order="cyan", general="magenta"), name="") + 
  theme_classic() +
  scale_y_continuous() + xlab("Word Offset") + ylab("Number of Occurrences") + ggtitle("War")

Conceptual mapping

Calculate the similarity between words using the built-in dist function. This is a matrix-based function, so first convert our long-format tibble to a wide matrix, compute the dist, then convert the result back to a long-format tibble.

distances <- kdes |> 
  pivot_wider(names_from=offset, values_from=p) |>
  column_to_rownames("word") |>
  as.matrix() |>
  dist(method='manhattan') |>
  as.matrix() |> 
  as_tibble(rownames="word1") |> 
  pivot_longer(-word1, names_to = "word2", values_to="sim") |>
  mutate(sim=1-sim/2)

(note: we divide by two as the sum of each vector is 1, so the manhattan distance scales from 0 to 2)

Compute Edges

For each word, we take the top-n most similar terms

edges <- distances |> 
  filter(word1 != word2) |> 
  group_by(word1) |> 
  slice_max(sim, n=SKIM) |>
  rename(from=word1, to=word2, weight=sim) |>
  ungroup()

Let’s check the words that Napoleon is most similar to

edges |> filter(from == "napoleon")

## # A tibble: 10 × 3
##    from     to       weight
##    <chr>    <chr>     <dbl>
##  1 napoleon military  0.760
##  2 napoleon war       0.740
##  3 napoleon given     0.736
##  4 napoleon beyond    0.733
##  5 napoleon clear     0.731
##  6 napoleon contrary  0.727
##  7 napoleon ordered   0.720
##  8 napoleon men       0.715
##  9 napoleon many      0.712
## 10 napoleon orders    0.711

Plot the graph

To create a tidygraph object, we first list the nodes and create a tbl_graph:

nodes <- distances |> select(name=word1) |> unique()
g <- tbl_graph(nodes=nodes, edges = edges)

Now, we can plot this. We add a color column to the nodes and edges to highlight Napoleon and it’s connections. Note that we use ggiraph to make sure we can make the graph readable:

napoleon=which(nodes$name=="napoleon")
highlight_ids=g |> activate("edges") |> filter(from==napoleon) |> pull(to)
highlights = nodes$name[highlight_ids]

ggobj <- g |> activate("edges") |> mutate(color=if_else(from==napoleon, "darkred", "grey")) |>
  activate("nodes") |> mutate(color=if_else(name %in% highlights, "darkred", "black")) |>
  ggraph(layout = "igraph", algorithm="fr") + 
  geom_edge_link(aes(edge_color=color)) + 
  geom_node_text(aes(label=name, color=color), repel=F) +  
  scale_edge_color_identity() +
  scale_color_identity() +
  theme_graph()
girafe(ggobj=ggobj, width_svg = 25, height_svg = 25,
       options = list(opts_sizing(rescale = FALSE)))

map

2023-10-03