Introduction

Common words such as “the”, “is”, and “for” tend to be very common in a corpus. One way to deal with low context words is to remove these words by using a list of stop words. We did this in an earlier tutorial.

Another approach is to use a term’s inverse document frequency (idf). This approach decreases the weight for commonly used words and increases the weight of less common words. We define idf as \[\text{idf}(t,D) = \log\left( \dfrac{N}{n_t} \right),\] where the idf of a given term \(t\) in the set of documents \(D\) is a function of the total number of documents being assessed \(N\) and the number of documents where the term \(t\) appears, \(n_t\).

The idf statistic can be combined with the term frequency, tf, to form a single statistic, tf-idf, where \[\text{tf-idf}(t,d,D) = \text{tf}(t,d) \cdot \text{idf}(t,D).\] This computes the frequency of a term adjusted for how rarely it is used. Note that the ratio inside the idf’s log function is always greater than or equal to 1 and hence idf is always non-negative. However, as a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. This statistic will then help identify high frequency words that provide particularly important context to a single document within a group of documents.

Cleaning the Data

As before, we will take advantage of the text cleaning from the harrypotter tutorial. We cleaned this data in a previous tutorial.

library(tidyverse)     
library(stringr)       
library(tidytext)      
library(harrypotter)

titles <- c("Philosopher's Stone", 
            "Chamber of Secrets", 
            "Prisoner of Azkaban",
            "Goblet of Fire", 
            "Order of the Phoenix", 
            "Half-Blood Prince",
            "Deathly Hallows")

books <- list(philosophers_stone, 
              chamber_of_secrets, 
              prisoner_of_azkaban,
              goblet_of_fire, 
              order_of_the_phoenix, 
              half_blood_prince,
              deathly_hallows)
  
series <- tibble()

for(i in seq_along(titles)) {

  clean <- tibble(chapter = seq_along(books[[i]]),
                  text = books[[i]]) %>%
    unnest_tokens(word, text) %>%
    mutate(book = titles[i]) %>%
    select(book, everything())
  
  series <- rbind(series, clean)
}

series$book <- factor(series$book, levels = rev(titles))

series

## # A tibble: 1,089,386 x 3
##    book                chapter word   
##    <fct>                 <int> <chr>  
##  1 Philosopher's Stone       1 the    
##  2 Philosopher's Stone       1 boy    
##  3 Philosopher's Stone       1 who    
##  4 Philosopher's Stone       1 lived  
##  5 Philosopher's Stone       1 mr     
##  6 Philosopher's Stone       1 and    
##  7 Philosopher's Stone       1 mrs    
##  8 Philosopher's Stone       1 dursley
##  9 Philosopher's Stone       1 of     
## 10 Philosopher's Stone       1 number 
## # ... with 1,089,376 more rows

Term Frequency

We can now create a word count for each of the books.

book_words <- series %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

series_words <- book_words %>%
  group_by(book) %>%
  summarise(total = sum(n))

book_words <- left_join(book_words, series_words)

book_words

## # A tibble: 67,881 x 4
##    book                 word      n  total
##    <fct>                <chr> <int>  <int>
##  1 Order of the Phoenix the   11740 258763
##  2 Deathly Hallows      the   10335 198906
##  3 Goblet of Fire       the    9305 191882
##  4 Half-Blood Prince    the    7508 171284
##  5 Order of the Phoenix to     6518 258763
##  6 Order of the Phoenix and    6189 258763
##  7 Deathly Hallows      and    5510 198906
##  8 Order of the Phoenix of     5332 258763
##  9 Prisoner of Azkaban  the    4990 105275
## 10 Goblet of Fire       and    4959 191882
## # ... with 67,871 more rows

This list shows that the non-contextual words (the words that don’t really add anything to the analysis) are the most common.

We graph the frequency of the words by book. Since the books are of different lengths, we divide by the total number of words in each book. If you visualize the data like this, you will note that it is clustered around 0. We can do a log transformation to take care of this issue.

book_words %>%
  mutate(ratio = n / total) %>%
  ggplot(aes(ratio, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  scale_x_log10() +
  facet_wrap(~ book, ncol = 2)

Zipf’s Law

Zipf’s law is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The law originated in quantitative linguistics where it was noticed that given a corpus of natural language utterances, the most common word will occur roughly twice as often as the second most common term, three times as often as the third most common term etc.

Formally, we can say that the \(r^{th}\) most frequent word has frequency \(f(r)\) that scales according to \(f(r) \propto \dfrac{1}{r^\alpha}\) where \(\alpha \approx 1\). The actual observed frequency will depend on the size of the corpus. However, we can see that the most common word has frequency proportional to 1 (since \(r = 1\)). The second most common word has frequency proportional to \(1/2^{\alpha}\), etc.

We will demonstrate Zipf’s law by plotting the data on a log-log graph, with the axes being log(rank order) and log(term frequency).

On a log-log graph, power law relationships of the form \(y = kx^n\) (such as the one described by Zipf’s law) should be linear with a slope equal to \(n\). In the case of Zipf’s law, our \(n = -1\) and an intercept equal to the base-10 log of the relative frequency of the most frequent term.

Notice that if \[y = kx^n,\] with \(k\) a constant equal to the number of occurrences of the most frequent term in our corpus, we see that \[\log y = \log k + n\log x.\] Setting \(X = \log x\), \(b = \log a\), \(m = k\) and \(Y = \log y\), we have \[Y = mx + b,\] a linear relationship in terms of \(k\). Therefore, if we have a log-log graph with slope -1, we know that \[y = \dfrac{10^b}{x},\] with \(b\) the intercept of the graph.

freq_by_rank <- book_words %>%
  group_by(book) %>%
  mutate(rank = row_number(),term_freq = n / total)
        

ggplot(freq_by_rank, aes(rank, term_freq, color = book)) +
  geom_line() +
  scale_x_log10() +
  scale_y_log10()

The graph illustrates that the distribution is similar across the seven books. Plotting a regression line, we see that the tails of the distribution deviate suggesting our distribution doesn’t follow Zipf’s law perfectly; however, it is close enough to generally state that the law approximately holds within our corpus of text.

lower_rank <- freq_by_rank %>%
  filter(rank < 500)

lm(log10(term_freq) ~ log10(rank), data = lower_rank)

## 
## Call:
## lm(formula = log10(term_freq) ~ log10(rank), data = lower_rank)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -0.9414      -0.9694

freq_by_rank %>% 
  ggplot(aes(rank, term_freq, color = book)) +
  geom_abline(intercept = -0.9414, 
              slope = -0.9694, 
              color = "gray50", 
              linetype = 2) +
  geom_line(size = 1.2, alpha = 0.8) +
  scale_x_log10() +
  scale_y_log10()

Inverse Document Frequency

The tf-idf statistic attempts to find the important words for the content of each document. It does so by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection of documents. By calculating tf-idf, we attempt to spot words that are important (i.e., common), but not too common. The tidytext package has a bind_tf_idf function to make this calculation easy.

book_words <- book_words %>%
  bind_tf_idf(word, book, n)

book_words

## # A tibble: 67,881 x 7
##    book                 word      n  total     tf   idf tf_idf
##    <fct>                <chr> <int>  <int>  <dbl> <dbl>  <dbl>
##  1 Order of the Phoenix the   11740 258763 0.0454     0      0
##  2 Deathly Hallows      the   10335 198906 0.0520     0      0
##  3 Goblet of Fire       the    9305 191882 0.0485     0      0
##  4 Half-Blood Prince    the    7508 171284 0.0438     0      0
##  5 Order of the Phoenix to     6518 258763 0.0252     0      0
##  6 Order of the Phoenix and    6189 258763 0.0239     0      0
##  7 Deathly Hallows      and    5510 198906 0.0277     0      0
##  8 Order of the Phoenix of     5332 258763 0.0206     0      0
##  9 Prisoner of Azkaban  the    4990 105275 0.0474     0      0
## 10 Goblet of Fire       and    4959 191882 0.0258     0      0
## # ... with 67,871 more rows

What we note from the first few rows of the table are that the tf values are high for the high frequency words while the idf values are all 0 since these words show up in all seven books and \(\log(1) = 0\). Thus, the product is also 0.

If we look at the words with the highest ft-idf values, these are mostly names of characters unique to each book.

book_words %>%
  arrange(desc(tf_idf))

## # A tibble: 67,881 x 7
##    book                 word            n  total       tf   idf   tf_idf
##    <fct>                <chr>       <int>  <int>    <dbl> <dbl>    <dbl>
##  1 Half-Blood Prince    slughorn      335 171284 0.00196  1.25  0.00245 
##  2 Deathly Hallows      c            1300 198906 0.00654  0.336 0.00220 
##  3 Order of the Phoenix umbridge      496 258763 0.00192  0.847 0.00162 
##  4 Goblet of Fire       bagman        208 191882 0.00108  1.25  0.00136 
##  5 Chamber of Secrets   lockhart      197  85401 0.00231  0.560 0.00129 
##  6 Prisoner of Azkaban  lupin         369 105275 0.00351  0.336 0.00118 
##  7 Goblet of Fire       winky         145 191882 0.000756 1.25  0.000947
##  8 Goblet of Fire       champions      84 191882 0.000438 1.95  0.000852
##  9 Deathly Hallows      xenophilius    79 198906 0.000397 1.95  0.000773
## 10 Half-Blood Prince    mclaggen       65 171284 0.000379 1.95  0.000738
## # ... with 67,871 more rows

Therefore, we can get a snapshot of each book by looking at the top 10 highest tf-idf words from each book.

book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))),
         book = factor(book, levels = titles)) %>% 
  group_by(book) %>%
  top_n(10, wt = tf_idf) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_bar(stat = "identity", 
           alpha = .8, 
           show.legend = FALSE) +
  labs(title = "Highest tf-idf words in the Harry Potter series",
       x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

Citations

“AFIT Data Science Lab R Programming Guide ·.” Accessed August 3, 2021. Available here.

“NRC Emotion Lexicon.” Accessed August 6, 2021. Available here.

“Introduction to Tidytext.” Accessed August 10, 2021. Available here.

Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach, 2017. Available here.

“Text Mining: Creating Tidy Text · UC Business Analytics R Programming Guide.” Accessed August 3, 2021. Available here.

“Zipf Law.” In Encyclopedia of Mathematics. EMS Press, 2001. Available here.

Zipf’s Law

OC Data Science

Introduction

Cleaning the Data

Term Frequency

Zipf’s Law

Inverse Document Frequency

Citations