Common words such as “the”, “is”, and “for” tend to be very common in a corpus. One way to deal with low context words is to remove these words by using a list of stop words. We did this in an earlier tutorial.
Another approach is to use a term’s inverse document frequency (idf). This approach decreases the weight for commonly used words and increases the weight of less common words. We define idf as \[\text{idf}(t,D) = \log\left( \dfrac{N}{n_t} \right),\] where the idf of a given term \(t\) in the set of documents \(D\) is a function of the total number of documents being assessed \(N\) and the number of documents where the term \(t\) appears, \(n_t\).
The idf statistic can be combined with the term frequency, tf, to form a single statistic, tf-idf, where \[\text{tf-idf}(t,d,D) = \text{tf}(t,d) \cdot \text{idf}(t,D).\] This computes the frequency of a term adjusted for how rarely it is used. Note that the ratio inside the idf’s log function is always greater than or equal to 1 and hence idf is always non-negative. However, as a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. This statistic will then help identify high frequency words that provide particularly important context to a single document within a group of documents.
As before, we will take advantage of the text cleaning from the harrypotter tutorial. We cleaned this data in a previous tutorial.
library(tidyverse)
library(stringr)
library(tidytext)
library(harrypotter)
titles <- c("Philosopher's Stone",
"Chamber of Secrets",
"Prisoner of Azkaban",
"Goblet of Fire",
"Order of the Phoenix",
"Half-Blood Prince",
"Deathly Hallows")
books <- list(philosophers_stone,
chamber_of_secrets,
prisoner_of_azkaban,
goblet_of_fire,
order_of_the_phoenix,
half_blood_prince,
deathly_hallows)
series <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
unnest_tokens(word, text) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
series$book <- factor(series$book, levels = rev(titles))
series
## # A tibble: 1,089,386 x 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
## 7 Philosopher's Stone 1 mrs
## 8 Philosopher's Stone 1 dursley
## 9 Philosopher's Stone 1 of
## 10 Philosopher's Stone 1 number
## # ... with 1,089,376 more rows
We can now create a word count for each of the books.
book_words <- series %>%
count(book, word, sort = TRUE) %>%
ungroup()
series_words <- book_words %>%
group_by(book) %>%
summarise(total = sum(n))
book_words <- left_join(book_words, series_words)
book_words
## # A tibble: 67,881 x 4
## book word n total
## <fct> <chr> <int> <int>
## 1 Order of the Phoenix the 11740 258763
## 2 Deathly Hallows the 10335 198906
## 3 Goblet of Fire the 9305 191882
## 4 Half-Blood Prince the 7508 171284
## 5 Order of the Phoenix to 6518 258763
## 6 Order of the Phoenix and 6189 258763
## 7 Deathly Hallows and 5510 198906
## 8 Order of the Phoenix of 5332 258763
## 9 Prisoner of Azkaban the 4990 105275
## 10 Goblet of Fire and 4959 191882
## # ... with 67,871 more rows
This list shows that the non-contextual words (the words that don’t really add anything to the analysis) are the most common.
We graph the frequency of the words by book. Since the books are of different lengths, we divide by the total number of words in each book. If you visualize the data like this, you will note that it is clustered around 0. We can do a log transformation to take care of this issue.
book_words %>%
mutate(ratio = n / total) %>%
ggplot(aes(ratio, fill = book)) +
geom_histogram(show.legend = FALSE) +
scale_x_log10() +
facet_wrap(~ book, ncol = 2)
Zipf’s law is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The law originated in quantitative linguistics where it was noticed that given a corpus of natural language utterances, the most common word will occur roughly twice as often as the second most common term, three times as often as the third most common term etc.
Formally, we can say that the \(r^{th}\) most frequent word has frequency \(f(r)\) that scales according to \(f(r) \propto \dfrac{1}{r^\alpha}\) where \(\alpha \approx 1\). The actual observed frequency will depend on the size of the corpus. However, we can see that the most common word has frequency proportional to 1 (since \(r = 1\)). The second most common word has frequency proportional to \(1/2^{\alpha}\), etc.
We will demonstrate Zipf’s law by plotting the data on a log-log graph, with the axes being log(rank order) and log(term frequency).
On a log-log graph, power law relationships of the form \(y = kx^n\) (such as the one described by Zipf’s law) should be linear with a slope equal to \(n\). In the case of Zipf’s law, our \(n = -1\) and an intercept equal to the base-10 log of the relative frequency of the most frequent term.
Notice that if \[y = kx^n,\] with \(k\) a constant equal to the number of occurrences of the most frequent term in our corpus, we see that \[\log y = \log k + n\log x.\] Setting \(X = \log x\), \(b = \log a\), \(m = k\) and \(Y = \log y\), we have \[Y = mx + b,\] a linear relationship in terms of \(k\). Therefore, if we have a log-log graph with slope -1, we know that \[y = \dfrac{10^b}{x},\] with \(b\) the intercept of the graph.
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),term_freq = n / total)
ggplot(freq_by_rank, aes(rank, term_freq, color = book)) +
geom_line() +
scale_x_log10() +
scale_y_log10()
The graph illustrates that the distribution is similar across the seven books. Plotting a regression line, we see that the tails of the distribution deviate suggesting our distribution doesn’t follow Zipf’s law perfectly; however, it is close enough to generally state that the law approximately holds within our corpus of text.
lower_rank <- freq_by_rank %>%
filter(rank < 500)
lm(log10(term_freq) ~ log10(rank), data = lower_rank)
##
## Call:
## lm(formula = log10(term_freq) ~ log10(rank), data = lower_rank)
##
## Coefficients:
## (Intercept) log10(rank)
## -0.9414 -0.9694
freq_by_rank %>%
ggplot(aes(rank, term_freq, color = book)) +
geom_abline(intercept = -0.9414,
slope = -0.9694,
color = "gray50",
linetype = 2) +
geom_line(size = 1.2, alpha = 0.8) +
scale_x_log10() +
scale_y_log10()
The tf-idf statistic attempts to find the important words for the content of each document. It does so by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection of documents. By calculating tf-idf, we attempt to spot words that are important (i.e., common), but not too common. The tidytext package has a bind_tf_idf function to make this calculation easy.
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
## # A tibble: 67,881 x 7
## book word n total tf idf tf_idf
## <fct> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Order of the Phoenix the 11740 258763 0.0454 0 0
## 2 Deathly Hallows the 10335 198906 0.0520 0 0
## 3 Goblet of Fire the 9305 191882 0.0485 0 0
## 4 Half-Blood Prince the 7508 171284 0.0438 0 0
## 5 Order of the Phoenix to 6518 258763 0.0252 0 0
## 6 Order of the Phoenix and 6189 258763 0.0239 0 0
## 7 Deathly Hallows and 5510 198906 0.0277 0 0
## 8 Order of the Phoenix of 5332 258763 0.0206 0 0
## 9 Prisoner of Azkaban the 4990 105275 0.0474 0 0
## 10 Goblet of Fire and 4959 191882 0.0258 0 0
## # ... with 67,871 more rows
What we note from the first few rows of the table are that the tf values are high for the high frequency words while the idf values are all 0 since these words show up in all seven books and \(\log(1) = 0\). Thus, the product is also 0.
If we look at the words with the highest ft-idf values, these are mostly names of characters unique to each book.
book_words %>%
arrange(desc(tf_idf))
## # A tibble: 67,881 x 7
## book word n total tf idf tf_idf
## <fct> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Half-Blood Prince slughorn 335 171284 0.00196 1.25 0.00245
## 2 Deathly Hallows c 1300 198906 0.00654 0.336 0.00220
## 3 Order of the Phoenix umbridge 496 258763 0.00192 0.847 0.00162
## 4 Goblet of Fire bagman 208 191882 0.00108 1.25 0.00136
## 5 Chamber of Secrets lockhart 197 85401 0.00231 0.560 0.00129
## 6 Prisoner of Azkaban lupin 369 105275 0.00351 0.336 0.00118
## 7 Goblet of Fire winky 145 191882 0.000756 1.25 0.000947
## 8 Goblet of Fire champions 84 191882 0.000438 1.95 0.000852
## 9 Deathly Hallows xenophilius 79 198906 0.000397 1.95 0.000773
## 10 Half-Blood Prince mclaggen 65 171284 0.000379 1.95 0.000738
## # ... with 67,871 more rows
Therefore, we can get a snapshot of each book by looking at the top 10 highest tf-idf words from each book.
book_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))),
book = factor(book, levels = titles)) %>%
group_by(book) %>%
top_n(10, wt = tf_idf) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = book)) +
geom_bar(stat = "identity",
alpha = .8,
show.legend = FALSE) +
labs(title = "Highest tf-idf words in the Harry Potter series",
x = NULL, y = "tf-idf") +
facet_wrap(~book, ncol = 2, scales = "free") +
coord_flip()
“AFIT Data Science Lab R Programming Guide ·.” Accessed August 3, 2021. Available here.
“NRC Emotion Lexicon.” Accessed August 6, 2021. Available here.
“Introduction to Tidytext.” Accessed August 10, 2021. Available here.
Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach, 2017. Available here.
“Text Mining: Creating Tidy Text · UC Business Analytics R Programming Guide.” Accessed August 3, 2021. Available here.
“Zipf Law.” In Encyclopedia of Mathematics. EMS Press, 2001. Available here.