A tf-idf example

Introduction

The intent of this report is to demonstrate a simple example of a natural language processing technique known as the tf-idf statistic. The tf-idf statistic is useful in a variety of contexts, ranging from anomaly detection to information retrieval. In this example, it will be used to extract characteristic words between two sports, baseball and cricket.

To proceed, the Wikipedia pages for each sport were downloaded as .txt files and cleaned beforehand to remove words that were used as code. In order to understand the code, it is recommended that the user have an intermediate knowledge of the tidyverse package. I first learned of this topic through the book Text Mining with R (Silge, Robinson), which can be read at https://www.tidytextmining.com/.

library(tidyverse)
library(tidytext)
library(jpeg)
library(kableExtra)


cricket <- read.delim2('C://Users//Owner//Documents//Github//tf_idfpresentationonbaseballandcricket//cricket - Wikipedia.txt', header = FALSE, fill = FALSE, col.names = 'cricket.words', stringsAsFactors = FALSE)
baseball <- read.delim2('C://Users//Owner//Documents//Github//tf_idfpresentationonbaseballandcricket//Baseball - Wikipedia.txt', fill = FALSE, col.names = 'baseball.words', stringsAsFactors = FALSE)

pal <- c("#4F628E","#7887AB", "#2E4272", "#AA8439")

Tidy Text Vocabulary

In the context of text analysis, the term documents is used to describe groups (most often a type of document) for analysis, and token is how the text is segmented. A corpus is a collection of documents. Tokenization is the process of breaking text into tokens [1]. Below I have provided some examples with their possible usecases. Note the usecases are not necessarily simple, that is, the tf-idf statistic would probably be used with other data science techniques!

Corpus of Documents	Token Structure	Possible Use Case
A collection of applications for insurance that vary in structure	single words	Speedy informational retrieval (locations, doctors, hospitals)
A collection of daily credit card transactions	codes that represent properties of the transactions	Detect fraudulent transactions
A collection of student essays	single words	Search for atypical vocabulary that may indicate plagiarism
A collection of a user’s tweets	single words	View how hobbies or interests change over time

Data Tidying

A peek at the baseball data shows that each row has a different length. The data is currently not tidy.

head(baseball)

##                                                                   baseball.words
## 1                                                                       Baseball
## 2                                          From Wikipedia, the free encyclopedia
## 3 This article is about the sport. For the ball used in the sport, see Baseball 
## 4                         (ball). For other uses, see Baseball (disambiguation).
## 5        Base ball redirects here. For old time baseball, see Vintage base ball.
## 6                                                                           game

The unnest_tokens command from the tidytext pacakge will tidy the text data by tokenization. In this case the tokenization technique used is token = words.

# tokenize
baseball.tokenized <- 
  baseball %>%
  unnest_tokens(output = word, input = baseball.words, token = 'words') %>%
  mutate(sport = 'baseball') %>%
  select(sport, word) # just reorders the data

head(baseball.tokenized)

##      sport         word
## 1 baseball     baseball
## 2 baseball         from
## 3 baseball    wikipedia
## 4 baseball          the
## 5 baseball         free
## 6 baseball encyclopedia

The same is done for the cricket dataset and then the two datasets are combined.

# tokenize
cricket.tokenized <- 
  cricket %>%
  unnest_tokens(word, cricket.words) %>%
  mutate(sport = 'cricket')

# combine
sports <- bind_rows(baseball.tokenized, cricket.tokenized)

Relative Frequencies

Often in text analysis, relative frequencies are not meaningful. Observe the results of the most frequent words for each sport.

wordfreq <-
  sports %>%
  group_by(sport, word) %>%
  summarise(N = n()) %>%
  mutate(total = sum(N)) %>%
  ungroup()

wordfreq %>%
  mutate(freq = N/total) %>%
  arrange(desc(freq)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(sport) %>%
  top_n(15, freq) %>%
  ungroup %>%
  ggplot(aes(word, freq, fill = sport)) + 
  scale_fill_manual(values = pal[c(1,4)])+
  geom_col(show.legend = FALSE) +
  facet_wrap(~sport, scales = 'free') + 
  coord_flip()

There are a few words that seem important, (such as baseball, cricket, league, base, and ball), but this is not particularily useful for describing the properties of each sport. Many of the most frequent words that occur are known as stop words and are not useful for analysis. You could go about by removing all the stop words, but….

The tf-idf statistic

The tf-idf statistic solves this problem by giving those frequent words less weight, known as the inverse document frequency, idf . It is calculated by:

idf = ln(N/df) , where N is the number of total number of documents df is the number of documents containing that word

As this ratio approaches 1, the idf converges to 0, which results in lower idf values for commonly occuring words.

The full statistic is then multiplyed by the term frequency, tf, the ratio of occurences of the term to the total terms in each document. tf-idf = (tf)(idf)

This is the simplest implementation of the statistic, and there are quite a few variations.
Before continuing with the a tf-idf analysis, it’s recommended to view the distribution of the terms in the data because it will affect the results. The word frequencies in the cricket and baseball data appears logarithmic, which means that it should be relatively easy to pick out the characteristic terms. If the data was somewhat uniform, the tf-idf statistic would most likely pick up on characteristic outliers.

ggplot(wordfreq, aes(N/total, fill = sport)) + 
  geom_histogram(show.legend = FALSE) + 
  scale_fill_manual(values = pal[c(2,3)])+
  xlim(NA, 0.01) +
  facet_wrap(~sport, ncol = 2, scales = 'free_y')

The tidytext package contains a function that will append the tf, idf, and tf-idf statistics to the data frame.

sports_tfidf <- 
  wordfreq %>%
  bind_tf_idf(term = word, document = sport, n = N)

Those tokens with the highest tf-idf are those that are characteristic to each sport. These results are informative, for example, baseball uses ‘pitchers’ and cricket uses ‘bowlers’. It also seems that baseball may be more popular in America (mlb, american), while cricket may be more popular worldwide (icc, nations).

sports_tfidf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(sport) %>%
  top_n(15, tf_idf) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = sport)) + 
  scale_fill_manual(values = pal[c(1,4)])+
  geom_col(show.legend = FALSE) +
  facet_wrap(~sport, scales = 'free') + 
  coord_flip()

It should be noted that in practice it may still be necessary to remove stop words and to build a method for how to deal with punctuation. For example, ‘batter’s’ is similiar to ‘batter’, and ‘needed’ could be classified as a stop word. Different variation of words, such as ‘bowlers’ and ‘bowler’ could also be adjusted for.

[1] Silge, J., & Robinson, D. (2019, March 23). Text Mining with R. Retrieved from https://www.tidytextmining.com/