Overview

An R script for calculating and graphing term frequency-inverse document frequency (TF-IDF) scores for terms extracted from three text sources. In this example, the three sources are collections of tweets, text messages, and emails produced by Sandy Hook Promise, a nonprofit organization that advocates for gun safety.

The output from each block of code appears below the block. This script was developed using R version 4.2.1 (2022-06-23 ucrt).

Required packages

The tidytext package is required, as are four supplementary packages. This code checks to see whether each package is installed already and installs the package if it isn’t. The code then loads each package into memory. The stringr library is part of the tidyverse package.

if (!require("tidytext")) install.packages("tidytext")
## Loading required package: tidytext
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("dplyr")) install.packages("dplyr")
if (!require("readr")) install.packages("readr")

library(tidytext)
library(ggplot2)
library(dplyr)
library(readr)
library(stringr)

Importing the data

The text being analyzed in this demonstration is stored in a comma-separated value file named AllMessages.csv. The file has two columns, “source” and “text,” and 942 rows. “Source” indicates whether a given row contains text from a tweet, an email, or a text message. “Text” contains the text of the tweet, email or text message. The code reads the .csv file and stores its contents in a data frame called text_df.

The script would work on any .csv file containing any source names and any text organized as described above.

text_df <- read_csv("AllMessages.csv")
## Rows: 942 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): source, text
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
text_df %>% count(source)
## # A tibble: 3 × 2
##   source      n
##   <chr>   <int>
## 1 Email      46
## 2 SMS        38
## 3 Twitter   858

Tokenizing the text

This code breaks the text into individual terms, or “tokens,” counts the number of times each word appears in each source, and stores the per-source word totals in a new data frame called tidy_text.

tidy_text <- text_df %>% 
  unnest_tokens(word,text) %>% 
  count(source,word, sort = TRUE)

Omitting words

Are there any words you want to omit from the analysis? You can specify them here. The script will delete them from the tidy_text data frame. You can skip this code if you want to.

my_stopwords <- tibble(word = c("stop2quit",
                                "it’s",
                                "we’re",
                                "shp",
                                "matthew"))
tidy_text <- tidy_text %>% 
  anti_join(my_stopwords)
## Joining, by = "word"

Adding source word totals

This code adds a column showing the total number of words in each type of source. It’s not necessary for calculating TF-IDF. But it can sometimes be helpful. You can skip this code if you like.

total_words <- tidy_text %>% 
  group_by(source) %>% 
  summarize(total = sum(n))
tidy_text <- left_join(tidy_text,total_words)
## Joining, by = "source"

Calculating TF-IDF scores

This code calculates a TF-IDF for each word within each source, stores the results in a data frame called tidy_text_tf_idf, then sorts the data frame in descending order by TF-IDF score.

tidy_text_tf_idf <- tidy_text %>%
  bind_tf_idf(word, source, n) %>% 
  arrange(desc(tf_idf))

Graphing top TF-IDF words by source

This code produces one bar chart per source showing the dozen largest TF-IDF scores and their associated words. If different words have the same TF-IDF score, and if the score is among the dozen largest, the chart will include a bar for each of the identically-scored words. Thus, a source’s chart can, at times, have more than a dozen bars.

Change 12 in n = 12 to display fewer or more TF-IDF scores per source. Change 2 in ncol = 2 to increase or decrease the number of vertical panels shown in the chart.

library(forcats)

tidy_text_tf_idf %>%
  group_by(source) %>%
  slice_max(tf_idf, n = 12) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~source, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL)

Exporting the data

This code exports the tidy_text_tf_idf data frame as a comma-separated value file called TFIDFScores.csv and stores the file in the same R subdirectory as the script. You can use a different filename if you like. If the script finds a pre-existing file named TFIDFScores.csv , the script will overwrite the file without warning or apology.

write_excel_csv(tidy_text_tf_idf, file = "TFIDFScores.csv")