An R script for calculating and graphing term frequency-inverse document frequency (TF-IDF) scores for terms extracted from three text sources. In this example, the three sources are collections of tweets, text messages, and emails produced by Sandy Hook Promise, a nonprofit organization that advocates for gun safety.
The output from each block of code appears below the block. This script was developed using R version 4.2.1 (2022-06-23 ucrt).
The tidytext package is required, as are four supplementary packages.
This code checks to see whether each package is installed already and
installs the package if it isn’t. The code then loads each package into
memory. The stringr
library is part of the
tidyverse
package.
if (!require("tidytext")) install.packages("tidytext")
## Loading required package: tidytext
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("dplyr")) install.packages("dplyr")
if (!require("readr")) install.packages("readr")
library(tidytext)
library(ggplot2)
library(dplyr)
library(readr)
library(stringr)
The text being analyzed in this demonstration is stored in a
comma-separated value file named AllMessages.csv. The file has two
columns, “source” and “text,” and 942 rows. “Source” indicates whether a
given row contains text from a tweet, an email, or a text message.
“Text” contains the text of the tweet, email or text message. The code
reads the .csv file and stores its contents in a data frame called
text_df
.
The script would work on any .csv file containing any source names and any text organized as described above.
text_df <- read_csv("AllMessages.csv")
## Rows: 942 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): source, text
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
text_df %>% count(source)
## # A tibble: 3 × 2
## source n
## <chr> <int>
## 1 Email 46
## 2 SMS 38
## 3 Twitter 858
This code breaks the text into individual terms, or “tokens,” counts
the number of times each word appears in each source, and stores the
per-source word totals in a new data frame called
tidy_text
.
tidy_text <- text_df %>%
unnest_tokens(word,text) %>%
count(source,word, sort = TRUE)
Are there any words you want to omit from the analysis? You can
specify them here. The script will delete them from the
tidy_text
data frame. You can skip this code if you want
to.
my_stopwords <- tibble(word = c("stop2quit",
"it’s",
"we’re",
"shp",
"matthew"))
tidy_text <- tidy_text %>%
anti_join(my_stopwords)
## Joining, by = "word"
This code adds a column showing the total number of words in each type of source. It’s not necessary for calculating TF-IDF. But it can sometimes be helpful. You can skip this code if you like.
total_words <- tidy_text %>%
group_by(source) %>%
summarize(total = sum(n))
tidy_text <- left_join(tidy_text,total_words)
## Joining, by = "source"
This code calculates a TF-IDF for each word within each source,
stores the results in a data frame called tidy_text_tf_idf
,
then sorts the data frame in descending order by TF-IDF score.
tidy_text_tf_idf <- tidy_text %>%
bind_tf_idf(word, source, n) %>%
arrange(desc(tf_idf))
This code produces one bar chart per source showing the dozen largest TF-IDF scores and their associated words. If different words have the same TF-IDF score, and if the score is among the dozen largest, the chart will include a bar for each of the identically-scored words. Thus, a source’s chart can, at times, have more than a dozen bars.
Change 12
in n = 12
to display fewer or
more TF-IDF scores per source. Change 2
in
ncol = 2
to increase or decrease the number of vertical
panels shown in the chart.
library(forcats)
tidy_text_tf_idf %>%
group_by(source) %>%
slice_max(tf_idf, n = 12) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~source, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)
This code exports the tidy_text_tf_idf
data frame as a
comma-separated value file called TFIDFScores.csv
and
stores the file in the same R subdirectory as the script. You can use a
different filename if you like. If the script finds a pre-existing file
named TFIDFScores.csv
, the script will overwrite the file
without warning or apology.
write_excel_csv(tidy_text_tf_idf, file = "TFIDFScores.csv")