In order to perform any text mining, it is imparitive that you tidy the text. Since text is often unstructured, performing even the most basic analysis is difficult without some initial cleaning.
Here, we are doing to discuss basic text cleaning and how to complete a basic frequency analysis.
Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham (Wickham 2014), tidy data has a specific structure:
We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.
This tutorial uses the harrypotter package. It is available on github. Assuming you have a version of devtools that is at least 1.6, you can download it with the following code.
devtools::install_github("bradleyboehmke/harrypotter")
## Error in get(genname, envir = envir) : object 'testthat_print' not found
We also load some necessary libraries.
library(tidyverse)
library(stringr)
library(tidytext)
library(harrypotter)
The seven novels in the harrypotter package, include:
Each text is in a character vector with each element representing a single chapter. For example, chapter 1 of Harry Potter and the Philosopher’s Stone can be viewed using the following command.
philosophers_stone[1]
[1] "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense….
To properly analyze this text we want to turn it into a data frame or tibble. We will create a 2 column tibble with the chapter number as column 1 and column 2 the text. We will do this for the philosophers_stone data.
text_tb <- tibble(chapter = seq_along(philosophers_stone),
text = philosophers_stone)
head(text_tb)
## # A tibble: 6 x 2
## chapter text
## <int> <chr>
## 1 1 "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Driv~
## 2 2 "THE VANISHING GLASS Nearly ten years had passed since the Dursleys ~
## 3 3 "THE LETTERS FROM NO ONE The escape of the Brazilian boa constrictor~
## 4 4 "THE KEEPER OF THE KEYS BOOM. They knocked again. Dudley jerked awak~
## 5 5 "DIAGON ALLEY Harry woke early the next morning. Although he could t~
## 6 6 "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS Harry's last mont~
This is not conducive to analytics. We would like to break the words apart. This is called unnesting.
text_tb %>%
unnest_tokens(word, text) %>%
head()
## # A tibble: 6 x 2
## chapter word
## <int> <chr>
## 1 1 the
## 2 1 boy
## 3 1 who
## 4 1 lived
## 5 1 mr
## 6 1 and
The unnest_token function does the following:
We will replicate this process over all of the novels.
titles <- c("Philosopher's Stone",
"Chamber of Secrets",
"Prisoner of Azkaban",
"Goblet of Fire",
"Order of the Phoenix",
"Half-Blood Prince",
"Deathly Hallows")
books <- list(philosophers_stone,
chamber_of_secrets,
prisoner_of_azkaban,
goblet_of_fire,
order_of_the_phoenix,
half_blood_prince,
deathly_hallows)
series <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
unnest_tokens(word, text) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
# set factor to keep books in order of publication
series$book <- factor(series$book, levels = rev(titles))
head(series)
## # A tibble: 6 x 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
We now have a tidy tibble with every individual word by chapter by book.
The simplest word frequency analysis is finding common words across texts.
series %>%
count(word, sort = TRUE) %>%
head()
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 the 51593
## 2 and 27430
## 3 to 26985
## 4 of 21802
## 5 a 20966
## 6 he 20322
Of course, we could have guessed that the most common words are also the words that we do not particularly care about. Mostly, these are prepositions and other shorter words that are common in everyday language. We refer to these words as stop words and these are not the important words in the Harry Potter series.
We can remove the stop words from our tibble with anti_join and the built-in stop_words data set provided by tidytext.
head(stop_words)
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
series %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head()
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 harry 16557
## 2 ron 5750
## 3 hermione 4912
## 4 dumbledore 2873
## 5 looked 2344
## 6 professor 2006
This is a far more interesting data set from which to work.
We find the top 5 most common words used in each book.
series %>%
anti_join(stop_words) %>%
group_by(book) %>%
count(word, sort = TRUE) %>%
top_n(5)
## # A tibble: 36 x 3
## # Groups: book [7]
## book word n
## <fct> <chr> <int>
## 1 Order of the Phoenix harry 3730
## 2 Goblet of Fire harry 2936
## 3 Deathly Hallows harry 2770
## 4 Half-Blood Prince harry 2581
## 5 Prisoner of Azkaban harry 1824
## 6 Chamber of Secrets harry 1503
## 7 Order of the Phoenix hermione 1220
## 8 Philosopher's Stone harry 1213
## 9 Order of the Phoenix ron 1189
## 10 Deathly Hallows hermione 1077
## # ... with 26 more rows
These might be easier to make sense of using a visualization.
series %>%
anti_join(stop_words) %>%
group_by(book) %>%
count(word, sort = TRUE) %>%
top_n(5) %>%
ungroup() %>%
mutate(book = factor(book, levels = titles),
text_order = nrow(.):1) %>%
ggplot(aes(reorder(word, text_order), n, fill = book)) +
geom_bar(stat = "identity") +
facet_wrap(~ book, scales = "free_y") +
labs(x = "NULL", y = "Frequency") +
coord_flip() +
theme(legend.position="none")
We can also look at this without reference to the text.
series %>%
anti_join(stop_words) %>%
group_by(book) %>%
count(word, sort = TRUE) %>%
top_n(5) %>%
ungroup() %>%
mutate(book = factor(book, levels = titles),
text_order = nrow(.):1) %>%
ggplot(aes(reorder(word, text_order), n, fill = book)) +
geom_bar(stat = "identity") +
labs(x = "", y = "Frequency") +
coord_flip()
In this section, we explored what we mean by tidy data when it comes to text, and how tidy data principles can be applied to natural language processing. When text is organized in a format with one token per row, tasks like removing stop words or calculating word frequencies are natural applications of familiar operations within the tidy tool ecosystem.
“AFIT Data Science Lab R Programming Guide ·.” Accessed August 3, 2021. Available here.
Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach, 2017. Available here.
“Text Mining: Creating Tidy Text · UC Business Analytics R Programming Guide.” Accessed August 3, 2021. Available here.