Introduction

In order to perform any text mining, it is imparitive that you tidy the text. Since text is often unstructured, performing even the most basic analysis is difficult without some initial cleaning.

Here, we are doing to discuss basic text cleaning and how to complete a basic frequency analysis.

Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham (Wickham 2014), tidy data has a specific structure:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.

Text Cleaning

This tutorial uses the harrypotter package. It is available on github. Assuming you have a version of devtools that is at least 1.6, you can download it with the following code.

devtools::install_github("bradleyboehmke/harrypotter")

## Error in get(genname, envir = envir) : object 'testthat_print' not found

We also load some necessary libraries.

library(tidyverse)
library(stringr)
library(tidytext)
library(harrypotter)

The seven novels in the harrypotter package, include:

philosophers_stone: Harry Potter and the Philosophers Stone (1997)
chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: Harry Potter and the Half-Blood Prince (2005)
deathly_hallows: Harry Potter and the Deathly Hallows (2007)

Each text is in a character vector with each element representing a single chapter. For example, chapter 1 of Harry Potter and the Philosopher’s Stone can be viewed using the following command.

philosophers_stone[1]

[1] "THE BOY WHO LIVED　　Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense….

To properly analyze this text we want to turn it into a data frame or tibble. We will create a 2 column tibble with the chapter number as column 1 and column 2 the text. We will do this for the philosophers_stone data.

text_tb <- tibble(chapter = seq_along(philosophers_stone),
                  text = philosophers_stone)

head(text_tb)

## # A tibble: 6 x 2
##   chapter text                                                                  
##     <int> <chr>                                                                 
## 1       1 "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Driv~
## 2       2 "THE VANISHING GLASS  Nearly ten years had passed since the Dursleys ~
## 3       3 "THE LETTERS FROM NO ONE  The escape of the Brazilian boa constrictor~
## 4       4 "THE KEEPER OF THE KEYS  BOOM. They knocked again. Dudley jerked awak~
## 5       5 "DIAGON ALLEY  Harry woke early the next morning. Although he could t~
## 6       6 "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS  Harry's last mont~

This is not conducive to analytics. We would like to break the words apart. This is called unnesting.

text_tb %>% 
  unnest_tokens(word, text) %>%
  head()

## # A tibble: 6 x 2
##   chapter word 
##     <int> <chr>
## 1       1 the  
## 2       1 boy  
## 3       1 who  
## 4       1 lived
## 5       1 mr   
## 6       1 and

The unnest_token function does the following:

splits the text into single words,
strips the punctuation,
converts words to lowercase (use the to_lower = FALSE argument to turn this off). We supply the unnest_token function two inputs. The first is what the column in the new tibble will be called and the second is the input column that the text comes from.

We will replicate this process over all of the novels.

titles <- c("Philosopher's Stone", 
            "Chamber of Secrets", 
            "Prisoner of Azkaban",
            "Goblet of Fire", 
            "Order of the Phoenix", 
            "Half-Blood Prince",
            "Deathly Hallows")

books <- list(philosophers_stone, 
              chamber_of_secrets, 
              prisoner_of_azkaban,
              goblet_of_fire, 
              order_of_the_phoenix, 
              half_blood_prince,
              deathly_hallows)
  
series <- tibble()

for(i in seq_along(titles)) {
        
  clean <- tibble(chapter = seq_along(books[[i]]),
                  text = books[[i]]) %>%
    unnest_tokens(word, text) %>%
    mutate(book = titles[i]) %>%
    select(book, everything())

  series <- rbind(series, clean)
}

# set factor to keep books in order of publication
series$book <- factor(series$book, levels = rev(titles))

head(series)

## # A tibble: 6 x 3
##   book                chapter word 
##   <fct>                 <int> <chr>
## 1 Philosopher's Stone       1 the  
## 2 Philosopher's Stone       1 boy  
## 3 Philosopher's Stone       1 who  
## 4 Philosopher's Stone       1 lived
## 5 Philosopher's Stone       1 mr   
## 6 Philosopher's Stone       1 and

We now have a tidy tibble with every individual word by chapter by book.

Word Frequency

The simplest word frequency analysis is finding common words across texts.

series %>% 
  count(word, sort = TRUE) %>%
  head()

## # A tibble: 6 x 2
##   word      n
##   <chr> <int>
## 1 the   51593
## 2 and   27430
## 3 to    26985
## 4 of    21802
## 5 a     20966
## 6 he    20322

Of course, we could have guessed that the most common words are also the words that we do not particularly care about. Mostly, these are prepositions and other shorter words that are common in everyday language. We refer to these words as stop words and these are not the important words in the Harry Potter series.

We can remove the stop words from our tibble with anti_join and the built-in stop_words data set provided by tidytext.

head(stop_words)

## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

series %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  head()

## # A tibble: 6 x 2
##   word           n
##   <chr>      <int>
## 1 harry      16557
## 2 ron         5750
## 3 hermione    4912
## 4 dumbledore  2873
## 5 looked      2344
## 6 professor   2006

This is a far more interesting data set from which to work.

Top 5 Most Common Harry Potter Words by Book

We find the top 5 most common words used in each book.

series %>%
  anti_join(stop_words) %>%
  group_by(book) %>%
  count(word, sort = TRUE) %>%
  top_n(5)

## # A tibble: 36 x 3
## # Groups:   book [7]
##    book                 word         n
##    <fct>                <chr>    <int>
##  1 Order of the Phoenix harry     3730
##  2 Goblet of Fire       harry     2936
##  3 Deathly Hallows      harry     2770
##  4 Half-Blood Prince    harry     2581
##  5 Prisoner of Azkaban  harry     1824
##  6 Chamber of Secrets   harry     1503
##  7 Order of the Phoenix hermione  1220
##  8 Philosopher's Stone  harry     1213
##  9 Order of the Phoenix ron       1189
## 10 Deathly Hallows      hermione  1077
## # ... with 26 more rows

These might be easier to make sense of using a visualization.

series %>%
  anti_join(stop_words) %>%
  group_by(book) %>%
  count(word, sort = TRUE) %>%
  top_n(5) %>%
  ungroup() %>%
  mutate(book = factor(book, levels = titles),
         text_order = nrow(.):1) %>%
  ggplot(aes(reorder(word, text_order), n, fill = book)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ book, scales = "free_y") +
  labs(x = "NULL", y = "Frequency") +
  coord_flip() +
  theme(legend.position="none")

We can also look at this without reference to the text.

series %>%
  anti_join(stop_words) %>%
  group_by(book) %>%
  count(word, sort = TRUE) %>%
  top_n(5) %>%
  ungroup() %>%
  mutate(book = factor(book, levels = titles),
         text_order = nrow(.):1) %>%
  ggplot(aes(reorder(word, text_order), n, fill = book)) +
  geom_bar(stat = "identity")  +
  labs(x = "", y = "Frequency") +
  coord_flip()

In this section, we explored what we mean by tidy data when it comes to text, and how tidy data principles can be applied to natural language processing. When text is organized in a format with one token per row, tasks like removing stop words or calculating word frequencies are natural applications of familiar operations within the tidy tool ecosystem.

Citations

“AFIT Data Science Lab R Programming Guide ·.” Accessed August 3, 2021. Available here.

Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach, 2017. Available here.

“Text Mining: Creating Tidy Text · UC Business Analytics R Programming Guide.” Accessed August 3, 2021. Available here.

Creating Tidy Text

OC Data Science

Introduction

Text Cleaning

Word Frequency

Top 5 Most Common Harry Potter Words by Book

Citations