Preparing Textual Data in R

Libraries Needed

We’ll use several R packages in this section:

readr will help in importing the .csv file into R.
tidyverse is a collection of R packages designed for data science, including dplyr with a set of verbs for common data manipulations and ggplot2 for visualization.
tidytext provides specific functions for a “tidy” approach to working with textual data, where one row represents one “token” or meaningful unit of text, for example a word.
readtext provides a function well suited to reading textual data from a large number of formats into R, including metadata.

# Load the libraries
if(!require("tidyverse"))install.packages("tidyverse")
if(!require("tidytext"))install.packages("tidytext")
if(!require("readtext"))install.packages("readtext")
if(!require("readr"))install.packages("readr")
if(!require("SnowballC"))install.packages("SnowballC")

Reading text into R

First, let’s look at the data in the speech data sample. We can take a look at those by either typing the names or use functions like glimpse() or str().

# import the dataset
speech <- read_csv("speech_data_sample.csv", 
                   show_col_types = F) %>%
  
  # Remove index without a speech
  na.omit() 


# Preview the data
str(speech)

## tibble [12,880 x 2] (S3: tbl_df/tbl/data.frame)
##  $ index : num [1:12880] 1.06e+09 9.60e+08 1.10e+09 1.04e+09 9.90e+08 ...
##  $ speech: chr [1:12880] "mr. president. can i have order. please?" "mr. president. i suggest the absence of a quorum." "madam speaker. february 1 is an extremely important date for us in terms of american security. you might wonder"| __truncated__ "mr. president. i call up my amendment no. 2528. the conradlieberman amendment." ...
##  - attr(*, "na.action")= 'omit' Named int [1:10] 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456
##   ..- attr(*, "names")= chr [1:10] "3447" "3448" "3449" "3450" ...

This sample contain 12,880 speeches.

Cleaning the Speech

Here we remove stop words, special characters, and reduce the speech to stem.

Replacing and removing characters

Now let’s take a look at text ‘cleaninng’. We will first remove the newline characters (\n). We use the str_replace_all function to replace all the occurrences of the \n pattern with a white space ” “. We need to add the escape character \ in front of our pattern to be replaced so the backslash before the n is interpreted correctly.

# Remove characters
speech <- speech %>%
  mutate(speech = # remove numbers
           gsub(speech, pattern = "[0-9]",
                       replacement = ""), 
         # replace newline 
         speech = str_replace_all(speech, "\\n", " "), 
         
         # remove white spaces
         speech = str_squish(speech))

Tokenize the text

Tokenizing text will retain the line number, remove punctuation, and default all words to lowercase characters.

speech <- speech %>%
  unnest_tokens(word, speech)

Remove Stop Words

Stop words are highly common words that are considered to provide non-relevant information about the content of a text.

speech <- speech %>%
  anti_join(stop_words)

Word Stemming

We need to reduce the words to their word stem or root form, for example reducing fishing, fished, and fisher to the stem fish.

# Stem the words
speech <- speech %>%
  mutate(word_stem = wordStem(word))

Word Frequency

Since our unit of analysis at this point is a word, let’s count to determine which words occur most frequently in the corpus as a whole. The bar graph below shows the most frequent words.

speech %>%
  count(word) %>% 
  filter(n > 4000) %>% 
  mutate(word = reorder(word, n)) %>%  # reorder values by frequency
  ggplot(aes(word, n)) +
     geom_col(fill = "steelblue") +
     coord_flip()  # flip x and y coordinates so we can read the words better