We’ll use several R packages in this section:
readr will help in importing the .csv
file into R.
tidyverse is a collection of R packages designed for
data science, including dplyr with a set of verbs for
common data manipulations and ggplot2 for
visualization.
tidytext provides specific functions for a “tidy”
approach to working with textual data, where one row represents one
“token” or meaningful unit of text, for example a word.
readtext provides a function well suited to reading
textual data from a large number of formats into R,
including metadata.
# Load the libraries
if(!require("tidyverse"))install.packages("tidyverse")
if(!require("tidytext"))install.packages("tidytext")
if(!require("readtext"))install.packages("readtext")
if(!require("readr"))install.packages("readr")
if(!require("SnowballC"))install.packages("SnowballC")
First, let’s look at the data in the speech data sample.
We can take a look at those by either typing the names or use functions
like glimpse() or str().
# import the dataset
speech <- read_csv("speech_data_sample.csv",
show_col_types = F) %>%
# Remove index without a speech
na.omit()
# Preview the data
str(speech)
## tibble [12,880 x 2] (S3: tbl_df/tbl/data.frame)
## $ index : num [1:12880] 1.06e+09 9.60e+08 1.10e+09 1.04e+09 9.90e+08 ...
## $ speech: chr [1:12880] "mr. president. can i have order. please?" "mr. president. i suggest the absence of a quorum." "madam speaker. february 1 is an extremely important date for us in terms of american security. you might wonder"| __truncated__ "mr. president. i call up my amendment no. 2528. the conradlieberman amendment." ...
## - attr(*, "na.action")= 'omit' Named int [1:10] 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456
## ..- attr(*, "names")= chr [1:10] "3447" "3448" "3449" "3450" ...
This sample contain 12,880 speeches.
Here we remove stop words, special characters, and reduce the speech to stem.
Now let’s take a look at text ‘cleaninng’. We will first remove the
newline characters (\n). We use the str_replace_all
function to replace all the occurrences of the \n pattern with a white
space ” “. We need to add the escape character \ in front of our pattern
to be replaced so the backslash before the n is interpreted
correctly.
# Remove characters
speech <- speech %>%
mutate(speech = # remove numbers
gsub(speech, pattern = "[0-9]",
replacement = ""),
# replace newline
speech = str_replace_all(speech, "\\n", " "),
# remove white spaces
speech = str_squish(speech))
Tokenizing text will retain the line number, remove punctuation, and default all words to lowercase characters.
speech <- speech %>%
unnest_tokens(word, speech)
Stop words are highly common words that are considered to provide non-relevant information about the content of a text.
speech <- speech %>%
anti_join(stop_words)
We need to reduce the words to their word stem or root form, for example reducing fishing, fished, and fisher to the stem fish.
# Stem the words
speech <- speech %>%
mutate(word_stem = wordStem(word))
Since our unit of analysis at this point is a word, let’s count to determine which words occur most frequently in the corpus as a whole. The bar graph below shows the most frequent words.
speech %>%
count(word) %>%
filter(n > 4000) %>%
mutate(word = reorder(word, n)) %>% # reorder values by frequency
ggplot(aes(word, n)) +
geom_col(fill = "steelblue") +
coord_flip() # flip x and y coordinates so we can read the words better