Simple Text Aanlysis with Tidy in R

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality", 
          "Girls go to college to get more knowldege -",
          "Boys go to Jupiter to get more stupider -")
print(text)

[1] "Because I could not stop for Death -"       
[2] "He kindly stopped for me -"                 
[3] "The Carriage held but just Ourselves -"     
[4] "and Immortality"                            
[5] "Girls go to college to get more knowldege -"
[6] "Boys go to Jupiter to get more stupider -"

That’s a renowned piece of poem from Emily Dickinson added with a funny piece of poem from my 7-year old daughter. The text data in this poem is stored as a character vector named “text”. To be able to–my dad is stupid that is true–manipulate this data, I have to turn it into a data frame. I can do so by using the dplyr library.

library(dplyr)
df_text <- data_frame(line = 1:6, text = text)
df_text

# A tibble: 6 x 2
   line text                                       
  <int> <chr>                                      
1     1 Because I could not stop for Death -       
2     2 He kindly stopped for me -                 
3     3 The Carriage held but just Ourselves -     
4     4 and Immortality                            
5     5 Girls go to college to get more knowldege -
6     6 Boys go to Jupiter to get more stupider -

Based on the outcome, we know the apparent differences between these two tables. The latter, is a “tibble” and has two columns:

an integer column representing each from the poem, and
a character column showing the exact lines from the poem.

One of the key aspects of text mining is called “tokenizing”. Silge & Robinson(2017) write that “A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens” (p. 3).

Now, I am going to brake the poem into tokens and change the tibble into a tidy data structure. We can achieve this goal using unnest_tokens() function available in the tidytext package.

library(tidytext)
df_text %>% #take the text table and 
  unnest_tokens(word, text)# break the rows by word

# A tibble: 36 x 2
    line word   
   <int> <chr>  
 1     1 because
 2     1 i      
 3     1 could  
 4     1 not    
 5     1 stop   
 6     1 for    
 7     1 death  
 8     2 he     
 9     2 kindly 
10     2 stopped
# ... with 26 more rows

Once again, we a tibble but it is 36 X 2 at this time. There rows are broken by words. For example, there are 7 words aka tokens in the first line. Simply, the “unnest” is the function that un nest the texts into the designated tokens. In addition, here’s some of the features associated with this function:

get rid of punctuation marks
number of columns remain same
rows are divided into words and they are put together as distinct rows
by default, the unnest_tokens function changes the script into lower cases. We can stop R from doing so by using “to_lower = FALSE” argument.

Getting Rid of Unnecessary Words

If we look at the tibble, we can see that there are many function workds aka. words that do not contribute to the meaning of a text. They are often used to connect ideas/expressions together like: of, to, for, an, a etc. The stop_word data set in tidytext package has a list of all of the expressions that we often want to avoid while analyzing a text.

I am now, going to get rid of such words. We can do so by using “anti_join” function.

library(tibble)
data(stop_words)# invoking the data file that contains the stop words
trimmed_data <- text %>%
  as.tibble %>%
  unnest_tokens(word,value) %>%
  anti_join(stop_words)# get rid of stop words that can be found in stop_word data file and populate the result as a tibble
trimmed_data

# A tibble: 13 x 1
   word       
   <chr>      
 1 stop       
 2 death      
 3 kindly     
 4 stopped    
 5 carriage   
 6 held       
 7 immortality
 8 girls      
 9 college    
10 knowldege  
11 boys       
12 jupiter    
13 stupider

Originally, we had words. As we got rid of stop words, we came up with 13 words that contribute towards the meaning of the text. These are the content words.

Our text was pretty small, i.e., 36 words. It is not difficult to dissect this text. However, the word mining tool can analyze much longer text and even multiple texts at the same time. Word cloud is one of the most effective tools in qualitative data analysis. Let’s create a simple word cloud using our content words.

library(wordcloud)
trimmed_data %>%
  as.tibble %>%
  count(word) %>%
  with(wordcloud(word, n))

Here, we go.

My next project will be much harder than this. I am going to dissect my Ph.D. dissertation from text mining perspectives and see what I come up with. Here’s the link: https://rpubs.com/nirmal/813873

Simple Text Aanlysis with Tidy in R

Nirmal Ghimire

9/25/2021

Getting Rid of Unnecessary Words