Introduction

In previous tutorials text mining topics such as word relationships, sentiment analysis, and inverse document frequencies were discussed in detail. Parallel to these topics another useful package exists which allows researchers to gain useful insights from text. The tm package was created by Ingo Feinerer and enables novice researchers (like me) to harness the power of R without an in-depth understanding of the programming language. With this understanding in mind, let’s explore some of the practical applications of the tm package.

Prerequisites

This tutorial leverages the harrypotter package created by Dr. Bradley Boehmke. It is available here. In addition to the harrypotter package the following packages should be installed and loaded prior to proceeding.

# Packages
library(knitr)       # used to make kable tables
library(harrypotter) # harry potter book series
library(tm)          # text mining package
library(SnowballC)   # applies Porter's stemmming algorithm (discussed later) 
library(magrittr)    # allows pipe operator
library(tidytext)    # tidy text for plots
library(ggplot2)     # used for plots
library(dplyr)       # Manipulate data frames 

What’s a Corpus?

The tm package utilizes the Corpus as its main structure. A corpus is simply a collection of documents, but like most things in R, the corpus has specific attributes that enable certain types of analysis. Corpora in R exist in two ways:

  • Volitile Corpus (VCorpus) is a temporary object within R and is the default when assigning documents to a corpus.
  • Permanent Corpus (PCorpus) is a permanent object that can be stored outside of R.

It is important to note that differences between a VCorpus and PCorpus will not affect operations covered in this tutorial–only how the corpus is stored.

The Source of the Problem

Once the type of corpus is defined, it is then necessary to identify a source. To see what sources are available for the tm package, enter getSources() in the console.

getSources()
## [1] "DataframeSource" "DirSource"       "URISource"       "VectorSource"   
## [5] "XMLSource"       "ZipSource"

As you would imagine, documents can exist in many different capacities. Three of the most utilized sources are:

  • VectorSource a vector of characters (treats each component as a document)
  • DataframeSource a data frame containing text (like CSV files)
  • DirSource for use with file directories

For the purpose of this tutorial, only VectorSource will be used.

Create a Corpus

To create a corpus we need to define the source and the object. We will use the phlosophers_stone as the object.

# Takes book one of the harry potter series and creates a corpus
my_corpus <- VCorpus(VectorSource(philosophers_stone))

From the example above, the corpus constructor (VCorpus) uses the source (Vectorsource) as the first argument. There is, of course, a second argument in the form of readerControl which must be a list of reader and language. The readerControl argument enables corpus options very much like readr enables options when reading in files. To view available readers, type getReaders() into the console.

getReaders()
##  [1] "readDOC"                 "readPDF"                
##  [3] "readPlain"               "readRCV1"               
##  [5] "readRCV1asPlain"         "readReut21578XML"       
##  [7] "readReut21578XMLasPlain" "readTabular"            
##  [9] "readTagged"              "readXML"

The above readers each have different arguments which may or may not be used depending on the type of document to be transformed to a corpus. If no reader is defined, the default is readPlain and the default language is english. The readerControl would enable the creation of a corpus from a PDF version of the Bible written in Dutch, if one so desired, but further exploration is beyond the scope of this tutorial.

Once the corpus is created, we may need to see some details of the corpus without viewing the entire text. To achieve this, simply type the corpus name into the console.

my_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 17

The output shows that this corpus has 17 documents, which in this scenario represent the 17 chapters in phlosophers_stone. To further highlight how the tm package interprets documents the following example is created.

text <- c("this is a string", "this is a different string", "this is yet a different string")
my_corpus2 <- VCorpus(VectorSource(text))
my_corpus2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3

Notice how each string is treated as a document. This is important to remember when creating a corpus since you may want to retain the metadata to identify where the text occurred. Most documents like phlosophers_stone contain metadata that parse the text into chapters. A complete review of metadata is beyond the scope of this tutorial, but some metadata manipulations may be required when working with the tm package. You can do so by using the meta() command and indicating a “tag” as the second argument. The default “tag” is a local index (applies to individual documents).

# Indexes the three strings as "named" documents under the title "From"
meta(my_corpus2, tag = "From") <- c("String1", "String2", "String3")
meta(my_corpus2)
my_corpus2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 1
## Content:  documents: 3

Notice that the document level (indexed) now shows “1”, indicating that an indexing scheme exists for these documents. If we would like to make a classification to all of the documents in this corpus, we can do so by changing the “type” of tag in the meta() command to “corpus”.

# Adds a corpus level index also called "strings"
meta(my_corpus2, tag ="strings", type = "corpus") <- "corpus_test"
my_corpus2
## <<VCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 1
## Content:  documents: 3

Now we see that the corpus specific metadata has changed to “1”. The corpus now contains a tag at the document level and the corpus level. This is important to remember since unstructured data requires “tags” to keep it organized. The tags can tell you which corpus and document the observations occurred. If you ever need to see what metadata is associated with your document, use the meta() command with regex.

# view metadata from chapter one
meta(my_corpus2[[1]])
##   author       : character(0)
##   datetimestamp: 2017-08-25 12:32:09
##   description  : character(0)
##   heading      : character(0)
##   id           : 1
##   language     : en
##   origin       : character(0)

Metadata manipulation is used primarily for large corpora with multiple documents. More information can be found by typing vignette("tm") in the console.

Corpus Transformations

One of the best features of the tm package is the ability to transform text into workable data without a great deal of code. To do this, we can use Transformations which are available in the tm package. To see available Transformations enter getTransformations() in the console.

getTransformations()
## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

Referring back to our phlosophers_stone corpus named “my_corpus”, we may wish to transform our data using some of the transformation commands. It is highly recommended to check your data after each transformation to ensure that the transformation had the desired effect. Before we start transformations, we should view the data. We can use the base R command writeLines(). However, writeLines() will write the entire chapter. To write, say the first 15 lines, we can use a combination of head() (to identify how many lines of text to write) and strwrap() (to coerce the corpus object into a character).

# Write 15 lines of chapter two as characters.
writeLines(head(strwrap(my_corpus[[2]]), 15)) 
## THE VANISHING GLASS  Nearly ten years had passed since the
## Dursleys had woken up to find their nephew on the front step, but
## Privet Drive had hardly changed at all. The sun rose on the same
## tidy front gardens and lit up the brass number four on the
## Dursleys' front door; it crept into their living room, which was
## almost exactly the same as it had been on the night when Mr.
## Dursley had seen that fateful news report about the owls. Only the
## photographs on the mantelpiece really showed how much time had
## passed. Ten years ago, there had been lots of pictures of what
## looked like a large pink beach ball wearing different-colored
## bonnets -- but Dudley Dursley was no longer a baby, and now the
## photographs showed a large blond boy riding his first bicycle, on
## a carousel at the fair, playing a computer game with his father,
## being hugged and kissed by his mother. The room held no sign at
## all that another boy lived in the house, too.  Yet Harry Potter

We see capitalization, punctuation, and spacing that is typical of text. Intuitively, it makes sense to use the removePunctuation() content transformer immediately (and you may depending on your text), but what is the impact? Looking at chapter two, we see the text “different-colored bonnets”. If we applied the removePunctuation() content transformer we would return the word “differentcolored”, which is not really a word. We need to add space between “different” and “colored”. To achieve this, we can create a content_transformer function to replace any pattern with spaces.

# Create a function called "addspace" that finds a user specified pattern and substitutes the pattern with a space.
addspace <- content_transformer(function(x, pattern) {
  return(gsub(pattern, " ", x))
  })

As mentioned before, we want to use the newly created addspace function prior to using the removePunctuation() content transformer. In addition, the tm uses a specific interface to apply functions to corpora called tm_map(). Let’s try it out.

# Replace "-" with space in ALL chapters of the philosopher's stone.
my_corpus <- tm_map(my_corpus, addspace, "-")
writeLines(head(strwrap(my_corpus[[2]]), 15))
## THE VANISHING GLASS  Nearly ten years had passed since the
## Dursleys had woken up to find their nephew on the front step, but
## Privet Drive had hardly changed at all. The sun rose on the same
## tidy front gardens and lit up the brass number four on the
## Dursleys' front door; it crept into their living room, which was
## almost exactly the same as it had been on the night when Mr.
## Dursley had seen that fateful news report about the owls. Only the
## photographs on the mantelpiece really showed how much time had
## passed. Ten years ago, there had been lots of pictures of what
## looked like a large pink beach ball wearing different colored
## bonnets but Dudley Dursley was no longer a baby, and now the
## photographs showed a large blond boy riding his first bicycle, on
## a carousel at the fair, playing a computer game with his father,
## being hugged and kissed by his mother. The room held no sign at
## all that another boy lived in the house, too.  Yet Harry Potter

Now we can remove other punctuation, numbers, make all words lower case, remove stop words and remove extra white space.

my_corpus <- tm_map(my_corpus, removePunctuation)
my_corpus <- tm_map(my_corpus, removeNumbers)
my_corpus <- tm_map(my_corpus, removeWords, stopwords("english"))

# Transform to lower case (need to wrap in content_transformer)
my_corpus <- tm_map(my_corpus,content_transformer(tolower))
my_corpus <- tm_map(my_corpus, stripWhitespace)

writeLines(head(strwrap(my_corpus[[2]]), 15))
## the vanishing glass  nearly ten years passed since dursleys
## woken find nephew front step privet drive hardly changed the sun
## rose tidy front gardens lit brass number four dursleys front door
## crept living room almost exactly night mr dursley seen fateful
## news report owls only photographs mantelpiece really showed much
## time passed ten years ago lots pictures looked like large pink
## beach ball wearing different colored bonnets dudley dursley longer
## baby now photographs showed large blond boy riding first bicycle
## carousel fair playing computer game father hugged kissed mother
## the room held sign another boy lived house   yet harry potter
## still asleep moment long his aunt petunia awake shrill voice made
## first noise day  up get now  harry woke start his aunt rapped
## door   up screeched harry heard walking toward kitchen sound
## frying pan put stove he rolled onto back tried remember dream it
## good one there flying motorcycle he funny feeling hed dream

The stripWhitespace() command is a little quirky, but since the tm package doesn’t treat spaces as characters, it’s purely cosmetic.

To Stem or Not To Stem

You may have noticed that we have not used the stemDocument() command, yet. This is a judgement call by the researcher and depends entirely on the type of text. For example, if your text contains the words “serve”, “services”, “servicing”, and “server” then the stemDocument() command will reduce these words down to their root and lump them all together under the stemmed word “serv”. To do this, the SnowballC package utilizes an algorithm developed by Dr. Martin Porter which hunts for suffixes and strips them to create a root. Dr. Porter’s algorithm has become the de facto language stemmer in natural language processing. There are, however, different word stemmers that use different algorithms. As a cautionary tale, some context may be lost or gained by the use of stemming. As a rule of thumb, the end justifies the mean.

To see the effect of stemming on our corpus we can run stemDocument().

# Applies Porter's word stemmer 
my_corpus <- tm_map(my_corpus, stemDocument)

writeLines(head(strwrap(my_corpus[[2]]), 15))
## the vanish glass  near ten year pass sinc dursley woken find
## nephew front step privet drive hard chang the sun rose tidi front
## garden lit brass number four dursley front door crept live room
## almost exact night mr dursley seen fate news report owl onli
## photograph mantelpiec realli show much time pass ten year ago lot
## pictur look like larg pink beach ball wear differ color bonnet
## dudley dursley longer babi now photograph show larg blond boy ride
## first bicycl carousel fair play comput game father hug kiss mother
## the room held sign anoth boy live hous   yet harri potter still
## asleep moment long his aunt petunia awak shrill voic made first
## nois day  up get now  harri woke start his aunt rap door
##   up screech harri heard walk toward kitchen sound fri pan put
## stove he roll onto back tri rememb dream it good one there fli
## motorcycl he funni feel hed dream   his aunt back outsid
## door  ar yet demanded  near said harry  wel get move i want

You can see that the algorithm interpreted “harry” as an adjective similar to “clearly” and removed the “y”. Another example is the word “chang”. The algorithm most likely found similar words like “changed”, “changing”, or “changer” and reduced all of them to “chang”. The advantage of this process is to reduce redundant words with similar meanings. The disadvantage is that the word “everyth” is no longer “everything”. As long as you’re familiar with the data, you would know that “harri” = “harry”. Be advised.

For the sake of preference, we can convert “harri” back to “harry”.

my_corpus <- tm_map(my_corpus, content_transformer(gsub), pattern = "harri", replacement = "harry")

writeLines(head(strwrap(my_corpus[[2]]), 15))
## the vanish glass  near ten year pass sinc dursley woken find
## nephew front step privet drive hard chang the sun rose tidi front
## garden lit brass number four dursley front door crept live room
## almost exact night mr dursley seen fate news report owl onli
## photograph mantelpiec realli show much time pass ten year ago lot
## pictur look like larg pink beach ball wear differ color bonnet
## dudley dursley longer babi now photograph show larg blond boy ride
## first bicycl carousel fair play comput game father hug kiss mother
## the room held sign anoth boy live hous   yet harry potter still
## asleep moment long his aunt petunia awak shrill voic made first
## nois day  up get now  harry woke start his aunt rap door
##   up screech harry heard walk toward kitchen sound fri pan put
## stove he roll onto back tri rememb dream it good one there fli
## motorcycl he funni feel hed dream   his aunt back outsid
## door  ar yet demanded  near said harry  wel get move i want

Document Term Matrix

Getting all of our text into a workable format does little to enable quantitative analysis. We can create a document term matrix to get an idea of the counts of words in each chapter. To visualize what’s going on, let’s make and convert “my_corpus3” into a document term matrix using the DocumentTermMatrix() command and view it by using the inspect() command

text2 <- c("bananas are good", "bananas are yellow")
my_corpus3 <- VCorpus(VectorSource(text2))
dtm1 <- DocumentTermMatrix(my_corpus3) # coerces my_corpus2 into a Document Term Matrix
inspect(dtm1)
## <<DocumentTermMatrix (documents: 2, terms: 4)>>
## Non-/sparse entries: 6/2
## Sparsity           : 25%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs are bananas good yellow
##    1   1       1    1      0
##    2   1       1    0      1

There are eight total terms in these two documents (strings). The longest term is seven characters long (bananas). Sparsity refers to the terms (words) that appear in one document, but not the other. Since there are eight terms and two of them appear only in one document, then: \[Sparsity = 2/8 = .25\]

If we so desired, we could create a term document matrix (a transposed version of document term matrix) by using the same steps above except with the TermDocumentMatrix() command (+1 for those that figured that out in advance). Let’s convert our harry potter corpus into a document term matrix for further analysis.

# coerces my_corpus into a Document Term Matrix
dtm_potter <- DocumentTermMatrix(my_corpus)

# inspects chapters 1:5, terms 10:17
inspect(dtm_potter[1:5, 10:17]) 
## <<DocumentTermMatrix (documents: 5, terms: 8)>>
## Non-/sparse entries: 5/35
## Sparsity           : 88%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs   anyway     are     as     at     aunt     bane     be     befor  
##    1        0     0    1    0      0      0    0       0
##    2        0     0    0    0      1      0    0       0
##    3        0     0    1    0      0      0    0       0
##    4        1     0    0    0      0      0    0       0
##    5        0     0    0    1      0      0    0       0

As with many documents there are a high proportion of sparse terms. To view the details of the entire first book enter the document term matrix name in the console.

dtm_potter
## <<DocumentTermMatrix (documents: 17, terms: 6304)>>
## Non-/sparse entries: 18094/89074
## Sparsity           : 83%
## Maximal term length: 29
## Weighting          : term frequency (tf)

There are 6304 (stemmed) terms in 17 chapters and 83% of the terms are sparse. The longest word is 29 characters (likely a result of combining two or more words).

Word Frequency

Now that the basic structure of the data has been defined, cleaned, and stemmed, we can get some basic information from the data after converting it into a matrix.

# Sum all columns(words) to get frequency
words_frequency <- colSums(as.matrix(dtm_potter)) 

This is a good time to check and see if the newly created matrix is accurate. We can check the length of the “word_frequency” matrix and verify that the terms are equal (6304).

# verify that the terms are still equal to dtm_potter
length(words_frequency) 
## [1] 6304

We can look at the highest number to gain insights.

# create sort order (descending) for matrix
ord <- order(words_frequency, decreasing=TRUE)

# get the top 10 words by frequency of appeearance
words_frequency[head(ord, 10)] %>% 
  kable()
harry 1010
said 775
look 394
ron 350
hagrid 302
back 253
one 248
know 232
get 229
like 207

No real surprises here. Names like harry, ron, and hagrid provide some insight, but words like “said”, “look”, “one” don’t tell us much. Alternatively, we could use the findFreqTerms() command, but it won’t tell us precisely how many times the words appeared.

# Return unoredered frequent terms that appeared more than 200, but less than infinity.
findFreqTerms(dtm_potter, lowfreq = 200, highfreq = Inf)
##  [1] "back"   "get"    "hagrid" "harry"  "know"   "like"   "look"  
##  [8] "one"    "ron"    "said"

Maybe if we know what words were associated with each other (much like bigrams) we could gain more of an understanding. Luckily, we can use the findAssocs() command. To do this we need to specify the document term matrix, the term, and the correlation that we seek. The higher the correlation the lower the number of associated terms. If you set correlation to “1” then only words that always occur together will be returned.

# Find words that are correlated with "ron" with a coefficient > .70
findAssocs(dtm_potter, "ron", .70) %>% 
  kable()
youv 0.78
three 0.77
afford 0.75
either 0.74
push 0.74
okay 0.72
she 0.71
away 0.71

In this example, “ron” and “youv” have a correlation coefficient of 0.78. In other words, the document term matrix column of “ron” is compared to the document term matrix column of “youv” and their correlation coefficient is 0.78, indicating a somewhat strong linear (if plotted) relationship between the two words. Recall the splendid explanation of correlation from Daniel Finney’s Tutorial. Using the cor() command we can verify the findAssocs() output.

# find correlation coefficient of two words as a matrix with words as column headers
cor(as.matrix(dtm_potter)[, "ron"], as.matrix(dtm_potter)[, "youv"])
## [1] 0.7769164

TF:IDF

From the TF:IDF Tutorial, the application of Zipf’s Law to a corpus of documents was used to identify which words were common in the individual documents versus the words that were common to all documents in the corpus. Thus far, we’ve looked at the entire corpus (one book) and neglected words common to each document (chapter). The tm package permits a corpus to be created by assigning tf:idf weights upon creation. Recall the method for viewing document term matrix information.

(dtm_potter)
## <<DocumentTermMatrix (documents: 17, terms: 6304)>>
## Non-/sparse entries: 18094/89074
## Sparsity           : 83%
## Maximal term length: 29
## Weighting          : term frequency (tf)

The weighting is defined as term frequency. We can adjust the weighting to reflect the tf:idf with the DocumentTermMatrix() command.

# convert our "clean" corpus into a tfidf weighted dtm
DocumentTermMatrix(my_corpus, control = list(weighting = weightTfIdf)) -> dtm_potter_tfidf

# View details of tfidf weighted dtm
dtm_potter_tfidf
## <<DocumentTermMatrix (documents: 17, terms: 6304)>>
## Non-/sparse entries: 16819/90349
## Sparsity           : 84%
## Maximal term length: 29
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Notice that our weighting is defined as “term frequency - inverse document frequency”. Since tf:idf identifies words that are common (but not too common) by weight, we now have a dtm that is weighted by tf-idf instead of just tf. We can inspect our newly minted tf-idf document term matrix to see some of the assigned weights.

# view details of chapters 1-17, terms 15:19
inspect(dtm_potter_tfidf[1:17, 15:19]) 
## <<DocumentTermMatrix (documents: 17, terms: 5)>>
## Non-/sparse entries: 5/80
## Sparsity           : 94%
## Maximal term length: 8
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs     bane          be       befor         bet      better  
##   1  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   11 0.00000000 0.000000000 0.002220241 0.000000000 0.002220241
##   15 0.00147296 0.000000000 0.000000000 0.000000000 0.000000000
##   2  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   3  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   4  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   5  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   6  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   7  0.00000000 0.000000000 0.000000000 0.000000000 0.000000000
##   9  0.00000000 0.001530312 0.000000000 0.001530312 0.000000000

It is important to note the weighting is by term (word) as it occurs in a particular document (chapter). Although this provides some insight, it may be more useful to transform the document term matrix into a data frame so we can see many insights at once.

# convert dtm into a df
df_potter <- tidy(dtm_potter)

# take the product of tf and idf and create new column labeled "tf_idf". Graph it. 
bind_tf_idf(df_potter, term_col = term, document_col = document, n_col = count) %>% 
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(term, levels = rev(unique(term))),
               chapter = factor(document, levels = 1:17)) %>%  
  group_by(document) %>% 
  top_n(6, wt = tf_idf) %>% 
  ungroup() %>% 
  ggplot(aes(word, tf_idf, fill = document)) +
        geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
        labs(title = "Highest tf-idf words in Philosopher's Stone by Chapter",
             x = "Words", y = "tf-idf") +
        facet_wrap(~chapter, ncol = 2, scales = "free") +
        coord_flip()

External Validation

To verify that the results are somewhat practical, we would have to have read the first book in the harry potter series to know if the chapter one “words” are a good representation of what chapter one was about. Many times the researcher does not possess the domain knowledge to independently assess the practicality of the results. Thus, the need for external validation is identified.

Looking at chapter one, we see the top six words in order of tf-idf are:

  • street
  • dursley
  • cat
  • drill
  • mrs
  • son

By viewing the chapter one summary we can see that these words were descriptive of the topics covered in chapter one.

Excercise

  1. Use the tm package to create a volatile corpus from the Prisoner of Azkaban. How many chapters are in this book?

  2. Use the content transformers to remove punctuation, numbers, and white space, then inspect the first 15 lines. Are there any additional cleaning steps required?

  3. Transform the clean corpus to a Document Term Matrix. What is the sparsity of this book?

Summary

The tm package is a good tool for novice researchers to conduct basic text analysis. As you’ve probably observed, some techniques are “black-box type” and require a great deal of trust. As such, this technique is amplified by domain knowledge of the data. This disadvantage usually requires some external validation to be used in a practical setting.