Text Mining: Preprocessing & Transformation

Preliminary

We will use two new packages, the tm package and the wordcloud package. If you do not already have these package installed, you will first install them using the install.packages() function.

install.packages(c("tm", "wordcloud"))

We will also use the DescTools package to obtain high level information about our data. Next, we load these packages for use in the session.

library(DescTools)
library(tm)
library(wordcloud)

In the lesson that follows, we use the imdb_reviews.csv file, which contains 1000 movie reviews from IMDB and an assigned polarity value (positive_flag) indicating the sentiment of the review (0 = negative, 1 = positive). Each review has a unique identifier, doc_id, and the review text (text).

We use the read.csv() function to import the CSV file into R as a dataframe named imdb. We set stringsAsFactors = FALSE to keep any character columns as-is. We also use the na.strings argument to specify when character strings (in the text column/variable) should be treated as NA, or missing values. We use na.strings = c("", " ") to specify that empty text documents ("") and documents with white space (" ") should be converted to NA values in out imdb dataframe.

imdb <- read.csv(file = "imdb_reviews.csv",
                 stringsAsFactors = FALSE,
                 na.strings = c("", " "))

Data Exploration & Preparation

First, we can obtain high-level information about the imdb dataframe to look at the variable types and to check for missing (NA) values.

Abstract(imdb)

## ------------------------------------------------------------------------------ 
## imdb
## 
## data frame:  1000 obs. of  3 variables
##      1000 complete cases (100.0%)
## 
##   Nr  ColName        Class      NAs  Levels
##   1   doc_id         integer    .          
##   2   text           character  .          
##   3   positive_flag  integer    .

We can also obtain the structure of our data using the str() function to preview our variables.

str(imdb)

## 'data.frame':    1000 obs. of  3 variables:
##  $ doc_id       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ text         : chr  "A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  " "Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  " "Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridi"| __truncated__ "Very little music or anything to speak of.  " ...
##  $ positive_flag: int  0 0 0 0 1 0 0 1 0 1 ...

The positive_flag variable would be our variable of interest in a predictive model. First, we can convert it to a nominal factor variable.

imdb$positive_flag <- factor(imdb$positive_flag)

We can use the plot() function on our factor variable to obtain a bar plot of the distribution of the sentiment (positive_flag) in our document collection (imdb).

plot(imdb$positive_flag, 
     main = "Review Sentiment",
     xlab = "Positive Flag")

As shown, our positive_flag variable is balanced, with 500 positive and 500 negative reviews.

We use the tm package to covert our text data to a Corpus, which we will apply preprocessing transformations to. We build the corpus using the Corpus() function from the tm package. Corpora will primarily be created from a VectorSource() or DataframeSource() object.

Note: To create a corpus using DataframeSource(), there must be a column named “text”, containing text and a column named “doc_id”, containing a unique document identifier.

colnames(imdb)

## [1] "doc_id"        "text"          "positive_flag"

As shown, the dataframe was created to be a compatible DataframeSource, and has the necessary columns/column names. We use the Corpus() function to create our corpus, named corpus, from our DataframeSource.

corpus <- Corpus(DataframeSource(x = imdb))

The object created with the Corpus() function is a special type of R object

class(corpus)

## [1] "SimpleCorpus" "Corpus"

It is a list object with

length(corpus)

## [1] 1000

equal to the number of text documents (observations) in the data.

We can view individual documents by using the inspect() function and using list subsetting ([[]]). To view the first document, we can use

inspect(corpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 87
## 
## A very, very, very slow-moving, aimless movie about a distressed, drifting young man.

Text Data Preprocessing

Next, we need to standardize and cleanse our data. We use the tm_map() function from the tm package to successively apply transformations to our corpus.

Let’s view a document in our corpus, Document 41, to compare the before and after of our cleaning.

inspect(corpus[[41]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 166
## 
## The very idea of it was lame - take a minor character from a mediocre PG-13 film, and make a complete non-sequel while changing its tone to a PG-rated family movie.

Case Conversion We convert all of the characters to lower case using the tolower() function.

corpus <- tm_map(x = corpus, # apply to all documents
                 FUN = tolower) # tolower() function

We can view the effect on Document 41:

inspect(corpus[[41]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 166
## 
## the very idea of it was lame - take a minor character from a mediocre pg-13 film, and make a complete non-sequel while changing its tone to a pg-rated family movie.

We can also visualize our corpus using the wordcloud() function in the wordcloud package. Since the function randomly generates the wordcloud, we will set a seed to create a reproducible plot.

set.seed(1)
wordcloud(corpus, # corpus object
          random.order = FALSE, # most frequent in center
          colors = brewer.pal(8, "Dark2"), # color schema
          max.words = 150) # top 150 terms

Number Removal To remove numbers from the text documents, we use the removeNumbers() function from the tm package.

corpus <- tm_map(x = corpus, # apply to all documents
                 FUN = removeNumbers) # removeNumbers() function

We can view the effect on Document 41:

inspect(corpus[[41]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 164
## 
## the very idea of it was lame - take a minor character from a mediocre pg- film, and make a complete non-sequel while changing its tone to a pg-rated family movie.

Stop Word Removal Two popular stopword lists include Snowball (“en”) and SMART. We can use the stopwords() function to view the stop word lists. The “en” list is less restrictive than the “SMART” stop word list.

stopwords(kind = "en")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

stopwords(kind = "SMART")

##   [1] "a"             "a's"           "able"          "about"        
##   [5] "above"         "according"     "accordingly"   "across"       
##   [9] "actually"      "after"         "afterwards"    "again"        
##  [13] "against"       "ain't"         "all"           "allow"        
##  [17] "allows"        "almost"        "alone"         "along"        
##  [21] "already"       "also"          "although"      "always"       
##  [25] "am"            "among"         "amongst"       "an"           
##  [29] "and"           "another"       "any"           "anybody"      
##  [33] "anyhow"        "anyone"        "anything"      "anyway"       
##  [37] "anyways"       "anywhere"      "apart"         "appear"       
##  [41] "appreciate"    "appropriate"   "are"           "aren't"       
##  [45] "around"        "as"            "aside"         "ask"          
##  [49] "asking"        "associated"    "at"            "available"    
##  [53] "away"          "awfully"       "b"             "be"           
##  [57] "became"        "because"       "become"        "becomes"      
##  [61] "becoming"      "been"          "before"        "beforehand"   
##  [65] "behind"        "being"         "believe"       "below"        
##  [69] "beside"        "besides"       "best"          "better"       
##  [73] "between"       "beyond"        "both"          "brief"        
##  [77] "but"           "by"            "c"             "c'mon"        
##  [81] "c's"           "came"          "can"           "can't"        
##  [85] "cannot"        "cant"          "cause"         "causes"       
##  [89] "certain"       "certainly"     "changes"       "clearly"      
##  [93] "co"            "com"           "come"          "comes"        
##  [97] "concerning"    "consequently"  "consider"      "considering"  
## [101] "contain"       "containing"    "contains"      "corresponding"
## [105] "could"         "couldn't"      "course"        "currently"    
## [109] "d"             "definitely"    "described"     "despite"      
## [113] "did"           "didn't"        "different"     "do"           
## [117] "does"          "doesn't"       "doing"         "don't"        
## [121] "done"          "down"          "downwards"     "during"       
## [125] "e"             "each"          "edu"           "eg"           
## [129] "eight"         "either"        "else"          "elsewhere"    
## [133] "enough"        "entirely"      "especially"    "et"           
## [137] "etc"           "even"          "ever"          "every"        
## [141] "everybody"     "everyone"      "everything"    "everywhere"   
## [145] "ex"            "exactly"       "example"       "except"       
## [149] "f"             "far"           "few"           "fifth"        
## [153] "first"         "five"          "followed"      "following"    
## [157] "follows"       "for"           "former"        "formerly"     
## [161] "forth"         "four"          "from"          "further"      
## [165] "furthermore"   "g"             "get"           "gets"         
## [169] "getting"       "given"         "gives"         "go"           
## [173] "goes"          "going"         "gone"          "got"          
## [177] "gotten"        "greetings"     "h"             "had"          
## [181] "hadn't"        "happens"       "hardly"        "has"          
## [185] "hasn't"        "have"          "haven't"       "having"       
## [189] "he"            "he's"          "hello"         "help"         
## [193] "hence"         "her"           "here"          "here's"       
## [197] "hereafter"     "hereby"        "herein"        "hereupon"     
## [201] "hers"          "herself"       "hi"            "him"          
## [205] "himself"       "his"           "hither"        "hopefully"    
## [209] "how"           "howbeit"       "however"       "i"            
## [213] "i'd"           "i'll"          "i'm"           "i've"         
## [217] "ie"            "if"            "ignored"       "immediate"    
## [221] "in"            "inasmuch"      "inc"           "indeed"       
## [225] "indicate"      "indicated"     "indicates"     "inner"        
## [229] "insofar"       "instead"       "into"          "inward"       
## [233] "is"            "isn't"         "it"            "it'd"         
## [237] "it'll"         "it's"          "its"           "itself"       
## [241] "j"             "just"          "k"             "keep"         
## [245] "keeps"         "kept"          "know"          "knows"        
## [249] "known"         "l"             "last"          "lately"       
## [253] "later"         "latter"        "latterly"      "least"        
## [257] "less"          "lest"          "let"           "let's"        
## [261] "like"          "liked"         "likely"        "little"       
## [265] "look"          "looking"       "looks"         "ltd"          
## [269] "m"             "mainly"        "many"          "may"          
## [273] "maybe"         "me"            "mean"          "meanwhile"    
## [277] "merely"        "might"         "more"          "moreover"     
## [281] "most"          "mostly"        "much"          "must"         
## [285] "my"            "myself"        "n"             "name"         
## [289] "namely"        "nd"            "near"          "nearly"       
## [293] "necessary"     "need"          "needs"         "neither"      
## [297] "never"         "nevertheless"  "new"           "next"         
## [301] "nine"          "no"            "nobody"        "non"          
## [305] "none"          "noone"         "nor"           "normally"     
## [309] "not"           "nothing"       "novel"         "now"          
## [313] "nowhere"       "o"             "obviously"     "of"           
## [317] "off"           "often"         "oh"            "ok"           
## [321] "okay"          "old"           "on"            "once"         
## [325] "one"           "ones"          "only"          "onto"         
## [329] "or"            "other"         "others"        "otherwise"    
## [333] "ought"         "our"           "ours"          "ourselves"    
## [337] "out"           "outside"       "over"          "overall"      
## [341] "own"           "p"             "particular"    "particularly" 
## [345] "per"           "perhaps"       "placed"        "please"       
## [349] "plus"          "possible"      "presumably"    "probably"     
## [353] "provides"      "q"             "que"           "quite"        
## [357] "qv"            "r"             "rather"        "rd"           
## [361] "re"            "really"        "reasonably"    "regarding"    
## [365] "regardless"    "regards"       "relatively"    "respectively" 
## [369] "right"         "s"             "said"          "same"         
## [373] "saw"           "say"           "saying"        "says"         
## [377] "second"        "secondly"      "see"           "seeing"       
## [381] "seem"          "seemed"        "seeming"       "seems"        
## [385] "seen"          "self"          "selves"        "sensible"     
## [389] "sent"          "serious"       "seriously"     "seven"        
## [393] "several"       "shall"         "she"           "should"       
## [397] "shouldn't"     "since"         "six"           "so"           
## [401] "some"          "somebody"      "somehow"       "someone"      
## [405] "something"     "sometime"      "sometimes"     "somewhat"     
## [409] "somewhere"     "soon"          "sorry"         "specified"    
## [413] "specify"       "specifying"    "still"         "sub"          
## [417] "such"          "sup"           "sure"          "t"            
## [421] "t's"           "take"          "taken"         "tell"         
## [425] "tends"         "th"            "than"          "thank"        
## [429] "thanks"        "thanx"         "that"          "that's"       
## [433] "thats"         "the"           "their"         "theirs"       
## [437] "them"          "themselves"    "then"          "thence"       
## [441] "there"         "there's"       "thereafter"    "thereby"      
## [445] "therefore"     "therein"       "theres"        "thereupon"    
## [449] "these"         "they"          "they'd"        "they'll"      
## [453] "they're"       "they've"       "think"         "third"        
## [457] "this"          "thorough"      "thoroughly"    "those"        
## [461] "though"        "three"         "through"       "throughout"   
## [465] "thru"          "thus"          "to"            "together"     
## [469] "too"           "took"          "toward"        "towards"      
## [473] "tried"         "tries"         "truly"         "try"          
## [477] "trying"        "twice"         "two"           "u"            
## [481] "un"            "under"         "unfortunately" "unless"       
## [485] "unlikely"      "until"         "unto"          "up"           
## [489] "upon"          "us"            "use"           "used"         
## [493] "useful"        "uses"          "using"         "usually"      
## [497] "uucp"          "v"             "value"         "various"      
## [501] "very"          "via"           "viz"           "vs"           
## [505] "w"             "want"          "wants"         "was"          
## [509] "wasn't"        "way"           "we"            "we'd"         
## [513] "we'll"         "we're"         "we've"         "welcome"      
## [517] "well"          "went"          "were"          "weren't"      
## [521] "what"          "what's"        "whatever"      "when"         
## [525] "whence"        "whenever"      "where"         "where's"      
## [529] "whereafter"    "whereas"       "whereby"       "wherein"      
## [533] "whereupon"     "wherever"      "whether"       "which"        
## [537] "while"         "whither"       "who"           "who's"        
## [541] "whoever"       "whole"         "whom"          "whose"        
## [545] "why"           "will"          "willing"       "wish"         
## [549] "with"          "within"        "without"       "won't"        
## [553] "wonder"        "would"         "would"         "wouldn't"     
## [557] "x"             "y"             "yes"           "yet"          
## [561] "you"           "you'd"         "you'll"        "you're"       
## [565] "you've"        "your"          "yours"         "yourself"     
## [569] "yourselves"    "z"             "zero"

We can use the intersect() function to find words that are common to both stop lists.

intersect(x = stopwords(kind = "en"),
          y = stopwords(kind = "SMART"))

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "it's"       "we're"      "they're"    "i've"       "you've"    
##  [66] "we've"      "they've"    "i'd"        "you'd"      "we'd"      
##  [71] "they'd"     "i'll"       "you'll"     "we'll"      "they'll"   
##  [76] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [81] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [86] "won't"      "wouldn't"   "shouldn't"  "can't"      "cannot"    
##  [91] "couldn't"   "let's"      "that's"     "who's"      "what's"    
##  [96] "here's"     "there's"    "where's"    "a"          "an"        
## [101] "the"        "and"        "but"        "if"         "or"        
## [106] "because"    "as"         "until"      "while"      "of"        
## [111] "at"         "by"         "for"        "with"       "about"     
## [116] "against"    "between"    "into"       "through"    "during"    
## [121] "before"     "after"      "above"      "below"      "to"        
## [126] "from"       "up"         "down"       "in"         "out"       
## [131] "on"         "off"        "over"       "under"      "again"     
## [136] "further"    "then"       "once"       "here"       "there"     
## [141] "when"       "where"      "why"        "how"        "all"       
## [146] "any"        "both"       "each"       "few"        "more"      
## [151] "most"       "other"      "some"       "such"       "no"        
## [156] "nor"        "not"        "only"       "own"        "same"      
## [161] "so"         "than"       "too"        "very"

We will use the more restrictive, “SMART” list. We use the removeWords() function from the tm package to remove the “SMART” stop words. The removeWords() function can also be used to remove custom stop words.

corpus <- tm_map(x = corpus, # apply to all documents
                 FUN = function(x) removeWords(x, # use removeWords() function to
                                               stopwords("SMART"))) # remove SMART stopwords

We can view the effect on Document 41:

inspect(corpus[[41]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 122
## 
##   idea    lame -   minor character   mediocre pg- film,  make  complete -sequel  changing  tone   pg-rated family movie.

Again, we can visualize our corpus using the wordcloud() function in the wordcloud package.

set.seed(1)
wordcloud(corpus, # corpus object
          random.order = FALSE, # most frequent in center
          colors = brewer.pal(8, "Dark2"), # color schema
          max.words = 150) # top 150 terms

Remove Punctuation We use the removePunctuation() function to remove punctuation. We use the function’s arguments to preserve dashes, but not contractions.

corpus <- tm_map(x = corpus, # apply to all documents
                 FUN = removePunctuation, # removePunctuation() function
                 preserve_intra_word_contractions = FALSE, # remove contractions
                 preserve_intra_word_dashes = TRUE) # keep dashes

We can view the effect on Document 41:

inspect(corpus[[41]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 117
## 
##   idea    lame    minor character   mediocre pg film  make  complete sequel  changing  tone   pg-rated family movie

Apply Stemming (or Lemmatization) We can either apply stemming or lemmatization to reduce the number of terms based on their root words. We will apply stemming, using the stemDocument() function in the tm package. We save the stemmed corpus as a new corpus, named corpus_stem.

corpus_stem <- tm_map(x = corpus, # apply to all documents
                      FUN = stemDocument, # stemDocument() function
                      language = "english") # English language stems

We can view the effect on Document 41:

inspect(corpus[[41]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 117
## 
##   idea    lame    minor character   mediocre pg film  make  complete sequel  changing  tone   pg-rated family movie

Again, we can visualize our corpus using the wordcloud() function in the wordcloud package.

set.seed(1)
wordcloud(corpus_stem, # stemmed corpus object
          random.order = FALSE, # most frequent in center
          colors = brewer.pal(8, "Dark2"), # color schema
          max.words = 150) # top 150 terms

Document-Term (or Term-Document) Representation

To create a Document-Term Matrix (DTM) we use the DocumentTermMatrix() function from the tm package. We will use the stemmed corpus. To create a Term-Document Matrix (TDM), the TermDocumentMatrix() function can be used.

dtm <- DocumentTermMatrix(corpus_stem)

We can view high-level information about our dtm object by running a code line of the object name, including the number of documents and terms, sparsity, maximal term length and term weighting (which by default is term frequency). To also view a preview of the DTM, we can use the inspect() function.

inspect(dtm)

## <<DocumentTermMatrix (documents: 1000, terms: 2194)>>
## Non-/sparse entries: 5664/2188336
## Sparsity           : 100%
## Maximal term length: 25
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  act bad charact film good great love movi time watch
##   244   0   0       1    1    0     1    0    0    0     0
##   376   1   0       0    0    2     0    0    1    0     0
##   391   0   0       1    0    0     0    0    0    0     0
##   422   0   0       0    0    0     0    0    2    0     0
##   429   1   2       0    0    0     0    0    1    0     0
##   470   0   0       0    0    0     0    0    1    0     0
##   477   0   0       0    0    0     0    0    0    0     0
##   621   0   0       0    0    0     0    1    0    0     0
##   622   0   0       1    0    0     0    0    0    1     0
##   805   0   0       0    1    0     0    0    0    0     0

Dimension Reduction

Our next step is to reduce the dimensionality of our DTM. This can be achieved either by setting a minimum document frequency threshold or setting a threshold for the allowable amount of sparsity.

Minimum Document Frequency We can use the bounds argument in the DocumentTermMatrix() function to set a lower bound on the minimum number of documents a term must appear in to be included as a term. Below, we create a new DTM, dtm_m5d, which sets a minimum document frequency of 5. We can then view high-level information about the DTM object by running a code line of the object name (dtm_m5d).

dtm_m5d <- DocumentTermMatrix(x = corpus_stem, 
                              control = list(bounds = list(global = c(5, Inf))))
dtm_m5d

## <<DocumentTermMatrix (documents: 1000, terms: 226)>>
## Non-/sparse entries: 2760/223240
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)

Sparsity Reduction With real-world data sets, we will sometimes see sparsity of 99% or 100% (because of rounding). Instead of document frequency, we can choose to remove a certain percentage of sparse terms. The removeSparseTerms() function can be applied to a DTM or TDM and will remove all terms that have at least x amount of sparse entries.

We can use the removeSparseTerms() function to remove infrequently occurring terms. We can then view high-level information about the DTM object by running a code line of the object name (dtm_sr).

dtm_sr <- removeSparseTerms(dtm, .999)
dtm_sr

## <<DocumentTermMatrix (documents: 1000, terms: 818)>>
## Non-/sparse entries: 4288/813712
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)

We can view the (stemmed) terms in our minimum document-frequency bounded DTM using the Terms() function.

Terms(dtm_m5d)

##   [1] "man"            "movi"           "audienc"        "charact"       
##   [5] "half"           "act"            "attempt"        "black"         
##   [9] "camera"         "clever"         "disappoint"     "line"          
##  [13] "plot"           "poor"           "ridicul"        "white"         
##  [17] "music"          "find"           "scene"          "song"          
##  [21] "art"            "guess"          "lack"           "work"          
##  [25] "hour"           "wast"           "good"           "kid"           
##  [29] "thought"        "bit"            "predict"        "cast"          
##  [33] "love"           "lot"            "made"           "show"          
##  [37] "hilari"         "cool"           "deliv"          "face"          
##  [41] "budget"         "film"           "long"           "review"        
##  [45] "singl"          "cinematographi" "direct"         "edit"          
##  [49] "put"            "perfect"        "cinema"         "histori"       
##  [53] "minut"          "level"          "word"           "imagin"        
##  [57] "simpli"         "amount"         "beauti"         "creat"         
##  [61] "pictur"         "piec"           "short"          "game"          
##  [65] "part"           "seri"           "deserv"         "strong"        
##  [69] "money"          "kind"           "time"           "crap"          
##  [73] "fun"            "enjoy"          "play"           "flick"         
##  [77] "complet"        "famili"         "lame"           "make"          
##  [81] "interest"       "entir"          "give"           "moment"        
##  [85] "funni"          "talent"         "peopl"          "star"          
##  [89] "stori"          "effect"         "real"           "worst"         
##  [93] "cost"           "lead"           "screen"         "written"       
##  [97] "girl"           "life"           "recommend"      "excel"         
## [101] "perform"        "believ"         "total"          "convinc"       
## [105] "utter"          "portray"        "actor"          "tom"           
## [109] "annoy"          "feel"           "absolut"        "dialogu"       
## [113] "bad"            "found"          "general"        "great"         
## [117] "thing"          "worth"          "suspens"        "write"         
## [121] "amaz"           "live"           "big"            "shot"          
## [125] "year"           "pace"           "gave"           "classic"       
## [129] "pretti"         "turn"           "problem"        "script"        
## [133] "touch"          "end"            "watch"          "back"          
## [137] "joy"            "bore"           "happen"         "horror"        
## [141] "stupid"         "director"       "night"          "nice"          
## [145] "brilliant"      "rent"           "world"          "fact"          
## [149] "leav"           "understand"     "move"           "rate"          
## [153] "experi"         "flaw"           "high"           "relat"         
## [157] "incred"         "terribl"        "wors"           "horribl"       
## [161] "suck"           "cartoon"        "emot"           "set"           
## [165] "tortur"         "reason"         "sound"          "job"           
## [169] "john"           "hitchcock"      "thriller"       "full"          
## [173] "danc"           "hole"           "recent"         "pathet"        
## [177] "talk"           "action"         "care"           "master"        
## [181] "fail"           "drama"          "visual"         "actress"       
## [185] "call"           "cheap"          "spoiler"        "fan"           
## [189] "solid"          "surpris"        "felt"           "child"         
## [193] "eye"            "continu"        "expect"         "day"           
## [197] "place"          "start"          "final"          "subtl"         
## [201] "mention"        "wonder"         "intellig"       "human"         
## [205] "entertain"      "memor"          "special"        "scare"         
## [209] "role"           "top"            "product"        "impress"       
## [213] "garbag"         "involv"         "style"          "produc"        
## [217] "open"           "comedi"         "superb"         "fine"          
## [221] "mess"           "documentari"    "origin"         "avoid"         
## [225] "begin"          "fast"

We can use the findFreqTerms() function in the tm package to find the terms that appear at least n times.

Terms that occur at least 25 times

findFreqTerms(x = dtm_m5d, lowfreq = 25)

##  [1] "movi"    "charact" "act"     "plot"    "scene"   "work"    "good"   
##  [8] "cast"    "love"    "made"    "film"    "time"    "play"    "make"   
## [15] "stori"   "actor"   "bad"     "great"   "thing"   "script"  "watch"

Terms that occur at least 50 times

findFreqTerms(x = dtm_m5d, lowfreq = 50)

## [1] "movi"    "charact" "good"    "film"    "bad"

Weighting

Since term frequency does not indicate term importance, we need to apply weighting to our DTM. Before applying weighting, we should identify and remove any empty documents (documents that do not contain any of the terms in our DTM following preprocessing).

nTerms(dtm_m5d)

## [1] 226

We can use the apply() function to identify empty documents. First, we can obtain the sum for each of the rows (documents) in our DTM.

rowsums <- apply(X = dtm_m5d, # dataframe to apply the function to
                 MARGIN = 1, # apply to the rows
                 FUN = sum) # apply the sum() function

From there, we can subset our DTM to only retain those documents (rows/observations) that have a sum greater than 0. We save this as a new DTM object named dtm_red.

dtm_red <- dtm_m5d[rowsums > 0,]

Now, we can apply TF-IDF weighting, using the weightTfIdf() function, with normalization to our dtm_red DTM object. We can then view high-level information about the DTM object by running a code line of the object name (dtm_red_tfidf).

dtm_red_tfidf <- weightTfIdf(dtm_red)
dtm_red_tfidf

## <<DocumentTermMatrix (documents: 933, terms: 226)>>
## Non-/sparse entries: 2760/208098
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

We can use the document IDs (Docs) to create a subset of our original dataframe, imdb, which includes only the non-empty observations (to match our DTM’s document dimension).

imdb_sub <- imdb[imdb$doc_id %in% Docs(dtm_red_tfidf),]

Finally, we can combine our dimension-reduced, TF-IDF-weighted DTM and dataframe, imdb_sub, together, so that we have the predictor variables (terms) and target (positive_flag) in the same dataframefor continued analysis (classification).

imdb_df <- data.frame(as.matrix(dtm_red_tfidf),
                      positive_flag = factor(imdb_sub$positive_flag))

We will export the prepared data as a CSV file for further use in classification analysis.

write.csv(x = imdb_df, 
          file = "imdb_df.csv",
          row.names = FALSE)

Text Mining: Preprocessing & Transformation

Dr. Chelsey Hill

Preliminary

Data Exploration & Preparation

Text Data Preprocessing

Document-Term (or Term-Document) Representation

Dimension Reduction

Weighting