We will use two new packages, the tm package and the wordcloud package. If you do not already have these package installed, you will first install them using the install.packages() function.
install.packages(c("tm", "wordcloud"))
We will also use the DescTools package to obtain high level information about our data. Next, we load these packages for use in the session.
library(DescTools)
library(tm)
library(wordcloud)
In the lesson that follows, we use the imdb_reviews.csv file, which contains 1000 movie reviews from IMDB and an assigned polarity value (positive_flag) indicating the sentiment of the review (0 = negative, 1 = positive). Each review has a unique identifier, doc_id, and the review text (text).
We use the read.csv() function to import the CSV file into R as a dataframe named imdb. We set stringsAsFactors = FALSE to keep any character columns as-is. We also use the na.strings argument to specify when character strings (in the text column/variable) should be treated as NA, or missing values. We use na.strings = c("", " ") to specify that empty text documents ("") and documents with white space (" ") should be converted to NA values in out imdb dataframe.
imdb <- read.csv(file = "imdb_reviews.csv",
stringsAsFactors = FALSE,
na.strings = c("", " "))
First, we can obtain high-level information about the imdb dataframe to look at the variable types and to check for missing (NA) values.
Abstract(imdb)
## ------------------------------------------------------------------------------
## imdb
##
## data frame: 1000 obs. of 3 variables
## 1000 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 doc_id integer .
## 2 text character .
## 3 positive_flag integer .
We can also obtain the structure of our data using the str() function to preview our variables.
str(imdb)
## 'data.frame': 1000 obs. of 3 variables:
## $ doc_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ text : chr "A very, very, very slow-moving, aimless movie about a distressed, drifting young man. " "Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out. " "Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridi"| __truncated__ "Very little music or anything to speak of. " ...
## $ positive_flag: int 0 0 0 0 1 0 0 1 0 1 ...
The positive_flag variable would be our variable of interest in a predictive model. First, we can convert it to a nominal factor variable.
imdb$positive_flag <- factor(imdb$positive_flag)
We can use the plot() function on our factor variable to obtain a bar plot of the distribution of the sentiment (positive_flag) in our document collection (imdb).
plot(imdb$positive_flag,
main = "Review Sentiment",
xlab = "Positive Flag")
As shown, our positive_flag variable is balanced, with 500 positive and 500 negative reviews.
We use the tm package to covert our text data to a Corpus, which we will apply preprocessing transformations to. We build the corpus using the Corpus() function from the tm package. Corpora will primarily be created from a VectorSource() or DataframeSource() object.
Note: To create a corpus using DataframeSource(), there must be a column named “text”, containing text and a column named “doc_id”, containing a unique document identifier.
colnames(imdb)
## [1] "doc_id" "text" "positive_flag"
As shown, the dataframe was created to be a compatible DataframeSource, and has the necessary columns/column names. We use the Corpus() function to create our corpus, named corpus, from our DataframeSource.
corpus <- Corpus(DataframeSource(x = imdb))
The object created with the Corpus() function is a special type of R object
class(corpus)
## [1] "SimpleCorpus" "Corpus"
It is a list object with
length(corpus)
## [1] 1000
equal to the number of text documents (observations) in the data.
We can view individual documents by using the inspect() function and using list subsetting ([[]]). To view the first document, we can use
inspect(corpus[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 87
##
## A very, very, very slow-moving, aimless movie about a distressed, drifting young man.
Next, we need to standardize and cleanse our data. We use the tm_map() function from the tm package to successively apply transformations to our corpus.
Let’s view a document in our corpus, Document 41, to compare the before and after of our cleaning.
inspect(corpus[[41]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 166
##
## The very idea of it was lame - take a minor character from a mediocre PG-13 film, and make a complete non-sequel while changing its tone to a PG-rated family movie.
tolower() function.corpus <- tm_map(x = corpus, # apply to all documents
FUN = tolower) # tolower() function
We can view the effect on Document 41:
inspect(corpus[[41]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 166
##
## the very idea of it was lame - take a minor character from a mediocre pg-13 film, and make a complete non-sequel while changing its tone to a pg-rated family movie.
We can also visualize our corpus using the wordcloud() function in the wordcloud package. Since the function randomly generates the wordcloud, we will set a seed to create a reproducible plot.
set.seed(1)
wordcloud(corpus, # corpus object
random.order = FALSE, # most frequent in center
colors = brewer.pal(8, "Dark2"), # color schema
max.words = 150) # top 150 terms
removeNumbers() function from the tm package.corpus <- tm_map(x = corpus, # apply to all documents
FUN = removeNumbers) # removeNumbers() function
We can view the effect on Document 41:
inspect(corpus[[41]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 164
##
## the very idea of it was lame - take a minor character from a mediocre pg- film, and make a complete non-sequel while changing its tone to a pg-rated family movie.
stopwords() function to view the stop word lists. The “en” list is less restrictive than the “SMART” stop word list.stopwords(kind = "en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
stopwords(kind = "SMART")
## [1] "a" "a's" "able" "about"
## [5] "above" "according" "accordingly" "across"
## [9] "actually" "after" "afterwards" "again"
## [13] "against" "ain't" "all" "allow"
## [17] "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always"
## [25] "am" "among" "amongst" "an"
## [29] "and" "another" "any" "anybody"
## [33] "anyhow" "anyone" "anything" "anyway"
## [37] "anyways" "anywhere" "apart" "appear"
## [41] "appreciate" "appropriate" "are" "aren't"
## [45] "around" "as" "aside" "ask"
## [49] "asking" "associated" "at" "available"
## [53] "away" "awfully" "b" "be"
## [57] "became" "because" "become" "becomes"
## [61] "becoming" "been" "before" "beforehand"
## [65] "behind" "being" "believe" "below"
## [69] "beside" "besides" "best" "better"
## [73] "between" "beyond" "both" "brief"
## [77] "but" "by" "c" "c'mon"
## [81] "c's" "came" "can" "can't"
## [85] "cannot" "cant" "cause" "causes"
## [89] "certain" "certainly" "changes" "clearly"
## [93] "co" "com" "come" "comes"
## [97] "concerning" "consequently" "consider" "considering"
## [101] "contain" "containing" "contains" "corresponding"
## [105] "could" "couldn't" "course" "currently"
## [109] "d" "definitely" "described" "despite"
## [113] "did" "didn't" "different" "do"
## [117] "does" "doesn't" "doing" "don't"
## [121] "done" "down" "downwards" "during"
## [125] "e" "each" "edu" "eg"
## [129] "eight" "either" "else" "elsewhere"
## [133] "enough" "entirely" "especially" "et"
## [137] "etc" "even" "ever" "every"
## [141] "everybody" "everyone" "everything" "everywhere"
## [145] "ex" "exactly" "example" "except"
## [149] "f" "far" "few" "fifth"
## [153] "first" "five" "followed" "following"
## [157] "follows" "for" "former" "formerly"
## [161] "forth" "four" "from" "further"
## [165] "furthermore" "g" "get" "gets"
## [169] "getting" "given" "gives" "go"
## [173] "goes" "going" "gone" "got"
## [177] "gotten" "greetings" "h" "had"
## [181] "hadn't" "happens" "hardly" "has"
## [185] "hasn't" "have" "haven't" "having"
## [189] "he" "he's" "hello" "help"
## [193] "hence" "her" "here" "here's"
## [197] "hereafter" "hereby" "herein" "hereupon"
## [201] "hers" "herself" "hi" "him"
## [205] "himself" "his" "hither" "hopefully"
## [209] "how" "howbeit" "however" "i"
## [213] "i'd" "i'll" "i'm" "i've"
## [217] "ie" "if" "ignored" "immediate"
## [221] "in" "inasmuch" "inc" "indeed"
## [225] "indicate" "indicated" "indicates" "inner"
## [229] "insofar" "instead" "into" "inward"
## [233] "is" "isn't" "it" "it'd"
## [237] "it'll" "it's" "its" "itself"
## [241] "j" "just" "k" "keep"
## [245] "keeps" "kept" "know" "knows"
## [249] "known" "l" "last" "lately"
## [253] "later" "latter" "latterly" "least"
## [257] "less" "lest" "let" "let's"
## [261] "like" "liked" "likely" "little"
## [265] "look" "looking" "looks" "ltd"
## [269] "m" "mainly" "many" "may"
## [273] "maybe" "me" "mean" "meanwhile"
## [277] "merely" "might" "more" "moreover"
## [281] "most" "mostly" "much" "must"
## [285] "my" "myself" "n" "name"
## [289] "namely" "nd" "near" "nearly"
## [293] "necessary" "need" "needs" "neither"
## [297] "never" "nevertheless" "new" "next"
## [301] "nine" "no" "nobody" "non"
## [305] "none" "noone" "nor" "normally"
## [309] "not" "nothing" "novel" "now"
## [313] "nowhere" "o" "obviously" "of"
## [317] "off" "often" "oh" "ok"
## [321] "okay" "old" "on" "once"
## [325] "one" "ones" "only" "onto"
## [329] "or" "other" "others" "otherwise"
## [333] "ought" "our" "ours" "ourselves"
## [337] "out" "outside" "over" "overall"
## [341] "own" "p" "particular" "particularly"
## [345] "per" "perhaps" "placed" "please"
## [349] "plus" "possible" "presumably" "probably"
## [353] "provides" "q" "que" "quite"
## [357] "qv" "r" "rather" "rd"
## [361] "re" "really" "reasonably" "regarding"
## [365] "regardless" "regards" "relatively" "respectively"
## [369] "right" "s" "said" "same"
## [373] "saw" "say" "saying" "says"
## [377] "second" "secondly" "see" "seeing"
## [381] "seem" "seemed" "seeming" "seems"
## [385] "seen" "self" "selves" "sensible"
## [389] "sent" "serious" "seriously" "seven"
## [393] "several" "shall" "she" "should"
## [397] "shouldn't" "since" "six" "so"
## [401] "some" "somebody" "somehow" "someone"
## [405] "something" "sometime" "sometimes" "somewhat"
## [409] "somewhere" "soon" "sorry" "specified"
## [413] "specify" "specifying" "still" "sub"
## [417] "such" "sup" "sure" "t"
## [421] "t's" "take" "taken" "tell"
## [425] "tends" "th" "than" "thank"
## [429] "thanks" "thanx" "that" "that's"
## [433] "thats" "the" "their" "theirs"
## [437] "them" "themselves" "then" "thence"
## [441] "there" "there's" "thereafter" "thereby"
## [445] "therefore" "therein" "theres" "thereupon"
## [449] "these" "they" "they'd" "they'll"
## [453] "they're" "they've" "think" "third"
## [457] "this" "thorough" "thoroughly" "those"
## [461] "though" "three" "through" "throughout"
## [465] "thru" "thus" "to" "together"
## [469] "too" "took" "toward" "towards"
## [473] "tried" "tries" "truly" "try"
## [477] "trying" "twice" "two" "u"
## [481] "un" "under" "unfortunately" "unless"
## [485] "unlikely" "until" "unto" "up"
## [489] "upon" "us" "use" "used"
## [493] "useful" "uses" "using" "usually"
## [497] "uucp" "v" "value" "various"
## [501] "very" "via" "viz" "vs"
## [505] "w" "want" "wants" "was"
## [509] "wasn't" "way" "we" "we'd"
## [513] "we'll" "we're" "we've" "welcome"
## [517] "well" "went" "were" "weren't"
## [521] "what" "what's" "whatever" "when"
## [525] "whence" "whenever" "where" "where's"
## [529] "whereafter" "whereas" "whereby" "wherein"
## [533] "whereupon" "wherever" "whether" "which"
## [537] "while" "whither" "who" "who's"
## [541] "whoever" "whole" "whom" "whose"
## [545] "why" "will" "willing" "wish"
## [549] "with" "within" "without" "won't"
## [553] "wonder" "would" "would" "wouldn't"
## [557] "x" "y" "yes" "yet"
## [561] "you" "you'd" "you'll" "you're"
## [565] "you've" "your" "yours" "yourself"
## [569] "yourselves" "z" "zero"
We can use the intersect() function to find words that are common to both stop lists.
intersect(x = stopwords(kind = "en"),
y = stopwords(kind = "SMART"))
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "it's" "we're" "they're" "i've" "you've"
## [66] "we've" "they've" "i'd" "you'd" "we'd"
## [71] "they'd" "i'll" "you'll" "we'll" "they'll"
## [76] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [81] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [86] "won't" "wouldn't" "shouldn't" "can't" "cannot"
## [91] "couldn't" "let's" "that's" "who's" "what's"
## [96] "here's" "there's" "where's" "a" "an"
## [101] "the" "and" "but" "if" "or"
## [106] "because" "as" "until" "while" "of"
## [111] "at" "by" "for" "with" "about"
## [116] "against" "between" "into" "through" "during"
## [121] "before" "after" "above" "below" "to"
## [126] "from" "up" "down" "in" "out"
## [131] "on" "off" "over" "under" "again"
## [136] "further" "then" "once" "here" "there"
## [141] "when" "where" "why" "how" "all"
## [146] "any" "both" "each" "few" "more"
## [151] "most" "other" "some" "such" "no"
## [156] "nor" "not" "only" "own" "same"
## [161] "so" "than" "too" "very"
We will use the more restrictive, “SMART” list. We use the removeWords() function from the tm package to remove the “SMART” stop words. The removeWords() function can also be used to remove custom stop words.
corpus <- tm_map(x = corpus, # apply to all documents
FUN = function(x) removeWords(x, # use removeWords() function to
stopwords("SMART"))) # remove SMART stopwords
We can view the effect on Document 41:
inspect(corpus[[41]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 122
##
## idea lame - minor character mediocre pg- film, make complete -sequel changing tone pg-rated family movie.
Again, we can visualize our corpus using the wordcloud() function in the wordcloud package.
set.seed(1)
wordcloud(corpus, # corpus object
random.order = FALSE, # most frequent in center
colors = brewer.pal(8, "Dark2"), # color schema
max.words = 150) # top 150 terms
removePunctuation() function to remove punctuation. We use the function’s arguments to preserve dashes, but not contractions.corpus <- tm_map(x = corpus, # apply to all documents
FUN = removePunctuation, # removePunctuation() function
preserve_intra_word_contractions = FALSE, # remove contractions
preserve_intra_word_dashes = TRUE) # keep dashes
We can view the effect on Document 41:
inspect(corpus[[41]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 117
##
## idea lame minor character mediocre pg film make complete sequel changing tone pg-rated family movie
stemDocument() function in the tm package. We save the stemmed corpus as a new corpus, named corpus_stem.corpus_stem <- tm_map(x = corpus, # apply to all documents
FUN = stemDocument, # stemDocument() function
language = "english") # English language stems
We can view the effect on Document 41:
inspect(corpus[[41]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 117
##
## idea lame minor character mediocre pg film make complete sequel changing tone pg-rated family movie
Again, we can visualize our corpus using the wordcloud() function in the wordcloud package.
set.seed(1)
wordcloud(corpus_stem, # stemmed corpus object
random.order = FALSE, # most frequent in center
colors = brewer.pal(8, "Dark2"), # color schema
max.words = 150) # top 150 terms
To create a Document-Term Matrix (DTM) we use the DocumentTermMatrix() function from the tm package. We will use the stemmed corpus. To create a Term-Document Matrix (TDM), the TermDocumentMatrix() function can be used.
dtm <- DocumentTermMatrix(corpus_stem)
We can view high-level information about our dtm object by running a code line of the object name, including the number of documents and terms, sparsity, maximal term length and term weighting (which by default is term frequency). To also view a preview of the DTM, we can use the inspect() function.
inspect(dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 2194)>>
## Non-/sparse entries: 5664/2188336
## Sparsity : 100%
## Maximal term length: 25
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs act bad charact film good great love movi time watch
## 244 0 0 1 1 0 1 0 0 0 0
## 376 1 0 0 0 2 0 0 1 0 0
## 391 0 0 1 0 0 0 0 0 0 0
## 422 0 0 0 0 0 0 0 2 0 0
## 429 1 2 0 0 0 0 0 1 0 0
## 470 0 0 0 0 0 0 0 1 0 0
## 477 0 0 0 0 0 0 0 0 0 0
## 621 0 0 0 0 0 0 1 0 0 0
## 622 0 0 1 0 0 0 0 0 1 0
## 805 0 0 0 1 0 0 0 0 0 0
Our next step is to reduce the dimensionality of our DTM. This can be achieved either by setting a minimum document frequency threshold or setting a threshold for the allowable amount of sparsity.
bounds argument in the DocumentTermMatrix() function to set a lower bound on the minimum number of documents a term must appear in to be included as a term. Below, we create a new DTM, dtm_m5d, which sets a minimum document frequency of 5. We can then view high-level information about the DTM object by running a code line of the object name (dtm_m5d).dtm_m5d <- DocumentTermMatrix(x = corpus_stem,
control = list(bounds = list(global = c(5, Inf))))
dtm_m5d
## <<DocumentTermMatrix (documents: 1000, terms: 226)>>
## Non-/sparse entries: 2760/223240
## Sparsity : 99%
## Maximal term length: 14
## Weighting : term frequency (tf)
removeSparseTerms() function can be applied to a DTM or TDM and will remove all terms that have at least x amount of sparse entries.We can use the removeSparseTerms() function to remove infrequently occurring terms. We can then view high-level information about the DTM object by running a code line of the object name (dtm_sr).
dtm_sr <- removeSparseTerms(dtm, .999)
dtm_sr
## <<DocumentTermMatrix (documents: 1000, terms: 818)>>
## Non-/sparse entries: 4288/813712
## Sparsity : 99%
## Maximal term length: 14
## Weighting : term frequency (tf)
We can view the (stemmed) terms in our minimum document-frequency bounded DTM using the Terms() function.
Terms(dtm_m5d)
## [1] "man" "movi" "audienc" "charact"
## [5] "half" "act" "attempt" "black"
## [9] "camera" "clever" "disappoint" "line"
## [13] "plot" "poor" "ridicul" "white"
## [17] "music" "find" "scene" "song"
## [21] "art" "guess" "lack" "work"
## [25] "hour" "wast" "good" "kid"
## [29] "thought" "bit" "predict" "cast"
## [33] "love" "lot" "made" "show"
## [37] "hilari" "cool" "deliv" "face"
## [41] "budget" "film" "long" "review"
## [45] "singl" "cinematographi" "direct" "edit"
## [49] "put" "perfect" "cinema" "histori"
## [53] "minut" "level" "word" "imagin"
## [57] "simpli" "amount" "beauti" "creat"
## [61] "pictur" "piec" "short" "game"
## [65] "part" "seri" "deserv" "strong"
## [69] "money" "kind" "time" "crap"
## [73] "fun" "enjoy" "play" "flick"
## [77] "complet" "famili" "lame" "make"
## [81] "interest" "entir" "give" "moment"
## [85] "funni" "talent" "peopl" "star"
## [89] "stori" "effect" "real" "worst"
## [93] "cost" "lead" "screen" "written"
## [97] "girl" "life" "recommend" "excel"
## [101] "perform" "believ" "total" "convinc"
## [105] "utter" "portray" "actor" "tom"
## [109] "annoy" "feel" "absolut" "dialogu"
## [113] "bad" "found" "general" "great"
## [117] "thing" "worth" "suspens" "write"
## [121] "amaz" "live" "big" "shot"
## [125] "year" "pace" "gave" "classic"
## [129] "pretti" "turn" "problem" "script"
## [133] "touch" "end" "watch" "back"
## [137] "joy" "bore" "happen" "horror"
## [141] "stupid" "director" "night" "nice"
## [145] "brilliant" "rent" "world" "fact"
## [149] "leav" "understand" "move" "rate"
## [153] "experi" "flaw" "high" "relat"
## [157] "incred" "terribl" "wors" "horribl"
## [161] "suck" "cartoon" "emot" "set"
## [165] "tortur" "reason" "sound" "job"
## [169] "john" "hitchcock" "thriller" "full"
## [173] "danc" "hole" "recent" "pathet"
## [177] "talk" "action" "care" "master"
## [181] "fail" "drama" "visual" "actress"
## [185] "call" "cheap" "spoiler" "fan"
## [189] "solid" "surpris" "felt" "child"
## [193] "eye" "continu" "expect" "day"
## [197] "place" "start" "final" "subtl"
## [201] "mention" "wonder" "intellig" "human"
## [205] "entertain" "memor" "special" "scare"
## [209] "role" "top" "product" "impress"
## [213] "garbag" "involv" "style" "produc"
## [217] "open" "comedi" "superb" "fine"
## [221] "mess" "documentari" "origin" "avoid"
## [225] "begin" "fast"
We can use the findFreqTerms() function in the tm package to find the terms that appear at least n times.
Terms that occur at least 25 times
findFreqTerms(x = dtm_m5d, lowfreq = 25)
## [1] "movi" "charact" "act" "plot" "scene" "work" "good"
## [8] "cast" "love" "made" "film" "time" "play" "make"
## [15] "stori" "actor" "bad" "great" "thing" "script" "watch"
Terms that occur at least 50 times
findFreqTerms(x = dtm_m5d, lowfreq = 50)
## [1] "movi" "charact" "good" "film" "bad"
Since term frequency does not indicate term importance, we need to apply weighting to our DTM. Before applying weighting, we should identify and remove any empty documents (documents that do not contain any of the terms in our DTM following preprocessing).
nTerms(dtm_m5d)
## [1] 226
We can use the apply() function to identify empty documents. First, we can obtain the sum for each of the rows (documents) in our DTM.
rowsums <- apply(X = dtm_m5d, # dataframe to apply the function to
MARGIN = 1, # apply to the rows
FUN = sum) # apply the sum() function
From there, we can subset our DTM to only retain those documents (rows/observations) that have a sum greater than 0. We save this as a new DTM object named dtm_red.
dtm_red <- dtm_m5d[rowsums > 0,]
Now, we can apply TF-IDF weighting, using the weightTfIdf() function, with normalization to our dtm_red DTM object. We can then view high-level information about the DTM object by running a code line of the object name (dtm_red_tfidf).
dtm_red_tfidf <- weightTfIdf(dtm_red)
dtm_red_tfidf
## <<DocumentTermMatrix (documents: 933, terms: 226)>>
## Non-/sparse entries: 2760/208098
## Sparsity : 99%
## Maximal term length: 14
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
We can use the document IDs (Docs) to create a subset of our original dataframe, imdb, which includes only the non-empty observations (to match our DTM’s document dimension).
imdb_sub <- imdb[imdb$doc_id %in% Docs(dtm_red_tfidf),]
Finally, we can combine our dimension-reduced, TF-IDF-weighted DTM and dataframe, imdb_sub, together, so that we have the predictor variables (terms) and target (positive_flag) in the same dataframefor continued analysis (classification).
imdb_df <- data.frame(as.matrix(dtm_red_tfidf),
positive_flag = factor(imdb_sub$positive_flag))
We will export the prepared data as a CSV file for further use in classification analysis.
write.csv(x = imdb_df,
file = "imdb_df.csv",
row.names = FALSE)