The goal of this document is to show how to perform different analyses of text documents at a word level, mainly using the tm (text mining) package in R.
I cannot claim full authorship of this document, since I have taken code snippets and have been inspired by multiple books and documents in the Web. Thanks everyone for sharing.
Check the working directory with wd. If it is not the one where your data are located, change it with setwd.
getwd()
## [1] "/Users/raul/ownCloud/Trabajo/Docencia/2015 Intelligent Systems/R"
setwd("~/ownCloud/Trabajo/Docencia/2015 Intelligent Systems/R")
Now we load the required libraries.
library(tm)
library(ggplot2)
library(wordcloud)
library(RWeka)
library(reshape2)
We are going to use the Movie review data version 2.0, created by Bo Pang and Lillian Lee.
Once unzipped, the data splits the different documents into positive and negative opinions. In this script we are going to use the positive opinions located in ./txt_sentoken/pos.
source.pos = DirSource("../Corpus/review_polarity/txt_sentoken/pos", encoding = "UTF-8")
corpus = Corpus(source.pos)
Let’s see how many entries there are in our corpus just by checking its length.
length(corpus)
## [1] 1000
Taking a look at the three first entries, we can see that they are not simple documents. If we show the first entry, we can see that it contains the document and some metadata.
summary(corpus[1:3])
## Length Class Mode
## cv000_29590.txt 2 PlainTextDocument list
## cv001_18431.txt 2 PlainTextDocument list
## cv002_15918.txt 2 PlainTextDocument list
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## cv000_29590.txt
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content
Let’s take a look at the document in the first entry.
inspect(corpus[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 4226
##
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
## for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
## to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
## the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
## in other words , don't dismiss this film because of its source .
## if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
## getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
## the ghetto in question is , of course , whitechapel in 1888 london's east end .
## it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
## when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
## abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
## upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
## i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
## in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
## it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
## and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
## don't worry - it'll all make sense when you see it .
## now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
## the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
## oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
## even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
## ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
## i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
## the film , however , is all good .
## 2 : 00 - r for strong violence/gore , sexuality , language and drug content
And to its metadata. Note that we can also access metadata items individually.
meta(corpus[[1]])
## author : character(0)
## datetimestamp: 2017-10-11 16:30:41
## description : character(0)
## heading : character(0)
## id : cv000_29590.txt
## language : en
## origin : character(0)
meta(corpus[[1]])$id
## [1] "cv000_29590.txt"
To create a term document matrix (TDM), we just invoke the TermDocumentMatrix function.
tdm = TermDocumentMatrix(corpus)
Let’s take a look at the summary of the TDM. The summary informs us about the high sparsity of the TDM (i.e., most of the content of the matrix are zeroes).
tdm
## <<TermDocumentMatrix (terms: 29924, documents: 1000)>>
## Non-/sparse entries: 325821/29598179
## Sparsity : 99%
## Maximal term length: 19
## Weighting : term frequency (tf)
Let’s take a look at a subset of the TDM for four documents and four terms. There we can see an example of the sparsity of the matrix.
inspect(tdm[2000:2003,100:103])
## <<TermDocumentMatrix (terms: 4, documents: 4)>>
## Non-/sparse entries: 1/15
## Sparsity : 94%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms cv099_10534.txt cv100_11528.txt cv101_10175.txt
## eventually 0 0 0
## exciting 0 0 0
## fei 0 0 0
## fighting 0 0 0
## Docs
## Terms cv102_7846.txt
## eventually 0
## exciting 1
## fei 0
## fighting 0
How many terms have been identified in the TDM? We can see it using the length function.
length(dimnames(tdm)$Terms)
## [1] 29924
How frequently do those terms appear? Let’s sum the content of all terms (i.e., rows) and see the frequency of the terms just shown.
freq=rowSums(as.matrix(tdm))
head(freq,10)
## 102 1888 500 80s abberline ably about
## 2 2 10 27 2 9 1721
## absinthe accent acting
## 1 37 322
tail(freq,10)
## obscuring obstructions overflying paneled powaqqatsi
## 1 1 1 1 1
## snoots tangerine timbre vainly westworld
## 1 1 1 1 1
If we plot those frequencies ordered, we can see how the corpus behaves following Zipf’s law.
plot(sort(freq, decreasing = T),col="blue",main="Word frequencies", xlab="Frequency-based rank", ylab = "Frequency")
And we can analyse the ten most frequent terms as well as check that 11240 terms out of 29924 only appear once in our corpus.
# Ten most frequent terms
tail(sort(freq),n=10)
## are but this film for his with that and the
## 3714 4492 4648 5232 5260 5588 5851 8121 19897 41498
# Number of terms only appearing once
sum(freq == 1)
## [1] 11240
We can see the different transformations that can be applied to a document by invoking the getTransformations function.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
Let’s take the first document in the corpus and apply some of these transformations. We will apply some transformations or others depending on our use case.
Let’s just take a look at the first sentence of the document.
doc=corpus[1]
doc[[1]]$content[1]
## [1] "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or \" graphic novel , \" if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called \" unfortunates \" ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( \" who holds back the electric car/who made steve guttenberg a star ? \" ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content "
First, we remove stop words. We can check the stopwords used using the stopwords function.
stopwords()
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
doc = tm_map(doc,removeWords,stopwords())
doc[[1]]$content[1]
## [1] "films adapted comic books plenty success , whether superheroes ( batman , superman , spawn ) , geared toward kids ( casper ) arthouse crowd ( ghost world ) , never really comic book like hell . \n starters , created alan moore ( eddie campbell ) , brought medium whole new level mid '80s 12-part series called watchmen . \n say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd . \n book ( \" graphic novel , \" will ) 500 pages long includes nearly 30 consist nothing footnotes . \n words , dismiss film source . \n can get past whole comic book thing , might find another stumbling block hell's directors , albert allen hughes . \ngetting hughes brothers direct seems almost ludicrous casting carrot top , well , anything , riddle : better direct film set ghetto features really violent street crime mad geniuses behind menace ii society ? \n ghetto question , course , whitechapel 1888 london's east end . \n filthy , sooty place whores ( called \" unfortunates \" ) starting get little nervous mysterious psychopath carving profession surgical precision . \n first stiff turns , copper peter godley ( robbie coltrane , world enough ) calls inspector frederick abberline ( johnny depp , blow ) crack case . \nabberline , widower , prophetic dreams unsuccessfully tries quell copious amounts absinthe opium . \nupon arriving whitechapel , befriends unfortunate named mary kelly ( heather graham , say ) proceeds investigate horribly gruesome crimes even police surgeon stomach . \n think anyone needs briefed jack ripper , go particulars , say moore campbell unique interesting theory identity killer reasons chooses slay . \n comic , bother cloaking identity ripper , screenwriters terry hayes ( vertical limit ) rafael yglesias ( les mis ? rables ) good job keeping hidden viewers end . \n funny watch locals blindly point finger blame jews indians , , englishman never capable committing ghastly acts . \n hell's ending whistling stonecutters song simpsons days ( \" holds back electric car/ made steve guttenberg star ? \" ) . \n worry - 'll make sense see . \nnow onto hell's appearance : certainly dark bleak enough , surprising see much looks like tim burton film planet apes ( times , seems like sleepy hollow 2 ) . \n print saw completely finished ( color music finalized , comments marilyn manson ) , cinematographer peter deming ( say word ) ably captures dreariness victorian-era london helped make flashy killing scenes remind crazy flashbacks twin peaks , even though violence film pales comparison black--white comic . \noscar winner martin childs' ( shakespeare love ) production design turns original prague surroundings one creepy place . \neven acting hell solid , dreamy depp turning typically strong performance deftly handling british accent . \nians holm ( joe gould's secret ) richardson ( 102 dalmatians ) log great supporting roles , big surprise graham . \n cringed first time opened mouth , imagining attempt irish accent , actually half bad . \n film , however , good . \n2 : 00 - r strong violence/gore , sexuality , language drug content "
Then, we remove punctuation symbols.
doc = tm_map(doc,removePunctuation)
doc[[1]]$content[1]
## [1] "films adapted comic books plenty success whether superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world never really comic book like hell \n starters created alan moore eddie campbell brought medium whole new level mid 80s 12part series called watchmen \n say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd \n book graphic novel will 500 pages long includes nearly 30 consist nothing footnotes \n words dismiss film source \n can get past whole comic book thing might find another stumbling block hells directors albert allen hughes \ngetting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film set ghetto features really violent street crime mad geniuses behind menace ii society \n ghetto question course whitechapel 1888 londons east end \n filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision \n first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case \nabberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium \nupon arriving whitechapel befriends unfortunate named mary kelly heather graham say proceeds investigate horribly gruesome crimes even police surgeon stomach \n think anyone needs briefed jack ripper go particulars say moore campbell unique interesting theory identity killer reasons chooses slay \n comic bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end \n funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts \n hells ending whistling stonecutters song simpsons days holds back electric car made steve guttenberg star \n worry ll make sense see \nnow onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow 2 \n print saw completely finished color music finalized comments marilyn manson cinematographer peter deming say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackwhite comic \noscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place \neven acting hell solid dreamy depp turning typically strong performance deftly handling british accent \nians holm joe goulds secret richardson 102 dalmatians log great supporting roles big surprise graham \n cringed first time opened mouth imagining attempt irish accent actually half bad \n film however good \n2 00 r strong violencegore sexuality language drug content "
Then, we remove numbers.
doc = tm_map(doc,removeNumbers)
doc[[1]]$content[1]
## [1] "films adapted comic books plenty success whether superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world never really comic book like hell \n starters created alan moore eddie campbell brought medium whole new level mid s part series called watchmen \n say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd \n book graphic novel will pages long includes nearly consist nothing footnotes \n words dismiss film source \n can get past whole comic book thing might find another stumbling block hells directors albert allen hughes \ngetting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film set ghetto features really violent street crime mad geniuses behind menace ii society \n ghetto question course whitechapel londons east end \n filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision \n first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case \nabberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium \nupon arriving whitechapel befriends unfortunate named mary kelly heather graham say proceeds investigate horribly gruesome crimes even police surgeon stomach \n think anyone needs briefed jack ripper go particulars say moore campbell unique interesting theory identity killer reasons chooses slay \n comic bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end \n funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts \n hells ending whistling stonecutters song simpsons days holds back electric car made steve guttenberg star \n worry ll make sense see \nnow onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow \n print saw completely finished color music finalized comments marilyn manson cinematographer peter deming say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackwhite comic \noscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place \neven acting hell solid dreamy depp turning typically strong performance deftly handling british accent \nians holm joe goulds secret richardson dalmatians log great supporting roles big surprise graham \n cringed first time opened mouth imagining attempt irish accent actually half bad \n film however good \n r strong violencegore sexuality language drug content "
Then, we remove extra whitespace.
doc = tm_map(doc,stripWhitespace)
doc[[1]]$content[1]
## [1] "films adapted comic books plenty success whether superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world never really comic book like hell starters created alan moore eddie campbell brought medium whole new level mid s part series called watchmen say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd book graphic novel will pages long includes nearly consist nothing footnotes words dismiss film source can get past whole comic book thing might find another stumbling block hells directors albert allen hughes getting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film set ghetto features really violent street crime mad geniuses behind menace ii society ghetto question course whitechapel londons east end filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case abberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium upon arriving whitechapel befriends unfortunate named mary kelly heather graham say proceeds investigate horribly gruesome crimes even police surgeon stomach think anyone needs briefed jack ripper go particulars say moore campbell unique interesting theory identity killer reasons chooses slay comic bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts hells ending whistling stonecutters song simpsons days holds back electric car made steve guttenberg star worry ll make sense see now onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow print saw completely finished color music finalized comments marilyn manson cinematographer peter deming say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackwhite comic oscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place even acting hell solid dreamy depp turning typically strong performance deftly handling british accent ians holm joe goulds secret richardson dalmatians log great supporting roles big surprise graham cringed first time opened mouth imagining attempt irish accent actually half bad film however good r strong violencegore sexuality language drug content "
And, finally, we can stem the document.
doc = tm_map(doc,stemDocument)
doc[[1]]$content[1]
## [1] "film adapt comic book plenti success whether superhero batman superman spawn gear toward kid casper arthous crowd ghost world never realli comic book like hell starter creat alan moor eddi campbel brought medium whole new level mid s part seri call watchmen say moor campbel thorough research subject jack ripper like say michael jackson start look littl odd book graphic novel will page long includ near consist noth footnot word dismiss film sourc can get past whole comic book thing might find anoth stumbl block hell director albert allen hugh get hugh brother direct seem almost ludicr cast carrot top well anyth riddl better direct film set ghetto featur realli violent street crime mad genius behind menac ii societi ghetto question cours whitechapel london east end filthi sooti place whore call unfortun start get littl nervous mysteri psychopath carv profess surgic precis first stiff turn copper peter godley robbi coltran world enough call inspector frederick abberlin johnni depp blow crack case abberlin widow prophet dream unsuccess tri quell copious amount absinth opium upon arriv whitechapel befriend unfortun name mari kelli heather graham say proceed investig horribl gruesom crime even polic surgeon stomach think anyon need brief jack ripper go particular say moor campbel uniqu interest theori ident killer reason choos slay comic bother cloak ident ripper screenwrit terri hay vertic limit rafael yglesia les mis rabl good job keep hidden viewer end funni watch local blind point finger blame jew indian englishman never capabl commit ghast act hell end whistl stonecutt song simpson day hold back electr car made steve guttenberg star worri ll make sens see now onto hell appear certain dark bleak enough surpris see much look like tim burton film planet ape time seem like sleepi hollow print saw complet finish color music final comment marilyn manson cinematograph peter deme say word abli captur dreari victorianera london help make flashi kill scene remind crazi flashback twin peak even though violenc film pale comparison blackwhit comic oscar winner martin child shakespear love product design turn origin pragu surround one creepi place even act hell solid dreami depp turn typic strong perform deft handl british accent ian holm joe gould secret richardson dalmatian log great support role big surpris graham cring first time open mouth imagin attempt irish accent actual half bad film howev good r strong violencegor sexual languag drug content"
Let’s create another term document matrix but now after applying transformations to our document.
tdm = TermDocumentMatrix(corpus,
control=list(stopwords = T,
removePunctuation = T,
removeNumbers = T,
stemming = T))
Let’s take a look at the summary of the new TDM.
tdm
## <<TermDocumentMatrix (terms: 19064, documents: 1000)>>
## Non-/sparse entries: 261750/18802250
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency (tf)
And let’s also take a look at a subset of the new TDM.
inspect(tdm[2030:2035,100:103])
## <<TermDocumentMatrix (terms: 6, documents: 4)>>
## Non-/sparse entries: 1/23
## Sparsity : 96%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms cv099_10534.txt cv100_11528.txt cv101_10175.txt cv102_7846.txt
## brizzi 0 0 0 0
## broach 0 0 0 0
## broad 0 0 0 0
## broadbent 0 0 0 0
## broadcast 0 1 0 0
## broaden 0 0 0 0
We can see how many terms have been identified in the TDM using the length function again.
length(dimnames(tdm)$Terms)
## [1] 19064
head(dimnames(tdm)$Terms,10)
## [1] "aaaahhh" "aah" "aamir" "aardman" "aaron" "abandon"
## [7] "abb" "abba" "abberlin" "abbi"
tail(dimnames(tdm)$Terms,10)
## [1] "zuehlk" "zuko" "zukovski" "zundel" "zurg" "zus"
## [7] "zweibel" "zwick" "zwigoff" "zyci"
How frequently do those terms appear? Let’s sum the content of all terms (i.e., rows) and see the frequency of the terms just shown.
freq=rowSums(as.matrix(tdm))
head(freq,10)
## aaaahhh aah aamir aardman aaron abandon abb abba
## 1 1 1 2 14 51 3 2
## abberlin abbi
## 2 14
tail(freq,10)
## zuehlk zuko zukovski zundel zurg zus zweibel zwick
## 2 3 1 2 1 1 1 14
## zwigoff zyci
## 1 2
We can plot those frequencies ordered again.
plot(sort(freq, decreasing = T),col="blue",main="Word frequencies", xlab="Frequency-based rank", ylab = "Frequency")
And we can analyse the ten most frequent terms as well as check that 6388 terms out of 19064 only appear once in our corpus.
# Ten most frequent terms
tail(sort(freq),n=10)
## scene can get time make like charact one movi
## 1365 1429 1518 1606 1693 2035 2066 3156 3163
## film
## 6195
# Number of terms only appearing once
sum(freq == 1)
## [1] 6388
We can see that the two most frequent stems are “film” and “movi” (from movie). Since our corpus deals with movie reviews, these two terms (apart from appearing quite frequently) do not contribute by adding valueable information about the document.
In these cases, we usually define custom stop words by adding new stop words to the predefined list in stopwords.
doc = corpus[1]
doc[[1]]$content[1]
## [1] "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or \" graphic novel , \" if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called \" unfortunates \" ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( \" who holds back the electric car/who made steve guttenberg a star ? \" ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content "
myStopwords = c(stopwords(),"film","films","movie","movies")
doc = tm_map(corpus[1],removeWords,myStopwords)
doc[[1]]$content[1]
## [1] " adapted comic books plenty success , whether superheroes ( batman , superman , spawn ) , geared toward kids ( casper ) arthouse crowd ( ghost world ) , never really comic book like hell . \n starters , created alan moore ( eddie campbell ) , brought medium whole new level mid '80s 12-part series called watchmen . \n say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd . \n book ( \" graphic novel , \" will ) 500 pages long includes nearly 30 consist nothing footnotes . \n words , dismiss source . \n can get past whole comic book thing , might find another stumbling block hell's directors , albert allen hughes . \ngetting hughes brothers direct seems almost ludicrous casting carrot top , well , anything , riddle : better direct set ghetto features really violent street crime mad geniuses behind menace ii society ? \n ghetto question , course , whitechapel 1888 london's east end . \n filthy , sooty place whores ( called \" unfortunates \" ) starting get little nervous mysterious psychopath carving profession surgical precision . \n first stiff turns , copper peter godley ( robbie coltrane , world enough ) calls inspector frederick abberline ( johnny depp , blow ) crack case . \nabberline , widower , prophetic dreams unsuccessfully tries quell copious amounts absinthe opium . \nupon arriving whitechapel , befriends unfortunate named mary kelly ( heather graham , say ) proceeds investigate horribly gruesome crimes even police surgeon stomach . \n think anyone needs briefed jack ripper , go particulars , say moore campbell unique interesting theory identity killer reasons chooses slay . \n comic , bother cloaking identity ripper , screenwriters terry hayes ( vertical limit ) rafael yglesias ( les mis ? rables ) good job keeping hidden viewers end . \n funny watch locals blindly point finger blame jews indians , , englishman never capable committing ghastly acts . \n hell's ending whistling stonecutters song simpsons days ( \" holds back electric car/ made steve guttenberg star ? \" ) . \n worry - 'll make sense see . \nnow onto hell's appearance : certainly dark bleak enough , surprising see much looks like tim burton planet apes ( times , seems like sleepy hollow 2 ) . \n print saw completely finished ( color music finalized , comments marilyn manson ) , cinematographer peter deming ( say word ) ably captures dreariness victorian-era london helped make flashy killing scenes remind crazy flashbacks twin peaks , even though violence pales comparison black--white comic . \noscar winner martin childs' ( shakespeare love ) production design turns original prague surroundings one creepy place . \neven acting hell solid , dreamy depp turning typically strong performance deftly handling british accent . \nians holm ( joe gould's secret ) richardson ( 102 dalmatians ) log great supporting roles , big surprise graham . \n cringed first time opened mouth , imagining attempt irish accent , actually half bad . \n , however , good . \n2 : 00 - r strong violence/gore , sexuality , language drug content "
Now let’s create another TDM with the transformations and the custom stop words.
tdm = TermDocumentMatrix(corpus,
control=list(stopwords = myStopwords,
removePunctuation = T,
removeNumbers = T,
stemming = T))
Let’s take a look at the summary of the new TDM.
tdm
## <<TermDocumentMatrix (terms: 19063, documents: 1000)>>
## Non-/sparse entries: 260110/18802890
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency (tf)
We can also show the most frequent terms and their frequencies in a bar plot.
freq=rowSums(as.matrix(tdm))
high.freq=tail(sort(freq),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df)
ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Frequency") +
ggtitle("Term frequencies")
Let’s create a TDM applying TF-IDF weighting instead of term frequency. This can be done as in previous cases but passing the weighting = weightTfIdf parameter.
tdm.tfidf = TermDocumentMatrix(corpus,
control = list(weighting = weightTfIdf,
stopwords = myStopwords,
removePunctuation = T,
removeNumbers = T,
stemming = T))
Let’s take a look at the summary of the new TDM.
tdm.tfidf
## <<TermDocumentMatrix (terms: 19063, documents: 1000)>>
## Non-/sparse entries: 260110/18802890
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
And let’s also take a look at a subset of the TDM.
inspect(tdm.tfidf[2030:2035,100:103])
## <<TermDocumentMatrix (terms: 6, documents: 4)>>
## Non-/sparse entries: 1/23
## Sparsity : 96%
## Maximal term length: 9
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Docs
## Terms cv099_10534.txt cv100_11528.txt cv101_10175.txt cv102_7846.txt
## brizzi 0 0.00000000 0 0
## broach 0 0.00000000 0 0
## broad 0 0.00000000 0 0
## broadbent 0 0.00000000 0 0
## broadcast 0 0.01814074 0 0
## broaden 0 0.00000000 0 0
We can plot the TF-IDF values ordered.
freq=rowSums(as.matrix(tdm.tfidf))
plot(sort(freq, decreasing = T),col="blue",main="Word TF-IDF frequencies", xlab="TF-IDF-based rank", ylab = "TF-IDF")
And we can analyse the ten terms with the highest TF-IDF.
tail(sort(freq),n=10)
## star will stori action comedi war famili love
## 2.824053 2.835222 2.889615 2.901084 2.918187 2.923040 2.970478 3.022230
## life alien
## 3.059757 3.343716
We can make the analysis of what words are more frequently associated with others.
Let’s analyse those terms frequently associated with “star”.
asoc.star = as.data.frame(findAssocs(tdm,"star", 0.5))
asoc.star$names <- rownames(asoc.star)
asoc.star
## star names
## trek 0.62 trek
## enterpris 0.57 enterpris
## picard 0.56 picard
## insurrect 0.55 insurrect
We can also put them in a bar graph.
ggplot(asoc.star, aes(reorder(names,star), star)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Correlation") +
ggtitle("\"star\" associations")
And now those terms frequently associated with “indiana”.
asoc.indi = as.data.frame(findAssocs(tdm,"indiana", 0.5))
asoc.indi$names <- rownames(asoc.indi)
asoc.indi
## indiana names
## ark 0.72 ark
## archeologist 0.70 archeologist
## diarrhea 0.70 diarrhea
## engrav 0.70 engrav
## fudg 0.70 fudg
## hieroglyph 0.70 hieroglyph
## registr 0.70 registr
## sallah 0.70 sallah
## swordsman 0.70 swordsman
## indi 0.65 indi
## selleck 0.61 selleck
## shorten 0.57 shorten
## snake 0.53 snake
And the same terms in a bar graph.
ggplot(asoc.indi, aes(reorder(names,indiana), indiana)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Correlation") +
ggtitle("\"indiana\" associations")
Now let’s make a word-document frequency graph that shows in a graphical way the frequency of terms in documents.
The first thing that we need to do, since we have a highly sparse TDM, is to remove sparse terms using the removeSparseTerms function.
tdm.small = removeSparseTerms(tdm,0.5)
dim(tdm.small)
## [1] 28 1000
tdm.small
## <<TermDocumentMatrix (terms: 28, documents: 1000)>>
## Non-/sparse entries: 17194/10806
## Sparsity : 39%
## Maximal term length: 7
## Weighting : term frequency (tf)
This way, instead of 19063 terms we have only the 28 terms that are more frequent in the corpus.
We can clearly see how our new TDM is less sparse.
inspect(tdm.small[1:4,1:4])
## <<TermDocumentMatrix (terms: 4, documents: 4)>>
## Non-/sparse entries: 7/9
## Sparsity : 56%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms cv000_29590.txt cv001_18431.txt cv002_15918.txt cv003_11664.txt
## also 0 0 0 0
## can 2 1 0 3
## charact 0 1 0 0
## come 0 2 1 3
We create a matrix were we count all the appearances of terms in the documents.
matrix.tdm = melt(as.matrix(tdm.small), value.name = "count")
head(matrix.tdm)
## Terms Docs count
## 1 also cv000_29590.txt 0
## 2 can cv000_29590.txt 2
## 3 charact cv000_29590.txt 0
## 4 come cv000_29590.txt 0
## 5 end cv000_29590.txt 3
## 6 even cv000_29590.txt 3
And we plot the word-document frequency graph. The grey color means that the term does not appear in the document. Besides, a stronger red color indicates a higher frequency.
ggplot(matrix.tdm, aes(x = Docs, y = Terms, fill = log10(count))) +
geom_tile(colour = "white") +
scale_fill_gradient(high="#FF0000" , low="#FFFFFF")+
ylab("Terms") +
theme(panel.background = element_blank()) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
Let’s choose a nice range of blue colors for the wordcloud. You can invoke the display.brewer.all function to see the whole palette.
Let’s also set the random number generator seed to some value (this way, we will always get the same word cloud).
pal=brewer.pal(8,"Blues")
pal=pal[-(1:3)]
set.seed(1234)
Due to some issue with the newest versions of the tm package (0.7 and 0.7-1) in order to create n-grams VCorpus must be used instead of Corpus. Another option to solve the problem is to go back to version 0.6-2 of the tm package.
corpus.ngrams = VCorpus(source.pos)
tdm.unigram = TermDocumentMatrix(corpus.ngrams,
control=list(stopwords = c(myStopwords,"s","ve"),
removePunctuation = T,
removeNumbers = T))
Now we extract the frequency of each term
freq = sort(rowSums(as.matrix(tdm.unigram)), decreasing = T)
Finally, we invoke the wordcloud function to make the wordcloud with those terms that appear at least 400 times.
word.cloud=wordcloud(words=names(freq), freq=freq,
min.freq=400, random.order=F, colors=pal)
To create a bigram wordcloud, we apply transformations to the original corpus. In this case, we add to the stop words list the “’s” and “’ve” words.
Then, we use Weka’s n-gram tokenizer to create a TDM that uses as terms the bigrams that appear in the corpus.
corpus.ngrams = tm_map(corpus.ngrams,removeWords,c(myStopwords,"s","ve"))
corpus.ngrams = tm_map(corpus.ngrams,removePunctuation)
corpus.ngrams = tm_map(corpus.ngrams,removeNumbers)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram = TermDocumentMatrix(corpus.ngrams,
control = list (tokenize = BigramTokenizer))
We extract the frequency of each bigram and analyse the twenty most frequent ones.
freq = sort(rowSums(as.matrix(tdm.bigram)),decreasing = TRUE)
freq.df = data.frame(word=names(freq), freq=freq)
head(freq.df, 20)
## word freq
## special effects special effects 171
## star wars star wars 133
## new york new york 131
## even though even though 120
## one best one best 115
## science fiction science fiction 84
## star trek star trek 84
## high school high school 81
## pulp fiction pulp fiction 75
## takes place takes place 72
## ever seen ever seen 68
## one day one day 68
## supporting cast supporting cast 68
## one thing one thing 62
## jackie chan jackie chan 61
## much like much like 59
## years ago years ago 59
## seems like seems like 57
## motion picture motion picture 56
## truman show truman show 56
And we plot the wordcloud.
wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F, colors=pal)
We could also plot the most frequent bigrams in a bar graph.
ggplot(head(freq.df,15), aes(reorder(word,freq), freq)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Bigrams") + ylab("Frequency") +
ggtitle("Most frequent bigrams")
To create a trigram wordcloud, the approach is the same but this time we tell the n-gram tokenizer to find trigrams.
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm.trigram = TermDocumentMatrix(corpus.ngrams,
control = list(tokenize = TrigramTokenizer))
We extract the frequency of each trigram and analyse the twenty most frequent ones.
freq = sort(rowSums(as.matrix(tdm.trigram)),decreasing = TRUE)
freq.df = data.frame(word=names(freq), freq=freq)
head(freq.df, 20)
## word freq
## saving private ryan saving private ryan 39
## good will hunting good will hunting 34
## new york city new york city 29
## robert de niro robert de niro 25
## jay silent bob jay silent bob 22
## tommy lee jones tommy lee jones 22
## thin red line thin red line 21
## know last summer know last summer 20
## babe pig city babe pig city 18
## samuel l jackson samuel l jackson 17
## world war ii world war ii 16
## blair witch project blair witch project 15
## one best year one best year 15
## american history x american history x 14
## william h macy william h macy 13
## dusk till dawn dusk till dawn 12
## little known facts little known facts 12
## natural born killers natural born killers 12
## star trek insurrection star trek insurrection 12
## based true story based true story 11
And we plot the wordcloud.
wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F, colors=pal)
We could also plot the most frequent trigrams in a bar graph.
ggplot(head(freq.df,15), aes(reorder(word,freq), freq)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Trigrams") + ylab("Frequency") +
ggtitle("Most frequent trigrams")