Introduction

The goal of this document is to show how to perform different analyses of text documents at a word level, mainly using the tm (text mining) package in R.

I cannot claim full authorship of this document, since I have taken code snippets and have been inspired by multiple books and documents in the Web. Thanks everyone for sharing.

Preparation

Check working directory

Check the working directory with wd. If it is not the one where your data are located, change it with setwd.

getwd()
## [1] "/Users/raul/ownCloud/Trabajo/Docencia/2015 Intelligent Systems/R"
setwd("~/ownCloud/Trabajo/Docencia/2015 Intelligent Systems/R")

Load libraries

Now we load the required libraries.

library(tm)
library(ggplot2)
library(wordcloud)
library(RWeka)
library(reshape2)

Load corpus

We are going to use the Movie review data version 2.0, created by Bo Pang and Lillian Lee.

Once unzipped, the data splits the different documents into positive and negative opinions. In this script we are going to use the positive opinions located in ./txt_sentoken/pos.

source.pos = DirSource("../Corpus/review_polarity/txt_sentoken/pos", encoding = "UTF-8")
corpus = Corpus(source.pos)

Inspect corpus

Let’s see how many entries there are in our corpus just by checking its length.

length(corpus)
## [1] 1000

Taking a look at the three first entries, we can see that they are not simple documents. If we show the first entry, we can see that it contains the document and some metadata.

summary(corpus[1:3])
##                 Length Class             Mode
## cv000_29590.txt 2      PlainTextDocument list
## cv001_18431.txt 2      PlainTextDocument list
## cv002_15918.txt 2      PlainTextDocument list
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
cv000_29590.txt 
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content

Let’s take a look at the document in the first entry.

inspect(corpus[[1]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 4226
## 
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
## for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
## to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
## the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
## in other words , don't dismiss this film because of its source . 
## if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
## getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? 
## the ghetto in question is , of course , whitechapel in 1888 london's east end . 
## it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . 
## when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . 
## abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . 
## upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . 
## i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . 
## in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . 
## it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . 
## and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . 
## don't worry - it'll all make sense when you see it . 
## now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . 
## the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . 
## oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . 
## even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . 
## ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . 
## i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . 
## the film , however , is all good . 
## 2 : 00 - r for strong violence/gore , sexuality , language and drug content

And to its metadata. Note that we can also access metadata items individually.

meta(corpus[[1]])
##   author       : character(0)
##   datetimestamp: 2017-10-11 16:30:41
##   description  : character(0)
##   heading      : character(0)
##   id           : cv000_29590.txt
##   language     : en
##   origin       : character(0)
meta(corpus[[1]])$id
## [1] "cv000_29590.txt"

Create a default term document matrix

To create a term document matrix (TDM), we just invoke the TermDocumentMatrix function.

tdm = TermDocumentMatrix(corpus)

Let’s take a look at the summary of the TDM. The summary informs us about the high sparsity of the TDM (i.e., most of the content of the matrix are zeroes).

tdm
## <<TermDocumentMatrix (terms: 29924, documents: 1000)>>
## Non-/sparse entries: 325821/29598179
## Sparsity           : 99%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Let’s take a look at a subset of the TDM for four documents and four terms. There we can see an example of the sparsity of the matrix.

inspect(tdm[2000:2003,100:103])
## <<TermDocumentMatrix (terms: 4, documents: 4)>>
## Non-/sparse entries: 1/15
## Sparsity           : 94%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##             Docs
## Terms        cv099_10534.txt cv100_11528.txt cv101_10175.txt
##   eventually               0               0               0
##   exciting                 0               0               0
##   fei                      0               0               0
##   fighting                 0               0               0
##             Docs
## Terms        cv102_7846.txt
##   eventually              0
##   exciting                1
##   fei                     0
##   fighting                0

How many terms have been identified in the TDM? We can see it using the length function.

length(dimnames(tdm)$Terms)
## [1] 29924

How frequently do those terms appear? Let’s sum the content of all terms (i.e., rows) and see the frequency of the terms just shown.

freq=rowSums(as.matrix(tdm))
head(freq,10)
##       102      1888       500       80s abberline      ably     about 
##         2         2        10        27         2         9      1721 
##  absinthe    accent    acting 
##         1        37       322
tail(freq,10)
##    obscuring obstructions   overflying      paneled   powaqqatsi 
##            1            1            1            1            1 
##       snoots    tangerine       timbre       vainly    westworld 
##            1            1            1            1            1

If we plot those frequencies ordered, we can see how the corpus behaves following Zipf’s law.

plot(sort(freq, decreasing = T),col="blue",main="Word frequencies", xlab="Frequency-based rank", ylab = "Frequency")

And we can analyse the ten most frequent terms as well as check that 11240 terms out of 29924 only appear once in our corpus.

# Ten most frequent terms
tail(sort(freq),n=10)
##   are   but  this  film   for   his  with  that   and   the 
##  3714  4492  4648  5232  5260  5588  5851  8121 19897 41498
# Number of terms only appearing once
sum(freq == 1)
## [1] 11240

Create a TDM after applying transformations to the corpus

Corpus transformations

We can see the different transformations that can be applied to a document by invoking the getTransformations function.

getTransformations()
## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

Let’s take the first document in the corpus and apply some of these transformations. We will apply some transformations or others depending on our use case.

Let’s just take a look at the first sentence of the document.

doc=corpus[1]
doc[[1]]$content[1]
## [1] "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or \" graphic novel , \" if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called \" unfortunates \" ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( \" who holds back the electric car/who made steve guttenberg a star ? \" ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content "

First, we remove stop words. We can check the stopwords used using the stopwords function.

stopwords()
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
doc = tm_map(doc,removeWords,stopwords())
doc[[1]]$content[1]
## [1] "films adapted  comic books   plenty  success , whether   superheroes ( batman , superman , spawn ) ,  geared toward kids ( casper )   arthouse crowd ( ghost world ) ,   never really   comic book like  hell  . \n starters ,   created  alan moore (  eddie campbell ) ,  brought  medium   whole new level   mid '80s   12-part series called  watchmen . \n say moore  campbell thoroughly researched  subject  jack  ripper   like saying michael jackson  starting  look  little odd . \n book (  \" graphic novel , \"   will )   500 pages long  includes nearly 30   consist  nothing  footnotes . \n  words ,  dismiss  film    source . \n  can get past  whole comic book thing ,  might find another stumbling block   hell's directors , albert  allen hughes . \ngetting  hughes brothers  direct  seems almost  ludicrous  casting carrot top  , well , anything ,  riddle   :  better  direct  film  set   ghetto  features really violent street crime   mad geniuses behind menace ii society ? \n ghetto  question  ,  course , whitechapel  1888 london's east end . \n  filthy , sooty place   whores ( called \" unfortunates \" )  starting  get  little nervous   mysterious psychopath    carving   profession  surgical precision . \n  first stiff turns  , copper peter godley ( robbie coltrane ,  world   enough ) calls  inspector frederick abberline ( johnny depp , blow )  crack  case . \nabberline ,  widower ,  prophetic dreams  unsuccessfully tries  quell  copious amounts  absinthe  opium . \nupon arriving  whitechapel ,  befriends  unfortunate named mary kelly ( heather graham , say    )  proceeds  investigate  horribly gruesome crimes  even  police surgeon  stomach . \n  think anyone needs   briefed  jack  ripper ,    go   particulars  ,    say moore  campbell   unique  interesting theory    identity   killer   reasons  chooses  slay . \n  comic ,   bother cloaking  identity   ripper ,  screenwriters terry hayes ( vertical limit )  rafael yglesias ( les mis ? rables )   good job  keeping  hidden  viewers    end . \n funny  watch  locals blindly point  finger  blame  jews  indians  ,   ,  englishman  never  capable  committing  ghastly acts . \n  hell's ending   whistling  stonecutters song   simpsons  days ( \"  holds back  electric car/ made steve guttenberg  star ? \" ) . \n worry - 'll  make sense   see  . \nnow onto  hell's appearance :  certainly dark  bleak enough ,   surprising  see  much   looks like  tim burton film  planet   apes  (  times ,  seems like sleepy hollow 2 ) . \n print  saw  completely finished (  color  music    finalized ,   comments  marilyn manson ) ,  cinematographer peter deming (  say  word ) ably captures  dreariness  victorian-era london  helped make  flashy killing scenes remind    crazy flashbacks  twin peaks , even though  violence   film pales  comparison     black--white comic . \noscar winner martin childs' ( shakespeare  love ) production design turns  original prague surroundings  one creepy place . \neven  acting   hell  solid ,   dreamy depp turning   typically strong performance  deftly handling  british accent . \nians holm ( joe gould's secret )  richardson ( 102 dalmatians ) log  great supporting roles ,   big surprise   graham . \n cringed  first time  opened  mouth , imagining  attempt   irish accent ,   actually  half bad . \n film , however ,   good . \n2 : 00 - r  strong violence/gore , sexuality , language  drug content "

Then, we remove punctuation symbols.

doc = tm_map(doc,removePunctuation)
doc[[1]]$content[1]
## [1] "films adapted  comic books   plenty  success  whether   superheroes  batman  superman  spawn    geared toward kids  casper    arthouse crowd  ghost world     never really   comic book like  hell   \n starters    created  alan moore   eddie campbell    brought  medium   whole new level   mid 80s   12part series called  watchmen  \n say moore  campbell thoroughly researched  subject  jack  ripper   like saying michael jackson  starting  look  little odd  \n book    graphic novel     will    500 pages long  includes nearly 30   consist  nothing  footnotes  \n  words   dismiss  film    source  \n  can get past  whole comic book thing   might find another stumbling block   hells directors  albert  allen hughes  \ngetting  hughes brothers  direct  seems almost  ludicrous  casting carrot top   well  anything   riddle     better  direct  film  set   ghetto  features really violent street crime   mad geniuses behind menace ii society  \n ghetto  question    course  whitechapel  1888 londons east end  \n  filthy  sooty place   whores  called  unfortunates    starting  get  little nervous   mysterious psychopath    carving   profession  surgical precision  \n  first stiff turns   copper peter godley  robbie coltrane   world   enough  calls  inspector frederick abberline  johnny depp  blow   crack  case  \nabberline   widower   prophetic dreams  unsuccessfully tries  quell  copious amounts  absinthe  opium  \nupon arriving  whitechapel   befriends  unfortunate named mary kelly  heather graham  say      proceeds  investigate  horribly gruesome crimes  even  police surgeon  stomach  \n  think anyone needs   briefed  jack  ripper     go   particulars      say moore  campbell   unique  interesting theory    identity   killer   reasons  chooses  slay  \n  comic    bother cloaking  identity   ripper   screenwriters terry hayes  vertical limit   rafael yglesias  les mis  rables    good job  keeping  hidden  viewers    end  \n funny  watch  locals blindly point  finger  blame  jews  indians       englishman  never  capable  committing  ghastly acts  \n  hells ending   whistling  stonecutters song   simpsons  days    holds back  electric car made steve guttenberg  star     \n worry  ll  make sense   see   \nnow onto  hells appearance   certainly dark  bleak enough    surprising  see  much   looks like  tim burton film  planet   apes    times   seems like sleepy hollow 2   \n print  saw  completely finished   color  music    finalized    comments  marilyn manson    cinematographer peter deming   say  word  ably captures  dreariness  victorianera london  helped make  flashy killing scenes remind    crazy flashbacks  twin peaks  even though  violence   film pales  comparison     blackwhite comic  \noscar winner martin childs  shakespeare  love  production design turns  original prague surroundings  one creepy place  \neven  acting   hell  solid    dreamy depp turning   typically strong performance  deftly handling  british accent  \nians holm  joe goulds secret   richardson  102 dalmatians  log  great supporting roles    big surprise   graham  \n cringed  first time  opened  mouth  imagining  attempt   irish accent    actually  half bad  \n film  however    good  \n2  00  r  strong violencegore  sexuality  language  drug content "

Then, we remove numbers.

doc = tm_map(doc,removeNumbers)
doc[[1]]$content[1]
## [1] "films adapted  comic books   plenty  success  whether   superheroes  batman  superman  spawn    geared toward kids  casper    arthouse crowd  ghost world     never really   comic book like  hell   \n starters    created  alan moore   eddie campbell    brought  medium   whole new level   mid s   part series called  watchmen  \n say moore  campbell thoroughly researched  subject  jack  ripper   like saying michael jackson  starting  look  little odd  \n book    graphic novel     will     pages long  includes nearly    consist  nothing  footnotes  \n  words   dismiss  film    source  \n  can get past  whole comic book thing   might find another stumbling block   hells directors  albert  allen hughes  \ngetting  hughes brothers  direct  seems almost  ludicrous  casting carrot top   well  anything   riddle     better  direct  film  set   ghetto  features really violent street crime   mad geniuses behind menace ii society  \n ghetto  question    course  whitechapel   londons east end  \n  filthy  sooty place   whores  called  unfortunates    starting  get  little nervous   mysterious psychopath    carving   profession  surgical precision  \n  first stiff turns   copper peter godley  robbie coltrane   world   enough  calls  inspector frederick abberline  johnny depp  blow   crack  case  \nabberline   widower   prophetic dreams  unsuccessfully tries  quell  copious amounts  absinthe  opium  \nupon arriving  whitechapel   befriends  unfortunate named mary kelly  heather graham  say      proceeds  investigate  horribly gruesome crimes  even  police surgeon  stomach  \n  think anyone needs   briefed  jack  ripper     go   particulars      say moore  campbell   unique  interesting theory    identity   killer   reasons  chooses  slay  \n  comic    bother cloaking  identity   ripper   screenwriters terry hayes  vertical limit   rafael yglesias  les mis  rables    good job  keeping  hidden  viewers    end  \n funny  watch  locals blindly point  finger  blame  jews  indians       englishman  never  capable  committing  ghastly acts  \n  hells ending   whistling  stonecutters song   simpsons  days    holds back  electric car made steve guttenberg  star     \n worry  ll  make sense   see   \nnow onto  hells appearance   certainly dark  bleak enough    surprising  see  much   looks like  tim burton film  planet   apes    times   seems like sleepy hollow    \n print  saw  completely finished   color  music    finalized    comments  marilyn manson    cinematographer peter deming   say  word  ably captures  dreariness  victorianera london  helped make  flashy killing scenes remind    crazy flashbacks  twin peaks  even though  violence   film pales  comparison     blackwhite comic  \noscar winner martin childs  shakespeare  love  production design turns  original prague surroundings  one creepy place  \neven  acting   hell  solid    dreamy depp turning   typically strong performance  deftly handling  british accent  \nians holm  joe goulds secret   richardson   dalmatians  log  great supporting roles    big surprise   graham  \n cringed  first time  opened  mouth  imagining  attempt   irish accent    actually  half bad  \n film  however    good  \n    r  strong violencegore  sexuality  language  drug content "

Then, we remove extra whitespace.

doc = tm_map(doc,stripWhitespace)
doc[[1]]$content[1]
## [1] "films adapted comic books plenty success whether superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world never really comic book like hell starters created alan moore eddie campbell brought medium whole new level mid s part series called watchmen say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd book graphic novel will pages long includes nearly consist nothing footnotes words dismiss film source can get past whole comic book thing might find another stumbling block hells directors albert allen hughes getting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film set ghetto features really violent street crime mad geniuses behind menace ii society ghetto question course whitechapel londons east end filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case abberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium upon arriving whitechapel befriends unfortunate named mary kelly heather graham say proceeds investigate horribly gruesome crimes even police surgeon stomach think anyone needs briefed jack ripper go particulars say moore campbell unique interesting theory identity killer reasons chooses slay comic bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts hells ending whistling stonecutters song simpsons days holds back electric car made steve guttenberg star worry ll make sense see now onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow print saw completely finished color music finalized comments marilyn manson cinematographer peter deming say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackwhite comic oscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place even acting hell solid dreamy depp turning typically strong performance deftly handling british accent ians holm joe goulds secret richardson dalmatians log great supporting roles big surprise graham cringed first time opened mouth imagining attempt irish accent actually half bad film however good r strong violencegore sexuality language drug content "

And, finally, we can stem the document.

doc = tm_map(doc,stemDocument)
doc[[1]]$content[1]
## [1] "film adapt comic book plenti success whether superhero batman superman spawn gear toward kid casper arthous crowd ghost world never realli comic book like hell starter creat alan moor eddi campbel brought medium whole new level mid s part seri call watchmen say moor campbel thorough research subject jack ripper like say michael jackson start look littl odd book graphic novel will page long includ near consist noth footnot word dismiss film sourc can get past whole comic book thing might find anoth stumbl block hell director albert allen hugh get hugh brother direct seem almost ludicr cast carrot top well anyth riddl better direct film set ghetto featur realli violent street crime mad genius behind menac ii societi ghetto question cours whitechapel london east end filthi sooti place whore call unfortun start get littl nervous mysteri psychopath carv profess surgic precis first stiff turn copper peter godley robbi coltran world enough call inspector frederick abberlin johnni depp blow crack case abberlin widow prophet dream unsuccess tri quell copious amount absinth opium upon arriv whitechapel befriend unfortun name mari kelli heather graham say proceed investig horribl gruesom crime even polic surgeon stomach think anyon need brief jack ripper go particular say moor campbel uniqu interest theori ident killer reason choos slay comic bother cloak ident ripper screenwrit terri hay vertic limit rafael yglesia les mis rabl good job keep hidden viewer end funni watch local blind point finger blame jew indian englishman never capabl commit ghast act hell end whistl stonecutt song simpson day hold back electr car made steve guttenberg star worri ll make sens see now onto hell appear certain dark bleak enough surpris see much look like tim burton film planet ape time seem like sleepi hollow print saw complet finish color music final comment marilyn manson cinematograph peter deme say word abli captur dreari victorianera london help make flashi kill scene remind crazi flashback twin peak even though violenc film pale comparison blackwhit comic oscar winner martin child shakespear love product design turn origin pragu surround one creepi place even act hell solid dreami depp turn typic strong perform deft handl british accent ian holm joe gould secret richardson dalmatian log great support role big surpris graham cring first time open mouth imagin attempt irish accent actual half bad film howev good r strong violencegor sexual languag drug content"

Create a TDM with transformations

Let’s create another term document matrix but now after applying transformations to our document.

tdm = TermDocumentMatrix(corpus,
                                    control=list(stopwords = T,
                                                 removePunctuation = T, 
                                                 removeNumbers = T,
                                                 stemming = T))

Let’s take a look at the summary of the new TDM.

tdm
## <<TermDocumentMatrix (terms: 19064, documents: 1000)>>
## Non-/sparse entries: 261750/18802250
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)

And let’s also take a look at a subset of the new TDM.

inspect(tdm[2030:2035,100:103])
## <<TermDocumentMatrix (terms: 6, documents: 4)>>
## Non-/sparse entries: 1/23
## Sparsity           : 96%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##            Docs
## Terms       cv099_10534.txt cv100_11528.txt cv101_10175.txt cv102_7846.txt
##   brizzi                  0               0               0              0
##   broach                  0               0               0              0
##   broad                   0               0               0              0
##   broadbent               0               0               0              0
##   broadcast               0               1               0              0
##   broaden                 0               0               0              0

We can see how many terms have been identified in the TDM using the length function again.

length(dimnames(tdm)$Terms)
## [1] 19064
head(dimnames(tdm)$Terms,10)
##  [1] "aaaahhh"  "aah"      "aamir"    "aardman"  "aaron"    "abandon" 
##  [7] "abb"      "abba"     "abberlin" "abbi"
tail(dimnames(tdm)$Terms,10)
##  [1] "zuehlk"   "zuko"     "zukovski" "zundel"   "zurg"     "zus"     
##  [7] "zweibel"  "zwick"    "zwigoff"  "zyci"

How frequently do those terms appear? Let’s sum the content of all terms (i.e., rows) and see the frequency of the terms just shown.

freq=rowSums(as.matrix(tdm))
head(freq,10)
##  aaaahhh      aah    aamir  aardman    aaron  abandon      abb     abba 
##        1        1        1        2       14       51        3        2 
## abberlin     abbi 
##        2       14
tail(freq,10)
##   zuehlk     zuko zukovski   zundel     zurg      zus  zweibel    zwick 
##        2        3        1        2        1        1        1       14 
##  zwigoff     zyci 
##        1        2

We can plot those frequencies ordered again.

plot(sort(freq, decreasing = T),col="blue",main="Word frequencies", xlab="Frequency-based rank", ylab = "Frequency")

And we can analyse the ten most frequent terms as well as check that 6388 terms out of 19064 only appear once in our corpus.

# Ten most frequent terms
tail(sort(freq),n=10)
##   scene     can     get    time    make    like charact     one    movi 
##    1365    1429    1518    1606    1693    2035    2066    3156    3163 
##    film 
##    6195
# Number of terms only appearing once
sum(freq == 1)
## [1] 6388

Create TDM with transformations and custom stopwords

We can see that the two most frequent stems are “film” and “movi” (from movie). Since our corpus deals with movie reviews, these two terms (apart from appearing quite frequently) do not contribute by adding valueable information about the document.

In these cases, we usually define custom stop words by adding new stop words to the predefined list in stopwords.

doc = corpus[1]
doc[[1]]$content[1]
## [1] "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or \" graphic novel , \" if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called \" unfortunates \" ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( \" who holds back the electric car/who made steve guttenberg a star ? \" ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content "
myStopwords = c(stopwords(),"film","films","movie","movies")
doc = tm_map(corpus[1],removeWords,myStopwords)
doc[[1]]$content[1]
## [1] " adapted  comic books   plenty  success , whether   superheroes ( batman , superman , spawn ) ,  geared toward kids ( casper )   arthouse crowd ( ghost world ) ,   never really   comic book like  hell  . \n starters ,   created  alan moore (  eddie campbell ) ,  brought  medium   whole new level   mid '80s   12-part series called  watchmen . \n say moore  campbell thoroughly researched  subject  jack  ripper   like saying michael jackson  starting  look  little odd . \n book (  \" graphic novel , \"   will )   500 pages long  includes nearly 30   consist  nothing  footnotes . \n  words ,  dismiss      source . \n  can get past  whole comic book thing ,  might find another stumbling block   hell's directors , albert  allen hughes . \ngetting  hughes brothers  direct  seems almost  ludicrous  casting carrot top  , well , anything ,  riddle   :  better  direct    set   ghetto  features really violent street crime   mad geniuses behind menace ii society ? \n ghetto  question  ,  course , whitechapel  1888 london's east end . \n  filthy , sooty place   whores ( called \" unfortunates \" )  starting  get  little nervous   mysterious psychopath    carving   profession  surgical precision . \n  first stiff turns  , copper peter godley ( robbie coltrane ,  world   enough ) calls  inspector frederick abberline ( johnny depp , blow )  crack  case . \nabberline ,  widower ,  prophetic dreams  unsuccessfully tries  quell  copious amounts  absinthe  opium . \nupon arriving  whitechapel ,  befriends  unfortunate named mary kelly ( heather graham , say    )  proceeds  investigate  horribly gruesome crimes  even  police surgeon  stomach . \n  think anyone needs   briefed  jack  ripper ,    go   particulars  ,    say moore  campbell   unique  interesting theory    identity   killer   reasons  chooses  slay . \n  comic ,   bother cloaking  identity   ripper ,  screenwriters terry hayes ( vertical limit )  rafael yglesias ( les mis ? rables )   good job  keeping  hidden  viewers    end . \n funny  watch  locals blindly point  finger  blame  jews  indians  ,   ,  englishman  never  capable  committing  ghastly acts . \n  hell's ending   whistling  stonecutters song   simpsons  days ( \"  holds back  electric car/ made steve guttenberg  star ? \" ) . \n worry - 'll  make sense   see  . \nnow onto  hell's appearance :  certainly dark  bleak enough ,   surprising  see  much   looks like  tim burton   planet   apes  (  times ,  seems like sleepy hollow 2 ) . \n print  saw  completely finished (  color  music    finalized ,   comments  marilyn manson ) ,  cinematographer peter deming (  say  word ) ably captures  dreariness  victorian-era london  helped make  flashy killing scenes remind    crazy flashbacks  twin peaks , even though  violence    pales  comparison     black--white comic . \noscar winner martin childs' ( shakespeare  love ) production design turns  original prague surroundings  one creepy place . \neven  acting   hell  solid ,   dreamy depp turning   typically strong performance  deftly handling  british accent . \nians holm ( joe gould's secret )  richardson ( 102 dalmatians ) log  great supporting roles ,   big surprise   graham . \n cringed  first time  opened  mouth , imagining  attempt   irish accent ,   actually  half bad . \n  , however ,   good . \n2 : 00 - r  strong violence/gore , sexuality , language  drug content "

Now let’s create another TDM with the transformations and the custom stop words.

tdm = TermDocumentMatrix(corpus,
                         control=list(stopwords = myStopwords,
                                      removePunctuation = T, 
                                      removeNumbers = T,
                                      stemming = T))

Let’s take a look at the summary of the new TDM.

tdm
## <<TermDocumentMatrix (terms: 19063, documents: 1000)>>
## Non-/sparse entries: 260110/18802890
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)

We can also show the most frequent terms and their frequencies in a bar plot.

freq=rowSums(as.matrix(tdm))
high.freq=tail(sort(freq),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df) 
ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Terms") + ylab("Frequency") +
  ggtitle("Term frequencies")

Create a TDM with TF-IDF weights

Let’s create a TDM applying TF-IDF weighting instead of term frequency. This can be done as in previous cases but passing the weighting = weightTfIdf parameter.

tdm.tfidf = TermDocumentMatrix(corpus,
                               control = list(weighting = weightTfIdf,
                                              stopwords = myStopwords, 
                                              removePunctuation = T,
                                              removeNumbers = T,
                                              stemming = T))

Let’s take a look at the summary of the new TDM.

tdm.tfidf
## <<TermDocumentMatrix (terms: 19063, documents: 1000)>>
## Non-/sparse entries: 260110/18802890
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

And let’s also take a look at a subset of the TDM.

inspect(tdm.tfidf[2030:2035,100:103])
## <<TermDocumentMatrix (terms: 6, documents: 4)>>
## Non-/sparse entries: 1/23
## Sparsity           : 96%
## Maximal term length: 9
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##            Docs
## Terms       cv099_10534.txt cv100_11528.txt cv101_10175.txt cv102_7846.txt
##   brizzi                  0      0.00000000               0              0
##   broach                  0      0.00000000               0              0
##   broad                   0      0.00000000               0              0
##   broadbent               0      0.00000000               0              0
##   broadcast               0      0.01814074               0              0
##   broaden                 0      0.00000000               0              0

We can plot the TF-IDF values ordered.

freq=rowSums(as.matrix(tdm.tfidf))

plot(sort(freq, decreasing = T),col="blue",main="Word TF-IDF frequencies", xlab="TF-IDF-based rank", ylab = "TF-IDF")

And we can analyse the ten terms with the highest TF-IDF.

tail(sort(freq),n=10)
##     star     will    stori   action   comedi      war   famili     love 
## 2.824053 2.835222 2.889615 2.901084 2.918187 2.923040 2.970478 3.022230 
##     life    alien 
## 3.059757 3.343716

Make an association analysis

We can make the analysis of what words are more frequently associated with others.

Let’s analyse those terms frequently associated with “star”.

asoc.star = as.data.frame(findAssocs(tdm,"star", 0.5))
asoc.star$names <- rownames(asoc.star) 
asoc.star
##           star     names
## trek      0.62      trek
## enterpris 0.57 enterpris
## picard    0.56    picard
## insurrect 0.55 insurrect

We can also put them in a bar graph.

ggplot(asoc.star, aes(reorder(names,star), star)) +   
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Terms") + ylab("Correlation") +
  ggtitle("\"star\" associations")

And now those terms frequently associated with “indiana”.

asoc.indi = as.data.frame(findAssocs(tdm,"indiana", 0.5))
asoc.indi$names <- rownames(asoc.indi) 
asoc.indi
##              indiana        names
## ark             0.72          ark
## archeologist    0.70 archeologist
## diarrhea        0.70     diarrhea
## engrav          0.70       engrav
## fudg            0.70         fudg
## hieroglyph      0.70   hieroglyph
## registr         0.70      registr
## sallah          0.70       sallah
## swordsman       0.70    swordsman
## indi            0.65         indi
## selleck         0.61      selleck
## shorten         0.57      shorten
## snake           0.53        snake

And the same terms in a bar graph.

ggplot(asoc.indi, aes(reorder(names,indiana), indiana)) +   
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Terms") + ylab("Correlation") +
  ggtitle("\"indiana\" associations")

Create a word-document frequency graph

Now let’s make a word-document frequency graph that shows in a graphical way the frequency of terms in documents.

The first thing that we need to do, since we have a highly sparse TDM, is to remove sparse terms using the removeSparseTerms function.

tdm.small = removeSparseTerms(tdm,0.5)
dim(tdm.small)
## [1]   28 1000
tdm.small
## <<TermDocumentMatrix (terms: 28, documents: 1000)>>
## Non-/sparse entries: 17194/10806
## Sparsity           : 39%
## Maximal term length: 7
## Weighting          : term frequency (tf)

This way, instead of 19063 terms we have only the 28 terms that are more frequent in the corpus.

We can clearly see how our new TDM is less sparse.

inspect(tdm.small[1:4,1:4])
## <<TermDocumentMatrix (terms: 4, documents: 4)>>
## Non-/sparse entries: 7/9
## Sparsity           : 56%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##          Docs
## Terms     cv000_29590.txt cv001_18431.txt cv002_15918.txt cv003_11664.txt
##   also                  0               0               0               0
##   can                   2               1               0               3
##   charact               0               1               0               0
##   come                  0               2               1               3

We create a matrix were we count all the appearances of terms in the documents.

matrix.tdm = melt(as.matrix(tdm.small), value.name = "count")
head(matrix.tdm)
##     Terms            Docs count
## 1    also cv000_29590.txt     0
## 2     can cv000_29590.txt     2
## 3 charact cv000_29590.txt     0
## 4    come cv000_29590.txt     0
## 5     end cv000_29590.txt     3
## 6    even cv000_29590.txt     3

And we plot the word-document frequency graph. The grey color means that the term does not appear in the document. Besides, a stronger red color indicates a higher frequency.

ggplot(matrix.tdm, aes(x = Docs, y = Terms, fill = log10(count))) +
  geom_tile(colour = "white") +
  scale_fill_gradient(high="#FF0000" , low="#FFFFFF")+
  ylab("Terms") +
  theme(panel.background = element_blank()) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Create a word cloud

Let’s choose a nice range of blue colors for the wordcloud. You can invoke the display.brewer.all function to see the whole palette.

Let’s also set the random number generator seed to some value (this way, we will always get the same word cloud).

pal=brewer.pal(8,"Blues")
pal=pal[-(1:3)]

set.seed(1234)

Due to some issue with the newest versions of the tm package (0.7 and 0.7-1) in order to create n-grams VCorpus must be used instead of Corpus. Another option to solve the problem is to go back to version 0.6-2 of the tm package.

corpus.ngrams = VCorpus(source.pos)

tdm.unigram = TermDocumentMatrix(corpus.ngrams,
                                control=list(stopwords = c(myStopwords,"s","ve"),
                                removePunctuation = T, 
                                removeNumbers = T)) 

Now we extract the frequency of each term

freq = sort(rowSums(as.matrix(tdm.unigram)), decreasing = T)

Finally, we invoke the wordcloud function to make the wordcloud with those terms that appear at least 400 times.

word.cloud=wordcloud(words=names(freq), freq=freq,
                     min.freq=400, random.order=F, colors=pal)

Create a bigram wordcloud

To create a bigram wordcloud, we apply transformations to the original corpus. In this case, we add to the stop words list the “’s” and “’ve” words.

Then, we use Weka’s n-gram tokenizer to create a TDM that uses as terms the bigrams that appear in the corpus.

corpus.ngrams = tm_map(corpus.ngrams,removeWords,c(myStopwords,"s","ve"))
corpus.ngrams = tm_map(corpus.ngrams,removePunctuation)
corpus.ngrams = tm_map(corpus.ngrams,removeNumbers)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram = TermDocumentMatrix(corpus.ngrams,
                                control = list (tokenize = BigramTokenizer))

We extract the frequency of each bigram and analyse the twenty most frequent ones.

freq = sort(rowSums(as.matrix(tdm.bigram)),decreasing = TRUE)
freq.df = data.frame(word=names(freq), freq=freq)
head(freq.df, 20)
##                            word freq
## special effects special effects  171
## star wars             star wars  133
## new york               new york  131
## even though         even though  120
## one best               one best  115
## science fiction science fiction   84
## star trek             star trek   84
## high school         high school   81
## pulp fiction       pulp fiction   75
## takes place         takes place   72
## ever seen             ever seen   68
## one day                 one day   68
## supporting cast supporting cast   68
## one thing             one thing   62
## jackie chan         jackie chan   61
## much like             much like   59
## years ago             years ago   59
## seems like           seems like   57
## motion picture   motion picture   56
## truman show         truman show   56

And we plot the wordcloud.

wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F, colors=pal)

We could also plot the most frequent bigrams in a bar graph.

ggplot(head(freq.df,15), aes(reorder(word,freq), freq)) +   
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams")

Create a trigram wordcloud

To create a trigram wordcloud, the approach is the same but this time we tell the n-gram tokenizer to find trigrams.

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm.trigram = TermDocumentMatrix(corpus.ngrams,
                                control = list(tokenize = TrigramTokenizer))

We extract the frequency of each trigram and analyse the twenty most frequent ones.

freq = sort(rowSums(as.matrix(tdm.trigram)),decreasing = TRUE)
freq.df = data.frame(word=names(freq), freq=freq)
head(freq.df, 20)
##                                          word freq
## saving private ryan       saving private ryan   39
## good will hunting           good will hunting   34
## new york city                   new york city   29
## robert de niro                 robert de niro   25
## jay silent bob                 jay silent bob   22
## tommy lee jones               tommy lee jones   22
## thin red line                   thin red line   21
## know last summer             know last summer   20
## babe pig city                   babe pig city   18
## samuel l jackson             samuel l jackson   17
## world war ii                     world war ii   16
## blair witch project       blair witch project   15
## one best year                   one best year   15
## american history x         american history x   14
## william h macy                 william h macy   13
## dusk till dawn                 dusk till dawn   12
## little known facts         little known facts   12
## natural born killers     natural born killers   12
## star trek insurrection star trek insurrection   12
## based true story             based true story   11

And we plot the wordcloud.

wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F, colors=pal)

We could also plot the most frequent trigrams in a bar graph.

ggplot(head(freq.df,15), aes(reorder(word,freq), freq)) +   
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Trigrams") + ylab("Frequency") +
  ggtitle("Most frequent trigrams")