Example of creating term document matrices with TF-IDF weights

Load required libraries.

library(tm)
library(ggplot2)

Set the working directory to the location of the script and data.

setwd("~/Youtube")

Load corpus from local files.

Load the Sentiment polarity dataset version 2.0 from the Movie review data.

Once unzipped, access the positive reviews in the dataset.

path = "./review_polarity/txt_sentoken/"

dir = DirSource(paste(path,"pos/",sep=""), encoding = "UTF-8")
corpus = Corpus(dir)

Check how many documents have been loaded.

length(corpus)

## [1] 1000

Access the document in the first entry.

corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
## for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
## to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
## the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
## in other words , don't dismiss this film because of its source . 
## if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
## getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? 
## the ghetto in question is , of course , whitechapel in 1888 london's east end . 
## it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . 
## when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . 
## abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . 
## upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . 
## i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . 
## in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . 
## it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . 
## and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . 
## don't worry - it'll all make sense when you see it . 
## now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . 
## the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . 
## oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . 
## even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . 
## ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . 
## i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . 
## the film , however , is all good . 
## 2 : 00 - r for strong violence/gore , sexuality , language and drug content

Define custom stop words for our corpus.

myStopwords = c(stopwords(),"film","films","movie","movies")

Create a TDM applying TF-IDF weighting instead of term frequency.

This can be done as in previous cases but passing the weighting = weightTfIdf parameter.

tdm = TermDocumentMatrix(corpus,
                         control = list(weighting = weightTfIdf,
                                        stopwords = myStopwords, 
                                        removePunctuation = T,
                                        removeNumbers = T,
                                        stemming = T))

Take a look at the summary of the TDM.

tdm

## <<TermDocumentMatrix (terms: 22445, documents: 1000)>>
## Non-/sparse entries: 257056/22187944
## Sparsity           : 99%
## Maximal term length: 61
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Take a look at a subset of the TDM.

inspect(tdm[2005:2015,100:103])

## <<TermDocumentMatrix (terms: 11, documents: 4)>>
## Non-/sparse entries: 2/42
## Sparsity           : 95%
## Maximal term length: 10
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##             Docs
## Terms        cv099_10534.txt cv100_11528.txt cv101_10175.txt
##   bobbitt          0.0000000     0.000000000               0
##   bobbl            0.0000000     0.000000000               0
##   bobcat           0.0000000     0.000000000               0
##   bodaci           0.0000000     0.000000000               0
##   boddi            0.0000000     0.000000000               0
##   bodi             0.0307017     0.009952296               0
##   bodili           0.0000000     0.000000000               0
##   bodyguard        0.0000000     0.000000000               0
##   bodystrewn       0.0000000     0.000000000               0
##   bodythem         0.0000000     0.000000000               0
##   bof              0.0000000     0.000000000               0
##             Docs
## Terms        cv102_7846.txt
##   bobbitt                 0
##   bobbl                   0
##   bobcat                  0
##   bodaci                  0
##   boddi                   0
##   bodi                    0
##   bodili                  0
##   bodyguard               0
##   bodystrewn              0
##   bodythem                0
##   bof                     0

Analyse how frequently terms appear by summing the content of all terms (i.e., rows).

freq=rowSums(as.matrix(tdm))
head(freq,10)

##    aaaahhh        aah      aamir    aardman      aaron    abandon 
## 0.03047640 0.02442594 0.02204820 0.02777936 0.44531043 0.61178662 
##        abb       abba   abberlin       abbi 
## 0.19413865 0.08339568 0.05204065 0.38470023

tail(freq,10)

## zuckerabrahamszuck             zuehlk               zuko 
##         0.03460342         0.05020546         0.09833795 
##           zukovski             zundel               zurg 
##         0.01566947         0.09401683         0.02111395 
##            zweibel              zwick            zwigoff 
##         0.02466778         0.25530615         0.04169784 
##               zyci 
##         0.07274295

Plot those frequencies ordered.

plot(sort(freq, decreasing = T),col="blue",main="Word TF-IDF frequencies", xlab="TF-IDF-based rank", ylab = "TF-IDF")

See the ten most frequent terms.

tail(sort(freq),n=10)

##     star   action   comedi      war     will    stori   famili     love 
## 2.845973 2.848084 2.886999 2.891485 2.901405 2.950756 2.995873 3.091364 
##     life    alien 
## 3.110522 3.396233

Show most frequent terms and their frequencies in a bar plot.

high.freq=tail(sort(freq),n=10)
hfp.df=as.data.frame(sort(high.freq))
hfp.df$names <- rownames(hfp.df) 

ggplot(hfp.df, aes(reorder(names,high.freq), high.freq)) +
  geom_bar(stat="identity") + coord_flip() + 
  xlab("Terms") + ylab("Frequency") +
  ggtitle("Term frequencies")

Example of creating term document matrices with TF-IDF weights

Raúl García-Castro (rgarcia@fi.upm.es)