Load required libraries.
library(tm)
library(ggplot2)
Set the working directory to the location of the script and data.
setwd("~/Youtube")
Load corpus from local files.
Load the Sentiment polarity dataset version 2.0 from the Movie review data.
Once unzipped, access the positive reviews in the dataset.
path = "./review_polarity/txt_sentoken/"
dir = DirSource(paste(path,"pos/",sep=""), encoding = "UTF-8")
corpus = Corpus(dir)
Check how many documents have been loaded.
length(corpus)
## [1] 1000
Access the document in the first entry.
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
## for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
## to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
## the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
## in other words , don't dismiss this film because of its source .
## if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
## getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
## the ghetto in question is , of course , whitechapel in 1888 london's east end .
## it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
## when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
## abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
## upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
## i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
## in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
## it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
## and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
## don't worry - it'll all make sense when you see it .
## now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
## the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
## oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
## even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
## ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
## i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
## the film , however , is all good .
## 2 : 00 - r for strong violence/gore , sexuality , language and drug content
Define custom stop words for our corpus.
myStopwords = c(stopwords(),"film","films","movie","movies")
Create a TDM with the transformations and the custom stop words.
tdm = TermDocumentMatrix(corpus,
control=list(stopwords = myStopwords,
removePunctuation = T,
removeNumbers = T,
stemming = T))
Make an analysis of what words are more frequently associated with others.
Analyse those terms frequently associated with “star”.
asoc.star = as.data.frame(findAssocs(tdm,"star", 0.5))
asoc.star$names <- rownames(asoc.star)
asoc.star
## star names
## trek 0.63 trek
## enterpris 0.57 enterpris
## picard 0.57 picard
## insurrect 0.56 insurrect
## jeanluc 0.54 jeanluc
## androidwishingtobehuman 0.50 androidwishingtobehuman
## anij 0.50 anij
## bubblebath 0.50 bubblebath
## crewmat 0.50 crewmat
## dougherti 0.50 dougherti
## everbut 0.50 everbut
## harkonnen 0.50 harkonnen
## homeworld 0.50 homeworld
## iith 0.50 iith
## indefin 0.50 indefin
## mountaintop 0.50 mountaintop
## plunder 0.50 plunder
## reassum 0.50 reassum
## ruafro 0.50 ruafro
## soran 0.50 soran
## unsuspens 0.50 unsuspens
## verdant 0.50 verdant
## youthrestor 0.50 youthrestor
Print them in a bar graph.
ggplot(asoc.star, aes(reorder(names,star), star)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Correlation") +
ggtitle("\"star\" associations")
Analyse those terms frequently associated with “indiana”.
asoc.indi = as.data.frame(findAssocs(tdm,"indiana", 0.5))
asoc.indi$names <- rownames(asoc.indi)
asoc.indi
## indiana names
## ark 0.74 ark
## actionmovi 0.70 actionmovi
## brawn 0.70 brawn
## diarrhea 0.70 diarrhea
## engrav 0.70 engrav
## hieroglyph 0.70 hieroglyph
## hotfudgerockin 0.70 hotfudgerockin
## minecart 0.70 minecart
## obcpo 0.70 obcpo
## professorarcheologist 0.70 professorarcheologist
## registr 0.70 registr
## sallah 0.70 sallah
## salsa 0.70 salsa
## swordsman 0.70 swordsman
## indi 0.68 indi
## selleck 0.61 selleck
## shorten 0.57 shorten
## snake 0.53 snake
Print them in a bar graph.
ggplot(asoc.indi, aes(reorder(names,indiana), indiana)) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Correlation") +
ggtitle("\"indiana\" associations")