Load required libraries.
# Needed for OutOfMemoryError: Java heap space
library(rJava)
.jinit(parameters="-Xmx4g")
# If there are more memory problems, invoke gc() after the POS tagging
library(openNLP)
library(openNLPmodels.en)
library(tm)
Set the working directory to the location of the script and data.
setwd("~/Youtube")
Load corpus from local files.
Load the Sentiment polarity dataset version 2.0 from the Movie review data.
Once unzipped, access the positive reviews in the dataset.
path = "./review_polarity_small/txt_sentoken/"
dir = DirSource(paste(path,"pos/",sep=""), encoding = "UTF-8")
corpus = Corpus(dir)
Check how many documents have been loaded.
length(corpus)
## [1] 10
Access the document in the first entry.
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
## for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
## to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
## the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
## in other words , don't dismiss this film because of its source .
## if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
## getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
## the ghetto in question is , of course , whitechapel in 1888 london's east end .
## it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
## when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
## abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
## upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
## i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
## in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
## it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
## and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
## don't worry - it'll all make sense when you see it .
## now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
## the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
## oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
## even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
## ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
## i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
## the film , however , is all good .
## 2 : 00 - r for strong violence/gore , sexuality , language and drug content
getAnnotationsFromDocument returns annotations for the text document: word, sentence, part-of-speech, and Penn Treebank parse annotations.
As an alternative, the koRpus package uses TreeTagger for POS tagging.
getAnnotationsFromDocument = function(doc){
x=as.String(doc)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
y1 <- annotate(x, list(sent_token_annotator, word_token_annotator))
y2 <- annotate(x, pos_tag_annotator, y1)
parse_annotator <- Parse_Annotator()
y3 <- annotate(x, parse_annotator, y2)
return(y3)
}
Apply the getAnnotationsFromDocument function to every document in the corpus.
This step may take long depending on the size of the corpus and on the annotations that we want to identify.
annotations = lapply(corpus, getAnnotationsFromDocument)
getAnnotatedPlainTextDocument returns the text document along with its annotations in an AnnotatedPlainTextDocument.
getAnnotatedPlainTextDocument = function(doc,annotations){
x=as.String(doc)
a = AnnotatedPlainTextDocument(x,annotations)
return(a)
}
Create AnnotatedPlainTextDocuments that attach the annotations to the document.
And store the annotated corpus in another variable (since we destroy the corpus metadata).
corpus.tagged = Map(getAnnotatedPlainTextDocument, corpus, annotations)
corpus.tagged[[1]]
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: 1, length(s): 849
## Content: chars: 4226
Access annotated documents.
doc = corpus.tagged[[1]]
doc
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: 1, length(s): 849
## Content: chars: 4226
Access the text representation of the document.
as.character(doc)
## films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
## for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
## to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
## the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
## in other words , don't dismiss this film because of its source .
## if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
## getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
## the ghetto in question is , of course , whitechapel in 1888 london's east end .
## it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
## when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
## abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
## upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
## i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
## in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
## it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
## and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
## don't worry - it'll all make sense when you see it .
## now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
## the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
## oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
## even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
## ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
## i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
## the film , however , is all good .
## 2 : 00 - r for strong violence/gore , sexuality , language and drug content
Access its words.
head(words(doc))
## [1] "films" "adapted" "from" "comic" "books" "have"
Access its sentences.
head(sents(doc),3)
## [[1]]
## [1] "films" "adapted" "from" "comic" "books"
## [6] "have" "had" "plenty" "of" "success"
## [11] "," "whether" "they" "'re" "about"
## [16] "superheroes" "(" "batman" "," "superman"
## [21] "," "spawn" ")" "," "or"
## [26] "geared" "toward" "kids" "(" "casper"
## [31] ")" "or" "the" "arthouse" "crowd"
## [36] "(" "ghost" "world" ")" ","
## [41] "but" "there" "'s" "never" "really"
## [46] "been" "a" "comic" "book" "like"
## [51] "from" "hell" "before" "."
##
## [[2]]
## [1] "for" "starters" "," "it" "was" "created"
## [7] "by" "alan" "moore" "(" "and" "eddie"
## [13] "campbell" ")" "," "who" "brought" "the"
## [19] "medium" "to" "a" "whole" "new" "level"
## [25] "in" "the" "mid" "'80s" "with" "a"
## [31] "12-part" "series" "called" "the" "watchmen" "."
##
## [[3]]
## [1] "to" "say" "moore" "and" "campbell"
## [6] "thoroughly" "researched" "the" "subject" "of"
## [11] "jack" "the" "ripper" "would" "be"
## [16] "like" "saying" "michael" "jackson" "is"
## [21] "starting" "to" "look" "a" "little"
## [26] "odd" "."
Access its tagged words.
head(tagged_words(doc))
## films/NNS
## adapted/VBD
## from/IN
## comic/JJ
## books/NNS
## have/VBP
Access its tagged sentences.
head(tagged_sents(doc),3)
## [[1]]
## films/NNS
## adapted/VBD
## from/IN
## comic/JJ
## books/NNS
## have/VBP
## had/VBN
## plenty/NN
## of/IN
## success/NN
## ,/,
## whether/IN
## they/PRP
## 're/VBP
## about/IN
## superheroes/NNS
## (/-LRB-
## batman/NN
## ,/,
## superman/NN
## ,/,
## spawn/NN
## )/-RRB-
## ,/,
## or/CC
## geared/VBN
## toward/IN
## kids/NNS
## (/-LRB-
## casper/NN
## )/-RRB-
## or/CC
## the/DT
## arthouse/NN
## crowd/NN
## (/-LRB-
## ghost/NN
## world/NN
## )/-RRB-
## ,/,
## but/CC
## there/EX
## 's/VBZ
## never/RB
## really/RB
## been/VBN
## a/DT
## comic/JJ
## book/NN
## like/IN
## from/IN
## hell/NN
## before/IN
## ./.
##
## [[2]]
## for/IN
## starters/NNS
## ,/,
## it/PRP
## was/VBD
## created/VBN
## by/IN
## alan/NN
## moore/NN
## (/-LRB-
## and/CC
## eddie/JJ
## campbell/NN
## )/-RRB-
## ,/,
## who/WP
## brought/VBD
## the/DT
## medium/NN
## to/TO
## a/DT
## whole/JJ
## new/JJ
## level/NN
## in/IN
## the/DT
## mid/JJ
## '80s/NNS
## with/IN
## a/DT
## 12-part/JJ
## series/NN
## called/VBN
## the/DT
## watchmen/NNS
## ./.
##
## [[3]]
## to/TO
## say/VB
## moore/NN
## and/CC
## campbell/NN
## thoroughly/RB
## researched/VBD
## the/DT
## subject/NN
## of/IN
## jack/NN
## the/DT
## ripper/NN
## would/MD
## be/VB
## like/IN
## saying/VBG
## michael/NN
## jackson/NN
## is/VBZ
## starting/VBG
## to/TO
## look/VB
## a/DT
## little/JJ
## odd/JJ
## ./.
Access the parse trees of its sentences.
head(parsed_sents(doc),3)
## [[1]]
## (TOP
## (S
## (S
## (NP
## (NP (NNS films))
## (VP
## (VBN adapted)
## (PP (IN from) (NP (JJ comic) (NNS books)))))
## (VP
## (VP
## (VBP have)
## (VP
## (VBN had)
## (NP (NP (NN plenty)) (PP (IN of) (NP (NN success))))
## (, ,)
## (SBAR
## (IN whether)
## (S
## (NP (PRP they))
## (VP
## (VBP 're)
## (PP
## (IN about)
## (NP
## (NP (NNS superheroes))
## (PRN
## (-LRB- -LRB-)
## (NP
## (NP (NN batman))
## (, ,)
## (NP (NN superman))
## (, ,)
## (NP (NN spawn)))
## (-RRB- -RRB-)))))))))
## (, ,)
## (CC or)
## (VP
## (VBN geared)
## (PP
## (IN toward)
## (NP
## (NP
## (NP (NNS kids))
## (PRN (-LRB- -LRB-) (NP (NN casper)) (-RRB- -RRB-)))
## (CC or)
## (NP
## (NP (DT the) (NN arthouse) (NN crowd))
## (PRN
## (-LRB- -LRB-)
## (NP (FW ghost) (NN world))
## (-RRB- -RRB-))))))))
## (, ,)
## (CC but)
## (S
## (NP (EX there))
## (VP
## (VBZ 's)
## (ADVP (RB never))
## (ADVP (RB really))
## (VP
## (VBN been)
## (NP
## (NP (DT a) (JJ comic) (NN book))
## (PP (IN like) (PP (IN from) (NP (NN hell)))))
## (ADVP (RB before)))))
## (. .)))
##
## [[2]]
## (TOP
## (S
## (PP (IN for) (NP (NNS starters)))
## (, ,)
## (NP (PRP it))
## (VP
## (VBD was)
## (VP
## (VBN created)
## (PP
## (IN by)
## (NP
## (NP (JJR alan) (NN moore))
## (PRN
## (-LRB- -LRB-)
## (CC and)
## (VP (VB eddie) (NP (NN campbell)))
## (-RRB- -RRB-))
## (, ,)
## (SBAR
## (WHNP (WP who))
## (S
## (VP
## (VBD brought)
## (NP (DT the) (NN medium))
## (PP
## (TO to)
## (NP
## (NP
## (NP (DT a) (JJ whole) (JJ new) (NN level))
## (PP
## (IN in)
## (NP (DT the) (JJ mid) (NNS '80s))))
## (PP
## (IN with)
## (NP
## (NP (DT a) (JJ 12-part) (NN series))
## (VP
## (VBN called)
## (S (NP (DT the) (NNS watchmen)))))))))))))))
## (. .)))
##
## [[3]]
## (TOP
## (VP
## (TO to)
## (VP
## (VB say)
## (SBAR
## (S
## (NP (NN moore) (CC and) (NN campbell))
## (VP
## (ADVP (RB thoroughly))
## (VBD researched)
## (NP
## (NP (DT the) (NN subject))
## (PP
## (IN of)
## (NP
## (NP (NN jack))
## (SBAR
## (S
## (NP (DT the) (NN ripper))
## (VP
## (MD would)
## (VP
## (VB be)
## (PP
## (IN like)
## (S
## (VP
## (VBG saying)
## (SBAR
## (S
## (NP (NN michael) (NN jackson))
## (VP
## (VBZ is)
## (VP
## (VBG starting)
## (S
## (VP
## (TO to)
## (VP
## (VB look)
## (NP
## (DT a)
## (JJ little)
## (JJ odd)))))))))))))))))))))))
## (. .)))