The goal of this document is to show how to perform different annotations (word, sentence, part-of-speech, and Penn Treebank parse) over text documents using the openNLP (natural language processing) and the tm (text mining) packages in R.
I cannot claim full authorship of this document, since I have taken code snippets and have been inspired by multiple books and documents in the Web. Thanks everyone for sharing.
Check the working directory with wd. If it is not the one where your data are located, change it with setwd.
getwd()
## [1] "/Users/alvaro.arranz/Universidad/Intelligent Systems/HandsOn-2"
setwd("~/Universidad/Intelligent Systems/HandsOn-2")
Now we load the required libraries. Only a couple of things to mention:
annotate function of the openNLP package may require to explicitly include the package name (i.e., ``) due to a name clash with ggplot2# Needed for OutOfMemoryError: Java heap space
library(rJava)
.jinit(parameters="-Xmx4g")
# If there are more memory problems, invoke gc() after the POS tagging
# The openNLPmodels.en library is not in CRAN; it has to be installed from another repository
#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at")
library(NLP)
library(openNLP)
library(openNLPmodels.en)
library(tm)
getAnnotationsFromDocument returns annotations for the text document: word, sentence, part-of-speech, and Penn Treebank parse annotations.
As an alternative, the koRpus package uses TreeTagger for POS tagging.
getAnnotationsFromDocument = function(doc){
x=as.String(doc)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
y1 <- annotate(x, list(sent_token_annotator, word_token_annotator))
y2 <- annotate(x, pos_tag_annotator, y1)
parse_annotator <- Parse_Annotator()
y3 <- annotate(x, parse_annotator, y2)
return(y3)
}
`` returns the text document merged with the annotations.
getAnnotatedMergedDocument = function(doc,annotations){
x=as.String(doc)
y2w <- subset(annotations, type == "word")
tags <- sapply(y2w$features, '[[', "POS")
r1 <- sprintf("%s/%s", x[y2w], tags)
r2 <- paste(r1, collapse = " ")
return(r2)
}
getAnnotatedPlainTextDocument returns the text document along with its annotations in an AnnotatedPlainTextDocument.
getAnnotatedPlainTextDocument = function(doc,annotations){
x=as.String(doc)
a = AnnotatedPlainTextDocument(x,annotations)
return(a)
}
We are going to use the Movie review data version 2.0, created by Bo Pang and Lillian Lee.
Once unzipped, the data splits the different documents into positive and negative opinions. In this script we are going to use the positive opinion file cv873_18636.txt located in txt_sentoken/pos.
source.pos = DirSource("./Corpus/review_polarity_small/txt_sentoken/pos", encoding = "UTF-8")
corpus = Corpus(source.pos)
Let’s take a look at the document in the first entry.
inspect(corpus[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 5503
##
## in roger michell's romantic comedy notting hill , william thacker ( hugh grant ) leads a rather dreary life maintaining his flagging travel bookshop in the quaint section of london which lends it's name to the film's title .
## one day , american movie superstar anna scott ( julia roberts ) walks in to purchase a book on turkey .
## quickly enamored of each other , the two embark upon an on-again , off-again love affair replete with romance , humor , and the occasional lump in the throat .
## the film opens with a non-verbal cue to anna's stardom as the title credits appear over a montage of slow motion sequences featuring the actress's appearances in films and at premieres - coming out of limousines , walking the red carpets and such .
## without words , this sequence gives us a background to her character .
## following , however , is a set-up narration by william indicating what he does and where he lives .
## i don't know why the filmmakers chose to go with a narration which tells us nothing we couldn't have figured out by watching the first ten minutes of film , and which never resurfaces after the movie's beginning , but there it is .
## if there were ever a clear case for " less is more , " this would be it .
## the film is told nearly first person from william's point of view , as he is in every scene .
## by nature of this arrangement , we get a very definite sense of what he is all about , and his nice guy personality wins us over easily .
## in fact , much of notting hill's strength lies in the great dialog written for this character by richard curtis .
## a scene where william is still in shock over the fact that he's even talking to a silver screen goddess is made golden by the way bumbles through his attempt to offer her some honey-soaked apricots from his refrigerator .
## or take an instance where anna kisses william and asks him never to tell anyone for fear of the incident hurting her image .
## william assures her he wouldn't say a word , then adds , " well , i'll probably tell myself now and then , but i'd never believe it . "
## great stuff .
## the downside to spending so much time with william is that we don't get to see enough of anna to make their relationship whole and plausible .
## we're constantly exposed to william's thoughts , feelings , actions and desires , but don't actually get the sense of how much anna really feels for him .
## there are a couple of instances where she declares her obvious interest , but they nearly come out of nowhere due to the fact that we're not sure what she's been thinking all the times in between .
## this , combined with the sheer iniquity of screen time between the two , makes this hugh grant's film hands down .
## he gets the great scenes ( look for one in which he has to portray an interviewer from horse and hound magazine in order to speak with anna ) , the great lines , and gives an overall wonderful performance .
## julia roberts fans will probably be disappointed by the actress's top billing and subsequent lack of involvement in the film ( ala sandra bullock in a time to kill ) along with her detached performance which is only worsened by her character's unpredictable behavior .
## anna doesn't get a lot of our compassion .
## this romantic comedy leans a little more toward the comedy than the romance , much of it supplied by grant himself , but with considerable help from the supporting cast .
## most notable is rhys ifans as spike , william's eccentric roommate , who is in the film for no other purpose than to make us laugh .
## hugh bonneville , emma chambers , james dreyfus , and gina mckee bring up the guard as william's friends and family , particularly shining in a scene where william brings anna to his sister's birthday dinner , and we get to see how these common folks react to the presence of a movie star in their midst .
## it's a scene most of us will probably think would play out in our own living rooms were we faced with a similar situation .
## roger michell's use of visuals doesn't sweep us off our feet , but does give us more than your typical movie of this type .
## for example , there are a couple of instances in this film where large amounts of time pass .
## whereas some films are content to simply put in a caption saying " eight months later , " michell presents us with more interesting cues , such as william's walk though his neighborhood while the seasons change around him .
## another memorable shot occurs in a park where the camera is lifted from ground level to a couple of hundred feet in the air .
## we're generally used to scenes where our point of view is lifted from the earth to treetop level or so , but in this case , the camera just keeps going up and up until we have a bird's eye view of the ground below .
## music is used rather glaringly as an enhancement to many of the film's scenes , and some of this might have been better toned down , but in other areas it works to full effect .
## it's kind of a mixed bag , but still fares better than many of today's lighthearted movies which are so influenced by the mtv fare that the film becomes one long music video .
## at least this film has some pretty good music that for the most part remains relevant and appropriate .
## notting hill's grant and roberts will not go down in history as one of the all-time greatest film pairings , but the chemistry is decent and the comedic aspects of the movie more than make up for it .
## for a couple of hours , you should expect to laugh more than cry , and that's not so bad , now is it ?
We just apply the getAnnotationsFromDocument function to every document in the corpus using lapply.
This step may take long depending on the size of the corpus and on the annotations that we want to identify.
annotations = lapply(corpus, getAnnotationsFromDocument)
The first annotations are sentence annotations. They indicate where the sentence starts and where it ends. In `` we can access the tokens in the sentence (and check the number of tokens it has). In parse we can access the parse tree.
head(annotations[[1]])
## id type start end features
## 1 sentence 1 224 constituents=<<integer,42>>,
## parse=<<character,1>>
## 2 sentence 227 329 constituents=<<integer,21>>,
## parse=<<character,1>>
## 3 sentence 332 490 constituents=<<integer,30>>,
## parse=<<character,1>>
## 4 sentence 493 740 constituents=<<integer,46>>,
## parse=<<character,1>>
## 5 sentence 743 812 constituents=<<integer,13>>,
## parse=<<character,1>>
## 6 sentence 815 913 constituents=<<integer,19>>,
## parse=<<character,1>>
Word annotations also are defined. They indicate where the word starts, where it ends, and the part-of-speech tag.
tail(annotations[[1]])
## id type start end features
## 1122 word 5486 5488 POS=JJ
## 1123 word 5490 5490 POS=,
## 1124 word 5492 5494 POS=RB
## 1125 word 5496 5497 POS=VBZ
## 1126 word 5499 5500 POS=PRP
## 1127 word 5502 5502 POS=.
We can create `AnnotatedPlainTextDocuments that attach the annotations to the document and store the annotated corpus in another variable (since we destroy the corpus metadata).
corpus.tagged = Map(getAnnotatedPlainTextDocument, corpus, annotations)
corpus.tagged[[1]]
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 1127
## Content: chars: 5503
We can also store all the annotations inline with the text and store the annotated corpus in another variable (since we destroy the corpus metadata).
corpus.taggedText = Map(getAnnotatedMergedDocument, corpus, annotations)
corpus.taggedText[[1]]
## [1] "in/IN roger/NN michell/NN 's/POS romantic/JJ comedy/NN notting/NN hill/NN ,/, william/IN thacker/NN (/-LRB- hugh/JJ grant/NN )/-RRB- leads/VBZ a/DT rather/RB dreary/JJ life/NN maintaining/VBG his/PRP$ flagging/JJ travel/NN bookshop/NN in/IN the/DT quaint/JJ section/NN of/IN london/PRP$ which/WDT lends/VBZ it/PRP 's/VBZ name/NN to/TO the/DT film/NN 's/POS title/NN ./. one/CD day/NN ,/, american/JJ movie/NN superstar/NN anna/NN scott/NN (/-LRB- julia/NN roberts/NNS )/-RRB- walks/VBZ in/IN to/TO purchase/VB a/DT book/NN on/IN turkey/NN ./. quickly/RB enamored/VBD of/IN each/DT other/JJ ,/, the/DT two/CD embark/NN upon/IN an/DT on-again/JJ ,/, off-again/JJ love/NN affair/NN replete/NN with/IN romance/NN ,/, humor/NN ,/, and/CC the/DT occasional/JJ lump/NN in/IN the/DT throat/NN ./. the/DT film/NN opens/VBZ with/IN a/DT non-verbal/JJ cue/NN to/TO anna/VB 's/POS stardom/NN as/IN the/DT title/NN credits/NNS appear/VBP over/IN a/DT montage/NN of/IN slow/JJ motion/NN sequences/NNS featuring/VBG the/DT actress/NN 's/POS appearances/NNS in/IN films/NNS and/CC at/IN premieres/NNS -/: coming/VBG out/IN of/IN limousines/NNS ,/, walking/VBG the/DT red/JJ carpets/NNS and/CC such/JJ ./. without/IN words/NNS ,/, this/DT sequence/NN gives/VBZ us/PRP a/DT background/NN to/TO her/PRP$ character/NN ./. following/VBG ,/, however/RB ,/, is/VBZ a/DT set-up/NN narration/NN by/IN william/NN indicating/VBG what/WP he/PRP does/VBZ and/CC where/WRB he/PRP lives/VBZ ./. i/PRP do/VBP n't/RB know/VB why/WRB the/DT filmmakers/NNS chose/VBD to/TO go/VB with/IN a/DT narration/NN which/WDT tells/VBZ us/PRP nothing/NN we/PRP could/MD n't/RB have/VB figured/VBN out/RP by/IN watching/VBG the/DT first/JJ ten/CD minutes/NNS of/IN film/NN ,/, and/CC which/WDT never/RB resurfaces/VBZ after/IN the/DT movie/NN 's/POS beginning/NN ,/, but/CC there/RB it/PRP is/VBZ ./. if/IN there/EX were/VBD ever/RB a/DT clear/JJ case/NN for/IN \"/`` less/JJR is/VBZ more/JJR ,/, \"/'' this/DT would/MD be/VB it/PRP ./. the/DT film/NN is/VBZ told/VBN nearly/RB first/JJ person/NN from/IN william/NN 's/POS point/NN of/IN view/NN ,/, as/IN he/PRP is/VBZ in/IN every/DT scene/NN ./. by/IN nature/NN of/IN this/DT arrangement/NN ,/, we/PRP get/VBP a/DT very/RB definite/JJ sense/NN of/IN what/WP he/PRP is/VBZ all/DT about/IN ,/, and/CC his/PRP$ nice/JJ guy/NN personality/NN wins/VBZ us/PRP over/RB easily/RB ./. in/IN fact/NN ,/, much/JJ of/IN notting/NN hill/NN 's/POS strength/NN lies/VBZ in/IN the/DT great/JJ dialog/NN written/VBN for/IN this/DT character/NN by/IN richard/JJ curtis/NN ./. a/DT scene/NN where/WRB william/NN is/VBZ still/RB in/IN shock/NN over/IN the/DT fact/NN that/IN he/PRP 's/VBZ even/RB talking/VBG to/TO a/DT silver/NN screen/NN goddess/NN is/VBZ made/VBN golden/JJ by/IN the/DT way/NN bumbles/NNS through/IN his/PRP$ attempt/NN to/TO offer/VB her/PRP$ some/DT honey-soaked/JJ apricots/NNS from/IN his/PRP$ refrigerator/NN ./. or/CC take/VB an/DT instance/NN where/WRB anna/DT kisses/NNS william/, and/CC asks/VBZ him/PRP never/RB to/TO tell/VB anyone/NN for/IN fear/NN of/IN the/DT incident/NN hurting/VBG her/PRP$ image/NN ./. william/NN assures/VBZ her/PRP$ he/PRP would/MD n't/RB say/VB a/DT word/NN ,/, then/RB adds/VBZ ,/, \"/`` well/UH ,/, i/FW 'll/MD probably/RB tell/VB myself/PRP now/RB and/CC then/RB ,/, but/CC i/PRP 'd/MD never/RB believe/VB it/PRP ./. \"/`` great/JJ stuff/NN ./. the/DT downside/NN to/TO spending/VBG so/RB much/JJ time/NN with/IN william/NN is/VBZ that/IN we/PRP do/VBP n't/RB get/VB to/TO see/VB enough/RB of/IN anna/NN to/TO make/VB their/PRP$ relationship/NN whole/JJ and/CC plausible/JJ ./. we/PRP 're/VBP constantly/RB exposed/VBN to/TO william/MD 's/POS thoughts/NNS ,/, feelings/NNS ,/, actions/NNS and/CC desires/NNS ,/, but/CC do/VBP n't/RB actually/RB get/VB the/DT sense/NN of/IN how/WRB much/JJ anna/NN really/RB feels/VBZ for/IN him/PRP ./. there/EX are/VBP a/DT couple/NN of/IN instances/NNS where/WRB she/PRP declares/VBZ her/PRP$ obvious/JJ interest/NN ,/, but/CC they/PRP nearly/RB come/VBP out/RB of/IN nowhere/RB due/JJ to/TO the/DT fact/NN that/IN we/PRP 're/VBP not/RB sure/JJ what/WP she/PRP 's/VBZ been/VBN thinking/VBG all/PDT the/DT times/NNS in/IN between/IN ./. this/DT ,/, combined/VBN with/IN the/DT sheer/JJ iniquity/NN of/IN screen/NN time/NN between/IN the/DT two/CD ,/, makes/VBZ this/DT hugh/JJ grant/NN 's/POS film/NN hands/NNS down/RB ./. he/PRP gets/VBZ the/DT great/JJ scenes/NNS (/-LRB- look/NN for/IN one/CD in/IN which/WDT he/PRP has/VBZ to/TO portray/VB an/DT interviewer/NN from/IN horse/NN and/CC hound/NN magazine/NN in/IN order/NN to/TO speak/VB with/IN anna/DT )/-RRB- ,/, the/DT great/JJ lines/NNS ,/, and/CC gives/VBZ an/DT overall/JJ wonderful/JJ performance/NN ./. julia/NNP roberts/NNS fans/NNS will/MD probably/RB be/VB disappointed/VBN by/IN the/DT actress/NN 's/POS top/JJ billing/NN and/CC subsequent/JJ lack/NN of/IN involvement/NN in/IN the/DT film/NN (/-LRB- ala/JJR sandra/NN bullock/NN in/IN a/DT time/NN to/TO kill/VB )/-RRB- along/RB with/IN her/PRP$ detached/JJ performance/NN which/WDT is/VBZ only/RB worsened/VBN by/IN her/PRP$ character/NN 's/POS unpredictable/JJ behavior/NN ./. anna/VB does/VBZ n't/RB get/VB a/DT lot/NN of/IN our/PRP$ compassion/NN ./. this/DT romantic/JJ comedy/NN leans/VBZ a/DT little/RB more/RBR toward/IN the/DT comedy/NN than/IN the/DT romance/NN ,/, much/RB of/IN it/PRP supplied/VBD by/IN grant/NN himself/PRP ,/, but/CC with/IN considerable/JJ help/NN from/IN the/DT supporting/JJ cast/NN ./. most/RBS notable/JJ is/VBZ rhys/VBG ifans/NNS as/IN spike/NN ,/, william/PRP 's/VBZ eccentric/JJ roommate/NN ,/, who/WP is/VBZ in/IN the/DT film/NN for/IN no/DT other/JJ purpose/NN than/IN to/TO make/VB us/PRP laugh/VB ./. hugh/JJ bonneville/NN ,/, emma/NN chambers/NNS ,/, james/NNS dreyfus/VBD ,/, and/CC gina/NN mckee/NN bring/VBP up/RP the/DT guard/NN as/IN william/NN 's/POS friends/NNS and/CC family/NN ,/, particularly/RB shining/VBG in/IN a/DT scene/NN where/WRB william/NN brings/VBZ anna/VB to/TO his/PRP$ sister/NN 's/POS birthday/NN dinner/NN ,/, and/CC we/PRP get/VBP to/TO see/VB how/WRB these/DT common/JJ folks/NNS react/VBP to/TO the/DT presence/NN of/IN a/DT movie/NN star/NN in/IN their/PRP$ midst/NN ./. it/PRP 's/VBZ a/DT scene/NN most/RBS of/IN us/PRP will/MD probably/RB think/VB would/MD play/VB out/RP in/IN our/PRP$ own/JJ living/NN rooms/NNS were/VBD we/PRP faced/VBN with/IN a/DT similar/JJ situation/NN ./. roger/NN michell/NN 's/POS use/NN of/IN visuals/NNS does/VBZ n't/RB sweep/VB us/PRP off/IN our/PRP$ feet/NNS ,/, but/CC does/VBZ give/VB us/PRP more/JJR than/IN your/PRP$ typical/JJ movie/NN of/IN this/DT type/NN ./. for/IN example/NN ,/, there/EX are/VBP a/DT couple/NN of/IN instances/NNS in/IN this/DT film/NN where/WRB large/JJ amounts/NNS of/IN time/NN pass/NN ./. whereas/IN some/DT films/NNS are/VBP content/JJ to/TO simply/RB put/VB in/IN a/DT caption/NN saying/VBG \"/`` eight/CD months/NNS later/RB ,/, \"/`` michell/NN presents/VBZ us/PRP with/IN more/RBR interesting/JJ cues/NNS ,/, such/JJ as/IN william/NN 's/POS walk/NN though/IN his/PRP$ neighborhood/NN while/IN the/DT seasons/NNS change/NN around/IN him/PRP ./. another/DT memorable/JJ shot/NN occurs/VBZ in/IN a/DT park/NN where/WRB the/DT camera/NN is/VBZ lifted/VBN from/IN ground/NN level/NN to/TO a/DT couple/NN of/IN hundred/CD feet/NNS in/IN the/DT air/NN ./. we/PRP 're/VBP generally/RB used/VBN to/TO scenes/NNS where/WRB our/PRP$ point/NN of/IN view/NN is/VBZ lifted/VBN from/IN the/DT earth/NN to/TO treetop/VB level/NN or/CC so/RB ,/, but/CC in/IN this/DT case/NN ,/, the/DT camera/NN just/RB keeps/VBZ going/VBG up/RB and/CC up/RB until/IN we/PRP have/VBP a/DT bird/NN 's/POS eye/NN view/NN of/IN the/DT ground/NN below/RB ./. music/NN is/VBZ used/VBN rather/RB glaringly/RB as/IN an/DT enhancement/NN to/TO many/JJ of/IN the/DT film/NN 's/POS scenes/NNS ,/, and/CC some/DT of/IN this/DT might/MD have/VB been/VBN better/RB toned/VBN down/RB ,/, but/CC in/IN other/JJ areas/NNS it/PRP works/VBZ to/TO full/JJ effect/NN ./. it/PRP 's/VBZ kind/NN of/IN a/DT mixed/JJ bag/NN ,/, but/CC still/RB fares/NNS better/JJR than/IN many/JJ of/IN today/NN 's/POS lighthearted/JJ movies/NNS which/WDT are/VBP so/RB influenced/VBN by/IN the/DT mtv/NN fare/NN that/IN the/DT film/NN becomes/VBZ one/CD long/JJ music/NN video/NN ./. at/IN least/JJS this/DT film/NN has/VBZ some/DT pretty/RB good/JJ music/NN that/IN for/IN the/DT most/JJS part/NN remains/VBZ relevant/JJ and/CC appropriate/JJ ./. notting/NN hill/NN 's/POS grant/NN and/CC roberts/NNS will/MD not/RB go/VB down/RP in/IN history/NN as/IN one/CD of/IN the/DT all-time/JJ greatest/JJS film/NN pairings/NNS ,/, but/CC the/DT chemistry/NN is/VBZ decent/JJ and/CC the/DT comedic/JJ aspects/NNS of/IN the/DT movie/NN more/RBR than/IN make/VB up/RP for/IN it/PRP ./. for/IN a/DT couple/NN of/IN hours/NNS ,/, you/PRP should/MD expect/VB to/TO laugh/VB more/JJR than/IN cry/NN ,/, and/CC that/DT 's/VBZ not/RB so/RB bad/JJ ,/, now/RB is/VBZ it/PRP ?/."
There are functions for accessing parts of an AnnotatedPlainTextDocument.
doc = corpus.tagged[[1]]
doc
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 1127
## Content: chars: 5503
For accessing the text representation of the document.
as.character(doc)
## in roger michell's romantic comedy notting hill , william thacker ( hugh grant ) leads a rather dreary life maintaining his flagging travel bookshop in the quaint section of london which lends it's name to the film's title .
## one day , american movie superstar anna scott ( julia roberts ) walks in to purchase a book on turkey .
## quickly enamored of each other , the two embark upon an on-again , off-again love affair replete with romance , humor , and the occasional lump in the throat .
## the film opens with a non-verbal cue to anna's stardom as the title credits appear over a montage of slow motion sequences featuring the actress's appearances in films and at premieres - coming out of limousines , walking the red carpets and such .
## without words , this sequence gives us a background to her character .
## following , however , is a set-up narration by william indicating what he does and where he lives .
## i don't know why the filmmakers chose to go with a narration which tells us nothing we couldn't have figured out by watching the first ten minutes of film , and which never resurfaces after the movie's beginning , but there it is .
## if there were ever a clear case for " less is more , " this would be it .
## the film is told nearly first person from william's point of view , as he is in every scene .
## by nature of this arrangement , we get a very definite sense of what he is all about , and his nice guy personality wins us over easily .
## in fact , much of notting hill's strength lies in the great dialog written for this character by richard curtis .
## a scene where william is still in shock over the fact that he's even talking to a silver screen goddess is made golden by the way bumbles through his attempt to offer her some honey-soaked apricots from his refrigerator .
## or take an instance where anna kisses william and asks him never to tell anyone for fear of the incident hurting her image .
## william assures her he wouldn't say a word , then adds , " well , i'll probably tell myself now and then , but i'd never believe it . "
## great stuff .
## the downside to spending so much time with william is that we don't get to see enough of anna to make their relationship whole and plausible .
## we're constantly exposed to william's thoughts , feelings , actions and desires , but don't actually get the sense of how much anna really feels for him .
## there are a couple of instances where she declares her obvious interest , but they nearly come out of nowhere due to the fact that we're not sure what she's been thinking all the times in between .
## this , combined with the sheer iniquity of screen time between the two , makes this hugh grant's film hands down .
## he gets the great scenes ( look for one in which he has to portray an interviewer from horse and hound magazine in order to speak with anna ) , the great lines , and gives an overall wonderful performance .
## julia roberts fans will probably be disappointed by the actress's top billing and subsequent lack of involvement in the film ( ala sandra bullock in a time to kill ) along with her detached performance which is only worsened by her character's unpredictable behavior .
## anna doesn't get a lot of our compassion .
## this romantic comedy leans a little more toward the comedy than the romance , much of it supplied by grant himself , but with considerable help from the supporting cast .
## most notable is rhys ifans as spike , william's eccentric roommate , who is in the film for no other purpose than to make us laugh .
## hugh bonneville , emma chambers , james dreyfus , and gina mckee bring up the guard as william's friends and family , particularly shining in a scene where william brings anna to his sister's birthday dinner , and we get to see how these common folks react to the presence of a movie star in their midst .
## it's a scene most of us will probably think would play out in our own living rooms were we faced with a similar situation .
## roger michell's use of visuals doesn't sweep us off our feet , but does give us more than your typical movie of this type .
## for example , there are a couple of instances in this film where large amounts of time pass .
## whereas some films are content to simply put in a caption saying " eight months later , " michell presents us with more interesting cues , such as william's walk though his neighborhood while the seasons change around him .
## another memorable shot occurs in a park where the camera is lifted from ground level to a couple of hundred feet in the air .
## we're generally used to scenes where our point of view is lifted from the earth to treetop level or so , but in this case , the camera just keeps going up and up until we have a bird's eye view of the ground below .
## music is used rather glaringly as an enhancement to many of the film's scenes , and some of this might have been better toned down , but in other areas it works to full effect .
## it's kind of a mixed bag , but still fares better than many of today's lighthearted movies which are so influenced by the mtv fare that the film becomes one long music video .
## at least this film has some pretty good music that for the most part remains relevant and appropriate .
## notting hill's grant and roberts will not go down in history as one of the all-time greatest film pairings , but the chemistry is decent and the comedic aspects of the movie more than make up for it .
## for a couple of hours , you should expect to laugh more than cry , and that's not so bad , now is it ?
For accessing its words.
head(words(doc))
## [1] "in" "roger" "michell" "'s" "romantic" "comedy"
For accessing its sentences.
head(sents(doc),2)
## [[1]]
## [1] "in" "roger" "michell" "'s" "romantic"
## [6] "comedy" "notting" "hill" "," "william"
## [11] "thacker" "(" "hugh" "grant" ")"
## [16] "leads" "a" "rather" "dreary" "life"
## [21] "maintaining" "his" "flagging" "travel" "bookshop"
## [26] "in" "the" "quaint" "section" "of"
## [31] "london" "which" "lends" "it" "'s"
## [36] "name" "to" "the" "film" "'s"
## [41] "title" "."
##
## [[2]]
## [1] "one" "day" "," "american" "movie"
## [6] "superstar" "anna" "scott" "(" "julia"
## [11] "roberts" ")" "walks" "in" "to"
## [16] "purchase" "a" "book" "on" "turkey"
## [21] "."
For accessing its tagged words.
head(tagged_words(doc))
## in/IN
## roger/NN
## michell/NN
## 's/POS
## romantic/JJ
## comedy/NN
For accessing its tagged sentences.
head(tagged_sents(doc),2)
## [[1]]
## in/IN
## roger/NN
## michell/NN
## 's/POS
## romantic/JJ
## comedy/NN
## notting/NN
## hill/NN
## ,/,
## william/IN
## thacker/NN
## (/-LRB-
## hugh/JJ
## grant/NN
## )/-RRB-
## leads/VBZ
## a/DT
## rather/RB
## dreary/JJ
## life/NN
## maintaining/VBG
## his/PRP$
## flagging/JJ
## travel/NN
## bookshop/NN
## in/IN
## the/DT
## quaint/JJ
## section/NN
## of/IN
## london/PRP$
## which/WDT
## lends/VBZ
## it/PRP
## 's/VBZ
## name/NN
## to/TO
## the/DT
## film/NN
## 's/POS
## title/NN
## ./.
##
## [[2]]
## one/CD
## day/NN
## ,/,
## american/JJ
## movie/NN
## superstar/NN
## anna/NN
## scott/NN
## (/-LRB-
## julia/NN
## roberts/NNS
## )/-RRB-
## walks/VBZ
## in/IN
## to/TO
## purchase/VB
## a/DT
## book/NN
## on/IN
## turkey/NN
## ./.
For accessing the parse trees of its sentences.
head(parsed_sents(doc),2)
## [[1]]
## (TOP
## (S
## (PP
## (IN in)
## (NP
## (NP (NN roger) (NN michell) (POS 's))
## (JJ romantic)
## (NN comedy)
## (NN notting)
## (NN hill)))
## (, ,)
## (PP
## (IN william)
## (NP
## (NP (NN thacker))
## (PRN (-LRB- -LRB-) (NP (JJ hugh) (NN grant)) (-RRB- -RRB-))))
## (VP
## (VBZ leads)
## (S
## (NP (DT a) (ADJP (RB rather) (JJ dreary)) (NN life))
## (VP
## (VBG maintaining)
## (NP (PRP$ his) (JJ flagging) (NN travel) (NN bookshop))
## (PP
## (IN in)
## (NP
## (NP (DT the) (JJ quaint) (NN section))
## (PP
## (IN of)
## (NP
## (NP (NN london))
## (SBAR
## (WHNP (WDT which))
## (S
## (VP
## (VBZ lends)
## (SBAR
## (S
## (NP (PRP it))
## (VP
## (VBZ 's)
## (NP (NN name))
## (PP
## (TO to)
## (NP
## (NP (DT the) (NN film) (POS 's))
## (NN title))))))))))))))))
## (. .)))
##
## [[2]]
## (TOP
## (S
## (NP (CD one) (NN day))
## (, ,)
## (NP (DT american) (NN movie) (NN superstar))
## (VP
## (VB anna)
## (S
## (NP
## (NP (NN scott))
## (PRN
## (-LRB- -LRB-)
## (NP (NNP julia) (NNS roberts))
## (-RRB- -RRB-)))
## (VP
## (VBZ walks)
## (ADVP (IN in))
## (S
## (VP
## (TO to)
## (VP
## (VB purchase)
## (NP (DT a) (NN book))
## (PP (IN on) (NP (NN turkey)))))))))
## (. .)))
In this section, we will check the results with the Penn Treebank tagset. So, we will be able to calculate the precision for both sentences and the total recall.
| WORD | POS Tag By Me | POS Tag Penn Treebank |
|---|---|---|
| in | IN | IN |
| roger | NN | NP |
| michell | NN | NP |
| ’s | POS | POS |
| romantic | JJ | JJ |
| comedy | NN | JJ |
| notting | NN | NP |
| hill | NN | NP |
| , | , | , |
| william | IN | NP |
| thacker | NN | NP |
| ( | -LRB- | ( |
| hugh | JJ | NP |
| grant | NN | NP |
| ) | -RRB- | ) |
| leads | VBZ | VBZ |
| a | DT | DT |
| rather | RB | RB |
| dreary | JJ | JJ |
| life | NN | NN |
| maintaining | VBG | VBG |
| his | PRP$ | PP$ |
| flagging | JJ | JJ |
| travel | NN | JJ |
| bookshop | NN | NN |
| in | IN | IN |
| the | DT | DT |
| quaint | JJ | JJ |
| section | NN | NN |
| of | IN | IN |
| london | PRP$ | NP |
| which | WDT | WDT |
| lends | VBZ | VBZ |
| it | PRP | PP |
| ’s | VBZ | VBZ |
| name | NN | NN |
| to | TO | IN |
| the | DT | DT |
| film | NN | NN |
| ’s | POS | POS |
| title | NN | NN |
| WORD | POS Tag By Me | POS Tag Penn Treebank |
|---|---|---|
| one | CD | CD |
| day | NN | NN |
| , | , | , |
| american | JJ | JJ |
| movie | NN | NN |
| superstar | NN | NN |
| anna | NN | NP |
| scott | NN | NP |
| ( | -LRB- | ( |
| julia | NN | NP |
| roberts | NNS | NP |
| ) | -RRB- | ) |
| walks | VBZ | VBZ |
| in | IN | IN |
| to | TO | TO |
| purchase | VB | VB |
| a | DT | DT |
| book | NN | NN |
| on | IN | IN |
| turkey | NN | NN |
metrics<- data.frame(matrix(NA, ncol = 3, nrow = 1))
names(metrics) <- c("PrecisionSentence_1","PrecisionSentence_2", "Recall")
metrics$PrecisionSentence_1 = (41-16) / 41
metrics$PrecisionSentence_2 = (20-6) / 20
metrics$Recall = (61-20) / length(words(doc))
metrics
## PrecisionSentence_1 PrecisionSentence_2 Recall
## 1 0.6097561 0.7 0.03751144