This post began as a group lab project where our small group decided to try and determine house affiliations in the Harry Potter book corpus. As a group, we focused on students and faculty who are active at Hogwarts within the first book. Although we did not have time to work through this as a group, I continued on with the lab for my own knowledge and development of the concepts.
The names of the four houses to which characters will be determined to be affiliated with are: Gryffindor, Hufflepuff, Ravenclaw, and Slytherin
First I created the corpus from the (harrypotter) library
philosophers_stone_corpus <- corpus(philosophers_stone)
philosophers_stone_summary <- summary(philosophers_stone_corpus)
philosophers_stone_summary$book <- "Philosopher's Stone"
philosophers_stone_summary$chapter <- as.numeric(str_extract(philosophers_stone_summary$Text, "[0-9]+"))
philosophers_stone_summary
## Corpus consisting of 17 documents, showing 17 documents:
##
## Text Types Tokens Sentences book chapter
## text1 1273 5643 349 Philosopher's Stone 1
## text2 1067 4128 237 Philosopher's Stone 2
## text3 1226 4630 297 Philosopher's Stone 3
## text4 1198 4761 321 Philosopher's Stone 4
## text5 1820 8372 563 Philosopher's Stone 5
## text6 1567 7949 566 Philosopher's Stone 6
## text7 1379 5445 351 Philosopher's Stone 7
## text8 1096 3594 198 Philosopher's Stone 8
## text9 1426 6131 410 Philosopher's Stone 9
## text10 1294 5207 334 Philosopher's Stone 10
## text11 1113 4152 276 Philosopher's Stone 11
## text12 1511 6729 447 Philosopher's Stone 12
## text13 1078 3930 261 Philosopher's Stone 13
## text14 1112 4354 308 Philosopher's Stone 14
## text15 1386 6437 459 Philosopher's Stone 15
## text16 1581 8277 591 Philosopher's Stone 16
## text17 1490 7101 506 Philosopher's Stone 17
docvars(philosophers_stone_corpus) <- philosophers_stone_summary
philosophers_stone_tokens <- tokens(philosophers_stone_corpus,
remove_punct = T,
remove_numbers = T)
print(philosophers_stone_tokens)
## Tokens consisting of 17 documents and 6 docvars.
## text1 :
## [1] "THE" "BOY" "WHO" "LIVED" "Mr" "and" "Mrs"
## [8] "Dursley" "of" "number" "four" "Privet"
## [ ... and 4,579 more ]
##
## text2 :
## [1] "THE" "VANISHING" "GLASS" "Nearly" "ten" "years"
## [7] "had" "passed" "since" "the" "Dursleys" "had"
## [ ... and 3,433 more ]
##
## text3 :
## [1] "THE" "LETTERS" "FROM" "NO" "ONE"
## [6] "The" "escape" "of" "the" "Brazilian"
## [11] "boa" "constrictor"
## [ ... and 3,827 more ]
##
## text4 :
## [1] "THE" "KEEPER" "OF" "THE" "KEYS" "BOOM" "They"
## [8] "knocked" "again" "Dudley" "jerked" "awake"
## [ ... and 3,674 more ]
##
## text5 :
## [1] "DIAGON" "ALLEY" "Harry" "woke" "early" "the"
## [7] "next" "morning" "Although" "he" "could" "tell"
## [ ... and 6,543 more ]
##
## text6 :
## [1] "THE" "JOURNEY" "FROM" "PLATFORM"
## [5] "NINE" "AND" "THREE-QUARTERS" "Harry's"
## [9] "last" "month" "with" "the"
## [ ... and 6,270 more ]
##
## [ reached max_ndoc ... 11 more documents ]
Pulling the stopwords from my tokens object
length(stopwords("en"))
## [1] 174
philosophers_stone_tokens <- tokens_select(philosophers_stone_tokens,
pattern = stopwords("en"),
selection = "remove")
length(philosophers_stone_tokens)
## [1] 17
print(philosophers_stone_tokens)
## Tokens consisting of 17 documents and 6 docvars.
## text1 :
## [1] "BOY" "LIVED" "Mr" "Mrs" "Dursley" "number"
## [7] "four" "Privet" "Drive" "proud" "say" "perfectly"
## [ ... and 2,308 more ]
##
## text2 :
## [1] "VANISHING" "GLASS" "Nearly" "ten" "years" "passed"
## [7] "since" "Dursleys" "woken" "find" "nephew" "front"
## [ ... and 1,783 more ]
##
## text3 :
## [1] "LETTERS" "ONE" "escape" "Brazilian" "boa"
## [6] "constrictor" "earned" "Harry" "longest-ever" "punishment"
## [11] "time" "allowed"
## [ ... and 2,059 more ]
##
## text4 :
## [1] "KEEPER" "KEYS" "BOOM" "knocked" "Dudley" "jerked"
## [7] "awake" "cannon" "said" "stupidly" "crash" "behind"
## [ ... and 1,972 more ]
##
## text5 :
## [1] "DIAGON" "ALLEY" "Harry" "woke" "early" "next"
## [7] "morning" "Although" "tell" "daylight" "kept" "eyes"
## [ ... and 3,702 more ]
##
## text6 :
## [1] "JOURNEY" "PLATFORM" "NINE" "THREE-QUARTERS"
## [5] "Harry's" "last" "month" "Dursleys"
## [9] "fun" "True" "Dudley" "now"
## [ ... and 3,284 more ]
##
## [ reached max_ndoc ... 11 more documents ]
If I engage the ‘stemming’ feature, I can see how many of the tokens are affected. In the case of this text, a first review of the resulting tokens left them less relevant than the original tokens, so I will not use that feature here.
#Then pull the corpus as a character vector (which works with cleanNLP) rather than a corpus object, which does not.
philosophers_stone_char_vector <- as.character(philosophers_stone_corpus)
#Return a data frame of the document-level variables
hp_data <- docvars(philosophers_stone_corpus)
#Now add the text to my data frame for running the annotation tool; column must be named `text`
hp_data$text <- philosophers_stone_char_vector
Before I can use the cleanNLP package, I need to initialize the “spacyr” installation and run the command to initialize the back end, “cnlp_init_udpipe()” and then begin annotating and analyzing:
spacy_initialize()
cnlp_init_udpipe()
#Now I can start analyzing the corpus:
annotated <- cnlp_annotate(hp_data)
## Processed document 10 of 17
head(annotated$token)
## # A tibble: 6 x 11
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1 THE "THE " the DET DT Definite=~ 2
## 2 1 1 2 BOY "BOY " Boy NOUN NN Number=Si~ 18
## 3 1 1 3 WHO "WHO " who PRON WP PronType=~ 4
## 4 1 1 4 LIVED "LIVED " live VERB VBD Mood=Ind|~ 2
## 5 1 1 5 Mr. "Mr. " Mr. PROPN NNP Number=Si~ 4
## 6 1 1 6 and "and " and CCONJ CC <NA> 7
## # ... with 1 more variable: relation <chr>
head(annotated$document)
## Text Types Tokens Sentences book chapter doc_id
## 1 text1 1273 5643 349 Philosopher's Stone 1 1
## 2 text2 1067 4128 237 Philosopher's Stone 2 2
## 3 text3 1226 4630 297 Philosopher's Stone 3 3
## 4 text4 1198 4761 321 Philosopher's Stone 4 4
## 5 text5 1820 8372 563 Philosopher's Stone 5 5
## 6 text6 1567 7949 566 Philosopher's Stone 6 6
#I will join the data frames into one database for analyzing all patterns.
anno_hpdata <- left_join(annotated$document, annotated$token, by = "doc_id")
head(anno_hpdata)
## Text Types Tokens Sentences book chapter doc_id sid tid token
## 1 text1 1273 5643 349 Philosopher's Stone 1 1 1 1 THE
## 2 text1 1273 5643 349 Philosopher's Stone 1 1 1 2 BOY
## 3 text1 1273 5643 349 Philosopher's Stone 1 1 1 3 WHO
## 4 text1 1273 5643 349 Philosopher's Stone 1 1 1 4 LIVED
## 5 text1 1273 5643 349 Philosopher's Stone 1 1 1 5 Mr.
## 6 text1 1273 5643 349 Philosopher's Stone 1 1 1 6 and
## token_with_ws lemma upos xpos feats tid_source
## 1 THE the DET DT Definite=Def|PronType=Art 2
## 2 BOY Boy NOUN NN Number=Sing 18
## 3 WHO who PRON WP PronType=Rel 4
## 4 LIVED live VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 2
## 5 Mr. Mr. PROPN NNP Number=Sing 4
## 6 and and CCONJ CC <NA> 7
## relation
## 1 det
## 2 nsubj
## 3 nsubj
## 4 acl:relcl
## 5 obj
## 6 cc
Now I can really start to look for affiliations to the houses. I’ll start by looking at some of the data for annotation options. Filtering by parts of speech, I can find that the lemma “sort” is found in 27 instances as a noun,
anno_hpdata %>%
filter(upos == "VERB") %>%
group_by(lemma) %>%
summarize(count = n()) %>%
top_n(n=250) %>%
arrange(desc(count))
## # A tibble: 254 x 2
## lemma count
## <chr> <int>
## 1 say 921
## 2 get 442
## 3 have 417
## 4 go 387
## 5 look 380
## 6 be 312
## 7 know 305
## 8 see 304
## 9 think 226
## 10 do 222
## # ... with 244 more rows
Looking at the corpus keywords using the “kwic” function and taking that information into account, it seems that the bulk of the conversation about sorting students into houses takes place in chapter 7:
sorted <- kwic(philosophers_stone_corpus, pattern = "sort*")
sorted %>%
group_by(docname)
## # A tibble: 44 x 7
## # Groups: docname [13]
## docname from to pre keyword post pattern
## <chr> <int> <int> <chr> <chr> <chr> <fct>
## 1 text2 2859 2859 "Behind the glass , a~ sorts of lizards and sn~ sort*
## 2 text4 757 757 ", and began taking a~ sorts of things out of ~ sort*
## 3 text4 4192 4192 "letters and he needs~ sorts of rubbish - spel~ sort*
## 4 text4 4281 4281 "with youngsters of h~ sort , fer a change , sort*
## 5 text5 2619 2619 "you . \" \" What" sort of magic do you t~ sort*
## 6 text5 5241 5241 "of him . He's a" sort of servant , isn'~ sort*
## 7 text5 5278 5278 ". I heard he's a" sort of savage - lives~ sort*
## 8 text5 5427 5427 "they should let the ~ sort in , do you ? sort*
## 9 text5 5818 5818 "and there's four bal~ sorta hard ter explain ~ sort*
## 10 text6 1883 1883 "one another in a dis~ sort of way over the b~ sort*
## # ... with 34 more rows
Looking at the names of the houses, starting with Gryffindor, I can first see the relevant context of the results, and it begins to emerge that the occasions when the sorting hat has declared a student part of a house, it is in all capital letters.
kwic(philosophers_stone_corpus, pattern = "Gryffindor*")
## Keyword-in-context with 102 matches.
## [text6, 5835] and I hope I'm in | Gryffindor |
## [text6, 5961] " asked Harry." | Gryffindor |
## [text7, 296] The four houses are called | Gryffindor |
## [text7, 1418] . You might belong in | Gryffindor |
## [text7, 1435] nerve, and chivalry Set | Gryffindors |
## [text7, 1880] " became the first new | Gryffindor |
## [text7, 2050] the hat declared him a | Gryffindor |
## [text7, 2074] on her head." | GRYFFINDOR |
## [text7, 2190] it finally shouted," | GRYFFINDOR |
## [text7, 2522] you're sure - better be | GRYFFINDOR |
## [text7, 2548] and walked shakily toward the | Gryffindor |
## [text7, 2789] , joined Harry at the | Gryffindor |
## [text7, 2832] hat had shouted," | GRYFFINDOR |
## [text7, 3282] service. Resident ghost of | Gryffindor |
## [text7, 3450] ," So - new | Gryffindors |
## [text7, 3466] house championship this year? | Gryffindors |
## [text7, 4742] you trot!" The | Gryffindor |
## [text7, 5164] and found themselves in the | Gryffindor |
## [text8, 268] always happy to point new | Gryffindors |
## [text8, 1323] Professor McGonagall was head of | Gryffindor |
## [text8, 2382] point will be taken from | Gryffindor |
## [text8, 2396] Things didn't improve for the | Gryffindors |
## [text8, 2673] another point you've lost for | Gryffindor |
## [text8, 2744] He'd lost two points for | Gryffindor |
## [text9, 31] Malfoy. Still, first-year | Gryffindors |
## [text9, 65] notice pinned up in the | Gryffindor |
## [text9, 83] starting on Thursday - and | Gryffindor |
## [text9, 646] , who was passing the | Gryffindor |
## [text9, 756] Ron, and the other | Gryffindors |
## [text9, 2642] " Wood's captain of the | Gryffindor |
## [text9, 3544] of the points you'll lose | Gryffindor |
## [text9, 3765] staircase, and into the | Gryffindor |
## [text9, 3941] " Don't you care about | Gryffindor |
## [text9, 4057] Hermione was locked out of | Gryffindor |
## [text9, 4958] got to get back to | Gryffindor |
## [text10, 1443] the Keeper -I'm Keeper for | Gryffindor |
## [text10, 4539] of winning fifty points for | Gryffindor |
## [text10, 4875] points will be taken from | Gryffindor |
## [text10, 4903] you'd better get off to | Gryffindor |
## [text10, 4954] troll. You each win | Gryffindor |
## [text11, 91] after weeks of training: | Gryffindor |
## [text11, 96] Gryffindor versus Slytherin. If | Gryffindor |
## [text11, 515] me. Five points from | Gryffindor |
## [text11, 561] said Ron bitterly. The | Gryffindor |
## [text11, 881] take any more points from | Gryffindor |
## [text11, 1355] , had done a large | Gryffindor |
## [text11, 1508] This is the best team | Gryffindor's |
## [text11, 1752] immediately by Angelina Johnson of | Gryffindor |
## [text11, 1877] by an excellent move by | Gryffindor |
## [text11, 1882] Gryffindor Keeper Wood and the | Gryffindors |
## [text11, 1892] that's Chaser Katie Bell of | Gryffindor |
## [text11, 1965] - nice play by the | Gryffindor |
## [text11, 2016] Bletchley dives - misses - | GRYFFINDORS |
## [text11, 2020] - GRYFFINDORS SCORE!" | Gryffindor |
## [text11, 2492] of rage echoed from the | Gryffindors |
## [text11, 2523] Foul!" screamed the | Gryffindors |
## [text11, 2542] at the goal posts for | Gryffindor |
## [text11, 2719] . Flint nearly kills the | Gryffindor |
## [text11, 2735] , so a penalty to | Gryffindor |
## [text11, 2754] and we continue play, | Gryffindor |
## [text11, 2859] to turn back toward the | Gryffindor |
## [text11, 3693] happily shouting the results - | Gryffindor |
## [text12, 99] to start. While the | Gryffindor |
## [text12, 621] ." Five points from | Gryffindor |
## [text12, 3298] to the fire in the | Gryffindor |
## [text12, 3371] Fred and George all over | Gryffindor |
## [text13, 414] excuse to knock points off | Gryffindor |
## [text13, 569] headed straight back to the | Gryffindor |
## [text13, 736] If I back out, | Gryffindor |
## [text13, 797] all the way up to | Gryffindor |
## [text13, 963] brave enough to be in | Gryffindor |
## [text13, 1036] Sorting Hat chose you for | Gryffindor |
## [text13, 2481] they choose people for the | Gryffindor |
## [text13, 2914] won! We've won! | Gryffindor |
## [text13, 2975] lasted five minutes. As | Gryffindors |
## [text13, 3137] was a happy blur: | Gryffindors |
## [text13, 3193] in the setting sun. | Gryffindor |
## [text15, 500] . Potter, I thought | Gryffindor |
## [text15, 552] points will be taken from | Gryffindor |
## [text15, 635] never been more ashamed of | Gryffindor |
## [text15, 648] points lost. That put | Gryffindor |
## [text15, 661] , they'd ruined any chance | Gryffindor |
## [text15, 741] happen when the rest of | Gryffindor |
## [text15, 751] done? At first, | Gryffindors |
## [text16, 2210] up to something. And | Gryffindor |
## [text16, 2519] take another fifty points from | Gryffindor |
## [text16, 2786] and your families alone if | Gryffindor |
## [text16, 3066] them; none of the | Gryffindors |
## [text16, 3401] you'll be caught again. | Gryffindor |
## [text17, 286] Snape was trying to stop | Gryffindor |
## [text17, 1469] won the house cup for | Gryffindor |
## [text17, 5534] Ron and Hermione at the | Gryffindor |
## [text17, 5662] : In fourth place, | Gryffindor |
## [text17, 5843] many years, I award | Gryffindor |
## [text17, 5849] house fifty points." | Gryffindor |
## [text17, 5923] of fire, I award | Gryffindor |
## [text17, 5946] had burst into tears. | Gryffindors |
## [text17, 5992] outstanding courage, I award | Gryffindor |
## [text17, 6014] yelling themselves hoarse knew that | Gryffindor |
## [text17, 6138] noise that erupted from the | Gryffindor |
## [text17, 6179] much as a point for | Gryffindor |
## [text17, 6278] serpent vanished and a towering | Gryffindor |
##
## , it sounds by far
## ," said Ron.
## , Hufflepuff, Ravenclaw,
## , Where dwell the brave
## apart; You might belong
## , and the table on
## ." Granger, Hermione
## !" shouted the hat
## ," Neville ran off
## !" Harry heard the
## table. He was so
## table." Turpin,
## !" Harry clapped loudly
## Tower."" I
## ! I hope you're going
## have never gone so long
## first years followed Percy through
## common room, a cozy
## in the right direction,
## House, but it hadn't
## House for your cheek,
## as the Potions lesson continued
## ." This was so
## in his very first week
## only had Potions with the
## common room that made them
## and Slytherin would be learning
## table, snatched the Remembrall
## hurried down the front steps
## team," Professor McGonagall
## if you're caught, and
## common room. A few
## , do you only care
## tower." Now what
## tower," said Ron
## . I have to fly
## faded quickly from Harry's mind
## for this," said
## tower. Students are finishing
## five points. Professor Dumbledore
## versus Slytherin. If Gryffindor
## won, they would move
## ."" He's just
## common room was very noisy
## . He sprinted back upstairs
## lion underneath. Then Hermione
## had in years. We're
## - what an excellent Chaser
## Keeper Wood and the Gryffindors
## take the Quaffle - that's
## there, nice dive around
## Beater, anyway, and
## SCORE!" Gryffindor cheers
## cheers filled the cold air
## below - Marcus Flint had
## . Madam Hooch spoke angrily
## . But in all the
## Seeker, which could happen
## , taken by Spinner,
## still in possession."
## goal- posts - he had
## had won by one hundred
## common room and the Great
## , Weasley, and be
## common room, where Harry
## tower because they'd stolen his
## !" George Weasley really
## common room, where he
## can't play at all.
## tower. Everyone fell over
## , Malfoy's already done that
## , didn't it? And
## team?" said Malfoy
## is in the lead!
## came spilling onto the field
## running to lift him onto
## in the lead. He'd
## meant more to you than
## ."" Fifty?
## students." A hundred
## in last place. In
## had had for the house
## found out what they'd done
## passing the giant hourglasses that
## really can't afford to lose
## ! Yes, Weasley,
## wins the house cup?
## had anything to say to
## will be in even more
## from winning, he did
## ." Quirrell cursed again
## table and tried to ignore
## , with three hundred and
## house fifty points."
## cheers nearly raised the bewitched
## house fifty points."
## up and down the table
## house sixty points."
## now had four hundred and
## table. Harry, Ron
## before. Harry, still
## lion took its place.
kwic(
philosophers_stone_corpus,
pattern = "GRYFFINDOR",
case_insensitive = FALSE,
)
## Keyword-in-context with 4 matches.
## [text7, 2074] on her head." | GRYFFINDOR | !" shouted the hat
## [text7, 2190] it finally shouted," | GRYFFINDOR | ," Neville ran off
## [text7, 2522] you're sure - better be | GRYFFINDOR | !" Harry heard the
## [text7, 2832] hat had shouted," | GRYFFINDOR | !" Harry clapped loudly
kwic(
philosophers_stone_corpus,
pattern = "SLYTHERIN",
case_insensitive = FALSE,
)
## Keyword-in-context with 1 match.
## [text7, 2246] when it screamed," | SLYTHERIN | !" Malfoy went to
kwic(
philosophers_stone_corpus,
pattern = "HUFFLEPUFF",
case_insensitive = FALSE,
)
## Keyword-in-context with 3 matches.
## [text7, 1762] A moments pause -" | HUFFLEPUFF | !" shouted the hat
## [text7, 1808] , Susan!"" | HUFFLEPUFF | !" shouted the hat
## [text7, 1991] , Justin!"" | HUFFLEPUFF | !" Sometimes, Harry
kwic(
philosophers_stone_corpus,
pattern = "RAVENCLAW",
case_insensitive = FALSE,
)
## Keyword-in-context with 1 match.
## [text7, 1833] , Terry!"" | RAVENCLAW | !" The table second
Although this is very promising, I need to next find an effective way to look at the entire sentence for each of these declaratory sorting statements. There is much more for me to learn regarding natural language processing!