Lab Session Project

This post began as a group lab project where our small group decided to try and determine house affiliations in the Harry Potter book corpus. As a group, we focused on students and faculty who are active at Hogwarts within the first book. Although we did not have time to work through this as a group, I continued on with the lab for my own knowledge and development of the concepts.

The names of the four houses to which characters will be determined to be affiliated with are: Gryffindor, Hufflepuff, Ravenclaw, and Slytherin

First I created the corpus from the (harrypotter) library

philosophers_stone_corpus <- corpus(philosophers_stone)
philosophers_stone_summary <- summary(philosophers_stone_corpus) 
philosophers_stone_summary$book <- "Philosopher's Stone"
philosophers_stone_summary$chapter <- as.numeric(str_extract(philosophers_stone_summary$Text, "[0-9]+"))
philosophers_stone_summary
## Corpus consisting of 17 documents, showing 17 documents:
## 
##    Text Types Tokens Sentences                book chapter
##   text1  1273   5643       349 Philosopher's Stone       1
##   text2  1067   4128       237 Philosopher's Stone       2
##   text3  1226   4630       297 Philosopher's Stone       3
##   text4  1198   4761       321 Philosopher's Stone       4
##   text5  1820   8372       563 Philosopher's Stone       5
##   text6  1567   7949       566 Philosopher's Stone       6
##   text7  1379   5445       351 Philosopher's Stone       7
##   text8  1096   3594       198 Philosopher's Stone       8
##   text9  1426   6131       410 Philosopher's Stone       9
##  text10  1294   5207       334 Philosopher's Stone      10
##  text11  1113   4152       276 Philosopher's Stone      11
##  text12  1511   6729       447 Philosopher's Stone      12
##  text13  1078   3930       261 Philosopher's Stone      13
##  text14  1112   4354       308 Philosopher's Stone      14
##  text15  1386   6437       459 Philosopher's Stone      15
##  text16  1581   8277       591 Philosopher's Stone      16
##  text17  1490   7101       506 Philosopher's Stone      17
docvars(philosophers_stone_corpus) <- philosophers_stone_summary
philosophers_stone_tokens <- tokens(philosophers_stone_corpus, 
    remove_punct = T,
    remove_numbers = T)
print(philosophers_stone_tokens)
## Tokens consisting of 17 documents and 6 docvars.
## text1 :
##  [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "and"     "Mrs"    
##  [8] "Dursley" "of"      "number"  "four"    "Privet" 
## [ ... and 4,579 more ]
## 
## text2 :
##  [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
##  [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
## [ ... and 3,433 more ]
## 
## text3 :
##  [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
##  [6] "The"         "escape"      "of"          "the"         "Brazilian"  
## [11] "boa"         "constrictor"
## [ ... and 3,827 more ]
## 
## text4 :
##  [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "They"   
##  [8] "knocked" "again"   "Dudley"  "jerked"  "awake"  
## [ ... and 3,674 more ]
## 
## text5 :
##  [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
##  [7] "next"     "morning"  "Although" "he"       "could"    "tell"    
## [ ... and 6,543 more ]
## 
## text6 :
##  [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
##  [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
##  [9] "last"           "month"          "with"           "the"           
## [ ... and 6,270 more ]
## 
## [ reached max_ndoc ... 11 more documents ]

Pulling the stopwords from my tokens object

length(stopwords("en"))
## [1] 174
philosophers_stone_tokens <- tokens_select(philosophers_stone_tokens, 
                                           pattern = stopwords("en"),
                                           selection = "remove")

length(philosophers_stone_tokens)
## [1] 17
print(philosophers_stone_tokens)
## Tokens consisting of 17 documents and 6 docvars.
## text1 :
##  [1] "BOY"       "LIVED"     "Mr"        "Mrs"       "Dursley"   "number"   
##  [7] "four"      "Privet"    "Drive"     "proud"     "say"       "perfectly"
## [ ... and 2,308 more ]
## 
## text2 :
##  [1] "VANISHING" "GLASS"     "Nearly"    "ten"       "years"     "passed"   
##  [7] "since"     "Dursleys"  "woken"     "find"      "nephew"    "front"    
## [ ... and 1,783 more ]
## 
## text3 :
##  [1] "LETTERS"      "ONE"          "escape"       "Brazilian"    "boa"         
##  [6] "constrictor"  "earned"       "Harry"        "longest-ever" "punishment"  
## [11] "time"         "allowed"     
## [ ... and 2,059 more ]
## 
## text4 :
##  [1] "KEEPER"   "KEYS"     "BOOM"     "knocked"  "Dudley"   "jerked"  
##  [7] "awake"    "cannon"   "said"     "stupidly" "crash"    "behind"  
## [ ... and 1,972 more ]
## 
## text5 :
##  [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "next"    
##  [7] "morning"  "Although" "tell"     "daylight" "kept"     "eyes"    
## [ ... and 3,702 more ]
## 
## text6 :
##  [1] "JOURNEY"        "PLATFORM"       "NINE"           "THREE-QUARTERS"
##  [5] "Harry's"        "last"           "month"          "Dursleys"      
##  [9] "fun"            "True"           "Dudley"         "now"           
## [ ... and 3,284 more ]
## 
## [ reached max_ndoc ... 11 more documents ]

If I engage the ‘stemming’ feature, I can see how many of the tokens are affected. In the case of this text, a first review of the resulting tokens left them less relevant than the original tokens, so I will not use that feature here.

#Then pull the corpus as a character vector (which works with cleanNLP) rather than a corpus object, which does not.

philosophers_stone_char_vector <- as.character(philosophers_stone_corpus)
#Return a data frame of the document-level variables

hp_data <- docvars(philosophers_stone_corpus)

#Now add the text to my data frame for running the annotation tool; column must be named `text`

hp_data$text <- philosophers_stone_char_vector

Before I can use the cleanNLP package, I need to initialize the “spacyr” installation and run the command to initialize the back end, “cnlp_init_udpipe()” and then begin annotating and analyzing:

spacy_initialize()

cnlp_init_udpipe()

#Now I can start analyzing the corpus:

annotated <- cnlp_annotate(hp_data)
## Processed document 10 of 17
head(annotated$token)
## # A tibble: 6 x 11
##   doc_id   sid tid   token token_with_ws lemma upos  xpos  feats      tid_source
##    <int> <int> <chr> <chr> <chr>         <chr> <chr> <chr> <chr>      <chr>     
## 1      1     1 1     THE   "THE "        the   DET   DT    Definite=~ 2         
## 2      1     1 2     BOY   "BOY "        Boy   NOUN  NN    Number=Si~ 18        
## 3      1     1 3     WHO   "WHO "        who   PRON  WP    PronType=~ 4         
## 4      1     1 4     LIVED "LIVED  "   live  VERB  VBD   Mood=Ind|~ 2         
## 5      1     1 5     Mr.   "Mr. "        Mr.   PROPN NNP   Number=Si~ 4         
## 6      1     1 6     and   "and "        and   CCONJ CC    <NA>       7         
## # ... with 1 more variable: relation <chr>
head(annotated$document)
##    Text Types Tokens Sentences                book chapter doc_id
## 1 text1  1273   5643       349 Philosopher's Stone       1      1
## 2 text2  1067   4128       237 Philosopher's Stone       2      2
## 3 text3  1226   4630       297 Philosopher's Stone       3      3
## 4 text4  1198   4761       321 Philosopher's Stone       4      4
## 5 text5  1820   8372       563 Philosopher's Stone       5      5
## 6 text6  1567   7949       566 Philosopher's Stone       6      6
#I will join the data frames into one database for analyzing all patterns.

anno_hpdata <- left_join(annotated$document, annotated$token, by = "doc_id")

head(anno_hpdata)
##    Text Types Tokens Sentences                book chapter doc_id sid tid token
## 1 text1  1273   5643       349 Philosopher's Stone       1      1   1   1   THE
## 2 text1  1273   5643       349 Philosopher's Stone       1      1   1   2   BOY
## 3 text1  1273   5643       349 Philosopher's Stone       1      1   1   3   WHO
## 4 text1  1273   5643       349 Philosopher's Stone       1      1   1   4 LIVED
## 5 text1  1273   5643       349 Philosopher's Stone       1      1   1   5   Mr.
## 6 text1  1273   5643       349 Philosopher's Stone       1      1   1   6   and
##   token_with_ws lemma  upos xpos                            feats tid_source
## 1          THE    the   DET   DT        Definite=Def|PronType=Art          2
## 2          BOY    Boy  NOUN   NN                      Number=Sing         18
## 3          WHO    who  PRON   WP                     PronType=Rel          4
## 4       LIVED    live  VERB  VBD Mood=Ind|Tense=Past|VerbForm=Fin          2
## 5          Mr.    Mr. PROPN  NNP                      Number=Sing          4
## 6          and    and CCONJ   CC                             <NA>          7
##    relation
## 1       det
## 2     nsubj
## 3     nsubj
## 4 acl:relcl
## 5       obj
## 6        cc

Now I can really start to look for affiliations to the houses. I’ll start by looking at some of the data for annotation options. Filtering by parts of speech, I can find that the lemma “sort” is found in 27 instances as a noun,

anno_hpdata %>% 
  filter(upos == "VERB") %>%
  group_by(lemma) %>% 
  summarize(count = n()) %>%
  top_n(n=250) %>%
  arrange(desc(count))
## # A tibble: 254 x 2
##    lemma count
##    <chr> <int>
##  1 say     921
##  2 get     442
##  3 have    417
##  4 go      387
##  5 look    380
##  6 be      312
##  7 know    305
##  8 see     304
##  9 think   226
## 10 do      222
## # ... with 244 more rows

Looking at the corpus keywords using the “kwic” function and taking that information into account, it seems that the bulk of the conversation about sorting students into houses takes place in chapter 7:

sorted <- kwic(philosophers_stone_corpus, pattern = "sort*")

sorted %>% 
  group_by(docname)
## # A tibble: 44 x 7
## # Groups:   docname [13]
##    docname  from    to pre                    keyword post               pattern
##    <chr>   <int> <int> <chr>                  <chr>   <chr>              <fct>  
##  1 text2    2859  2859 "Behind the glass , a~ sorts   of lizards and sn~ sort*  
##  2 text4     757   757 ", and began taking a~ sorts   of things out of ~ sort*  
##  3 text4    4192  4192 "letters and he needs~ sorts   of rubbish - spel~ sort*  
##  4 text4    4281  4281 "with youngsters of h~ sort    , fer a change ,   sort*  
##  5 text5    2619  2619 "you . \" \" What"     sort    of magic do you t~ sort*  
##  6 text5    5241  5241 "of him . He's a"      sort    of servant , isn'~ sort*  
##  7 text5    5278  5278 ". I heard he's a"     sort    of savage - lives~ sort*  
##  8 text5    5427  5427 "they should let the ~ sort    in , do you ?      sort*  
##  9 text5    5818  5818 "and there's four bal~ sorta   hard ter explain ~ sort*  
## 10 text6    1883  1883 "one another in a dis~ sort    of way over the b~ sort*  
## # ... with 34 more rows

Looking at the names of the houses, starting with Gryffindor, I can first see the relevant context of the results, and it begins to emerge that the occasions when the sorting hat has declared a student part of a house, it is in all capital letters.

kwic(philosophers_stone_corpus, pattern = "Gryffindor*")
## Keyword-in-context with 102 matches.                                                                    
##   [text6, 5835]                   and I hope I'm in |  Gryffindor  |
##   [text6, 5961]                     " asked Harry." |  Gryffindor  |
##    [text7, 296]          The four houses are called |  Gryffindor  |
##   [text7, 1418]               . You might belong in |  Gryffindor  |
##   [text7, 1435]             nerve, and chivalry Set | Gryffindors  |
##   [text7, 1880]              " became the first new |  Gryffindor  |
##   [text7, 2050]              the hat declared him a |  Gryffindor  |
##   [text7, 2074]                       on her head." |  GRYFFINDOR  |
##   [text7, 2190]                it finally shouted," |  GRYFFINDOR  |
##   [text7, 2522]             you're sure - better be |  GRYFFINDOR  |
##   [text7, 2548]       and walked shakily toward the |  Gryffindor  |
##   [text7, 2789]               , joined Harry at the |  Gryffindor  |
##   [text7, 2832]                   hat had shouted," |  GRYFFINDOR  |
##   [text7, 3282]          service. Resident ghost of |  Gryffindor  |
##   [text7, 3450]                         ," So - new | Gryffindors  |
##   [text7, 3466]       house championship this year? | Gryffindors  |
##   [text7, 4742]                      you trot!" The |  Gryffindor  |
##   [text7, 5164]         and found themselves in the |  Gryffindor  |
##    [text8, 268]           always happy to point new | Gryffindors  |
##   [text8, 1323]    Professor McGonagall was head of |  Gryffindor  |
##   [text8, 2382]            point will be taken from |  Gryffindor  |
##   [text8, 2396]       Things didn't improve for the | Gryffindors  |
##   [text8, 2673]       another point you've lost for |  Gryffindor  |
##   [text8, 2744]            He'd lost two points for |  Gryffindor  |
##     [text9, 31]           Malfoy. Still, first-year | Gryffindors  |
##     [text9, 65]             notice pinned up in the |  Gryffindor  |
##     [text9, 83]          starting on Thursday - and |  Gryffindor  |
##    [text9, 646]               , who was passing the |  Gryffindor  |
##    [text9, 756]                  Ron, and the other | Gryffindors  |
##   [text9, 2642]             " Wood's captain of the |  Gryffindor  |
##   [text9, 3544]           of the points you'll lose |  Gryffindor  |
##   [text9, 3765]             staircase, and into the |  Gryffindor  |
##   [text9, 3941]              " Don't you care about |  Gryffindor  |
##   [text9, 4057]          Hermione was locked out of |  Gryffindor  |
##   [text9, 4958]                  got to get back to |  Gryffindor  |
##  [text10, 1443]          the Keeper -I'm Keeper for |  Gryffindor  |
##  [text10, 4539]         of winning fifty points for |  Gryffindor  |
##  [text10, 4875]           points will be taken from |  Gryffindor  |
##  [text10, 4903]             you'd better get off to |  Gryffindor  |
##  [text10, 4954]                 troll. You each win |  Gryffindor  |
##    [text11, 91]            after weeks of training: |  Gryffindor  |
##    [text11, 96]     Gryffindor versus Slytherin. If |  Gryffindor  |
##   [text11, 515]                me. Five points from |  Gryffindor  |
##   [text11, 561]              said Ron bitterly. The |  Gryffindor  |
##   [text11, 881]           take any more points from |  Gryffindor  |
##  [text11, 1355]                  , had done a large |  Gryffindor  |
##  [text11, 1508]               This is the best team | Gryffindor's |
##  [text11, 1752]  immediately by Angelina Johnson of |  Gryffindor  |
##  [text11, 1877]             by an excellent move by |  Gryffindor  |
##  [text11, 1882]      Gryffindor Keeper Wood and the | Gryffindors  |
##  [text11, 1892]         that's Chaser Katie Bell of |  Gryffindor  |
##  [text11, 1965]                  - nice play by the |  Gryffindor  |
##  [text11, 2016]          Bletchley dives - misses - | GRYFFINDORS  |
##  [text11, 2020]               - GRYFFINDORS SCORE!" |  Gryffindor  |
##  [text11, 2492]             of rage echoed from the | Gryffindors  |
##  [text11, 2523]                 Foul!" screamed the | Gryffindors  |
##  [text11, 2542]               at the goal posts for |  Gryffindor  |
##  [text11, 2719]            . Flint nearly kills the |  Gryffindor  |
##  [text11, 2735]                   , so a penalty to |  Gryffindor  |
##  [text11, 2754]               and we continue play, |  Gryffindor  |
##  [text11, 2859]             to turn back toward the |  Gryffindor  |
##  [text11, 3693]      happily shouting the results - |  Gryffindor  |
##    [text12, 99]                 to start. While the |  Gryffindor  |
##   [text12, 621]                 ." Five points from |  Gryffindor  |
##  [text12, 3298]                  to the fire in the |  Gryffindor  |
##  [text12, 3371]            Fred and George all over |  Gryffindor  |
##   [text13, 414]          excuse to knock points off |  Gryffindor  |
##   [text13, 569]         headed straight back to the |  Gryffindor  |
##   [text13, 736]                      If I back out, |  Gryffindor  |
##   [text13, 797]                   all the way up to |  Gryffindor  |
##   [text13, 963]               brave enough to be in |  Gryffindor  |
##  [text13, 1036]           Sorting Hat chose you for |  Gryffindor  |
##  [text13, 2481]          they choose people for the |  Gryffindor  |
##  [text13, 2914]                     won! We've won! |  Gryffindor  |
##  [text13, 2975]             lasted five minutes. As | Gryffindors  |
##  [text13, 3137]                   was a happy blur: | Gryffindors  |
##  [text13, 3193]                 in the setting sun. |  Gryffindor  |
##   [text15, 500]                 . Potter, I thought |  Gryffindor  |
##   [text15, 552]           points will be taken from |  Gryffindor  |
##   [text15, 635]          never been more ashamed of |  Gryffindor  |
##   [text15, 648]               points lost. That put |  Gryffindor  |
##   [text15, 661]          , they'd ruined any chance |  Gryffindor  |
##   [text15, 741]             happen when the rest of |  Gryffindor  |
##   [text15, 751]                     done? At first, | Gryffindors  |
##  [text16, 2210]                up to something. And |  Gryffindor  |
##  [text16, 2519]      take another fifty points from |  Gryffindor  |
##  [text16, 2786]          and your families alone if |  Gryffindor  |
##  [text16, 3066]                   them; none of the | Gryffindors  |
##  [text16, 3401]             you'll be caught again. |  Gryffindor  |
##   [text17, 286]            Snape was trying to stop |  Gryffindor  |
##  [text17, 1469]               won the house cup for |  Gryffindor  |
##  [text17, 5534]             Ron and Hermione at the |  Gryffindor  |
##  [text17, 5662]                  : In fourth place, |  Gryffindor  |
##  [text17, 5843]                 many years, I award |  Gryffindor  |
##  [text17, 5849]                house fifty points." |  Gryffindor  |
##  [text17, 5923]                    of fire, I award |  Gryffindor  |
##  [text17, 5946]               had burst into tears. | Gryffindors  |
##  [text17, 5992]        outstanding courage, I award |  Gryffindor  |
##  [text17, 6014] yelling themselves hoarse knew that |  Gryffindor  |
##  [text17, 6138]         noise that erupted from the |  Gryffindor  |
##  [text17, 6179]                 much as a point for |  Gryffindor  |
##  [text17, 6278]     serpent vanished and a towering |  Gryffindor  |
##                                    
##  , it sounds by far                
##  ," said Ron.                      
##  , Hufflepuff, Ravenclaw,          
##  , Where dwell the brave           
##  apart; You might belong           
##  , and the table on                
##  ." Granger, Hermione              
##  !" shouted the hat                
##  ," Neville ran off                
##  !" Harry heard the                
##  table. He was so                  
##  table." Turpin,                   
##  !" Harry clapped loudly           
##  Tower."" I                        
##  ! I hope you're going             
##  have never gone so long           
##  first years followed Percy through
##  common room, a cozy               
##  in the right direction,           
##  House, but it hadn't              
##  House for your cheek,             
##  as the Potions lesson continued   
##  ." This was so                    
##  in his very first week            
##  only had Potions with the         
##  common room that made them        
##  and Slytherin would be learning   
##  table, snatched the Remembrall    
##  hurried down the front steps      
##  team," Professor McGonagall       
##  if you're caught, and             
##  common room. A few                
##  , do you only care                
##  tower." Now what                  
##  tower," said Ron                  
##  . I have to fly                   
##  faded quickly from Harry's mind   
##  for this," said                   
##  tower. Students are finishing     
##  five points. Professor Dumbledore 
##  versus Slytherin. If Gryffindor   
##  won, they would move              
##  ."" He's just                     
##  common room was very noisy        
##  . He sprinted back upstairs       
##  lion underneath. Then Hermione    
##  had in years. We're               
##  - what an excellent Chaser        
##  Keeper Wood and the Gryffindors   
##  take the Quaffle - that's         
##  there, nice dive around           
##  Beater, anyway, and               
##  SCORE!" Gryffindor cheers         
##  cheers filled the cold air        
##  below - Marcus Flint had          
##  . Madam Hooch spoke angrily       
##  . But in all the                  
##  Seeker, which could happen        
##  , taken by Spinner,               
##  still in possession."             
##  goal- posts - he had              
##  had won by one hundred            
##  common room and the Great         
##  , Weasley, and be                 
##  common room, where Harry          
##  tower because they'd stolen his   
##  !" George Weasley really          
##  common room, where he             
##  can't play at all.                
##  tower. Everyone fell over         
##  , Malfoy's already done that      
##  , didn't it? And                  
##  team?" said Malfoy                
##  is in the lead!                   
##  came spilling onto the field      
##  running to lift him onto          
##  in the lead. He'd                 
##  meant more to you than            
##  ."" Fifty?                        
##  students." A hundred              
##  in last place. In                 
##  had had for the house             
##  found out what they'd done        
##  passing the giant hourglasses that
##  really can't afford to lose       
##  ! Yes, Weasley,                   
##  wins the house cup?               
##  had anything to say to            
##  will be in even more              
##  from winning, he did              
##  ." Quirrell cursed again          
##  table and tried to ignore         
##  , with three hundred and          
##  house fifty points."              
##  cheers nearly raised the bewitched
##  house fifty points."              
##  up and down the table             
##  house sixty points."              
##  now had four hundred and          
##  table. Harry, Ron                 
##  before. Harry, still              
##  lion took its place.
kwic(
philosophers_stone_corpus,
pattern = "GRYFFINDOR",
case_insensitive = FALSE,
)
## Keyword-in-context with 4 matches.                                                                             
##  [text7, 2074]           on her head." | GRYFFINDOR | !" shouted the hat     
##  [text7, 2190]    it finally shouted," | GRYFFINDOR | ," Neville ran off     
##  [text7, 2522] you're sure - better be | GRYFFINDOR | !" Harry heard the     
##  [text7, 2832]       hat had shouted," | GRYFFINDOR | !" Harry clapped loudly
kwic(
philosophers_stone_corpus,
pattern = "SLYTHERIN",
case_insensitive = FALSE,
)
## Keyword-in-context with 1 match.                                                                 
##  [text7, 2246] when it screamed," | SLYTHERIN | !" Malfoy went to
kwic(
philosophers_stone_corpus,
pattern = "HUFFLEPUFF",
case_insensitive = FALSE,
)
## Keyword-in-context with 3 matches.                                                                    
##  [text7, 1762] A moments pause -" | HUFFLEPUFF | !" shouted the hat 
##  [text7, 1808]         , Susan!"" | HUFFLEPUFF | !" shouted the hat 
##  [text7, 1991]        , Justin!"" | HUFFLEPUFF | !" Sometimes, Harry
kwic(
philosophers_stone_corpus,
pattern = "RAVENCLAW",
case_insensitive = FALSE,
)
## Keyword-in-context with 1 match.                                                           
##  [text7, 1833] , Terry!"" | RAVENCLAW | !" The table second

Although this is very promising, I need to next find an effective way to look at the entire sentence for each of these declaratory sorting statements. There is much more for me to learn regarding natural language processing!