We performed a textual analysis of the Harry Potter series using word frequencies, sentiment analysis, and clustering. We acquired the Harry Potter books from the github account of bradleyboehmke. This gave us the complete text for each book. From here, we broke it up into individual words to perform most of our analyses. For some of our analyses, we also needed to break the books up into sentences. Using the tokenize_sentences function from the tokenizers package. We scraped the wikipedia page List of Harry Potter characters, to get our list of characters. To obtain the locations in the Harry Potter universe, we scraped the wikipedia page, Places in Harry Potter. Lastly, for the spells used in Harry Potter, we scraped from a hypable.com list titled Harry Potter List of Spells.

Most frequently used words in the series

To find the frequency of all the words in the entire series we used the table function to count up all times each word was used throughout the series.

word	frequency
the	51593
and	27430
to	26985
of	21802
a	20966
he	20322
harry	16557
was	15631
said	14398
his	14264

As you can see, this initial step is not very interesting. It is interesting that Harry makes the top list of words.

Displaying frequency of non “stop words” in the whole series

We thought that our analysis would become more interesting if we removed “stopwords”. Stopwords are words that alone don’t carry very much meaning. For example, on its own the word “the” doesn’t carry any meaning, but the word “death” carries significant meaning.

word	frequency
harry	16557
ron	5750
hermione	4912
dumbledore	2873
looked	2344
professor	2006
hagrid	1732
time	1713
wand	1639
eyes	1604

Bar plot of the most frequently used words in the whole series

Plotting word frequency by book

Books 2-6 are also included at the end of the document.

Character frequency of whole series

First names

word	frequency
harry	16557
ron	5750
hermione	4912
sirius	1002
fred	870
george	711
ginny	701
neville	666
vernon	434
sir	419

Last names

word	frequency
dumbledore	2873
hagrid	1732
snape	1532
weasley	1460
malfoy	1155
potter	1109
voldemort	980
black	929
lupin	732
mcgonagall	670

Spell frequency of whole series

To find which spells are used most often, we compared our dataframe containing information about how many times each word is used and a list of the first word in each spell. We did this because the only times that the first word in a spell is used is when they are saying the entire spell. This allowed us to use inner_join to get our frequencies of each spell. It is interesting to note that expecto patronum is the most used spell in the series, and the least used in a four-way tie between confundo, liberacorpus, morsmordre, and rictusempra.

word	frequency
expecto patronum	43
accio	33
expelliarmus	27
stupefy	27
avada kedavra	23
lumos	22
riddikulus	17
crucio	15
impedimenta	13
disillusionment	11

Location Frequency Full Series

To find the frequency of the locations, we had a slightly more complicated process. Since not all of the locations can be boiled down to one word, we had to compare the list of locations to a dataframe of sentences. From here, we used regular expressions to extract the locations from each sentence, then used table to count up how often each location was mentioned. As you would expect, Hogwarts is the location that is most mentioned. It is also interesting that Diagon Alley comes up very often even though it doesn’t seem like a lot of the books are set in Diagon Alley.

location	occurences
Hogwarts	856
Azkaban	177
Hogsmeade	151
Ministry of Magic	148
Durmstrang	65
Burrow	56
Diagon Alley	56
Godric’s Hollow	54
Beauxbatons	53
Honeydukes	28

Sentiment Analysis

Sentiment by Chapter Analysis

We performed sentiment analysis on each chapter in the series using the SentimentAnalysis package. We first found the sentiments of every word, then we used convertToBinaryResponse to have each word categorized as positive or negative. Next we found the average sentiment of each chapter (positive words were given the value 1 and negative words were given the value minus 1). As you can see, overall the sixth book has the most positive average sentiment and the last book has the lowest average sentiment.

As you can see below, the sentiment gradually becomes more negative as the series progresses. This was unsuprising to us. The most negative chapter in the entire series is the 93rd chapter in the series, which corresponds to 34th chapter of the 4th book. In this chapter, Harry and Voldemort duel, which Harry wins, and Harry is tortured by Voldemort using crucio. It is unsurprising that this is the most negative chapter in the series since Cedric was just murdered by Voldemort and Voldemort tortures Harry. The most positive chapter in the series is the 74th, which corresponds to the 15th chapter of the 4th book. In this chapter, the students welcome the students of the other schools participating in the Triwizard cup. We didn’t find it unfathomable that this is most positive chapter in the series; however, it would not have been the one that we would have guessed.

Sentiment Analysis using AFINN dictionary

We also performed sentiment analysis of each chapter using the AFINN sentiment dictionary that is built into R. In this analysis, each word is assigned an integer value that determines how positive or negative that word is according to AFINN. After assigning the values to each word of the series, we plotted the average score of each chapter. From the plots below, you can see a similar pattern as in our previous sentiment analysis; the series seems to get darker as it progresses.

Trials and Tribulations

There were a few things that we tried that didn’t work or didn’t give us any additional insights. The main thing we tried was Named Entity Recognition using natural language processing. We found the r package NLP and openNLP, which together were supposed to be able to predict the part of speech of every word in a sentence and be able to tell us if each word is a name, location, organization, or none of the above. This, however, did not work at all. It would only recognize some names that were very common (like John), but not slightly less common ones (like Anna). So, when we looked at names in Harry Potter, it didn’t work at all. We aren’t certain whether we didn’t implement it right or whether the underlying method it uses is really bad, but we couldn’t seem to get it to work right.

Another thing that we tried was clustering to see if there were any underlying patterns. We started with kmeans clustering, which didn’t yield anything interesting. Next, we used DBSCAN, since we didn’t need to choose how many clusters to force the data into. This was slightly more interesting but the process was very time consuming, and we weren’t sure what radius to pick and how many points to have in each cluster. It was interesting though.

Lastly, our data is located under ~/Mscs 264 F18/Project/Jack_Abe/HarryPotterDataScience/Data and our work is under ~/Mscs 264 F18/Project/Jack_Abe/HarryPotterDataScience/Work.

Harry Potter Text Analysis

Abe Eyman Casey and Jack Welsh

12/9/2018