We performed a textual analysis of the Harry Potter series using word frequencies, sentiment analysis, and clustering. We acquired the Harry Potter books from the github account of bradleyboehmke. This gave us the complete text for each book. From here, we broke it up into individual words to perform most of our analyses. For some of our analyses, we also needed to break the books up into sentences. Using the tokenize_sentences function from the tokenizers package. We scraped the wikipedia page List of Harry Potter characters, to get our list of characters. To obtain the locations in the Harry Potter universe, we scraped the wikipedia page, Places in Harry Potter. Lastly, for the spells used in Harry Potter, we scraped from a hypable.com list titled Harry Potter List of Spells.

Most frequently used words in the series

To find the frequency of all the words in the entire series we used the table function to count up all times each word was used throughout the series.

word frequency
the 51593
and 27430
to 26985
of 21802
a 20966
he 20322
harry 16557
was 15631
said 14398
his 14264

As you can see, this initial step is not very interesting. It is interesting that Harry makes the top list of words.

Displaying frequency of non “stop words” in the whole series

We thought that our analysis would become more interesting if we removed “stopwords”. Stopwords are words that alone don’t carry very much meaning. For example, on its own the word “the” doesn’t carry any meaning, but the word “death” carries significant meaning.

word frequency
harry 16557
ron 5750
hermione 4912
dumbledore 2873
looked 2344
professor 2006
hagrid 1732
time 1713
wand 1639
eyes 1604

Bar plot of the most frequently used words in the whole series

Plotting word frequency by book

Books 2-6 are also included at the end of the document.

Character frequency of whole series

First names

word frequency
harry 16557
ron 5750
hermione 4912
sirius 1002
fred 870
george 711
ginny 701
neville 666
vernon 434
sir 419

Last names

word frequency
dumbledore 2873
hagrid 1732
snape 1532
weasley 1460
malfoy 1155
potter 1109
voldemort 980
black 929
lupin 732
mcgonagall 670

Spell frequency of whole series

To find which spells are used most often, we compared our dataframe containing information about how many times each word is used and a list of the first word in each spell. We did this because the only times that the first word in a spell is used is when they are saying the entire spell. This allowed us to use inner_join to get our frequencies of each spell. It is interesting to note that expecto patronum is the most used spell in the series, and the least used in a four-way tie between confundo, liberacorpus, morsmordre, and rictusempra.

word frequency
expecto patronum 43
accio 33
expelliarmus 27
stupefy 27
avada kedavra 23
lumos 22
riddikulus 17
crucio 15
impedimenta 13
disillusionment 11

Location Frequency Full Series

To find the frequency of the locations, we had a slightly more complicated process. Since not all of the locations can be boiled down to one word, we had to compare the list of locations to a dataframe of sentences. From here, we used regular expressions to extract the locations from each sentence, then used table to count up how often each location was mentioned. As you would expect, Hogwarts is the location that is most mentioned. It is also interesting that Diagon Alley comes up very often even though it doesn’t seem like a lot of the books are set in Diagon Alley.

location occurences
Hogwarts 856
Azkaban 177
Hogsmeade 151
Ministry of Magic 148
Durmstrang 65
Burrow 56
Diagon Alley 56
Godric’s Hollow 54
Beauxbatons 53
Honeydukes 28

Sentiment Analysis

Sentiment by Chapter Analysis

We performed sentiment analysis on each chapter in the series using the SentimentAnalysis package. We first found the sentiments of every word, then we used convertToBinaryResponse to have each word categorized as positive or negative. Next we found the average sentiment of each chapter (positive words were given the value 1 and negative words were given the value minus 1). As you can see, overall the sixth book has the most positive average sentiment and the last book has the lowest average sentiment.

As you can see below, the sentiment gradually becomes more negative as the series progresses. This was unsuprising to us. The most negative chapter in the entire series is the 93rd chapter in the series, which corresponds to 34th chapter of the 4th book. In this chapter, Harry and Voldemort duel, which Harry wins, and Harry is tortured by Voldemort using crucio. It is unsurprising that this is the most negative chapter in the series since Cedric was just murdered by Voldemort and Voldemort tortures Harry. The most positive chapter in the series is the 74th, which corresponds to the 15th chapter of the 4th book. In this chapter, the students welcome the students of the other schools participating in the Triwizard cup. We didn’t find it unfathomable that this is most positive chapter in the series; however, it would not have been the one that we would have guessed.

Sentiment Analysis using AFINN dictionary

We also performed sentiment analysis of each chapter using the AFINN sentiment dictionary that is built into R. In this analysis, each word is assigned an integer value that determines how positive or negative that word is according to AFINN. After assigning the values to each word of the series, we plotted the average score of each chapter. From the plots below, you can see a similar pattern as in our previous sentiment analysis; the series seems to get darker as it progresses.

Trials and Tribulations

There were a few things that we tried that didn’t work or didn’t give us any additional insights. The main thing we tried was Named Entity Recognition using natural language processing. We found the r package NLP and openNLP, which together were supposed to be able to predict the part of speech of every word in a sentence and be able to tell us if each word is a name, location, organization, or none of the above. This, however, did not work at all. It would only recognize some names that were very common (like John), but not slightly less common ones (like Anna). So, when we looked at names in Harry Potter, it didn’t work at all. We aren’t certain whether we didn’t implement it right or whether the underlying method it uses is really bad, but we couldn’t seem to get it to work right.

Another thing that we tried was clustering to see if there were any underlying patterns. We started with kmeans clustering, which didn’t yield anything interesting. Next, we used DBSCAN, since we didn’t need to choose how many clusters to force the data into. This was slightly more interesting but the process was very time consuming, and we weren’t sure what radius to pick and how many points to have in each cluster. It was interesting though.

Lastly, our data is located under ~/Mscs 264 F18/Project/Jack_Abe/HarryPotterDataScience/Data and our work is under ~/Mscs 264 F18/Project/Jack_Abe/HarryPotterDataScience/Work.

KMeans

KMeans clustering with 4 clusters.

DBSCAN

DBSCAN with a radius of .28 and 5 points needed to form a cluster.

Top 10 most frequent words in each book