We performed a textual analysis of the Harry Potter series using word frequencies, sentiment analysis, and clustering. We acquired the Harry Potter books from the github account of bradleyboehmke. This gave us the complete text for each book. From here, we broke it up into individual words to perform most of our analyses. For some of our analyses, we also needed to break the books up into sentences. Using the tokenize_sentences function from the tokenizers package. We scraped the wikipedia page List of Harry Potter characters, to get our list of characters. To obtain the locations in the Harry Potter universe, we scraped the wikipedia page, Places in Harry Potter. Lastly, for the spells used in Harry Potter, we scraped from a hypable.com list titled Harry Potter List of Spells.
Most frequently used words in the series
To find the frequency of all the words in the entire series we used the table function to count up all times each word was used throughout the series.
word | frequency |
---|---|
the | 51593 |
and | 27430 |
to | 26985 |
of | 21802 |
a | 20966 |
he | 20322 |
harry | 16557 |
was | 15631 |
said | 14398 |
his | 14264 |
As you can see, this initial step is not very interesting. It is interesting that Harry makes the top list of words.
Displaying frequency of non “stop words” in the whole series
We thought that our analysis would become more interesting if we removed “stopwords”. Stopwords are words that alone don’t carry very much meaning. For example, on its own the word “the” doesn’t carry any meaning, but the word “death” carries significant meaning.
word | frequency |
---|---|
harry | 16557 |
ron | 5750 |
hermione | 4912 |
dumbledore | 2873 |
looked | 2344 |
professor | 2006 |
hagrid | 1732 |
time | 1713 |
wand | 1639 |
eyes | 1604 |
Bar plot of the most frequently used words in the whole series
Plotting word frequency by book
Books 2-6 are also included at the end of the document.
Character frequency of whole series
First names
word | frequency |
---|---|
harry | 16557 |
ron | 5750 |
hermione | 4912 |
sirius | 1002 |
fred | 870 |
george | 711 |
ginny | 701 |
neville | 666 |
vernon | 434 |
sir | 419 |
Last names
word | frequency |
---|---|
dumbledore | 2873 |
hagrid | 1732 |
snape | 1532 |
weasley | 1460 |
malfoy | 1155 |
potter | 1109 |
voldemort | 980 |
black | 929 |
lupin | 732 |
mcgonagall | 670 |
Spell frequency of whole series
To find which spells are used most often, we compared our dataframe containing information about how many times each word is used and a list of the first word in each spell. We did this because the only times that the first word in a spell is used is when they are saying the entire spell. This allowed us to use inner_join to get our frequencies of each spell. It is interesting to note that expecto patronum is the most used spell in the series, and the least used in a four-way tie between confundo, liberacorpus, morsmordre, and rictusempra.
word | frequency |
---|---|
expecto patronum | 43 |
accio | 33 |
expelliarmus | 27 |
stupefy | 27 |
avada kedavra | 23 |
lumos | 22 |
riddikulus | 17 |
crucio | 15 |
impedimenta | 13 |
disillusionment | 11 |
Location Frequency Full Series
To find the frequency of the locations, we had a slightly more complicated process. Since not all of the locations can be boiled down to one word, we had to compare the list of locations to a dataframe of sentences. From here, we used regular expressions to extract the locations from each sentence, then used table
to count up how often each location was mentioned. As you would expect, Hogwarts is the location that is most mentioned. It is also interesting that Diagon Alley comes up very often even though it doesn’t seem like a lot of the books are set in Diagon Alley.
location | occurences |
---|---|
Hogwarts | 856 |
Azkaban | 177 |
Hogsmeade | 151 |
Ministry of Magic | 148 |
Durmstrang | 65 |
Burrow | 56 |
Diagon Alley | 56 |
Godric’s Hollow | 54 |
Beauxbatons | 53 |
Honeydukes | 28 |
Sentiment Analysis
Sentiment by Chapter Analysis
We performed sentiment analysis on each chapter in the series using the SentimentAnalysis package. We first found the sentiments of every word, then we used convertToBinaryResponse
to have each word categorized as positive or negative. Next we found the average sentiment of each chapter (positive words were given the value 1 and negative words were given the value minus 1). As you can see, overall the sixth book has the most positive average sentiment and the last book has the lowest average sentiment.
As you can see below, the sentiment gradually becomes more negative as the series progresses. This was unsuprising to us. The most negative chapter in the entire series is the 93rd chapter in the series, which corresponds to 34th chapter of the 4th book. In this chapter, Harry and Voldemort duel, which Harry wins, and Harry is tortured by Voldemort using crucio. It is unsurprising that this is the most negative chapter in the series since Cedric was just murdered by Voldemort and Voldemort tortures Harry. The most positive chapter in the series is the 74th, which corresponds to the 15th chapter of the 4th book. In this chapter, the students welcome the students of the other schools participating in the Triwizard cup. We didn’t find it unfathomable that this is most positive chapter in the series; however, it would not have been the one that we would have guessed.
Sentiment Analysis using AFINN dictionary
We also performed sentiment analysis of each chapter using the AFINN sentiment dictionary that is built into R. In this analysis, each word is assigned an integer value that determines how positive or negative that word is according to AFINN. After assigning the values to each word of the series, we plotted the average score of each chapter. From the plots below, you can see a similar pattern as in our previous sentiment analysis; the series seems to get darker as it progresses.
Trials and Tribulations
There were a few things that we tried that didn’t work or didn’t give us any additional insights. The main thing we tried was Named Entity Recognition using natural language processing. We found the r package NLP and openNLP, which together were supposed to be able to predict the part of speech of every word in a sentence and be able to tell us if each word is a name, location, organization, or none of the above. This, however, did not work at all. It would only recognize some names that were very common (like John), but not slightly less common ones (like Anna). So, when we looked at names in Harry Potter, it didn’t work at all. We aren’t certain whether we didn’t implement it right or whether the underlying method it uses is really bad, but we couldn’t seem to get it to work right.
Another thing that we tried was clustering to see if there were any underlying patterns. We started with kmeans clustering, which didn’t yield anything interesting. Next, we used DBSCAN, since we didn’t need to choose how many clusters to force the data into. This was slightly more interesting but the process was very time consuming, and we weren’t sure what radius to pick and how many points to have in each cluster. It was interesting though.
Lastly, our data is located under ~/Mscs 264 F18/Project/Jack_Abe/HarryPotterDataScience/Data and our work is under ~/Mscs 264 F18/Project/Jack_Abe/HarryPotterDataScience/Work.
KMeans
KMeans clustering with 4 clusters.
DBSCAN
DBSCAN with a radius of .28 and 5 points needed to form a cluster.