Harry Potter Text Analysis

Thesis

My thesis here is that the harry potter books will go down in sentiment value as time goes on. Meaning that the first book is much more positive and they get less positive as the books go on.

library(textdata)
library(harrypotter)
library(tidytext)
library(plyr)
library(ggthemes)
library(wordcloud2)
library(tidyverse)
library(plotly)
library(knitr)

hp_words <- list(
  philosophers_stone = philosophers_stone,
  chamber_of_secrets = chamber_of_secrets,
  prisoner_of_azkaban = prisoner_of_azkaban,
  goblet_of_fire = goblet_of_fire,
  order_of_the_phoenix = order_of_the_phoenix,
  half_blood_princae = half_blood_prince,
  deathly_hallows = deathly_hallows) %>%
  
  
  ldply(rbind) %>% # bind all chapter text to dataframe columns
  mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
  select(-.id) %>% # remove ID column
  gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
  filter(!is.na(text)) %>% # delete the rows/chapters without text
  mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
  unnest_tokens(word, text, token = 'words') # tokenize data frame

The first analysis I want to make is what are the most common words that will count towards the sentiment. The top words of each book may give us insight into the sentiment value of each book.

## Joining, by = "word"

book	word	top_word
deathly_hallows	wand	577
order_of_the_phoenix	voice	433
goblet_of_fire	eyes	316
half_blood_princae	time	313
prisoner_of_azkaban	eyes	176
chamber_of_secrets	time	148
philosophers_stone	time	120

Here I want to filter out all of the names that come up because they wont count towards the sentiment value. I created two different sets of code for this part. One of them includes all of the characters except for the main three and professor because those names would overwhelm the count and there would be no useful analysis.

mainCharacters <- c("harry", "ron", "hermione", "professor")

hp_words %>% 
  group_by(book) %>%
  count(word, sort = TRUE) %>% 
  anti_join(stop_words) %>%
  filter(!word %in% mainCharacters) %>%
  mutate(top_word = max(n)) %>% 
  ungroup() %>% 
  filter(n == top_word) %>% 
  subset(select = -c(n)) -> top_secCharacters_by_book

## Joining, by = "word"

kable(top_secCharacters_by_book)

book	word	top_word
half_blood_princae	dumbledore	873
order_of_the_phoenix	sirius	588
deathly_hallows	wand	577
goblet_of_fire	dumbledore	529
prisoner_of_azkaban	lupin	369
philosophers_stone	hagrid	336
chamber_of_secrets	malfoy	202

Wordclouds

Here I am making wordclouds for each book for a first look into the 100 most used words in the books. We can make assumptions based on the word clouds about which books will have the most negative sentiments.

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'philosophers_stone') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

This wordcloud for the first book makes sense. There are no words that stick out as obviously negative or even significantly positive in the wordcloud which makes me think that this book wont be negative but probably will be fairly neutral.

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'chamber_of_secrets') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'prisoner_of_azkaban') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

This wordcloud stands out to me because of the big difference in the count of the word black versus the rest of the words. This makes sense for the Prisoner of Azkaban because the story is about Sirius Black’s escape from Azkaban.

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'goblet_of_fire') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'order_of_the_phoenix') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'half_blood_prince') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

## Warning in max(dataOut$freq): no non-missing arguments to max; returning -Inf

hp_words %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% chars) %>%
  filter(book == 'deathly_hallows') %>% 
  dplyr::count(word, sort = TRUE) %>% 
  head(100) %>% 
  wordcloud2()

## Joining, by = "word"

This word cloud is significantly more negative than the rest of the word clouds with words like death, dark, darkness standing out as pretty common words throughout the book. This makes me think that the last book is going to be the most negative.

Sentiment Values per Book

The next thing I want to do is look at the sentiment value for each book.

book	average
philosophers_stone	0.0577468
chamber_of_secrets	-0.0438945
prisoner_of_azkaban	-0.0651840
goblet_of_fire	-0.0173904
order_of_the_phoenix	-0.0543027
half_blood_princae	0.0354648
deathly_hallows	-0.2126561

This chart shows the sentiment values for the overall book. All of them are pretty neutral except for the last which leans a lot more to the negative side than the rest of the books.

hp_averages %>% 
  ggplot(aes(average, book))+
  geom_col() -> test

ggplotly(test)

This is the data from the chart in a graph that clearly shows the last book being the most negative.

Conclusion

We can easily see from this graph that while the first book is the most positive and the last is by far the most negative there is really no correlation fro the books in between. Most of them seem to have a negative sentiment. The correlation between the words that are most popular in each book make sense. In The Deathly Hollows Wand and Voldemort are two of the most popular words which makes sense since the last story is about the death of Voldemort and the return of the Elder Wand. My thesis wasn’t exactly right. The half blood Prince which is the second to last book actually leans more positive than negative.

Harry Potter Text Analysis

Samuel Goldberg

2022-11-11

Thesis

Wordclouds

Sentiment Values per Book

Conclusion