My thesis here is that the harry potter books will go down in sentiment value as time goes on. Meaning that the first book is much more positive and they get less positive as the books go on.
library(textdata)
library(harrypotter)
library(tidytext)
library(plyr)
library(ggthemes)
library(wordcloud2)
library(tidyverse)
library(plotly)
library(knitr)
hp_words <- list(
philosophers_stone = philosophers_stone,
chamber_of_secrets = chamber_of_secrets,
prisoner_of_azkaban = prisoner_of_azkaban,
goblet_of_fire = goblet_of_fire,
order_of_the_phoenix = order_of_the_phoenix,
half_blood_princae = half_blood_prince,
deathly_hallows = deathly_hallows) %>%
ldply(rbind) %>% # bind all chapter text to dataframe columns
mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
select(-.id) %>% # remove ID column
gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
filter(!is.na(text)) %>% # delete the rows/chapters without text
mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
unnest_tokens(word, text, token = 'words') # tokenize data frame
The first analysis I want to make is what are the most common words that will count towards the sentiment. The top words of each book may give us insight into the sentiment value of each book.
## Joining, by = "word"
| book | word | top_word |
|---|---|---|
| deathly_hallows | wand | 577 |
| order_of_the_phoenix | voice | 433 |
| goblet_of_fire | eyes | 316 |
| half_blood_princae | time | 313 |
| prisoner_of_azkaban | eyes | 176 |
| chamber_of_secrets | time | 148 |
| philosophers_stone | time | 120 |
Here I want to filter out all of the names that come up because they wont count towards the sentiment value. I created two different sets of code for this part. One of them includes all of the characters except for the main three and professor because those names would overwhelm the count and there would be no useful analysis.
mainCharacters <- c("harry", "ron", "hermione", "professor")
hp_words %>%
group_by(book) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words) %>%
filter(!word %in% mainCharacters) %>%
mutate(top_word = max(n)) %>%
ungroup() %>%
filter(n == top_word) %>%
subset(select = -c(n)) -> top_secCharacters_by_book
## Joining, by = "word"
kable(top_secCharacters_by_book)
| book | word | top_word |
|---|---|---|
| half_blood_princae | dumbledore | 873 |
| order_of_the_phoenix | sirius | 588 |
| deathly_hallows | wand | 577 |
| goblet_of_fire | dumbledore | 529 |
| prisoner_of_azkaban | lupin | 369 |
| philosophers_stone | hagrid | 336 |
| chamber_of_secrets | malfoy | 202 |
Here I am making wordclouds for each book for a first look into the 100 most used words in the books. We can make assumptions based on the word clouds about which books will have the most negative sentiments.
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'philosophers_stone') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
This wordcloud for the first book makes sense. There are no words that stick out as obviously negative or even significantly positive in the wordcloud which makes me think that this book wont be negative but probably will be fairly neutral.
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'chamber_of_secrets') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'prisoner_of_azkaban') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
This wordcloud stands out to me because of the big difference in the count of the word black versus the rest of the words. This makes sense for the Prisoner of Azkaban because the story is about Sirius Black’s escape from Azkaban.
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'goblet_of_fire') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'order_of_the_phoenix') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'half_blood_prince') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
## Warning in max(dataOut$freq): no non-missing arguments to max; returning -Inf
hp_words %>%
anti_join(stop_words) %>%
filter(!word %in% chars) %>%
filter(book == 'deathly_hallows') %>%
dplyr::count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2()
## Joining, by = "word"
This word cloud is significantly more negative than the rest of the word clouds with words like death, dark, darkness standing out as pretty common words throughout the book. This makes me think that the last book is going to be the most negative.
The next thing I want to do is look at the sentiment value for each book.
| book | average |
|---|---|
| philosophers_stone | 0.0577468 |
| chamber_of_secrets | -0.0438945 |
| prisoner_of_azkaban | -0.0651840 |
| goblet_of_fire | -0.0173904 |
| order_of_the_phoenix | -0.0543027 |
| half_blood_princae | 0.0354648 |
| deathly_hallows | -0.2126561 |
This chart shows the sentiment values for the overall book. All of them are pretty neutral except for the last which leans a lot more to the negative side than the rest of the books.
hp_averages %>%
ggplot(aes(average, book))+
geom_col() -> test
ggplotly(test)
This is the data from the chart in a graph that clearly shows the last book being the most negative.
We can easily see from this graph that while the first book is the most positive and the last is by far the most negative there is really no correlation fro the books in between. Most of them seem to have a negative sentiment. The correlation between the words that are most popular in each book make sense. In The Deathly Hollows Wand and Voldemort are two of the most popular words which makes sense since the last story is about the death of Voldemort and the return of the Elder Wand. My thesis wasn’t exactly right. The half blood Prince which is the second to last book actually leans more positive than negative.