For my project, I decided to take a look at the most common words from all seven Harry Potter books and analyze their sentiment. I hypothesis that the the most popular words in all seven Harry Potter books would be all of the main characters names. I am interested to see what other words will be the most common and if they are dark or positive. I am also interested in the sentiments that I will be looking at. The first sentiment I will find the average affinity sentiment by book of all seven of the novels. I will also look at a sentiment by chapter from all seven books. Finally, the last sentiment will look at chapter and book. I am looking forward to looking at what is the most negative and positive.

Wordcloud for all 7 books - the most common words:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(wordcloud2)
library(tidytext)
library(ggthemes)
hp_words <- read_csv("hp_words.csv")
## New names:
## Rows: 1089386 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): book, word dbl (2): ...1, chapter
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

The Wordcloud represents the 50 most common words from all Harry Potter books.

  hp_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>%
  head(50) %>% 
  wordcloud2(size=1.6)
## Joining, by = "word"
 hp_words %>% 
   anti_join(stop_words) %>% 
   count(word, sort = TRUE) %>%
   head(50) 
## Joining, by = "word"
## # A tibble: 50 × 2
##    word           n
##    <chr>      <int>
##  1 harry      16557
##  2 ron         5750
##  3 hermione    4912
##  4 dumbledore  2873
##  5 looked      2344
##  6 professor   2006
##  7 hagrid      1732
##  8 time        1713
##  9 wand        1639
## 10 eyes        1604
## # … with 40 more rows

I am not surprised that the main characters are the most common words.

Sentiment:

I found the average afinn sentiment by book of the seven Harry Potter books. Based on the results Deathly Hallows and Prisoner of Azkaban had the most negative sentiment. In conclusion this graph tells us that by book the third and last book were the most negative. I however was not surprised by these findings as I personally thought these were the darkest books out of all seven .

 hp_words %>% 
   anti_join(stop_words) %>% 
   inner_join(get_sentiments('afinn')) %>% 
group_by(book) %>% 
  summarise(mean=mean(value)) %>% 
   ggplot(aes(mean, book)) + geom_col() + xlim(-0.6, 0.6) 
## Joining, by = "word"
## Joining, by = "word"

Sentiment by chapter from all 7 books:

I found the average afinn sentiment by chapter from all seven books combined (mean) . The graph shows that the least negative chapter which is towards the beginning and the most negative chapter is towards the end.This graph matches the falling action established in the hero’s journey of Harry Potter. The beginning of chapters in all seven books show that they were more positive versus the end of the graph showing the downfall of the hero’s journey.

Source: https://nicholasgilmore.medium.com/the-heros-journey-in-each-of-the-7-harry-potter-books-8b5bccf75be6

 hp_words %>% 
   anti_join(stop_words) %>% 
   inner_join(get_sentiments('afinn')) %>% 
   group_by(chapter) %>% 
   summarise(mean=mean(value)) %>% 
   ggplot(aes(chapter, mean)) + geom_smooth() 
## Joining, by = "word"
## Joining, by = "word"
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Sentiment by chapter and book:

I found the average afinn sentiment by chapter. This time I separated by book so that you can see how each book progressed. I removed the confidence interval for the sake of readability. The big take away from this graph is that Half Blood Prince is more negative then any other books. The Half Blood Prince chapters have both the most and least negative chapters in all of the seven books.

hp_words %>% 
   anti_join(stop_words) %>% 
   inner_join(get_sentiments('afinn')) %>% 
   group_by(chapter,book) %>% 
   summarise(mean=mean(value)) %>% 
   ggplot(aes(chapter, mean, color=book)) + geom_smooth(se=F) 
## Joining, by = "word"
## Joining, by = "word"
## `summarise()` has grouped output by 'chapter'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Conclusion:

In conclusion the most significant point of the project was finding that The Half Blood Prince have both the most and least negative chapters in all of the seven books. I also found that overall the sentiment of the books follow the heroes journey of Harry Potter by rising in the beginning and falling at the end. In addition, out of all the seven books the Prisoner of Azkaban and Deathly Hallows were the most negative.