CNN did a special report on it: a few weeks ago it was twenty years since the first Harry Potter novel - the sorceror’s/philosopher’s stone - was published. In honour of the series, I decided to start a text analysis and visualization project which my other-half wittily dubbed Harry Plotter. With the project, I intend to demonstrate how Hadley Wickham’s tidyverse and packages such as tidytext (free book), which build on the tidyverse principles, have taken programming in R to an all-new level. Moreover, I just enjoy making pretty graphs : )
I hope you will enjoy them as much as I do.
First, we need to set up our environment in RStudio. We will be needing several packages for our analyses. Most importantly, Bradley Boehmke was nice enough to gather all Harry Potter books in his harrypotter package on GitHub. We need devtools to install that package the first time, but from then on can load it in normally. Next, we load the tidytext package, which automates and tidies a lot of the functionality provided by prior, more complex text mining packages such as tm. We also need plyr for a specific function (i.e.,ldply()). Other tidyverse packages we can load in a single bundle, including ggplot2, dplyr, and tidyr, which I use in almost every of my professional and personal projects. Finally, we load the wordcloud visualization package which draw on the earlier mentioned tm.
After loading the packages, I set some default options, the details of which are in the comments.
# SETUP ####
# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(tidytext)
library(plyr)
library(tidyverse)
library(wordcloud)
# SET WORKING DIRECTORY AND OPTIONS
setwd("C:/Users/u1235656/stack/PhD/VDLogic/projects/harry plotter")
options(stringsAsFactors = F, # do not convert upon loading
scipen = 999, # do not convert numbers to e-values
max.print = 200) # stop printing after 200 values
# VIZUALIZATION SETTINGS
theme_set(theme_light()) # set default ggplot theme to light
w = 12 # defaulft plot width
h = 8 # default plot height
fs = 12 # default plot font size
With RStudio all set, it is time to load the data. From harrypotter, we retrieve the text of each book which we then “pipe” (%>% - another magical function from the tidyverse - specifically magrittr) along to bind all objects into a single dataframe. Here, each row represents a book with the text for each chapter stored in a seperate columns. We want tidy data, so we use tidyr’s gather() function to turn each column into grouped rows: a long data format. After some additional cleaning, we use tidytext’s unnest_tokens() function to retrieve the tokens from the chapters, in this case single words.
# LOAD IN BOOK CHAPTERS
# TRANSFORM TO TOKENIZED DATASET
hp_words <- list(
philosophers_stone = philosophers_stone,
chamber_of_secrets = chamber_of_secrets,
prisoner_of_azkaban = prisoner_of_azkaban,
goblet_of_fire = goblet_of_fire,
order_of_the_phoenix = order_of_the_phoenix,
half_blood_prince = half_blood_prince,
deathly_hallows = deathly_hallows
) %>%
ldply(rbind) %>% # bind all chapter text to dataframe columns
mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
select(-.id) %>% # remove ID column
gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
filter(!is.na(text)) %>% # delete the rows/chapters without text
mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
unnest_tokens(word, text, token = 'words') # tokenize data frame
Let’s have a look at our current data format.
head() prints the first rows (default n = 6) of any dataframe piped in whereas tail() returns the last rows.
# EXAMINE FIRST AND LAST WORDS OF SAGA
hp_words %>% head()
## book chapter word
## 1 philosophers_stone 1 the
## 1.1 philosophers_stone 1 boy
## 1.2 philosophers_stone 1 who
## 1.3 philosophers_stone 1 lived
## 1.4 philosophers_stone 1 mr
## 1.5 philosophers_stone 1 and
hp_words %>% tail()
## book chapter word
## 200.7629 order_of_the_phoenix 38 dudley
## 200.7630 order_of_the_phoenix 38 hurrying
## 200.7631 order_of_the_phoenix 38 along
## 200.7632 order_of_the_phoenix 38 in
## 200.7633 order_of_the_phoenix 38 his
## 200.7634 order_of_the_phoenix 38 wake
Okay let’s really explore these data. A good first step would be to examine word frequencies. Any guesses what the most frequent words are? Place your bets…
# PLOT WORD FREQUENCY PER BOOK
hp_words %>%
group_by(book, word) %>%
anti_join(stop_words, by = "word") %>% # delete stopwords
count() %>% # summarize count per word per book
arrange(desc(n)) %>% # highest freq on top
group_by(book) %>% #
mutate(top = seq_along(word)) %>% # identify rank within group
filter(top <= 15) %>% # retain top 15 frequent words
# create barplot
ggplot(aes(x = -top, fill = book)) +
geom_bar(aes(y = n), stat = 'identity', col = 'black') +
# make sure words are printed either in or next to bar
geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
label = word), size = fs/3, hjust = "left") +
theme(legend.position = 'none', # get rid of legend
text = element_text(size = fs), # determine fontsize
axis.text.x = element_text(angle = 45, hjust = 1, size = fs/1.5), # rotate x text
axis.ticks.y = element_blank(), # remove y ticks
axis.text.y = element_blank()) + # remove y text
labs(y = "Word count", x = "", # add labels
title = "Harry Plotter: Most frequent words throughout the saga") +
facet_grid(. ~ book) + # seperate plot for each book
coord_flip() # flip axes
Unsuprisingly, Harry is the most common word in every single book. Dumbledore’s role as an (irresponsible) mentor seems to become greater as the storyline progresses. The plot also nicely depicts the key characters per book:
Finally, J.K. seems somewhat obsessively writing about eyes looking at doors?
In this first post of the Harry Plotter project, we will additionally examine the sentiment over the course of the books. tidytext includes three famous sentiment dictionaries:
I identified all words in the Harry Potter books that occur in any of these dictionaries and I bound them into a long dataframe, which looks something like this:
# EXTRACT SENTIMENT WITH THREE DICTIONARIES
hp_senti <- bind_rows(
# 1 AFINN
hp_words %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
filter(score != 0) %>% # delete neutral words
mutate(sentiment = ifelse(score < 0, 'negative', 'positive')) %>% # identify sentiment
mutate(score = sqrt(score ^ 2)) %>% # all scores to positive
group_by(book, chapter, sentiment) %>%
mutate(dictionary = 'afinn'), # create dictionary identifier
# 2 BING
hp_words %>%
inner_join(get_sentiments("bing"), by = "word") %>%
group_by(book, chapter, sentiment) %>%
mutate(dictionary = 'bing'), # create dictionary identifier
# 3 NRC
hp_words %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
group_by(book, chapter, sentiment) %>%
mutate(dictionary = 'nrc') # create dictionary identifier
)
# EXAMINE FIRST SENTIMENT WORDS
hp_senti %>% head()
## # A tibble: 6 x 6
## # Groups: book, chapter, sentiment [2]
## book chapter word score sentiment dictionary
## <fctr> <int> <chr> <dbl> <chr> <chr>
## 1 philosophers_stone 1 proud 2 positive afinn
## 2 philosophers_stone 1 perfectly 3 positive afinn
## 3 philosophers_stone 1 thank 2 positive afinn
## 4 philosophers_stone 1 strange 1 negative afinn
## 5 philosophers_stone 1 nonsense 2 negative afinn
## 6 philosophers_stone 1 big 1 positive afinn
Let’s see which words carry most sentiment. Wordclouds are not my favorite visualization tool but they do allow for a quick display of frequencies among a large body of words.
hp_senti %>%
group_by(word) %>%
count() %>% # summarize count per word
mutate(log_n = sqrt(n)) %>% # take root to decrease outlier impact
with(wordcloud(word, log_n, max.words = 100))
It appears we need to correct for some words that occur in the sentiment dictionaries but have a different, unsentimental meaning in J.K. Rowling’s books. The least we can do is get rid of the two character names.
# DELETE SENTIMENT FOR CHARACTER NAMES
hp_senti_sel <- hp_senti %>% filter(!word %in% c("harry","moody"))
Let’s quickly sketch the remaining words per sentiment.
# VIZUALIZE MOST FREQUENT WORDS PER SENTIMENT
hp_senti_sel %>% # NAMES EXCLUDED
group_by(word, sentiment) %>%
count() %>% # summarize count per word per sentiment
group_by(sentiment) %>%
arrange(sentiment, desc(n)) %>% # most frequent on top
mutate(top = seq_along(word)) %>% # identify rank within group
filter(top <= 15) %>% # keep top 15 frequent words
ggplot(aes(x = -top, fill = factor(sentiment))) +
# create barplot
geom_bar(aes(y = n), stat = 'identity', col = 'black') +
# make sure words are printed either in or next to bar
geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
label = word), size = fs/3, hjust = "left") +
theme(legend.position = 'none', # remove legend
text = element_text(size = fs), # determine fontsize
axis.text.x = element_text(angle = 45, hjust = 1), # rotate x text
axis.ticks.y = element_blank(), # remove y ticks
axis.text.y = element_blank()) + # remove y text
labs(y = "Word count", x = "", # add manual labels
title = "Harry Plotter: Words carrying sentiment as counted throughout the saga",
subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
facet_grid(. ~ sentiment) + # seperate plot for each sentiment
coord_flip() # flip axes
Ok, that looks pretty good. Let’s continue to the final two visualisations.
Positive and negative sentiment is measured in each of the three dictionaries, which allows us to compare and contrast scores. Let’s examine what the bipolar sentiment looks like throughout the Harry Potter books.
# VIZUALIZE POSTIVE/NEGATIVE SENTIMENT OVER TIME
plot_sentiment <- hp_senti_sel %>% # NAMES EXCLUDED
group_by(dictionary, sentiment, book, chapter) %>%
summarize(score = sum(score), # summarize AFINN scores
count = n(), # summarize bing and nrc counts
# move bing and nrc counts to score
score = ifelse(is.na(score), count, score)) %>%
filter(sentiment %in% c('positive','negative')) %>% # only retain bipolar sentiment
mutate(score = ifelse(sentiment == 'negative', -score, score)) %>% # reverse negative values
# create area plot
ggplot(aes(x = chapter, y = score)) +
geom_area(aes(fill = score > 0),stat = 'identity') +
scale_fill_manual(values = c('red','green')) + # change colors
# add black smoothed line without standard error
geom_smooth(method = "loess", se = F, col = "black") +
theme(legend.position = 'none', # remove legend
text = element_text(size = fs)) + # change font size
labs(x = "Chapter", y = "Sentiment score", # add labels
title = "Harry Plotter: Sentiment during the saga",
subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
# seperate plot per book and dictionary and free up x-axes
facet_grid(dictionary ~ book, scale = "free_x")
plot_sentiment
Let’s draw that same plot again but zoomed in on the smoothed average.
plot_sentiment + coord_cartesian(ylim = c(-100,50)) # zoom in plot
Sentiment seems overly negative throughout the series. Particularly salient is that every book seems to end on a down note, with the exception of the Prisoner of Azkaban. Moreover, sentiment seems to become more volatile in books four through six: all have a sad start after which the stories brighten up, just to end in misery again. In her final book, J.K. Rowling depicts a world that is about to be concured by the dark Lord and the average sentiment clearly resembles this grim outlook.
On a sererate note, maybe it’s this specific text, but the bing sentiment dictionary seems to be most dim of the three.
As a final visualization let’s examine how other emotions, which are included in the nrc dictionary, are resembled in the books.
# VIZUALIZE EMOTIONAL SENTIMENT OVER TIME
hp_senti_sel %>% # NAMES EXCLUDED
filter(!sentiment %in% c('negative','positive')) %>% # only retain other sentiments (nrc)
group_by(sentiment, book, chapter) %>%
count() %>% # summarize count
# create area plot
ggplot(aes(x = chapter, y = n)) +
geom_area(aes(fill = sentiment), stat = 'identity') +
# add black smoothing line without standard error
geom_smooth(aes(fill = sentiment), method = "loess", se = F, col = 'black') +
theme(legend.position = 'none', # remove legend
text = element_text(size = fs)) + # change font size
labs(x = "Chapter", y = "Emotion score", # add labels
title = "Harry Plotter: Emotions during the saga",
subtitle = "Using tidytext and the nrc sentiment dictionary") +
# seperate plots per sentiment and book and free up x-axes
facet_grid(sentiment ~ book, scale = "free_x")
Unfortunately this plot is less insightful. For starters, the eight emotions seem to draw on either similar words, or J.K. Rowling combines all in her writing simulatenously: patterns seem highly similar across emotions, take for example the Chamber of Secrets. It would be interesting to examine these emotions, the words behind them, and the statistical differences along the storyline in more detail, but let’s leave this for a later post.
I hope you enjoyed this and please subscribe or come back to see any subsequent analyses.