CNN did a special report on it: a few weeks ago it was twenty years since the first Harry Potter novel - the sorceror’s/philosopher’s stone - was published. In honour of the series, I decided to start a text analysis and visualization project which my other-half wittily dubbed Harry Plotter. With the project, I intend to demonstrate how Hadley Wickham’s tidyverse and packages such as tidytext (free book), which build on the tidyverse principles, have taken programming in R to an all-new level. Moreover, I just enjoy making pretty graphs : )

I hope you will enjoy them as much as I do.

Setup

First, we need to set up our environment in RStudio. We will be needing several packages for our analyses. Most importantly, Bradley Boehmke was nice enough to gather all Harry Potter books in his harrypotter package on GitHub. We need devtools to install that package the first time, but from then on can load it in normally. Next, we load the tidytext package, which automates and tidies a lot of the functionality provided by prior, more complex text mining packages such as tm. We also need plyr for a specific function (i.e.,ldply()). Other tidyverse packages we can load in a single bundle, including ggplot2, dplyr, and tidyr, which I use in almost every of my professional and personal projects. Finally, we load the wordcloud visualization package which draw on the earlier mentioned tm.

After loading the packages, I set some default options, the details of which are in the comments.

# SETUP ####

# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(tidytext)
library(plyr)
library(tidyverse)
library(wordcloud)

# SET WORKING DIRECTORY AND OPTIONS
setwd("C:/Users/u1235656/stack/PhD/VDLogic/projects/harry plotter")
options(stringsAsFactors = F, # do not convert upon loading
        scipen = 999, # do not convert numbers to e-values
        max.print = 200) # stop printing after 200 values

# VIZUALIZATION SETTINGS
theme_set(theme_light()) # set default ggplot theme to light
w = 12 # defaulft plot width
h = 8 # default plot height
fs = 12 # default plot font size

Data preparation

With RStudio all set, it is time to load the data. From harrypotter, we retrieve the text of each book which we then “pipe” (%>% - another magical function from the tidyverse - specifically magrittr) along to bind all objects into a single dataframe. Here, each row represents a book with the text for each chapter stored in a seperate columns. We want tidy data, so we use tidyr’s gather() function to turn each column into grouped rows: a long data format. After some additional cleaning, we use tidytext’s unnest_tokens() function to retrieve the tokens from the chapters, in this case single words.

# LOAD IN BOOK CHAPTERS
# TRANSFORM TO TOKENIZED DATASET
hp_words <- list(
 philosophers_stone = philosophers_stone,
 chamber_of_secrets = chamber_of_secrets,
 prisoner_of_azkaban = prisoner_of_azkaban,
 goblet_of_fire = goblet_of_fire,
 order_of_the_phoenix = order_of_the_phoenix,
 half_blood_prince = half_blood_prince,
 deathly_hallows = deathly_hallows
) %>%
 ldply(rbind) %>% # bind all chapter text to dataframe columns
 mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
 select(-.id) %>% # remove ID column
 gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
 filter(!is.na(text)) %>% # delete the rows/chapters without text
 mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
 unnest_tokens(word, text, token = 'words') # tokenize data frame

Let’s have a look at our current data format.

head() prints the first rows (default n = 6) of any dataframe piped in whereas tail() returns the last rows.

# EXAMINE FIRST AND LAST WORDS OF SAGA
hp_words %>% head()

##                   book chapter  word
## 1   philosophers_stone       1   the
## 1.1 philosophers_stone       1   boy
## 1.2 philosophers_stone       1   who
## 1.3 philosophers_stone       1 lived
## 1.4 philosophers_stone       1    mr
## 1.5 philosophers_stone       1   and

hp_words %>% tail()

##                          book chapter     word
## 200.7629 order_of_the_phoenix      38   dudley
## 200.7630 order_of_the_phoenix      38 hurrying
## 200.7631 order_of_the_phoenix      38    along
## 200.7632 order_of_the_phoenix      38       in
## 200.7633 order_of_the_phoenix      38      his
## 200.7634 order_of_the_phoenix      38     wake

Word frequency

Okay let’s really explore these data. A good first step would be to examine word frequencies. Any guesses what the most frequent words are? Place your bets…

# PLOT WORD FREQUENCY PER BOOK
hp_words %>%
  group_by(book, word) %>%
  anti_join(stop_words, by = "word") %>% # delete stopwords
  count() %>% # summarize count per word per book
  arrange(desc(n)) %>% # highest freq on top
  group_by(book) %>% # 
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # retain top 15 frequent words
  # create barplot
  ggplot(aes(x = -top, fill = book)) + 
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # get rid of legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1, size = fs/1.5), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add labels
       title = "Harry Plotter: Most frequent words throughout the saga") +
  facet_grid(. ~ book) + # seperate plot for each book
  coord_flip() # flip axes

Unsuprisingly, Harry is the most common word in every single book. Dumbledore’s role as an (irresponsible) mentor seems to become greater as the storyline progresses. The plot also nicely depicts the key characters per book:

Lockhart and Dobby in book 2,
Lupin in book 3,
Moody and Crouch in book 4,
Umbridge in book 5,
Ginny in book 6,
and the final confrontation with He who must not be named in book 7.

Finally, J.K. seems somewhat obsessively writing about eyes looking at doors?

Estimating sentiment

In this first post of the Harry Plotter project, we will additionally examine the sentiment over the course of the books. tidytext includes three famous sentiment dictionaries:

AFINN: including bipolar sentiment scores ranging from -5 to 5
bing: including bipolar sentiment scores
nrc: including sentiment scores for many different emotions (e.g., anger, joy, and surprise)

I identified all words in the Harry Potter books that occur in any of these dictionaries and I bound them into a long dataframe, which looks something like this:

# EXTRACT SENTIMENT WITH THREE DICTIONARIES
hp_senti <- bind_rows(
  # 1 AFINN 
  hp_words %>% 
    inner_join(get_sentiments("afinn"), by = "word") %>%
    filter(score != 0) %>% # delete neutral words
    mutate(sentiment = ifelse(score < 0, 'negative', 'positive')) %>% # identify sentiment
    mutate(score = sqrt(score ^ 2)) %>% # all scores to positive
    group_by(book, chapter, sentiment) %>% 
    mutate(dictionary = 'afinn'), # create dictionary identifier
  # 2 BING 
  hp_words %>% 
    inner_join(get_sentiments("bing"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'bing'), # create dictionary identifier
  # 3 NRC 
  hp_words %>% 
    inner_join(get_sentiments("nrc"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'nrc') # create dictionary identifier
)

# EXAMINE FIRST SENTIMENT WORDS
hp_senti %>% head()

## # A tibble: 6 x 6
## # Groups:   book, chapter, sentiment [2]
##                 book chapter      word score sentiment dictionary
##               <fctr>   <int>     <chr> <dbl>     <chr>      <chr>
## 1 philosophers_stone       1     proud     2  positive      afinn
## 2 philosophers_stone       1 perfectly     3  positive      afinn
## 3 philosophers_stone       1     thank     2  positive      afinn
## 4 philosophers_stone       1   strange     1  negative      afinn
## 5 philosophers_stone       1  nonsense     2  negative      afinn
## 6 philosophers_stone       1       big     1  positive      afinn

Wordcloud

Let’s see which words carry most sentiment. Wordclouds are not my favorite visualization tool but they do allow for a quick display of frequencies among a large body of words.

hp_senti %>%
  group_by(word) %>%
  count() %>% # summarize count per word
  mutate(log_n = sqrt(n)) %>% # take root to decrease outlier impact
  with(wordcloud(word, log_n, max.words = 100))

It appears we need to correct for some words that occur in the sentiment dictionaries but have a different, unsentimental meaning in J.K. Rowling’s books. The least we can do is get rid of the two character names.

# DELETE SENTIMENT FOR CHARACTER NAMES
hp_senti_sel <- hp_senti %>% filter(!word %in% c("harry","moody"))

Words per sentiment

Let’s quickly sketch the remaining words per sentiment.

# VIZUALIZE MOST FREQUENT WORDS PER SENTIMENT
hp_senti_sel %>% # NAMES EXCLUDED
  group_by(word, sentiment) %>%
  count() %>% # summarize count per word per sentiment
  group_by(sentiment) %>%
  arrange(sentiment, desc(n)) %>% # most frequent on top
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # keep top 15 frequent words
  ggplot(aes(x = -top, fill = factor(sentiment))) + 
  # create barplot
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add manual labels
       title = "Harry Plotter: Words carrying sentiment as counted throughout the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
  facet_grid(. ~ sentiment) + # seperate plot for each sentiment
  coord_flip() # flip axes

Ok, that looks pretty good. Let’s continue to the final two visualisations.

Positive and negative sentiment throughout the series

Positive and negative sentiment is measured in each of the three dictionaries, which allows us to compare and contrast scores. Let’s examine what the bipolar sentiment looks like throughout the Harry Potter books.

# VIZUALIZE POSTIVE/NEGATIVE SENTIMENT OVER TIME
plot_sentiment <- hp_senti_sel %>% # NAMES EXCLUDED
  group_by(dictionary, sentiment, book, chapter) %>%
  summarize(score = sum(score), # summarize AFINN scores
            count = n(), # summarize bing and nrc counts
            # move bing and nrc counts to score 
            score = ifelse(is.na(score), count, score))  %>%
  filter(sentiment %in% c('positive','negative')) %>%   # only retain bipolar sentiment
  mutate(score = ifelse(sentiment == 'negative', -score, score)) %>% # reverse negative values
  # create area plot
  ggplot(aes(x = chapter, y = score)) +    
  geom_area(aes(fill = score > 0),stat = 'identity') +
  scale_fill_manual(values = c('red','green')) + # change colors
  # add black smoothed line without standard error
  geom_smooth(method = "loess", se = F, col = "black") + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Sentiment score", # add labels
       title = "Harry Plotter: Sentiment during the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
     # seperate plot per book and dictionary and free up x-axes
  facet_grid(dictionary ~ book, scale = "free_x")
plot_sentiment

Let’s draw that same plot again but zoomed in on the smoothed average.

plot_sentiment + coord_cartesian(ylim = c(-100,50)) # zoom in plot

Sentiment seems overly negative throughout the series. Particularly salient is that every book seems to end on a down note, with the exception of the Prisoner of Azkaban. Moreover, sentiment seems to become more volatile in books four through six: all have a sad start after which the stories brighten up, just to end in misery again. In her final book, J.K. Rowling depicts a world that is about to be concured by the dark Lord and the average sentiment clearly resembles this grim outlook.

On a sererate note, maybe it’s this specific text, but the bing sentiment dictionary seems to be most dim of the three.

Other emotions throughout the series

As a final visualization let’s examine how other emotions, which are included in the nrc dictionary, are resembled in the books.

# VIZUALIZE EMOTIONAL SENTIMENT OVER TIME
hp_senti_sel %>% # NAMES EXCLUDED 
  filter(!sentiment %in% c('negative','positive')) %>% # only retain other sentiments (nrc)
  group_by(sentiment, book, chapter) %>%
  count() %>% # summarize count
  # create area plot
  ggplot(aes(x = chapter, y = n)) +
  geom_area(aes(fill = sentiment), stat = 'identity') + 
  # add black smoothing line without standard error
  geom_smooth(aes(fill = sentiment), method = "loess", se = F, col = 'black') + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Emotion score", # add labels
       title = "Harry Plotter: Emotions during the saga",
       subtitle = "Using tidytext and the nrc sentiment dictionary") +
  # seperate plots per sentiment and book and free up x-axes
  facet_grid(sentiment ~ book, scale = "free_x")

Unfortunately this plot is less insightful. For starters, the eight emotions seem to draw on either similar words, or J.K. Rowling combines all in her writing simulatenously: patterns seem highly similar across emotions, take for example the Chamber of Secrets. It would be interesting to examine these emotions, the words behind them, and the statistical differences along the storyline in more detail, but let’s leave this for a later post.

I hope you enjoyed this and please subscribe or come back to see any subsequent analyses.

About the author

Paul van der Laken is a Ph.D. student at Tilburg University, sponsored by Shell. Paul has nearly five years experience in People Analytics / HR Analytics and his Ph.D. research focuses on how organizations may leverage their HR data to improve the effectiveness of their global mobility policy. Nevertheless, he has a broader interest in all things data and works on diverse dashboarding and data visualization projects in his spare time. Under the label of VDLogic, Paul furthermore provides post-graduate and in-house courses in data analysis as well as (pro-bono) stastistical consulting. Please visit his blog (www.paulvanderlaken.com) for more information.

Harry Plotter

Paul van der Laken

2 augustus 2017