Game of Thrones Book Analysis

Winter is here. Season 7 of Game of Thrones is out, so we thought it would be an opportune time to do a fun text analysis of the books!

Much of this analysis is done with the tidytext R package and the Stanford CoreNLP library.

Data collection

To get the data into R, we'll read in a text file with the readLines() function and convert it into a data frame.

df <- tibble()

# Read data
for (i in 1:5) {
  
  # Read text files
  assign(paste0("book", i), readLines(paste0("got", i, ".txt")))
  
  # Create dataframes
  assign(paste0("book", i), tibble(get(paste0("book", i))))
  
  # Set the book number
  df <- rbind(df, get(paste0("book", i)))

}

Now that we have the text of the books in a single dataframe, we need to remove lines in which there is no text.

# Column names
colnames(df) <- 'text'

# Remove rows that are empty
df <- df %>% filter(text != "")

Tidy the text

To get this data in a tidy format, we need each value to have its own row. We can use the unnest_tokens() function to do this for us.

# Unnest the tokens
text_df <- df %>%
  unnest_tokens(word, text)

Nice! Now we want to remove words like "a" and "the" that appear frequently but don't provide much value. We'll remove these stop words by "anti-joining" them with our tidy data frame, thus making sure that all stop words are excluded.

# Get stop words
data(stop_words)

# Anti join stop words
text_df <- text_df %>%
  anti_join(stop_words, by = "word")

Now we can list and visualize the most frequently occuring words in the series.

At first glance we can see that titles like "lord", "ser", and "king", and names occur most frequently. We don't have a good understanding of the context in which these words occur however. One way we can address this is by finding and analyzing n-grams.

N-grams

An n-gram is a contiguous series of n words from a text; for example, a bigram is a pair of words, with n=2. We will use the unnest_tokens function from the tidytext package to identify all the bigrams in the 5 Game of Thrones book and transform them into a tidy dataset.

# Get bigrams
got_bigrams <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

Now that we have our n-gram, we can visualize the most popular ones.

Oops, this list is full of stopwords! We can remove them by separating the two words in the bigram, removing stopwords, and reuniting the bigrams.

library(tidyr)

# Separate the two words
bigrams_separated <- got_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# Filter out stopwords
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# Count the new bigrams
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

# Unite the bigrams to form words
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

Now we can plot the most common bigrams, excluding stopwords.

Most of these bigrams are titles or names, which makes sense! Also hot pie. :)

Gendered verbs

This study by Matthew Jockers and Gabi Kirilloff utilizes text mining to examine 19th century novels and explore how gendered pronouns like he/she/him/her are associated with different verbs.

These researchers used the Stanford CoreNLP library to parse dependencies in sentences and find which verbs are connected to which pronouns, but we can also use a tidytext approach to find the most commonly-occuring verbs that appear after these gendered pronouns. The two pronouns we'll examine here are "he" and "she".

Let's find "gendered" bigrams by finding all bigrams in which the first word is "he" or "she".

# Define our pronouns
pronouns <- c("he", "she")

# Get our bigram where first word is a pronoun
gender_bigrams <- got_bigrams %>%
    count(bigram, sort = TRUE) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(word1 %in% pronouns) %>%
    count(word1, word2, wt = n, sort = TRUE) %>%
    rename(total = nn)

Now let's visualize the most common gendered bigrams.

These are the most common bigrams that start with “he” and “she” in the Game of Thrones series. The most common bigrams are similar between the male and female characters in these books

We can use a log odds ratio so we can find the words that exhibit the biggest differences between relative use for “she” and “he”.

# Calculate log odds ratio
word_ratios <- gender_bigrams %>%
    group_by(word2) %>%
    filter(sum(total) > 50) %>%
    ungroup() %>%
    spread(word1, total, fill = 0) %>%
    mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
    mutate(logratio = log2(she / he)) %>%
    arrange(desc(logratio))    

So which words have about the same likelihood of following “he” or “she” in the first Thrones book?

# Arrange by logratio
word_ratios %>% 
    arrange(abs(logratio))
## # A tibble: 117 x 4
##      word2          he         she     logratio
##      <chr>       <dbl>       <dbl>        <dbl>
##  1  caught 0.001656759 0.001662001  0.004558188
##  2     sat 0.003995712 0.004023793  0.010103468
##  3   found 0.010232921 0.010321903  0.012491048
##  4  looked 0.012035864 0.012246326  0.025009301
##  5   stood 0.004824091 0.004723583 -0.030375602
##  6    said 0.062664458 0.061056683 -0.037498185
##  7   tried 0.007163045 0.007435269  0.053812107
##  8 stopped 0.002095312 0.002011896 -0.058609283
##  9  turned 0.010866387 0.010409377 -0.061988621
## 10 laughed 0.003752071 0.003586424 -0.065141020
## # ... with 107 more rows

These words, like "caught", "sat", and "found" are about as likely to come after the word "she" as the word "he". Now let’s look at the words that exhibit the largest differences in appearing after "she" compared to "he".

Men are more than twice as likely to fall, die, do, seem, or bring, whereas women are more than twice as likely to whisper, scream, throw, pray, and cry. This doesn't paint a pretty picture of gender roles in the Game of Thrones series.

More positive, action-oriented verbs like "drew", "shouted", and "can" seem to appear more often for men, while more passive, victim-like verbs like "didn't", "screamed", and "cried" appear more often for women in the series.

Character dependencies

We can also use the Stanford CoreNLP library to parse dependencies in sentences and find which verbs are connected to certain pronouns.

# Load library
library(cleanNLP); library(reticulate)

# Setting up NLP backend
init_spaCy()

# Get text
text <- paste(df$text, collapse = " ")

Because our input is a text string we set as_strings to TRUE (the default is to assume that we are giving the function paths to where the input data sits on the local machine"):

obj <- run_annotators(text, as_strings = TRUE)

Here, we used the spaCy backend.The returned annotation object is nothing more than a list of data frames (and one matrix), similar to a set of tables within a database.

Named entities

Named entity recognition is the task of finding entities that can be defined by proper names, categorizing them, and standardizing their formats. Let's use this approach to get the names of the main characters in the series

# Find the named entities in our text
people <- get_entity(obj) %>% 
  filter(entity_type == "PERSON" & entity != "Hand" & entity != "Father") %>%
  group_by(entity) %>%
  count %>%
  arrange(desc(n))

# Show the top 20 characters by mention
people[1:20,]
## # A tibble: 20 x 2
## # Groups:   entity [20]
##        entity     n
##         <chr> <int>
##  1        Jon  1826
##  2      Jaime  1195
##  3       Arya   973
##  4       Robb   847
##  5        Sam   800
##  6        Ned   796
##  7     Robert   767
##  8      Sansa   761
##  9       Dany   751
## 10    Catelyn   559
## 11     Cersei   537
## 12    Brienne   522
## 13    "Jon\t"   475
## 14 Your Grace   461
## 15      Grace   348
## 16 Lord Tywin   339
## 17      Stark   339
## 18    Stannis   338
## 19       "\t"   302
## 20     Tyrion   269

Cool! Now that we have the names of the main characters, let look at the relationship between the names and certain key words.

Dependencies

Dependencies give the grammatical relationship between pairs of words within a sentence. We'll use the get_dependency() function to find dependencies between words in the books.

# Get the dependencies
dependencies <- get_dependency(obj, get_token = TRUE)

Let's see at what the dependencies look like.

head(dependencies)
## # A tibble: 6 x 10
##      id   sid   tid tid_target relation relation_full  word lemma
##   <int> <int> <int>      <int>    <chr>         <chr> <chr> <chr>
## 1     1     1     2          1      det          <NA>  GAME  game
## 2     1     1     0          2     ROOT          <NA>  ROOT  ROOT
## 3     1     1     2          3     prep          <NA>  GAME  game
## 4     1     1     3          4     pobj          <NA>    OF    of
## 5     1     1     6          5 compound          <NA>  Song  song
## 6     1     1     2          6    appos          <NA>  GAME  game
## # ... with 2 more variables: word_target <chr>, lemma_target <chr>

The word is related to the target, and the relationship is defined in the relation column. We're more interested in the lemma and lemma_target as they have been standardized for us. The relationship we're most interested in is direct dependency (nsubj for nominal subject). Let's filter the results to show words that are directly dependent on one of our main characters' names.

# Sub out some names
dependencies$lemma_target <- gsub("daenerys", "dany", dependencies$lemma_target)
dependencies$lemma_target <- gsub("jon snow", "jon", dependencies$lemma_target)

# Find direct dependencies on our main characters
subject_dependencies <- dependencies %>% 
  filter(lemma_target %in% main_characters$entity & relation == 'nsubj') %>%
  group_by(lemma_target, word, relation) %>%
  count

head(subject_dependencies)
## # A tibble: 6 x 4
## # Groups:   lemma_target, word, relation [6]
##   lemma_target      word relation     n
##          <chr>     <chr>    <chr> <int>
## 1         arya  admitted    nsubj     3
## 2         arya    agreed    nsubj     1
## 3         arya     alive    nsubj     1
## 4         arya  ambushed    nsubj     1
## 5         arya announced    nsubj     1
## 6         arya    answer    nsubj     1

Cool! At this point, we may be interested in seeing words that appear more frequently for certain characters than for others. To do that, we can calculate each term's inverse document frequency (tdf), defined as:

idf(term) = ln(documents / documents containing term)

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The idea of tf-idf is to find the important words for the content of each collection of comments by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in an entire collection of documents, in this case the text of all of the books.

The bind_tf_idf function takes a tidy text dataset as input with one row per token (word), per document. One column (word here) contains the terms, one column contains the documents (lemma_target here), and the last necessary column contains the counts, how many times each document contains each term (n).

# Calculate td-idf
book_words <- dependencies %>%
  filter(lemma_target %in% main_characters$entity & relation == 'nsubj') %>%
  select(lemma_target, word) %>%
  group_by(lemma_target, word) %>% 
  summarise(n = n()) %>%
  bind_tf_idf(word, lemma_target, n)

Now that we've calculated tf-idf for each of our main characters, we can visualize the words that are closely associated with them, relative to the word's association with the other characters.

These are fun to see!