For my project I wanted to use the text mining skills I have learned throughout the semester to analyze how different the dialouge of the charcters in the George R. R. Martin’s books and the HBO Tv show Game of Thrones. My main goal in this project was to identify the similarities in the dialogue which could give me an insight on how acurate the HBO Tv show is to the original book series. I have never read the Game of Thrones Book Series and have only recently been exposed to the arguably the worlds greatest Tv show. However, many books that have correlating Tv series there are huge plot twists and character deviations that essentially makes the Tv show a whole new story in itself. With this in mind I realized I would have to go beyond just the chacters and plots of Game of Thrones in my analysis which is where the charcter dialogues come in. \(~\)

With the aformentioned statements I hypothesized that if the Tv show Game of Thrones and George R. R. Martin’s Books differ in the characters, plot, and settings there should be a clear deviations between the text that the characters are saying to each other in the two similar but seperate worlds. So HBO to direct the Tv show away from the books will lead to the characters we love having diffrent conversations and interactions all together.
\(~\)

My first step on starting my analysis was finding a pdf copy of the books online. So I found the first three books A Game of Thrones, A Clash of Kings, and A Storm of Swords. In order to import the files into Rstudios I need to use the pdftools package and a few other packages to do my textual analysis. Here is were I loaded the libraries of the packages I installed in Rstudios.

library(httr)
library(readr)
library(readtext)
library(stringr)
library(magrittr)
library(ggplot2)
library(ggthemes)
library(extrafont)
library(plyr)
library(scales)
library(tidyr)
library(dplyr)
library(RColorBrewer)
library(wordcloud)
library(rvest)
library(tidyverse)
library(tidytext)
library(xml2)
library(stringr)
library(pdftools)
library(tm)
library(topicmodels)
library(wordVectors)
library(qdapRegex)
library(igraph)
library(ggraph)

With the packages loaded I needed to set my working directory to downloads since that’s were I downloaded the books. Then I used the pdftools package to read the books into Rstudios as shown below.

setwd("~/Downloads")
Book1 <- pdf_text("GOT Book1.pdf")
Book2 <- pdf_text("GOT Book2.pdf")
Book3 <- pdf_text("GOT Book3.pdf")

Book1 <- as.data.frame(Book1)
Book1 <- Book1 %>% filter(Book1 != "")
Book2 <- as.data.frame(Book2)
Book2 <- Book2 %>% filter(Book2 != "")
Book3 <- as.data.frame(Book3)
Book3 <- Book3 %>% filter(Book3 != "")

After setting nmy working directory and uploading the books I need to find the charcter dialogues form the Tv shows. In order to make this a one to one comparioson I will only be using the first three books of the series and the first four seasons of the Tv show since the first four seasons are suppose to be based off the first three books. To obtain the four seasons text I needed to find a website with all the scripts from the Tv show and scrape that text to create a dataframe of the seasons with the 10 episodes in indvidual rows. The website I found was SpringfieldSpringfield which has scripts for every Movie and Tv show. So I set a baseurl to the webpage in SpringfieldSpringfield with just the Game of Throne’s scripts. The using a nested for loop and with the httr package I was able to make a matrix where I read through each episode of each seasons and saved it to the matrix ordered column wise. Then I saved the matrix as a data frame relabling the columns as the seasons.

m = matrix(nrow = 10, ncol = 4)
Seasons <- as.data.frame(m)

baseurl <- "https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=game-of-thrones&episode="

for (j in 1:4){
for (i in 1:10) {
  season <- paste0("s0", j)
  episode <- ifelse(i < 10, paste0("e0", i), paste0("e", i))
  url <- paste0(baseurl, season, episode)
  webpage <- read_html(url)
  script <- webpage %>% html_node(".scrolling-script-container")
  m[i, j] <- html_text(script, trim = TRUE)
}
}

Seasons <- as.data.frame(m)

colnames(Seasons) <- c("Season 1","Season 2","Season 3", "Season 4")

After getting all the text for the books and the seasons I need to subset the book text so that I only have the dialogue that the characters are saying in each book. To do this I used the function rm_between and the string format of all the rows in the book dataframes and looked for where the unique book quotes where in the story. These quotes are diffrent from the regular Rstudios quotes and the in order to read between the quotes in the story I had to copy the quotes from the story to use in the rm_between function. Afterwards, I put all the dialouges from the storeis in a matrix and dropped all empty rows where there wasn’t any dialogues. Now in the matrix all the text was still concatinated as a list so I used another loop and split each row into its own data frame of individual dialogues. Upon doing that I also combined all of those individual dialouges into one big data frame of charcter dialogues and relabled the colunm name as text.

Book1_dialog = matrix(nrow = 753, ncol = 1)
for(i in 1:753){
Book1_dialog[i] <- (rm_between(as.String(Book1$Book1[i]), '“', '”', extract=TRUE))
}
Book1_dialog = subset(Book1_dialog, Book1_dialog != "NA")

full_chap1 = data.frame()
for(i in 1:688){
jkfnm = do.call(data.frame, Book1_dialog[i])
names(jkfnm)[1] <- "Dialog_Book1"
full_chap1 = rbind(full_chap1, jkfnm)
}
names(full_chap1)[1] <- "text"


Book2_dialog = matrix(nrow = 532, ncol = 1)
for(i in 1:532){
  Book2_dialog[i] <- (rm_between(as.String(Book2$Book2[i]), '“', '”', extract=TRUE))
}

Book2_dialog = subset(Book2_dialog, Book2_dialog != "NA")

full_chap2 = data.frame()
for(i in 1:506){
  jkfnm = do.call(data.frame, Book2_dialog[i])
  names(jkfnm)[1] <- "Dialog_Book2"
  full_chap2 = rbind(full_chap2, jkfnm)
}
names(full_chap2)[1] <- "text"


Book3_dialog = matrix(nrow = 864, ncol = 1)
for(i in 1:864){
  Book3_dialog[i] <- (rm_between(as.String(Book3$Book3[i]), '“', '”', extract=TRUE))
}

Book3_dialog = subset(Book3_dialog, Book3_dialog != "NA")

full_chap3 = data.frame()
for(i in 1:794){
  jkfnm = do.call(data.frame, Book3_dialog[i])
  names(jkfnm)[1] <- "Dialog_Book3"
  full_chap3 = rbind(full_chap3, jkfnm)
}
names(full_chap3)[1] <- "text"

However, the comparison of the book dialouge and the Tv show was still not one to one since the Tv show episodes had all the script concatinated in row of that season column. To split the script into individual dialouges I split each season into its own data frame just like the books and split the episode strings in to individual dialogues from the characters and binded all the individual data frames at the same time in a loop. After this step I was done with orginaizing my data and I went on to do some analysis of the seasons and books.

Season1 = Seasons[1]

Season1_Split = data.frame()
for(i in 1:10){
  x = unlist(strsplit(as.String(Season1[i,]), "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
  Season1_Split = rbind(Season1_Split, as.data.frame(x))
}
names(Season1_Split)[1] <- "text"


Season2 = Seasons[2]
Season2_Split = data.frame()
for(i in 1:10){
  x = unlist(strsplit(as.String(Season2[i,]), "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
  Season2_Split = rbind(Season2_Split, as.data.frame(x))
}
names(Season2_Split)[1] <- "text"


Season3 = Seasons[3]
Season3_Split = data.frame()
for(i in 1:10){
  x = unlist(strsplit(as.String(Season3[i,]), "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
  Season3_Split = rbind(Season3_Split, as.data.frame(x))
}
names(Season3_Split)[1] <- "text"


Season4 = Seasons[4]
Season4_Split = data.frame()
for(i in 1:10){
  x = unlist(strsplit(as.String(Season4[i,]), "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
  Season4_Split = rbind(Season4_Split, as.data.frame(x))
}
names(Season4_Split)[1] <- "text"

The first tool I used to analyze the two texts was Sentiment Analysis which will allow me to quantitatively see if the dialogues of the charcters in the books or the tv show were more negative than positive or vice versa. Using Sentiment Analysis will allow me to guage how positive or negative the dialogue of the characters in the books are and the chracters in the Tv show are. This will also enable me to see the over all simialrties and diffrences of each season and its corresponding book to understand how much HBO changed in the character dialogues.

full_chap1$text = as.character(full_chap1$text)
Season1_Split$text = as.character(Season1_Split$text)

full_chap2$text = as.character(full_chap2$text)
Season2_Split$text = as.character(Season2_Split$text)

full_chap3$text = as.character(full_chap3$text)
Season3_Split$text = as.character(Season3_Split$text)

Season4_Split$text = as.character(Season4_Split$text)


tidy_Book1 =  full_chap1 %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

tidy_Season1 =  Season1_Split %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

tidy_Book2 =  full_chap2 %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

tidy_Season2 =  Season2_Split %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

tidy_Book3 =  full_chap3 %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

tidy_Season3 =  Season3_Split %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

tidy_Season4 =  Season4_Split %>%
  unnest_tokens(input = text, output = word, token = "words", format = "text") %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

# Seeing the sentimnet outputs
tidy_Book1 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) 
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      413      532       119
tidy_Book2 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      470      663       193
tidy_Book3 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) 
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      545      929       384
tidy_Season1 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      554      732       178
tidy_Season2 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) 
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      716     1086       370
tidy_Season3 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>%
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) 
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      684     1080       396
tidy_Season4 %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## # A tibble: 1 x 3
##   negative positive sentiment
##      <dbl>    <dbl>     <dbl>
## 1      598      951       353

Upon running the code I see that the over all snetiment of the Books and the Tv show have not chnaged and staty consisitently positive even as we progress through the various Tv seasons and the Books. However, this may seem counter intuitive at first since we are anaylizing the Game of Thrones series, but the results are indicating that what the chracters say aren’t very negative, but more positive. This is true since we see in the show that much more of the negative content is given by the charcter’s actions which aren’t transcribed in the scripts. In additon, the Book series does have descriptions of the actions of chracters and when running the sentiment analysis of the entire books I noticed that the outputs (not shown in markdown) were a lot more negative. However, since the books have descriptions of the chracter’s actions and the Tv shwo scripts don’t it wouldn’t be a fare comparison. \(~\)

Thus, comparing just what the chacracters say we can see that HBO was able to get the over all sentiment of the charcters simialr to George R. R. Martin in his text. What is even more interesting is that both HBO and George R. R. Martin increased the over all positive sentiment of the chracter’s dialogue thourgh out the seasons and the books respectively. Thus, humerously it seems that the characters were being coming more positive when it came to communicating with one another over the progressing devlopment of the indiviudal stories in contrasat to their actions which were being coming more and more violent and negative in a sense. This could have been a writing stratagy used by George R. R. Martin which HBO esstinally adapted or HBO may have used a lot of the character dialouges from the stories instead of charcters, settings, and plots which could account for the simialrity in the trends. \(~\)

Continuing my analyis of the two text I went on further to create a topic models for each of the books and chracters to see what topics the charcter’s were talking about and how HBO could have changed the over all topics of the Tv show through chracter dialouge. In order to do this I split up the topic models by books and Seasons. Below I did the topic models for the books:

full_chap1_info = as.data.frame((full_chap1$text))
full_chap1_info <- data.frame(full_chap1_info, ad_id = c(1:5135))
full_chap1_info$X.full_chap1.text. <- as.character(full_chap1_info$X.full_chap1.text.)

library(dplyr)
library(tidytext)

full_chap1_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.full_chap1.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_full_chap1

k_vec = c(2,4,6,8,10,12)
mod_chap1 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_chap1[[j]] = LDA(x=dtm_full_chap1, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
p1 = sapply(mod_chap1, function(x) {perplexity(x, dtm_full_chap1)})
our_model1 = mod_chap1[[3]]


#Book1 2
full_chap2_info = as.data.frame((full_chap2$text))
full_chap2_info <- data.frame(full_chap2_info, ad_id = c(1:5546))
full_chap2_info$X.full_chap2.text. <- as.character(full_chap2_info$X.full_chap2.text.)

library(dplyr)
library(tidytext)

full_chap2_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.full_chap2.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_full_chap2

k_vec = c(2,4,6,8,10,12)
mod_chap2 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_chap2[[j]] = LDA(x=dtm_full_chap2, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
p2 = sapply(mod_chap2, function(x) {perplexity(x, dtm_full_chap2)})
our_model2 = mod_chap2[[2]]


#Book1 3
full_chap3_info = as.data.frame((full_chap3$text))
full_chap3_info <- data.frame(full_chap3_info, ad_id = c(1:6967))
full_chap3_info$X.full_chap3.text. <- as.character(full_chap3_info$X.full_chap3.text.)

library(dplyr)
library(tidytext)

full_chap3_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.full_chap3.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_full_chap3

k_vec = c(2,4,6,8,10,12)
mod_chap3 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_chap3[[j]] = LDA(x=dtm_full_chap3, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
p3 = sapply(mod_chap3, function(x) {perplexity(x, dtm_full_chap3)})
our_model3 = mod_chap3[[2]]

Next I ran the code for the topic models of the Seasons below to see if the converstation between the characters in the book and the tv series are the same and what that could implicate.

#Season 1
Season1_Split_info = as.data.frame((Season1_Split$text))
Season1_Split_info <- data.frame(Season1_Split_info, ad_id = c(1:5842))
Season1_Split_info$X.Season1_Split.text. <- as.character(Season1_Split_info$X.Season1_Split.text.)

library(dplyr)
library(tidytext)

Season1_Split_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.Season1_Split.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_Season1_Split

k_vec = c(2,4,6,8,10,12)
mod_Season1 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_Season1[[j]] = LDA(x=dtm_Season1_Split, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
pS1 = sapply(mod_Season1, function(x) {perplexity(x, dtm_Season1_Split)})

our_modelS1 = mod_Season1[[3]]


#Season 2
Season2_Split_info = as.data.frame((Season2_Split$text))
Season2_Split_info <- data.frame(Season2_Split_info, ad_id = c(1:7476))
Season2_Split_info$X.Season2_Split.text. <- as.character(Season2_Split_info$X.Season2_Split.text.)

library(dplyr)
library(tidytext)

Season2_Split_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.Season2_Split.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_Season2_Split

k_vec = c(2,4,6,8,10,12)
mod_Season2 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_Season2[[j]] = LDA(x=dtm_Season2_Split, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
pS2 = sapply(mod_Season2, function(x) {perplexity(x, dtm_Season2_Split)})

our_modelS2 = mod_Season2[[3]]


#Season 3
Season3_Split_info = as.data.frame((Season3_Split$text))
Season3_Split_info <- data.frame(Season3_Split_info, ad_id = c(1:7614))
Season3_Split_info$X.Season3_Split.text. <- as.character(Season3_Split_info$X.Season3_Split.text.)

library(dplyr)
library(tidytext)

Season3_Split_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.Season3_Split.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_Season3_Split

k_vec = c(2,4,6,8,10,12)
mod_Season3 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_Season3[[j]] = LDA(x=dtm_Season3_Split, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
pS3 = sapply(mod_Season3, function(x) {perplexity(x, dtm_Season3_Split)})

our_modelS3 = mod_Season3[[3]]


#Season 4
Season4_Split_info = as.data.frame((Season4_Split$text))
Season4_Split_info <- data.frame(Season4_Split_info, ad_id = c(1:7058))
Season4_Split_info$X.Season4_Split.text. <- as.character(Season4_Split_info$X.Season4_Split.text.)

library(dplyr)
library(tidytext)

Season4_Split_info %>% mutate(doc_id = as.character(ad_id))%>%
  unnest_tokens(input = X.Season4_Split.text., output=word, format="text", token="words") %>% 
  group_by(doc_id, word) %>%
  summarise(n=n()) %>% 
  anti_join(stop_words, by="word") %>%
  cast_dtm(document = doc_id ,term=word, value=n) -> dtm_Season4_Split

k_vec = c(2,4,6,8,10,12)
mod_Season4 = list()
for (j in 1:6) {
  cat(j, "\n")
  k = k_vec[j]
  mod_Season4[[j]] = LDA(x=dtm_Season4_Split, k=k, method="Gibbs", control=list(alpha=1))
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6
pS4 = sapply(mod_Season4, function(x) {perplexity(x, dtm_Season4_Split)})

our_modelS4 = mod_Season4[[3]]

After running all the code for the topic models of the books I looked at the outputs for the topcis. First to get the maximum number of topics I graphed the perpelexity of 1:6 on a graph for each model and found the lowest point to get the topics. The graph of the perplexity of the topic model is shown below and I won’t be showing the rest since it takes up a lot of space. From the graph we can see that the 3 is the lowest so we use that above to generate the model for Book1. I repeated this setp for all the books and seasons. Then I show the outputs (The Topics) below for each book and season:

plot(x=1:6, y=p1, type="o", pch=16)

terms(our_model1, k=10)
##       Topic 1  Topic 2 Topic 3  Topic 4      Topic 5    Topic 6    
##  [1,] "don’t"  "boy"   "it’s"   "lord"       "ser"      "lady"     
##  [2,] "i’m"    "gods"  "son"    "stark"      "king’s"   "brother"  
##  [3,] "father" "king"  "m’lord" "hand"       "stop"     "lannister"
##  [4,] "you’re" "dead"  "corn"   "grace"      "watch"    "time"     
##  [5,] "bran"   "hear"  "kill"   "leave"      "love"     "sweet"    
##  [6,] "dragon" "i’ll"  "fear"   "eddard"     "true"     "maester"  
##  [7,] "he’s"   "khal"  "woman"  "winterfell" "call"     "honor"    
##  [8,] "arya"   "told"  "i’ve"   "jon"        "hurt"     "ride"     
##  [9,] "blood"  "can’t" "left"   "father"     "queen"    "sister"   
## [10,] "sword"  "jon"   "city"   "snow"       "children" "fool"
terms(our_model2, k=10)
##       Topic 1   Topic 2  Topic 3      Topic 4
##  [1,] "lord"    "lady"   "ser"        "don’t"
##  [2,] "grace"   "i’m"    "brother"    "i’ll" 
##  [3,] "king"    "dead"   "prince"     "boy"  
##  [4,] "that’s"  "you’re" "there’s"    "hodor"
##  [5,] "father"  "stark"  "m’lord"     "gods" 
##  [6,] "true"    "he’s"   "can’t"      "it’s" 
##  [7,] "bring"   "won’t"  "winterfell" "wolf" 
##  [8,] "told"    "time"   "sister"     "bran" 
##  [9,] "stannis" "die"    "fool"       "blood"
## [10,] "hand"    "girl"   "i’ve"       "hear"
terms(our_model3, k=10)
##       Topic 1   Topic 2   Topic 3  Topic 4 
##  [1,] "it’s"    "ser"     "lady"   "lord"  
##  [2,] "hodor"   "grace"   "you’re" "i’m"   
##  [3,] "there’s" "don’t"   "king"   "father"
##  [4,] "aye"     "mother"  "snow"   "son"   
##  [5,] "that’s"  "brother" "dead"   "won’t" 
##  [6,] "i’ll"    "sword"   "jon"    "hear"  
##  [7,] "gold"    "kill"    "told"   "boy"   
##  [8,] "true"    "killed"  "die"    "hand"  
##  [9,] "bloody"  "fire"    "wall"   "leave" 
## [10,] "we’ll"   "king’s"  "queen"  "woman"
terms(our_modelS1, k=10)
##       Topic 1     Topic 2      Topic 3  Topic 4    Topic 5      Topic 6 
##  [1,] "brother"   "lord"       "king"   "dead"     "king's"     "father"
##  [2,] "ser"       "stark"      "boy"    "day"      "life"       "sword" 
##  [3,] "son"       "lady"       "time"   "girl"     "l'm"        "wall"  
##  [4,] "honor"     "watch"      "hand"   "khal"     "hear"       "kill"  
##  [5,] "war"       "jon"        "north"  "dothraki" "l'll"       "house" 
##  [6,] "lannister" "ned"        "killed" "sister"   "lannisters" "robert"
##  [7,] "gold"      "winterfell" "queen"  "kingdoms" "landing"    "blood" 
##  [8,] "family"    "death"      "true"   "people"   "fight"      "grace" 
##  [9,] "lt's"      "wrong"      "stop"   "world"    "bring"      "told"  
## [10,] "bran"      "arryn"      "throne" "mother"   "wife"       "head"
terms(our_modelS2, k=10)
##       Topic 1   Topic 2   Topic 3      Topic 4     Topic 5    Topic 6   
##  [1,] "stannis" "king"    "lord"       "grace"     "stark"    "night"   
##  [2,] "city"    "father"  "boy"        "people"    "lady"     "day"     
##  [3,] "dead"    "mother"  "girl"       "ser"       "grace"    "queen"   
##  [4,] "time"    "love"    "winterfell" "iron"      "war"      "fight"   
##  [5,] "fire"    "brother" "theon"      "lannister" "home"     "king's"  
##  [6,] "die"     "north"   "gods"       "hear"      "son"      "gold"    
##  [7,] "ships"   "dragons" "light"      "life"      "killed"   "children"
##  [8,] "qarth"   "joffrey" "brother"    "house"     "robb"     "leave"   
##  [9,] "heard"   "hand"    "battle"     "throne"    "woman"    "told"    
## [10,] "gates"   "sister"  "starks"     "jaime"     "fighting" "watch"
terms(our_modelS3, k=10)
##       Topic 1   Topic 2     Topic 3   Topic 4   Topic 5      Topic 6 
##  [1,] "girl"    "time"      "lady"    "father"  "king"       "lord"  
##  [2,] "day"     "ser"       "stark"   "grace"   "north"      "father"
##  [3,] "die"     "son"       "king's"  "people"  "wall"       "killed"
##  [4,] "castle"  "life"      "leave"   "boy"     "dead"       "hodor" 
##  [5,] "black"   "kill"      "landing" "hand"    "love"       "jon"   
##  [6,] "watch"   "lannister" "blood"   "told"    "house"      "mother"
##  [7,] "night"   "family"    "sansa"   "pay"     "winterfell" "bed"   
##  [8,] "stop"    "gold"      "head"    "city"    "children"   "snow"  
##  [9,] "hear"    "child"     "joffrey" "master"  "mother"     "world" 
## [10,] "brother" "fight"     "wedding" "friends" "coming"     "water"
terms(our_modelS4, k=10)
##       Topic 1   Topic 2     Topic 3    Topic 4   Topic 5  Topic 6   
##  [1,] "king"    "lord"      "father"   "ser"     "time"   "wall"    
##  [2,] "kill"    "lannister" "lady"     "dead"    "die"    "people"  
##  [3,] "king's"  "tyrion"    "killed"   "hodor"   "leave"  "love"    
##  [4,] "joffrey" "house"     "boy"      "grace"   "girl"   "gods"    
##  [5,] "watch"   "hear"      "stark"    "world"   "family" "north"   
##  [6,] "hand"    "snow"      "brother"  "protect" "head"   "remember"
##  [7,] "life"    "safe"      "children" "gold"    "war"    "told"    
##  [8,] "landing" "jon"       "sister"   "fucking" "live"   "gate"    
##  [9,] "son"     "prince"    "mother"   "stop"    "told"   "fight"   
## [10,] "city"    "friend"    "sansa"    "truth"   "wine"   "army"

As you can see from the outputs of the topic models the key observations to make here are the facts that \(~\)

Over all the topic models tell us that the dialogues of the charaters are very similar in topics such as lord or references to the crown. It seems like the HBO tv show and the books use the refrences to the crown a lot when every characters interact to possibly focus attention on the iorn throne and the significane of the person who sit on the throne. \(~\)

In addition to the topic model I also did a naive basyes classifer of all the text for the books and seasons.I trained the data on 80% of the orginal text. This will allow me to understand which words are likely to be in the Tv show and which will most likely be in the Book series. Looking at the percentages it should tell me over all how similar or diffrent the diolgues would be between the two data sets. The code is below for the classifier.

book_text = rbind(full_chap1, full_chap2, full_chap3)
tv_text = rbind(Season1_Split, Season2_Split, Season3_Split)

book_text$group = "Book"
tv_text$group = "Tv"
names(book_text)[1] <- "text"
names(tv_text)[1] <- "text"


Book_Tv = rbind(book_text, tv_text)

Book_Tv$group = factor(Book_Tv$group)

Book_Tv = Book_Tv[sample(nrow(Book_Tv)), ]

Book_Tv_corpus <- Corpus(VectorSource(Book_Tv$text))

Book_Tv_dtm = DocumentTermMatrix(Book_Tv_corpus, 
                                 list(tolower = TRUE, removeNumbers = TRUE, stopwords = TRUE, 
                                      removePunctuation = TRUE, stemming = TRUE))

Book_Tv_dtm_train <- Book_Tv_dtm[1:(.8*38580), ]

Book_Tv_dtm_test <- Book_Tv_dtm[(.8*38580):38580, ]


Book_Tv_train_labels = Book_Tv[1:(.8*38580), ]$group
Book_Tv_test_labels = Book_Tv[(.8*38580):38580, ]$group

 
Book_Tv_freq_words <- findFreqTerms(Book_Tv_dtm_train, 5)

Book_Tv_dtm_freq_train<- Book_Tv_dtm_train[ , Book_Tv_freq_words]

Book_Tv_dtm_freq_test <- Book_Tv_dtm_test[ , Book_Tv_freq_words]


convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}


Book_Tv_train <- apply(Book_Tv_dtm_freq_train, MARGIN = 2, convert_counts)
Book_Tv_test <- apply(Book_Tv_dtm_freq_test, MARGIN = 2, convert_counts)


library(e1071) 
## Warning: package 'e1071' was built under R version 3.5.2
Book_Tv_classifier = naiveBayes(Book_Tv_train, Book_Tv_train_labels, laplace = 1)

summary(Book_Tv_classifier)
##         Length Class  Mode     
## apriori    2   table  numeric  
## tables  2035   -none- list     
## levels     2   -none- character
## call       4   -none- call
Book_Tv_test_pred = predict(Book_Tv_classifier, Book_Tv_test)

The ouptut of the classifer (not shown due to being a large list) shows that a lot of the words have almost an equal probability of being in the Tv show and in the book. It seems that HBO didn’t just copy the sentiment of the charcters in the show but they also used a lot of the similar words that George R. R. Martin in used in his books for conversations between the characters. \(~\)

Finally, I went on to usde ngrams to see the simialrity and differences betweeen the most common 4 word phrases used in the books and the seasons. A lot of 2 and 3 word phrases are very common in text, however 4 or more word phrases are not that common between two texts unless they share common similarities. Due to this I wanted to use n-grams to see how simialr the phrases in the dialogues of the books and seasons are as well as how frequently they show up. The code to get the n-grams is shown below for all the books and seasons.

#Book 1
full_chap1$text = as.character(full_chap1$text)

# This finds the most frequent 4 word bigrams for the book 
got_B1_bigrams <- full_chap1 %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_B1_bigrams <- na.omit(got_B1_bigrams)


# Book2
full_chap2$text = as.character(full_chap2$text)

# This finds the most frequent 4 word bigrams for the book 
got_B2_bigrams <- full_chap2 %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_B2_bigrams <- na.omit(got_B2_bigrams)


# Book3
full_chap3$text = as.character(full_chap3$text)

# This finds the most frequent 4 word bigrams for the book 
got_B3_bigrams <- full_chap3 %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_B3_bigrams <- na.omit(got_B3_bigrams)


#Season1
Season1_Split$text = as.character(Season1_Split$text)

# This finds the most frequent 4 word bigrams for the book 
got_S1_bigrams <- Season1_Split %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_S1_bigrams <- na.omit(got_S1_bigrams)


#Season2
Season2_Split$text = as.character(Season2_Split$text)

# This finds the most frequent 4 word bigrams for the book 
got_S2_bigrams <- Season2_Split %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_S2_bigrams <- na.omit(got_S2_bigrams)


#Season3
Season3_Split$text = as.character(Season3_Split$text)

# This finds the most frequent 4 word bigrams for the book 
got_S3_bigrams <- Season3_Split %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_S3_bigrams <- na.omit(got_S3_bigrams)


#Season4
Season4_Split$text = as.character(Season4_Split$text)

# This finds the most frequent 4 word bigrams for the book 
got_S4_bigrams <- Season4_Split %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 4) %>%
  count(bigram, sort = TRUE)

got_S4_bigrams <- na.omit(got_S4_bigrams)

Then after running the code to obtain the 4 words phrases in the books and seasons we can see the diffrences in the top ten words by making frequency plots of all of the phrases below:

#Book1
ggplot(head(got_B1_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Book1")

#Book2
ggplot(head(got_B2_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Book2")

#Book3
ggplot(head(got_B3_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Book3")

#Season1
ggplot(head(got_S1_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Season1")

#Season2
ggplot(head(got_S2_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Season2")

#Season3
ggplot(head(got_S3_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Season3")

#Season4
ggplot(head(got_S4_bigrams,15), aes(reorder(bigram,n), n)) +
  geom_bar(stat = "identity") + coord_flip() +
  xlab("Bigrams") + ylab("Frequency") +
  ggtitle("Most frequent bigrams Season4")

The ngrams show us that the Tv shows seems to be refrencing to titles and names of people a lot more than the books with in the dialogues. However, a lot of similar 4 phrase bigrams appear in all of the text. Which also supports the idea of HBO using a lot of the textual dialogue form the books. \(~\)

Finally using the N-grams I obtained from the books and seasons I created a network graph of the phrases with frequecies greater than 3. This will allow us to see how each phrase is realted to one another. Essentially, it groups the phrases to show which ones are realted and the realtion ships of the phrases are based on the context and object of the dialogue the phrases come from.

#Book1 
bigram_graph_B1 <- got_B1_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.20, "inches"))

ggraph(bigram_graph_B1, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

#Book2
bigram_graph_B2 <- got_B2_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph_B2, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

#Book3 
bigram_graph_B3 <- got_B3_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph_B3, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

#Season1
bigram_graph_S1 <- got_S1_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph_S1, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

#Season2
bigram_graph_S2 <- got_S2_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph_S2, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

#Season3
bigram_graph_S3 <- got_S3_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph_S3, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

#Season4
bigram_graph_S4 <- got_S4_bigrams %>%
  filter(n > 3) %>%
  graph_from_data_frame()

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph_S4, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

In conclusions, it seems to me that HBO used a lot of the dialogue for their characters even if the characters weren’t the same. The context of the chracter dialogues where very similar as the topic model showed. This indicates that the Tv show made characters imitate the dialogue of the books however, my anyalisis does not show which characters were connect to which dialogue so I can’t say for sure if HBO gave their Tv show characters dialogues specific to the Book characters they are suppose to portray or a motage of textual information form various characters from the book. However, my anyalisis does show that regardless of the Tv season’s visual content the context and the tones of the characters are synonemous to that of the books.

\(~\)

To improve my textual analysis I would try to incorperate more methods to identify key differences between the texts. Methods like coreNLP and name dentity recognition to identfy charcters and group the dialouge with the characters that said each dialogue for the seasons and the books. This way I could have created a map of the character dialogue interections which could have shown more differences in my anyalsis since many of the characters in the book don’t show up or show up very little.

\(~\)

Due to time constraints I was unable to created the realtions ships between chracters via the dialogue. Majority of my time went into cleaning the data to obtain the dialogues form the books since I was unformilure with the process. In addition, I was working on learning the r markdown syntax in concurrent with my project which also took time out of the project. Fianlly, working with various data sets to compare in this project was very difficult since I had to make sure that each data set for the books and the seasons were acurate in format when executing each type of analysis. However, learning form all of the coding issues and time constraints I was able to develope my ability to multi mangae multiple data sets and anyalize those data set concurrently. Overall, I feel like I have learned a lot of useful skils from this class which I know will come in handy in the future.