Project Description

Game of Thrones is a popular fantasy Drama series. In this project, I have tried to analyze the text data from the script of the first five seasons of the series. I first downloaded the data file from kaggle which had all the dialog and characters from thee five seasons.The data was in excel file. I have filtered some rows which had some ambiguous character names such as ‘men’, ‘someone’ etc in excel. Then only I have imported the data set here.

I have used various tools and packages of text mining to analyze this. Main steps of this project are as follows. 1. Importing the data set 2. Refining the data with tidy text 3. Plotting the top characters with highest number of dialog
4. Counting the number of characters and finding out the top 10 characters.
5. Plotting the percentage of the dialog spoken by top characters
6. Converting the dialog data in tidy text structure and creating tokens of word
7. Refining the words removing stop-words.
8. Plotting the most frequent words in the dialog
9. Creating the word cloud with most frequent words.
10. Sentiment analysis using bing and nrc lexicons.
- Making wordcloud of positive and negative words using bings
- Plotting different type of sentiments using nrc.
- Top 10 words of each sentiments in the dialog
- nrc sentiment analysis of top 10 characters

Creating corpus, refining corpus and making term document matrices and finding out the most frequent words.
Making clusters of similar topic and cluster analysis

Installing all the required packages and importing libraries

library('rvest')
library('dplyr') #data manipulation
library(readxl) #read excel files
library(tidyverse) # data manipulation
library(tm) # text mining
library(wordcloud) # word cloud generator
library(wordcloud2) # word cloud generator
library(tidytext) # text mining for word processing and sentiment analysis
library(reshape2) # reshapes a data frame
library(knitr) # dynamic report generation
library(gridExtra) # miscellaneous Functions for "Grid" Graphics
library(grid) # add Grid to a Plot
library(igraph) # creating and manipulating graphs and analyzing networks
library(ggraph) # graphs and Networks
library(SnowballC) # for stemming the text
library("textdata")
library(remotes) #to get nrc lexicon from github
library(topicmodels)#for topic modeling

I imported all the required libraries for text mining, data manipulation, to make graphs, world clouds, term document matrix etc. From now on I will only be using functions from the base library of R and from the libraries mentioned above.

#Importing the data set
library(readxl)
got_script <- read_excel("E:/Data Science/Statistics with R/My projects/Game of thrones script.xlsx")
head(got_script)

## # A tibble: 6 x 5
##   Season   Episode   `Episode Title`  Characters   Dialog                       
##   <chr>    <chr>     <chr>            <chr>        <chr>                        
## 1 Season 1 Episode 1 Winter is Coming waymar royce What do you expect? They're ~
## 2 Season 1 Episode 1 Winter is Coming will         I've never seen wildlings do~
## 3 Season 1 Episode 1 Winter is Coming waymar royce How close did you get?       
## 4 Season 1 Episode 1 Winter is Coming will         Close as any man would.      
## 5 Season 1 Episode 1 Winter is Coming gared        We should head back to the w~
## 6 Season 1 Episode 1 Winter is Coming royce        Do the dead frighten you?

I imported the excel data file from my device. Data has all the details of characters, episodes and dialog from 5 seasons. Wsing read_excel function of readxl library.

Total number of characters and top 20 characters

#Total number of characters in 5 seasons of Game of Thrones
length(unique(got_script$Characters))

## [1] 565

#Top twenty characters in whole 5 seasons
top_characters <- got_script %>% 
  count(Characters) %>%
  arrange(desc(n)) %>%
  slice(1:20)

Top 20 characters with most amount of dialog.

#Top 20 Characters with most amount of dialog
got_script %>% 
  count(Characters) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(y=reorder(Characters, n), x=n)) +
  geom_bar(stat="identity", aes(fill=n), show.legend=FALSE) + 
  geom_label(aes(label=n)) +
  scale_fill_gradient(low="dodgerblue", high="dodgerblue4") +
  labs(x="Character", y="Lines of dialogue",
       title="Lines of dialogue per character") +  
  theme_bw()

First I counted the frequency of characters, arranged them in descending order and sliced the first 20 to take out the top 20 characters.

Then I used ggplot of ggplot2 to plot the result and used different methods of function to make the graph more clear add labels and color gradient.

I decided to go with top 20 characters cause the series is vast with different parallel story lines with vast number of characters.

Percentage of the dialog for top characters

#Top 20 characters with most percentage of dialog
got_script %>% 
  count(Characters) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  mutate(Percentage=n/nrow(got_script)) %>%
  ggplot(aes(x=reorder(Characters, Percentage), y=Percentage)) +
  geom_bar(stat="identity", aes(fill=Percentage), show.legend=FALSE) +
  geom_label(aes(label=paste0(round(Percentage*100, 2), "%"))) +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_gradient(low="paleturquoise", high="paleturquoise4") +
  labs(x="Character", y="Lines of dialogue (%)", 
       title="Lines of dialogue per character (relative values)") + 
  coord_flip() +
  theme_bw()

Almost everything is same as above (finding out the most amount of dialog) but here, I calculated the percentage of total dialgog for top characters using mutate(). Then I plotted the result using ggplot () function and added different aspects to the graph to make it more expressive.

Top 10 characters

#Top 10 characters list
top_10_characters <- got_script %>% 
  count(Characters) %>%
  arrange(desc(n)) %>%
  slice(1:10)

Making the tidy data structure from the Dialog of our text file got_script

# Transforming the dialog to tidy data structure
got_tokens <- got_script %>%  
  mutate(dialogue=as.character(got_script$Dialog)) %>%
  unnest_tokens(word, Dialog)

#Removing stop words
data(stop_words)
got_tokens <- got_tokens %>%
anti_join(stop_words)

I created the tokens of the words from the dialog and stored it in got_tokens.

And then I removed the stop words from the got_tokens using functions of tidytext.

Counting the words in token and plotting the most frequent words

#counting the words in the dialog
head(got_tokens %>%
  count(word, sort = TRUE))

## # A tibble: 6 x 2
##   word       n
##   <chr>  <int>
## 1 lord    1109
## 2 king     812
## 3 father   668
## 4 grace    530
## 5 time     523
## 6 lady     492

#Plotting the frequent words

got_tokens %>%
  count(word, sort = TRUE) %>%
  filter(n > 280) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  geom_label(aes(label=n)) +
  labs(x="Word frequency", y="Words)", 
      title="Most used words in game of thrones") +
  theme_bw()

I counted the words from the word column of our got_tokens.

I then filtered the words with frequency more than 280 and then reordered the word list. Then plotted the top words using the ggplot() function and used different grammars of ggplot to add labels and different things.

Making word cloud with top words

# defining color palette
pal <- brewer.pal(8,"Dark2")

#Creating word cloud of most frequent words
got_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 200,rot.per=0.25, colors = pal, size = 1))

I created the word cloud using wordcloud() function from the refined word from got_tokens. I included 200 top words for this. And used color pallet to make the cloud more colorful.

Sentiment Analysis

Using ‘bing’ and ‘nrc’ Lexicons (for sentiment classification)

#positive and negative words
got_tokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=200)

For this sentiment analysis, I have used bing laxicon which was created by Bing Liu and orthers. It categorizes words into positive and negative. I first separeted the positive and negative words using this function and used comparison.cloud() function to visualize these words in word cloud.

Frequency of each sentiment using nrc

# Frequency of the sentiments of each word in dialog  
sentiments <- got_tokens %>%
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort=TRUE) 

# Visualization
ggplot(data=sentiments, aes(x=reorder(sentiment, n), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency", 
       title="Game of Thrones Sentiment Analysis(NRC lexicon)") +
  coord_flip() +
  theme_bw()

nrc is another laxicon created by from Saif Mohammad and Peter Turney. It categorizes the words into the range of emotion such as positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. I categorized our words of got_tokens using nrc token and visualized it using ggplot function.

Finding out the most frequent term of each sentiments

#Most frequent terms for each sentiments
#using nrc
get_sentiments("nrc")

## # A tibble: 13,875 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,865 more rows

#to get new sentiments
nrc <- get_sentiments("nrc")%>% 
  mutate(lexicon = "nrc", 
         words_in_lexicon = n_distinct(word))

#top 10 words from each sentiments 
got_tokens %>% 
  inner_join(nrc, "word") %>%
  count(sentiment, word, sort=T) %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Words", 
       title="Most frequent terms for each sentiment (NRC lexicon)") +
  coord_flip() +
  theme_bw()

Again, using nrc laxicon I created the list of most frequent terms from each emotions of nrc. I then arranged the words in descending orders and sliced to get top 10 words from each and visualized it using ggplot function.

Sentiment Analysis of top 10 characters

#Sentiment analysis of the top 10 characters using nrc
got_tokens %>%
  filter(Characters %in% c("tyrion lannister","jon snow","daenerys targaryen", "cersei lannister", 
                     "jaime lannister","sansa stark","arya stark","davos",            
                     "theon greyjoy","petyr baelish")) %>%
  inner_join(nrc, "word") %>%
  count(Characters, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Characters, scales="free_x") +
  labs(x="Sentiment", y="Frequency", 
       title="Sentiment Analysis for each character (NRC lexicon)") +
  coord_flip() +
  theme_bw()

Here, I analyzed the the sentiments of top 10 characters. For that I listed the column vector of top 10 characters and filtered the dialog of those characters. And categorized them in the sentiments from the nrc. And then visualized them with the help of ggplot().

Making Corpus, Term Document Matrix and Topic Modeling

my_corpus <-Corpus(VectorSource(got_tokens$word))
inspect(my_corpus[1:4])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 4
## 
## [1] expect  savages lot     steals

#cleaning the corpus with the functions of tm library
mystopwords <-  c(stopwords("english"),"will", "one", "thats","weve","hes","theres","ive","im",
                 "will","can","cant","dont","youve","us",
                 "youre","youll","theyre","whats","didnt","now", "know", "like", "back", "never")
my_corpus <- tm_map(my_corpus, removeWords, mystopwords)
inspect(my_corpus[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] expect  savages lot

I created the corpus from the using with the help of tm library. I had already created the tokens of word in got_tokens, so I created the corpus using the same word tokens. Most of the refine work was done with tidy text. So I only further removed some costume stop words here. To make our model more efficient.

Term Document Matrix and Frequency of words

#making term document matrix
myTdm <- TermDocumentMatrix(my_corpus, control=list(wordLengths=c(1,Inf)))

#checking the frequency of the words
freq.terms <- findFreqTerms(myTdm, lowfreq=20) #with lowest frequency of 20
head(freq.terms)

## [1] "expect"    "lot"       "wildlings" "life"      "close"     "head"

Then from our corpus I created a term document matrix and then counted the frequency of the words. There were way too many words which was not possible to display here. So I used head() function to show some of the words.

Summary of Text Mining and its relationship between NLP, Machine learning and AI

Text Mining is a sub type of Data mining with deals with unstructured text data or natural languages. It uses Natural Language Processing to analyze and draw some valuable insights from text data. There are various ways in which we can perform text mining. Some basic things we can do in text mining is frequency analysis where we can check the re-occurrence of certain words, We can check for collocation (sequence of words commonly occurring together), concordance (recognizing the context of words). We can also perform some advance operations such as Topic analysis, Sentiment Analysis, Language detection, Intent detection etc.

Artificial Intelligence is branch of computer science which deals with simulating the Human Intelligence or cognitive ability in machines and computer systems. It covers a wide range. Machine Learning and NLP are also part of it.

NLP is a process in AI that analyzes and works on how computers translate and understand human languages.NLP helps computers to understand our languages and make sense out of it. That is why we use NLP in text mining to perform some text analysis tasks such as topic modeling, intent analysis, sentiment analysis etc. And we use machine learning to automate these processes. Machine learning is a process of applying different algorithms which teach machines to automatically learn and improve, without being explicitly programmed.This is how AI, Machine Learning anf NLp are related to text mining. Sometimes we even usue these terms interchangeably.

Text_mining_project

Sunita Parajuli