INTRODUCTION

The aim of this project is to get some insights about Vampire Diaries using some great R packages. The main hypothesis statement for this analysis is that main characters become sadder with new seasons. I also expect that vampire Diaries mainly has negative sentiment for all seasons and for all characters as most of the supernatural series This article will main include the following: Data retrieve, cleaning and exploration; Word frequency analysis; Sentiment analysis; N-Gram analysis and Topic modelling.

DATA

Firstly, I scraped transcripts for all the episodes from Forever Dreaming transcripts website. Please find the code below.

#scraper function
get_episodes <- function(my_link){
  t <- read_html(my_link)
  series <- list()
  series[['name']] <- t %>% html_nodes(".topictitle") %>% html_text()
  series[['link']] <- t %>% html_nodes(".topictitle") %>% html_attr('href')
  return(series)
}

#scrape all pages
url <- paste0("https://transcripts.foreverdreaming.org/viewforum.php?f=18&start=", seq(0, 150, 25))
episode_links <- data.table(rbindlist(lapply(url, get_episodes)))

#exclude links that are not related to episodes
episode_links <- episode_links[name != "The Vampire Diaries Transcript Index",]
episode_links <- episode_links[name != "Updates: (05/10/22) **Summer 2022 Challenge**",]

#get transcripts for all seasons
get_transcripts <- function(link) {
  t <- read_html(paste0("https://transcripts.foreverdreaming.org", str_sub(link, start = 2) ))
  transcript <- t %>% html_nodes("#pagecontent p") %>% html_text()
  tinfo <- t %>% html_nodes('h2') %>% html_text()
  transcript <- str_subset(transcript, "^(?!\\[)")
  transcript <- str_subset(transcript, "^(?!\\()")
  transcript <- str_subset(transcript, "^(?!Scene)")
  transcript<- transcript[grepl(':', transcript, fixed = T)]
  textdf <- 
    rbindlist(
      lapply(transcript, function(x){
        t_piaces <- strsplit(x, ':')[[1]]
        data.table('character' = t_piaces[1], 'text' = trimws(paste(t_piaces[2:length(t_piaces)], collapse = " " )) )
      })
    )
  textdf$season <- substring(tinfo, 1, 2)
  textdf$episode <- substring(tinfo, 4, 5)
  textdf$title <- substring(tinfo, 9,nchar(tinfo))
  return(textdf)
}

t_list <- pblapply(episode_links$link, get_transcripts)

#save df
df <- rbindlist(t_list, fill = T)
saveRDS(df, file = "VD_data.rds")

Data Cleaning and Exploration

As the next step, I checked number of lines, episodes and seasons. According to the descriptive statistics, the average number of lines per episode is around 200, and the Q1 is around 120 lines, thus I decided to filter out episodes with lines less than Q1.

After that, I checked the number of characters and identified the main characters according to their lines. According to the dataset, there are more than 700 characters identified. I filtered data for the main ten characters. Please see below the graph of number of lines of top 10 characters.

## [1] 713

WORD FREQUENCY

As the first step of the text analysis, I decided to check the words frequency in episodes. To do so, we firstly need to tokenize lines, that is to have a single word for each row in dataframe following the tidytext approach. As a next step, I loaded stop words and other words that I know I do not want to include in the analysis and filtered them out from our dataframe. Please find below the code chunk to see what words have been filtered out.

##tokenize transcripts
df_tokens <- df %>% unnest_tokens(word, text)

##import stopwords
data(stop_words)

##filter stopwords from the tokens
df_tokens <- df_tokens %>%
  anti_join(stop_words)

##get top 10 words
df_tokens %>%
  count(word, sort = TRUE) %>% 
  head(10) #mostly includes names of other characters, lets exclude them too
## # A tibble: 10 x 2
##    word       n
##    <chr>  <int>
##  1 elena   1349
##  2 stefan  1203
##  3 damon   1167
##  4 yeah    1163
##  5 ll      1146
##  6 gonna   1089
##  7 hey      878
##  8 bonnie   709
##  9 klaus    579
## 10 time     546
##filter other characters names from text
df_tokens <- df_tokens %>%
  anti_join(bind_rows(data.frame(word = c(tolower(unique(df$character)))))) ##still have some words that we want to ignore

##filter words that we know we want to ignore
words_to_ignore <- data_frame(word = c("ll", "yeah", "hey", "klaus", "gonna", "jenna", "uh", "vampire", "vampires", "um", "isobel", "elijah", "kai", "jo", "enzo", "silas", "vicky", "lily", "jacob", "valerie", "hmm", "julian","6" ))

df_tokens <- df_tokens %>%
  anti_join(words_to_ignore)

##get top 10 words after applied filters
df_tokens %>%
  count(word, sort = TRUE) %>% 
  head(10)
## # A tibble: 10 x 2
##    word       n
##    <chr>  <int>
##  1 time     546
##  2 life     440
##  3 stop     420
##  4 blood    392
##  5 dead     388
##  6 love     388
##  7 talk     350
##  8 fine     345
##  9 hell     345
## 10 people   335

In the plot below I visualized most common words for all seasons and episodes.

I also decided to check the frequency of the most common words by season and by top 10 characters. Please find the graphs below.

From the plot below using wordcloud package, we can find top 100 words that frequently appeared throughout all episodes of the series.

SENTIMENT ANALYSIS

As the next step of the analysis I investigated the sentiments of the lines. I decided to download 2 general sentiment lexicons, that are NRC and Bing. These lexicons assign sentiments to words in different ways. The NRC lexicon has several sentiment categories such as joy or anger, the Bing lexicon only has positive and negative sentiment categories. Firstly I joined the tokens with the NRC and Bing lexicons to see which sentiment categories they got assigned to. From the table below it can be noticed that the negative categories have the highest number of lines

## # A tibble: 10 x 2
##    sentiment    num_words
##    <chr>            <int>
##  1 negative          9161
##  2 positive          9005
##  3 trust             5898
##  4 fear              5286
##  5 anticipation      5237
##  6 sadness           5015
##  7 joy               4500
##  8 anger             4264
##  9 disgust           3301
## 10 surprise          2466
## # A tibble: 2 x 2
##   sentiment num_words
##   <chr>         <int>
## 1 negative       9517
## 2 positive       5016
## # A tibble: 5 x 2
##   sentiment    num_words
##   <chr>            <int>
## 1 negative          3661
## 2 positive           885
## 3 constraining       305
## 4 uncertainty        289
## 5 litigious          103

With NRC lexicon I also built graphs to show sentiments of 3 main characters, Damon, Stefan and Elena. As it can be seen from the plots below all of them have mostly negative sentiments.

## compare sentiments across top 3 main characters with NRC lexicon
## get names of top 3 main artists 
top3 <- df %>% 
  count(character, sort = T) %>% 
  head(3)

## calculate sentiment for their texts
top3_sentiment <- df_tokens %>% 
  filter(character %in% top3$character) %>% 
  inner_join(bing) %>%
  count(character, index = title, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## plot sentiment
ggplot(top3_sentiment, aes(index, sentiment, fill = character)) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_d() +
  facet_wrap(~character, ncol = 1, scales = "free_x") +
  labs(x = "", y = 'Sentiment', title = 'Sentiments of 3 main characters') +
  theme(axis.text.x = element_blank(),
        axis.ticks = element_blank())

I also used Bing lexicon to check the most common and negative words for series at all. As it can be noticed from the graphs below the most common negative word is dead and most common positive word is love.

Using loughran lexicon I decided to build wordcloud for most common negative and positive words that appeared during all episodes. Please see them below.

##build wordcloud for negtive and positive sentiments with loughran lexicon
negative_wordcloud <- df_loughran  %>% 
        filter(sentiment=='negative') %>% 
        count(word, sort = TRUE) 

wordcloud(words = negative_wordcloud$word, 
          freq = negative_wordcloud$n, 
          min.freq = 1,max.words=100, 
          colors = brewer.pal(100, 'PuBu'),
          random.order=FALSE)

positive_wordcloud <- df_loughran %>%
        filter(sentiment=='positive') %>% 
        count(word, sort = TRUE) 

wordcloud(words = positive_wordcloud$word, 
          freq = positive_wordcloud$n,
          min.freq = 1,max.words=100,
          colors = brewer.pal(100, 'RdPu'),
          random.order=FALSE)

SENTIMENT TREND ANALYSIS FOR 3 MAIN CHARACHTERS

Lets check sentiment trends across all seasons for the main 3 characters, Damon, Elena and Stefan. According to the graph below, no significant conclusion can be made, lets take a look at trends by seasons.

After aggregating it by season, some conclusions can be made from the plot below. It can be noticed that there is high decrease in sentiments for Damon and Elena, that is fully explainable by their break up in season 5.

N-GRAM Analysis

In this section I wanted to find out what words frequently appear together. To do this we can use tokenized lines and count the number of times two words appear together. The graph below shows those pairs that appear more than 15 times together. It can be noticed really meaningful pairs, such as “miss Mystic falls” (their hometown), “leave town” (as they always needed to leave town), “hunter, vampire, blood, human” and others.

##tokenize using bigrams
df_bigrams <- df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)

df_bigrams <- df_bigrams  %>% separate(bigram, c("word1", "word2"), sep = " ") %>% count(word1, word2, sort = TRUE)

##add stopwords
vd_stopwords <- tibble(word = c("k", "ll", "ha", "yeah", 'ooh', 'whoa', 'aah', 'hey', 'ah', 'eh', 'uh', 'huh'))

##filter out stopwords
bigrams_filtered <- df_bigrams %>% 
  filter(!(word1 %in% vd_stopwords$word) & !(word2 %in% vd_stopwords$word)) %>% filter(!(word1 %in% stop_words$word) & !(word2 %in% stop_words$word))

##bigram network graph
bigram_graph <- bigrams_filtered  %>% 
        select(from = word1, to = word2, n = n) %>% 
        filter(n > 15)  %>% 
        graph_from_data_frame()

set.seed(2018)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
        geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                       arrow = a, end_cap = circle(.07, 'inches')) +
        geom_node_point(color = "lightblue", size = 5) +
        geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
        labs(x="",y="", title = "Network of Bigrams")+
        theme_bw()+
        theme(legend.position="none",
              plot.title=element_text(size=20, hjust=0.1, colour="#58D68D", vjust=1))

TOPIC MODELLING

Lastly, I did Topic Modelling using LDA model in order to determine the main topics that were discussed by 3 main characters and how they changed over the seasons. It can be noticed that words are mainly common for all of the three actors, but for Elena there are some different topics such as love, family, feel, which was expected as she is very infantile character in series.

CONCLUSION

The text analysis has been performed for Vampire Diaries series using some powerful R packages. We have identified the most frequent words, most negative and positive words, sentiments for the main characters across all episodes and seasons. From the analysis performed it can be concluded that main characters and series at all have negative sentiment in every season. That conclusion met my initial expectations as the Vampire Diaries really knows how to break fans’ hearts, with some of the saddest and tragic episodes. All such dramas, especially the supernatural ones, come with some tragedy, and The Vampire Diaries is not an exception. It was initially expected that three main characters become sadder in each new season, as there have been always new tragedy and dramas, but as we can see from the sentiment trend analysis there is no such pattern, and in some last seasons Damon and Stefan even were less negative than in first ones.