DS3 Term Project

Introduction

If someone asked what my favorite TV show was, I would definitely hesitate for some time, but most likely would say, Better Call Saul! And since the final season is airing right now, I thought it’s a great time to analyze its famous lines.

My original idea was to only analyse the main character’s (Jimmy McGill/Saul Goodman) lines and compare it to the lines of Saul Goodman in Breaking Bad (Better Call Saul is the prequel of Breaking Bad, they share various characters, most importantly Jimmy McGill/Saul Goodman). However after some scraping I had to realize, that there is just not enough lines to analyse, so instead I decided to compare the transcripts of the original show (Breaking Bad) and its prequel.

I assume that Better Call Saul have an overall more positive tone, however I expect it to slowly get more and more similar to Breaking Bad as Jimmy McGill step by step takes on his final form known from the original series.

Data

I managed to find a website (please see code for the website), that has full transcripts for all episodes of both shows. I scraped all Breaking Bad episodes and the first five seasons of Better Call Saul. For the purpose of further analyses I added the episode and season numbers. Even though the scripts for both shows were scraped from the same website, the texts were a little different, probably due to different people editing them, so I scraped the shows separately.

# Better Call Saul

# getting all the links to the episode pages
links <- tibble()
for (page_result in c(2, 27)) {
  link <- paste0('https://transcripts.foreverdreaming.org/viewforum.php?f=205&start=', page_result)
  person_link <- read_html(link) %>% html_nodes('.topictitle') %>% html_attr('href')
  person_link <- paste0('https://transcripts.foreverdreaming.org', str_sub(person_link, 2, -1))
  links <-  rbind(links, person_link)
}
links_vector <- as.vector(t(links))  ## transforming links to vector
links_vector <- links_vector %>% str_replace("&sid(.)+", "")
links_vector <- unique(links_vector)
links_vector <- links_vector[-1]     ## removing first item, because it's not a link for an episode

# function for scraping transcripts from the individual episode pages
scripts <- function(url) {
  t <- url %>%  read_html() %>% html_nodes('.postbody') %>% html_text()
}

test <- lapply(links_vector, scripts)
bcs_raw <- rbindlist(lapply(test, as.data.table), fill = T)

bcs_raw <- bcs_raw %>% mutate(
  show = "Better Call Saul",
  episode = row_number(),
  text = V1
) %>% select(-c(V1))

# Adding seasons
bcs_raw <- bcs_raw %>% mutate(
  season = ifelse(episode<=10,1,ifelse(episode<=20,2,ifelse(episode<=30,3,
                                                            ifelse(episode<=40,4,5))))
)

bcs_raw <- bcs_raw[, c(1, 4, 2, 3)]

# write_csv(x = bcs_raw, "data/bcs_raw.csv")
# write_rds(x = bcs_raw, "data/bcs_raw.rds")

# Breaking Bad

# getting all the links to the episode pages
links <- tibble()
for (page_result in c(0,25,50)) {
  link <- paste0('https://transcripts.foreverdreaming.org/viewforum.php?f=165&start=', page_result)
  person_link <- read_html(link) %>% html_nodes('.topictitle') %>% html_attr('href')
  person_link <- paste0('https://transcripts.foreverdreaming.org', str_sub(person_link, 2, -1))
  links <-  rbind(links, person_link)
}
links_vector <- as.vector(t(links))  ## transforming links to vector
links_vector <- links_vector %>% str_replace("&sid(.)+", "")
links_vector <- unique(links_vector)
links_vector <- links_vector[-c(1,2)]     ## removing first items, because they are not links for an episode


# function for scraping transcripts from the individual episode pages
scripts <- function(url) {
  t <- url %>%  read_html() %>% html_nodes('.postbody') %>% html_text()
}

test <- lapply(as.list(links_vector), scripts)
bb_raw <- rbindlist(lapply(test, as.data.table), fill = T)

bb_raw <- bb_raw %>% mutate(
  show = "Breaking Bad",
  episode = row_number(),
  text = V1
) %>% select(-c(V1))

# Adding seasons - length of season are not equal, first season for example has only 7 episodes, most of the
# others have 13 and the final season has 16 episodes
bb_raw <- bb_raw %>% mutate(
  season = ifelse(episode<=7,1,ifelse(episode<=19,2,ifelse(episode<=33,3,
                                                            ifelse(episode<=46,4,5))))
)

bb_raw <- bb_raw[, c(1, 4, 2, 3)]

# write_csv(x = bb_raw, "data/bb_raw.csv")
# write_rds(x = bb_raw, "data/bb_raw.rds")

Further cleaning with Regular expressions

After acquiring the raw datasets, I cleaned them using regex. I removed parentheses/square brackets and the text within them, since they were not containing actual speech. Also sometimes before a dialogue, a character’s name were visible, I removed those instances too, as well as all the digits. Lastly the two datasets were combined using rbind.

# Better Call Saul

#bcs_raw <- read_rds("data/bcs_raw.rds")

# Removing unnecessary text from the beginning
bcs_raw <- bcs_raw %>% mutate(text = str_remove(text, ".*;"))

# Removing square brackets and text within, because they aren't part of the dialogues
bcs_raw <- bcs_raw %>% mutate(text = gsub("\\[[^][]*]", "", text))

# Removing parentheses and text within, because they aren't part of the dialogues
bcs_raw <- bcs_raw %>% mutate(text = str_replace_all(text, "\\s*\\([^\\)]+\\)", ""))

# Removing names of characters (Saul:, Jimmy:, Mike: etc) from the beginning of the lines
bcs_raw <- bcs_raw %>% mutate(text = str_replace_all(text, "[A-Z]?[a-z]+:", ""))

# Removing digits
bcs_raw <- bcs_raw %>% mutate(text = str_replace_all(text, "(\\d)*", ""))


#write_csv(x = bcs_raw, "data/asd_raw.csv")


# Breaking Bad

#bcs_raw <- read_rds("data/bcs_raw.rds")

# Removing unnecessary text from the beginning
bb_raw1 <- bb_raw %>% mutate(text = str_remove(text, ".*;"))

# Removing square brackets and text within, because they aren't part of the dialogues
bb_raw1 <- bb_raw1 %>% mutate(text = gsub("\\[[^][]*]", "", text))

# Removing parentheses and text within, because they aren't part of the dialogues
bb_raw1 <- bb_raw1 %>% mutate(text = str_replace_all(text, "\\s*\\([^\\)]+\\)", ""))

# Removing names of characters (Saul:, Jimmy:, Mike: etc) from the beginning of the lines
bb_raw1 <- bb_raw1 %>% mutate(text = str_replace_all(text, "[A-Z]?[a-z]+:", ""))

# Removing digits
bb_raw1 <- bb_raw1 %>% mutate(text = str_replace_all(text, "(\\d)*", ""))

#write_csv(x = bb_raw1, "data/bb_raw1.csv")

# Creating one dataset
data <- rbind(bb_raw1, bcs_raw)

Creating tidy text format

The next step is to create a tidy text dataset, meaning that one row contains one token, so in the first instance this is one word. For this purpose I used the tidytext library’s unnest_tokens function. Then I removed the stopwords using the tidytext built in stopwords which I extended with some custom words.

data_word <- data %>%  unnest_tokens(word, text)

# Removing stopwords with some additions to the built in stop_words df
# I chose these words after taking a look at the most frequent words
custom_stop_words <- bind_rows(
  tibble(word = c("yeah", "uh", "mm", "gonna", "mnh", "zu", "shoo", "amc's breaking", "la",
                  "hey", "huh", "hmm", "ah", "ll", "em", "ohh", "yo", 
                  "um", "sh"), lexicon = c("custom")), stop_words)

data_word <- data_word %>%
  anti_join(custom_stop_words)

#write_rds(x = data, "data/data_tidy.rds")

Word frequency

The two combined scripts contain more than 100 000 words with more than 20 000 unique words (after removing stopwords). First let’s visualize the differences. Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, time, call, money, guy. Words that are far from the line (Walter, Hank, Chuck, Verde) are words that are found more in one set of texts than another.

data_word %>% count(show, word, sort = T) %>% group_by(show) %>% 
  mutate(proportion = n / sum(n)) %>%  
  select(-n) %>% 
  spread(show, proportion) %>% 
  ggplot(aes(x = `Better Call Saul`, y = `Breaking Bad`, label = word)) +
  geom_point(alpha = 0.3) +
  geom_text(aes(label=ifelse((`Breaking Bad`>0.0015)|(`Better Call Saul`>0.002),
                             as.character(word),'')),hjust=0,vjust=0, alpha = 0.4) +
  geom_abline(alpha = 0.5) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(labels = scales::percent) +
  theme_bw()

Now let’s take a look at the most frequent words per TV show.

Some of them are common in both shows, but I’m surprised that “talk” is only frequent in Breaking Bad, not in Better Call Saul, despite having a lawyer for the main character. On the other hand, “call” is more frequent in Better Call Saul.

data_word %>%
  count(show, word, sort = TRUE) %>% 
  top_n(20) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = show)) + geom_bar(stat="identity", position="dodge") + 
  labs(
    title="Word frequency by TV show",
    y="Word",
    x="Frequency"
  ) +
  coord_flip() +
  theme_bw()

The next plots show the frequency of words per season. We can see that the 5th season has most of the frequent words, which might mean that the two series are getting closer in some way.

data_word %>%
  count(season, word, sort = TRUE) %>% 
  top_n(30) %>% 
  mutate(word = reorder(word, n))  %>% 
  ggplot(aes(word, n, fill = season)) + geom_bar(stat="identity", position="dodge", show.legend = F) + 
  facet_wrap(~season) +
  labs(
    title="Word frequency by Season",
    y="Word",
    x="Frequency"
  ) +
  coord_flip() +
  theme_bw()

Bigrams

Sometimes an individual word has a different or opposite meaning in the context. Analyzing n-grams can help us with this issue to some extent. First I visualized the most frequent bigrams per TV show. A few of these bigrams make sense for those who are familiar with the series in question, for example “car wash”, “Uncle Hank or”Mesa Verde”.

data_bigrams <- data %>%  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigrams_separated <- data_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ", remove = F)

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% custom_stop_words$word) %>%
  filter(!word2 %in% custom_stop_words$word)

bigrams_filtered %>% count(show, bigram, sort = TRUE) %>%
  filter(n > 30) %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(bigram, n, fill = show)) + geom_bar(stat="identity", position="dodge") + 
  labs(
    title="Bigram frequency by TV show",
    y="Bigram",
    x="Frequency"
  ) +
  coord_flip() +
  theme_bw()

Misidentified Sentiment

Let’s now look at those cases, where analyzing the text by single words is misleading. The plot below shows all the bigrams where the first words is “not”. “Not like” and “not good” were the largest causes of misidentification, making the text seem more positive than it is.

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  count(show, word2, value, sort = TRUE) %>%
  ungroup()

not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(30) %>%
  mutate(word2 = reorder(word2, contribution)) %>% 
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip() +
  theme_bw()

Visualizing a network of bigrams with ggraph

I thought there will be some connection between the nodes of the two shows, but there aren’t any showing up on this graph. On the other hand, we can observe some of the more common names and phrases from both shows.

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigram_counts  %>% 
  filter(n > 20) %>% 
  select(from=word1, to=word2, n) %>% 
  graph_from_data_frame()

set.seed(2022)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Sentiment analysis

In the next part I conduct a lexicon-based sentiment analysis on the transcripts using tidytext tools and sentiment lexicons.

Bing Lexicon

First I use the Bing Lexicon to categorize the words in a binary fashion into positive and negative categories for both shows. Just by looking at the most frequent words, Breaking Bad seem to have more negative words, meanwhile the positive words are about equal.

data_word %>% inner_join(get_sentiments("bing")) %>% count(show, word, sentiment, sort = TRUE) %>%  
  group_by(sentiment) %>% top_n(20) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = show)) + 
  geom_bar(stat="identity", position="dodge") + 
  labs(
    y="Contribution to the Sentiment",
    x="Frequency"
  ) +
  coord_flip() +
  theme_bw() +
  theme(legend.title=element_blank()) +
  facet_wrap(~sentiment, scales = "free_y")

AFINN Lexicon

The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. This one is pretty interesting, it looks like Better Call Saul started off on a more negative note, then goes its score gets more positive by the end of the second season, then at the end of season three it dips below to a pretty negative score. This makes sense, because at that time one of the major characters die. Also Better Call Saul achieves a more negative value than Breaking Bad, which is very surprising to me.

data_word %>% inner_join(get_sentiments("afinn")) %>% group_by(episode) %>% 
  summarise(show, sentiment = sum(value)) %>% 
  ggplot(aes(episode, sentiment, fill = show)) +
  geom_col(show.legend = FALSE) +
  labs(y = 'Sum of AFINN sentiment score per episode',
       x = 'Episode #') +
  facet_wrap(~show, ncol = 1, scales = "fixed") +
  theme_bw()

NRC Lexicon

The NRC Lexicon contains a wider variety of sentiments. In the next plot I visualize these sentiments per show. Other than trust, Breaking Bad achieves higher frequency in all remaining seven sentiments. Although the difference is not huge. But based on the NRC Lexicon, we can say that it displays a wider range of emotions than its prequel, Better Call Saul.

data_word  %>% 
  inner_join(get_sentiments("nrc")) %>% 
  filter(!(sentiment  %in% c("positive","negative"))) %>% 
  group_by(show) %>%
  count(sentiment) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(n, show , fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment) +
  labs(y = 'TV Show',
       x = '') +
  theme_bw() +
  theme(
  axis.text.x = element_blank())

Topic Modeling the AP

Topic modeling is a method for unsupervised classification of text documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. For my analysis I used Latent Dirichlet allocation (LDA), which is a particularly popular method for fitting a topic model. In the plot below, β is the per-topic-per-word probability, provided by the tidytext package.

I decided to got with a two-topic model, thinking that there would be two distinguishable topic because of the two shows, or perhaps because of the two worlds portrayed by both shows, which is the legal and the criminal worlds. However the two topics look pretty similar to me, which could mean that the two shows are in fact similar in terms of topics.

library(topicmodels)

topic <- data_word %>% count(episode, word) %>% cast_dtm(episode,word,n)

ap_lda <- LDA(topic, k = 2, control = list(seed = 1234))
ap_topics <- tidy(ap_lda, matrix = "beta")

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  xlab('') +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Conclusion

My assumption before conduction this analysis was, that Better Call Saul would have an overall more positive tone than Breaking Bad, however after conducting the analysis, I cannot confirm that. As a matter of fact, analysis with the AFINN Lexicon shows, it could even be the other way around with the prequel hitting more negative tones even. Based on the various methods I used in this project, I would say, that two shows are pretty close as far as the sentiments of their transcripts go. Which makes sense, given the large overlap, but for me as a fan of both series, Better Call Saul seems more lighthearted without a doubt.