Introduction

For this assignment I will be doing some text analysis on the show How I Met Your Mother. This will be done using packages like Tidytext. I started off with scraping data from the following website Forever Dreaming. The code for which can be found below.

My main focus will be on the character of Ted, and main analysis will be focused towards comparing his character with other main characters. My hypothesis is that the character of Ted became more negative and pessimistic as he could not find the one, as the time passed away, contributing towards his negativity.

Scraping the Data

Below is the code for scarping the data from the website. It was able to scrap all the episodes data correctly with the right number of seasons and episode. (9 seasons and 208 episodes). However, I dropped the last 3 seasons after that, and focused my analyis on the first 6. The primary reason behind that was a simple one, I only watch the first 6 seasons :p. I really want to finish the show because I loveee it, and hopefully I will get the time after I am done with all the assignments :D. I have commented out the code as I saved the data to a csv and rds file, and then call it directly from Github.

# Get the links for the episodes ------------------------------------------

#### Links for the 9 pages where the links for episodes can be found
#pages <- paste0("https://transcripts.foreverdreaming.org/viewforum.php?f=177&start=", seq(0, 200, 25))

#### Get the links for all episodes in one table
#process_one_link <- function(my_link){
#  t <- read_html(my_link)
#  episodes <- list()
#  episodes[['name']] <- t %>% html_nodes(".topictitle") %>% html_text()
#  episodes[['link']] <- t %>% html_nodes(".topictitle") %>% html_attr('href')
#  return(episodes)
#}

#episode_links <- data.table(rbindlist(lapply(pages, process_one_link)))


#### There are links to Info pages that are not episodes -> the name of them is 
####  "Please Read Updates: Take the 2021 Challenge!", lets get rid of them
#episode_links <- episode_links[name != "Updates: (05/10/22) **Summer 2022 Challenge**",]

## 208 links remain which is perfect as there are exactly this amount of episodes in the series


# Get the transcript for all episodes -------------------------------------

# link <- episode_links$link[2]
#get_transcript <- function(link) {
  # print(link)
#  t <- read_html(paste0("https://transcripts.foreverdreaming.org", str_sub(link, start = 2) ))
#  transcript <- t %>% html_nodes("#pagecontent p") %>% html_text()
#  tinfo <- t %>% html_nodes('h2') %>% html_text()
#  transcript <- str_subset(transcript, "^(?!\\[)")
#  transcript <- str_subset(transcript, "^(?!\\()")
#  transcript <- str_subset(transcript, "^(?!Scene)")
#  transcript<- transcript[grepl(':', transcript, fixed = T)]
#  textdf <- 
#    rbindlist(
#      lapply(transcript, function(x){
#        t_piaces <- strsplit(x, ':')[[1]]
#        data.table('actor' = t_piaces[1], 'text' = trimws(paste(t_piaces[2:length(t_piaces)], collapse = " " )) )
#      })
#    )
#  textdf$season <- substring(tinfo, 1, 2)
#  textdf$episode <- substring(tinfo, 4, 5)
#  textdf$title <- substring(tinfo, 9,nchar(tinfo))
#  return(textdf)
#}

#t_list <- pblapply(episode_links$link, get_transcript)
#full_df <- rbindlist(t_list, fill = T)


#saveRDS(full_df, "HIMYM_data.rds")

#write.csv(full_df, "HIMYM_data.csv", row.names = F)

data <- read.csv("https://raw.githubusercontent.com/Rauhannazir/Data-Science-3/main/HIMYM_data.csv")

Data Cleaning and Exploratory Data Analyis

For data cleaning purposes, mainly I dropped the unwanted seasons and unwanted cast members. I only kept the main 5 cast members, and the first 6 seasons. The major cleaning was done with the text so that It was in the right format to perform all kind of analysis. For instance, removing all the punctuation marks, lower casing the text, removing stop words and removing concatenations. I also manually included the words that I did not want in later stages that were messing up the results. Another major step was Tokenization. It is a technique through which sentences are broken down into single words, so that they can be analyzed and the sentiments can be figured out. I also changed the overlapping names, such as there were different spellings used for the same characters, so that we don’t lose the data when we select the top 5 cast members.

For EDA, I did visulaizations through which I could explore the number of lines, number of seasons and number of episodes, and confirm that the main cast members were actually being categorized as top 5 in terms of number of lines that they spoke.

#Visulaizing the number of episodes in each season
data %>% 
  group_by(season) %>% 
  summarise(num_episodes = length(unique(episode))) %>% 
  ggplot(aes(as.factor(season), num_episodes)) +
  geom_col(fill = "cyan4", alpha = 0.8, width = 2/3) + 
  geom_text(aes(label = num_episodes), size = 4, position = position_stack(vjust = 0.5)) +
  labs(title = "Number of episodes in each season",
       x = "Season", y = "Number of episodes") +
  theme_light()

# A total of 9 seasons and 208 episodes. Which is the accurate number.
data <- data %>% 
  filter(season != "7" & season !="8" & season != "9")

#Now I will look at the number of lines in each, just to make sure that we have not done anything wrong in terms of missing a huge number of lines. 

data %>% 
  group_by(season) %>% 
  summarise(num_lines = n()) %>% 
  ggplot(aes(as.factor(season), num_lines)) +
  geom_col(fill = "cyan4", alpha = 0.8, width = 2/3) + 
  geom_text(aes(label = num_lines), size = 4, position = position_stack(vjust = 0.5)) +
  labs(title = "Number of lines in each season",
       subtitle = "Lines refer to sentences spoken by the characters",
       x = "Season", y = "Number of lines") +
  theme_light()

#This seems about right because on average each episode had around 200 lines 


#Adding a column for number of lines for each episode 
data <- data %>% 
  group_by(season, episode) %>% 
  mutate(lines = n()) %>% 
  ungroup()

#Converting the string type to numeric so we can calculate the mean
data$lines <- as.numeric(data$lines)
#mean(data$lines)

#The mean of lines is 233 to be precise

#we will also convert season and episode number to numeric as well
data$season <- as.numeric(data$season)
data$episode <- as.numeric(data$episode)

I want to see the top 10 actors in terms of number of lines. Top 5 should be the main characters if there is nothing wrong. As expected we got the right top 5 cast members. However, as mentioned in the data cleaning part, I had to manually change names of some cast memebers, that were being mentioned multiple times with different details, such as TED(2030).

data %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines = n()) %>% 
  arrange(desc(Total.Lines)) %>%
  head(10)
## # A tibble: 10 x 2
##    Character     Total.Lines
##    <chr>               <int>
##  1 Ted                  6274
##  2 Barney               4889
##  3 Marshall             4369
##  4 Robin                3887
##  5 Lily                 3721
##  6 Ted (2030)            403
##  7 Ted from 2030         383
##  8 Man                   239
##  9 Stella                225
## 10 Woman                 215
#we will also visualize this 
data %>% group_by(actor) %>% 
  summarize(lines = n()) %>% 
  arrange(desc(lines)) %>% 
  top_n(10) %>%
  ggplot(aes(reorder(actor, lines), lines)) +
  geom_col(fill = 'cyan4', alpha = 0.8) +
  geom_text(aes(label = lines), size = 4, position = position_stack(vjust = 0.5)) +
  labs(title = 'Number of lines by charachters (top 10)',
       x = NULL, y = NULL) +
  coord_flip() +
  theme_light()

#Ted
data$actor[data$actor=="Ted (2030)"] <- "Ted"
data$actor[data$actor=="Ted from 2030"] <- "Ted"
data$actor[data$actor=="Ted (voix off)"] <- "Ted"
data$actor[data$actor=="Robin Sparkles"] <- "Robin"

These occurrences are the major ones, with the rest of the instances mostly only occurring once, so there is no point changing their spellings as well. Now we will look at the count again and just the top 5. I am going to drop every other actor apart from these 5 as we don’t need them. After which we still have over 85% of the observations left. Moreover we can see that Ted has the highest number of lines as he was kind of the “Main” character, with him telling the story to his kids.

data %>% group_by(actor) %>% 
  summarize(lines = n()) %>% 
  arrange(desc(lines)) %>% 
  top_n(5) %>%
  ggplot(aes(reorder(actor, lines), lines)) +
  geom_col(fill = 'cyan4', alpha = 0.8) +
  geom_text(aes(label = lines), size = 4, position = position_stack(vjust = 0.5)) +
  labs(title = 'Number of lines by charachters (top 5)',
       x = NULL, y = NULL) +
  coord_flip() +
  theme_light()

data <- data %>% filter(actor %in% c("Ted", "Barney", "Marshall", "Robin", "Lily"))

data <- data %>% 
  mutate(actor = as.factor(actor),
         season = as.factor(season))

I further explored this by counting the number of lines season wise.

s1_lines <- data %>% 
  filter(season == 1) %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines.S1 = n()) %>% 
  arrange(desc(Total.Lines.S1)) %>%
  head(5) 
s1_lines
## # A tibble: 5 x 2
##   Character Total.Lines.S1
##   <fct>              <int>
## 1 Ted                 1487
## 2 Barney               777
## 3 Marshall             767
## 4 Lily                 762
## 5 Robin                685
s2_lines <- data %>% 
  filter(season == 2) %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines.S2 = n()) %>% 
  arrange(desc(Total.Lines.S2)) %>%
  head(5) 
s2_lines
## # A tibble: 5 x 2
##   Character Total.Lines.S2
##   <fct>              <int>
## 1 Ted                 1208
## 2 Barney               726
## 3 Marshall             716
## 4 Robin                675
## 5 Lily                 635
s3_lines <- data %>% 
  filter(season == 3) %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines.S3 = n()) %>% 
  arrange(desc(Total.Lines.S3)) %>%
  head(5) 
s3_lines
## # A tibble: 5 x 2
##   Character Total.Lines.S3
##   <fct>              <int>
## 1 Ted                  820
## 2 Barney               740
## 3 Marshall             492
## 4 Lily                 434
## 5 Robin                433
s4_lines <- data %>% 
  filter(season == 4) %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines.S4 = n()) %>% 
  arrange(desc(Total.Lines.S4)) %>%
  head(5) 
s4_lines
## # A tibble: 5 x 2
##   Character Total.Lines.S4
##   <fct>              <int>
## 1 Ted                 1243
## 2 Barney               948
## 3 Marshall             798
## 4 Robin                707
## 5 Lily                 574
s5_lines <- data %>% 
  filter(season == 5) %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines.S5 = n()) %>% 
  arrange(desc(Total.Lines.S5)) %>%
  head(5) 
s5_lines
## # A tibble: 5 x 2
##   Character Total.Lines.S5
##   <fct>              <int>
## 1 Ted                 1276
## 2 Barney               861
## 3 Marshall             859
## 4 Robin                743
## 5 Lily                 671
s6_lines <- data %>% 
  filter(season == 6) %>% 
  group_by(Character = actor) %>% 
  summarise(Total.Lines.S6 = n()) %>% 
  arrange(desc(Total.Lines.S6)) %>%
  head(5) 
s6_lines
## # A tibble: 5 x 2
##   Character Total.Lines.S6
##   <fct>              <int>
## 1 Ted                 1096
## 2 Barney               837
## 3 Marshall             737
## 4 Robin                661
## 5 Lily                 645

These results can be better depicted with a graph. Below is the graph. One interesting insight is that the order by number of lines spoken remains almost exactly the same with only Robin and Lily exchanging the last 2 spots between them.

data %>% 
  group_by(season, actor) %>% 
  summarize(Count = n()) %>% 
  arrange(desc(Count)) %>% 
  group_by(season) %>% 
  top_n(5) %>% 
  ungroup() %>% 
  mutate(actor = reorder_within(actor, Count, season)) %>% 
  ggplot(aes(actor, Count, fill = season)) +
  geom_col(show.legend = F, fill = "cyan4") +
  labs(title = 'Number of lines by charachters in each season',
       x = NULL, y = NULL) +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~season, scales = 'free') +
  theme_light()

Another interesting thing is that there is a general downward trend of number of lines spoken as the seasons progress. To further explore this I will look at the number of words per episode, as the seasons progress.

Barney and Robin are almost constant, with a downward trend of Lily and Ted. The most significant drop for Ted, who has the highest share for number of lines overall as well.

data <- data %>% 
  group_by(season, episode) %>% 
  mutate(episode_num = cur_group_id()) %>% 
  ungroup() 
data %>% 
  group_by(episode_num, actor) %>% 
  summarise(num_lines = n()) %>% 
  ggplot(aes(episode_num, num_lines, color = actor)) +
  geom_line(show.legend = F) +
  geom_smooth(method = lm, show.legend = F) +
  facet_wrap(~actor) + 
  labs(title = "Number of lines through the episodes by people",
       x = "Episode", y = "Number of lines") +
  theme_light() 

Before we can start performing sentiment analysis we need to make sure that the text is in the right form. By that I mean that we need to convert the text into all lower case, remove the punctuation, remove stop words and remove contractions

## convert the actors to lower case

data$text <- tolower(data$text)

# Removing contractions and replacing with full words
data$text <- lapply(data$text, replace_contraction)

# Removing any punctuation from the lines
data$text <- lapply(data$text, removePunctuation)
# Removing — from the script
for (i in 1:length(data$text)) {
  data$text[i] <- gsub("—","",data$text[i])
}


#Tokenizing and removing stop words 
tokens <- data %>% 
  mutate(text = as.character(text)) %>%  
  group_by(actor) %>% 
  unnest_tokens(word, text) 
tokens %>% head(5)
## # A tibble: 5 x 2
## # Groups:   actor [1]
##   actor    word 
##   <fct>    <chr>
## 1 Marshall opens
## 2 Marshall ring 
## 3 Marshall will 
## 4 Marshall you  
## 5 Marshall marry
tokens <-  anti_join(tokens,stop_words)

Sentiment Analyis

Now moving towards the actual sentiment analysis. I will begin with seeing what are the emotions that drive Ted.

Overall it shows that ted is more positive than negative, however its pretty close. That’s exactly what I expected from a guy like ted, who had these phases of ups and downs. The third most significant emotion is anticipation.

to_plot <- tokens %>% 
  inner_join(get_sentiments("nrc")) %>% 
  filter(actor == "Ted") %>% 
  count(sentiment, actor) %>% 
  group_by(actor, sentiment) %>% 
  summarise(sentiment_sum = sum(n)) %>% 
  ungroup() %>% 
  arrange(sentiment_sum)
myColors = c("Ted" = "#f5f505", "anger" = "#FA8072", "anticipation" = "#04700A", "disgust" = "#99F1EB", "fear" = "#F39C12", "joy" = "#D7DBDD", "negative" = "#06060a", "positive" = "#062D82", "sadness" = "#546e3f", "surprise" = "#2ef24f", "trust" = "#3e2ef2" )


# The Chord Diagram  
circos.clear()
circos.par(gap.after = c(rep(2, length(unique(to_plot[[1]])) - 1), 15,
                         rep(2, length(unique(to_plot[[2]])) - 1), 15), gap.degree=3.5)
chordDiagram(to_plot,grid.col = myColors, transparency = 0.5, annotationTrack = c("name", "grid"),
             annotationTrackHeight = c(0.1, 0.02))
title("What emotions is Ted driven by?")

I will dig deeper into the these three emotions, by looking at the top words belonging to each emotion that Ted used. One issue that I see with this is that there is an overlap of words with multiple emotions, so the context in which the word was used is not being portrayed. For instance the word good, wait and god.

tokens %>% 
  inner_join(get_sentiments("nrc")) %>%
  filter(actor == "Ted") %>% 
  filter(sentiment %in% c("anticipation", "positive", "negative")) %>% 
  count(sentiment, word, sort=T) %>% 
  group_by(sentiment) %>% 
  arrange(desc(n)) %>% 
  slice(1:7) %>% 
  
  #Plot:
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend = F) +
  facet_wrap(~sentiment, scales = "free_y") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  coord_flip() +
  theme_bw() +
  labs(x="Word", y="Frequency", title="Sentiment split by most frequent words for Fear, Negative, Positive emotions") 

But to still test my hypothesis of Ted becoming more negative as the seasons progressed, I need to split these by season, and see if it is true.

As we saw in the overall sentiment graph and both negative and positive emotions were pretty close. If we now look at the season wise breakdown, we see in the first season, Ted used words related to both emotions pretty evenly, and right after that season we see that negative starts to outweigh the positive. The gap remains throughout and it even increases a bit in the sixth season. This is supporting my hypothesis.

tokens_2 <- data %>% 
  mutate(line = as.character(text)) %>%  
  group_by(actor, season) %>% 
  unnest_tokens(word, line) 
tokens_2 <-  anti_join(tokens_2,stop_words)
sentiment_season <- tokens_2 %>% 
  inner_join(sentiments) %>% 
  filter(actor == "Ted") %>% 
  count(sentiment, season) %>% 
  group_by(actor, sentiment, season) %>% 
  summarise(sentiment_sum = sum(n)) %>% 
  ungroup()
ggplot(sentiment_season) +
  aes(x = sentiment, weight = sentiment_sum) +
  geom_bar(fill = "#112446") +
  theme_minimal() +
  facet_grid(vars(season), vars()) +
  labs(
    x = "Emotion",
    y = "Words per emotion",
    title = "Emotions by Season"
  ) +  
  ggthemes::theme_economist() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 15L,
                              face = "bold",
                              hjust = 0.5
    ),
    panel.grid.minor = element_line(color = "white",
                                    size = 0.2,
                                    linetype = 1),
    legend.background = element_rect(color = "black", fill = "#9bff16", linetype = "solid"),
    axis.title.y = element_text(margin=margin(r=5)))

## N-Grams

As one word alone is not enough to accurately depict the emotion in which the word was being used, I will now create N-grams so that it helps in gauging the emotions more accurately and then see the top words belonging to each category.

In this case our Bi-Gram is not useful as well, because it is showing only the names that were used in the show as the most frequent words. So for my latter analysis, I will remove these names as well.

As for the, Even though these words may belong to the category of stop words and not actually represent anything by themselves. In this case they do actually represent Teds character, and his negative and confused emotions. Reluctant to take any initiative usually and not knowing what to do in many situations.

#Bi-Gram

data %>% 
  group_by(actor) %>% 
  unnest_tokens(bigram, text, token = 'ngrams', n = 2) %>% 
  ungroup() %>% 
  separate(bigram, c('word_1', 'word_2'), sep = ' ') %>%
  filter((!word_1 %in% stop_words$word) & 
           (!word_2 %in% stop_words$word) & (word_1 != word_2)) %>% 
  unite(bigram, word_1, word_2, sep = ' ') %>% 
  count(actor, bigram, sort = T) %>% 
  group_by(actor) %>% 
  top_n(5, wt = n) %>% 
  ungroup() %>% 
  mutate(actor = as.factor(actor),
         bigram = reorder_within(bigram, n, actor)) %>% 
  ggplot(aes(bigram, n, fill = actor)) +
  geom_col(show.legend = F) +
  labs(title = "",
       x = NULL, y = NULL) +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~actor, scales = 'free') +
  theme_light()

#Tri Gram for TED
Ted_trigrams <- data %>%
  filter(actor == "Ted") %>% 
  unnest_tokens( output = "trigrams", input = "text",token = "ngrams", n = 3) %>% 
  count(trigrams, sort = TRUE) %>%
  head(11) 
Ted_trigrams <- Ted_trigrams[-1,]
ggplot(Ted_trigrams) +
  aes(x = trigrams, weight = n) +
  geom_bar(fill = "#112446") +
  labs(
    x = "Count",
    y = "Trigrams",
    title = "Trigrams"
  ) +  
  ggthemes::theme_economist() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 15L,
                              face = "bold",
                              hjust = 0.5
    ),
    panel.grid.minor = element_line(color = "white",
                                    size = 0.2,
                                    linetype = 1),
    legend.background = element_rect(color = "black", fill = "#9bff16", linetype = "solid"),
    axis.title.y = element_text(margin=margin(r=5))) +
  coord_flip()

Below is the process for manually removing these words. I will store it into a csv file and call it directly from github when the file is ready. So the code is commented out

#tokens <- data %>% unnest_tokens(word, text, token = "ngrams", n = 1)
 
#fwrite(tokens, "tokens.csv")
#tokens <- read.csv("tokens.csv")
#names(tokens)[names(tokens) == 'bigram'] <- 'word'
 
#tokens_filtered <- tokens %>%
#filter(!word %in% stop_words$word)
#tokens_filtered
 
#Removing undesirable words and stop words

#undesirable_words <- c("ted","lily","robin","hey", "yeah", "uh", "Ted","gonna", "i’m","Barney", "Lily","Marshall", "it’s", "y’know", "guys","that’s", "you’re","ooh", "umm", "huh", "um", "don’t", "god","y'know", "guy","can’t","ohh","i’ll", "didn’t","she’s","we’re","gotta", "wanna","ben","i’ve", "there’s", "what’s","doesn’t", "lot","he’s", "let’s","hey","barney","marshal")

#tokens_tidy <- tokens_filtered %>%
#filter(!word %in% undesirable_words) %>%
#filter(!nchar(word) < 3) %>%
#anti_join(stop_words)
 
 
# Removing digits
#tokens_tidy$word <- gsub("\\d", "", tokens_tidy$word)
 
#fwrite(tokens_tidy, "tokens_tidy.csv")
#tokens_tidy <- read.csv("tokens_tidy.csv")


#Now that I have tokenized I will comment out the above code.

tokens_tidy <- read.csv("https://raw.githubusercontent.com/Rauhannazir/Data-Science-3/main/tokens_tidy.csv")

I have already done NCR analysis which describes the sentiment of the word in terms of emotions. Now I will be doing AFINN, which scores each word according to the sentiment it expresses and BING, which categorizes word into positive or negative sentiments.

We can see that compared to the score of overall show, For ted it is less positive and his negative side is more negative.

afinn <- get_sentiments("afinn")
tokensafinn <- tokens_tidy %>% inner_join(afinn)



tokens_viz <- tokens_tidy %>%  inner_join(afinn, by = c(word = "word")) %>%
  dplyr::count(word, value, sort = TRUE) %>%
  ungroup()

tokens_Ted <- tokens_tidy %>%  filter(actor %in%  "Ted")  %>% inner_join(afinn, by = c(word = "word")) %>%
  dplyr::count(word, value, sort = TRUE) %>%
  ungroup()
# tokens_viz

all_pop <- tokens_viz %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(25) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Top 25 popular words from HIMYM") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()+
  theme_light()

Ted_pop <- tokens_Ted %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(25) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Top 25 popular words spoken by Ted") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()+
  theme_light()
all_pop

Ted_pop

Now I will look at the positive and negative words for the show and compare it with the ones Ted spoke. After which I will dig deeper into the evolution of Ted over the course of 6 seasons.

I remember the show had a lot of graphic words like whore and bitch, and that has been confirmed here as well. Ted was no different in this case and has used similar words.

all_neg <- tokens_viz %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  filter(value<= -4) %>% 
  head(10) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Top 10 negative words used") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()+
  theme_light()


all_pos <- tokens_viz %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  filter(value>=4) %>% 
  head(10) %>%
  mutate(word2 = reorder(word, contribution)) %>%
  ggplot(aes(word, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE, fill="#00BFC4") +
  xlab("Top 10 positive words used") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()+
  theme_light()

Ted_neg <- tokens_Ted %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  filter(value<= -4) %>% 
  head(10) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Top 10 negative words by Ted") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()+
  theme_light()


Ted_pos <- tokens_Ted %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  filter(value>=4) %>% 
  head(10) %>%
  mutate(word2 = reorder(word, contribution)) %>%
  ggplot(aes(word, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE, fill="#00BFC4") +
  xlab("Top 10 positive words by Chandler") +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()+
  theme_light()
all_neg

Ted_neg

all_pos

Ted_pos

Tf-idf

This will allow me to find unique words for every character. With this technique we already got much closer to find identifying words for the characters. This is extremely personalized as we can see unique words for every character, for instance marshal where he has used meow and msutard.

a <- data %>% 
  group_by(actor) %>% 
  unnest_tokens(word, text) %>% 
  ungroup() %>% 
  mutate(word = lemmatize_words(word)) %>% 
  count(actor, word, sort = T) %>% 
  ungroup() %>% 
  bind_tf_idf(term = word, document = actor, n = n) %>% 
  group_by(actor) %>% 
  top_n(7, wt = tf_idf) %>%
  ungroup() %>% 
  mutate(word = reorder_within(word, tf_idf, actor)) %>% 
  ggplot(aes(word, tf_idf, fill = actor)) +
  geom_col(show.legend = F) +
  labs(title = "Typical words used by characters",
       x = NULL, y = NULL) +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~actor, scales = 'free') +
  theme_light()
a

Bing Lexicon

I did this to compare Ted to other characters as well. We can see for all the cast, they were more negative, however Ted contributes the most.

bing <- get_sentiments("bing")

tokensbing <- tokens_tidy %>% inner_join(bing)

tokensbing %>% 
  ggplot(aes(sentiment, fill = actor))+
  geom_bar(show.legend = FALSE)+
  facet_wrap(actor~.)+
  theme_light()+
  theme(
    strip.text = element_text(),
    plot.title = element_text(hjust = 0.5, size = 20)
  )+
  labs(fill = NULL, x = NULL, y = "Sentiment Frequency", title = "Sentiments of each characters by using bing lexicon")

This is an excellent graph to gauge the sentiments of Ted over the seasons.There was a huge drop in positivity right after season 1 and a further drop after season 4. He slightly became more positive in season 5 but then this also changed in the 6th season.

##Over seasons analysis of Ted 
tidy_afinn <- tokens_tidy  %>% inner_join(afinn)
tidy_afinn %>% 
  filter(actor %in%  "Ted")  %>% 
  group_by(season, actor) %>% 
  summarise(total = sum(value), .groups = 'drop') %>% 
  ungroup() %>% 
  mutate(Neg = if_else(total < 0, TRUE, FALSE)) %>% 
  ggplot()+
  geom_path(aes(season, total, color = actor), size = 1.2)+
  theme_light()+
  theme(legend.position = "bottom")+
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))+
  labs(x = "Season", color = NULL, y = "Total Sentiment Score")

This however is not that straightforward to read. But still compared to first 3 seasons it shows that Ted was relatively more negative and his fear also grew, which also is in line with my hypothesis.

nrc <- get_sentiments("nrc")

tokensnrc <- tokens_tidy %>% inner_join(nrc)  

tokensnrc %>% 
  filter(actor %in%  "Ted")  %>% 
  group_by(season) %>% 
  ggplot(aes(sentiment, fill = sentiment))+
  geom_bar(show.legend = TRUE)+
  facet_wrap(~season)+
  theme_light()+
  theme(
    strip.text = element_text(),
    plot.title = element_text(hjust = 0.5, size = 20)
  )+
  labs(fill = NULL, x = NULL, y = "Sentiment Frequency", title = "Sentiments of Ted")

Another interesting analysis is to see whether any character is similar to Ted. And what probability is that they will belong to the same cluster. We can see that every character is completely different and they don’t overlap

top_words <- data %>% 
  group_by(actor) %>% 
  unnest_tokens(word, text) %>% 
  ungroup() %>% 
  select(actor, word) %>% 
  mutate(word = lemmatize_words(word)) %>% 
  anti_join(stop_words) %>% 
  count(actor, word, sort = T) %>% 
  ungroup()
top_words_dtm <- top_words %>% cast_dtm(actor, word, n)
top_words_dtm_lda <- top_words_dtm %>% LDA(k = 5, control = list(seed = 8080))
top_words_dtm_lda_gammas <- tidy(top_words_dtm_lda, matrix = 'gamma')
top_words_dtm_lda_gammas %>%  
  rename('actor' = 'document') %>% 
  mutate(topic = as.factor(topic)) %>% 
  ggplot(aes(topic, gamma, fill = actor)) + 
  geom_point(show.legend = F) +
  facet_wrap(~actor, scales = 'free') + 
  labs(title = "Really significant differences in vocabulary used by characters",
       x = '5 topics (clusters) from LDA algo',
       y = '% of being assigned to one cluster') +
  theme_light()

# Word Clouds

Finally I will make word clouds for Ted for every season. This might also give us some insight into the development of Ted’s character. One thing that stays common in all the word clouds is that he is talking about love, his ultimate goal, but in season 6 he talks about love less frequently

wordcloud1 <- tokens_tidy %>%
  filter(actor %in%  "Ted")%>%
  filter(season %in%  "1")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, main="Season 1"))

wordcloud2 <- tokens_tidy %>%
  filter(actor %in%  "Ted")%>%
  filter(season %in%  "2")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, main="Season 2"))

wordcloud3 <- tokens_tidy %>%
  filter(actor %in%  "Ted")%>%
  filter(season %in%  "3")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, main="Season 3"))

wordcloud4 <- tokens_tidy %>%
  filter(actor %in%  "Ted")%>%
  filter(season %in%  "4")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, main="Season 4"))

wordcloud5 <- tokens_tidy %>%
  filter(actor %in%  "Ted")%>%
  filter(season %in%  "5")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, main="Season 5"))

wordcloud6 <- tokens_tidy %>%
  filter(actor %in%  "Ted")%>%
  filter(season %in%  "6")%>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, main="Season 6"))

Conclusion

This was a pretty fun exercise where I was able to uncover a lot of information about Ted and other cast members. Overall the data suggest that my hypothesis of him getting more negative and less pessimistic could not be proved wrong as I don’t see enough evidence. In fact a lot of metrics and words do suggest that he actually became more negative. Overall, we were able to see how strong these tools are and also quite user friendly.