Introduction


A collection of techniques known as network analysis describes the connections between units of study. Nodes, the edges and connections between them make up a network. If we apply this framework to text data, the end result of such analysis is a text network, which allows to read the relationships between the presented entities. Since one can use this technique to detect relationships between connected words, in this work, I will try to use it to analyze a very popular show - How I met your mother.

The series centers on the romantic experiences of Ted Mosby, a 27-year-old architect residing in New York City. Finding the right person and establishing a family are the protagonist’s top priorities in life. The story revolves mainly around his group of closest friends, including Marshall Eriksen and his college sweetheart and now fiance, Lily Aldrin, Barney Stinson, eternal bachelor and ladies’ man, and Canadian news writer Robin Scherbatsky, who Ted falls in love with. The lives of each character are intertwined. The show examines a variety of plot threads, such as the “will they or won’t they” relationship between Robin and Ted, the ups and downs of the characters’ jobs, and Ted’s struggles with finding true love.

My paper will focus on creating a text network in a way that enables us to see the most important elements of the story and discover the relationships between characters. To be more specific, it should discover the connection between the members of the group of friends - Ted, Lily, Marshal, Robin, Barney, the romantic relationship between Lily and Marshal, most important facts about the main character - Ted, for instance - (his job as the architect, his main desire in life, etc), the location of their residence - New York. As the plot is very complex - sky is the limit. However, it will be considered as a success if the text network is able to discover the most important facts.

Data


The data was downloaded from https://genius.com/ - a website that publishes transcripts of popular TV shows. Data set includes 2 seasons (44 episodes) - approximately 20 hours of show in a form of transcript. It includes character’s dialogues and Narrators’ comments. Untransformed text looks as follows:


Lines
Lily: Gosh, you are unbelievable, Marshall. No, you open it….
Screen splits with Lily and Marshall arguing on top while Yasmin and Ted are still talking at the bar on the bottom of the screen. Future Ted, the narrator, is narrating.
Narrator [during split screen]: There are two big questions a man has to ask in life. One, you plan out for months; the other just slips out when you’re half drunk at some bar.
Marshall [to Lily]: Will you marry me?
Ted [to Yasmin]: D’you want to go out sometime?
Split screen ends. Scene returns to the kitchen with Lily and Marshall.

As we can see, the text is not ready for analyzing. It contains a lot of special characters, information about the setting, comments, ordinary expressions and many more as this is an every day spoken language. For this and modeling purposes, following libraries were used:

library(tm)
library(tidytext)
library(plotly)
library(wordcloud)
library(dplyr)
library(tidyr)
library(igraph)
library(glue)
library(networkD3)
library(magrittr)
library(purrr)
library(ngram)
library(cowplot)

In order to clean the text, following phrases were removed from the data:

  1. English abbreviations such as “n’t”, “’s”, “’ll”, etc.
  2. Words with no meaning that are used in every day conversations, for instance: “ughh”, “Uh”, “Mr”.
  3. Phrases that narrator repeats during every episode that have no significance for this history: “Ted from 2030: Previously on How I Met Your Mother…”.
  4. Special characters.
  5. Information about the setting in square brackets.
  6. Information about the protagonists in round brackets.
  7. Text was transformed to lower case.
  8. Numbers and stopwords were removed.
  9. Words were stemmed
  10. White spaces were removed.
Lines
lili gosh unbeliev marshal open
screen split lili marshal argu top yasmin ted still talk bar bottom screen futur ted narrat narrat
two big question man ask life one plan month slip half drunk bar
marshal will marri
ted want go sometim
split screen end scene return kitchen lili marshal

The text is now ready for processing. Let’s see its most important words (single) and co-occuring words first (bigrams,trigrams, etc).

Unsurprisingly, names of the main characters are the ones with biggest counts in the text. In order to gain more insight into this story, a word cloud was created and plotted below:

Apart from main protagonists’ names, we can see words like “want”, “guy”, “get”, “call”, “love”, “man”, “night”, “Victoria” etc. Since the story is mainly focused on Ted trying to find love and meets Victoria on his way to the happiness, it makes sense for these words to be included in the cloud. However, it is very difficult to tell what the story is about just by looking at this visualization.

In order to see the associations between characters and/or threads in the story, it is useful to look at the bigrams. They show co-occurences of words and it enables us to see which elements have some kind of a relationship with each other.

word associated_word weight
lili marshal 358
marshal lili 265
get marri 196
ted robin 166
ted mother 156
ted marshal 147
first time 142
let go 138
robin ted 134
ted father 132

This table captures the essence of the story surprisingly well - one of the main threads presented in the series is Marshall (best friend of Ted) and Lilly (Marshall’s first girlfriend ever) getting married. Ted is looking for love and a mother for his kids, falls for Robin but she rejects him and he needs to let go. This is actually the long story short of the first two seasons and the table shows exactly this, which is amazing!

By plotting the counts of specific weights (number of bigram occurences in the corpus), we can see what the most typical weights are and get an idea on what kind of threshold we can set. As the plot below is very skewed, it was transformed for visualization purposes in a following way: log(weight + 1).

As we can see here, the most frequent weight is one. The counts of the most frequent bigrams are very low - they can’t be observed in this chart. Apart from this, it is clear that there are many bigrams with low counts and then the bigger the weight, the count decreases at a very fast pace. It will probably be good for our analysis as there are many non-relevant bigrams which should be easily removed.

In order to get the ‘essence’ of the story, the threshold will be set to 80 - every bigram with lower count will not be included in the weighted network and only relevant elements will remain:

This network present the story in a decent way. However, it is very user unfriendly since it’s difficult to tell which node/edge is important or grouped together.

By setting the sizes of the nodes and edges, additional information will be available in the visualization.

Again, the story is somehow presented by the network but the visualization could be more informative and since it has a round shape, it is difficult to tell what elements exactly are grouped together.

In order to gain more insight, Biggest Connected Component method will be used:

This plot is much better and easier to read. not only do we have the node sizes presented in a very straightforward way, but also the way they are connected is easier to understand and read. Let’s make it more dynamic so that it’s easier to navigate.

We can now see the story presented in first two seasons of the show in a very informative way. The most important elements of the network are Ted, Marshall and Lilly which makes a lot of sense. Ted’s (main character) attributes are: future, robin, yes, barney, father, mother, go get married, let go, Marshal, etc… That is very accurate as it describes his main objectives/events in life. It could be slightly better if his job - architect was added to this list. We can also see that Marshall and Lilly and their last names have a strong connection which is another success. Marshall’s attributes are mostly associated with the fact that Lilly was his first girlfriend. Lilly’s, however, indicated that the protagonists are living a very adventurous life in New York. All in all, this is a decent presentation of the story. However, there are some attributes that could be omitted and they are not this relevant. Apart from this, there are also some that should be included.

Let’s now look at the network created by using skip grams. The technique is very similar to the previous one with only one difference which is allowing to skip the word between two words.

word1 word2 weight
marshal lili 555
lili marshal 519
ted robin 263
ted marshal 253
ted go 241
robin ted 233

The skip-grams table did not change much. On the other hand, the network grew significantly and includes more information now. Essentially, it presents the same story with some small additions - we can now see that Ted wants a kid, Marshall and Lilly have a get back attribute, which indicates that the broke up and Marshall wanted her back. Marshall awaits his personal growth and evolution with the beginning of the 21st century. Again, this is a decent presentation of the story which could be perfect with some additions.

Let’s now investigate 3 measures: degree, closeness and betweenness:

word degree closeness betweenness
ted 4271 9.09e-05 512
lili 3539 9.76e-05 435
marshal 3477 9.63e-05 425

Only three words were left here as all of the measures are significantly bigger for them than for other words. It comes as no surprise that Ted, Lilly and Marshal are top 3 words in case of all of the measures as they are the most important characters in the story. Degree value indicates the amount of connections the word has in a network. The bigger the value, the more ‘attributes’ the object has. Closeness informs us about the distance to others in the network. In comparison to less central nodes, more central nodes can communicate with other nodes in the network more rapidly and readily. In order to reach other nodes in the network, more central nodes do not need to “travel” as far along pathways since they have low proximity centrality scores. Betweenness gauges how frequently a node in the network is on the shortest path between two other nodes. Those with strong betweenness centrality scores are frequently seen as information keepers.

As we can see here, the top 1 scorer here is Ted which means that he: 1. is connected to the biggest amount of attributes 2. has the lowest distance to travel to other nodes in the network 3. is the keeper of the biggest amount of information This results make a lot of sense since the show is mainly centered around ted and his life. This is a good indicator that the network captures the angle from which the story is being told.

In order to further improve the network, nodes will be assigned to clusters so that it’s more clear which parts of the story are closely tied together. For this purpose, a technique called the Louvain method and correlation analysis are used to find communities in vast networks. For each community, it optimizes the modularity score, which measures how well nodes are assigned to different communities. Assessing how much more closely connected nodes inside a community are than they would be in a random network is what is meant by this.:

## IGRAPH clustering multi level, groups: 5, mod: 0.41
## + groups:
##   $`1`
##   [1] "marshal"    "lili"       "eriksen"    "easi"       "realli"    
##   [6] "aldrin"     "dude"       "hasselhoff" "skywalk"   
##   
##   $`2`
##    [1] "ted"    "robin"  "barney" "futur"  "mother" "want"   "father" "guy"   
##    [9] "think"  "yes"    "gonna"  "one"    "good"   "thank"  "call"   "got"   
##   [17] "kid"    "love"   "um"    
##   
##   $`3`
##   + ... omitted several groups/vertices

With this plot we can figure out what the main threads are in this story. It is clear that one of them - the orange one is focused on love life - get married, let go, get back. The green one is all about living an eventful life in New York. Darker blue uncovers the relationship between Lilly Aldrin and Marshall Eriksen - Ted’s best friends, who are about to get married. The yellow cluster serves as an additional information about their relationship - Lilly is Marhall’s first girlfriend ever. The light blue group is a good summary of what is on main character’s mind, finding love, future mother for his kids and his group of friends.

Now, let’s plot separate networks for all of our main characters:

## [1] "marshal, lili, eriksen, easi, realli, aldrin, dude, hasselhoff, skywalk"                        
## [2] "ted, robin, barney, futur, mother, want, father, guy, think, yes, gonna, one, good, thank, call"
## [3] "get, let, go, marri, back"                                                                      
## [4] "first, time, ladi, man, ever, gentlemen"                                                        
## [5] "last, whole, new, kind, way, evolv, share, stcenturi, total, night, york, littl"

Ted’s network is accurate, however, very limited. Surprisingly, it does not say anything about his love life. While there is an are for improvement here, Marshall’s network is actually very good. It shows that he cares a lot about his law degree, is closely realted to Lilly and even mentions his old car - fiero. Lilly’s network is also close to perfect - it says a lot about her carrer as she is a kindergarten teacher but wants to be a painter and moves to san fransisco for this particular reason. Also Robin’s attributes are actually closely related to her life as a reporter at Metro News One, used to be a teenager pop star in Canada known under the nickname of Robin Sparkle and is considered to be Ted’s future girlfriend. Last, but not least - Barney, heartbraker, bachelor, who only cares about picking up girls, money, suits and his legendary life. All in all, this part of analysis is very successful for all of the characters except from Ted.

Evaluation


The analysis described presents the main threads of How I met your Mother in a successful way. By looking at the tables, plots, networks etc, one can get an idea on what the main characters, their life goals, important events and relationships are, which can be definitely considered as a job well done. While there are still some parts of the research that could be more accurate, the general results are more than satisfactory and the story is well presented by the text network.


Summary


In my paper, I concentrated on building a network of texts that helps in understanding the key plot points and character interactions in How I met your Mother series. More specifically, it revealed the relationships between the group of friends - Ted, Lily, Marshal, Robin, and Barney, as well as the romance between Lily and Marshal. It also reveal the most crucial information about the main character Ted, such as greatest aspiration in life, which is finding love and starting a family. Sadly, it failed to detect his occupation and passion - architecture.

Conclusions

Basing on the evidence presented above, the analysis can be concluded by following findings:

  1. The analysis reveals what the main characters are accurately.
  2. Text network presents the most important relationships - friendship of Ted, Marshal, Lilly, Barney and Robin.
  3. It was found that Marshal and his first girlfriend ever wanted to get married.
  4. Ted’s failed romantic plans about Robin were shown.
  5. Text analysis reveals the location - New York City.
  6. Main characters’ occupations (except from Ted’s) are presented.
  7. Network reveals some additional info on characters/their attributes.

Possible improvements.

  1. Adding data from other seasons.
  2. Using more advanced techniques.
  3. Different approach to removing parts of the text.