Text Analaysis : Friends

Introduction


As a quick reminder, Friends is an iconic American television series from the 1990s and early 2000s. The series follows the adventures of six friends living in New York, their loves, careers and family relationships. With funny and engaging characters, funny and moving storylines, Friends has become a cult series that has captivated millions of viewers around the world. The show has also been praised for its depiction of friendship and adulthood, and continues to have a significant cultural impact today.

The goal of this project is to use scraping techniques to recover the important elements of all the scripts of the series Friends. The creation of the different dataframes may contain certain errors and may not be exhaustive. The most essential elements in the construction of this dataset were to recover the names of the actors as well as their lines of dialogue to continue a text analysis.

Minor Characters


For this part we will focus on the secondary characters of Friends. We will remove the main characters of the series namely: Chandler, Joey, Monica, Phoebe, Rachel & Ross. We will mainly base ourselves only on the dataframe including the words and not the bigrams. Below you will see the top 10 secondary characters with the most words.


Speaker    Occurencies

Mike       878
Charlie    659
Carol      426
David      423
Erica      360
Janice     329
Tag        295
Emily      288
Paul       277
Frank      262


For more details we will also look at the top 10 secondary characters with the most unique words.


Speaker    Occurencies

Mike       410
Charlie    364
David      245
Carol      235
Erica      218
Tag        192
Janice     188
Richard    176
Emily      172
Paul       170


We notice that there is no change for the first 3 characters namely Mike, Charlie and David. Overall the characters have a fairly weak unique vocabulary source of many repetitions. Note that Richard appears in the ranking to the detriment of Frank.

To go further we will show the top 10 words for the secondary characters.


minor_characters_top_10_words

We then notice that the words are not very significant and result only from onomatopoeia or very short words. To try to support the analysis a little more, we are going to do a word cloud without the following word list: ‘oh’, ‘yeah’, ‘okay’, ‘hey’, ‘ok’, ‘hi’.

Considering that the series is a short format (sitcom) and that situational comedy is very present in the Friends universe, it is not surprising to have onomatopoeias as the most pronounced words


word_cloud_minor_characters

We notice that the words that stand out the most are very simple and basic (‘know’, ‘well’, ‘go’, ‘think’). As well as the names of the main characters.

We will not go further in the analysis of secondary characters and we will focus on the main characters through the Bigram, combinaison of two words. But before that we will dive quickly into the analysis of the places.

Places

Overview of the places in Friends

This section focuses on the different locations/places in the series in an attempt to find patterns or other interesting information. Although the data cleaning process was long and tedious (starting with over 701 different locations), few places have emerged as essential to the series! There is a ping pong game between the same locations from one scene to another, so despite the variety of places, we often find the same ones. Thus, the top 10 locations that appear the most in the series are as follows:

Scene_place                         count_overall

Central Perk                        481
Monica and Rachel's apartement      421
Monica and Chandler's bedroom       175
Chandler and Joey's apartement      154
Joey and Rachel's                   85
Ross's apartment                    65
Monica's                            58
A Restaurant                        27
Joey's apartement                   27
The Hallway                         22


The first four places are overrepresented and well ahead of the others. It must be said that the multitude of places that have only one appearance are due to scraping and a lack of cleaning, so they could well be part of the top 30 with the necessary data cleaning. This way, we reserve a small margin of error.

What is quite surprising in this top 10 is the number of times Monica’s living place appears and also Joey’s (3 times out of 10 each) and yet (!! SPOILER ALERT !!) they are not the characters we see the most in the series.


Top 5 places per season

We will look more closely at the distribution of seats by season to try to see if there are any glaring issues.

top5s1 top5s2 top5s3 top5s4 top5s5 top5s6 top5s7 top5s8 top5s9 top5s10

We notice that on average, the top 3 places (which appear the most per season) are far ahead of the last two. There is a significant gap. We can see that “Central Perk” always ranks among the top two places regardless of the season. This becomes even more evident as we progress through the series, as it emerges as the number one location (e.g. season 10).

Main Characters

First look at the data


The main characters of the series Friends are Chandler, Joey, Monica, Phoebe, Rachel and Ross. It is on them that we will now focus.

First we will see who is saying the most words in the whole series.


Speaker    Occurencies

Rachel      19531
Ross        19194
Joey        18171
Chandler    17811
Monica      17434
Phoebe      15848


We notice that it is Rachel and Ross who pronounce the most words in the series with nearly 20,000 words each, while Phoebe in last position only pronounces 15,000.

Not surprisingly, Rachel says the most words because, as her character says, she talks a lot and sometimes fills her stress by talking.

However it is also interesting to see who has the most unique words to also see who is the character who has the most diverse vocabulary in the series.


Speaker    Occurencies

Chandler    3216
Ross        3191
Joey        3004
Phoebe      2878
Monica      2859
Rachel      2809


In this game, Chandler has the most varied vocabulary, while Rachel, the character with the most words spoken in the series, is also the main character with the least varied vocabulary, with 2,800 unique words.

Again, this fits well with the character of Chandler, who sometimes plays the role of a sarcastic and erudite person

Note that Ross, the second character with the most words, is also the second character with the most unique words.

Now it’s time to observe that they are the most pronounced words in the series.


word_cloud_minor_characters

The top 20 words spoken by the main characters is not very useful to be able to deduce something.

The word “oh” is spoken more than 3500 times, and the top 20 is mainly composed of simple words, the first names of the characters reflecting the simple nature of the series.

As mentioned before, the format of the series and the character typology encourages short answers and a lot of reaction, so not very surprising

Following this top 20 not very clarifying on the words used, a list of words was created to remove them from the analysis. These are short expressions, used mainly orally, as well as the names of the main characters.


['oh', 'yeah', 'okay', 'hey', "ok", "hi", "ross", "uh","joey","chandler","monica", "got", "phoebe", "rachel", "one", "yes"]


Here is a word cloud that is exempted from this word list :


This wordcloud is once again quite simple in the words they contain but represents the Friends series well, namely a slash of life series, which has no specific theme except the peregrinations of a group of friends.


Bigram


Following this first inconclusive point regarding unique words. We will turn to Bigram, a combination of two words to give us more context.


In the same way as for single words, a list of words has been created to give a more convincing result.

In addition, bigrams with the same characters twice have been removed from the visualization (eg: “Yeah Yeah”, “Well Well”).


 ['oh', 'yeah', 'hi', 'hey', 'okay', 'uh', 'huh', 'whoa']


We find here more interesting results with the Bigram “let go” pronounced more than 80 times which must certainly be “Let’s go” but whose S and the apostrophe were removed during the normalization of the dataset.


Sentiment Analysis


Following the research of the bigram which showed the rather neutral character of the series from the increased repetition of basic expression. We will focus on the feeling of the words present in the script For this we have created a new column which by the VADER algorithm will assign a coefficient to the bigram. This coefficient between -1 and 1 will make it possible to attribute a positive, negative or neutral feeling to the Bigram


This bar chart shows us the importance of neutrality on all the bigrams of the main characters in the series, accentuating the universal character of the series. The low rate of negativity and even positivity shows the lack of divisive words in the series also supporting the success of the series which does not take risks with the text it uses.

It is also interesting to have more granularity on this feeling and to observe what happens with the split of the main characters.


The result is quite disturbing since all the main characters have sentiment rates in the same proportions. Namely 60% neutrality, 10% negativity and 30% positivity. The main characters are written in the same way regarding their feelings, being quite smooth due to their significant neutrality of the bigrams employed.


Enrichment


Following the text analysis that we have just detailed, we decided to enrich the dataframe we had to explore other avenues of analysis. With a dataframe listing the imdb ratings of the episodes we wanted to observe possible correlations between the episode rating and the number of words spoken (from the main characters).


We observe two trends here, namely that the episodes of season nine exceeding 1000 words (the average of which is around 500 words) were less appreciated by the public. Similarly, Season 10 episodes exceeding 1000 spoken words have slightly higher ratings which may also correspond to a judgment may be altered given the end of the series.