Insights for a new season of Friends


Introduction


In the context of a possible new season of friends, our team has tried to gather as much information as possible about the series friends and particularly its writing. The objective of this report is to give a precise and quantified description of the writing of the series.


The data base used for the following analysis is the result of the scrapping of the script of more than 200 episodes of friends. All of our data are stored into our own servers and are not public for the moment.


Our database is constituted of 5 datasets that are presented as it follows :

speaker episode words
monica 0101 guy
monica 0101 work
joey 0101 guy
joey 0101 gotta
joey 0101 wrong
chandler 0101 joey
chandler 0101 nice
chandler 0101 hump
chandler 0101 hump
chandler 0101 hairpiece
## [1] 159385      3

The next part of this report will be decomposed into 3 parts :

  • Character’s language
  • Episode structure
  • Character’s behavior and actions

Character’s language and personality

What makes first the singularity of Friends is its characters and their interactions. The way they interact between each other.

The goal of this part will be, through their language, to analyse the behavior and the talking habits of Friends main characters.

Knowing the way they talk is the best way recreate the feeling of the Tv show. To do so, we propose you to take a look at the most used words by characters.


Talking habits


Monica

On the wordcloud above, we can see that monica’s dialog are mostly oriented towards Chandler. In order to make her dialogs line more likely to be original, it will be very important to insist on her relation with chandler.


Chandler

Chandler, as other characters put a lot on emphasis on affirmation with “yeah”. Since it’s not a really relevant insight, it is important take a look at other words into this visualization. We can see that he is a lot in interraction with Joey and Ross. He also has the same talking habits as Monica with “y’know” and “god”.


Joey

Joey offen starts it’s sentences by “Hey”. It is one of the talking habits that characterize him teh most. In opposition with the previous characters, his interactions are way more distributed between characters. He is offen in contact with every characters.


Ross

Without any suprise, Ross lines are really oriented to Rachel. This is one of the most pronouced name in his mouth. He is also interacting with chandler and joey a lot. As a talking habits, he is pronouncing a lot “great”, “y’know” or even “huh”. Those are important criteria to take into consideration.


Phoebe

Phoebe is one of most difficult character to analyse in terms of language. There is no significant talking habits comming out of our analysis. She’s the only one who use “ooh” and “wait” a lot. She seems to be pronounce the name of the other characters a lot.


Rachel

Rachel talks a lot about ross. His name is the most pronounce in her mouth. She also use the term “honey” a lot.


Relationships




What we can notice in the first place is that the couples formed in the series quote first the name of their partner. Monica talks a lot about Chandler and Chandler about Monica. It is the same for Ross and Rachel.

Globally, we can observe that the stronger the link is between the characters of the series (friend/love interest/couple), the more the names are quoted. Joey and Chandler are friends and roommates and mention their names much more than those of the girls or even Ross.

It is funny to note that Ross and Joey are the two characters who pronounce their first names the most, a language tick that is important to know when writing dialogue for them.

Episode structure


In this part, a closer look will be taken into the episode structure. The first step will be to analyse the opening scene. In this part we will focus on the most occurring words of the opening scenes in order to know where it happens and which characters are the most present.


opening scenes

## Selecting by Freq


Opening scenes of Friends are mostly the same. An average episode of friends starts with the main characters located in central perk, their appartment, sitting and talking about a situation when a character enters the room. If you want people to feel like your episode is an episode of friends, you should definitly follow these criterias.


Scenes analysis

Scenes description are the core content of episodes structures. They allow to identify which characters act in which place. Studying their structure is the best way to understand how the episodes ar constructed. In the table bellow, you will find information about the most occurring places in friends and the most involved characters.


words Freq
cut 397
monica 216
ross 179
joey 171
rachel 168
chandler 157
phoebe 113
back 83
room 72
time 62
lapse 55
apartment 49
door 49
enters 43
living 39
bedroom 37
central 31
perk 31
cast 27
chandler’s 27
conan 26
flashback 25
walks 25
couch 24
monica’s 24
rachel’s 24
table 24
inside 23
scene 22
sitting 21


Of course, most of the scenes descriptions involve main characters. A lot of time lapses are happening in Friends too. Most of the actions are also triggered by the entry of a character in a place. Over our analysis 25 scenes occur in the past (flashbacks). Those scenes are really important too. The two main places are Central Perk and the three mains appartment.

character’s behavior and actions

In order to have a more precise idea of character’s way to talk, it is essential to dive into the sentiment that get out of their language. This will be the first part of this analysis. The second one will be on the didascalia that lead their actions in the script. With this insights, we will be able to know what are the most performed actions in show which will allow to wright more likely actions into the future script.

Character’s sentiment analysis

captioncaptioncaptioncaptioncaptioncaption

caption

It is really interesting to notice that the biggest part of the words used by Friends characters are positive words. Yet, some of the main characters seems to deviate a little bit from this trend. Ross and Monica seem to have more equilibrium between negative and positive words.

It is when we take a look deeper into the diverse feelings expressed that we can identify more specification about the characters :

  • Phoebe’s emotions (in order of importance) are Trust, Fear, Anger
  • Chandler’s emotions are Trust, Fear, Anticipation
  • Ross’s emotions are Trust, Fear, Anticipation
  • Monica’s emotions are Trust, Fear, Joy
  • Joey’s emotions are Trust, Anticipation, Fear
  • Rachel’s emotions are Trust, Anticipation, Sadness

Those are the feeling that must come out the dialog of a potential new season for each main characters.

words Freq
ross 500
rachel 468
chandler 463
joey 459
monica 453
cut 398
phoebe 283
door 204
back 180
credits 178
enters 145
starts 140
break 130
room 127
end 123
commercial 122
walks 106
time 86
apartment 82
start 81
opening 78
turns 78
kiss 65
closing 61
opens 61
couch 60
bedroom 58
lapse 56
living 56
phone 54
runs 54
table 54

There is no essential outcome of this analysis. there is no typical didascalia in friends. Our dataframe could may be be explored deeper to identify more precise information about the general sentiment of the didascalias.

In general, our data set provide a wide range of opportunities in terms of analysis and manipulations.

Technical details and specification


code to generate word cloud images used into the markdown :


# Subset the top 25 repeated word for each speaker
#-----Monica---
df_monica25 = df_monica %>% top_n(20)
df_monica25 <- subset (df_monica25, select = -speaker)
w1 = wordcloud2(data = df_monica25,color = "random-light")
saveWidget(w1, '1.html', selfcontained = F)
webshot('1.html', '1.png', vwidth=1000,vheight=800, delay = 5)
#-----Chandler---
df_chandler25 = df_chandler %>% top_n(20)
df_chandler25 <- subset (df_chandler25, select = -speaker)
w2 = wordcloud2(data = df_chandler25,color = "random-light")
saveWidget(w2, '2.html', selfcontained = F)
webshot('2.html', '2.png', vwidth=700,vheight=600, delay = 5)
#-----Joey---
df_joey25 = df_joey %>% top_n(20)
df_joey25 <- subset (df_joey25, select = -speaker)
w3 = wordcloud2(data = df_joey25,color = "random-light")
saveWidget(w3, '3.html', selfcontained = F)
webshot('3.html', '3.png', vwidth=400,vheight=400, delay = 5)
#-----Ross---
df_ross25 = df_ross %>% top_n(20)
df_ross25 <- subset (df_ross25, select = -speaker)
w4 = wordcloud2(data = df_ross25,color = "random-light")
saveWidget(w4, '4.html', selfcontained = F)
webshot('4.html', '4.png', vwidth=400,vheight=400, delay = 5)
#-----Phoebe---
df_phoebe25 = df_phoebe %>% top_n(20)
df_phoebe25 <- subset (df_phoebe25, select = -speaker)
w5 = wordcloud2(data = df_phoebe25,color = "random-light")
saveWidget(w5, '5.html', selfcontained = F)
webshot('5.html', '5.png', vwidth=400,vheight=400, delay = 5)
#-----Rachel---
df_rachel25 = df_rachel %>% top_n(20)
df_rachel25 <- subset (df_rachel25, select = -speaker)
w6 = wordcloud2(data = df_rachel25,color = "random-light")
saveWidget(w6, '6.html', selfcontained = F)
webshot('6.html', '6.png', vwidth=400,vheight=400, delay = 5)


Code to import the database on our server :

library(odbc)
library(DBI)


con <- dbConnect(odbc(),
                 Driver = "{SQL Server}",
                 Server = "***censored***.database.windows.net",
                 Database = "secret",
                 UID = "blablabla",
                 PWD = "bliblubli")
dbWriteTable(conn = con, 
             name = "titletbl", 
             value = df_title)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "openingtbl", 
             value = df_opening_scene)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "scenestbl", 
             value = df_scenes)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "didascaliatbl", 
             value = df_didascalia)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "dialtbl", 
             value = df_dial )  ## x is any data frame


code to realize sentiment analysis and sentiment graphs and store them into images* :


## sentiment monica 
sentim_monica = get_nrc_sentiment(df_monica$words)
emo_bar = colSums(sentim_monica)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_monica_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Monica") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

get_nrc_sentiment('bing')
tibble_joey = tibble(df_joey)

## sentiment joey
sentim_joey = get_nrc_sentiment(df_joey$words)
emo_bar = colSums(sentim_joey)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_joey_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                 "negative"= "red",
                                 "trust"= "purple",
                                 "anticipation"= "blue",
                                 "fear"= "orange",
                                 "joy"= "steelblue",
                                 "anger"= "grey",
                                 "sadness"= "black",
                                 "disgust"= "yellow",
                                 "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Joey") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))


## sentiment ross
sentim_ross = get_nrc_sentiment(df_ross$words)
emo_bar = colSums(sentim_ross)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_ross_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Ross") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))
## sentiment Phoebe 
sentim_phoebe = get_nrc_sentiment(df_phoebe$words)
emo_bar = colSums(sentim_phoebe)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_phoebe_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for phoebe") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

## sentiment chandler
sentim_chandler = get_nrc_sentiment(df_chandler$words)
emo_bar = colSums(sentim_chandler)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_chandler_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for chandler") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

## sentiment Rachel
sentim_rachel = get_nrc_sentiment(df_rachel$words)
emo_bar = colSums(sentim_rachel)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_rachel_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Rachel") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

ggsave("sentiment_rachel_graph.png",plot=sentiment_rachel_graph,width = 11, height = 8)
ggsave("sentiment_joey_graph.png",plot=sentiment_joey_graph,width = 11, height = 8)
ggsave("sentiment_monica_graph.png",plot=sentiment_monica_graph,width = 11, height = 8)
ggsave("sentiment_ross_graph.png",plot=sentiment_ross_graph,width = 11, height = 8)
ggsave("sentiment_chandler_graph.png",plot=sentiment_chandler_graph,width = 11, height = 8)
ggsave("sentiment_phoebe_graph.png",plot=sentiment_phoebe_graph,width = 11, height = 8)


*This was mandatory because the code is really long to run so the graphs are stored as pictures.

Code to generate our data base :

# load the libraries

library(rvest)
library(tidyverse)
library(tidytext)
library(SnowballC)
library(xml2)
library(urltools)
library(plumber)
library(stringr)


setwd("C:/Users/rapha/Desktop/ALL_Seasons")


dir<-"C:/Users/rapha/Desktop/ALL_Seasons"

# get a list of all files with in directory

files<-list.files(path=dir, pattern='html', full.names = TRUE)

# create empty list that will store our dataframes that we will merge after.

dialog_list = list()
didascalia_list = list()
opening_scene_list = list()
title_list = list()
scene_list = list()

# start the loop of the files of our directory

for (file in files){
    
    #load the file html
  
    htmlencod = read_html(x = file, encoding = "UTF-8")
    
    # scrap body paragraphs text
    
    dialog = htmlencod %>%
      html_nodes("body p") %>%
      html_text()
    
    # scrap the title of our files
    
    title = htmlencod %>%
      html_nodes("h1") %>%
      html_text()
    
    # create the titles dataframe and store the number of the episode as a key to join to other tables
    
    title = data.frame(title,stringsAsFactors = FALSE)
    title["episode"]=str_sub(file, start= -9,end= -6)
    title = title %>% unnest_tokens(output = "words",input = title, token = "words" )
    title_list[[file]] = title
    
    # create the dialogs dataframe and store the number of the episode as a key to join to other tables
    
    dial = data.frame(dialog, stringsAsFactors = FALSE)
    dial = separate(data = dial, col = dialog, into = c("speaker", "script"), sep = "\\:")
    dial["episode"] = str_sub(file, start= -9,end= -6)
    
    # create the opening scene dataframe and store the number of the episode as a key to join to other tables
    # this table is created from the dial dataframe
    
    opening_scene = dial[2,]
    opening_scene$script = str_c(opening_scene$speaker, opening_scene$script)
    opening_scene = select(opening_scene,-1)
    opening_scene = opening_scene %>% unnest_tokens(output = "words",input = script, token = "words" )
    opening_scene_list[[file]] = data.frame(opening_scene)
    
    # create the didascalia dataframe and store the number of the episode as a key to join to other tables
    # this table is created from the dial dataframe based on partial match
    
    didascalia = data.frame(dial[is.na(dial$script),]$"speaker",stringsAsFactors = FALSE)
    didascalia["episode"] = str_sub(file, start= -9,end= -6)
    colnames(didascalia)= c("didascalia","episode")
    
    # all the scenes are separated from the rest of the body text by square brackets. This table is so extracted
    # from the didascalia data frame.
    
    scene = didascalia[str_detect(didascalia$didascalia, "\\["), ]
    colnames(scene) = c("scene_description", "episode")
    scene = scene %>% unnest_tokens(output = "words",input = scene_description, token = "words" )
    scene_list[[file]] = scene
    
    # The scenes dedicated part in didascalia table is then suppressed to avoid double entries.
    
    didascalia=didascalia[-str_detect(didascalia$didascalia, "\\["), ]
    didascalia = didascalia %>% unnest_tokens(output = "words",input = didascalia, token = "words" )
    didascalia_list[[file]] = didascalia
    
    # The dialog dataframe cleaning.
    
    dial = drop_na(dial)
    dial = dial[-c(1,2),]
    dial = dial %>% unnest_tokens(output = "words",input = script, token = "words" )
    dialog_list[[file]]= dial

}



# All the tables are from our list are then merged to form big dataset by category, all joinable by episode.

df_title = bind_rows(title_list)
df_dial = bind_rows(dialog_list)
df_opening_scene = bind_rows(opening_scene_list)
df_didascalia = bind_rows(didascalia_list)
df_scenes = bind_rows(scene_list)

# load the stop words from snowballC.

stop_words = stop_words %>%
  filter(lexicon == "SMART")

# antijoin by words to suppress unmeaningfull words

df_title = df_title %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_dial = df_dial %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_opening_scene = df_opening_scene %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_didascalia = df_didascalia %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_scenes = df_scenes %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))

# --------------------------------------------------------------------------------

# identification of the words that pollutes our dataset.

unique(df_dial$words)

df_dial %>% count(words)
df_didascalia %>% count(words)
df_opening_scene %>% count(words)
df_title %>% count(words)
df_scenes %>% count(words)

# numbers
# words of one letter
# weird spelling
# combination of letters and numbers
# ...

# -------------------------------------------------------------------------------

# clean the alphanumerical entries

df_dial$words = gsub('[[:digit:]]+',NA , df_dial$words)
df_didascalia$words = gsub('[[:digit:]]+',NA , df_didascalia$words)
df_opening_scene$open_words = gsub('[[:digit:]]+',NA , df_opening_scene$words)
df_scenes$words = gsub('[[:digit:]]+',NA , df_scenes$words)
df_title$words = gsub('[[:digit:]]+',NA , df_title$words)

# drop empty rows generated before

df_dial=drop_na(df_dial)
df_opening_scene=drop_na(df_opening_scene)
df_didascalia=drop_na(df_didascalia)
df_scenes=drop_na(df_scenes)
df_title=drop_na(df_title)

# drop all the row with one letter entries

df_dial = filter(df_dial,nchar(df_dial$words)!=2)
df_didascalia = filter(df_didascalia,nchar(df_didascalia$words)!=2)
df_opening_scene = filter(df_opening_scene,nchar(df_opening_scene$words)!=2)
df_scenes = filter(df_scenes,nchar(df_scenes$words)!=2)
df_title = filter(df_title,nchar(df_title$words)!=2)

df_dial=df_dial[!str_detect(df_dial$speaker, "\\["), ]
df_dial=df_dial[!str_detect(df_dial$speaker, "\\("), ]
df_dial$speaker =  tolower(df_dial$speaker)

# what's next to do ?


# The cleaning of our dataset is well advanced. We can now use our dataset to perform analysis.
# Next step will be to analyse our data set and provide useful insights to someone who would wright
# another season of Friends