Insights for a new season of Friends

Introduction

In the context of a possible new season of friends, our team has tried to gather as much information as possible about the series friends and particularly its writing. The objective of this report is to give a precise and quantified description of the writing of the series.

The data base used for the following analysis is the result of the scrapping of the script of more than 200 episodes of friends. All of our data are stored into our own servers and are not public for the moment.

Our database is constituted of 5 datasets that are presented as it follows :

speaker	episode	words
monica	0101	guy
monica	0101	work
joey	0101	guy
joey	0101	gotta
joey	0101	wrong
chandler	0101	joey
chandler	0101	nice
chandler	0101	hump
chandler	0101	hump
chandler	0101	hairpiece

## [1] 159385      3

The next part of this report will be decomposed into 3 parts :

Character’s language
Episode structure
Character’s behavior and actions

Character’s language and personality

What makes first the singularity of Friends is its characters and their interactions. The way they interact between each other.

The goal of this part will be, through their language, to analyse the behavior and the talking habits of Friends main characters.

Knowing the way they talk is the best way recreate the feeling of the Tv show. To do so, we propose you to take a look at the most used words by characters.

Talking habits

Monica

On the wordcloud above, we can see that monica’s dialog are mostly oriented towards Chandler. In order to make her dialogs line more likely to be original, it will be very important to insist on her relation with chandler.

Chandler

Chandler, as other characters put a lot on emphasis on affirmation with “yeah”. Since it’s not a really relevant insight, it is important take a look at other words into this visualization. We can see that he is a lot in interraction with Joey and Ross. He also has the same talking habits as Monica with “y’know” and “god”.

Joey

Joey offen starts it’s sentences by “Hey”. It is one of the talking habits that characterize him teh most. In opposition with the previous characters, his interactions are way more distributed between characters. He is offen in contact with every characters.

Ross

Without any suprise, Ross lines are really oriented to Rachel. This is one of the most pronouced name in his mouth. He is also interacting with chandler and joey a lot. As a talking habits, he is pronouncing a lot “great”, “y’know” or even “huh”. Those are important criteria to take into consideration.

Phoebe

Phoebe is one of most difficult character to analyse in terms of language. There is no significant talking habits comming out of our analysis. She’s the only one who use “ooh” and “wait” a lot. She seems to be pronounce the name of the other characters a lot.

Rachel

Rachel talks a lot about ross. His name is the most pronounce in her mouth. She also use the term “honey” a lot.

Relationships

What we can notice in the first place is that the couples formed in the series quote first the name of their partner. Monica talks a lot about Chandler and Chandler about Monica. It is the same for Ross and Rachel.

Globally, we can observe that the stronger the link is between the characters of the series (friend/love interest/couple), the more the names are quoted. Joey and Chandler are friends and roommates and mention their names much more than those of the girls or even Ross.

It is funny to note that Ross and Joey are the two characters who pronounce their first names the most, a language tick that is important to know when writing dialogue for them.

Episode structure

In this part, a closer look will be taken into the episode structure. The first step will be to analyse the opening scene. In this part we will focus on the most occurring words of the opening scenes in order to know where it happens and which characters are the most present.

opening scenes

## Selecting by Freq

Opening scenes of Friends are mostly the same. An average episode of friends starts with the main characters located in central perk, their appartment, sitting and talking about a situation when a character enters the room. If you want people to feel like your episode is an episode of friends, you should definitly follow these criterias.

Scenes analysis

Scenes description are the core content of episodes structures. They allow to identify which characters act in which place. Studying their structure is the best way to understand how the episodes ar constructed. In the table bellow, you will find information about the most occurring places in friends and the most involved characters.

words	Freq
cut	397
monica	216
ross	179
joey	171
rachel	168
chandler	157
phoebe	113
back	83
room	72
time	62
lapse	55
apartment	49
door	49
enters	43
living	39
bedroom	37
central	31
perk	31
cast	27
chandler’s	27
conan	26
flashback	25
walks	25
couch	24
monica’s	24
rachel’s	24
table	24
inside	23
scene	22
sitting	21

Of course, most of the scenes descriptions involve main characters. A lot of time lapses are happening in Friends too. Most of the actions are also triggered by the entry of a character in a place. Over our analysis 25 scenes occur in the past (flashbacks). Those scenes are really important too. The two main places are Central Perk and the three mains appartment.

character’s behavior and actions

In order to have a more precise idea of character’s way to talk, it is essential to dive into the sentiment that get out of their language. This will be the first part of this analysis. The second one will be on the didascalia that lead their actions in the script. With this insights, we will be able to know what are the most performed actions in show which will allow to wright more likely actions into the future script.

Character’s sentiment analysis

caption

It is really interesting to notice that the biggest part of the words used by Friends characters are positive words. Yet, some of the main characters seems to deviate a little bit from this trend. Ross and Monica seem to have more equilibrium between negative and positive words.

It is when we take a look deeper into the diverse feelings expressed that we can identify more specification about the characters :

Phoebe’s emotions (in order of importance) are Trust, Fear, Anger
Chandler’s emotions are Trust, Fear, Anticipation
Ross’s emotions are Trust, Fear, Anticipation
Monica’s emotions are Trust, Fear, Joy
Joey’s emotions are Trust, Anticipation, Fear
Rachel’s emotions are Trust, Anticipation, Sadness

Those are the feeling that must come out the dialog of a potential new season for each main characters.

words	Freq
ross	500
rachel	468
chandler	463
joey	459
monica	453
cut	398
phoebe	283
door	204
back	180
credits	178
enters	145
starts	140
break	130
room	127
end	123
commercial	122
walks	106
time	86
apartment	82
start	81
opening	78
turns	78
kiss	65
closing	61
opens	61
couch	60
bedroom	58
lapse	56
living	56
phone	54
runs	54
table	54

There is no essential outcome of this analysis. there is no typical didascalia in friends. Our dataframe could may be be explored deeper to identify more precise information about the general sentiment of the didascalias.

In general, our data set provide a wide range of opportunities in terms of analysis and manipulations.

Technical details and specification

code to generate word cloud images used into the markdown :

# Subset the top 25 repeated word for each speaker
#-----Monica---
df_monica25 = df_monica %>% top_n(20)
df_monica25 <- subset (df_monica25, select = -speaker)
w1 = wordcloud2(data = df_monica25,color = "random-light")
saveWidget(w1, '1.html', selfcontained = F)
webshot('1.html', '1.png', vwidth=1000,vheight=800, delay = 5)
#-----Chandler---
df_chandler25 = df_chandler %>% top_n(20)
df_chandler25 <- subset (df_chandler25, select = -speaker)
w2 = wordcloud2(data = df_chandler25,color = "random-light")
saveWidget(w2, '2.html', selfcontained = F)
webshot('2.html', '2.png', vwidth=700,vheight=600, delay = 5)
#-----Joey---
df_joey25 = df_joey %>% top_n(20)
df_joey25 <- subset (df_joey25, select = -speaker)
w3 = wordcloud2(data = df_joey25,color = "random-light")
saveWidget(w3, '3.html', selfcontained = F)
webshot('3.html', '3.png', vwidth=400,vheight=400, delay = 5)
#-----Ross---
df_ross25 = df_ross %>% top_n(20)
df_ross25 <- subset (df_ross25, select = -speaker)
w4 = wordcloud2(data = df_ross25,color = "random-light")
saveWidget(w4, '4.html', selfcontained = F)
webshot('4.html', '4.png', vwidth=400,vheight=400, delay = 5)
#-----Phoebe---
df_phoebe25 = df_phoebe %>% top_n(20)
df_phoebe25 <- subset (df_phoebe25, select = -speaker)
w5 = wordcloud2(data = df_phoebe25,color = "random-light")
saveWidget(w5, '5.html', selfcontained = F)
webshot('5.html', '5.png', vwidth=400,vheight=400, delay = 5)
#-----Rachel---
df_rachel25 = df_rachel %>% top_n(20)
df_rachel25 <- subset (df_rachel25, select = -speaker)
w6 = wordcloud2(data = df_rachel25,color = "random-light")
saveWidget(w6, '6.html', selfcontained = F)
webshot('6.html', '6.png', vwidth=400,vheight=400, delay = 5)

Code to import the database on our server :

library(odbc)
library(DBI)


con <- dbConnect(odbc(),
                 Driver = "{SQL Server}",
                 Server = "***censored***.database.windows.net",
                 Database = "secret",
                 UID = "blablabla",
                 PWD = "bliblubli")
dbWriteTable(conn = con, 
             name = "titletbl", 
             value = df_title)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "openingtbl", 
             value = df_opening_scene)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "scenestbl", 
             value = df_scenes)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "didascaliatbl", 
             value = df_didascalia)  ## x is any data frame

dbWriteTable(conn = con, 
             name = "dialtbl", 
             value = df_dial )  ## x is any data frame

code to realize sentiment analysis and sentiment graphs and store them into images* :

## sentiment monica 
sentim_monica = get_nrc_sentiment(df_monica$words)
emo_bar = colSums(sentim_monica)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_monica_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Monica") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

get_nrc_sentiment('bing')
tibble_joey = tibble(df_joey)

## sentiment joey
sentim_joey = get_nrc_sentiment(df_joey$words)
emo_bar = colSums(sentim_joey)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_joey_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                 "negative"= "red",
                                 "trust"= "purple",
                                 "anticipation"= "blue",
                                 "fear"= "orange",
                                 "joy"= "steelblue",
                                 "anger"= "grey",
                                 "sadness"= "black",
                                 "disgust"= "yellow",
                                 "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Joey") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))


## sentiment ross
sentim_ross = get_nrc_sentiment(df_ross$words)
emo_bar = colSums(sentim_ross)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_ross_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Ross") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))
## sentiment Phoebe 
sentim_phoebe = get_nrc_sentiment(df_phoebe$words)
emo_bar = colSums(sentim_phoebe)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_phoebe_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for phoebe") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

## sentiment chandler
sentim_chandler = get_nrc_sentiment(df_chandler$words)
emo_bar = colSums(sentim_chandler)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_chandler_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for chandler") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

## sentiment Rachel
sentim_rachel = get_nrc_sentiment(df_rachel$words)
emo_bar = colSums(sentim_rachel)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))

sentiment_rachel_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) + 
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c( "positive"= "green",
                                "negative"= "red",
                                "trust"= "purple",
                                "anticipation"= "blue",
                                "fear"= "orange",
                                "joy"= "steelblue",
                                "anger"= "grey",
                                "sadness"= "black",
                                "disgust"= "yellow",
                                "surprise"= "brown")) +
  xlab("emotion")+
  ggtitle("Repartition of Emotion for Rachel") +
  theme(panel.background = element_blank(),plot.title = element_text(size=28))

ggsave("sentiment_rachel_graph.png",plot=sentiment_rachel_graph,width = 11, height = 8)
ggsave("sentiment_joey_graph.png",plot=sentiment_joey_graph,width = 11, height = 8)
ggsave("sentiment_monica_graph.png",plot=sentiment_monica_graph,width = 11, height = 8)
ggsave("sentiment_ross_graph.png",plot=sentiment_ross_graph,width = 11, height = 8)
ggsave("sentiment_chandler_graph.png",plot=sentiment_chandler_graph,width = 11, height = 8)
ggsave("sentiment_phoebe_graph.png",plot=sentiment_phoebe_graph,width = 11, height = 8)

*This was mandatory because the code is really long to run so the graphs are stored as pictures.

Code to generate our data base :

# load the libraries

library(rvest)
library(tidyverse)
library(tidytext)
library(SnowballC)
library(xml2)
library(urltools)
library(plumber)
library(stringr)


setwd("C:/Users/rapha/Desktop/ALL_Seasons")


dir<-"C:/Users/rapha/Desktop/ALL_Seasons"

# get a list of all files with in directory

files<-list.files(path=dir, pattern='html', full.names = TRUE)

# create empty list that will store our dataframes that we will merge after.

dialog_list = list()
didascalia_list = list()
opening_scene_list = list()
title_list = list()
scene_list = list()

# start the loop of the files of our directory

for (file in files){
    
    #load the file html
  
    htmlencod = read_html(x = file, encoding = "UTF-8")
    
    # scrap body paragraphs text
    
    dialog = htmlencod %>%
      html_nodes("body p") %>%
      html_text()
    
    # scrap the title of our files
    
    title = htmlencod %>%
      html_nodes("h1") %>%
      html_text()
    
    # create the titles dataframe and store the number of the episode as a key to join to other tables
    
    title = data.frame(title,stringsAsFactors = FALSE)
    title["episode"]=str_sub(file, start= -9,end= -6)
    title = title %>% unnest_tokens(output = "words",input = title, token = "words" )
    title_list[[file]] = title
    
    # create the dialogs dataframe and store the number of the episode as a key to join to other tables
    
    dial = data.frame(dialog, stringsAsFactors = FALSE)
    dial = separate(data = dial, col = dialog, into = c("speaker", "script"), sep = "\\:")
    dial["episode"] = str_sub(file, start= -9,end= -6)
    
    # create the opening scene dataframe and store the number of the episode as a key to join to other tables
    # this table is created from the dial dataframe
    
    opening_scene = dial[2,]
    opening_scene$script = str_c(opening_scene$speaker, opening_scene$script)
    opening_scene = select(opening_scene,-1)
    opening_scene = opening_scene %>% unnest_tokens(output = "words",input = script, token = "words" )
    opening_scene_list[[file]] = data.frame(opening_scene)
    
    # create the didascalia dataframe and store the number of the episode as a key to join to other tables
    # this table is created from the dial dataframe based on partial match
    
    didascalia = data.frame(dial[is.na(dial$script),]$"speaker",stringsAsFactors = FALSE)
    didascalia["episode"] = str_sub(file, start= -9,end= -6)
    colnames(didascalia)= c("didascalia","episode")
    
    # all the scenes are separated from the rest of the body text by square brackets. This table is so extracted
    # from the didascalia data frame.
    
    scene = didascalia[str_detect(didascalia$didascalia, "\\["), ]
    colnames(scene) = c("scene_description", "episode")
    scene = scene %>% unnest_tokens(output = "words",input = scene_description, token = "words" )
    scene_list[[file]] = scene
    
    # The scenes dedicated part in didascalia table is then suppressed to avoid double entries.
    
    didascalia=didascalia[-str_detect(didascalia$didascalia, "\\["), ]
    didascalia = didascalia %>% unnest_tokens(output = "words",input = didascalia, token = "words" )
    didascalia_list[[file]] = didascalia
    
    # The dialog dataframe cleaning.
    
    dial = drop_na(dial)
    dial = dial[-c(1,2),]
    dial = dial %>% unnest_tokens(output = "words",input = script, token = "words" )
    dialog_list[[file]]= dial

}



# All the tables are from our list are then merged to form big dataset by category, all joinable by episode.

df_title = bind_rows(title_list)
df_dial = bind_rows(dialog_list)
df_opening_scene = bind_rows(opening_scene_list)
df_didascalia = bind_rows(didascalia_list)
df_scenes = bind_rows(scene_list)

# load the stop words from snowballC.

stop_words = stop_words %>%
  filter(lexicon == "SMART")

# antijoin by words to suppress unmeaningfull words

df_title = df_title %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_dial = df_dial %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_opening_scene = df_opening_scene %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_didascalia = df_didascalia %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))
df_scenes = df_scenes %>%
  anti_join(y = stop_words,
            by = c("words" = "word"))

# --------------------------------------------------------------------------------

# identification of the words that pollutes our dataset.

unique(df_dial$words)

df_dial %>% count(words)
df_didascalia %>% count(words)
df_opening_scene %>% count(words)
df_title %>% count(words)
df_scenes %>% count(words)

# numbers
# words of one letter
# weird spelling
# combination of letters and numbers
# ...

# -------------------------------------------------------------------------------

# clean the alphanumerical entries

df_dial$words = gsub('[[:digit:]]+',NA , df_dial$words)
df_didascalia$words = gsub('[[:digit:]]+',NA , df_didascalia$words)
df_opening_scene$open_words = gsub('[[:digit:]]+',NA , df_opening_scene$words)
df_scenes$words = gsub('[[:digit:]]+',NA , df_scenes$words)
df_title$words = gsub('[[:digit:]]+',NA , df_title$words)

# drop empty rows generated before

df_dial=drop_na(df_dial)
df_opening_scene=drop_na(df_opening_scene)
df_didascalia=drop_na(df_didascalia)
df_scenes=drop_na(df_scenes)
df_title=drop_na(df_title)

# drop all the row with one letter entries

df_dial = filter(df_dial,nchar(df_dial$words)!=2)
df_didascalia = filter(df_didascalia,nchar(df_didascalia$words)!=2)
df_opening_scene = filter(df_opening_scene,nchar(df_opening_scene$words)!=2)
df_scenes = filter(df_scenes,nchar(df_scenes$words)!=2)
df_title = filter(df_title,nchar(df_title$words)!=2)

df_dial=df_dial[!str_detect(df_dial$speaker, "\\["), ]
df_dial=df_dial[!str_detect(df_dial$speaker, "\\("), ]
df_dial$speaker =  tolower(df_dial$speaker)

# what's next to do ?


# The cleaning of our dataset is well advanced. We can now use our dataset to perform analysis.
# Next step will be to analyse our data set and provide useful insights to someone who would wright
# another season of Friends