
In the context of a possible new season of friends, our team has tried to gather as much information as possible about the series friends and particularly its writing. The objective of this report is to give a precise and quantified description of the writing of the series.
The data base used for the following analysis is the result of the scrapping of the script of more than 200 episodes of friends. All of our data are stored into our own servers and are not public for the moment.
Our database is constituted of 5 datasets that are presented as it follows :
| speaker | episode | words |
|---|---|---|
| monica | 0101 | guy |
| monica | 0101 | work |
| joey | 0101 | guy |
| joey | 0101 | gotta |
| joey | 0101 | wrong |
| chandler | 0101 | joey |
| chandler | 0101 | nice |
| chandler | 0101 | hump |
| chandler | 0101 | hump |
| chandler | 0101 | hairpiece |
## [1] 159385 3
The next part of this report will be decomposed into 3 parts :
What makes first the singularity of Friends is its characters and their interactions. The way they interact between each other.
The goal of this part will be, through their language, to analyse the behavior and the talking habits of Friends main characters.
Knowing the way they talk is the best way recreate the feeling of the Tv show. To do so, we propose you to take a look at the most used words by characters.
On the wordcloud above, we can see that monica’s dialog are mostly oriented towards Chandler. In order to make her dialogs line more likely to be original, it will be very important to insist on her relation with chandler.
Chandler, as other characters put a lot on emphasis on affirmation with “yeah”. Since it’s not a really relevant insight, it is important take a look at other words into this visualization. We can see that he is a lot in interraction with Joey and Ross. He also has the same talking habits as Monica with “y’know” and “god”.
Joey offen starts it’s sentences by “Hey”. It is one of the talking habits that characterize him teh most. In opposition with the previous characters, his interactions are way more distributed between characters. He is offen in contact with every characters.
Without any suprise, Ross lines are really oriented to Rachel. This is one of the most pronouced name in his mouth. He is also interacting with chandler and joey a lot. As a talking habits, he is pronouncing a lot “great”, “y’know” or even “huh”. Those are important criteria to take into consideration.
Phoebe is one of most difficult character to analyse in terms of language. There is no significant talking habits comming out of our analysis. She’s the only one who use “ooh” and “wait” a lot. She seems to be pronounce the name of the other characters a lot.
Rachel talks a lot about ross. His name is the most pronounce in her mouth. She also use the term “honey” a lot.
What we can notice in the first place is that the couples formed in the series quote first the name of their partner. Monica talks a lot about Chandler and Chandler about Monica. It is the same for Ross and Rachel.
Globally, we can observe that the stronger the link is between the characters of the series (friend/love interest/couple), the more the names are quoted. Joey and Chandler are friends and roommates and mention their names much more than those of the girls or even Ross.
It is funny to note that Ross and Joey are the two characters who pronounce their first names the most, a language tick that is important to know when writing dialogue for them.
In this part, a closer look will be taken into the episode structure. The first step will be to analyse the opening scene. In this part we will focus on the most occurring words of the opening scenes in order to know where it happens and which characters are the most present.
## Selecting by Freq
Opening scenes of Friends are mostly the same. An average episode of friends starts with the main characters located in central perk, their appartment, sitting and talking about a situation when a character enters the room. If you want people to feel like your episode is an episode of friends, you should definitly follow these criterias.
Scenes description are the core content of episodes structures. They allow to identify which characters act in which place. Studying their structure is the best way to understand how the episodes ar constructed. In the table bellow, you will find information about the most occurring places in friends and the most involved characters.
| words | Freq |
|---|---|
| cut | 397 |
| monica | 216 |
| ross | 179 |
| joey | 171 |
| rachel | 168 |
| chandler | 157 |
| phoebe | 113 |
| back | 83 |
| room | 72 |
| time | 62 |
| lapse | 55 |
| apartment | 49 |
| door | 49 |
| enters | 43 |
| living | 39 |
| bedroom | 37 |
| central | 31 |
| perk | 31 |
| cast | 27 |
| chandler’s | 27 |
| conan | 26 |
| flashback | 25 |
| walks | 25 |
| couch | 24 |
| monica’s | 24 |
| rachel’s | 24 |
| table | 24 |
| inside | 23 |
| scene | 22 |
| sitting | 21 |
Of course, most of the scenes descriptions involve main characters. A lot of time lapses are happening in Friends too. Most of the actions are also triggered by the entry of a character in a place. Over our analysis 25 scenes occur in the past (flashbacks). Those scenes are really important too. The two main places are Central Perk and the three mains appartment.
In order to have a more precise idea of character’s way to talk, it is essential to dive into the sentiment that get out of their language. This will be the first part of this analysis. The second one will be on the didascalia that lead their actions in the script. With this insights, we will be able to know what are the most performed actions in show which will allow to wright more likely actions into the future script.





caption
It is really interesting to notice that the biggest part of the words used by Friends characters are positive words. Yet, some of the main characters seems to deviate a little bit from this trend. Ross and Monica seem to have more equilibrium between negative and positive words.
It is when we take a look deeper into the diverse feelings expressed that we can identify more specification about the characters :
Those are the feeling that must come out the dialog of a potential new season for each main characters.
| words | Freq |
|---|---|
| ross | 500 |
| rachel | 468 |
| chandler | 463 |
| joey | 459 |
| monica | 453 |
| cut | 398 |
| phoebe | 283 |
| door | 204 |
| back | 180 |
| credits | 178 |
| enters | 145 |
| starts | 140 |
| break | 130 |
| room | 127 |
| end | 123 |
| commercial | 122 |
| walks | 106 |
| time | 86 |
| apartment | 82 |
| start | 81 |
| opening | 78 |
| turns | 78 |
| kiss | 65 |
| closing | 61 |
| opens | 61 |
| couch | 60 |
| bedroom | 58 |
| lapse | 56 |
| living | 56 |
| phone | 54 |
| runs | 54 |
| table | 54 |
There is no essential outcome of this analysis. there is no typical didascalia in friends. Our dataframe could may be be explored deeper to identify more precise information about the general sentiment of the didascalias.
In general, our data set provide a wide range of opportunities in terms of analysis and manipulations.
code to generate word cloud images used into the markdown :
# Subset the top 25 repeated word for each speaker
#-----Monica---
df_monica25 = df_monica %>% top_n(20)
df_monica25 <- subset (df_monica25, select = -speaker)
w1 = wordcloud2(data = df_monica25,color = "random-light")
saveWidget(w1, '1.html', selfcontained = F)
webshot('1.html', '1.png', vwidth=1000,vheight=800, delay = 5)
#-----Chandler---
df_chandler25 = df_chandler %>% top_n(20)
df_chandler25 <- subset (df_chandler25, select = -speaker)
w2 = wordcloud2(data = df_chandler25,color = "random-light")
saveWidget(w2, '2.html', selfcontained = F)
webshot('2.html', '2.png', vwidth=700,vheight=600, delay = 5)
#-----Joey---
df_joey25 = df_joey %>% top_n(20)
df_joey25 <- subset (df_joey25, select = -speaker)
w3 = wordcloud2(data = df_joey25,color = "random-light")
saveWidget(w3, '3.html', selfcontained = F)
webshot('3.html', '3.png', vwidth=400,vheight=400, delay = 5)
#-----Ross---
df_ross25 = df_ross %>% top_n(20)
df_ross25 <- subset (df_ross25, select = -speaker)
w4 = wordcloud2(data = df_ross25,color = "random-light")
saveWidget(w4, '4.html', selfcontained = F)
webshot('4.html', '4.png', vwidth=400,vheight=400, delay = 5)
#-----Phoebe---
df_phoebe25 = df_phoebe %>% top_n(20)
df_phoebe25 <- subset (df_phoebe25, select = -speaker)
w5 = wordcloud2(data = df_phoebe25,color = "random-light")
saveWidget(w5, '5.html', selfcontained = F)
webshot('5.html', '5.png', vwidth=400,vheight=400, delay = 5)
#-----Rachel---
df_rachel25 = df_rachel %>% top_n(20)
df_rachel25 <- subset (df_rachel25, select = -speaker)
w6 = wordcloud2(data = df_rachel25,color = "random-light")
saveWidget(w6, '6.html', selfcontained = F)
webshot('6.html', '6.png', vwidth=400,vheight=400, delay = 5)
Code to import the database on our server :
library(odbc)
library(DBI)
con <- dbConnect(odbc(),
Driver = "{SQL Server}",
Server = "***censored***.database.windows.net",
Database = "secret",
UID = "blablabla",
PWD = "bliblubli")
dbWriteTable(conn = con,
name = "titletbl",
value = df_title) ## x is any data frame
dbWriteTable(conn = con,
name = "openingtbl",
value = df_opening_scene) ## x is any data frame
dbWriteTable(conn = con,
name = "scenestbl",
value = df_scenes) ## x is any data frame
dbWriteTable(conn = con,
name = "didascaliatbl",
value = df_didascalia) ## x is any data frame
dbWriteTable(conn = con,
name = "dialtbl",
value = df_dial ) ## x is any data frame
code to realize sentiment analysis and sentiment graphs and store them into images* :
## sentiment monica
sentim_monica = get_nrc_sentiment(df_monica$words)
emo_bar = colSums(sentim_monica)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))
sentiment_monica_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c( "positive"= "green",
"negative"= "red",
"trust"= "purple",
"anticipation"= "blue",
"fear"= "orange",
"joy"= "steelblue",
"anger"= "grey",
"sadness"= "black",
"disgust"= "yellow",
"surprise"= "brown")) +
xlab("emotion")+
ggtitle("Repartition of Emotion for Monica") +
theme(panel.background = element_blank(),plot.title = element_text(size=28))
get_nrc_sentiment('bing')
tibble_joey = tibble(df_joey)
## sentiment joey
sentim_joey = get_nrc_sentiment(df_joey$words)
emo_bar = colSums(sentim_joey)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))
sentiment_joey_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c( "positive"= "green",
"negative"= "red",
"trust"= "purple",
"anticipation"= "blue",
"fear"= "orange",
"joy"= "steelblue",
"anger"= "grey",
"sadness"= "black",
"disgust"= "yellow",
"surprise"= "brown")) +
xlab("emotion")+
ggtitle("Repartition of Emotion for Joey") +
theme(panel.background = element_blank(),plot.title = element_text(size=28))
## sentiment ross
sentim_ross = get_nrc_sentiment(df_ross$words)
emo_bar = colSums(sentim_ross)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))
sentiment_ross_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c( "positive"= "green",
"negative"= "red",
"trust"= "purple",
"anticipation"= "blue",
"fear"= "orange",
"joy"= "steelblue",
"anger"= "grey",
"sadness"= "black",
"disgust"= "yellow",
"surprise"= "brown")) +
xlab("emotion")+
ggtitle("Repartition of Emotion for Ross") +
theme(panel.background = element_blank(),plot.title = element_text(size=28))
## sentiment Phoebe
sentim_phoebe = get_nrc_sentiment(df_phoebe$words)
emo_bar = colSums(sentim_phoebe)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))
sentiment_phoebe_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c( "positive"= "green",
"negative"= "red",
"trust"= "purple",
"anticipation"= "blue",
"fear"= "orange",
"joy"= "steelblue",
"anger"= "grey",
"sadness"= "black",
"disgust"= "yellow",
"surprise"= "brown")) +
xlab("emotion")+
ggtitle("Repartition of Emotion for phoebe") +
theme(panel.background = element_blank(),plot.title = element_text(size=28))
## sentiment chandler
sentim_chandler = get_nrc_sentiment(df_chandler$words)
emo_bar = colSums(sentim_chandler)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))
sentiment_chandler_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c( "positive"= "green",
"negative"= "red",
"trust"= "purple",
"anticipation"= "blue",
"fear"= "orange",
"joy"= "steelblue",
"anger"= "grey",
"sadness"= "black",
"disgust"= "yellow",
"surprise"= "brown")) +
xlab("emotion")+
ggtitle("Repartition of Emotion for chandler") +
theme(panel.background = element_blank(),plot.title = element_text(size=28))
## sentiment Rachel
sentim_rachel = get_nrc_sentiment(df_rachel$words)
emo_bar = colSums(sentim_rachel)
emo_sum = data.frame(count = emo_bar, emotion = names(emo_bar))
sentiment_rachel_graph = ggplot(emo_sum, aes(x = reorder(emotion, -count),y = count, fill = emotion)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c( "positive"= "green",
"negative"= "red",
"trust"= "purple",
"anticipation"= "blue",
"fear"= "orange",
"joy"= "steelblue",
"anger"= "grey",
"sadness"= "black",
"disgust"= "yellow",
"surprise"= "brown")) +
xlab("emotion")+
ggtitle("Repartition of Emotion for Rachel") +
theme(panel.background = element_blank(),plot.title = element_text(size=28))
ggsave("sentiment_rachel_graph.png",plot=sentiment_rachel_graph,width = 11, height = 8)
ggsave("sentiment_joey_graph.png",plot=sentiment_joey_graph,width = 11, height = 8)
ggsave("sentiment_monica_graph.png",plot=sentiment_monica_graph,width = 11, height = 8)
ggsave("sentiment_ross_graph.png",plot=sentiment_ross_graph,width = 11, height = 8)
ggsave("sentiment_chandler_graph.png",plot=sentiment_chandler_graph,width = 11, height = 8)
ggsave("sentiment_phoebe_graph.png",plot=sentiment_phoebe_graph,width = 11, height = 8)
*This was mandatory because the code is really long to run so the graphs are stored as pictures.
Code to generate our data base :
# load the libraries
library(rvest)
library(tidyverse)
library(tidytext)
library(SnowballC)
library(xml2)
library(urltools)
library(plumber)
library(stringr)
setwd("C:/Users/rapha/Desktop/ALL_Seasons")
dir<-"C:/Users/rapha/Desktop/ALL_Seasons"
# get a list of all files with in directory
files<-list.files(path=dir, pattern='html', full.names = TRUE)
# create empty list that will store our dataframes that we will merge after.
dialog_list = list()
didascalia_list = list()
opening_scene_list = list()
title_list = list()
scene_list = list()
# start the loop of the files of our directory
for (file in files){
#load the file html
htmlencod = read_html(x = file, encoding = "UTF-8")
# scrap body paragraphs text
dialog = htmlencod %>%
html_nodes("body p") %>%
html_text()
# scrap the title of our files
title = htmlencod %>%
html_nodes("h1") %>%
html_text()
# create the titles dataframe and store the number of the episode as a key to join to other tables
title = data.frame(title,stringsAsFactors = FALSE)
title["episode"]=str_sub(file, start= -9,end= -6)
title = title %>% unnest_tokens(output = "words",input = title, token = "words" )
title_list[[file]] = title
# create the dialogs dataframe and store the number of the episode as a key to join to other tables
dial = data.frame(dialog, stringsAsFactors = FALSE)
dial = separate(data = dial, col = dialog, into = c("speaker", "script"), sep = "\\:")
dial["episode"] = str_sub(file, start= -9,end= -6)
# create the opening scene dataframe and store the number of the episode as a key to join to other tables
# this table is created from the dial dataframe
opening_scene = dial[2,]
opening_scene$script = str_c(opening_scene$speaker, opening_scene$script)
opening_scene = select(opening_scene,-1)
opening_scene = opening_scene %>% unnest_tokens(output = "words",input = script, token = "words" )
opening_scene_list[[file]] = data.frame(opening_scene)
# create the didascalia dataframe and store the number of the episode as a key to join to other tables
# this table is created from the dial dataframe based on partial match
didascalia = data.frame(dial[is.na(dial$script),]$"speaker",stringsAsFactors = FALSE)
didascalia["episode"] = str_sub(file, start= -9,end= -6)
colnames(didascalia)= c("didascalia","episode")
# all the scenes are separated from the rest of the body text by square brackets. This table is so extracted
# from the didascalia data frame.
scene = didascalia[str_detect(didascalia$didascalia, "\\["), ]
colnames(scene) = c("scene_description", "episode")
scene = scene %>% unnest_tokens(output = "words",input = scene_description, token = "words" )
scene_list[[file]] = scene
# The scenes dedicated part in didascalia table is then suppressed to avoid double entries.
didascalia=didascalia[-str_detect(didascalia$didascalia, "\\["), ]
didascalia = didascalia %>% unnest_tokens(output = "words",input = didascalia, token = "words" )
didascalia_list[[file]] = didascalia
# The dialog dataframe cleaning.
dial = drop_na(dial)
dial = dial[-c(1,2),]
dial = dial %>% unnest_tokens(output = "words",input = script, token = "words" )
dialog_list[[file]]= dial
}
# All the tables are from our list are then merged to form big dataset by category, all joinable by episode.
df_title = bind_rows(title_list)
df_dial = bind_rows(dialog_list)
df_opening_scene = bind_rows(opening_scene_list)
df_didascalia = bind_rows(didascalia_list)
df_scenes = bind_rows(scene_list)
# load the stop words from snowballC.
stop_words = stop_words %>%
filter(lexicon == "SMART")
# antijoin by words to suppress unmeaningfull words
df_title = df_title %>%
anti_join(y = stop_words,
by = c("words" = "word"))
df_dial = df_dial %>%
anti_join(y = stop_words,
by = c("words" = "word"))
df_opening_scene = df_opening_scene %>%
anti_join(y = stop_words,
by = c("words" = "word"))
df_didascalia = df_didascalia %>%
anti_join(y = stop_words,
by = c("words" = "word"))
df_scenes = df_scenes %>%
anti_join(y = stop_words,
by = c("words" = "word"))
# --------------------------------------------------------------------------------
# identification of the words that pollutes our dataset.
unique(df_dial$words)
df_dial %>% count(words)
df_didascalia %>% count(words)
df_opening_scene %>% count(words)
df_title %>% count(words)
df_scenes %>% count(words)
# numbers
# words of one letter
# weird spelling
# combination of letters and numbers
# ...
# -------------------------------------------------------------------------------
# clean the alphanumerical entries
df_dial$words = gsub('[[:digit:]]+',NA , df_dial$words)
df_didascalia$words = gsub('[[:digit:]]+',NA , df_didascalia$words)
df_opening_scene$open_words = gsub('[[:digit:]]+',NA , df_opening_scene$words)
df_scenes$words = gsub('[[:digit:]]+',NA , df_scenes$words)
df_title$words = gsub('[[:digit:]]+',NA , df_title$words)
# drop empty rows generated before
df_dial=drop_na(df_dial)
df_opening_scene=drop_na(df_opening_scene)
df_didascalia=drop_na(df_didascalia)
df_scenes=drop_na(df_scenes)
df_title=drop_na(df_title)
# drop all the row with one letter entries
df_dial = filter(df_dial,nchar(df_dial$words)!=2)
df_didascalia = filter(df_didascalia,nchar(df_didascalia$words)!=2)
df_opening_scene = filter(df_opening_scene,nchar(df_opening_scene$words)!=2)
df_scenes = filter(df_scenes,nchar(df_scenes$words)!=2)
df_title = filter(df_title,nchar(df_title$words)!=2)
df_dial=df_dial[!str_detect(df_dial$speaker, "\\["), ]
df_dial=df_dial[!str_detect(df_dial$speaker, "\\("), ]
df_dial$speaker = tolower(df_dial$speaker)
# what's next to do ?
# The cleaning of our dataset is well advanced. We can now use our dataset to perform analysis.
# Next step will be to analyse our data set and provide useful insights to someone who would wright
# another season of Friends