501 Final Report

Table of Contents:

1. Introduction

2. Data sets and resources

3. Exploratory Data Analysis

1. Introduction

We chose this topic for our final project because we thought it would be fun to learn a new type of data analysis that we haven’t had experience with in our classes: sentiment analysis. We were originally inspired by some projects on Kaggle, particularly a Star Wars Script Analysis by Kaggle user Xavier (see 7. References). We thought it would be interesting to perform a gender analysis on some trilogy or movie series. However, we quickly realized that most of the options we were considering (Star Wars, Lord of the Rings, Harry Potter, etc.) were heavily male-dominated stories. We chose the television show Friends because, unlike many television shows of its time (and even up to today), the lead cast was evenly split between female and male characters. Friends is an American sitcom that premiered in 1994, airing for ten seasons until its finale in 2004. It stars Jennifer Aniston, Courtney Cox, and Lisa Kudrow as lead gals Rachel, Monica, and Phoebe, as well as David Schwimmer, Matthew Perry, and Matt LeBlanc as lead guys Ross, Chandler, and Joey living their chaotic and comedic lives as young adults navigating their lives in New York City. However, despite its promising gender balance, fans and critics alike can agree that many aspects of the show are problematic or did not age well, particularly regarding its treatment of women and gender roles. By applying sentiment and text analysis to the scripts of this beloved sitcom, we hope to identify whether or not the script of Friends is sexist.

2. Data sets and resources

Friends Script source: https://www.kaggle.com/blessondensil294/friends-tv-series-screenplay-script?select=S10E01+Joey+and+Rachel+Kiss.txt

Sentiment analysis lexicon source: https://www.kaggle.com/andradaolteanu/bing-nrc-afinn-lexicons

Unlike many of the data projects we have discusses throughout this semester, sentiment analysis requires more than one data set. At minimum, it requires a text data set and a sentiment lexicon data set. Due to the episodic nature of Friends, we have 41 script data sets (for the 24 episodes of season 1 and the 17 episodes of season 10) that each contain 3 variables/features: Character, Line, and Scene. These were originally text files, which we formatted and cleaned in excel before loading them into R. We have also utilized 3 lexicon data sets throughout our work: Bing (which has 2477 unique words/observations and two columns: the unique words and associated sentiment (binary with either positive or negative)), AFinn (which has 2477 unique words/observations and the second column holds values ranging from -5 to 5, corresponding to the intensity of the emotion), and NRC (which has 13.9K words/observations and second column representing the associated sentiment: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust). The first step of our process was cleaning the text files in Excel and saving them as csv files to load into R. This was a rather tedious process, but due to the utter messiness of certain episode files, cleaning in R would have been complicated beyond the abilities we have developed this semester. In addition to our data sets, we used various R packages for our analysis. Here we call our packages, load our data, and define functions which we got fron Xavier on Kaggle:

# PACKAGES

library(tidyverse) # data manipulation
library(tm) # text mining
library(wordcloud) # word cloud generator
library(wordcloud2) # word cloud generator
library(tidytext) # text mining for word processing and sentiment analysis
library(reshape2) # reshapes a data frame
library(radarchart) # drawing the radar chart from a data frame
library(RWeka) # data mining tasks
library(scales) # formatting data as percent
library(gridExtra) #for ensemble graphic
library(grid) #for ensemble graphic

# SCRIPT DATA

# season 1 scripts
episode1 <- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp1withScene.csv")
episode2<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp2withScene.csv")
episode3<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp3withScene.csv")
episode4<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp4withScene.csv")
episode5<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp5withScene.csv")
episode6<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp6withScene.csv")
episode7<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp7withScene.csv")
episode8<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp8withScene.csv")
episode9<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp9withScene.csv")
episode10<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp10withScene.csv")
episode11<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp11withScene.csv")
episode12<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp12withScene.csv")
episode13<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp13withScene.csv")
episode14<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp14withScene.csv")
episode15<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp15withScene.csv")
episode16<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp16withScene.csv")
episode17<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp17withScene.csv")
episode18<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp18withScene.csv")
episode19<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp19withScene.csv")
episode20<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp20withScene.csv")
episode21<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp21withScene.csv")
episode22<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp22withScene.csv")
episode23<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp23withScene.csv")
episode24<- read.csv("/Users/nikohellman/Desktop/friends scripts/FriendsEp24withScene.csv")

# season 10 scripts
s10episode1 <- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp1withScene.csv")
s10episode2<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp2withScene.csv")
s10episode3<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp3withScene.csv")
s10episode4<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp4withScene.csv")
s10episode5<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp5withScene.csv")
s10episode6<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp6withScene.csv")
s10episode7<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp7withScene.csv")
s10episode8<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp8withScene.csv")
s10episode9<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp9withScene.csv")
s10episode10<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp10withScene.csv")
s10episode11<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp11withScene.csv")
s10episode12<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp12withScene.csv")
s10episode13<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp13withScene.csv")
s10episode14<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp14withScene.csv")
s10episode15<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp15withScene.csv")
s10episode16<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp16withScene.csv")
s10episode17<- read.csv("/Users/nikohellman/Desktop/friends scripts/S10FriendsEp17withScene.csv")

# Fix an error found in one episode
s10episode11<- s10episode11%>%
  mutate(Scene=X)%>%
  select(Character, Line, Scene)

# LEXICONS

bing <- read_csv("/Users/nikohellman/Desktop/sentiment analysis sets/Bing.csv")
nrc <- read_csv("/Users/nikohellman/Desktop/sentiment analysis sets/NRC.csv")
afinn <- read_csv("/Users/nikohellman/Desktop/sentiment analysis sets/Afinn.csv")

# FUNCTIONS

# Text transformations
cleanCorpus <- function(corpus){
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
  corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
  v_stopwords <- c(stopwords("english"), c("thats","weve","hes","theres","ive","im",
                                           "will","can","cant","dont","youve","us",
                                           "youre","youll","theyre","whats","didnt"))
  corpus.tmp <- tm_map(corpus.tmp, removeWords, v_stopwords)
  corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
  return(corpus.tmp)}

# Most frequent terms 
frequentTerms <- function(text){
  s.cor <- Corpus(VectorSource(text))
  s.cor.cl <- cleanCorpus(s.cor)
  s.tdm <- TermDocumentMatrix(s.cor.cl)
  s.tdm <- removeSparseTerms(s.tdm, 0.999)
  m <- as.matrix(s.tdm)
  word_freqs <- sort(rowSums(m), decreasing=TRUE)
  dm <- data.frame(word=names(word_freqs), freq=word_freqs)
  return(dm)}

# Define bigram tokenizer 
tokenizer  <- function(x){
  NGramTokenizer(x, Weka_control(min=2, max=2))}

# Most frequent bigrams 
frequentBigrams <- function(text){
  s.cor <- VCorpus(VectorSource(text))
  s.cor.cl <- cleanCorpus(s.cor)
  s.tdm <- TermDocumentMatrix(s.cor.cl, control=list(tokenize=tokenizer))
  s.tdm <- removeSparseTerms(s.tdm, 0.999)
  m <- as.matrix(s.tdm)
  word_freqs <- sort(rowSums(m), decreasing=TRUE)
  dm <- data.frame(word=names(word_freqs), freq=word_freqs)
  return(dm)}

3. Exploratory Data Analysis

With our data loaded, we began exploring the data through visualizations. We truly could not have done this without the guidance of Xavier on Kaggle and Heather Kitada Smalley, who provided valuable sentiment analysis EDA examples. Included in this report is a selection of our many, many EDA graphics which were primarily performed on a subset of season 1 episodes: episode 1, 4, and 8.

EDA: Episode 1

# Top 20 characters with more dialogues 
top.episode1.chars <- as.data.frame(sort(table(episode1$Character), decreasing=TRUE))[1:20,]
top.episode1.chars<- top.episode1.chars%>%
  filter(!(str_detect(Var1, "\\[")))
try<-top.episode1.chars%>%
  mutate(topspeak=ifelse(Var1=="Monica", "1", "0"))

ggplot(data=try, aes(x=Var1, y=Freq, fill=topspeak)) +
  scale_fill_manual( values = c( "1"="red", "0"="darkgray" ), guide = "none" )+
  geom_bar(stat="identity", colour="black") +
  theme_classic()+
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character(s)", y="Number of lines")

# Most frequent bigrams
episode1.bigrams <- frequentBigrams(episode1$Line)[1:20,]
ggplot(data=episode1.bigrams, aes(x=reorder(word, -freq), y=freq)) +
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  labs(x="Bigram", y="Frequency")+
  theme_classic()+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  geom_text(x=5, y=13, label="\"Cut cut\" \n is in reference to Rachel \n cutting up her dad's credit cards \n as a sign of independence", fontface='italic', size=3)

# Cumulative lines 
episode1char<- episode1
episode1char$linenum<-1:length(episode1$Line)
episode1char$one<-rep(1, length(episode1$Line))

### Rachel
rachel<-episode1char%>%
  filter(Character=="Rachel")%>%
  select(-Line)
rachel$count<-cumsum(rachel$one)

### Monica
monica<-episode1char%>%
  filter(Character=="Monica")%>%
  select(-Line)
monica$count<-cumsum(monica$one)

### Phoebe
phoebe<-episode1char%>%
  filter(Character=="Phoebe")%>%
  select(-Line)
phoebe$count<-cumsum(phoebe$one)


### Ross
ross<-episode1char%>%
  filter(Character=="Ross")%>%
  select(-Line)
ross$count<-cumsum(ross$one)

### Chandler
chandler<-episode1char%>%
  filter(Character=="Chandler")%>%
  select(-Line)
chandler$count<-cumsum(chandler$one)

### Joey
joey<-episode1char%>%
  filter(Character=="Joey")%>%
  select(-Line)
joey$count<-cumsum(joey$one)

# Graph
timeline<-rbind(rachel, monica, phoebe,
                ross, chandler, joey)
ggplot(timeline, aes(x=linenum, y=count, color=Character))+
  geom_line()

EDA: Episodes 1, 4, 8

# all 3 episodes dialogues 
epset3 <- rbind(episode1, episode4, episode8)

# Top 20 characters with more dialogues 
top.epset3.chars <- as.data.frame(sort(table(epset3$Character), decreasing=TRUE))[1:20,]

# Visualization 
ggplot(data=top.epset3.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of lines")

# Transform the text to a tidy data structure with one token per row
tokens <- epset3 %>%  
  mutate(line=as.character(epset3$Line)) %>%
  unnest_tokens(word, line)

# Sentiments and frequency associated with each word - nrc
sentiments <- tokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=sentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

# Top 10 terms for each sentiment
sentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
tokens %>%
  filter(Character %in% c("Monica","Rachel","Phoebe","Ross","Chandler",
                          "Joey")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

# Stopwords
mystopwords <- tibble(word=c(stopwords("english"), 
                                 c("thats","weve","hes","theres","ive","im",
                                   "will","can","cant","dont","youve","us",
                                   "youre","youll","theyre","whats","didnt")))

# Tokens without stopwords
top.chars.tokens <- epset3 %>%
  mutate(line=as.character(epset3$Line)) %>%
  filter(Character %in% c("Monica","Rachel","Phoebe","Ross","Chandler",
                          "Joey")) %>%
  unnest_tokens(word, line) %>%
  anti_join(mystopwords, by="word")

# Most relevant words for each character
top.chars.tokens %>%
  count(Character, word) %>%
  bind_tf_idf(word, Character, n) %>%
  group_by(Character) %>% 
  arrange(desc(tf_idf)) %>%
  slice(1:10) %>%
  ungroup() %>%
  mutate(word2=factor(paste(word, Character, sep="__"), 
                       levels=rev(paste(word, Character, sep="__"))))%>%
  ggplot(aes(x=word2, y=tf_idf)) +
  geom_col(aes(fill=Character), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_y") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(y="tf–idf", x="Sentiment") +
  scale_x_discrete(labels=function(x) gsub("__.+$", "", x)) +
  coord_flip()

Based upon our initial exploratory data analysis and the feedback we received from professors and peers, we cleaned and uploaded the rest of the season 1 episodes for a more complete piture to analyze. Furthermore, we loaded in season 10 data, to see if the show was able to improve upon any sexist issues over its ten years on the air. This also provided us with a time-based analysis in certain areas. Lastly, we realized we needed to subset the data based on gender, as this was extremely relevant to our research question. Most importantly, we realized we needed to refine our research by identifying particular measures of “sexism.” By implementing some of this, we produced additional EDA visuals:

EDA: Season 1

# all episodes dialogues 
epset <- rbind(episode1, episode2, episode3, episode4, episode5, episode6, episode7,
               episode8, episode9, episode10, episode11, episode12, episode13, episode14,
               episode15, episode16, episode17, episode18, episode19, episode20, episode21,
               episode22, episode23, episode24)

# How many dialogues?
length(epset$Line)

## [1] 6369

# How many characters?
length(levels(as.factor(epset$Character)))

## [1] 183

# Top 20 characters with more dialogues 
top.epset.chars <- as.data.frame(sort(table(epset$Character), decreasing=TRUE))[1:20,]
top.epset.chars<- top.epset.chars%>%
  filter(!(str_detect(Var1, "\\[")))

# Visualization 
ggplot(data=top.epset.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of lines")

# Most frequent bigrams
epset.bigrams <- frequentBigrams(epset$Line)[1:20,]
ggplot(data=epset.bigrams, aes(x=reorder(word, -freq), y=freq)) +  
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")

EDA: Season 10

# all episodes dialogues 
s10epset <- rbind(s10episode1, s10episode2, s10episode3, s10episode4, s10episode5, s10episode6, 
                  s10episode7, s10episode8, s10episode9, s10episode10, s10episode11, s10episode12, 
                  s10episode13, s10episode14, s10episode15, s10episode16, s10episode17)

# How many dialogues?
length(s10epset$Line)

## [1] 5566

# How many characters?
length(levels(as.factor(s10epset$Character)))

## [1] 139

# Top 20 characters with more dialogues 
top.s10epset.chars <- as.data.frame(sort(table(s10epset$Character), decreasing=TRUE))[1:20,]
top.s10epset.chars<- top.s10epset.chars%>%
  filter(!(str_detect(Var1, "\\[")))

# Visualization 
ggplot(data=top.s10epset.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of lines")

# Most frequent bigrams
s10epset.bigrams <- frequentBigrams(s10epset$Line)[1:20,]
ggplot(data=s10epset.bigrams, aes(x=reorder(word, -freq), y=freq)) +  
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")

EDA: By gender

# SEASON 1
girls<- epset%>%
  filter(Character %in% c("Monica","Rachel","Phoebe"))

#how many lines for lead girls in s1?
length(girls$Line)

## [1] 2359

boys<- epset%>%
  filter(Character %in% c("Joey", "Chandler", "Ross"))

#how many lines for lead boys in s1?
length(boys$Line)

## [1] 2410

#girls most frequent bigrams
girls.bigrams <- frequentBigrams(girls$Line)[1:20,]
ggplot(data=girls.bigrams, aes(x=reorder(word, -freq), y=freq)) +  
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")

#boys most freq bigrams
boys.bigrams <- frequentBigrams(boys$Line)[1:20,]
ggplot(data=boys.bigrams, aes(x=reorder(word, -freq), y=freq)) +  
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")

#girls positive/negative word cloud
girlstokens <- girls %>%  
  mutate(line=as.character(girls$Line)) %>%
  unnest_tokens(word, line)

girlstokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

## Joining, by = "word"

# boys positive/negative word cloud
boystokens <- boys %>%  
  mutate(line=as.character(boys$Line)) %>%
  unnest_tokens(word, line)

boystokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

## Joining, by = "word"

# girls sentiments frequency
girlssentiments <- girlstokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

ggplot(data=girlssentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

#boys sentiments frequency 
boyssentiments <- boystokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

ggplot(data=boyssentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

#girls top words associated w each sentiment
girlssentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

#boys top words associated w each sentiment
boyssentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# SEASON 10

s10girls<- s10epset%>%
  filter(Character %in% c("Monica","Rachel","Phoebe"))

# how many lines for lead girls in s10?
length(s10girls$Line)

## [1] 2157

s10boys<- s10epset%>%
  filter(Character %in% c("Joey", "Chandler", "Ross"))

# how many lines for lead boys in s10?
length(s10boys$Line)

## [1] 2295

#girls most frequent bigrams
s10girls.bigrams <- frequentBigrams(s10girls$Line)[1:20,]
ggplot(data=s10girls.bigrams, aes(x=reorder(word, -freq), y=freq)) +  
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")

#boys most freq bigrams
s10boys.bigrams <- frequentBigrams(s10boys$Line)[1:20,]
ggplot(data=s10boys.bigrams, aes(x=reorder(word, -freq), y=freq)) +  
  geom_bar(stat="identity", fill="chocolate2", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")

#girls positive/negative word cloud
s10girlstokens <- s10girls %>%  
  mutate(line=as.character(s10girls$Line)) %>%
  unnest_tokens(word, line)

s10girlstokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

## Joining, by = "word"

# boys positive/negative word cloud
s10boystokens <- s10boys %>%  
  mutate(line=as.character(s10boys$Line)) %>%
  unnest_tokens(word, line)

s10boystokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

## Joining, by = "word"

# girls sentiments frequency
s10girlssentiments <- s10girlstokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

ggplot(data=s10girlssentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

#boys sentiments frequency 
s10boyssentiments <- s10boystokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

ggplot(data=s10boyssentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

#girls top words associated w each sentiment
s10girlssentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

#boys top words associated w each sentiment
s10boyssentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

4. Research Metrics

Broadly speaking, our research question is simple: Are the Friends scripts sexist? However, sexism is a pretty complex issue and there are many different ways to assess whether a piece of media, particularly something as multidimensional as a television show, is sexist or not. To refine our project, we identified the following four metrics to assess the treatment of women and gender roles in the first and last season of this series:

Who gets more lines, the guys or the gals?
Are the frequent terms spoken by the guys different from that of the gals? Are the differences inherently gendered?
How does the show describe masculine and feminine identities?
A case study: Despite their similar emotional journeys in season 1, how do Ross and Rachel’s emotions compare?

We explore and assess the script based upon these metrics in the following graphics.

Metric 1: The guys get more lines than the gals.

sgdata<- tibble(Gender=c("Gals","Guys","Gals","Guys"), Season=c("Season 1","Season 1","Season 10","Season 10"), LineCount=c(length(girls$Line), length(boys$Line), length(s10girls$Line),length(s10boys$Line)), LineProp=c(length(girls$Line)/length(epset$Line), length(boys$Line)/length(epset$Line), length(s10girls$Line)/length(s10epset$Line), length(s10boys$Line)/length(s10epset$Line)))

ggplot(sgdata) +
  aes(x = as.factor(Season), y = LineProp,
    group = Gender, color=Gender) +
  geom_line(size=2)+
  geom_label(aes(label= scales::percent(LineProp)), size=5)+
  theme(axis.line = element_blank(),
        axis.ticks = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size=10, face='bold'),
        plot.background = element_blank(),
        panel.background = element_blank(),
        legend.position = 'none',
        plot.title = element_text(hjust = 0.5), 
        plot.caption = element_text(hjust=0.35, vjust=130, face='italic')) +
  scale_color_manual(values=c("indianred","steelblue3"))+
  ylab("")+
  xlab("")+
  ylim(.37,.42)+
  labs(title="What percent of the lines do the guys and gals have?"
       ,
       caption="Our data shows that the women leads \n have less lines than their male counterparts \n in propotion to the total lines in the season, \n with the gap widening from season 1 to 10."
       )

To explore the question of who gets more lines, we use a slope graph from season 1 to season 10. Observe that in season 1, the lead guys have a greater proportion of the total lines than the gals do. This disparity is widened in season 10.

Metric 2: The gals say ‘sorry’ more than the guys.

#girls positive/negative word cloud
girlstokens <- girls %>% 
  mutate(line=as.character(girls$Line)) %>%
  unnest_tokens(word, line)

girlstokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  bind_tf_idf(word, sentiment, n) %>%
  filter(word!='like' & word!='well')%>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("brown4", "red"), max.words=100)

## Joining, by = "word"

# boys positive/negative word cloud
boystokens <- boys %>% 
  mutate(line=as.character(boys$Line)) %>%
  unnest_tokens(word, line)

boystokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  bind_tf_idf(word, sentiment, n) %>%
  filter(word!='like' & word!='well')%>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("midnightblue", "steelblue2"), max.words=100)

## Joining, by = "word"

#girls positive/negative word cloud season 10
s10girlstokens <- s10girls %>% 
  mutate(line=as.character(s10girls$Line)) %>%
  unnest_tokens(word, line)

s10girlstokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  bind_tf_idf(word, sentiment, n) %>%
  filter(word!='like' & word!='well')%>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("brown4", "red"), max.words=100)

## Joining, by = "word"

# boys positive/negative word cloud season 10
s10boystokens <- s10boys %>% 
  mutate(line=as.character(s10boys$Line)) %>%
  unnest_tokens(word, line)

s10boystokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  bind_tf_idf(word, sentiment, n) %>%
  filter(word!='like' & word!='well')%>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("midnightblue", "steelblue2"), max.words=100)

## Joining, by = "word"

To explore the question of how the guys and gals use different language, we used a positive/negative sentiment wordcloud. While language in generally consistent across the genders of the leads, the most prominent difference is that the gals say sorry much more than the guys. This is inherently gendered in our society, as women are typically expected to apologize and express more emotional responsibility than men according to traditional gender roles.

Metric 3: Men have jobs, women have their appearance.

totaleps<- rbind(epset,s10epset)

# all bigrams for masc
totalmascbigrams<- totaleps%>%
  unnest_tokens(bigram, Line, token = "ngrams", n = 2)%>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
  filter(word2 %in% c("boy","Boy","boys","Boys","man","Man","men","Men","guy","Guy","guys","Guys"))%>%
  unite(bigram, word1, word2, sep = " ")%>%
  count(bigram, sort = TRUE)%>%
  filter(bigram!="hey guys")%>%
  filter(bigram!="kinda guy")%>%
  filter(bigram!="wrong guy")%>%
  filter(bigram!="guys guys")%>%
  filter(bigram!="300 guys")%>%
  filter(bigram!="alright boys")%>%
  filter(bigram!="bed guys")%>%
  filter(bigram!="boys boys")%>%
  filter(bigram!="huh guy")%>%
  filter(bigram!="bye guys")%>%
  filter(bigram!="issue guys")%>%
  filter(bigram!="live boy")%>%
  filter(bigram!="win guy")%>%
  filter(bigram!="kinda guy")%>%
  filter(bigram!="whoah boy")%>%
  filter(bigram!="what're'you guys")%>%
  filter(bigram!="lives guy")%>%
  filter(bigram!="ooh boy")%>%
  filter(bigram!="listen guys")

# all bigrams for fem
totalfembigrams<- totaleps%>%
  unnest_tokens(bigram, Line, token = "ngrams", n = 2)%>%
  separate(bigram, c("word1", "word2"), sep = " ")%>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
  filter(word2 %in% c("girl","Girl","girls","Girls","woman","Woman","women","Women","lady","Lady","ladies","Ladies"))%>%
  unite(bigram, word1, word2, sep = " ")%>%
  count(bigram, sort = TRUE)%>%
  filter(bigram!="likes women")%>%
  filter(bigram!="close women")%>%
  filter(bigram!="leaving girls")%>%
  filter(bigram!="kills women")%>%
  filter(bigram!="mind women")%>%
  filter(bigram!="listen lady")%>%
  filter(bigram!="slash woman")%>%
  filter(bigram!="woman ladies")%>%
  filter(bigram!="ready ladies")%>%
  filter(bigram!="people woman")%>%
  filter(bigram!="mother ladies")%>%
  filter(bigram!="girls girls")

totalfembigrams <- totalfembigrams[1:6,]
totalmascbigrams <- totalmascbigrams[1:10,]

#plotting
a<-ggplot(data=totalfembigrams, aes(x=reorder(bigram,n), y=n)) +  
  geom_bar(stat="identity", fill="indianred") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Bigram", y="Frequency")+
  coord_flip()+
   theme_classic()+
  theme(
    axis.title=element_blank()
  )

b<-ggplot(data=totalmascbigrams, aes(x=reorder(bigram,n), y=n)) +  
  geom_bar(stat="identity", fill="steelblue3") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  coord_flip()+
  theme_classic()+
  theme(
    axis.title=element_blank()
  )
plots = list(a,b)

#ensemble
library(gridExtra)
library(grid)

# plotmath expressions
top<- textGrob(expression(bold("Gendered Bigrams")),  gp = gpar(fontsize = 20) )
yleft <- textGrob("Bigram", 
                  rot = 90, gp = gpar(fontsize = 20))

bottom <- textGrob("Total Times Used", gp = gpar(fontsize = 20))


# Lay out plots
g<-grid.arrange(grobs=plots, ncol = 2, nrow = 1, 
                    left=yleft, bottom = bottom, top=top)

To explore the question of how the scripts describe individuals based off of their gender, we filtered the script for bigrams in which the second word was a masculine or feminine noun (ex: boy/guy/man versus girl/lady/woman). We found that men are described with far more diverse adjectives than women, with nearly double the amount of bigrams. Their adjectives were either negative (creepy, fat, ugly) or career-based (wine guy is the wine supplier at Monica’s restaurant, etc.). However, the majority of women descriptors were appearance-based.

Metric 4: Rachel is more emotional than Ross.

This involved looking through season one for climactic moments between Rachel and Ross. We identified episode 1, 2, 5, 7, 12, 14, 19, and 24 to look at for our case study.

#####EP 1
# Transform the text to a tidy data structure with one token per row
ftokens <- episode1 %>%  
  mutate(line=as.character(episode1$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments1<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="1")

# Stopwords
mystopwords <- tibble(word=c(stopwords("english"), 
                                 c("thats","weve","hes","theres","ive","im",
                                   "will","can","cant","dont","youve","us",
                                   "youre","youll","theyre","whats","didnt")))

# Tokens without stopwords
Randr.ftokens <- episode1 %>%
  mutate(line=as.character(episode1$Line)) %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  unnest_tokens(word, line) %>%
  anti_join(mystopwords, by="word")

# Most frequent words for each character
Randr.ftokens %>%
  count(Character, word) %>%
  group_by(Character) %>% 
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ungroup() %>%
  mutate(word2=factor(paste(word, Character, sep="__"), 
                       levels=rev(paste(word, Character, sep="__"))))%>%
  ggplot(aes(x=word2, y=n)) +
  geom_col(aes(fill=Character), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_y") +
  labs(x="Sentiment", y="Frequency") +
  scale_x_discrete(labels=function(x) gsub("__.+$", "", x)) +
  coord_flip()

##### EP 2
# Transform the text to a tidy data structure with one token per row
ftokens <- episode2 %>%  
  mutate(line=as.character(episode2$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments2<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="2")



# EP 5


# Transform the text to a tidy data structure with one token per row
ftokens <- episode5 %>%  
  mutate(line=as.character(episode5$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments5<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="5")

##### EP 7
# Transform the text to a tidy data structure with one token per row
ftokens <- episode7 %>%  
  mutate(line=as.character(episode7$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments7<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="7")




##### EP 12
# Transform the text to a tidy data structure with one token per row
ftokens <- episode12 %>%  
  mutate(line=as.character(episode12$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments12<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="12")

##### EP 14
# Transform the text to a tidy data structure with one token per row
ftokens <- episode14 %>%  
  mutate(line=as.character(episode14$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments14<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="14")


##### EP 19
# Transform the text to a tidy data structure with one token per row
ftokens <- episode19 %>%  
  mutate(line=as.character(episode19$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments19<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="19")


##### EP 24
# Transform the text to a tidy data structure with one token per row
ftokens <- episode24 %>%  
  mutate(line=as.character(episode24$Line)) %>%
  unnest_tokens(word, line)

# Positive and negative words - bing
ftokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n", fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

# Sentiments and frequency associated with each word - nrc
fsentiments <- ftokens %>% 
  inner_join(nrc, "word") %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=fsentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
  geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
  labs(x="Sentiment", y="Frequency")+
  theme(axis.text.x = element_text(angle=45, hjust=1))

fsentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip()

# Sentiment analysis for the main characters
ftokens %>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~Character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

casestudysentiments24<- ftokens%>%
  filter(Character %in% c("Rachel", "Ross")) %>%
  inner_join(nrc, "word") %>%
  count(Character, sentiment, sort=TRUE)%>%
  mutate(ep="24")

casestudysentiments<- rbind(casestudysentiments1, casestudysentiments2, casestudysentiments5, casestudysentiments7, casestudysentiments12, casestudysentiments14, casestudysentiments19, casestudysentiments24)

#make it easier to read
level_order <- c("1", "2", "5","7","12","14","19","24") 

#plot
casestudysentiments<-casestudysentiments%>% 
  filter(sentiment %in% c("fear", "joy","trust", "positive"))
emotions <- c(
                    fear = "Fear",
                    joy = "Joy",
                    trust = "Trust",
                    positive = "Positive"
                    )
ggplot(casestudysentiments,aes(x=factor(ep, level = level_order), y=n, color=Character))+
  geom_line(data=casestudysentiments, aes(group=Character), size=1)+
  facet_grid(sentiment ~ ., labeller = as_labeller(emotions))+
  labs(title="Sentiment Analysis of Rachel and Ross in Season 1", x="Episode Number", y="Frequency")+
  theme_classic()+
  scale_color_manual(values = c("Rachel" = "indianred", "Ross" = "steelblue3"))+
  theme(
    title=element_text(face="bold")  )

For our last metric, we chose a specific narrative to explore the dynamics between the leading male and female characters: Ross and Rachel. Their relationship is a central plot line throughout the ten seasons of the show. In the first season, they follow comparable emotional arcs: in the first episode, Rachel has just run away from her wedding to a man she didn’t love and Ross has recently divorced his wife after discovering she was gay. They both develop feelings for one another while also navigating their complex romantic lives. However, this graphic shows that despite their similar emotional plot lines, Rachel’s lines express more intense emotion that Ross’s. This contributes to the societal expectation of women being more emotional than men generally.

5. Results

Based upon our 4 metrics, we can determine that the Friends script is sexist. The leading guys speak more than the gals, they gals say sorry (and other gendered terms) more than the guys, women are typically described by their appearance while men are more frequently described by their profession, and the female lead Rachel is depicted as more emotional than her counterpart, Ross.

If you’ve seen the show, you probably aren’t surprised. However, regardless of whether you like the show or not, it is important that we analyze problematic aspects of pop culture in order to be better in the future. Unfortunately, many sitcoms followed in their sexist footsteps. However, by analyzing and identifying these script issues, we can inform and improve future sitcom scripts, as well as encourage audiences to be aware of the implicit messaging being set forth by seemingly harmless sexist comedy.

6. Limitations

While we learned a lot throughout this project, there are important limitations worth noting.

Sentiment analysis, at this level at least, is not capable of picking up on sarcasm and other tone-related comedy. This is an important limitation particularly for Friends, as much of the humor relies on sarcasm and tone.
Due to the high-dimensional nature of television and movies, there are many more factors than the script that determine the representation of gender such as costumes, physicality, soundtrack, set, line delivery, etc. Script is just one factor, and its the only one we could examine through sentiment text analysis.

Despite these prominent limitations, we believe that sentiment analysis provides valuable insight. We hope that more projects can be done in the future, expanding gender analysis to other shows and movies as well as looking closer at Friends, particularly its treatment of LGBTQ characters, as this is also crucial to holistic gender analysis.

7. References

Lastly, thank you to our professors and peers for you feedback, tips, and tricks.

501 Final Report

Ellie Greenberg & Niko Hellman

12/9/2021

1. Introduction

2. Data sets and resources

3. Exploratory Data Analysis

EDA: Episode 1

EDA: Episodes 1, 4, 8

EDA: Season 1

EDA: Season 10

EDA: By gender

4. Research Metrics

Metric 1: The guys get more lines than the gals.

Metric 2: The gals say ‘sorry’ more than the guys.

Metric 3: Men have jobs, women have their appearance.

Metric 4: Rachel is more emotional than Ross.

5. Results

6. Limitations

7. References