The question asked throughout this project was “Does what you say reflect who you are? The goal of this project is to examine what characters say in each of the 3 Lord of the Rings movies, how they change. Lannguage is a medium for characters to communicate their personalitites and motives. Does what they say accurately depict who they are? Example: Is Sam a good character? Many would say yes, but does this match up with his speech? Something else to ask is if a charaters speech is one way to signify their character development. A hope for this project would be to develop a program to track a characters growth by the sentiment of their language.
The data for this project was found at https://www.kaggle.com/datasets/paultimothymooney/lord-of-the-rings-data?select=lotr_characters.csv. A data competition website, where another data scientist collected an analyzed this data, and shared his dataset for others use. In the same way in literature classes we talk about joining a conversation and responding to what we are reading, this is not confinded to subjects! In mathematics we do the same thing, looking at what others have learned, their observations spark questions and elicit responses worthy of research and thought. We have such a tendency to make distinctions but boiled down we are more alike than we think.
data = read.csv ('https://raw.githubusercontent.com/nicabey/LOTR/main/lotr_characters.csv?token=GHSAT0AAAAAABS2BA62KNN5FIJYTLUO2HLAYSEV5XA')
scriptsdata = read.csv('https://raw.githubusercontent.com/nicabey/LOTR/main/lotr_scripts.csv?token=GHSAT0AAAAAABS2BA63TGRMJVAXOAI52E46YSEV6IA')
In the movie scipts data, there is some unnecessary differentiation between what character is speaking. Movie scripts will identify a character’s voice sometimes and separate it from the character. “Frodo Voice-over” vs. Frodo speaking to another character. All are things that The character is saying, so by cleaning up the data to remove unnecessary identifiers like “voice over” etc. we can get more of what each character says.
scriptsdata <- scriptsdata %>%
mutate(char = str_replace_all(string = scriptsdata$char, pattern =" VOICE OVER",replacement = ""))
scriptsdata <- scriptsdata %>%
mutate(char = str_replace_all(string = scriptsdata$char, pattern =" VOICE-OVER",replacement = ""))
scriptsdata <- scriptsdata %>%
mutate(char = str_replace_all(string = scriptsdata$char, pattern =" VOICEOVER",replacement = ""))
scriptsdata <- scriptsdata %>%
mutate(char = str_replace_all(string = scriptsdata$char, pattern =" VOICE",replacement = ""))
Determining good or bad is relative, yes. There is a way to automate it. Sentiment Analysis is that way. Sentiment Analysis is a program that looks a strings of text and assigns to them a positive, negative or neutral value. Businesses often use this to examine consumer feedback ex. social media, and product reviews. The goal in this section is to examine the lines each character says in “The Fellowship of the Ring”, “The Two Towers”, and “The Return of the King” and determine a characters sentiment.
statement = "I am very nervous about presenting in class"
statement2 = "I am excited for graduation next week"
t(analyzeSentiment(c(statement, statement2))[,1:4])
## [,1] [,2]
## WordCount 3.0000000 4.0
## SentimentGI -0.3333333 0.5
## NegativityGI 0.3333333 0.0
## PositivityGI 0.0000000 0.5
We can see in the above example 2 statements with different sentiments. In the first
sentiment1 <- cbind(scriptsdata,sentiment1)
t(sentiment1[10,])
## 10
## X "9"
## char "SMEAGOL"
## dialog " My precious. "
## movie "The Return of the King "
## WordCount "1"
## SentimentGI "1"
## NegativityGI "0"
## PositivityGI "1"
## SentimentHE "0"
## NegativityHE "0"
## PositivityHE "0"
## SentimentLM "0"
## NegativityLM "0"
## PositivityLM "0"
## RatioUncertaintyLM "0"
## SentimentQDAP "1"
## NegativityQDAP "0"
## PositivityQDAP "1"
head(sentiment1[sentiment1['movie'] =="The Fellowship of the Ring ",])
## X char
## 1401 1400 MERRY
## 1402 1401 STRIDER
## 1403 1402 FRODO
## 1404 1403 STRIDER
## 1405 1404 FRODO
## 1406 1405 STRIDER
## dialog
## 1401 What are they eating when they can't get hobbit ?
## 1402
## 1403 Who is she ? This woman you sing of ?
## 1404 Tis the Lady of L'thien. The Elf Maiden who gave her love to Beren ... a mortal
## 1405 What happened to her?
## 1406 She died.
## movie WordCount SentimentGI NegativityGI
## 1401 The Fellowship of the Ring 4 -0.2500000 0.25
## 1402 The Fellowship of the Ring 0 NaN NaN
## 1403 The Fellowship of the Ring 2 0.0000000 0.00
## 1404 The Fellowship of the Ring 9 0.1111111 0.00
## 1405 The Fellowship of the Ring 1 0.0000000 0.00
## 1406 The Fellowship of the Ring 1 -1.0000000 1.00
## PositivityGI SentimentHE NegativityHE PositivityHE SentimentLM
## 1401 0.0000000 0 0 0 0
## 1402 NaN NaN NaN NaN NaN
## 1403 0.0000000 0 0 0 0
## 1404 0.1111111 0 0 0 0
## 1405 0.0000000 0 0 0 0
## 1406 0.0000000 0 0 0 0
## NegativityLM PositivityLM RatioUncertaintyLM SentimentQDAP NegativityQDAP
## 1401 0 0 0 0.0000000 0.0
## 1402 NaN NaN NaN NaN NaN
## 1403 0 0 0 -0.5000000 0.5
## 1404 0 0 0 0.1111111 0.0
## 1405 0 0 0 0.0000000 0.0
## 1406 0 0 0 -1.0000000 1.0
## PositivityQDAP
## 1401 0.0000000
## 1402 NaN
## 1403 0.0000000
## 1404 0.1111111
## 1405 0.0000000
## 1406 0.0000000
sentiment1 %>%
filter(movie == "The Fellowship of the Ring ") %>%
group_by(char)%>%
summarise(mean(SentimentGI, na.rm = TRUE)) %>%
head(10)
## # A tibble: 10 × 2
## char `mean(SentimentGI, na.rm = TRUE)`
## <chr> <dbl>
## 1 ARAGORN 0.0494
## 2 ARWEN 0.00725
## 3 BARLIMAN 0.0374
## 4 BILBO 0.105
## 5 BOROMIR 0.00343
## 6 CHILDREN HOBBITS 0
## 7 CROWD 0
## 8 ELROND -0.0309
## 9 FARMER MAGGOT -0.333
## 10 FRODO 0.0634
#head(sentiment1)
Now that we have identified what characters are saying as good or bad, are we able to see that they have changed over time?
sentiment1 %>%
group_by(char,movie)%>%
summarise(mean(SentimentGI, na.rm = TRUE)) %>%
arrange(char,match(movie, c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))) %>%
head(7)
## `summarise()` has grouped output by 'char'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 3
## # Groups: char [5]
## char movie `mean(SentimentGI, na.rm = TRUE)`
## <chr> <chr> <dbl>
## 1 " GANDALF" "The Return of the King " -0.417
## 2 "(GOLLUM" "The Return of the King " 0
## 3 "ARAGORN" "The Fellowship of the Ring " 0.0494
## 4 "ARAGORN" "The Two Towers " -0.00507
## 5 "ARAGORN" "The Return of the King " 0.0864
## 6 "ARAGORN " "The Two Towers " 0.125
## 7 "ARGORN" "The Two Towers " 0.125
frodosarc <- sentiment1 %>%
arrange(match(movie, c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))) %>%
filter(char == "FRODO") %>%
select(SentimentGI) %>%
filter(!is.na(SentimentGI))
frodosarc$TalkingSequence = 1:212
library(ggplot2)
frodosarc %>%
ggplot(mapping = aes(y = SentimentGI, x = TalkingSequence)) +
geom_line()
This is all over the place. Why? It looks like my scripts data is not as clean as I’d hoped for. Some of the lines are out of order, the entire Fellowship of the Ring script is spliced throughout the return of the king film. In a perfect world we would hope to see that as the scripts progressed the characters sentiment would also, either in a negative or a positive way.
sentiment1 <- sentiment1 %>%
rowwise() %>%
mutate(Positive = (SentimentGI>0))
samschange <- sentiment1 %>%
arrange(match(movie, c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))) %>%
filter(char == "SAM") %>%
select(SentimentGI) %>%
filter(!is.na(SentimentGI))
samschange$TalkingSequence = 1:203
samschange %>%
ggplot(mapping = aes(y = SentimentGI, x = TalkingSequence)) +
geom_line()
Now having categorized a characters statements as positive and negative I would be curious to see how they change over tolkiens writing. Because the previous section couldn’t have a conclusion drawn from it, lets approach things in a new way. Because the data is out of order we are unable to see the change within a character in a linear way. Instead lets look at things more generally. I’ve chonsent to take the mean of a specific character in each of the 3 movies. From this information I am able to draw a conclusion such as “Frodo has a strong positivce sentiment in The Fellowship of the Ring” rather than wayward data. This is illustrated graphically for Frodo and Sam and our beloved Boromir below.
frodoavg <- sentiment1 %>%
filter(char == 'FRODO') %>%
arrange(match(movie, c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))) %>%
group_by(movie) %>%
summarise(meanSGI = mean(SentimentGI, na.rm = TRUE))
frodoavg$movie <- factor(frodoavg$movie, levels = c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))
ggplot(frodoavg, aes(x = movie, y = meanSGI)) +
geom_col()
samavg <- sentiment1 %>%
filter(char == 'SAM') %>%
arrange(match(movie, c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))) %>%
group_by(movie) %>%
summarise(meanSGI = mean(SentimentGI, na.rm = TRUE))
samavg$movie <- factor(samavg$movie, levels = c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))
ggplot(samavg, aes(x = movie, y = meanSGI)) +
geom_col()
boroavg <- sentiment1 %>%
filter(char == 'BOROMIR') %>%
arrange(match(movie, c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))) %>%
group_by(movie) %>%
summarise(meanSGI = mean(SentimentGI, na.rm = TRUE))
frodoavg$movie <- factor(frodoavg$movie, levels = c("The Fellowship of the Ring ","The Two Towers ","The Return of the King " ))
ggplot(boroavg, aes(x = movie, y = meanSGI)) +
geom_col()
We can see the change in Frodo’s language over the course of the 3 movies. He begins with a high positive sentiment in the first movie and his sentiment decreases over time. Knowing that frodo does become less positive the why question comes to mind. This could be due to many influences of the journey, exhaustion, or this could be another influence, as the ring bearer he will have had the ring for the longest time in the return of the king. More opportunity for response! This is a conversation, but knowing things for certain can eliminate the thoughts that could be led by personal feelings.
Had there been more time I would’ve liked to clean the data even further reducing the error involved with discrepancies between names and other identifiers. Also further cleaning to organize the scripts data into its correct spots. Additionally, I had found a second data set that identifies the character’s race, with more time we could’ve classified good and bad races based on their sentiment. In this project I learned much about the importance of data entry, and the practice of cleaning up a dirty data set. I also gained more practice with sentiment analysis, and how to apply sentiment to something I hadn’t in the past. Language matters, language teaches about not only the characters in this series, but in knowing how their language reveals who they are we are able to see ourselves in this same way. What we say matters, it has the power to communicate who we are, so speak wisely.