Trends within arts come and go over time, and can occasionally reflect how the national mood changes as well. By examining the scripts of Oscars Best Picture winners and attaching sentiment values to these scripts, I hope to answer the question - Can we see a trend in the sentiment of Oscar winning film screenplays over time? Our data will consist of 30 films, the years they were released, and their scripts.
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.3
library(textdata)
## Warning: package 'textdata' was built under R version 4.4.3
library(sentimentr)
## Warning: package 'sentimentr' was built under R version 4.4.3
library(reshape2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(stringr)
Read The Screenplay Series from Deadline (scraped
Oscar Best Picture Movies (dataset)
I initially chose to manually scrape the data from these websites, as I assumed it would be fairly easy as the scripts are clearly marked. However, I learned after “gathering” all of the scripts that when putting them into an Excel cell, only roughly 32,000 characters were transferred over. I then had to go back to the scripts, find where they were cut off, and enter them again for every roughly 32,000 characters. If I were to repeat this process, I would likely use a web scraping tool to simplify this process.
# Load the script and Oscars metadata
script_data <- read.csv("https://github.com/scrummett/DATA607/raw/refs/heads/main/Script%20test.csv")
oscar_data <- read.csv("https://raw.githubusercontent.com/scrummett/DATA607/refs/heads/main/oscars_df.csv")
colnames(script_data)
## [1] "Film" "Script"
tibble(oscar_data)
Presented are the two separate datasets, both loaded from being stored in github. The first consists of the data manually scraped, which contains the films in one column and their scripts in another, with as much of a script as could fit into a single observation. This data reflects the raw data that I scraped. The second dataset consists of all Oscar Best Picture nominees and details on them listed over different columns, however we will be focusing on which films were winners between 1990 and 2020, their Oscar year and their genre.
There is at least one character within the data that I scraped that
has caused a problem when trying to print to pdf. I have been able to
determine that “tibble(script_data)” was causing the problem, and I have
therefore opted to show “colnames” to assure you that I am using data,
and a photo of “tibble(script_data)” in a normal R Script to explain how
the data is formatted. I would otherwise use head() or another way to
view the data, however when I did this my pdf would then crash, I
imagine from attempting to show/format so much text.
# Clean and format the Oscars data
oscar_data <- oscar_data |>
mutate(
Oscar.Year = as.numeric(Oscar.Year),
Film = reorder(Film, Oscar.Year)
)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Oscar.Year = as.numeric(Oscar.Year)`.
## Caused by warning:
## ! NAs introduced by coercion
# Filter to only the categories we need
oscar_data <- oscar_data |>
select(Film,
Oscar.Year,
Movie.Genre
)
oscar_data <- oscar_data |>
separate_rows(Movie.Genre, sep = ",")
tibble(oscar_data)
# Merge the script data with Oscar data by film title
script_data <- merge(script_data, oscar_data, by = "Film")
# Tokenize the script into individual words
script_data <- script_data |>
unnest_tokens(Word, Script) |>
mutate(Film = reorder(Film, Oscar.Year)) |>
arrange(Oscar.Year)
tibble(script_data)
To clean the data, I first made sure the “Oscar.Year” column was numeric in order to reorder the list in ascending order. I then filtered out all columns except for “Film”, “Oscar.Year” and “Movie.Genre”, as these will be our variables we plan to analyze further on. Additionally, the column “Movie.Genre” has multiple genre types listed in single observations. I separated these out longer so each genre observation is in a row of its own.
I followed this by merging the two datasets by “Film”. This left only the films that I scraped the scripts of, those that won Best Picture from 1990 to 2020. From here, our data is nearly tidy, but the “Script” column is still incredibly long. I unnested each word from the “Script” column into a new column “Word” containing each individual word found in the “Script” column as its own observation.
I chose not to filter out any observations yet as attaching sentiment will filter out most of the superfluous words, and I can then examine each movie for any words that are acting as potential outliers that are not relevant through a word count. Genres now have their own individual row per observation, as do all words from the film’s scripts. As such, I would consider this data nearly tidy, but we must add sentiment values to the data to complete it.
Here I have attached values to each word using three different lexicons which we will compare across. Bing attaches values “positive” or “negative” to words within its lexicon, and NRC does the same with the addition of more descriptive language such as “joy” or “anger”. Afinn scores words on a scale from negative to positive with values of -5 to 5.
bing_lex <- get_sentiments("bing") |> rename(Bing = sentiment)
afinn_lex <- get_sentiments("afinn") |> rename(Afinn = value)
nrc_lex <- get_sentiments("nrc") |> rename(Nrc = sentiment)
script_data <- script_data |>
left_join(bing_lex, by = c("Word" = "word")) |>
left_join(afinn_lex, by = c("Word" = "word")) |>
left_join(nrc_lex, by = c("Word" = "word")) |>
filter(!is.na(Bing) | !is.na(Afinn) | !is.na(Nrc))
## Warning in left_join(script_data, bing_lex, by = c(Word = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 723499 of `x` matches multiple rows in `y`.
## ℹ Row 3946 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## Warning in left_join(left_join(left_join(script_data, bing_lex, by = c(Word = "word")), : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 6 of `x` matches multiple rows in `y`.
## ℹ Row 10480 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
script_data <- script_data |>
mutate(
Bing = as.character(Bing),
Afinn = as.character(Afinn),
Nrc = as.character(Nrc)) |>
pivot_longer(cols = c("Bing",
"Afinn",
"Nrc"),
names_to = "Lexicon",
values_to = "Sentiment")
tibble(script_data)
After attaching values to each word, I combined the three separate datasets by the column “Word” and filtered out the words that contain no value. With lexicons and their sentiments contained in one dataset we can take it from being wide data to being long data. Lexicons (Bing, Afinn, NRC) are now stored as one column, with their values in another single column. With the data being long, it is now tidy, however it is not quite clean despite filtering out the words with no sentiment attached. Next we will filter out any words that might be outliers.
In order to find any outliers, we can conduct a word count to see if there are any words that are disproportionately affecting sentiments. While some words may occur in abundance naturally, I fear that names and proper nouns might be considered “sentimental” while not truly affecting the tone of a film.
bing_filter <- script_data |>
filter(Lexicon == "Bing" & !is.na(Sentiment))
bing_count <- bing_filter |>
group_by(Film, Word, Sentiment) |>
summarise(Count = n(), .groups = "drop")
bing_count |>
group_by(Film) |>
slice_max(order_by = Count, n = 5) |>
ungroup() |>
mutate(Word = reorder_within(Word, Count, Film)) |>
ggplot(aes(x = Word, y = Count, fill = Sentiment)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_x_reordered() +
geom_text(aes(label = Count), hjust = -0.1, size = 3) +
facet_wrap(~Film, scales = "free_y") +
scale_y_reordered() +
labs(
title = "Top 5 Sentiment Words per Film (Bing Lexicon)",
x = "Word",
y = "Contribution to Sentiment"
)
afinn_filter <- script_data |>
filter(Lexicon == "Afinn" & !is.na(Sentiment))
afinn_count <- afinn_filter |>
group_by(Film, Word) |>
summarise(
n = n(),
value = unique(as.numeric(Sentiment)),
.groups = "drop"
)
afinn_count |>
group_by(Film) |>
slice_max(order_by = n, n = 5, with_ties = FALSE) |>
ungroup() |>
mutate(Word = reorder_within(Word, n, Film)) |>
ggplot(aes(n, Word, fill = value)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = n), hjust = -0.1, size = 3) +
facet_wrap(~Film, scales = "free_y") +
scale_y_reordered() +
labs(
title = "Top 5 Sentiment Words per Film (AFINN Lexicon)",
x = "Contribution to Sentiment",
y = "Word"
)
nrc_filter <- script_data |>
filter(Lexicon == "Nrc" & !is.na(Sentiment))
nrc_count <- nrc_filter |>
group_by(Film, Word, Sentiment) |>
summarise(Count = n(), .groups = "drop")
nrc_count |>
distinct(Film, Word, .keep_all = TRUE) |>
group_by(Film) |>
slice_max(order_by = Count, n = 5) |>
ungroup() |>
mutate(Word = reorder_within(Word, Count, Film)) |>
ggplot(aes(x = Word, y = Count, fill = Sentiment)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_x_reordered() +
geom_text(aes(label = Count), hjust = -0.1, size = 3) +
facet_wrap(~Film, scales = "free_y", ) +
scale_y_reordered() +
labs(
title = "Top 5 Sentiment Words per Film (NRC Lexicon)",
x = "Word",
y = "Contribution to Sentiment"
)
As mentioned above, we can see several outliers in the form of names and proper nouns. Cleaning our data will involve removing these, as they are not pertinent to our sentiment analysis.
script_data <- script_data |>
filter(!grepl("fist|woo|stern|merry|smiles|john|colonel|princess|kicking|don|sheriff|inspector|nurse",
Word, ignore.case = TRUE))
tibble(script_data)
With our data now long, consisting of all sentiments, I would consider this data tidy. With words that do not add sentiment removed, and words affecting sentiment despite being proper nouns removed, I would consider this data clean. We can now begin analysis.
bing_sentiment <- script_data |>
filter(Lexicon == "Bing" & !is.na(Sentiment)) |>
select(Film, Oscar.Year, Sentiment, Movie.Genre) |>
count(Film, Oscar.Year, Sentiment) |>
pivot_wider(names_from = Sentiment, values_from = n, values_fill = 0) |>
mutate(Sentiment = positive - negative)
afinn_sentiment <- script_data |>
filter(Lexicon == "Afinn" & !is.na(Sentiment)) |>
mutate(Sentiment = as.numeric(Sentiment)) |>
group_by(Film, Oscar.Year, Movie.Genre) |>
summarise(Sentiment = sum(Sentiment), .groups = "drop")
nrc_sentiment <- script_data |>
filter(Lexicon == "Nrc" & !is.na(Sentiment)) |>
select(Film, Oscar.Year, Sentiment, Movie.Genre) |>
count(Film, Oscar.Year, Sentiment) |>
pivot_wider(names_from = Sentiment, values_from = n, values_fill = 0) |>
mutate(Sentiment = positive - negative)
merge1 <- merge(bing_sentiment, afinn_sentiment, by = "Film")
all_sentiment <- merge(merge1, nrc_sentiment, by = "Film")
all_sentiment <- all_sentiment |>
select(Film,
Oscar.Year.x,
Sentiment,
Sentiment.x,
Sentiment.y,
Movie.Genre) |>
rename("Bing.Sentiment" = "Sentiment.x",
"Nrc.Sentiment" = "Sentiment",
"Afinn.Sentiment" = "Sentiment.y",
"Oscar.Year" = "Oscar.Year.x") |>
mutate("Mean.Sentiment" = round(rowMeans(cbind(Bing.Sentiment,
Nrc.Sentiment,
Afinn.Sentiment)))) |>
arrange(Oscar.Year)
tibble(all_sentiment)
Here I have taken the original dataset and subdivided it into three separate datasets that contain a sentiment value for each film (NRC contains many different emotions, not just a total value). For Bing and NRC, this was done by totaling the number of positive words and subtracting by the totaled negative words. For Afinn, numeric values are already assigned to words, so all numeric values per word per film were combined to find the total sentiment score. Once broken down into their own datasets, I have combined them into one new one, filtered out so I have only a film’s Title, Oscar Year, Movie Genre, and all three sentiment scores. Sentiment scores and Oscar Year had to be renamed in order to be clean, and once everything was cleaned we could take the average sentiment score of all the movies. This score is then saved as “Mean.Sentiment”. This was done in order to create graphs showing how the score of the Best Picture Winner has changed over time.
all_sentiment |>
ggplot(aes(x = Film, y = Bing.Sentiment, fill = Film)) +
geom_col(show.legend = FALSE, width = 0.7) +
geom_text(aes(label = Bing.Sentiment), vjust = -0.5, size = 3) +
theme(
axis.text.x = element_text(angle = 50, hjust = 1, size = 6),
axis.title.y = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")
) +
labs(
x = "Film",
y = "Sentiment Score",
title = "Bing Sentiment Scores by Film"
)
all_sentiment |>
ggplot(aes(x = Film, y = Afinn.Sentiment, fill = Film)) +
geom_col(show.legend = FALSE, width = 0.7) +
geom_text(aes(label = Afinn.Sentiment), vjust = -0.5, size = 3) +
theme(
axis.text.x = element_text(angle = 50, hjust = 1, size = 6),
axis.title.y = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")
) +
labs(
x = "Film",
y = "Sentiment Score",
title = "Afinn Sentiment Scores by Film"
)
all_sentiment |>
ggplot(aes(x = Film, y = Nrc.Sentiment, fill = Film)) +
geom_col(show.legend = FALSE, width = 0.7) +
geom_text(aes(label = Nrc.Sentiment), vjust = -0.5, size = 3) +
theme(
axis.text.x = element_text(angle = 50, hjust = 1, size = 6),
axis.title.y = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")
) +
labs(
x = "Film",
y = "Sentiment Score",
title = "NRC Sentiment Scores by Film"
)
all_sentiment |>
ggplot(aes(x = Film, y = Mean.Sentiment, fill = Film)) +
geom_col(show.legend = FALSE, width = 0.7) +
geom_text(aes(label = Mean.Sentiment), vjust = -0.5, size = 3) +
theme(
axis.text.x = element_text(angle = 50, hjust = 1, size = 6),
axis.title.y = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")
) +
labs(
x = "Film",
y = "Sentiment Score",
title = "Average Sentiment Scores by Film"
)
After plotting sentiment scores by film in chronological order of release date, I find it difficult to see a clear linear relationship between the two. There seems to be a mild trend of negative scores in the early 90s, followed by a rise in the late 90s, to low scores again in the 2000s, ending with fairly neutral scores in the 2010s using Bing and Afinn lexicons, however NRC presents mostly positive scores across the 30 films. I have plotted the average sentiment scores as well, in hopes to see a new linear trend, however what I mostly see is a more muted version of what has been previously described. Plotting these with a linear model can perhaps show us insight into if there are trends.
all_sentiment |>
ggplot(aes(x = Oscar.Year, y = Bing.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.5, size = 3) +
labs(title = "Bing Sentiment vs. Oscar Year",
x = "Oscar Year",
y = "Filtered Bing Sentiment")
## `geom_smooth()` using formula = 'y ~ x'
all_sentiment |>
ggplot(aes(x = Oscar.Year, y = Afinn.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.5, size = 3) +
labs(title = "Afinn Sentiment vs. Oscar Year",
x = "Oscar Year",
y = "Filtered Afinn Sentiment")
## `geom_smooth()` using formula = 'y ~ x'
all_sentiment |>
ggplot(aes(x = Oscar.Year, y = Nrc.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.5, size = 3) +
labs(title = "NRC Sentiment vs. Oscar Year",
x = "Oscar Year",
y = "Filtered NRC Sentiment")
## `geom_smooth()` using formula = 'y ~ x'
all_sentiment |>
ggplot(aes(x = Oscar.Year, y = Mean.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.5, size = 3) +
labs(title = "Average Sentiment vs. Oscar Year",
x = "Oscar Year",
y = "Average Sentiment")
## `geom_smooth()` using formula = 'y ~ x'
Across all graphs, broken down by lexicon and their average, we see a slight trend upward in scores. While NRC is fairly flat, there is still a slightly upward trajectory to it. However, if this trend is statistically significant remains to be seen.
m1 <- lm(Bing.Sentiment ~ Oscar.Year, data = all_sentiment)
summary(m1)
##
## Call:
## lm(formula = Bing.Sentiment ~ Oscar.Year, data = all_sentiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3709.8 -951.2 251.6 1083.2 1687.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -106419.23 30853.38 -3.449 0.000926 ***
## Oscar.Year 52.58 15.39 3.416 0.001028 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1193 on 75 degrees of freedom
## Multiple R-squared: 0.1347, Adjusted R-squared: 0.1231
## F-statistic: 11.67 on 1 and 75 DF, p-value: 0.001028
Using Bing we can see that there is a statistically significant upward trend with a p-value of 0.0010. However, our R-squared indicates that our model is not a very good fit.
m2 <- lm(Afinn.Sentiment ~ Oscar.Year, data = all_sentiment)
summary(m2)
##
## Call:
## lm(formula = Afinn.Sentiment ~ Oscar.Year, data = all_sentiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3195.6 -577.1 189.6 660.5 1528.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64948.60 23572.81 -2.755 0.00736 **
## Oscar.Year 32.15 11.76 2.734 0.00780 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 911.2 on 75 degrees of freedom
## Multiple R-squared: 0.09063, Adjusted R-squared: 0.07851
## F-statistic: 7.475 on 1 and 75 DF, p-value: 0.007801
Using Afinn, we can see that there is astatistically significant relationship with a p value of 0.0078, however once again our model is not a good fit.
m3 <- lm(Nrc.Sentiment ~ Oscar.Year, data = all_sentiment)
summary(m3)
##
## Call:
## lm(formula = Nrc.Sentiment ~ Oscar.Year, data = all_sentiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1677.31 -402.92 21.14 434.76 2066.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17483.134 18037.703 -0.969 0.336
## Oscar.Year 8.968 8.999 0.997 0.322
##
## Residual standard error: 697.2 on 75 degrees of freedom
## Multiple R-squared: 0.01307, Adjusted R-squared: -8.892e-05
## F-statistic: 0.9932 on 1 and 75 DF, p-value: 0.3222
While using NRC does show a slightly positive trend upward over the years, it is not statisically significant with a p-value of 0.3222.
m4 <- lm(Mean.Sentiment ~ Oscar.Year, data = all_sentiment)
summary(m4)
##
## Call:
## lm(formula = Mean.Sentiment ~ Oscar.Year, data = all_sentiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2290.4 -411.2 217.6 620.5 1074.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62940.90 21735.93 -2.896 0.00495 **
## Oscar.Year 31.23 10.84 2.880 0.00518 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 840.2 on 75 degrees of freedom
## Multiple R-squared: 0.09958, Adjusted R-squared: 0.08758
## F-statistic: 8.295 on 1 and 75 DF, p-value: 0.005181
And finally, there is a statisically significant trend for average sentiment scores across years with a p value of 0.0052. Despite this, as with all of these models, our R-squared value indicates that this is not a good fit for our data.
# Group by genre and calculate the mean sentiment for each
genre_summary <- all_sentiment |>
group_by(Movie.Genre) |>
summarise(Mean.Sentiment = mean(Mean.Sentiment), .groups = "drop")
# Create a plot showing average sentiment by genre
genre_summary |>
ggplot(aes(x = reorder(Movie.Genre, Mean.Sentiment), y = Mean.Sentiment)) +
geom_col(fill = "blue") +
coord_flip() +
labs(title = "Average Sentiment by Genre",
x = "Genre",
y = "Average Sentiment")
Here we can see that there are more genres that skew negative than positive in our data, with greater sentiment as well, with Actions and Westerns affecting the negative skew the most. While I thought perhaps excluding these extreme cases might help to see a trend in sentiment, I was disappointed.
films_with_western_or_action <- all_sentiment |>
filter(Movie.Genre %in% c("Western", "Action")) |>
pull(Film) |>
unique()
data_no_western_or_action <- all_sentiment |>
filter(!Film %in% films_with_western_or_action)
data_no_western_or_action <- data_no_western_or_action |>
group_by(Film, Oscar.Year) |>
summarise(Mean.Sentiment = mean(Mean.Sentiment), .groups = "drop") |>
arrange(Oscar.Year)
data_no_western_or_action |>
ggplot(aes(x = Oscar.Year, y = Mean.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.7, size = 3) +
labs(
title = "Average Sentiment vs. Oscar Year (No Western or Action)",
x = "Oscar Year",
y = "Average Sentiment"
)
## `geom_smooth()` using formula = 'y ~ x'
m5 <- lm(Mean.Sentiment ~ Oscar.Year, data = data_no_western_or_action)
summary(m5)
##
## Call:
## lm(formula = Mean.Sentiment ~ Oscar.Year, data = data_no_western_or_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1320.9 -308.6 185.6 352.1 772.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34831.08 26504.92 -1.314 0.201
## Oscar.Year 17.31 13.21 1.310 0.203
##
## Residual standard error: 585.7 on 24 degrees of freedom
## Multiple R-squared: 0.06671, Adjusted R-squared: 0.02782
## F-statistic: 1.715 on 1 and 24 DF, p-value: 0.2027
Again we see that our model is a very poor fit and a high p-value, 0.2027. Instead of choosing genres that acted as outliers, I sorted through how removing all genres would affect the trend.
data_with_romance <- all_sentiment |>
filter(Movie.Genre %in% c("Romance")) |>
pull(Film) |>
unique()
data_with_no_romance <- all_sentiment |>
filter(!Film %in% data_with_romance)
data_with_no_romance <- data_with_no_romance |>
group_by(Film, Oscar.Year) |>
summarise(Mean.Sentiment = mean(Mean.Sentiment), .groups = "drop") |>
arrange(Oscar.Year)
data_with_no_romance |>
ggplot(aes(x = Oscar.Year, y = Mean.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.7, size = 3) +
labs(
title = "Average Sentiment vs. Oscar Year (No Romance)",
x = "Oscar Year",
y = "Average Sentiment"
)
## `geom_smooth()` using formula = 'y ~ x'
m5 <- lm(Mean.Sentiment ~ Oscar.Year, data = data_with_no_romance)
summary(m5)
##
## Call:
## lm(formula = Mean.Sentiment ~ Oscar.Year, data = data_with_no_romance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2090.2 -389.8 180.2 624.5 1246.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84017.97 37982.01 -2.212 0.0372 *
## Oscar.Year 41.69 18.94 2.201 0.0380 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 867.9 on 23 degrees of freedom
## Multiple R-squared: 0.174, Adjusted R-squared: 0.1381
## F-statistic: 4.845 on 1 and 23 DF, p-value: 0.03804
data_with_war <- all_sentiment |>
filter(Movie.Genre %in% c("War")) |>
pull(Film) |>
unique()
data_with_no_war <- all_sentiment |>
filter(!Film %in% data_with_war)
data_with_no_war <- data_with_no_war |>
group_by(Film, Oscar.Year) |>
summarise(Mean.Sentiment = mean(Mean.Sentiment), .groups = "drop") |>
arrange(Oscar.Year)
data_with_no_war |>
ggplot(aes(x = Oscar.Year, y = Mean.Sentiment)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_text(aes(label = round(Mean.Sentiment, 1)), vjust = -0.7, size = 3) +
labs(
title = "Average Sentiment vs. Oscar Year (No War)",
x = "Oscar Year",
y = "Average Sentiment"
)
## `geom_smooth()` using formula = 'y ~ x'
m6 <- lm(Mean.Sentiment ~ Oscar.Year, data = data_with_no_war)
summary(m6)
##
## Call:
## lm(formula = Mean.Sentiment ~ Oscar.Year, data = data_with_no_war)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2229.3 -349.7 190.1 568.5 1128.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76425.44 34940.49 -2.187 0.0379 *
## Oscar.Year 37.96 17.43 2.178 0.0386 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 839.2 on 26 degrees of freedom
## Multiple R-squared: 0.1544, Adjusted R-squared: 0.1218
## F-statistic: 4.746 on 1 and 26 DF, p-value: 0.03863
After sifting through the all genres, I found that only removing “Romance” or “War” films would result in a statistically significant relationship, however neither model was a very good fit. While this was interesting, I find that this felt more like picking and choosing what data to include, and as such, cannot say with confidence that this showed a valuable relationship between sentiment score and time.
Manually scraping for data can be tiresome, tedious and ineffective - however, I am pleased with the outcome of this project, as it made me appreciate how data is gathered in a more complete way. If I were to do this again, I would use a scraping tool to cut down on my front end work which would allow myself more time to focus on the analysis.
Additionally, I am pleased with the end result of my data and my analysis. It was disappointing to find that despite statistically significant relationships existing, none of them were very good fits for a linear model. I did find it interesting, however, to see how these lexicons can differ from each other in a real quantifiable way, and how someone might use that to adjust their data to fit a predetermined conclusion.
Lastly, the idea for this project came from how films “changed” after 9/11, and how they became darker and grittier. I think it is interesting and relevant to point out that in our graphs we can see how following the 2002 Oscars where Chicago won, we see a drop in sentiment scores, a reflection of these grittier films.