library(tidyverse)
library(janitor)
library(jsonlite)
library(qdapDictionaries)
library(tidytext)
library(rvest)
library(vader)
library(ggplot2)Final Project
Introduction
We signed up for the Article Search API from the New York Times. This project navigates the complexities of fetching JSON data and converting it into a clean, tidy and coherent R data frame using R tools and other packages. By merging this data with average movie reviews from Letterboxd, we embark on a journey to understand audience reactions.
Letterboxd is an app where users review movies, providing numerical average scores. The New York Times reviews are critics’ reviews consisting of paragraphs. We use sentiment analysis to quantify these reviews.
Appreciating how movies are judged is crucial for fully grasping how people react to them. This project looks closely at quantifying how movies are critiqued. By studying both critics’ opinions and audience ratings, the project uncovers the different views that shape how movies are assessed.
Connect to NYT API
First, we sign up for an API key on the New York Times website. Here, we use the rstudioapi::skForPassword() function to keep my API key private when running the code in R Studio. For running and rendering all code for the purposes of this markdown, we also alternatively save the API key in an R chunk that we elect not to include in this published version.
The base HTTP request URL is defined below as well.
api_key <- rstudioapi::askForPassword("Authorization Key")Error: RStudio not running
base_url <-
"https://api.nytimes.com/svc/search/v2/articlesearch.json?"Here, we create is a function to pull data based on the filter and page number parameters, starting from the newest articles.
Note that the Article Search API returns a max of 10 results at a time. We use the page query parameter to paginate through results (page=0 for results 1-10, page=1 for 11-20, etc. You can paginate through up to 100 pages (1,000 results)). Also note that we add a delay in the loop to avoid hitting the rate limit and getting a 429 error.
get_movies <- function(filter, num_pages, timeout) {
# initialize data frame
df <- tibble()
for (page in seq_len(num_pages)) {
# set url
url <- paste0(base_url,
"fq=",
filter,
"sort=newest&page=",
page,
"&api-key=",
api_key)
# initialize success as false before we get a success
success <- FALSE
while (success == FALSE) {
# while success is false,
tryCatch({
df <- fromJSON(url, flatten = TRUE)$response$docs |>
clean_names() |> rbind(df, .data) # append to df
success <- TRUE # Set success to TRUE if no error occurs
},
# if error,
error = function(e) {
# Add a delay between requests to avoid hitting the rate limit
Sys.sleep({timeout})
})
}
}
# Return the resulting data frame
return(df)
}Load JSON NYT data into R data frame
We call the function, iterating through num_pages pages of JSON data, appending them together as the data frame movies_df. We will be using the Article Search API to get New York Times movie reviews, so we define the filter query accordingly.
filter <-
'section_name%3A%22Movies%22%20AND%20type_of_material%3A%22Review%22'
num_pages <- 100
num_seconds <- 9
if(file.exists("nyt_movies_raw.rdat")) {
load("nyt_movies_raw.rdat")
} else {
nyt_movies_raw <- get_movies(filter, num_pages, num_seconds)
save(nyt_movies_raw, file = "nyt_movies_raw.rdat")
}Here is the R data frame of NYT JSON movie review data loaded from the NYT API:
head(nyt_movies_raw) abstract
1 Housebond voyeur suspects neighbor of murder. Tasteful, engaging update of Hitchcock classic.
2 LIKE the young German martyrs whose World War II story it tells, Michael Verhoeven's ''White Rose'' is resoundingly decent and sincere. The intrinsically stirring story of these dissidents, who risked their lives to disseminate anti-Nazi literature in the Munich of 1942 and 1943, is enough to give this film an almost automatic emotional power. While Mr. Verhoeven's pacing isn't always galvanizing, and while the individual characters seem somewhat faceless at times (despite the camera's habit of photographing them at intimately close range), ''The White Rose'' has an honesty and urgency that override its occasional colorlessness. It's a forthright tale of heroism, plainly and deliberately told.
3 Not even necrophilia and the imaginative deployment of leeches can relieve this exercise in unrelenting dullness.
4 THE annual flood of holiday albums has poured onto record-store shelves. Stars are ready to personalize the old songs or put a wintertime twist on their usual styles; unknowns hope familiar material will help them find new listeners. Pious or raunchy, ethnic or homogenized, blue or sentimental, unabashedly corny or determinedly hip, they all try to pack holiday feelings into cozy commodities. At the same time the holidays also bring out music packages that won't be obsolete on Dec. 26: compilations, greatest-hits albums and reissued albums. Below, the rock and jazz critics of The Times survey the seasonal bounty. (Individual CD's range from $11.97 to $28.97; two-CD sets range from $19.97 to $31.97.)
5 PAUL THEROUX'S adventure novel ''The Mosquito Coast'' has a much bleaker vision than the movie directed by Peter Weir and starring Harrison Ford. But even with some of the more acerbic qualities of the book's antihero, Allie Fox, toned down, it took more than three years for the producer, Jerome Hellman, to raise the money to shoot a movie that most studio executives found simply not sufficiently ''high-concept.'' What is high-concept? ''It has absolutely no meaning, as far as I can tell,'' Mr. Hellman said. ''I guess it's an idea that can be expressed in a sentence so simplistic that there is no chance of confusion, and it probably entails an upbeat ending. If we had been willing to have Allie emerge a changed man who learned from his experience and took his family back to New England, we would have had no trouble.'' The story is that of a man horrified at what he believes is the degeneration of America into a society of fast food, crime, hypocrisy and lack of initiative. He decides to take his family to the Mosquito Coast, the wild jungle land of Central America, and start over.
6 “Love in the Time of Cholera” is faithful to the outline of the novel but emotionally and spiritually anemic.
web_url
1 https://www.nytimes.com/1983/10/09/movies/and-james-stewart-recalls-hitch.html
2 https://www.nytimes.com/1983/05/06/movies/the-white-rose-students-against-nazis.html
3 https://www.nytimes.com/2006/06/08/movies/08psyc.html
4 https://www.nytimes.com/2000/12/08/movies/reissues-albums-wishing-you-a-merry-with-a-ha-ha-and-a-ho-ho-ho.html
5 https://www.nytimes.com/1986/11/21/movies/at-the-movies.html
6 https://www.nytimes.com/2007/11/16/movies/16chol.html
snippet
1 Housebond voyeur suspects neighbor of murder. Tasteful, engaging update of Hitchcock classic.
2
3 Not even necrophilia and the imaginative deployment of leeches can relieve this exercise in unrelenting dullness.
4
5
6 “Love in the Time of Cholera” is faithful to the outline of the novel but emotionally and spiritually anemic.
lead_paragraph
1 L.B. Jeffries, the inquiring photographer whose broken leg prevents him from doing anything more strenuous than gazing out his window, is back. And he's in very good company, or soon will be. He will be joined by John ''Scottie'' Ferguson, who loves one woman because she reminds him of another and who is paralyzed by his fear of heights; Dr. Ben McKenna, whose child is kidnapped as part of an international assassination plot; and Rupert Cadell, a professor who attends a party and solves a murder mystery in the same apartment, and on the very same evening.
2 LIKE the young German martyrs whose World War II story it tells, Michael Verhoeven's ''White Rose'' is resoundingly decent and sincere. The intrinsically stirring story of these dissidents, who risked their lives to disseminate anti-Nazi literature in the Munich of 1942 and 1943, is enough to give this film an almost automatic emotional power.
3 Anyone popping in to "Psychopathia Sexualis" hoping for a classier "Basic Instinct" or perhaps a sneak peek at Woody Allen's memoirs will be sorely disappointed. Named for the notorious 1886 text by Richard Freiherr von Krafft-Ebing, a German psychiatrist whose specialty was sexual perversion, the movie reconstructs four of the book's colorful case histories. But applying the scientific method to carnal behavior is one thing; applying it to moviemaking is quite another.
4 THE annual flood of holiday albums has poured onto record-store shelves. Stars are ready to personalize the old songs or put a wintertime twist on their usual styles; unknowns hope familiar material will help them find new listeners. Pious or raunchy, ethnic or homogenized, blue or sentimental, unabashedly corny or determinedly hip, they all try to pack holiday feelings into cozy commodities. At the same time the holidays also bring out music packages that won't be obsolete on Dec. 26: compilations, greatest-hits albums and reissued albums. Below, the rock and jazz critics of The Times survey the seasonal bounty. (Individual CD's range from $11.97 to $28.97; two-CD sets range from $19.97 to $31.97.)
5 PAUL THEROUX'S adventure novel ''The Mosquito Coast'' has a much bleaker vision than the movie directed by Peter Weir and starring Harrison Ford. But even with some of the more acerbic qualities of the book's antihero, Allie Fox, toned down, it took more than three years for the producer, Jerome Hellman, to raise the money to shoot a movie that most studio executives found simply not sufficiently ''high-concept.'' What is high-concept? ''It has absolutely no meaning, as far as I can tell,'' Mr. Hellman said. ''I guess it's an idea that can be expressed in a sentence so simplistic that there is no chance of confusion, and it probably entails an upbeat ending. If we had been willing to have Allie emerge a changed man who learned from his experience and took his family back to New England, we would have had no trouble.''
6 “Love in the Time of Cholera” sets itself the elusive task of translating Gabriel García Márquez’s masterpiece of magical realism into an upscale art film with popular appeal. Faithful to the outline of the novel but emotionally and spiritually anemic, it slides into the void between art and entertainment, where well-intended would-be screen epics often land with a thud.
print_section print_page source
1 2 21 The New York Times
2 C 13 The New York Times
3 E 5 The New York Times
4 E 30 The New York Times
5 C 8 The New York Times
6 E 1 The New York Times
multimedia
1 NULL
2 NULL
3 NULL
4 NULL
5 NULL
6 0, 0, 0, xlarge, popup, thumbnail, NA, NA, NA, NA, NA, NA, image, image, image, images/2007/11/16/arts/16cholera-600.jpg, images/2007/11/16/arts/16cholera-600.jpg, images/2007/11/16/arts/16cholera-75.jpg, 330, 330, 75, 600, 600, 75, xlarge, popup, thumbnail, articleLarge, popup, thumbStandard, images/2007/11/16/arts/16cholera-600.jpg, NA, NA, 600, NA, NA, 330, NA, NA, NA, NA, images/2007/11/16/arts/16cholera-75.jpg, NA, NA, 75, NA, NA, 75
keywords
1 subject, subject, creative_works, MOTION PICTURES, REVIEW, Rear Window (Movie), 1, 2, 3, N, N, N
2 subject, subject, creative_works, MOTION PICTURES, REVIEW, WHITE ROSE, THE (MOVIE), 1, 2, 3, N, N, N
3 subject, MOTION PICTURES, 1, N
4 NULL
5 organizations, persons, persons, persons, persons, persons, persons, subject, subject, subject, creative_works, creative_works, ISLAND PICTURES, BALLHAUS, MICHAEL, Fassbinder, Rainer Werner, BROKAW, CARY, Ford, Harrison, SCHWARTZ, RUSSELL, Weir, Peter, Suspensions, Dismissals and Resignations, Appointments and Executive Changes, MOTION PICTURES, GLASS MENAGERIE, THE (MOVIE), MOSQUITO COAST, THE (MOVIE), 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, N, N, N, N, N, N, N, N, N, N, N, N
6 subject, persons, persons, persons, persons, MOTION PICTURES, Newell, Mike, Harwood, Ronald, Bardem, Javier, Bratt, Benjamin, 1, 2, 3, 4, 5, N, N, N, N, N
pub_date document_type news_desk
1 1983-10-09T05:00:00+0000 article Arts and Leisure Desk
2 1983-05-06T05:00:00+0000 article Weekend Desk
3 2006-06-08T04:00:00+0000 article Culture
4 2000-12-08T05:00:00+0000 article Movies, Performing Arts/Weekend Desk
5 1986-11-21T05:00:00+0000 article Weekend Desk
6 2007-11-16T05:00:00+0000 article Weekend
section_name type_of_material
1 Movies Review
2 Movies Review
3 Movies Review
4 Movies Review
5 Movies Review
6 Movies Review
id word_count
1 nyt://article/08bc53b8-7807-509a-ae15-bbf3862ce15f 1314
2 nyt://article/08c254dd-9973-56cb-b43f-f1b458d2b115 875
3 nyt://article/08c2f145-05b6-5da4-959a-1e2a5736037c 331
4 nyt://article/08c37cf0-dc6a-5da8-99ba-6016989f0454 248
5 nyt://article/08c710e7-61aa-55ee-94ff-e56d53b15ce0 1627
6 nyt://article/08c97d6c-026c-5269-b609-a13b47e97a6a 1006
uri
1 nyt://article/08bc53b8-7807-509a-ae15-bbf3862ce15f
2 nyt://article/08c254dd-9973-56cb-b43f-f1b458d2b115
3 nyt://article/08c2f145-05b6-5da4-959a-1e2a5736037c
4 nyt://article/08c37cf0-dc6a-5da8-99ba-6016989f0454
5 nyt://article/08c710e7-61aa-55ee-94ff-e56d53b15ce0
6 nyt://article/08c97d6c-026c-5269-b609-a13b47e97a6a
headline_main
1 ...AND JAMES STEWART RECALLS 'HITCH'
2 'THE WHITE ROSE,' STUDENTS AGAINST NAZIS
3 Sexual Deviations on Parade in 'Psychopathia Sexualis'
4 Reissues; Albums Wishing You a Merry With a Ha-Ha and a Ho-Ho-Ho
5 AT THE MOVIES
6 50 Years and 600 Women Later, True Love
headline_kicker headline_content_kicker
1 <NA> NA
2 <NA> NA
3 Movie Review NA
4 <NA> NA
5 <NA> NA
6 Movie Review | 'Love in the Time of Cholera' NA
headline_print_headline
1 ...AND JAMES STEWART RECALLS 'HITCH'
2 'THE WHITE ROSE,' STUDENTS AGAINST NAZIS
3 Sexual Deviations on Parade, Yet Strangely Unstimulating
4 Reissues; Albums Wishing You a Merry With a Ha-Ha and a Ho-Ho-Ho
5 AT THE MOVIES
6 50 Years and 600 Women Later, True Love
headline_name headline_seo headline_sub byline_original
1 NA NA NA By Janet Maslin
2 NA NA NA By Janet Maslin
3 NA NA NA By Jeannette Catsoulis
4 NA NA NA By Jon Pareles
5 NA NA NA By Nina Darnton
6 NA NA NA By Stephen Holden
byline_person byline_organization
1 Janet, NA, Maslin, NA, NA, reported, , 1 <NA>
2 Janet, NA, Maslin, NA, NA, reported, , 1 <NA>
3 Jeannette, NA, Catsoulis, NA, NA, reported, , 1 <NA>
4 Jon, NA, Pareles, NA, NA, reported, , 1 <NA>
5 Nina, NA, Darnton, NA, NA, reported, , 1 <NA>
6 Stephen, NA, Holden, NA, NA, reported, , 1 <NA>
Cleaning NYT data
Before we finalize this data set, we subset the data frame to our variables of interest and do some data transformation operations:
The
keywordsvariable in the original data frame was a list-column. Thenameof the movie is tagged as a keyword in each review article. We unnest this list-column variable to extract the moviename.Clean the movie
namebyconverting all characters to lowercase
removing parentheticals in the string for matching using REGEX
removing any “, the” at the end of any movie names in the NYT data
Convert NYT publication date column
pub_dateto a datetime format.
# Define REGEX patterns
media_pattern <-
".*\\(([^\\(\\)]+)\\)$" # to get media type (e.g., "movie", "play", etc.), pull text in final parenthetical of movie title
name_pattern <-
"(.*) \\(([^\\(\\)]+)\\)$" # to get movie title on it's own, remove final parenthetical
the_pattern <- "^the\\s" # to remove "the" from beginning of movie title
comma_the_pattern <- ",\\sthe$" # to remove ", the" from end of movie title
a_pattern <- "^a\\s" # to remove "a" from beginning of movie title
comma_a_pattern <- ",\\sa$" # to remove ", a" from end of movie title
nyt_movies <- nyt_movies_raw |>
# Unnest `keywords` list variable
unnest(keywords, keep_empty = TRUE) |>
# Correct date column format
mutate(
pub_date = as_datetime(pub_date),
media = str_match(value, media_pattern)[, 2] |>
tolower(),
# Clean movie titles, removing "the"
name = str_match(value, name_pattern)[, 2] |>
tolower() |>
str_replace(comma_the_pattern, "") |>
str_replace(the_pattern, "") |>
str_replace(a_pattern, "") |>
str_replace(comma_a_pattern, "")
) |>
# Filter data to movies only (removing keyword rows for crew, other types of media, etc.)
filter(media == "movie") |>
# Reorder reviews by publication date
arrange(desc(pub_date)) |>
# Select variables of interest: columns w/ text for sentiment analysis and merging
select(
abstract,
web_url,
name,
pub_date,
headline_main,
headline_kicker,
headline_print_headline
)
head(nyt_movies)# A tibble: 6 × 7
abstract web_url name pub_date headline_main headline_kicker
<chr> <chr> <chr> <dttm> <chr> <chr>
1 Sex, death an… https:… prin… 2024-05-09 11:00:09 ‘A Prince’ R… Critic’s Pick
2 Ryusuke Hamag… https:… evil… 2024-05-02 16:40:16 ‘Evil Does N… Critic’s Pick
3 The second fe… https:… slow 2024-05-02 13:51:05 ‘Slow’ Revie… <NA>
4 An outstandin… https:… i sa… 2024-05-02 09:03:05 ‘I Saw the T… Critic’s Pick
5 Guy Ritchie’s… https:… mini… 2024-04-18 17:17:18 ‘The Ministr… <NA>
6 A college pro… https:… lous… 2024-03-28 11:00:11 ‘Lousy Carte… <NA>
# ℹ 1 more variable: headline_print_headline <chr>
Above is our merged data set.
Load Letterboxd data and clean
Load Letterboxd data and clean variables. We perform the same cleaning transformations to the movie name variable that we did with the variable of the same name in the NYT data above. Also, we drop movies released before the earliest review in the NYT data (1981-03-27 05:00:00); we don’t want to merge these older Letterboxd movies since it’s unlikely that an NYT critic would review a movie more than a year after its release.
# load letterboxd data
letterboxd_movies <-
read_csv(
"https://raw.githubusercontent.com/naomibuell/DATA607_FinalProject/main/movies_trimmed.csv"
) |>
drop_na(minute) |>
mutate(
name = tolower(name) |> # switching movie names to lower case
# Use str_match() with the pattern
str_replace(comma_the_pattern, "") |>
str_replace(the_pattern, "") |>
str_replace(a_pattern, "") |>
str_replace(comma_a_pattern, "")
) |>
select(-c(id)) |>
# only keep Letterboxd data within a similar date range to the NYT data
filter(date >= year(min(nyt_movies$pub_date)) - 1)Rows: 85614 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, tagline, description
dbl (4): id, date, minute, rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(letterboxd_movies)# A tibble: 6 × 6
name date tagline description minute rating
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 barbie 2023 She's every… "Barbie an… 114 3.91
2 parasite 2019 Act like yo… "All unemp… 133 4.57
3 everything everywhere all at once 2022 The univers… "An aging … 140 4.32
4 fight club 1999 Mischief. M… "A ticking… 139 4.27
5 interstellar 2014 Mankind was… "The adven… 169 4.32
6 joker 2019 Put on a ha… "During th… 122 3.83
Merge NYT and Letterboxd Data
Below, we merge this NYT data with Letterboxd data based on movie name. We also pick the best matches based on when the NYT critic reviewed the movie and when Letterboxd says the movie was released, assuming a true match would show the movie reviewed by NYT right when it came out (since Letterboxd only posts the year of release and not the full date, we assume all movies came out on Christmas day for the purposes of this calculation).
# First, merge datasets by name. This is a many to many join, so movie-review rows will not be unique, with some extra matches done in error.
merged <- inner_join(nyt_movies, letterboxd_movies) |>
# For movies w/ multiple matched rows, choose absolute difference in dates between sources.
group_by(name) |>
mutate(
# assuming all movies came out on Christmas (popular date for movie release),
release_date = as_datetime(paste0(date, "-12-25")),
# calculate absolute diff btwn review and release date
dates_diff = abs(difftime(pub_date, release_date, units = "days"))
) |>
filter(dates_diff == min(dates_diff)) |> # for each unique name, get movie row w/ shortest diff between review and release date
ungroup() |>
# remove any duplicated rows
distinct()
head(merged)# A tibble: 6 × 14
abstract web_url name pub_date headline_main headline_kicker
<chr> <chr> <chr> <dttm> <chr> <chr>
1 Sex, death an… https:… prin… 2024-05-09 11:00:09 ‘A Prince’ R… Critic’s Pick
2 Sex, death an… https:… prin… 2024-05-09 11:00:09 ‘A Prince’ R… Critic’s Pick
3 Ryusuke Hamag… https:… evil… 2024-05-02 16:40:16 ‘Evil Does N… Critic’s Pick
4 The second fe… https:… slow 2024-05-02 13:51:05 ‘Slow’ Revie… <NA>
5 An outstandin… https:… i sa… 2024-05-02 09:03:05 ‘I Saw the T… Critic’s Pick
6 In this white… https:… femme 2024-03-21 11:00:06 ‘Femme’ Revi… <NA>
# ℹ 8 more variables: headline_print_headline <chr>, date <dbl>, tagline <chr>,
# description <chr>, minute <dbl>, rating <dbl>, release_date <dttm>,
# dates_diff <drtn>
Describe and Validate Analysis Data set
Now that we have our analysis data set, we can perform a few statistical analyses to describe and validate the merged data. The following section checks our data set that we’ll use for analysis for outliers, consistency, completeness, and any discrepancies between the merged data sets that would indicate merge errors.
Descriptive statistics
First, we generate summary statistics and compare the distribution of the dates reported by both data sets:
merged |>
select(pub_date, date) |>
summary() pub_date date
Min. :1981-03-27 05:00:00.00 Min. :1980
1st Qu.:1995-03-07 17:00:00.00 1st Qu.:1995
Median :2002-09-13 05:00:00.00 Median :2002
Mean :2005-03-26 09:47:11.39 Mean :2005
3rd Qu.:2017-07-06 18:50:22.00 3rd Qu.:2017
Max. :2024-05-09 11:00:09.00 Max. :2024
# compare dates for NYT and Letterboxd data
merged |>
ggplot() +
geom_histogram(
aes(x = pub_date, fill = "Merged Review publication date (source: NYT)"),
alpha = .5,
bins = 44
) +
geom_histogram(
aes(x = release_date, fill = "Merged Movie release year (source: Letterbox)"),
alpha = .5,
bins = 44
) +
theme(legend.position = "bottom") +
labs(title = "Dates, by data source",
x = "Date",
y = "Frequency")# examine the difference between review and release dates
merged |>
ggplot(aes(x = dates_diff)) +
geom_histogram(bins = 100) +
labs(title = "Most movies have very little difference between timing of review and release.",
x = "Difference between release and review date in days")merged |>
mutate(dates_diff = as.numeric(dates_diff)) |>
select(dates_diff) |>
summary() dates_diff
Min. : 0.086
1st Qu.: 68.062
Median : 132.101
Mean : 444.391
3rd Qu.: 255.758
Max. :12761.792
Both movie release and review publication dates align well between data sources. They start and end around the same time. Both have peaks in the early 2000’s and around 2020’s, with a major dip in the late 2000’s. This lack of data around the late 2000’s stems from the limited NYT data, not the Letterboxd data.
Next we’ll check the distribution and outliers of the other numerical data, all from the Letterboxd data set:
summary(merged$minute) Min. 1st Qu. Median Mean 3rd Qu. Max.
2.0 92.0 102.0 107.3 115.0 924.0
merged |>
ggplot(aes(x = minute/60)) +
geom_histogram(bins = 60) +
labs(title = "Movie length",
x = "Length in hours (source: Letterboxd)")The distribution is centered around about 1.7 hours. There are many high outliers in terms of movie length (max 15 hours)–likely due to full television or movie series being included in the database. We elect not to remove these outliers because they can be genuine matches. For e.g., the longest film in the data, “Heimat” was 924 minutes long according to Letterboxd and was reviewed by the NYT, confirming that this chronicle of Germany “will be broadcast for the first time on cable on Bravo in eight parts.”
Next we check user ratings for outliers:
summary(merged$rating) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.310 3.000 3.310 3.257 3.570 4.400
merged |>
ggplot(aes(x = rating)) +
geom_histogram() +
labs(title = "User ratings",
x = "Average movie user rating (source: Letterboxd)")Average movie ratings in this data-set range from 1.31 to 4.4. The average average rating on Letterboxd is 3.2566728. The data looks normal but is slightly left skewed. This is expected.
merged |>
ggplot(aes(x = rating)) +
geom_boxplot() Data Quality Assessment/Overlap Analysis
Here, we check the number of records and variables from each source that were merged. The Letterboxd data was much larger than what we pulled from NYT using the API (67984 vs. 1000 movies long, respectively), so the NYT data was the limiting factor for the merge in terms of the length of our final data set.
perc_merged <- nrow(merged)/nrow(nyt_movies)However, 100 percent of data from NYT was able to be merged with a Letterboxd rating.
Here, we identify any duplicate records:
duplicates <- merged |>
group_by(name) |>
mutate(duplicate = n() > 1) |>
filter(duplicate)
head(duplicates)# A tibble: 6 × 15
# Groups: name [3]
abstract web_url name pub_date headline_main headline_kicker
<chr> <chr> <chr> <dttm> <chr> <chr>
1 Sex, death an… https:… prin… 2024-05-09 11:00:09 ‘A Prince’ R… Critic’s Pick
2 Sex, death an… https:… prin… 2024-05-09 11:00:09 ‘A Prince’ R… Critic’s Pick
3 In her featur… https:… land 2021-02-11 12:00:05 ‘Land’ Revie… Critic’s Pick
4 In her featur… https:… land 2021-02-11 12:00:05 ‘Land’ Revie… Critic’s Pick
5 This Bob Nels… https:… conf… 2016-03-18 00:54:44 Review: In ‘… <NA>
6 This Bob Nels… https:… conf… 2016-03-18 00:54:44 Review: In ‘… <NA>
# ℹ 9 more variables: headline_print_headline <chr>, date <dbl>, tagline <chr>,
# description <chr>, minute <dbl>, rating <dbl>, release_date <dttm>,
# dates_diff <drtn>, duplicate <lgl>
num_duplicated_rows <- duplicates |>
nrow()/2There are 7 duplicated movies, where one movie name from the NYT matched to two observations with the same movie name (and release year) in the Letterboxd database. We can manually review to determine which observation is a true match, then drop the other observation(s). A good way to automate the checking of true matches, which may be necessary with larger datasets, could be to group by multiple columns along with name such as pub_date or minute (movie length). In this case, each of these are true matches, and therefore all duplicates can be removed safely.
merged_no_dupl <- merged |>
distinct(name, .keep_all = TRUE)
head(merged_no_dupl)# A tibble: 6 × 14
abstract web_url name pub_date headline_main headline_kicker
<chr> <chr> <chr> <dttm> <chr> <chr>
1 Sex, death an… https:… prin… 2024-05-09 11:00:09 ‘A Prince’ R… Critic’s Pick
2 Ryusuke Hamag… https:… evil… 2024-05-02 16:40:16 ‘Evil Does N… Critic’s Pick
3 The second fe… https:… slow 2024-05-02 13:51:05 ‘Slow’ Revie… <NA>
4 An outstandin… https:… i sa… 2024-05-02 09:03:05 ‘I Saw the T… Critic’s Pick
5 In this white… https:… femme 2024-03-21 11:00:06 ‘Femme’ Revi… <NA>
6 Zhang Lu’s qu… https:… shad… 2024-03-14 09:02:21 ‘The Shadowl… <NA>
# ℹ 8 more variables: headline_print_headline <chr>, date <dbl>, tagline <chr>,
# description <chr>, minute <dbl>, rating <dbl>, release_date <dttm>,
# dates_diff <drtn>
Web Scraping Full NYT Reviews
As the API only provides the first paragraph/snippet from the movie review, we’ll need to scrape the HTML hosted on the NYT website for the full text. Luckily, the API provides the URI via the web_url field and every element containing review text contains the unique class StoryBodyCompanionColumn. This allows us to easily pull the text from the HTML elements as a character vector and collapse it into a single string, adding it as a new column in our data frame.
We do all of this after the merging and cleaning to minimize the amount of scraping we need to perform to only the movies reviews we will be analyzing, reducing our HTML requests by almost half and memory utilization by over 2 megabytes. That may seem small, but for larger datasets, 50% reduction is a large amount.
We have a NYT digital subscription, preventing any license agreement issues concerning the text.
reviewScrape <- function(url) {
success <- FALSE
while (success == FALSE) {
tryCatch({
htmlText <- read_html(url) |>
html_nodes(".StoryBodyCompanionColumn") |>
html_text() |>
paste0()
success <- TRUE # Set success to TRUE if no error occurs
}, error = function(e) {
print(paste0("Error: ", e))
Sys.sleep(num_seconds)
})
}
# return review text
return(htmlText)
}if(file.exists("reviews.rdat")) {
load("reviews.rdat")
} else {
reviews <- lapply(merged_no_dupl$web_url, reviewScrape)
save(reviews, file = "reviews.rdat")
}
merged_no_dupl$review <- reviewsSentiment and Regression Analysis of NYT Review
Now that we have our merged and tidied data, we can perform some deeper analysis of these reviews provided by the NYT API. When performing sentiment analysis, choosing the right dictionary, or lexicon, is important. There already exist analyses of different lexicons, but we will replicate some of those thoughts and analyses here. In general, it’s important to choose domain-specific lexicons to ensure proper sentiment analysis of words used and highest possible coverage of words in the text being analyzed.
The three lexicons we’ll be utilizing against our movie reviews are AFINN, as it is a lexicon from the tidytext package we’ve used previously, VADER, a lexicon specifically attuned to sentiments expressed in social media, and labMT, as its dictionary was partially built directly from New York Times publications; they all use a numerical scale for rating the positivity, valence, or happiness of a word.
labMT and AFINN require us to load up our dictionaries to join with our tokenized reviews, but VADER provides a useful get_vader() function that returns the score of each word in the text as well as the average.
data(labMT)
afinn <- get_sentiments("afinn")# Getting sentiments without a function. Get afinn and labMT sentiment by word
sa_afinn_labMT <- merged_no_dupl |>
unnest(review) |>
unnest_tokens(word, review, token = "words") |>
anti_join(stop_words) |>
left_join(afinn) |>
mutate(afinn = value) |>
left_join(labMT)Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
# get sentiment column names
sa_cols <- colnames(sa_afinn_labMT)[17:length(colnames(sa_afinn_labMT))]
# get average sentiment per review
sa_collapsed <- sa_afinn_labMT |>
group_by(web_url) |>
summarize(across(all_of(sa_cols), mean, na.rm = TRUE)) |>
full_join(merged_no_dupl)Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(all_of(sa_cols), mean, na.rm = TRUE)`.
ℹ In group 1: `web_url =
"https://www.nytimes.com/1981/03/27/movies/a-killer-stalks-in-eyes-of-a-stranger.html"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Joining with `by = join_by(web_url)`
# get vader sentiment
sa <- sa_collapsed |>
pull(review) |>
sapply(get_vader) |>
t() |>
as_tibble() |>
bind_cols(sa_collapsed)par(mfrow = c(1, 3))
cor(as.numeric(sa$rating),sa$afinn)[1] 0.01296589
cor(as.numeric(sa$rating),sa$happiness_average)[1] -0.003418552
cor(as.numeric(sa$rating),as.numeric(sa$compound))[1] NA
# AFINN
plot(sa$afinn, sa$rating, main="AFINN", xlab="NYT Sentiment Mean", ylab="Letterboxd Rating")
abline(lm(sa$rating ~ sa$afinn), col="darkred")
# LabMT
plot(sa$happiness_average, sa$rating, main="LabMT", xlab="NYT Sentiment Mean", ylab="Letterboxd Rating")
abline(lm(sa$rating ~ sa$happiness_average), col="darkred")
# VADER
plot(sa$compound, sa$rating, main="VADER", xlab="NYT Sentiment Mean", ylab="Letterboxd Rating")
abline(lm(sa$rating ~ sa$compound), col="darkred")I would expect a somewhat linear relationship between the sentiment analyses and the critic movie rating by Letterboxd, but it appears there is none. With these three lexicons, only VADER shows any kind of relationship, albeit the opposite of what we would expect. For such a niche medium as “movie reviews”, it would likely be necessary to use a specialized lexicon built around very specific terms used by movie critics, or even better to use a real ML model to infer Letterboxd rating based on sentiment.
Conclusion analysis and graphic: Comparison of NYT review sentiments vs. Letterboxd numerical rating for matched movies. How do professional movie critics differ from Letterboxd users?
# Heatmap of rating, year of review, and happiness average
ggplot(sa, aes(x = date, y = round(rating))) +
geom_tile(aes(fill = happiness_average), color = "white") +
scale_fill_gradient(low = "navy", high = "pink", name = "Audience Score") +
labs(title = "Sentiment Scores vs. Ratings Heatmap",
x = "Review Year",
y = "Critic Rating")Reflection on Project
The lack of a strong linear relationship between the sentiment analyses and the critic movie rating by Letterboxd across all lexicons suggests that sentiment alone may not be a strong predictor of how critics rate movies. This indicates that other factors beyond sentiment, such as plot, acting, direction, and thematic depth, likely influence critics’ evaluations.
The comparison between NYT review sentiments and Letterboxd ratings highlights potential differences in how professional movie critics and general users perceive and evaluate movies. Professional critics may focus more on aspects like cinematography, storytelling, and thematic depth, while user ratings on platforms like Letterboxd may be influenced by personal preferences, entertainment value, and emotional resonance.
While VADER performed slightly better than AFINN and labMT, it still did not show a strong correlation with the Letterboxd ratings. This suggests that using generic sentiment lexicons may not capture the nuances of movie reviews accurately. Utilizing specialized lexicons tailored to movie reviews or employing machine learning models to infer ratings based on sentiment may yield better results. Initially considered but not included in the final version of this project are NRC and Loughlan lexicons, which were attempted but were redundant in the results and analysis.
To better understand the differences between professional critics and Letterboxd users, future analyses could involve: Developing specialized sentiment lexicons tailored to movie reviews, capturing industry-specific terminology and critical language. Exploring additional features beyond sentiment, such as genre, directorial style, and production budget, to understand their impact on ratings.
Conclusion
Overall, the project underscores the complexity of movie evaluation and the importance of considering multiple factors beyond sentiment alone when analyzing critical responses. It also highlights the fact that critics and audiences do not share one opinion on movies. The conclusion graphics suggest that critics’ reviews alone may not reliably predict audience opinions on films.
To effectively communicate these findings, the comparative analysis graphic illustrates the sentiment scores derived from the NYT reviews alongside the corresponding Letterboxd ratings for a set of matched movies. This visual representation can highlight the divergence between professional critics’ sentiments and user ratings, providing insights into the distinct perspectives and evaluation criteria of each group.
By incorporating specialized lexicons and exploring additional features, future research can enhance our understanding of how different audiences perceive and rate movies, ultimately contributing to more nuanced and accurate analyses of film criticism.
References
- Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, and Christopher M. Danforth. “Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter.” PLoS ONE 6, no. 12 (2011).