Sentiment analysis involves using natural language processing and machine learning techniques to identify and extract subjective information from text data. The main goal of my project is to practice my knowledge of Web API, data cleaning, sentiment analysis, and data visualization.
The Rotten Tomatoes datasets were collected from Kaggle, and reviews were extracted via the Themoviedb API. The project aims to answer two questions:
2.What are the most common words for each category?
Additionally, the reviews from The Movie db will be compared with Rotten Tomatoes reviews.
library("tidyverse")
library("janeaustenr")
library("stringr")
library("tidytext")
library(tidyverse)
library(jsonlite)
library(httr)
library(wordcloud)
library(reshape2)
The Rotten Tomatoes dataset was collected from Kaggle and downloaded to a local directory. You can find the datasets at https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset.
The dataset has been scraped from the publicly available website https://www.rottentomatoes.com as of 2020-10-31.
movies <- read_csv("C:\\Users\\tonyl\\OneDrive\\Documents\\fina_project\\rotten_tomatoes_movies.csv")
reviews <- read_csv("C:\\Users\\tonyl\\OneDrive\\Documents\\fina_project\\rotten_tomatoes_critic_reviews.csv")
head(movies)
## # A tibble: 6 × 22
## rotten_tomatoes_link movie_title movie_info critics_consensus content_rating
## <chr> <chr> <chr> <chr> <chr>
## 1 m/0814255 Percy Jack… Always tr… Though it may se… PG
## 2 m/0878835 Please Give Kate (Cat… Nicole Holofcene… R
## 3 m/10 10 A success… Blake Edwards' b… R
## 4 m/1000013-12_angry_men 12 Angry M… Following… Sidney Lumet's f… NR
## 5 m/1000079-20000_leagu… 20,000 Lea… In 1866, … One of Disney's … G
## 6 m/10000_bc 10,000 B.C. Mammoth h… With attention s… PG-13
## # ℹ 17 more variables: genres <chr>, directors <chr>, authors <chr>,
## # actors <chr>, original_release_date <date>, streaming_release_date <date>,
## # runtime <dbl>, production_company <chr>, tomatometer_status <chr>,
## # tomatometer_rating <dbl>, tomatometer_count <dbl>, audience_status <chr>,
## # audience_rating <dbl>, audience_count <dbl>,
## # tomatometer_top_critics_count <dbl>, tomatometer_fresh_critics_count <dbl>,
## # tomatometer_rotten_critics_count <dbl>
head(reviews)
## # A tibble: 6 × 8
## rotten_tomatoes_link critic_name top_critic publisher_name review_type
## <chr> <chr> <lgl> <chr> <chr>
## 1 m/0814255 Andrew L. Urban FALSE Urban Cinefile Fresh
## 2 m/0814255 Louise Keller FALSE Urban Cinefile Fresh
## 3 m/0814255 <NA> FALSE FILMINK (Australi… Fresh
## 4 m/0814255 Ben McEachen FALSE Sunday Mail (Aust… Fresh
## 5 m/0814255 Ethan Alter TRUE Hollywood Reporter Rotten
## 6 m/0814255 David Germain TRUE Associated Press Rotten
## # ℹ 3 more variables: review_score <chr>, review_date <date>,
## # review_content <chr>
The dataset consists of two CSV files, which are imported as “reviews” and “movies”. “reviews” contains the review content, and “movies” contains the movie title. Two data frames can be merged based on the common column “rotten_tomatoes_link”.
# Extract columns that will be used
# Merge two datasets
# Remove rows contains NA value
# Drop rotten_tomatoes_link column
# Preparing a new set of data frame so Genres column are broken down by genres value.
review_2 <- reviews |>
select(rotten_tomatoes_link, review_content)
movies_2 <- movies |>
select(rotten_tomatoes_link, movie_title, audience_rating, genres)
merged <- merge(movies_2, review_2, by = "rotten_tomatoes_link")
merged <- merged |>
na.omit() |>
select(-rotten_tomatoes_link)
merged_genres <- merged |>
separate_rows(genres, sep = ",\\s*")
head(merged_genres)
## # A tibble: 6 × 4
## movie_title audience_rating genres review_content
## <chr> <dbl> <chr> <chr>
## 1 Percy Jackson & the Olympians: The Ligh… 53 Actio… The pleasant …
## 2 Percy Jackson & the Olympians: The Ligh… 53 Comedy The pleasant …
## 3 Percy Jackson & the Olympians: The Ligh… 53 Drama The pleasant …
## 4 Percy Jackson & the Olympians: The Ligh… 53 Scien… The pleasant …
## 5 Percy Jackson & the Olympians: The Ligh… 53 Actio… ...great fun …
## 6 Percy Jackson & the Olympians: The Ligh… 53 Comedy ...great fun …
First, we tokenize the cleaned data in “merge_genres”. Then, we use the AFINN lexicon to determine the sentiment score. Two results are obtained: one by movie title and one by genre.
merged_tokens <- merged_genres |>
unnest_tokens(output = "word", token = "words", input = review_content) |>
anti_join(stop_words)
# The afinn lexicon
m_afinn_by_movie <- merged_tokens |>
inner_join(get_sentiments("afinn")) |>
group_by(movie_title, audience_rating) |>
summarise(sentiment = sum(value)) |>
arrange(desc(sentiment))
m_afinn_by_movie
## # A tibble: 17,319 × 3
## # Groups: movie_title [16,742]
## movie_title audience_rating sentiment
## <chr> <dbl> <dbl>
## 1 Spider-Man: Into the Spider-Verse 93 8568
## 2 Spider-Man: Homecoming 87 6906
## 3 Spider-Man: Far From Home 95 5432
## 4 Star Wars: The Last Jedi 43 4644
## 5 Shrek 2 69 4528
## 6 Ralph Breaks the Internet 65 4488
## 7 Toy Story 4 94 4416
## 8 Shazam! 82 4380
## 9 Ant-Man 86 4028
## 10 Captain Marvel 47 3988
## # ℹ 17,309 more rows
m_afinn_by_genres <- merged_tokens |>
inner_join(get_sentiments("afinn")) |>
group_by(genres) |>
summarise(sentiment = sum(value)) |>
arrange(desc(sentiment))
m_afinn_by_genres
## # A tibble: 21 × 2
## genres sentiment
## <chr> <dbl>
## 1 Comedy 349464
## 2 Drama 344137
## 3 Action & Adventure 159363
## 4 Science Fiction & Fantasy 132861
## 5 Romance 114955
## 6 Kids & Family 103481
## 7 Animation 85021
## 8 Art House & International 68267
## 9 Documentary 50113
## 10 Musical & Performing Arts 45825
## # ℹ 11 more rows
Will movie categories affect viewer reviews?
Yes. Viewers usually use positive words for categories such as Comedy and Drama, and negative words for categories such as Horror.
What are the most common words for each category?
We picked 5 genres to generate the most common words. From the charts, we can see that “fun” and “love” are the most common positive words. However, “funny” is the most common negative word among these genres. One possible reason is that viewers are giving sarcastic reviews.
# Sentiment score sort by Genres
ggplot(m_afinn_by_genres, aes(x = reorder(genres, sentiment), y = sentiment)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Genres", y = "Sentiment Score") +
coord_flip()
# Most common positive and negative words for all movie reviews
# Bing Lexicon is used here
bing_word_counts <- merged_tokens |>
inner_join(get_sentiments("bing")) |>
count(word, sentiment, sort = TRUE) |>
ungroup()
bing_word_counts
## # A tibble: 6,039 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 fun positive 63987
## 2 funny negative 54486
## 3 love positive 51058
## 4 entertaining positive 42071
## 5 plot negative 40089
## 6 bad negative 39542
## 7 hard negative 31886
## 8 humor positive 26372
## 9 fans positive 25575
## 10 classic positive 24948
## # ℹ 6,029 more rows
bing_word_counts |>
group_by(sentiment) |>
top_n(10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
# World Cloud
merged_tokens |>
inner_join(get_sentiments("bing")) |>
count(word, sentiment, sort = TRUE) |>
acast(word ~ sentiment, value.var = "n", fill = 0) |>
comparison.cloud(colors = c("gray20", "gray80"),max.words = 100)
# Most common positive and negative words for by Genres
bing_word_counts2 <- merged_tokens |>
inner_join(get_sentiments("bing")) |>
group_by(genres) |>
count(word, sentiment, sort = TRUE)
bing_word_counts2
## # A tibble: 82,329 × 4
## # Groups: genres [21]
## genres word sentiment n
## <chr> <chr> <chr> <int>
## 1 Comedy funny negative 20629
## 2 Drama love positive 15024
## 3 Action & Adventure fun positive 12846
## 4 Comedy fun positive 11786
## 5 Drama funny negative 9545
## 6 Drama plot negative 9503
## 7 Drama fun positive 9240
## 8 Drama entertaining positive 8894
## 9 Comedy love positive 8590
## 10 Drama hard negative 8552
## # ℹ 82,319 more rows
genres_list <- c("Comedy", "Drama", "Romance", "Horror", "Documentary")
for (genre in genres_list) {
word_count1 <- bing_word_counts2 %>%
filter(genres == genre) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n))
plot <- ggplot(word_count1, aes(x = word, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL,
title = paste("Top 10 Words by Genre:", genre)
) +
coord_flip()
print(plot)
}
The Movie db API is used to extract reviews for “Spider-Man: Into the Spider-Verse”. The data will be tokenized, and the most common negative and positive words will be found.
“Marvel” is the most common positive word among the two data sources. “Plot” is the most common negative word among the two data sources.
One drawback is that the number of reviews from the Movie db API is only 53, and each common frequency is mostly 1. This will not be statistically significant.
api_key <- "7ebae80cdd0ab879679dc189866bf7ed"
movie_id <- 324857 # Spider-Man: Into the Spider-Verse
url <- paste0("https://api.themoviedb.org/3/movie/", movie_id, "/reviews?api_key=", api_key)
response <- GET(url)
reviews2 <- content(response, "text")
text_content <- content(response, as = "text")
data <- fromJSON(text_content)
df <- as.data.frame(data)
df_review <- df |>
select(results.content)
df_review_tokens <- df_review |>
unnest_tokens(output = "word", token = "words", input = results.content) |>
anti_join(stop_words)
df_word_counts <- df_review_tokens |>
inner_join(get_sentiments("bing")) |>
count(word, sentiment, sort = TRUE)
df_word_counts_plot <- df_word_counts |>
group_by(sentiment) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Spider-Man: Into the Spider-Verse",
x = NULL
) +
coord_flip()
# Most common positive and negative words for "Spider-Man: Into the Spider-Verse"
bing_word_counts3 <- merged_tokens |>
filter(movie_title == "Spider-Man: Into the Spider-Verse") |>
inner_join(get_sentiments("bing")) |>
count(word, sentiment, sort = TRUE)
bing_word_counts3
## # A tibble: 398 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 fun positive 200
## 2 fresh positive 168
## 3 marvel positive 152
## 4 amazing positive 136
## 5 humor positive 128
## 6 funny negative 120
## 7 spectacular positive 120
## 8 dazzling positive 80
## 9 super positive 80
## 10 fast positive 72
## # ℹ 388 more rows
Spider_plot <- bing_word_counts3 |>
group_by(sentiment) |>
top_n(10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Spider-Man: Into the Spider-Verse",
x = NULL
) +
coord_flip()
Spider_plot
df_word_counts
## word sentiment n
## 1 love positive 4
## 2 plot negative 4
## 3 proud positive 3
## 4 enjoy positive 2
## 5 enjoyed positive 2
## 6 hilarious positive 2
## 7 humour positive 2
## 8 marvel positive 2
## 9 masterpiece positive 2
## 10 perfect positive 2
## 11 recommend positive 2
## 12 safe positive 2
## 13 amazing positive 1
## 14 approval positive 1
## 15 awesome positive 1
## 16 bad negative 1
## 17 bored negative 1
## 18 childish negative 1
## 19 cool positive 1
## 20 dead negative 1
## 21 death negative 1
## 22 died negative 1
## 23 disappointed negative 1
## 24 doubt negative 1
## 25 easy positive 1
## 26 engaging positive 1
## 27 excellent positive 1
## 28 faith positive 1
## 29 fantastic positive 1
## 30 favorite positive 1
## 31 fresh positive 1
## 32 fun positive 1
## 33 glad positive 1
## 34 hard negative 1
## 35 hell negative 1
## 36 hype negative 1
## 37 illusion negative 1
## 38 impressed positive 1
## 39 killed negative 1
## 40 loves positive 1
## 41 nice positive 1
## 42 passion positive 1
## 43 perfectly positive 1
## 44 realistic positive 1
## 45 risks negative 1
## 46 sad negative 1
## 47 satisfy positive 1
## 48 slowed negative 1
## 49 smear negative 1
## 50 spectacular positive 1
## 51 struggles negative 1
## 52 stylish positive 1
## 53 tragic negative 1
## 54 twists negative 1
## 55 unbelievable negative 1
## 56 worth positive 1
df_word_counts_plot
By using the AFINN and Bing lexicons, we found out that different relaxing movie categories, such as Comedy, usually have a positive sentiment score. “Fun” and “love” are the most common positive words.
Reviews from Movie DB can also be extracted and undergo the same sentiment analysis.