Final Project

Introduction

Sentiment analysis involves using natural language processing and machine learning techniques to identify and extract subjective information from text data. The main goal of my project is to practice my knowledge of Web API, data cleaning, sentiment analysis, and data visualization.

The Rotten Tomatoes datasets were collected from Kaggle, and reviews were extracted via the Themoviedb API. The project aims to answer two questions:

Will movie categories affect viewer reviews?

2.What are the most common words for each category?

Additionally, the reviews from The Movie db will be compared with Rotten Tomatoes reviews.

Library

library("tidyverse")
library("janeaustenr")
library("stringr")
library("tidytext")
library(tidyverse)
library(jsonlite)
library(httr)
library(wordcloud)
library(reshape2)

Loading the dataset

The Rotten Tomatoes dataset was collected from Kaggle and downloaded to a local directory. You can find the datasets at https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset.

The dataset has been scraped from the publicly available website https://www.rottentomatoes.com as of 2020-10-31.

movies <- read_csv("C:\\Users\\tonyl\\OneDrive\\Documents\\fina_project\\rotten_tomatoes_movies.csv")
reviews <- read_csv("C:\\Users\\tonyl\\OneDrive\\Documents\\fina_project\\rotten_tomatoes_critic_reviews.csv")

head(movies)

## # A tibble: 6 × 22
##   rotten_tomatoes_link   movie_title movie_info critics_consensus content_rating
##   <chr>                  <chr>       <chr>      <chr>             <chr>         
## 1 m/0814255              Percy Jack… Always tr… Though it may se… PG            
## 2 m/0878835              Please Give Kate (Cat… Nicole Holofcene… R             
## 3 m/10                   10          A success… Blake Edwards' b… R             
## 4 m/1000013-12_angry_men 12 Angry M… Following… Sidney Lumet's f… NR            
## 5 m/1000079-20000_leagu… 20,000 Lea… In 1866, … One of Disney's … G             
## 6 m/10000_bc             10,000 B.C. Mammoth h… With attention s… PG-13         
## # ℹ 17 more variables: genres <chr>, directors <chr>, authors <chr>,
## #   actors <chr>, original_release_date <date>, streaming_release_date <date>,
## #   runtime <dbl>, production_company <chr>, tomatometer_status <chr>,
## #   tomatometer_rating <dbl>, tomatometer_count <dbl>, audience_status <chr>,
## #   audience_rating <dbl>, audience_count <dbl>,
## #   tomatometer_top_critics_count <dbl>, tomatometer_fresh_critics_count <dbl>,
## #   tomatometer_rotten_critics_count <dbl>

head(reviews)

## # A tibble: 6 × 8
##   rotten_tomatoes_link critic_name     top_critic publisher_name     review_type
##   <chr>                <chr>           <lgl>      <chr>              <chr>      
## 1 m/0814255            Andrew L. Urban FALSE      Urban Cinefile     Fresh      
## 2 m/0814255            Louise Keller   FALSE      Urban Cinefile     Fresh      
## 3 m/0814255            <NA>            FALSE      FILMINK (Australi… Fresh      
## 4 m/0814255            Ben McEachen    FALSE      Sunday Mail (Aust… Fresh      
## 5 m/0814255            Ethan Alter     TRUE       Hollywood Reporter Rotten     
## 6 m/0814255            David Germain   TRUE       Associated Press   Rotten     
## # ℹ 3 more variables: review_score <chr>, review_date <date>,
## #   review_content <chr>

Tidy the Data

The dataset consists of two CSV files, which are imported as “reviews” and “movies”. “reviews” contains the review content, and “movies” contains the movie title. Two data frames can be merged based on the common column “rotten_tomatoes_link”.

# Extract columns that will be used
# Merge two datasets
# Remove rows contains NA value
# Drop rotten_tomatoes_link column
# Preparing a new set of data frame so Genres column are broken down by genres value.

review_2 <- reviews |> 
                  select(rotten_tomatoes_link, review_content) 
movies_2 <- movies |> 
                  select(rotten_tomatoes_link, movie_title, audience_rating, genres)

merged <- merge(movies_2, review_2, by = "rotten_tomatoes_link")

merged <- merged |> 
            na.omit() |> 
            select(-rotten_tomatoes_link)

merged_genres <- merged |> 
                    separate_rows(genres, sep = ",\\s*") 

head(merged_genres)

## # A tibble: 6 × 4
##   movie_title                              audience_rating genres review_content
##   <chr>                                              <dbl> <chr>  <chr>         
## 1 Percy Jackson & the Olympians: The Ligh…              53 Actio… The pleasant …
## 2 Percy Jackson & the Olympians: The Ligh…              53 Comedy The pleasant …
## 3 Percy Jackson & the Olympians: The Ligh…              53 Drama  The pleasant …
## 4 Percy Jackson & the Olympians: The Ligh…              53 Scien… The pleasant …
## 5 Percy Jackson & the Olympians: The Ligh…              53 Actio… ...great fun …
## 6 Percy Jackson & the Olympians: The Ligh…              53 Comedy ...great fun …

Sentiment Analysis

First, we tokenize the cleaned data in “merge_genres”. Then, we use the AFINN lexicon to determine the sentiment score. Two results are obtained: one by movie title and one by genre.

merged_tokens <- merged_genres |> 
                  unnest_tokens(output = "word", token = "words", input = review_content) |> 
                  anti_join(stop_words)

# The afinn lexicon
m_afinn_by_movie  <- merged_tokens |> 
                        inner_join(get_sentiments("afinn")) |> 
                        group_by(movie_title, audience_rating) |> 
                        summarise(sentiment = sum(value)) |> 
                        arrange(desc(sentiment))
m_afinn_by_movie

## # A tibble: 17,319 × 3
## # Groups:   movie_title [16,742]
##    movie_title                       audience_rating sentiment
##    <chr>                                       <dbl>     <dbl>
##  1 Spider-Man: Into the Spider-Verse              93      8568
##  2 Spider-Man: Homecoming                         87      6906
##  3 Spider-Man: Far From Home                      95      5432
##  4 Star Wars: The Last Jedi                       43      4644
##  5 Shrek 2                                        69      4528
##  6 Ralph Breaks the Internet                      65      4488
##  7 Toy Story 4                                    94      4416
##  8 Shazam!                                        82      4380
##  9 Ant-Man                                        86      4028
## 10 Captain Marvel                                 47      3988
## # ℹ 17,309 more rows

m_afinn_by_genres <- merged_tokens |> 
                        inner_join(get_sentiments("afinn")) |> 
                        group_by(genres) |> 
                        summarise(sentiment = sum(value)) |> 
                        arrange(desc(sentiment))
m_afinn_by_genres

## # A tibble: 21 × 2
##    genres                    sentiment
##    <chr>                         <dbl>
##  1 Comedy                       349464
##  2 Drama                        344137
##  3 Action & Adventure           159363
##  4 Science Fiction & Fantasy    132861
##  5 Romance                      114955
##  6 Kids & Family                103481
##  7 Animation                     85021
##  8 Art House & International     68267
##  9 Documentary                   50113
## 10 Musical & Performing Arts     45825
## # ℹ 11 more rows

Will movie categories affect viewer reviews?

Yes. Viewers usually use positive words for categories such as Comedy and Drama, and negative words for categories such as Horror.

What are the most common words for each category?

We picked 5 genres to generate the most common words. From the charts, we can see that “fun” and “love” are the most common positive words. However, “funny” is the most common negative word among these genres. One possible reason is that viewers are giving sarcastic reviews.

# Sentiment score sort by Genres

ggplot(m_afinn_by_genres, aes(x = reorder(genres, sentiment), y = sentiment)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Genres", y = "Sentiment Score") +
  coord_flip()

# Most common positive and negative words for all movie reviews
# Bing Lexicon is used here

bing_word_counts <- merged_tokens |> 
  inner_join(get_sentiments("bing")) |> 
  count(word, sentiment, sort = TRUE) |> 
  ungroup()

bing_word_counts

## # A tibble: 6,039 × 3
##    word         sentiment     n
##    <chr>        <chr>     <int>
##  1 fun          positive  63987
##  2 funny        negative  54486
##  3 love         positive  51058
##  4 entertaining positive  42071
##  5 plot         negative  40089
##  6 bad          negative  39542
##  7 hard         negative  31886
##  8 humor        positive  26372
##  9 fans         positive  25575
## 10 classic      positive  24948
## # ℹ 6,029 more rows

bing_word_counts |> 
  group_by(sentiment) |> 
  top_n(10) |> 
  ungroup() |> 
  mutate(word = reorder(word, n)) |> 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

# World Cloud 
merged_tokens |> 
  inner_join(get_sentiments("bing")) |> 
  count(word, sentiment, sort = TRUE) |> 
  acast(word ~ sentiment, value.var = "n", fill = 0) |> 
  comparison.cloud(colors = c("gray20", "gray80"),max.words = 100)

# Most common positive and negative words for by Genres

bing_word_counts2 <- merged_tokens |> 
  inner_join(get_sentiments("bing")) |>
  group_by(genres) |> 
  count(word, sentiment, sort = TRUE) 

bing_word_counts2

## # A tibble: 82,329 × 4
## # Groups:   genres [21]
##    genres             word         sentiment     n
##    <chr>              <chr>        <chr>     <int>
##  1 Comedy             funny        negative  20629
##  2 Drama              love         positive  15024
##  3 Action & Adventure fun          positive  12846
##  4 Comedy             fun          positive  11786
##  5 Drama              funny        negative   9545
##  6 Drama              plot         negative   9503
##  7 Drama              fun          positive   9240
##  8 Drama              entertaining positive   8894
##  9 Comedy             love         positive   8590
## 10 Drama              hard         negative   8552
## # ℹ 82,319 more rows

genres_list <- c("Comedy", "Drama", "Romance", "Horror", "Documentary")

for (genre in genres_list) {
  word_count1 <- bing_word_counts2 %>%
    filter(genres == genre) %>%
    group_by(sentiment) %>%
    top_n(10) %>%
    ungroup() %>%
    mutate(word = reorder(word, n))
  
  plot <- ggplot(word_count1, aes(x = word, y = n, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(
      y = "Contribution to sentiment",
      x = NULL,
      title = paste("Top 10 Words by Genre:", genre)
    ) +
    coord_flip()
  
  print(plot)
}

Comparing with The Movie db

The Movie db API is used to extract reviews for “Spider-Man: Into the Spider-Verse”. The data will be tokenized, and the most common negative and positive words will be found.

“Marvel” is the most common positive word among the two data sources. “Plot” is the most common negative word among the two data sources.

One drawback is that the number of reviews from the Movie db API is only 53, and each common frequency is mostly 1. This will not be statistically significant.

api_key <- "7ebae80cdd0ab879679dc189866bf7ed"
movie_id <- 324857 # Spider-Man: Into the Spider-Verse
url <- paste0("https://api.themoviedb.org/3/movie/", movie_id, "/reviews?api_key=", api_key)
response <- GET(url)
reviews2 <- content(response, "text")
text_content <- content(response, as = "text")

data <- fromJSON(text_content)
df <- as.data.frame(data)

df_review <- df |> 
              select(results.content)

df_review_tokens <- df_review |> 
                        unnest_tokens(output = "word", token = "words", input = results.content) |> 
                        anti_join(stop_words)

df_word_counts <- df_review_tokens |>
                        inner_join(get_sentiments("bing")) |>
                        count(word, sentiment, sort = TRUE) 

df_word_counts_plot <- df_word_counts |> 
                        group_by(sentiment) |> 
                        ungroup() |> 
                        mutate(word = reorder(word, n)) |> 
                        ggplot(aes(word, n, fill = sentiment)) +
                        geom_col(show.legend = FALSE) +
                        facet_wrap(~sentiment, scales = "free_y") +
                        labs(
                          y = "Spider-Man: Into the Spider-Verse",
                          x = NULL
                        ) +
                        coord_flip()

# Most common positive and negative words for "Spider-Man: Into the Spider-Verse"

bing_word_counts3 <- merged_tokens |>
                        filter(movie_title == "Spider-Man: Into the Spider-Verse") |> 
                        inner_join(get_sentiments("bing")) |>
                        count(word, sentiment, sort = TRUE) 

bing_word_counts3

## # A tibble: 398 × 3
##    word        sentiment     n
##    <chr>       <chr>     <int>
##  1 fun         positive    200
##  2 fresh       positive    168
##  3 marvel      positive    152
##  4 amazing     positive    136
##  5 humor       positive    128
##  6 funny       negative    120
##  7 spectacular positive    120
##  8 dazzling    positive     80
##  9 super       positive     80
## 10 fast        positive     72
## # ℹ 388 more rows

Spider_plot <- bing_word_counts3 |> 
                        group_by(sentiment) |> 
                        top_n(10) |> 
                        ungroup() |> 
                        mutate(word = reorder(word, n)) |> 
                        ggplot(aes(word, n, fill = sentiment)) +
                        geom_col(show.legend = FALSE) +
                        facet_wrap(~sentiment, scales = "free_y") +
                        labs(
                          y = "Spider-Man: Into the Spider-Verse",
                          x = NULL
                        ) +
                        coord_flip()

Spider_plot

df_word_counts

##            word sentiment n
## 1          love  positive 4
## 2          plot  negative 4
## 3         proud  positive 3
## 4         enjoy  positive 2
## 5       enjoyed  positive 2
## 6     hilarious  positive 2
## 7        humour  positive 2
## 8        marvel  positive 2
## 9   masterpiece  positive 2
## 10      perfect  positive 2
## 11    recommend  positive 2
## 12         safe  positive 2
## 13      amazing  positive 1
## 14     approval  positive 1
## 15      awesome  positive 1
## 16          bad  negative 1
## 17        bored  negative 1
## 18     childish  negative 1
## 19         cool  positive 1
## 20         dead  negative 1
## 21        death  negative 1
## 22         died  negative 1
## 23 disappointed  negative 1
## 24        doubt  negative 1
## 25         easy  positive 1
## 26     engaging  positive 1
## 27    excellent  positive 1
## 28        faith  positive 1
## 29    fantastic  positive 1
## 30     favorite  positive 1
## 31        fresh  positive 1
## 32          fun  positive 1
## 33         glad  positive 1
## 34         hard  negative 1
## 35         hell  negative 1
## 36         hype  negative 1
## 37     illusion  negative 1
## 38    impressed  positive 1
## 39       killed  negative 1
## 40        loves  positive 1
## 41         nice  positive 1
## 42      passion  positive 1
## 43    perfectly  positive 1
## 44    realistic  positive 1
## 45        risks  negative 1
## 46          sad  negative 1
## 47      satisfy  positive 1
## 48       slowed  negative 1
## 49        smear  negative 1
## 50  spectacular  positive 1
## 51    struggles  negative 1
## 52      stylish  positive 1
## 53       tragic  negative 1
## 54       twists  negative 1
## 55 unbelievable  negative 1
## 56        worth  positive 1

df_word_counts_plot

Conculsion

By using the AFINN and Bing lexicons, we found out that different relaxing movie categories, such as Comedy, usually have a positive sentiment score. “Fun” and “love” are the most common positive words.

Reviews from Movie DB can also be extracted and undergo the same sentiment analysis.