IMDB Scraping for Hunger Games Series

Author

Mick Rathbone

Introduction

The Hunger Games is a three book series written by Suzanne Collins that was adapted into four movies: The Hunger Games, Catching Fire, Mockingjay Part 1, and Mockingjay Part 2. Each of the movies has been posted on IMDB, which is an online database for films, television shows, etc.. This database also shows reviews for every item in the database, meaning that there are several reviews for the four movies in the Hunger Games Series. As someone who thoroughly enjoys this series, my goal was to scrape the reviews and see if there were trends in the sentiment scores as the series goes on.

Analysis

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.2.1     ✔ dplyr   1.1.3
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Rows: 100 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): reviewer_name, reviewer_title, reviewer_content, movie_title
dbl  (2): reviewer_rating, spoiler_warning
date (1): reviewer_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What is the best movie according to reviewer ratings?

After loading in the dataset, the first part of the data worth examining is the average reviewer ratings for each movie. IMDB uses a scale from 1 to 10, with 1 being the lowest possible score and 10 being the highest. This dataset consists of 25 reviews from each of the four movies, meaning there are 100 reviews overall.

HGS %>% 
  group_by(movie_title) %>% 
  summarize(AvgRating = mean(reviewer_rating, na.rm = TRUE)) %>% 
  arrange(desc(AvgRating))
# A tibble: 4 × 2
  movie_title       AvgRating
  <chr>                 <dbl>
1 Catching Fire          7.62
2 Hunger Games           7.18
3 Mockingjay Part 2      6.68
4 Mockingjay Part 1      6.52

The above table shows that Catching Fire was the clear favorite among IMDB reviewers with an average rating of 7.63 , while Mockingjay Part 1 was the least favorite with a rating of 6.52.

Which movie had the most talked about scenes/plot?

HGS %>% 
  group_by(movie_title) %>% 
  summarize(Total_Spoilers = sum(spoiler_warning, na.rm = TRUE)) %>% 
  arrange(desc(Total_Spoilers))
# A tibble: 4 × 2
  movie_title       Total_Spoilers
  <chr>                      <dbl>
1 Hunger Games                  10
2 Mockingjay Part 1             10
3 Mockingjay Part 2              7
4 Catching Fire                  6

In order to answer this question, I examined the amount of spoiler warnings that each movie had with its reviews, and found that both the original Hunger Games movie and Mockingjay Part 1 had the most spoiler warnings. This means that the plot or specific scenes were mentioned in 40% of their reviews, whether positively or negatively.

What were the reviewers thoughts of the individual movies and the series as a whole based on Sentiment Analysis?

tidy_word <-
  HGS %>% 
  unnest_tokens(word,movie_title)

tidy_HGS <- 
  HGS %>%
  unnest_tokens(word,reviewer_content) %>%
  anti_join(stop_words) %>% 
  anti_join(tidy_word)
Joining with `by = join_by(word)`
Joining with `by = join_by(reviewer_name, reviewer_date, reviewer_rating,
reviewer_title, spoiler_warning, word)`
bing <- 
  get_sentiments("bing")

HGS_counts <- 
  tidy_HGS %>% 
  group_by(movie_title, word) %>% 
  summarize(n = n()) %>% 
  inner_join(bing)
`summarise()` has grouped output by 'movie_title'. You can override using the
`.groups` argument.
Joining with `by = join_by(word)`
HGS_counts %>% 
  group_by(movie_title) %>%
  filter(!word == 'plot',!word == 'intense', !word == 'fans') %>% 
  filter(n>5) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~movie_title, ncol = 4) +
  geom_text(aes(label = signif(n, digits = 3)), nudge_y = 8) +
  labs(title = "Positive and Negative Words for the Hunger Games Series by Movie",
       subtitle = "Only words appearing at least 5 times are shown",
       x = "Words",
       y = "Number of Times Word Appears")

To answer this question, I first looked at the individual movies and the scorable words from their reviews. Since there are only 25 reviews per movie, I set the threshold for the amount of times a word must be said to n=5. This showed between 5 and 7 words for each movie after filtering out ambiguous or meaningless words (plot, intense, and fans). Looking at the results, The Hunger Games and Catching Fire had the most amount of positive words, which fit with the results from reviewer ratings from earlier. On the other end of the spectrum, Mockingjay Part 1 had only negative words appear in this visualization, showing that the reviewers had a lot of negative feelings for this movie.

tidy_HGS %>%
  inner_join(bing) %>% 
  filter(!word == 'plot',!word == 'intense', !word == 'fans') %>%
  group_by(movie_title, sentiment) %>% 
  summarize(n = n()) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  ggplot(aes(x = movie_title, y = sentiment)) +
  geom_col(position = "dodge") +
  labs(title = "Movie Positivity Scores",
       subtitle = "Positivity score is the total number of positive words minus total negative words",
       y = "Total Positivity Score",
       x = "Movie Title")
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'movie_title'. You can override using the
`.groups` argument.

The next visualization is very interesting, as only one movie has a positive score based on all scorable words: Catching Fie. Unsurprisingly, Mockingjay Part 1 was still the worst by far, but the low overall Positivity score for Hunger Games was surprising.

HGS_counts1 <- 
  tidy_HGS %>% 
  group_by(word) %>% 
  summarize(n = n()) %>% 
  inner_join(bing)
Joining with `by = join_by(word)`
HGS_counts1 %>%
  filter(!word == 'plot', !word == 'intense', !word == 'fans') %>%
  filter(n > 10) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Hunger Games Series Sentiment Scores by Word",
       subtitle = "Scorable words appearing at least 10 times",
       x = "Words",
       y = "Number of Times Word Appears")

The final visualization shows the sentiment scores of scorable words across all of the movies combined, allowing for analysis of the overall series. This graphic shows that the top two most used words in all of the reviews was love and amazing, meaning that the highest sentiment felt by the reviewers was positive. There were also slightly more positive words than negative, keeping with the positive theme.

Conclusion

After analyzing the data, there was a clear favorite movie when it came to reviewers on IMDB: Catching Fire. Mockingjay Part 1 was just as clear when it came to being the least favorite, as its positivity score and overall rating were both worse than the others. This analysis was also interesting as it showed that not every original movie is the best, and not every sequel gets progressively worse, as neither was the case here.