BAIS 462 Final Project

Author

Jack Griffith

Movie Analysis

Introduction

Growing up I’ve always been a fan of movies. There are so many iconic scenes that are burned into my brain. Movies are a form of art that can spring all sorts of emotions. As fascinating as it is watching movies, analyzing the has become another passion of mine. With all that being said I decided to make this project that analyzes the highest-rated movies from TMDB (The Movie Database) using a dataset.
After exploring trends in ratings, popularity, and vote counts, I supplemented the dataset with an API call to OMDb to obtain Metascore values.

Primary Dataset

The Primary dataset is TMDB and it contains fields such as title, rating, vote totals, popularity, and release dates.

library(readr)
library(ggpmisc)

Warning: package 'ggpmisc' was built under R version 4.4.3

Loading required package: ggpp

Loading required package: ggplot2

Registered S3 methods overwritten by 'ggpp':
  method                  from   
  heightDetails.titleGrob ggplot2
  widthDetails.titleGrob  ggplot2


Attaching package: 'ggpp'

The following object is masked from 'package:ggplot2':

    annotate

library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(httr)
library(jsonlite) 
library(ggplot2) 
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

movie_data <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/griffithj6_xavier_edu/IQBtSHUA2dm9SpeVjdwyUDz3AdAyMx4WH6D4P4PPPgoaLkI?download=1")

New names:
• `` -> `...1`

Rows: 9980 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): title
dbl  (5): ...1, id, vote_average, vote_count, popularity
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive Analysis

Graph 1 — Distribution of TMDB Ratings

ggplot(movie_data, aes(x = vote_average)) +
geom_histogram(bins = 20, fill = "blue", color = "black") +
labs(title = "Distribution of Movie Ratings",
x = "TMDB Vote Average",
y = "Count")

This histogram shows the distribution of TMDB vote averages for the movies in the dataset. Most movies are clustered in the higher rating range, around 6-7.5, indicating that the dataset primarily consists of highly rated films. There are fewer movies with low ratings, which suggests that the collection focuses good movies more than bad. The histogram helps us understand the overall quality trend in the dataset.

Graph 2 — Vote Average vs Vote Count (With Means)

ggplot(movie_data, aes(x = vote_average, y = vote_count)) +
geom_point(alpha = 0.6) +
geom_hline(aes(yintercept = mean(vote_count)),
color = "red", linetype = "dashed") +
geom_vline(aes(xintercept = mean(vote_average)),
color = "blue", linetype = "dashed") +
labs(title = "Vote Average vs Vote Count",
x = "TMDB Rating (0–10)",
y = "Number of Votes")

This scatter plot shows how the number of votes relates to the average rating of each movie. The red dashed horizontal line represents the average vote count which was 2056.24, and the blue dashed vertical line represents the average rating which was 6.71. Movies with high ratings don’t always have the most votes, and those with many votes can have bad ratings. This highlights that popularity doesn’t perfectly correlate with how highly a movie is rated.

Graph 3 — Popularity vs Vote Average

ggplot(movie_data, aes(x = popularity, y = vote_average)) +
geom_point(alpha = 0.6) +
geom_smooth() +
coord_cartesian(xlim = c(0, quantile(movie_data$popularity, 0.95))) +
labs(title = "Popularity vs Vote Average",
x = "TMDB Popularity Score",
y = "TMDB Rating")

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This scatter plot shows TMDB popularity scores compared to the average movie rating. While there is a slight trend of higher rated movies being more popular, the correlation is not very strong. Some lower rated movies still have high popularity scores which shows us that factors other than quality, like marketing, and star power can influence popularity.

Graph 4 — Popularity vs Vote Count

ggplot(movie_data, aes(x = popularity, y = vote_count)) +
geom_point(alpha = 0.6) +
geom_smooth() +
coord_cartesian(
xlim = c(0, quantile(movie_data$popularity, 0.95)),
ylim = c(0, NA)
) +
labs(title = "Popularity vs Vote Count",
x = "Popularity Score",
y = "Number of Votes")

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This graph examines how a movie’s popularity relates to the number of votes it receives. There is a clear positive trend: movies with higher popularity tend to have more votes. The smoothing line shows this relationship.

Graph 5 — Number of Movies by Release Year

movie_data$date <- ymd(movie_data$release_date)
movie_data$release_year <- year(movie_data$date)

ggplot(movie_data, aes(x = release_year)) +
geom_bar(fill = "red", color = "black") +
labs(
title = "Number of Top-Rated Movies by Release Year",
x = "Release Year",
y = "Count of Movies"
)

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_count()`).

This bar chart displays how many top rated movies were released each year. The Peaks in the chart are very clear that in the 2000s more critically acclaimed films were released. This visualization helps show us that the movies that are highly rated more often are ones that were released more recently.

Secondary Data Source (OMDb API)

I used the OMDb API which is a free API to fetch Metascore and release year for the top 10 movies rated in TMDB.

top10<- data.frame(
  title = c("The Shawshank Redemption", "The Godfather", "The Godfather Part II",
            "Schindler's List", "12 Angry Men", "Spirited Away", 
            "The Dark Knight", "The Green Mile", 
            "Dilwale Dulhania Le Jayenge", "GoodFellas"),
  vote_average = c(8.711, 8.685, 8.571, 8.566, 8.548, 8.534, 8.524, 8.502, 8.500, 8.500))


omdb_api_key <- "14f1e012"

metascores <- c()
years <- c()

movies <- c(
  "The Shawshank Redemption",
  "The Godfather",
  "The Godfather Part II",
  "Schindler's List",
  "12 Angry Men",
  "Spirited Away",
  "The Dark Knight",
  "The Green Mile",
  "Dilwale Dulhania Le Jayenge",
  "GoodFellas"
)




for (movie in movies) {
  print(paste("Fetching:", movie))  # see progress
  url <- paste0("http://www.omdbapi.com/?t=", URLencode(movie), "&apikey=", omdb_api_key)
  res <- httr::GET(url)
  res_json <- jsonlite::fromJSON(httr::content(res, "text"))
  metascores <- c(metascores, as.numeric(res_json$Metascore))
  years <- c(years, as.numeric(res_json$Year))
  Sys.sleep(0.2)
}

[1] "Fetching: The Shawshank Redemption"
[1] "Fetching: The Godfather"
[1] "Fetching: The Godfather Part II"
[1] "Fetching: Schindler's List"
[1] "Fetching: 12 Angry Men"
[1] "Fetching: Spirited Away"
[1] "Fetching: The Dark Knight"
[1] "Fetching: The Green Mile"
[1] "Fetching: Dilwale Dulhania Le Jayenge"

Warning: NAs introduced by coercion

[1] "Fetching: GoodFellas"

omdb_data <- data.frame(
  title = movies,
  metascore = metascores,
  year = years
)

top10_data <- merge(top10, omdb_data, by = "title")

ggplot(top10_data, aes(x = vote_average, y = metascore, label = title)) +
  geom_point(color = "steelblue", size = 3) +
  geom_text(vjust = -1, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "TMDB Rating vs Metascore for Top 10 Movies",
    x = "TMDB Rating",
    y = "Metascore"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).

The top rated movies in TMDB were:

The Shawshank Redemption
The Godfather
The Godfather Part II
Schindler’s List
12 Angry Men
Spirited Away
The Dark Knight
The Green Mile
Dilwale Dulhania Le Jayenge
GoodFellas

I took these movies and combined the Metascores from the OMDb API to make a graph comparing them. These movies all tend to have very high Metascores, but The Green Mile certainly does not. It is very surprising to see such a highly rated movie have such a low Metascore.

Conclusion

In conclusion, this analysis highlights how different metrics, such as TMDB ratings, vote counts, popularity, and Metascores, provide unique insights into what makes a movie highly regarded. The TMDB ratings and vote counts show a lot of the audience engagement, and the Metascores give more of a critic’s perspective. By combining primary data from TMDB with supplemental information from the OMDB API, it gave a clearer picture of these top rated movies. It showed how audience opinions and critical reviews can be very different. Overall, this exploration of movies deepened my appreciation for movies and the many factors that contribute to what makes them good.