data <- data <-read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
Genre:
The genre column was unclear because it was formatted as a list
of dictionaries instead of a simple string. Instead of just listing
genres like “Comedy” or “Drama,” it
contained entries like
[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
.
The documentation clarified that some columns, including genre, were
stored in the form of stringified JSON Object.
If a movie has multiple genres, I believe its easier to store it in the form of a stringified JSON Object, this is most likely why they encoded it this way.
If I hadn’t read the documentation, I might have assumed the column was a string, making it difficult to extract for statistical analysis.
ID and IMBD_ID:
I found it confusing that there is an IMBD_ID and a normal ID when looking at the data set. The documentation explained to me that the ID column actually corresponds to the TMDB_ID.
The ID column corresponds to the TMDB_ID while the IMDB_ID stored the IMDB identifier. It was most likely encoded this way to allow referencing of both TMDB and IMDB.
If I hadn’t read the documentation, I wouldn’t have known what the ID corresponds too and most likely would have assumes it’s an internal ID to the database.
Production_Companies:
I found the production companies column to be unclear as it was
formatted as a list of dictionaries similar to the genre column. For
example, instead of having “USA” it had
[{'iso_3166_1': 'US', 'name': 'United States of America'}].
The
documentation clarified that keywords weren’t stored as simple
strings.
It was encoded this way most likely to store a country with an id, making it more readable and easier to access.
If I hadn’t read the documentation, I wouldn’t have known why it was stored that way and most likely would have assumed it was a normal string.
data$revenue <- as.numeric(data$revenue)
data$vote_average <- as.numeric(data$vote_average)
ggplot(data, aes(x = revenue)) +
geom_histogram(bins = 50, fill = 'red', alpha = 0.6) +
geom_histogram(data = data %>% filter(revenue == 0),
bins = 50, fill = 'blue', alpha = 0.6) +
labs(title = 'Revenue Distribution (with 0 values highlighted)',
x = 'Revenue', y = 'Count') +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(data, aes(x = revenue, y = vote_average)) +
geom_point(aes(color = (revenue == 0)), alpha = 0.6) +
scale_color_manual(values = c('red', 'blue'), name = 'Zero Revenue') +
labs(title = 'Revenue vs. Movie Rating (Vote Average)',
x = 'Revenue', y = 'Vote Average') +
theme_minimal()
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
The first graph portrays the amount of rows that have a revenue of 0, showing am extremely high count. The second graph shows a revenue to vote average ratio. It even shows some movies have 10 ratings but gaining 0 revenue.
A risk associated with their being many 0 values that may be wrong is that it may undervalue the average revenue/total across the data set.
Another risk is that it skews the values towards 0 making it difficult to statistically analyze that data set.
Genre:
explicit_missing <- sum(is.na(data$genre))
print(paste("Explicitly Missing Genres:", explicit_missing))
## [1] "Explicitly Missing Genres: 0"
implicit_missing <- sum(data$genre == "" | data$genre == "[]")
print(paste("Implicitly Missing Genres:", implicit_missing))
## [1] "Implicitly Missing Genres: 2442"
There are no explicitly missing rows, but there are 2,442 implicitly missing rows. As for empty groups, I was unable to find any.
Insight: While no genres are explicitly missing, 2,442 movies have no genre data recorded. This could lead to issues in genre-based recommendations and analysis.
Explicitly_missing_spoken_languages <- sum(is.na(data$spoken_languages))
Implicitly_missing_spoken_languages <- sum(data$spoken_languages == "" | data$spoken_languages == "[]")
print(paste("Explicitly Missing Spoken Languages:", Explicitly_missing_spoken_languages))
## [1] "Explicitly Missing Spoken Languages: 0"
print(paste("Implicitly Missing Spoken Languages:", Implicitly_missing_spoken_languages))
## [1] "Implicitly Missing Spoken Languages: 3835"
There are no explicitly missing rows, but there are 3,835 implicitly missing rows. As for empty groups, I was unable to find any.
Insight: Some films have no spoken language recorded, which could hint at them not being documented, this could be due to some lesser-known or older movies not being documented correctly.
Popularity:
ggplot(data, aes(x = 1:nrow(data), y = popularity)) +
geom_point(color = "red", alpha = 0.6) +
labs(title = "Scatter Plot of Movie Popularity", x = "Index", y = "Popularity") +
theme_minimal()
In the scatter plot for popularity, I would define anything above 200 as an outlier, as most ratings are lower than 200.
Insight: As we can see, its very rare for a movie to have a popularity rating over 200, with only a few movies reaching that stage. These movies likely correspond to blockbuster films.