data <- data <-read.csv("movies_metadata.csv", stringsAsFactors = FALSE)

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)

Cleared up Columns

  1. Genre:

    • The genre column was unclear because it was formatted as a list of dictionaries instead of a simple string. Instead of just listing genres like “Comedy” or “Drama,” it contained entries like [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]. The documentation clarified that some columns, including genre, were stored in the form of stringified JSON Object.

    • If a movie has multiple genres, I believe its easier to store it in the form of a stringified JSON Object, this is most likely why they encoded it this way.

    • If I hadn’t read the documentation, I might have assumed the column was a string, making it difficult to extract for statistical analysis.

  2. ID and IMBD_ID:

    • I found it confusing that there is an IMBD_ID and a normal ID when looking at the data set. The documentation explained to me that the ID column actually corresponds to the TMDB_ID.

    • The ID column corresponds to the TMDB_ID while the IMDB_ID stored the IMDB identifier. It was most likely encoded this way to allow referencing of both TMDB and IMDB.

    • If I hadn’t read the documentation, I wouldn’t have known what the ID corresponds too and most likely would have assumes it’s an internal ID to the database.

  3. Production_Companies:

    • I found the production companies column to be unclear as it was formatted as a list of dictionaries similar to the genre column. For example, instead of having “USA” it had [{'iso_3166_1': 'US', 'name': 'United States of America'}].The documentation clarified that keywords weren’t stored as simple strings.

    • It was encoded this way most likely to store a country with an id, making it more readable and easier to access.

    • If I hadn’t read the documentation, I wouldn’t have known why it was stored that way and most likely would have assumed it was a normal string.

Element That Remains Unclear:

data$revenue <- as.numeric(data$revenue)
data$vote_average <- as.numeric(data$vote_average)

ggplot(data, aes(x = revenue)) + 
  geom_histogram(bins = 50, fill = 'red', alpha = 0.6) +
  geom_histogram(data = data %>% filter(revenue == 0), 
                  bins = 50, fill = 'blue', alpha = 0.6) +
  labs(title = 'Revenue Distribution (with 0 values highlighted)', 
       x = 'Revenue', y = 'Count') +
  theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(data, aes(x = revenue, y = vote_average)) +
  geom_point(aes(color = (revenue == 0)), alpha = 0.6) +
  scale_color_manual(values = c('red', 'blue'), name = 'Zero Revenue') +
  labs(title = 'Revenue vs. Movie Rating (Vote Average)', 
       x = 'Revenue', y = 'Vote Average') +
  theme_minimal()
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).

The first graph portrays the amount of rows that have a revenue of 0, showing am extremely high count. The second graph shows a revenue to vote average ratio. It even shows some movies have 10 ratings but gaining 0 revenue.

Risks:

Categorical Columns:

Genre:

explicit_missing <- sum(is.na(data$genre))  
print(paste("Explicitly Missing Genres:", explicit_missing))
## [1] "Explicitly Missing Genres: 0"
implicit_missing <- sum(data$genre == "" | data$genre == "[]")  
print(paste("Implicitly Missing Genres:", implicit_missing))
## [1] "Implicitly Missing Genres: 2442"
Explicitly_missing_spoken_languages <- sum(is.na(data$spoken_languages))  

Implicitly_missing_spoken_languages <- sum(data$spoken_languages == "" | data$spoken_languages == "[]") 

print(paste("Explicitly Missing Spoken Languages:", Explicitly_missing_spoken_languages))
## [1] "Explicitly Missing Spoken Languages: 0"
print(paste("Implicitly Missing Spoken Languages:", Implicitly_missing_spoken_languages))
## [1] "Implicitly Missing Spoken Languages: 3835"

Continuous Column:

Popularity:

ggplot(data, aes(x = 1:nrow(data), y = popularity)) +
  geom_point(color = "red", alpha = 0.6) +
  labs(title = "Scatter Plot of Movie Popularity", x = "Index", y = "Popularity") +
  theme_minimal()