Data Dive Week 5

data <- data <-read.csv("movies_metadata.csv", stringsAsFactors = FALSE)

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

Cleared up Columns

Genre:
- The genre column was unclear because it was formatted as a list of dictionaries instead of a simple string. Instead of just listing genres like “Comedy” or “Drama,” it contained entries like [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]. The documentation clarified that some columns, including genre, were stored in the form of stringified JSON Object.
- If a movie has multiple genres, I believe its easier to store it in the form of a stringified JSON Object, this is most likely why they encoded it this way.
- If I hadn’t read the documentation, I might have assumed the column was a string, making it difficult to extract for statistical analysis.
ID and IMBD_ID:
- I found it confusing that there is an IMBD_ID and a normal ID when looking at the data set. The documentation explained to me that the ID column actually corresponds to the TMDB_ID.
- The ID column corresponds to the TMDB_ID while the IMDB_ID stored the IMDB identifier. It was most likely encoded this way to allow referencing of both TMDB and IMDB.
- If I hadn’t read the documentation, I wouldn’t have known what the ID corresponds too and most likely would have assumes it’s an internal ID to the database.
Production_Companies:
- I found the production companies column to be unclear as it was formatted as a list of dictionaries similar to the genre column. For example, instead of having “USA” it had [{'iso_3166_1': 'US', 'name': 'United States of America'}].The documentation clarified that keywords weren’t stored as simple strings.
- It was encoded this way most likely to store a country with an id, making it more readable and easier to access.
- If I hadn’t read the documentation, I wouldn’t have known why it was stored that way and most likely would have assumed it was a normal string.

Element That Remains Unclear:

An element that remains unclear throughout the data set is the use of 0’s in the revenue column. The documentation does not explain whether these 0’s indicate that a movie simply had no revenue or whether that data is missing and null. This makes it difficult to determine the intended meanings of these values. For example, it logically doesn’t make sense for a movie to have an outstanding budget but have a 0 value associated with its revenue.

data$revenue <- as.numeric(data$revenue)
data$vote_average <- as.numeric(data$vote_average)

ggplot(data, aes(x = revenue)) + 
  geom_histogram(bins = 50, fill = 'red', alpha = 0.6) +
  geom_histogram(data = data %>% filter(revenue == 0), 
                  bins = 50, fill = 'blue', alpha = 0.6) +
  labs(title = 'Revenue Distribution (with 0 values highlighted)', 
       x = 'Revenue', y = 'Count') +
  theme_minimal()

## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(data, aes(x = revenue, y = vote_average)) +
  geom_point(aes(color = (revenue == 0)), alpha = 0.6) +
  scale_color_manual(values = c('red', 'blue'), name = 'Zero Revenue') +
  labs(title = 'Revenue vs. Movie Rating (Vote Average)', 
       x = 'Revenue', y = 'Vote Average') +
  theme_minimal()

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).

The first graph portrays the amount of rows that have a revenue of 0, showing am extremely high count. The second graph shows a revenue to vote average ratio. It even shows some movies have 10 ratings but gaining 0 revenue.

Risks:

A risk associated with their being many 0 values that may be wrong is that it may undervalue the average revenue/total across the data set.
Another risk is that it skews the values towards 0 making it difficult to statistically analyze that data set.

Categorical Columns:

Genre:

explicit_missing <- sum(is.na(data$genre))  
print(paste("Explicitly Missing Genres:", explicit_missing))

## [1] "Explicitly Missing Genres: 0"

implicit_missing <- sum(data$genre == "" | data$genre == "[]")  
print(paste("Implicitly Missing Genres:", implicit_missing))

## [1] "Implicitly Missing Genres: 2442"

There are no explicitly missing rows, but there are 2,442 implicitly missing rows. As for empty groups, I was unable to find any.
Insight: While no genres are explicitly missing, 2,442 movies have no genre data recorded. This could lead to issues in genre-based recommendations and analysis.

Explicitly_missing_spoken_languages <- sum(is.na(data$spoken_languages))  

Implicitly_missing_spoken_languages <- sum(data$spoken_languages == "" | data$spoken_languages == "[]") 

print(paste("Explicitly Missing Spoken Languages:", Explicitly_missing_spoken_languages))

## [1] "Explicitly Missing Spoken Languages: 0"

print(paste("Implicitly Missing Spoken Languages:", Implicitly_missing_spoken_languages))

## [1] "Implicitly Missing Spoken Languages: 3835"

There are no explicitly missing rows, but there are 3,835 implicitly missing rows. As for empty groups, I was unable to find any.
Insight: Some films have no spoken language recorded, which could hint at them not being documented, this could be due to some lesser-known or older movies not being documented correctly.

Continuous Column:

Popularity:

ggplot(data, aes(x = 1:nrow(data), y = popularity)) +
  geom_point(color = "red", alpha = 0.6) +
  labs(title = "Scatter Plot of Movie Popularity", x = "Index", y = "Popularity") +
  theme_minimal()

In the scatter plot for popularity, I would define anything above 200 as an outlier, as most ratings are lower than 200.
Insight: As we can see, its very rare for a movie to have a popularity rating over 200, with only a few movies reaching that stage. These movies likely correspond to blockbuster films.