data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 4.3.3
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
We aim to test whether a movie genre significantly impacts a movie’s popularity.
Throughout my life, I’ve always realized there has been a difference between what audience’s like based on a film’s genre. Some genres, like action seem to dominate the box office with other genres like documentaries or comedy often appeal to niche audiences. This test aims to help determine whether some genres are consistently more popular than others.
Using the ANOVA test we will:
Compare popularity scores across multiple genres to check for significant differences.
Consolidate smaller genres for a fair comparison.
Interpret the F-Statistic and p-value to determine if genre influences popularity
Use post-hoc tests to identify which specific genres differ in popularity.
This will help us to understand if genre is a key factor in determining a film’s popularity.
Null Hypothesis (H₀): There is no significant difference in mean popularity across genres.
Alternative Hypothesis (H₁): At least one genre has a significantly different mean popularity score.
Here I converted popularity to numeric and ensured that there were no null values.
I also extracted the genre to a string format as it is originally stored in the form as a stringified JSON Object. I also only extract the main (first) genre from a movie.
Next I filtered out missing or invalid genres as in the csv file there were genres like “Aniplex Carousel Productions” which is a company.
data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data <- data[!is.na(data$popularity), ]
extract_genre <- function(genre_str) {
if (is.na(genre_str) || genre_str == "[]" || genre_str == "" || genre_str == "{}") {
return(NA) # Handle missing or empty values
}
genre_str <- gsub("'", "\"", genre_str)
genre_list <- tryCatch(fromJSON(genre_str), error = function(e) return(NA))
if (is.data.frame(genre_list) && "name" %in% colnames(genre_list)) {
return(genre_list$name[1])
} else {
return(NA)
}
}
data$genre <- sapply(data$genres, extract_genre)
valid_genres <- c("Action", "Adventure", "Animation", "Comedy", "Crime",
"Documentary", "Drama", "Family", "Fantasy", "Foreign",
"History", "Horror", "Music", "Mystery", "Romance",
"Science Fiction", "Thriller", "TV Movie", "War", "Western")
data <- data[data$genre %in% valid_genres, ]
I’ve always noticed that certain genres tend to match in their style and their usual audiences. For example, Action/Adventure films would be grouped together or Romance and Drama.
Following the guidelines, I used this idea to consolidate my data into smaller groups, having similar genres grouped together while extras would be grouped in the “Other” category. We also set this in a new column for grouped genres!
This will make our ANOVA test more effective as the data set won’t have diluted results across many small categories.
consolidate_genre <- function(genre) {
if (genre %in% c("Action", "Adventure")) {
return("Action/Adventure")
} else if (genre %in% c("Drama", "Romance")) {
return("Drama/Romance")
} else if (genre %in% c("Horror", "Thriller")) {
return("Horror/Thriller")
} else if (genre %in% c("Science Fiction", "Fantasy")) {
return("Sci-Fi/Fantasy")
} else if (genre %in% c("Comedy")) {
return("Comedy")
} else if (genre %in% c("Crime", "Documentary")) {
return("Crime/Documentary")
} else {
return("Other")
}
}
data$genre_grouped <- sapply(data$genre, consolidate_genre)
table(data$genre_grouped)
##
## Action/Adventure Comedy Crime/Documentary Drama/Romance
## 6002 8820 5100 13157
## Horror/Thriller Other Sci-Fi/Fantasy
## 4284 4304 1351
Here we run our ANOVA test to check whether a movie’s genre significantly impacts popularity.
# Perform ANOVA test
anova_result <- aov(popularity ~ genre_grouped, data = data)
# Display ANOVA summary
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## genre_grouped 6 35073 5845 158.5 <2e-16 ***
## Residuals 43011 1586149 37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Insights:
Our F-Value is 158.5 indicating that the variance between genres is significantly larger than the variance within genres!
Our p-value < 2e-16 which is extremely small.
Based on this we reject our Null Hypothesis confirming that genre has a meaningful impact on popularity.
pairwise.t.test(data$popularity, data$genre_grouped, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$popularity and data$genre_grouped
##
## Action/Adventure Comedy Crime/Documentary Drama/Romance
## Comedy < 2e-16 - - -
## Crime/Documentary < 2e-16 < 2e-16 - -
## Drama/Romance < 2e-16 0.20755 < 2e-16 -
## Horror/Thriller < 2e-16 7.8e-08 < 2e-16 3.1e-15
## Other < 2e-16 1.00000 < 2e-16 0.03740
## Sci-Fi/Fantasy 0.04829 2.7e-15 < 2e-16 < 2e-16
## Horror/Thriller Other
## Comedy - -
## Crime/Documentary - -
## Drama/Romance - -
## Horror/Thriller - -
## Other 0.00058 -
## Sci-Fi/Fantasy 0.00049 2.1e-11
##
## P value adjustment method: bonferroni
ggplot(data, aes(x = genre_grouped, y = popularity, fill = genre_grouped)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Movie Popularity Across Genres",
x = "Genre",
y = "Popularity Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Insights:
Action/Adventure movies are shown to have a significant difference in comparison to every other genre, suggesting it draws more audiences.
The test results suggest that groups like Comedy and Drama/Romance, as well as Other and Drama/Romance films do not have a significant difference, implying they likely appeal to similar audiences.
Based on the Boxplot we can see that the Other and Action/Adventure groups have higher outliers suggesting they have movies that are extremely popular.
The idea that film genres tend to influence popularity ratings is supported by our results, as genres Action/Adventure and Comedy films show significantly higher popularity, while others generally have lower popularity.
For the sake of this test I changed my response variable from popularity to vote_count, as vote_count and revenue had the most linearity.
We aim to determine whether a movie’s revenue significantly influences its vote count using a linear regression model.
The logic behind this choice is that higher grossing movies tend to have wider releases which leads to large audiences and more votes.
Using linear regression we will:
Check for linearity to confirm that revenue and vote_count have a linear relationship.
Fit and evaluate a regression model to estimate just how much revenue affects vote count.
Interpret the coefficients to understand the impact between the two.
This will help us understand whether higher revenue to create a film consistently leads to increased audience votes.
data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
data$revenue <- as.numeric(data$revenue)
data$vote_count <- as.numeric(data$vote_count)
data <- data[!is.na(data$revenue) & !is.na(data$vote_count), ]
ggplot(data, aes(x = revenue, y = vote_count)) +
geom_point(color = "blue", alpha = 0.5) +
labs(title = "Revenue vs. Vote Count",
x = "Revenue ($)", y = "Vote Count") +
theme_minimal()
model <- lm(vote_count ~ revenue, data = data)
summary(model)
##
## Call:
## lm(formula = vote_count ~ revenue, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5215.9 -38.4 -32.4 -16.3 9012.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.038e+01 1.365e+00 29.58 <2e-16 ***
## revenue 6.201e-06 2.091e-08 296.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.7 on 45458 degrees of freedom
## Multiple R-squared: 0.6594, Adjusted R-squared: 0.6594
## F-statistic: 8.8e+04 on 1 and 45458 DF, p-value: < 2.2e-16
Insights:
The intercept being 40.38, suggests that even for movies that have zero revenue the estimated baseline for vote count is 40 votes.
The coefficient for revenue (0.00000620147) indicates that for every $1 increase in revenue, the expected vote count increases by approximately 0.000006 votes. It suggests that higher revenue is generally associated with more votes, though the effect is very minimal.
The p-value is extremely low indicating that revenue is a meaningful predictor of vote count.
The R-squared value shows that about 65.94% of the variability in vote count is explained by revenue, suggesting a moderate relationship.
The R-Squared value being moderate means that revenue is a predicting factor affecting vote counts but other variables likely play a role boosting engagement and bringing in more votes.
Higher revenue generally leads to more votes but other factors also play a significant role in audience engagement!