data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(jsonlite)

## Warning: package 'jsonlite' was built under R version 4.3.3

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

Understanding Popularity Across Genres

Approach:

We aim to test whether a movie genre significantly impacts a movie’s popularity.
Throughout my life, I’ve always realized there has been a difference between what audience’s like based on a film’s genre. Some genres, like action seem to dominate the box office with other genres like documentaries or comedy often appeal to niche audiences. This test aims to help determine whether some genres are consistently more popular than others.
Using the ANOVA test we will:
- Compare popularity scores across multiple genres to check for significant differences.
- Consolidate smaller genres for a fair comparison.
- Interpret the F-Statistic and p-value to determine if genre influences popularity
- Use post-hoc tests to identify which specific genres differ in popularity.
This will help us to understand if genre is a key factor in determining a film’s popularity.

Null Hypothesis:

Null Hypothesis (H₀): There is no significant difference in mean popularity across genres.
Alternative Hypothesis (H₁): At least one genre has a significantly different mean popularity score.

Data Preparation:

Here I converted popularity to numeric and ensured that there were no null values.
I also extracted the genre to a string format as it is originally stored in the form as a stringified JSON Object. I also only extract the main (first) genre from a movie.
Next I filtered out missing or invalid genres as in the csv file there were genres like “Aniplex Carousel Productions” which is a company.

data$popularity <- as.numeric(data$popularity)

## Warning: NAs introduced by coercion

data <- data[!is.na(data$popularity), ]

extract_genre <- function(genre_str) {
  if (is.na(genre_str) || genre_str == "[]" || genre_str == "" || genre_str == "{}") {
    return(NA)  # Handle missing or empty values
  }
  
  genre_str <- gsub("'", "\"", genre_str)

  genre_list <- tryCatch(fromJSON(genre_str), error = function(e) return(NA))
  
  if (is.data.frame(genre_list) && "name" %in% colnames(genre_list)) {
    return(genre_list$name[1])  
  } else {
    return(NA)
  }
}

data$genre <- sapply(data$genres, extract_genre)

valid_genres <- c("Action", "Adventure", "Animation", "Comedy", "Crime", 
                  "Documentary", "Drama", "Family", "Fantasy", "Foreign", 
                  "History", "Horror", "Music", "Mystery", "Romance", 
                  "Science Fiction", "Thriller", "TV Movie", "War", "Western")

data <- data[data$genre %in% valid_genres, ]

Data Consolidation:

I’ve always noticed that certain genres tend to match in their style and their usual audiences. For example, Action/Adventure films would be grouped together or Romance and Drama.
Following the guidelines, I used this idea to consolidate my data into smaller groups, having similar genres grouped together while extras would be grouped in the “Other” category. We also set this in a new column for grouped genres!
This will make our ANOVA test more effective as the data set won’t have diluted results across many small categories.

consolidate_genre <- function(genre) {
  if (genre %in% c("Action", "Adventure")) {
    return("Action/Adventure")
  } else if (genre %in% c("Drama", "Romance")) {
    return("Drama/Romance")
  } else if (genre %in% c("Horror", "Thriller")) {
    return("Horror/Thriller")
  } else if (genre %in% c("Science Fiction", "Fantasy")) {
    return("Sci-Fi/Fantasy")
  } else if (genre %in% c("Comedy")) {
    return("Comedy")
  } else if (genre %in% c("Crime", "Documentary")) {
    return("Crime/Documentary")
  } else {
    return("Other")
  }
}

data$genre_grouped <- sapply(data$genre, consolidate_genre)

table(data$genre_grouped)

## 
##  Action/Adventure            Comedy Crime/Documentary     Drama/Romance 
##              6002              8820              5100             13157 
##   Horror/Thriller             Other    Sci-Fi/Fantasy 
##              4284              4304              1351

Now we see that the genres are consolidated with Drama/Romance dominating the dataset and Sci-Fi/Fantasy with the fewest in the set.

ANOVA Test:

Here we run our ANOVA test to check whether a movie’s genre significantly impacts popularity.

# Perform ANOVA test
anova_result <- aov(popularity ~ genre_grouped, data = data)

# Display ANOVA summary
summary(anova_result)

##                  Df  Sum Sq Mean Sq F value Pr(>F)    
## genre_grouped     6   35073    5845   158.5 <2e-16 ***
## Residuals     43011 1586149      37                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insights:

Our F-Value is 158.5 indicating that the variance between genres is significantly larger than the variance within genres!
Our p-value < 2e-16 which is extremely small.
Based on this we reject our Null Hypothesis confirming that genre has a meaningful impact on popularity.

Pairwise t-test and Visualization

Since we figured out that a movie’s genre affects popularity we now need to understand which genres differ, we use a pairwise t-test to figure this out.

pairwise.t.test(data$popularity, data$genre_grouped, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$popularity and data$genre_grouped 
## 
##                   Action/Adventure Comedy  Crime/Documentary Drama/Romance
## Comedy            < 2e-16          -       -                 -            
## Crime/Documentary < 2e-16          < 2e-16 -                 -            
## Drama/Romance     < 2e-16          0.20755 < 2e-16           -            
## Horror/Thriller   < 2e-16          7.8e-08 < 2e-16           3.1e-15      
## Other             < 2e-16          1.00000 < 2e-16           0.03740      
## Sci-Fi/Fantasy    0.04829          2.7e-15 < 2e-16           < 2e-16      
##                   Horror/Thriller Other  
## Comedy            -               -      
## Crime/Documentary -               -      
## Drama/Romance     -               -      
## Horror/Thriller   -               -      
## Other             0.00058         -      
## Sci-Fi/Fantasy    0.00049         2.1e-11
## 
## P value adjustment method: bonferroni

ggplot(data, aes(x = genre_grouped, y = popularity, fill = genre_grouped)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Movie Popularity Across Genres",
       x = "Genre",
       y = "Popularity Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Insights:

Action/Adventure movies are shown to have a significant difference in comparison to every other genre, suggesting it draws more audiences.
The test results suggest that groups like Comedy and Drama/Romance, as well as Other and Drama/Romance films do not have a significant difference, implying they likely appeal to similar audiences.
Based on the Boxplot we can see that the Other and Action/Adventure groups have higher outliers suggesting they have movies that are extremely popular.

Conclusion:

The idea that film genres tend to influence popularity ratings is supported by our results, as genres Action/Adventure and Comedy films show significantly higher popularity, while others generally have lower popularity.

Linear Regression Between Revenue and Vote Count

For the sake of this test I changed my response variable from popularity to vote_count, as vote_count and revenue had the most linearity.

Approach:

We aim to determine whether a movie’s revenue significantly influences its vote count using a linear regression model.
The logic behind this choice is that higher grossing movies tend to have wider releases which leads to large audiences and more votes.
Using linear regression we will:
- Check for linearity to confirm that revenue and vote_count have a linear relationship.
- Fit and evaluate a regression model to estimate just how much revenue affects vote count.
- Interpret the coefficients to understand the impact between the two.

This will help us understand whether higher revenue to create a film consistently leads to increased audience votes.

Checking Linearity:

Here we use a scatter plot to check the linearity between Revenue and Vote Count. Before creating the plot,

data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)

data$revenue <- as.numeric(data$revenue)
data$vote_count <- as.numeric(data$vote_count)

data <- data[!is.na(data$revenue) & !is.na(data$vote_count), ]


ggplot(data, aes(x = revenue, y = vote_count)) +
  geom_point(color = "blue", alpha = 0.5) +
  labs(title = "Revenue vs. Vote Count",
       x = "Revenue ($)", y = "Vote Count") +
  theme_minimal()

The relationship between revenue and vote count is not perfectly linear, but the positive trend is strong enough to proceed with linear regression analysis. It shows that a higher revenue tends to show a higher vote count, this can still provide meaningful insights.

Linear Regression Model and its Fits

Here we build our linear regression model to analyze the relationship between revenue and vote count.

model <- lm(vote_count ~ revenue, data = data)

summary(model)

## 
## Call:
## lm(formula = vote_count ~ revenue, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5215.9   -38.4   -32.4   -16.3  9012.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.038e+01  1.365e+00   29.58   <2e-16 ***
## revenue     6.201e-06  2.091e-08  296.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.7 on 45458 degrees of freedom
## Multiple R-squared:  0.6594, Adjusted R-squared:  0.6594 
## F-statistic: 8.8e+04 on 1 and 45458 DF,  p-value: < 2.2e-16

Insights:

The intercept being 40.38, suggests that even for movies that have zero revenue the estimated baseline for vote count is 40 votes.
The coefficient for revenue (0.00000620147) indicates that for every $1 increase in revenue, the expected vote count increases by approximately 0.000006 votes. It suggests that higher revenue is generally associated with more votes, though the effect is very minimal.
The p-value is extremely low indicating that revenue is a meaningful predictor of vote count.
The R-squared value shows that about 65.94% of the variability in vote count is explained by revenue, suggesting a moderate relationship.
The R-Squared value being moderate means that revenue is a predicting factor affecting vote counts but other variables likely play a role boosting engagement and bringing in more votes.

Conclusion:

Higher revenue generally leads to more votes but other factors also play a significant role in audience engagement!

Data Dive Week 8

2025-03-09

Understanding Popularity Across Genres

Approach:

Null Hypothesis:

Data Preparation:

Data Consolidation:

ANOVA Test:

Pairwise t-test and Visualization

Conclusion:

Linear Regression Between Revenue and Vote Count

Approach:

Checking Linearity:

Linear Regression Model and its Fits

Conclusion: