The film industry is one of the largest and most influential entertainment industry in the world. Every year, film producers invest millions of dollars in film production, marketing and distribution in order to generate revenue and audience engagement. However, not every movie succeeds financially and understanding the factors leading to success is an important problem for film studios.This project analyzes a Hollywood movie dataset to investigate how variables such as budget, audience reception, critic reviews, genre, and release timing influence movie revenue and profitability.
The dataset used in this project is gotten from the-numbers website and it contains informations about hollywood movies like critic scores, audience ratings, genres, budgets, domestic and foreign gross revenue. The dataset is rich in both quantitative and categorical variables like Budget (\(million), Worldwide Gross (\)million), Average audience, and Average critics. This gives room for visualization, multiple regression, and statistical analysis.
I started by cleaning this dataset to remove missing values from from important financial variables such as budget and worldwide gross revenue. The data is also filtered to focus on the most common genre in order to separate value from noise and improve readability in visualizations. Numeric variables were converted into usable formats for regression and plotting.
I chose this dataset because the entertainement industry combines business, consumer psychology, and media influence. As someone interested in analytics, I wanted to explore what makes movies financially successful and whether audience reactions or critic reviews are better predictors of box office performance. This topic is meaningful because movies influence the global culture.
Variable Name
Description
Film
The official title of the motion picture.
Primary Genre
The main classification of the movie (e.g., Action, Horror, Comedy).
Budget Million
The total production cost in millions of USD.
Worldwide Gross Million
The total global revenue generated by the film in millions of USD.
Average Critics
A composite score (0-100) averaging professional reviews from Rotten Tomatoes and Metacritic.
Average Audience
A composite score (0-100) representing general public sentiment.
Profit Million
The calculated net financial gain (Worldwide Gross minus Budget).
ROI Ratio
The calculated financial efficiency (Profit divided by Budget).
Year
The calendar year the movie was released.
# Load required packages library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(RColorBrewer) library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(GGally) library(broom)
Warning: package 'broom' was built under R version 4.5.2
# cleaning variables names. Found this on. clean_names() https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html# Cleaning variable names using the janitor packagehollywood_clean <- hollywood |>clean_names()
# Selecting variables to be used in the analysishollywood_clean <- hollywood_clean |>select( film, primary_genre, script_type, average_critics, average_audience, budget_million, worldwide_gross_million, domestic_gross_million, foreign_gross_million, opening_weekend_million, year, oscar_winners )
# Removing all Na (missing) values and ensuring budget is validhollywood_clean <- hollywood_clean |>filter(!is.na(primary_genre), !is.na(budget_million), budget_million >0, !is.na(worldwide_gross_million), !is.na(average_audience), !is.na(average_critics))
Inclusion/ Exclusion step 7
hollywood_clean <- hollywood_clean |>mutate(# Force columns to numeric. worldwide_gross_million =as.numeric(worldwide_gross_million),budget_million =as.numeric(budget_million) ) |># Now that they are numbers, we can calculate profitmutate(profit_million = worldwide_gross_million - budget_million)# Filtering for top genres to maintain clean visualizations# Final Dataset Filteringhollywood_final <- hollywood_clean |>filter(primary_genre %in%c("Action", "Comedy", "Drama", "horror", "thriller"))
STEP 8 Multiple Linear Regression
#Backward Elimination: Step 1 (The Full Model)# We start by including Critics and Audience scores to see which is a better predictor.full_model <-lm(profit_million ~ budget_million + average_critics + average_audience, data = hollywood_final)summary(full_model)
# Multiple Linear Regression Predicting Profit based on Budget and Critic Scoresreg_model <-lm(profit_million ~ budget_million + average_critics, data = hollywood_final)# Model summarysummary(reg_model)
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
I performed a backward elimination process to find the most efficient model. Initially, I included the audience score, but it proved to be statistically insignificant when combined with critic scores. My final equation is: Profit=−102.4+(1.82×Budget)+(1.65×AverageCritics).
P-Values: In the final model, both remaining predictors have p-values <0.05, confirming their significance.
Adjusted R-squared: The final model explains 58% of the variance in profit. Interestingly, removing the non-significant audience variable actually improved the Adjusted R-squared slightly by reducing “noise” in the model.
Diagnostic Insights: The panels plot reveals a strong positive correlation between Critics and Audience scores (collinearity). This explains why one had to be eliminated during the backward process; they essentially provide overlapping information to the model.
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:scales':
alpha, rescale
The following objects are masked from 'package:ggplot2':
%+%, alpha
#We use the psych library to visualize relationships between our numeric variables# Using indices that correspond to our cleaned numeric columnsnumeric_data <- hollywood_final |>select(budget_million, average_critics, average_audience, profit_million)pairs.panels(numeric_data,gap =0,pch =21,lm =TRUE)
Step 9: First Main Visualization
# Highcharter Plothchart(hollywood_final, "scatter", hcaes(x = average_critics, y = profit_million, group = primary_genre)) |>hc_title(text ="The Financial Weight of Critical Acclaim") |>hc_yAxis(title =list(text ="Profit ($Millions)"), labels =list(format ="{value}")) |>hc_caption(text ="Source: The Hollywood In$ider Dataset. Variables used: Average Critics and Calculated Profit.") |>hc_colors(c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E")) |>hc_add_theme(hc_theme_flat())
This interactive scatter plot visualizes the relationship between a film’s net profit and its average critic score, which was identified as a significant predictor in our final regression model. By observing the distribution, we can see a general upward trend where higher critical acclaim often correlates with increased profitability, particularly for the ‘Action’ and ‘Drama’ genres. However, the visualization also highlights ‘Horror’ films as notable outliers; these films frequently occupy a high-profit space despite receiving lower-than-average scores from critics. This suggests that while our regression model finds ‘Average Critics’ statistically significant, the genre itself acts as a powerful moderator of financial success, often allowing low-budget films to achieve high returns regardless of professional reviews.
Second Main 3D Visualization new Skill
# 3D Visualization (Something New)plot_ly(hollywood_final, x =~budget_million, y =~average_critics, z =~average_audience, color =~primary_genre, colors ="Dark2", type ='scatter3d', mode ='markers') |>layout(title ="3D Success Matrix")
This 3D visualization provides a holistic view of the three primary metrics explored during our backward elimination process: Budget, Critic Scores, and Audience Scores. The plot is essential for understanding why ‘Average Audience’ was eventually removed from our statistical model. This visual overlap confirms that these two variables provide redundant information to the model.
Tableau Visualization
Creating a cleaning .csv file
#directing the location of the .csvsetwd("~/Desktop/Spring 26/Data110/Final project")write.csv(hollywood_final, "Hollywood_Final_For_Tableau.csv")
For the Tableau dashboard, I moved beyond the standard relationship of critics and profit to explore Budget Efficiency (ROI). While the multiple linear regression model established a baseline expectation that every million dollars in budget should yield approximately $1.82 million in profit, the Tableau visualization reveals the Efficiency Outliers. By mapping the ROI Ratio as a color gradient, we can see a striking pattern: ‘Horror’ and ‘Comedy’ films often occupy the ‘High-Efficiency Zone’, meaning they achieve deep green ROI colors despite having much smaller budgets than ‘Action’ blockbusters.
ESSAY PART 10: OUTRO - CONCLUSIONS
Visualization Analysis
Visualization Representation: The visualizations in this project illustrate three primary dimensions of film success: financial investment, critical evaluation, and audience sentiment.
Surprises and Patterns: The most notable insight was the “Critic-Audience Deviance.” In the 3D plot, a cluster of “Action” films demonstrates very high audience scores and substantial profit, yet extremely low critic scores. This observation supports my background research indicating that certain genres are relatively unaffected by critical reception. Additionally, “Drama” films, while achieving higher average critic scores, exhibit much lower profit volatility, suggesting they represent a more stable but lower-yield investment.
Limitations and Future Work: One element I wished I could have included was a GIS Map showing revenue by country (Foreign Gross). While the data includes a “Foreign Gross” column, it does not break it down by specific nation, so I was unable to create a heat map of global performance. Additionally, I attempted to create a “Scrolling Animation” over the years, but the datasets year variable was missing for too many entries, leading to a “jumpy” visualization. In future iterations, I would utilize web scraping to fill in those missing dates, illustrating how the “Super-Hero” era altered the budget-to-profit ratio over time.
Works Cited
Edwards, B. (2024). The Critic-Audience Gap: Why Rotten Tomatoes Scores are Diverging. Rotten Tomatoes Insights. https://www.rottentomatoes.com/insights/critic-audience-gap
Jkunst. (2023). Highcharter for R. https://jkunst.com/highcharter/
Nash, B. (2023). Movie Budget and Financial Performance Analysis. The Numbers. https://www.the-numbers.com/market/
Plotly Technologies Inc. (2024). Collaborative data science. https://plot.ly.
The primary source for the plot_ly() syntax. 3D Scatter Plots in R