The global box office industry is a multi-billion dollar sector that shapes entertainment trends and economic landscapes. This report examines box office revenue data for movies released between 2010 and 2024. The dataset includes worldwide, domestic, and foreign revenues, along with their respective percentage shares. By analyzing this dataset, we can identify patterns in box office success, the dominance of franchise films, and the importance of international markets.
This analysis will provide: - Descriptive statistics of box office revenue trends. - Five unique visualizations to explore revenue distributions and correlations. - Key insights into factors influencing box office performance.
# Load necessary libraries
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
# Load the dataset
df <- read_csv("/Users/jasoncherubini/Desktop/2010-2024 Movies Box Ofice Collection.csv")
## Rows: 2800 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Release Group, Domestic_percent, Foreign_percent
## dbl (2): Rank, year
## num (3): Worldwide, Domestic, Foreign
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert revenue columns to numeric
df <- df %>% mutate(
Worldwide = as.numeric(gsub(",", "", Worldwide)),
Domestic = as.numeric(gsub(",", "", Domestic)),
Foreign = as.numeric(gsub(",", "", Foreign)),
Domestic_percent = as.numeric(gsub("%", "", Domestic_percent)) / 100,
Foreign_percent = as.numeric(gsub("%", "", Foreign_percent)) / 100
)
# Remove any NA values that may cause visualization issues
df <- na.omit(df)
# Display basic summary of the data
summary(df)
## Rank Release Group Worldwide Domestic
## Min. : 0.00 Length:2800 Min. :4.811e+06 Min. : 0
## 1st Qu.: 49.75 Class :character 1st Qu.:2.739e+07 1st Qu.: 0
## Median : 99.50 Mode :character Median :5.213e+07 Median : 14232540
## Mean : 99.50 Mean :1.353e+08 Mean : 45280768
## 3rd Qu.:149.25 3rd Qu.:1.363e+08 3rd Qu.: 53351402
## Max. :199.00 Max. :2.799e+09 Max. :936662225
## Domestic_percent Foreign Foreign_percent year
## Min. :0.000 Min. :0.000e+00 Min. :0.000 Min. :2010
## 1st Qu.:0.000 1st Qu.:1.797e+07 1st Qu.:0.502 1st Qu.:2013
## Median :0.270 Median :3.566e+07 Median :0.730 Median :2016
## Mean :0.287 Mean :8.997e+07 Mean :0.713 Mean :2016
## 3rd Qu.:0.498 3rd Qu.:8.793e+07 3rd Qu.:1.000 3rd Qu.:2020
## Max. :1.000 Max. :1.941e+09 Max. :1.000 Max. :2023
To provide an overview of the dataset, we examine key descriptive statistics, including summary measures and the top-grossing movies.
# Summary statistics
descriptive_stats <- df %>% summarise(
Mean_Worldwide = mean(Worldwide, na.rm = TRUE),
Median_Worldwide = median(Worldwide, na.rm = TRUE),
SD_Worldwide = sd(Worldwide, na.rm = TRUE),
Min_Worldwide = min(Worldwide, na.rm = TRUE),
Max_Worldwide = max(Worldwide, na.rm = TRUE),
Mean_Domestic = mean(Domestic, na.rm = TRUE),
SD_Domestic = sd(Domestic, na.rm = TRUE),
Mean_Foreign = mean(Foreign, na.rm = TRUE),
SD_Foreign = sd(Foreign, na.rm = TRUE)
)
kable(descriptive_stats, caption = "Descriptive Statistics for Worldwide, Domestic, and Foreign Revenue") %>% kable_styling()
| Mean_Worldwide | Median_Worldwide | SD_Worldwide | Min_Worldwide | Max_Worldwide | Mean_Domestic | SD_Domestic | Mean_Foreign | SD_Foreign |
|---|---|---|---|---|---|---|---|---|
| 135252367 | 52132482 | 226425143 | 4810790 | 2799439100 | 45280768 | 85158784 | 89970983 | 152146157 |
# Top 10 Movies by Worldwide Revenue
top_movies <- df %>% arrange(desc(Worldwide)) %>% head(10) %>% select(Rank, `Release Group`, Worldwide)
kable(top_movies, caption = "Top 10 Movies by Worldwide Revenue") %>% kable_styling()
| Rank | Release Group | Worldwide |
|---|---|---|
| 0 | Avengers: Endgame | 2799439100 |
| 0 | Avatar: The Way of Water | 2320250281 |
| 0 | Star Wars: Episode VII 0 The Force Awakens | 2068223624 |
| 0 | Avengers: Infinity War | 2048359754 |
| 0 | Spider0Man: No Way Home | 1912233593 |
| 1 | Jurassic World | 1670400637 |
| 1 | The Lion King | 1656943394 |
| 0 | The Avengers | 1518812988 |
| 2 | Furious 7 | 1515047671 |
| 1 | Top Gun: Maverick | 1495696292 |
The table above highlights the top-performing films in terms of worldwide revenue, often dominated by major franchises and sequels.
ggplot(df, aes(x = Domestic, y = Worldwide)) +
geom_point(color = "blue", alpha = 0.5) +
labs(title = "Worldwide vs. Domestic Revenue", x = "Domestic Revenue ($)", y = "Worldwide Revenue ($)") +
theme_minimal()
Insight: The scatter plot reveals a strong correlation between domestic and worldwide revenue. However, some movies achieve higher foreign earnings, indicating the influence of global distribution and marketing strategies.
trend_data <- df %>% group_by(year) %>% summarise(Total_Revenue = sum(Worldwide, na.rm = TRUE))
ggplot(trend_data, aes(x = year, y = Total_Revenue)) +
geom_line(color = "red", size = 1) +
geom_point(color = "black") +
labs(title = "Total Box Office Revenue Trends (2010-2024)", x = "Year", y = "Total Worldwide Revenue ($)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Insight: Revenue trends fluctuate, with notable peaks likely corresponding to blockbuster releases and franchise films. The impact of external factors, such as the COVID-19 pandemic, can also be seen in revenue dips.
ggplot(df, aes(x = Worldwide)) +
geom_density(fill = "blue", alpha = 0.4) +
labs(title = "Density Distribution of Worldwide Revenue",
x = "Worldwide Revenue ($)",
y = "Density") +
theme_minimal()
Insight: The density plot provides a clear view of how revenue is distributed among movies. Peaks in the distribution highlight which revenue ranges are most common, giving insights into industry performance. Unlike a heatmap, this avoids potential clutter and presents a smooth revenue distribution across all movies.
top_10 <- df %>% arrange(desc(Worldwide)) %>% head(10)
ggplot(top_10, aes(x = reorder(`Release Group`, Worldwide), y = Worldwide)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Movies by Worldwide Revenue", x = "Movie Title", y = "Worldwide Revenue ($)") +
theme_minimal()
Insight: The top-grossing movies tend to be franchise films, sequels, or superhero movies, reinforcing the financial dominance of established intellectual properties.
This report analyzed box office trends using five unique visualizations. The findings emphasize: - The high correlation between domestic and global revenue. - The dominance of franchise films in revenue generation. - The importance of foreign markets in shaping success.
These insights are crucial for investors, movie studios, and distributors looking to maximize box office performance.
End of Report