# Load required libraries
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(plotly)
library(highcharter)
setwd("/Users/rebeccambaho/Downloads")
marvel<- read_csv("marvel_box_office.csv")Project 2: Marvel Dataset
This project focuses on Marvel movies, which have become a significant cultural phenomenon in recent years, dominating the box office with numerous blockbuster hits. As a big fan, I was thrilled to discover a data-set covering every Marvel superhero movie since Blade. The data is sourced from The Numbers and Wikipedia. In this analysis, we’ll delve into the ownership distribution of Marvel movies and examine the relationship between budget and box office accross the years. Additionally, we’ll explore how Marvel movies perform in relation to their budgets.
# Explore the data - Summary statistics
summary(marvel) Movie Release Date Release Month Release Day
Length:68 Length:68 Length:68 Min. : 1.00
Class :character Class :character Class :character 1st Qu.: 4.00
Mode :character Mode :character Mode :character Median : 8.00
Mean :10.63
3rd Qu.:16.25
Max. :30.00
Release Year Ownership Domestic Box Office
Min. :1998 Length:68 Min. : 8050977
1st Qu.:2008 Class :character 1st Qu.:132461948
Median :2014 Mode :character Median :214728302
Mean :2013 Mean :255834383
3rd Qu.:2019 3rd Qu.:347191576
Max. :2024 Max. :858373000
Inflation Adjusted Domestic International Box Office
Min. : 11807352 Min. :2.107e+06
1st Qu.:184227607 1st Qu.:1.531e+08
Median :263540128 Median :3.237e+08
Mean :321874669 Mean :3.847e+08
3rd Qu.:433006184 3rd Qu.:4.984e+08
Max. :986754117 Max. :1.931e+09
Inflation Adjusted International Worldwide Box Office
Min. :3.089e+06 Min. :1.016e+07
1st Qu.:2.053e+08 1st Qu.:2.835e+08
Median :4.061e+08 Median :5.639e+08
Mean :4.767e+08 Mean :6.405e+08
3rd Qu.:6.421e+08 3rd Qu.:8.467e+08
Max. :2.219e+09 Max. :2.789e+09
Inflation Adjusted Worldwide Opening Weekend
Min. :1.490e+07 Min. : 4271451
1st Qu.:4.024e+08 1st Qu.: 54944072
Median :7.206e+08 Median : 85308521
Mean :7.986e+08 Mean : 95573759
3rd Qu.:9.946e+08 3rd Qu.:123435530
Max. :3.206e+09 Max. :357115007
Inflation Adjusted Opening Weekend Budget
Min. : 6264398 Min. : 33000000
1st Qu.: 72010860 1st Qu.:115750000
Median :106263045 Median :160000000
Mean :119597453 Mean :160217647
3rd Qu.:152621862 3rd Qu.:200000000
Max. :410526314 Max. :400000000
Inflation Adjusted Budget IMDb Score Meta Score Tomatometer
Min. : 51330083 Min. :3.800 Min. :26.00 Min. : 9.00
1st Qu.:148341582 1st Qu.:6.175 1st Qu.:47.75 1st Qu.:46.75
Median :201706716 Median :6.900 Median :61.50 Median :74.50
Mean :202661147 Mean :6.760 Mean :58.15 Mean :65.60
3rd Qu.:250000000 3rd Qu.:7.500 3rd Qu.:69.00 3rd Qu.:87.50
Max. :459825329 Max. :8.400 Max. :88.00 Max. :96.00
Rotten Tomato Audience Score Run Time In Minutes Phase
Min. :18.00 Min. : 92.0 Length:68
1st Qu.:63.75 1st Qu.:112.0 Class :character
Median :78.50 Median :124.0 Mode :character
Mean :73.28 Mean :123.7
3rd Qu.:87.00 3rd Qu.:134.0
Max. :98.00 Max. :181.0
Director
Length:68
Class :character
Mode :character
1. Pie chart:
# Compute the percentage of ownership for each category
marvel_percent <- marvel %>%
count(Ownership) %>%
mutate(prop = n / sum(n) * 100) %>%
arrange(desc(Ownership)) %>%
mutate(ypos = cumsum(prop) - 0.5 * prop,
total_movies = paste0("(", n, ")"))# Plot pie chart to visualize ownership distribution
ggplot(marvel_percent, aes(x = "", y = prop, fill = Ownership)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void() +
theme(legend.position = "none",
plot.background = element_rect(fill = "#1E1E1E"),
panel.background = element_rect(fill = "#1E1E1E"),
axis.text = element_blank(),
axis.title = element_blank(),
plot.title = element_text(color = "white"),
plot.caption = element_text(color = "white")) +
geom_text(aes(x = 1.8, y = ypos, label = paste(Ownership, "\n", total_movies, "\n", round(prop, 2), "%")),
color = "white", size = 2.2) +
scale_fill_brewer(palette = "BrBG") +
labs(title = "Ownership Distribution of Marvel Movies",
caption = "Source: The Numbers + Wikipedia") I was curious about the ownership distribution of movies, so I decided to create a pie chart. From the chart, it’s evident that Marvel studios has the largest share, accounting for approximately 48% with 33 movies in its portfolio, including iconic titles like the Avengers movies. Following Marvel studios, 20th Century Fox owns 18 movies, while Sony holds the third-largest share with 11 movies. 20th Century Fox is responsible for titles like X-Men and Deadpool, which happen to be among my favorites. Despite its simplicity, I’m really happy with how this visualization turned out.
2. Bubble Map:
# Define function to format numbers with suffixes
formatNumber <- function(number) {
ifelse(number == 0, "0", {
suffixes <- c("", "K", "M", "B", "T")
suffixIndex <- floor(log10(abs(number)) / 3)
suffix <- suffixes[suffixIndex + 1]
number <- number / 10^(3 * suffixIndex)
number <- formatC(number, format = "f", digits = 0, big.mark = ",", drop0trailing = TRUE)
paste0(number, suffix)
})
}# Create a bubble chart to visualize Marvel movie budgets and performance
p <- ggplot(marvel, aes(x = `Release Year`, y = Budget, size = `Worldwide Box Office`,
text = paste("<b>", Movie, "</b><br>",
"<b>", `Release Year`, "</b><br>",
"<b>Budget:</b> <b>", formatNumber(Budget), "</b> ", "<br>",
"<b>Worldwide Box Office:</b> ", formatNumber(`Worldwide Box Office`)))) +
geom_point(aes(fill = `Worldwide Box Office`), alpha = 0.5, shape = 21, stroke = .4, color = "#003300") +
labs(title = "Marvel Movie Budgets vs Perfomance",
x = "Release Year",
y = "Budget",
size = "Worldwide Box Office",
caption = "Source: The Numbers + Wikipedia") +
scale_fill_gradient(low = "white", high = "#003300", name = "Worldwide Box Office",
labels = function(x) formatNumber(x)) + # Use formatNumber function for legend labels
scale_size_continuous(range = c(3, 15)) +
scale_y_continuous(labels = function(x) formatNumber(x)) +
theme_solarized() +
theme(plot.caption = element_text(hjust = 0, color = "black", face = "italic"))# Convert ggplot object to interactive plot using plotly
ggplotly(p, tooltip = "text")The bubble chart shows how well Marvel movies perform financially by comparing their budgets to their global earnings over the years. Each bubble stands for a movie, and its size shows how much money the movie made. Looking at the chart, it seems like movies with larger budgets tend to do better financially. However, to analyze this relationship more closely, I created the dual axes plot.
3. Dual axes plot:
(Main graph)
# Sort the data by Release Year
marvel <- marvel %>% arrange(`Release Year`) # Define colors for the highchart
cols <- c("slateblue", "aliceblue")
# Create highchart with legend
hc <- highchart() %>%
hc_title(text = "Global Performance in Relation to Marvel Movie Budgets", style = list(color = "white")) %>%
hc_yAxis_multiples(
list(title = list(text = "Budget", style = list(color = cols[1]))),
list(title = list(text = "Worldwide Box Office", style = list(color = cols[2])), opposite = TRUE)
) %>%
hc_add_series(
data = marvel$Budget,
name = "Budget",
type = "column",
color = cols[1]
) %>%
hc_add_series(
data = marvel$`Worldwide Box Office`,
name = "Worldwide Box Office",
type = "spline",
lineWidth = 2.5,
color = cols[2],
yAxis = 1
) %>%
hc_xAxis(
categories = marvel$Movie,
labels = list(style = list(color = "white", fontFamily = "Josefin Slab"))
) %>%
hc_tooltip(
shared = TRUE,
formatter = JS("function() {
var formatNumber = function(number) {
var suffixes = ['', 'K', 'M', 'B', 'T'];
var suffixIndex = Math.floor(Math.log10(Math.abs(number)) / 3);
var suffix = suffixes[suffixIndex];
number = number / Math.pow(10, 3 * suffixIndex);
number = number.toFixed(2);
return '$' + number + suffix;
};
var tooltip = '<b>' + this.points[0].point.category + '</b><br/>';
$.each(this.points, function(i, point) {
tooltip += '<span style=\"color:' + point.color + '\"><b>\u25CF</b></span> <b>' + point.series.name + ': <b>' + formatNumber(point.y) + '</b></b><br/>';
});
return tooltip;
}")
) %>%
hc_credits(
text = "Source: The Numbers + Wikipedia",
href = "://the-numbers.com",
style = list(color = "white")
) %>%
hc_chart(
backgroundColor = "#1E1E1E", # Dark background color
style = list(fontFamily = "Josefin Slab", color = "white")
) %>%
hc_legend(align = "left", verticalAlign = "top", itemStyle = list(color = "white"))
# Print the highchart
hcI created a dual-axis graph to analyze Marvel’s financial performance over the years. This graph illustrates the relationship between Marvel movie budgets and their corresponding worldwide box office earnings. From a glance at the graph, we notice a correlation between budget and worldwide box office revenue. For instance, “Age of Ultron,” despite having a sizable budget, only earned $1.40 billion in revenue, which is a shame considering it’s a great movie. Conversely, “Endgame” exceeded its budget, bringing in $2.7 billion in revenue with a budget of $400 million.
Linear Regression Analysis:
# Set scipen option to prevent scientific notation
options(scipen = 999)
# Perform linear regression analysis
model <- lm(`Worldwide Box Office` ~ Budget, data = marvel)
# Summary of the regression model
summary(model)
Call:
lm(formula = `Worldwide Box Office` ~ Budget, data = marvel)
Residuals:
Min 1Q Median 3Q Max
-1015150794 -192653567 -38523949 111202077 1067929446
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -162633928.5840 100408807.4356 -1.62 0.11
Budget 5.0127 0.5703 8.79 0.00000000000103 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 343400000 on 66 degrees of freedom
Multiple R-squared: 0.5393, Adjusted R-squared: 0.5324
F-statistic: 77.27 on 1 and 66 DF, p-value: 0.000000000001026
The linear regression model equation can be expressed as:
Worldwide Box Office = -162,633,928.5840 + 5.0127 × Budget
This equation tells us that as the budget for a Marvel movie increases, its worldwide box office earnings tend to go up too. The regression equation suggests that for every unit increase in the movie budget, the worldwide box office earnings are expected to increase by approximately $5.01 million. The statistical significance of this relationship is supported by a very small p-value (approximately 0.000000000001026). This means that the budget variable is highly likely to have a significant impact on worldwide box office earnings. Additionally, the summary includes measures of model fit such as the coefficient of determination (R-squared), which indicates that approximately 53.93% of the variability in worldwide box office earnings can be explained by the budget variable.
So, while spending more money on Marvel movies usually means higher earnings, there are other factors that also influence how successful a movie is at the box office.
Summary Essay
A. I acquired this data-set from Kaggle. As I mentioned earlier, it focuses on Marvel movies, providing information on their budgets and worldwide box office earnings over time. Like I said, I’m a big Marvel fan, so this data-set really caught my eye. It includes data such as movie titles, release years, budgets, worldwide box office earnings, and details about the studios behind the movies. There are even more variables, but I might explore those in a future project. The data consists mostly of numerical values, although there are some categorical aspects such as studio ownership. The individual who uploaded the data-set on Kaggle mentioned obtaining it from sources like “The Numbers,” Wikipedia, and IMDb ratings. I didn’t need to do much cleaning as the data-set was already well-organized. While the data-set didn’t specify phases for non-MCU movies like X-Men and Deadpool, I decided to include them anyway to make sure we cover all Marvel movies thoroughly. The only cleaning task I performed was sorting the data by release year.
B. Marvel Studios has reshaped the franchise movie landscape in the past decade. As reported by Harvard Business Review, its portfolio of 22 films has raked in an astounding $17 billion in global box office revenue, surpassing every other movie franchise in history. Notably, these films boast an impressive average approval rating of 84% on Rotten Tomatoes, far exceeding the 68% average for top-grossing franchises. Moreover, each movie secures an average of 64 nominations and awards, underscoring their critical acclaim and widespread recognition. The release of Avengers: Endgame earlier this year is a prime example of this triumph. The sheer magnitude of its success prompted online ticket platforms to overhaul their systems to manage the overwhelming influx of requests (Harvard Business Review).
C. The visuals show us how well Marvel movies are doing financially. The pie chart tells us Marvel Studios makes most of these movies, with big studios like 20th Century Fox and Sony also in the mix. The bubble chart shows how the money spent on making the movies compares to how much they make worldwide. Generally, when the budget goes up, so do the earnings. The graph with two axes backs this up, showing us which movies did better or worse compared to their budgets over time. And the linear regression confirms that spending more money usually means making more money at the box office.
Originally, I had the year on the x-axis of my dual axes plot instead of the movie title. But when I attempted to include the movie title in the tooltip function, it didn’t work. I couldn’t keep it that way because it would have been confusing to distinguish between the columns. So, I adjusted the graph to display the movie titles instead of the years. Afterward, I tried adding the year to the tooltip, but it didn’t work. It seems that the dual axes tooltip can only show the variables used for the graph. Overall, I’m really happy with how all my graphs came out.
References
Harrison, S., Carlsen, A., & Škerlavaj, M. (2019). Marvel’s Blockbuster Machine: How the studio balances continuity and renewal. Harvard Business Review. Retrieved from https://hbr.org/2019/07/marvels-blockbuster-machine