Project 2: Marvel Dataset

Author

Rebecca Jipdjio

Published

April 14, 2024

Source: WIRED

This project focuses on Marvel movies, which have become a significant cultural phenomenon in recent years, dominating the box office with numerous blockbuster hits. As a big fan, I was thrilled to discover a data-set covering every Marvel superhero movie since Blade. The data is sourced from The Numbers and Wikipedia. In this analysis, we’ll delve into the ownership distribution of Marvel movies and examine the relationship between budget and box office accross the years. Additionally, we’ll explore how Marvel movies perform in relation to their budgets.

# Load required libraries

library(tidyverse)
library(ggplot2)
library(ggthemes)
library(plotly)
library(highcharter)
setwd("/Users/rebeccambaho/Downloads")
marvel<- read_csv("marvel_box_office.csv")
# Explore the data - Summary statistics

summary(marvel)
    Movie           Release Date       Release Month       Release Day   
 Length:68          Length:68          Length:68          Min.   : 1.00  
 Class :character   Class :character   Class :character   1st Qu.: 4.00  
 Mode  :character   Mode  :character   Mode  :character   Median : 8.00  
                                                          Mean   :10.63  
                                                          3rd Qu.:16.25  
                                                          Max.   :30.00  
  Release Year   Ownership         Domestic Box Office
 Min.   :1998   Length:68          Min.   :  8050977  
 1st Qu.:2008   Class :character   1st Qu.:132461948  
 Median :2014   Mode  :character   Median :214728302  
 Mean   :2013                      Mean   :255834383  
 3rd Qu.:2019                      3rd Qu.:347191576  
 Max.   :2024                      Max.   :858373000  
 Inflation Adjusted Domestic International Box Office
 Min.   : 11807352           Min.   :2.107e+06       
 1st Qu.:184227607           1st Qu.:1.531e+08       
 Median :263540128           Median :3.237e+08       
 Mean   :321874669           Mean   :3.847e+08       
 3rd Qu.:433006184           3rd Qu.:4.984e+08       
 Max.   :986754117           Max.   :1.931e+09       
 Inflation Adjusted International Worldwide Box Office
 Min.   :3.089e+06                Min.   :1.016e+07   
 1st Qu.:2.053e+08                1st Qu.:2.835e+08   
 Median :4.061e+08                Median :5.639e+08   
 Mean   :4.767e+08                Mean   :6.405e+08   
 3rd Qu.:6.421e+08                3rd Qu.:8.467e+08   
 Max.   :2.219e+09                Max.   :2.789e+09   
 Inflation Adjusted Worldwide Opening Weekend    
 Min.   :1.490e+07            Min.   :  4271451  
 1st Qu.:4.024e+08            1st Qu.: 54944072  
 Median :7.206e+08            Median : 85308521  
 Mean   :7.986e+08            Mean   : 95573759  
 3rd Qu.:9.946e+08            3rd Qu.:123435530  
 Max.   :3.206e+09            Max.   :357115007  
 Inflation Adjusted Opening Weekend     Budget         
 Min.   :  6264398                  Min.   : 33000000  
 1st Qu.: 72010860                  1st Qu.:115750000  
 Median :106263045                  Median :160000000  
 Mean   :119597453                  Mean   :160217647  
 3rd Qu.:152621862                  3rd Qu.:200000000  
 Max.   :410526314                  Max.   :400000000  
 Inflation Adjusted Budget   IMDb Score      Meta Score     Tomatometer   
 Min.   : 51330083         Min.   :3.800   Min.   :26.00   Min.   : 9.00  
 1st Qu.:148341582         1st Qu.:6.175   1st Qu.:47.75   1st Qu.:46.75  
 Median :201706716         Median :6.900   Median :61.50   Median :74.50  
 Mean   :202661147         Mean   :6.760   Mean   :58.15   Mean   :65.60  
 3rd Qu.:250000000         3rd Qu.:7.500   3rd Qu.:69.00   3rd Qu.:87.50  
 Max.   :459825329         Max.   :8.400   Max.   :88.00   Max.   :96.00  
 Rotten Tomato Audience Score Run Time In Minutes    Phase          
 Min.   :18.00                Min.   : 92.0       Length:68         
 1st Qu.:63.75                1st Qu.:112.0       Class :character  
 Median :78.50                Median :124.0       Mode  :character  
 Mean   :73.28                Mean   :123.7                         
 3rd Qu.:87.00                3rd Qu.:134.0                         
 Max.   :98.00                Max.   :181.0                         
   Director        
 Length:68         
 Class :character  
 Mode  :character  
                   
                   
                   

1. Pie chart:

# Compute the percentage of ownership for each category

marvel_percent <- marvel %>%
  count(Ownership) %>%
  mutate(prop = n / sum(n) * 100) %>%
  arrange(desc(Ownership)) %>%
  mutate(ypos = cumsum(prop) - 0.5 * prop,
         total_movies = paste0("(", n, ")"))
# Plot pie chart to visualize ownership distribution

ggplot(marvel_percent, aes(x = "", y = prop, fill = Ownership)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  theme_void() + 
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#1E1E1E"),
        panel.background = element_rect(fill = "#1E1E1E"),
        axis.text = element_blank(), 
        axis.title = element_blank(),
        plot.title = element_text(color = "white"),
        plot.caption = element_text(color = "white")) + 
  geom_text(aes(x = 1.8, y = ypos, label = paste(Ownership, "\n", total_movies, "\n", round(prop, 2), "%")), 
            color = "white", size = 2.2) +  
  scale_fill_brewer(palette = "BrBG") +  
  labs(title = "Ownership Distribution of Marvel Movies",
       caption = "Source: The Numbers + Wikipedia") 

I was curious about the ownership distribution of movies, so I decided to create a pie chart. From the chart, it’s evident that Marvel studios has the largest share, accounting for approximately 48% with 33 movies in its portfolio, including iconic titles like the Avengers movies. Following Marvel studios, 20th Century Fox owns 18 movies, while Sony holds the third-largest share with 11 movies. 20th Century Fox is responsible for titles like X-Men and Deadpool, which happen to be among my favorites. Despite its simplicity, I’m really happy with how this visualization turned out.

2. Bubble Map:

# Define function to format numbers with suffixes

formatNumber <- function(number) {
  ifelse(number == 0, "0", {
    suffixes <- c("", "K", "M", "B", "T")
    suffixIndex <- floor(log10(abs(number)) / 3)
    suffix <- suffixes[suffixIndex + 1]
    number <- number / 10^(3 * suffixIndex)
    number <- formatC(number, format = "f", digits = 0, big.mark = ",", drop0trailing = TRUE)
    paste0(number, suffix)
  })
}
# Create a bubble chart to visualize Marvel movie budgets and performance

p <- ggplot(marvel, aes(x = `Release Year`, y = Budget, size = `Worldwide Box Office`,
                        text = paste("<b>", Movie, "</b><br>",
                                     "<b>", `Release Year`, "</b><br>",
                                     "<b>Budget:</b> <b>", formatNumber(Budget), "</b> ", "<br>",
                                     "<b>Worldwide Box Office:</b> ", formatNumber(`Worldwide Box Office`)))) +
  geom_point(aes(fill = `Worldwide Box Office`), alpha = 0.5, shape = 21, stroke = .4, color = "#003300") + 
  labs(title = "Marvel Movie Budgets vs Perfomance",
       x = "Release Year",
       y = "Budget",
       size = "Worldwide Box Office",
       caption = "Source: The Numbers + Wikipedia") +
  scale_fill_gradient(low = "white", high = "#003300", name = "Worldwide Box Office",
                      labels = function(x) formatNumber(x)) +  # Use formatNumber function for legend labels
  scale_size_continuous(range = c(3, 15)) + 
  scale_y_continuous(labels = function(x) formatNumber(x)) + 
  theme_solarized() +
  theme(plot.caption = element_text(hjust = 0, color = "black", face = "italic"))
# Convert ggplot object to interactive plot using plotly

ggplotly(p, tooltip = "text")

The bubble chart shows how well Marvel movies perform financially by comparing their budgets to their global earnings over the years. Each bubble stands for a movie, and its size shows how much money the movie made. Looking at the chart, it seems like movies with larger budgets tend to do better financially. However, to analyze this relationship more closely, I created the dual axes plot.

3. Dual axes plot:

(Main graph)

# Sort the data by Release Year

marvel <- marvel %>% arrange(`Release Year`) 
# Define colors for the highchart
cols <- c("slateblue", "aliceblue")

# Create highchart with legend
hc <- highchart() %>%
  hc_title(text = "Global Performance in Relation to Marvel Movie Budgets", style = list(color = "white")) %>%
  hc_yAxis_multiples(
    list(title = list(text = "Budget", style = list(color = cols[1]))),
    list(title = list(text = "Worldwide Box Office", style = list(color = cols[2])), opposite = TRUE)
  ) %>%
  hc_add_series(
    data = marvel$Budget,
    name = "Budget",
    type = "column",
    color = cols[1]
  ) %>%
  hc_add_series(
    data = marvel$`Worldwide Box Office`,
    name = "Worldwide Box Office",
    type = "spline",
    lineWidth = 2.5,
    color = cols[2],
    yAxis = 1
  ) %>%
  hc_xAxis(
    categories = marvel$Movie,
    labels = list(style = list(color = "white", fontFamily = "Josefin Slab"))
  ) %>%
  hc_tooltip(
    shared = TRUE,
    formatter = JS("function() {
      var formatNumber = function(number) {
        var suffixes = ['', 'K', 'M', 'B', 'T'];
        var suffixIndex = Math.floor(Math.log10(Math.abs(number)) / 3);
        var suffix = suffixes[suffixIndex];
        number = number / Math.pow(10, 3 * suffixIndex);
        number = number.toFixed(2);
        return '$' + number + suffix;
      };
      var tooltip = '<b>' + this.points[0].point.category + '</b><br/>';
      $.each(this.points, function(i, point) {
        tooltip += '<span style=\"color:' + point.color + '\"><b>\u25CF</b></span> <b>' + point.series.name + ': <b>' + formatNumber(point.y) + '</b></b><br/>';
      });
      return tooltip;
    }")
  ) %>%
  hc_credits(
    text = "Source: The Numbers + Wikipedia",
    href = "://the-numbers.com",
    style = list(color = "white")
  ) %>%
  hc_chart(
    backgroundColor = "#1E1E1E",  # Dark background color
    style = list(fontFamily = "Josefin Slab", color = "white")  
  ) %>%
  hc_legend(align = "left", verticalAlign = "top", itemStyle = list(color = "white"))

# Print the highchart
hc

I created a dual-axis graph to analyze Marvel’s financial performance over the years. This graph illustrates the relationship between Marvel movie budgets and their corresponding worldwide box office earnings. From a glance at the graph, we notice a correlation between budget and worldwide box office revenue. For instance, “Age of Ultron,” despite having a sizable budget, only earned $1.40 billion in revenue, which is a shame considering it’s a great movie. Conversely, “Endgame” exceeded its budget, bringing in $2.7 billion in revenue with a budget of $400 million.

Linear Regression Analysis:

# Set scipen option to prevent scientific notation
options(scipen = 999)

# Perform linear regression analysis
model <- lm(`Worldwide Box Office` ~ Budget, data = marvel)

# Summary of the regression model
summary(model)

Call:
lm(formula = `Worldwide Box Office` ~ Budget, data = marvel)

Residuals:
        Min          1Q      Median          3Q         Max 
-1015150794  -192653567   -38523949   111202077  1067929446 

Coefficients:
                   Estimate      Std. Error t value         Pr(>|t|)    
(Intercept) -162633928.5840  100408807.4356   -1.62             0.11    
Budget               5.0127          0.5703    8.79 0.00000000000103 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 343400000 on 66 degrees of freedom
Multiple R-squared:  0.5393,    Adjusted R-squared:  0.5324 
F-statistic: 77.27 on 1 and 66 DF,  p-value: 0.000000000001026

The linear regression model equation can be expressed as:

Worldwide Box Office = -162,633,928.5840 + 5.0127 × Budget

This equation tells us that as the budget for a Marvel movie increases, its worldwide box office earnings tend to go up too. The regression equation suggests that for every unit increase in the movie budget, the worldwide box office earnings are expected to increase by approximately $5.01 million. The statistical significance of this relationship is supported by a very small p-value (approximately 0.000000000001026). This means that the budget variable is highly likely to have a significant impact on worldwide box office earnings. Additionally, the summary includes measures of model fit such as the coefficient of determination (R-squared), which indicates that approximately 53.93% of the variability in worldwide box office earnings can be explained by the budget variable.

So, while spending more money on Marvel movies usually means higher earnings, there are other factors that also influence how successful a movie is at the box office.

Summary Essay

A. I acquired this data-set from Kaggle. As I mentioned earlier, it focuses on Marvel movies, providing information on their budgets and worldwide box office earnings over time. Like I said, I’m a big Marvel fan, so this data-set really caught my eye. It includes data such as movie titles, release years, budgets, worldwide box office earnings, and details about the studios behind the movies. There are even more variables, but I might explore those in a future project. The data consists mostly of numerical values, although there are some categorical aspects such as studio ownership. The individual who uploaded the data-set on Kaggle mentioned obtaining it from sources like “The Numbers,” Wikipedia, and IMDb ratings. I didn’t need to do much cleaning as the data-set was already well-organized. While the data-set didn’t specify phases for non-MCU movies like X-Men and Deadpool, I decided to include them anyway to make sure we cover all Marvel movies thoroughly. The only cleaning task I performed was sorting the data by release year.

B. Marvel Studios has reshaped the franchise movie landscape in the past decade. As reported by Harvard Business Review, its portfolio of 22 films has raked in an astounding $17 billion in global box office revenue, surpassing every other movie franchise in history. Notably, these films boast an impressive average approval rating of 84% on Rotten Tomatoes, far exceeding the 68% average for top-grossing franchises. Moreover, each movie secures an average of 64 nominations and awards, underscoring their critical acclaim and widespread recognition. The release of Avengers: Endgame earlier this year is a prime example of this triumph. The sheer magnitude of its success prompted online ticket platforms to overhaul their systems to manage the overwhelming influx of requests (Harvard Business Review).

C. The visuals show us how well Marvel movies are doing financially. The pie chart tells us Marvel Studios makes most of these movies, with big studios like 20th Century Fox and Sony also in the mix. The bubble chart shows how the money spent on making the movies compares to how much they make worldwide. Generally, when the budget goes up, so do the earnings. The graph with two axes backs this up, showing us which movies did better or worse compared to their budgets over time. And the linear regression confirms that spending more money usually means making more money at the box office.

Originally, I had the year on the x-axis of my dual axes plot instead of the movie title. But when I attempted to include the movie title in the tooltip function, it didn’t work. I couldn’t keep it that way because it would have been confusing to distinguish between the columns. So, I adjusted the graph to display the movie titles instead of the years. Afterward, I tried adding the year to the tooltip, but it didn’t work. It seems that the dual axes tooltip can only show the variables used for the graph. Overall, I’m really happy with how all my graphs came out.

References

Harrison, S., Carlsen, A., & Škerlavaj, M. (2019). Marvel’s Blockbuster Machine: How the studio balances continuity and renewal. Harvard Business Review. Retrieved from https://hbr.org/2019/07/marvels-blockbuster-machine