Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


MCU Marvel movies scatter plot: critic score vs profit as a percentage of budget
Source: Information is Beautiful (McCandless 2023a).


Objective

The objective of the graph “Which is The Best Performing Marvel Movie?” on Information Is Beautiful, with the default settings, is an attention-seeking, tabloid-like headline similar to many other websites dedicated to general audience popular culture references.

Although the website audience is geared to graphic designers and data visualisation architects, the visuals on the site are highly interactive to appeal to a general audience. Being interactive, the Marvel visualisation is made to be shared, using different iterations of axes. Some data users have shared the interactive site on LinkedIn (Jackson 2023), while the Information is Beautiful founder has shared different specific trends on Twitter (McCandless 2023b).

Movies within the Marvel Cinematic Universe (MCU) are popular in the general population, with a large depth and breadth of casual and serious fans. Many sites and Youtube channels present “inside knowledge” with a tone of “everybody else is wrong about” information to that specific audience. The headline and the default settings of the Marvel visualisation lends itself to this discussion: here’s the truth about Marvel movies, in a unique metric that nobody else is telling you. Many websites will rank Marvel movies, such as by critic score, or audience score, or largest difference between audience and critics (Demchak 2022). Other sites use absolute gross of a film’s box office performance.

Ranking or comparing two discreet variables are otherwise non-existent outside of this visualisation tool.

The default measurement for this Marvel visualisation along the X-axis is critic rating. The y-axis is profit in gross as a percentage measurement of production cost (total gross divided by estimated reported budget).

The author of the text below the visualisation is either not a fan of MCU movies, or someone who used to be a fan but isn’t one any longer. The writer describes an upcoming movie as “looming”, with the refrain “gawd knows” to describe the number of “slugfests” that are in the development “pipe hole” for future release (McCandless 2023a). Using critic score as the X-axis, instead of audience score, could be seen as dismissive of the riff raff. Using the ambiguous “% budget recovered” diminishes the biggest and most popular titles and may increase the standing of less popular titles to similar stature.

Issues with visualisation There are 3 main issues with the visualisation itself:

  • Deceptive comparative variables

Percent budget recovered, which are the Y-values, isn’t very intuitive, especially because it’s an independent variable compared with the X-axis. If Percent Budget Recovered were plotted against its dependent variable (production cost), then using it as a comparison would be better. However, it can still be misunderstood, especially with large numbers. For instance, a movie that makes $1.5 billion would be considered incredibly successful no matter the budget, but if percent budget recovered were used, a $150 million movie would make 10X its budget, a $250 million movie only 6X its budget, and $400 million movie make only 3.75X its budget. At these scales, using budget as the dependent variable and profit as the independent variable is easier to understand. Showing net profit instead of just gross profit would show that budget should be accounted for with a movie’s success.

  • Deceptive intercepts. There are solid X and Y intercepts that aren’t 0.

There’s a y-axis at 500%, and a second dotted y-intercept line at about 250% called “Breakeven”. Either 100% or the “Breakeven” would be the essential y-intercept for practical purposes – anything less than 100 percent would mean that the movie made less than its budget, and above that line would be that the movie made more than its budget; “breakeven” is when a movie would generally make more money than its production cost and marketing cost combined. Meanwhile, the x-intercept is just as arbitrary at 72.5%, and there’s an arrow above 100%, as if 100 % isn’t an absolute limit.

  • Colour issues. The dots and the graph elements are a poor use of colour.

The background is black, which isn’t preferable for data visualisation design.

For the data points, there are 10 different categories, each with a different colour for non-ordered factors. This is too many for people to reasonably distinguish. The red-green colour blindess distinction would be poor for half of the categories: 2 categories are red (and they’re not that different anyway), and 3 are green. Additionally, the light yellow is almost indistinguishable from the white Other category (called “Unique”).

In this instance, colours don’t correspond with existing colour-schemes for characters: among the three green categories, 2 of them don’t include any green in their costumes, and none of the greens represent Marvel’s most distinguishable green character, the Hulk.

There are existing categories that would create fewer category levels. For instance, the MCU’s “Phases”, which may show comparisons and differences over time in a single graph. Instead of using colours for “categories” or semi-franchises, colours would be better used to represent numeric differences, and facets used used to represent limited groupings, such as the Marvel “Phases”.

If there were categories based on mini-franchises within the MCU, this would be a tongue-in-cheek reduction:

  1. Teams led by guys named Chris (4 Avengers and 3 Guardians of the Galaxy)

  2. Blonde guys named Chris “solo” titles (4 Thor and 3 Captain America)

  3. Brits with an American accent (2 Dr Strange and 3 Spider-Man MCU)

  4. Other solo white guy trilogies (3 Iron Man and 3 Ant-Man)

  5. “Diversity” (non-white Male movies: Captain Marvel, Eternals, Black Widow, 2 Black Panther, Shang-Chi)

  6. Yes, there was a Hulk solo movie in the MCU (1, to be coloured green)

References

Baglin J (16 February 2023) Data Visualisation: From Theory to Practice , RMIT, published online, accessed 26 July 2023. https://dark-star-161610.appspot.com/secured/_book/index.html

Demchak, M (2 October 2022) ‘10 MCU Movies And Shows That Most Divided Critics And Audiences, According To Rotten Tomatoes: The divide between fan and critic opinions is forever growing, even between the highly acclaimed Marvel Cinematic Universe’, Screen Rant, accessed 24 July 2023. https://screenrant.com/10-mcu-movies-shows-that-divided-critics-and-audiences-the-most-according-to-rotten-tomatoes

Jackson C (21 July 2023)‘With #barbenheimer landing this weekend I’ve got movies on the brain…’ [LinkedIn Post], Chaz Jackson, accessed 24 July 2023. https://www.linkedin.com/posts/chaz-jackson_which-is-the-best-performing-marvel-movie-activity-7088173435025010690-W39u/

McCandless D (15 May 2023a) Which is The Best Performing Marvel Movie?, Information is Beautiful website, accessed 21 July 2023. https://informationisbeautiful.net/visualizations/which-is-the-best-performing-marvel-movie/

McCandless D (8 March 2023b) ‘Possible trend: Using % of gross from international audiences …’ [Tweet], David McCandless, accessed 23 July 2023. https://twitter.com/mccandelish/status/1633212896917491712

Code

The following code was used to fix the issues identified in the original.

if(!require("pacman"))install.packages("pacman")

# install if don't have; load all
p_load(ggplot2, # plot
       tidyr, dplyr, # data analysis
       colourpicker, colorspace, colorblindr, # colours
       magrittr, # for pipes
       httr, XML, # for grabbing data from Wikipedia
       stringr, glue, # for formatting text
       viridis) # for color palette
# Load as a data frame
  # data is saved on local computer, donwloaded from locked Google Sheet (no API with it). 
# Source: McCandless D, Evans T and Barton P 2023.
Marvel <- read.csv("What is The Best Performing Marvel Movie_ PUBLIC DATA - Marvel Movies.csv")

# check information
df <- Marvel
df %>% dim()
## [1] 30 19
df %>% colnames()
##  [1] "film"                           "category"                      
##  [3] "worldwide.gross...m."           "X..budget.recovered"           
##  [5] "critics...score"                "audience...score"              
##  [7] "audience.vs.critics...deviance" "budget"                        
##  [9] "domestic.gross...m."            "international.gross...m."      
## [11] "opening.weekend...m."           "second.weekend...m."           
## [13] "X1st.vs.2nd.weekend.drop.off"   "X..gross.from.opening.weekend" 
## [15] "X..gross.from.domestic"         "X..gross.from.international"   
## [17] "X..budget.opening.weekend"      "year"                          
## [19] "source"
df %>% head()
##                      film      category worldwide.gross...m.
## 1               Ant-Man         Ant-Man                  518
## 2      Ant-Man & The Wasp       Ant-Man                  623
## 3 Avengers: Age of Ultron      Avengers                 1395
## 4      Avengers: End Game      Avengers                 2797
## 5  Avengers: Infinity War      Avengers                 2048
## 6           Black Panther Black Panther                 1336
##   X..budget.recovered critics...score audience...score
## 1                398%             83%              85%
## 2                479%             87%              80%
## 3                382%             76%              82%
## 4                699%             94%              90%
## 5                683%             85%              91%
## 6                668%             96%              79%
##   audience.vs.critics...deviance budget domestic.gross...m.
## 1                            -2%    130                 180
## 2                             7%    130                 216
## 3                            -6%    365                 459
## 4                             4%    400                 858
## 5                            -6%    300                 678
## 6                            17%    200                 700
##   international.gross...m. opening.weekend...m. second.weekend...m.
## 1                      338                 57.0                  24
## 2                      406                 75.8                  29
## 3                      936                191.0                  77
## 4                     1939                357.0                 147
## 5                     1369                257.0                 114
## 6                      636                202.0                 111
##   X1st.vs.2nd.weekend.drop.off X..gross.from.opening.weekend
## 1                         -58%                          31.8
## 2                         -62%                          35.0
## 3                         -60%                          41.7
## 4                         -59%                          41.6
## 5                         -56%                          38.0
## 6                         -45%                          28.9
##   X..gross.from.domestic X..gross.from.international X..budget.opening.weekend
## 1                  34.7%                       65.3%                     43.8%
## 2                  34.7%                       65.2%                     58.3%
## 3                  32.9%                       67.1%                     52.3%
## 4                  30.7%                       69.3%                     89.3%
## 5                  33.1%                       66.8%                     85.7%
## 6                  52.4%                       47.6%                    101.0%
##   year                                                                source
## 1 2015                 https://www.the-numbers.com/movie/Ant-Man#tab=summary
## 2 2018    https://www.the-numbers.com/movie/Ant-Man-and-the-Wasp#tab=summary
## 3 2015  https://www.the-numbers.com/movie/Avengers-Age-of-Ultron#tab=summary
## 4 2019 https://www.the-numbers.com/movie/Avengers-Endgame-(2019)#tab=summary
## 5 2018   https://www.the-numbers.com/movie/Avengers-Infinity-War#tab=summary
## 6 2018           https://www.the-numbers.com/movie/Black-Panther#tab=summary
# rename simplified names
Marvel %<>% rename(Movie = film,
                   TotalGross = worldwide.gross...m.,
                   CriticsScore = critics...score,
                   Budget = budget,
                   Year = year) 

# only keep required columns
Marvel %<>% select(c("Movie", "TotalGross", "CriticsScore", "Budget", "Year"))

# create calculated columns to show in graph
  # Profit, which is gross - budget

Marvel %<>% mutate(Profit = TotalGross - Budget) 

# check that it looks OK
Marvel %>% head()
##                     Movie TotalGross CriticsScore Budget Year Profit
## 1               Ant-Man          518          83%    130 2015    388
## 2      Ant-Man & The Wasp        623          87%    130 2018    493
## 3 Avengers: Age of Ultron       1395          76%    365 2015   1030
## 4      Avengers: End Game       2797          94%    400 2019   2397
## 5  Avengers: Infinity War       2048          85%    300 2018   1748
## 6           Black Panther       1336          96%    200 2018   1136
# Critic Score is a character, with %
# Make it a number
Marvel$CriticsScore %<>% gsub("%", "", .) %>% as.numeric()
# get Marvel phases.
# Source: Wikipedia 2023
url <- "https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films" %>% GET()

# load data -- it's the second and third table within the Wikipedia page
  # 2 different periods of films
# only keep the first 2 columns (name and date) of the data frame
  # Source for how to: shnee 2017
Mphases1 <- readHTMLTable(doc = content(url, "text"))[2] %>% data.frame() %>% select(1:2)
Mphases2 <- readHTMLTable(doc = content(url, "text"))[3] %>% data.frame() %>% select(1:2)

# make the first two columns the same name
names(Mphases1) <- c("Movie", "Date")
names(Mphases2) <- c("Movie", "Date")

# combine the two groups
Mphases <- rbind(Mphases1, Mphases2)
# Add a phase column
Mphases %<>% mutate(Phase = ifelse(grepl("Phase", Movie), Movie, NA))

# fill down
Mphases %<>%  fill(Phase, .direction = "down")

# Remove the parenthetical text [] after "Phase" number
Mphases$Phase <- str_split(Mphases$Phase, "\\[", simplify = T)[,1]
# Combine phase with original data set
Marvel <- left_join(Marvel, Mphases)

# There are some phases that don't merge due to differences in the name of the movies
  # Wikipedia is full name
  # Data set is simplified
  # Order by year
  # Fill down missing values for Phase, then get the final one at the top
Marvel <- Marvel[order(Marvel$Year), ] %<>%
    fill(Phase, .direction = "down") %<>%
    fill(Phase, .direction = "up")

# Stack phases
Marvel$Phase %<>% factor(levels = unique(.), ordered = TRUE)
# filter out Avengers -- they are outliers in budget and profit -- 250 - 400 million production cost, and profit of 2.5 million for Endgame
# filter out Black Widow -- released during the pandemic, to video-on-demand, so wasn't given the chance for the same box office
Marvel %<>% filter(!grepl("Avengers", Marvel$Movie) &
                     Movie != "Black Widow") 
# Get phases as a date range
  # So facet, and show change over time, to a practical audience
DatePhases <- Marvel %>% group_by(Phase) %>% 
                          summarise(Earliest = min(Year, na.rm = TRUE),
                                    Latest = max(Year, na.rm = TRUE),
                                    YearRange = glue("{Earliest} - {Latest}"))

Marvel <- left_join(Marvel, DatePhases)
# rename Black Panther 2:
Marvel$Movie %<>% gsub("Black Panther 2", "Black Panther 2: Wakanda Forever", .)
# choose what to label
  # Some of the semi-franchises that aren't in the same cluster
  # Some of the other values that aren't in the same cluster

# Spider-Man No Way Home is an outlier for profitability. Put its label below the dot
bottomlabels <- c("Spider-Man: No Way Home") %>% paste(collapse = "|")
toplabels <- c("Iron Man", "Eternals", "Captain America", "Black Panther", "Spider-Man", "Shang") %>% paste(collapse = "|")
#Ant-Man 1 & 2 had low budget and low profit. Move it away from the bottom right corner of the graph
upandright <- c("Ant-Man")

# UpandLeft: Spider-Man and Captain America sequels are long names, separate by a colon. Adjust these to fit on 2 lines. Captain America: Civil War is high budget, so it's off to the right of the graph, so move it back onto the graph

# Thor movies -- will have to adjust their labels manually and slightly to not overlap nearby labels

Marvel %<>% mutate(TopLabel = ifelse(grepl(toplabels, Movie) &    # includes top label
                                        !grepl(":", Movie) & !grepl(bottomlabels, Movie), # but not bottom label or sub-title
                                        Movie, NA),
                  BottomLabel = ifelse(grepl(bottomlabels, Movie), gsub(":", ":\n", Movie), NA),
                  UpandLeft = ifelse(grepl(toplabels, Movie) & grepl(":", Movie) & !grepl(bottomlabels, Movie), gsub(":", ":\n", Movie), NA),
                  UpandRight = ifelse(grepl(upandright, Movie), gsub("& ", "&\n", Movie), NA),
                  # manually move Thor movies -- they're different from other values but close enough to overlap
                  Thor =  ifelse(Movie  %in% "Thor", "Thor", NA),
                  Thor2 = ifelse(Movie  %in% "Thor: Dark World", gsub(":", ":\n", Movie), NA),
                  Thor3 = ifelse(Movie  %in% "Thor: Ragnarok", gsub(":", ":\n", Movie), NA),
                  Thor4 = ifelse(Movie  %in% "Thor: Love & Thunder", Movie, NA))
#plot the graph
p1 <- ggplot(data = Marvel,
             aes(x = Budget,
                 y = Profit,
                 colour = CriticsScore)) +
  
      # graph as a scatter plot, in facet stacked, colored by Critic Score
      geom_point(size = 3) +
      facet_grid(YearRange ~ .) +
      scale_color_viridis(name = "Critic Score \non \nRotten Tomatoes\n(%)", 
                          option = "F") +   
  
      # Main graph labels
      labs(title = "The MCU: Profit vs Budget, across Phases (not including Avengers)",
           subtitle = "Increased production costs, profitability and critical favourability to 2016-2019.\nOnly greater costs since",
           x = "Production Cost (in millions of $)",
           y = "Net Profit after Production Cost (in millions of $)") +
      theme_minimal() +
      theme(legend.justification = "top",
            legend.title=element_text(size=8), # change legend
            strip.background = element_rect(fill = "lightgrey", size = 4, color = "lightgrey")) +
  
      # Individual movie labels
          # top label -- vjust is negative
      geom_text(aes(label = TopLabel),
                    size = 3, colour = "black",
                    vjust = - .6) + 
          # bottom label - vjust is positive
      geom_text(aes(label = BottomLabel),
                    size = 3, colour = "black",
                    vjust = .8) + 
            # special label -- to the left and up
      geom_text(aes(label = UpandLeft),
                    size = 3, colour = "black",
                    vjust = .1,
                    hjust = .8) + 
            # up and right
        geom_text(aes(label = UpandRight),
                    size = 3, colour = "black",
                    hjust = -.1) + 
          # manually label Thor movies to not overlap with others
            #Thor 1 --to the right and slight down, not overlap Captain America
                geom_text(aes(label = Thor),
                    size = 3, colour = "black",
                    hjust = -.4,
                    vjust = .2) + 
            # Thor 2 -- a little to the right of Captain America: Winter Soldier
                geom_text(aes(label = Thor2),
                    size = 3, colour = "black",
                    vjust = -.2,
                    hjust = .9) + 
            # Thor 3 -- below
                  geom_text(aes(label = Thor3),
                    size = 3, colour = "black",
                    vjust = .3,
                    hjust = 0) + 
              # Thor 4 -- below
                  geom_text(aes(label = Thor4),
                    size = 3, colour = "black",
                    vjust = 1,
                    hjust = .8) 

Data Reference

McCandless D, Evans T and Barton P (17 February 2023) What is The Best Performing Marvel Movie? PUBLIC DATA [Google Sheet], Information is Beautiful website, accessed 15 July 2023. https://docs.google.com/spreadsheets/d/1YSJ4ypkYLq6j1mIBJCgUHhHjJZQ0Rkfe1qW2WC5HLiw/edit?pli=1#gid=748627588

shnee (1 February 2017) ‘Importing wikipedia tables in R’, Stackoverflow website, accessed 27 July 2023. https://stackoverflow.com/questions/7407735/importing-wikipedia-tables-in-r

Wikipedia (circa 2023) List of Marvel Cinematic Universe films</em, Wikipedia website, accessed 27 July 2023. https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films

Reconstruction

The following plot fixes the main issues in the original.