In this mini analysis we work with the data used in the FiveThirtyEight story titled “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”. Your task is to fill in the blanks denoted by Walt Hickey.

Data and packages

We start with loading the packages we’ll use.

library(fivethirtyeight)
library(tidyverse)

The dataset contains information on 1794 movies released between 1970 and 2013. However we’ll focus our analysis on movies released between 1990 and 2013.

bechdel90_13 <- bechdel %>% 
  filter(between(year, 1990, 2013))

print(bechdel90_13)
## # A tibble: 1,615 × 15
##     year imdb    title test  clean…¹ binary budget domgr…² intgr…³ code  budge…⁴
##    <int> <chr>   <chr> <chr> <ord>   <chr>   <int>   <dbl>   <dbl> <chr>   <int>
##  1  2013 tt1711… 21 &… nota… notalk  FAIL   1.3 e7  2.57e7  4.22e7 2013…  1.3 e7
##  2  2012 tt1343… Dred… ok-d… ok      PASS   4.5 e7  1.34e7  4.09e7 2012…  4.57e7
##  3  2013 tt2024… 12 Y… nota… notalk  FAIL   2   e7  5.31e7  1.59e8 2013…  2   e7
##  4  2013 tt1272… 2 Gu… nota… notalk  FAIL   6.1 e7  7.56e7  1.32e8 2013…  6.1 e7
##  5  2013 tt0453… 42    men   men     FAIL   4   e7  9.50e7  9.50e7 2013…  4   e7
##  6  2013 tt1335… 47 R… men   men     FAIL   2.25e8  3.84e7  1.46e8 2013…  2.25e8
##  7  2013 tt1606… A Go… nota… notalk  FAIL   9.2 e7  6.73e7  3.04e8 2013…  9.2 e7
##  8  2013 tt2194… Abou… ok-d… ok      PASS   1.2 e7  1.53e7  8.73e7 2013…  1.2 e7
##  9  2013 tt1814… Admi… ok    ok      PASS   1.3 e7  1.80e7  1.80e7 2013…  1.3 e7
## 10  2013 tt1815… Afte… nota… notalk  FAIL   1.3 e8  6.05e7  2.44e8 2013…  1.3 e8
## # … with 1,605 more rows, 4 more variables: domgross_2013 <dbl>,
## #   intgross_2013 <dbl>, period_code <int>, decade_code <int>, and abbreviated
## #   variable names ¹​clean_test, ²​domgross, ³​intgross, ⁴​budget_2013
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
bechdel2005 <- bechdel %>% 
  filter(between(year, 2005, 2005))
print(bechdel2005)
## # A tibble: 100 × 15
##     year imdb    title test  clean…¹ binary budget domgr…² intgr…³ code  budge…⁴
##    <int> <chr>   <chr> <chr> <ord>   <chr>   <int>   <dbl>   <dbl> <chr>   <int>
##  1  2005 tt0402… AEon… ok    ok      PASS   5.5 e7  2.59e7  4.80e7 2005…  6.56e7
##  2  2005 tt0398… Assa… ok    ok      PASS   3   e7  2.00e7  3.60e7 2005…  3.58e7
##  3  2005 tt0372… Batm… nota… notalk  FAIL   1.5 e8  2.05e8  3.73e8 2005…  1.79e8
##  4  2005 tt0388… Beau… ok    ok      PASS   2.5 e7  3.64e7  3.84e7 2005…  2.98e7
##  5  2005 tt0374… Bewi… ok    ok      PASS   8   e7  6.33e7  1.31e8 2005…  9.54e7
##  6  2005 tt0383… Bloo… dubi… dubious FAIL   2.5 e7  2.41e6  3.61e6 2005…  2.98e7
##  7  2005 tt0357… Boog… men   men     FAIL   2   e7  4.68e7  6.72e7 2005…  2.39e7
##  8  2005 tt0439… Boyn… ok    ok      PASS   2.9 e6  3.13e6  3.13e6 2005…  3.46e6
##  9  2005 tt0393… Brick nota… notalk  FAIL   4.5 e5  2.08e6  4.09e6 2005…  5.37e5
## 10  2005 tt0388… Brok… nota… notalk  FAIL   1.39e7  8.30e7  1.74e8 2005…  1.66e7
## # … with 90 more rows, 4 more variables: domgross_2013 <dbl>,
## #   intgross_2013 <dbl>, period_code <int>, decade_code <int>, and abbreviated
## #   variable names ¹​clean_test, ²​domgross, ³​intgross, ⁴​budget_2013
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Summary of 2012 Abraham Lincoln: Vampire Slayer

This movie failed the Bechdel test. It was labeled as dubious. This means it had ceratin factors in the Bechdel test, but still was biased towards men. The budget was 67500000, and the Gross Domestic Product was 37519139.

Summary:

The Bechdel test was created by a Alison Bechdel, a cartoonist in the 1980’s. In this specific comic strip, the character states how she would not watch a movie if it didn’t pass ‘The Test’. To pass this test a movie would have to include these 3 things: two named women characters, the women having a conversation, and that conversation being about anything other than a man.
In this data set we are analyzing movies that both pass and fail the Bechdel test between the years 1990 to 2013. We are comparing how much money they have made, and the budgets of these movies. Movies that passed the Bechdel test have a lower budget than movies that didn’t. When we compare the profit, movies that pass the Bechdel test (and include more than 2 women) make more money than movies that include fewer than two women. Internationally, the profit is about the same.

Median Budgets

Dollars Earned for Dollars Spent

There are 1615 such movies.

The financial variables we’ll focus on are the following:

  • budget_2013: Budget in 2013 inflation adjusted dollars
  • domgross_2013: Domestic gross (US) in 2013 inflation adjusted dollars
  • intgross_2013: Total International (i.e., worldwide) gross in 2013 inflation adjusted dollars

And we’ll also use the binary and clean_test variables for grouping.

Analysis

Let’s take a look at how median budget and gross vary by whether the movie passed the Bechdel test, which is stored in the binary variable.

bechdel90_13 %>%
  group_by(binary) %>%
  summarise(
    med_budget = median(budget_2013),
    med_domgross = median(domgross_2013, na.rm = TRUE),
    med_intgross = median(intgross_2013, na.rm = TRUE)
    )
## # A tibble: 2 × 4
##   binary med_budget med_domgross med_intgross
##   <chr>       <dbl>        <dbl>        <dbl>
## 1 FAIL    48385984.    57318606.    104475669
## 2 PASS    31070724     45330446.     80124349

Next, let’s take a look at how median budget and gross vary by a more detailed indicator of the Bechdel test result. This information is stored in the clean_test variable, which takes on the following values:

  • ok = passes test
  • dubious= passes test, but still biased
  • men = women only talk about men
  • notalk = women don’t talk to each other
  • nowomen = fewer than two women
bechdel90_13 %>%
  #group_by(med_budget) %>%
  summarise(
    med_budget = median(budget_2013),
    med_domgross = median(domgross_2013, na.rm = TRUE),
    med_intgross = median(intgross_2013, na.rm = TRUE)
    )
## # A tibble: 1 × 3
##   med_budget med_domgross med_intgross
##        <int>        <dbl>        <dbl>
## 1   37878971     52270207     93523336

In order to evaluate how return on investment varies among movies that pass and fail the Bechdel test, we’ll first create a new variable called roi as the ratio of the gross to budget.

bechdel90_13 <- bechdel90_13 %>%
  mutate(roi = (intgross_2013 + domgross_2013) / budget_2013)

Let’s see which movies have the highest return on investment.

bechdel90_13 %>%
  arrange(desc(roi)) %>% 
  select(title, roi, year)
## # A tibble: 1,615 × 3
##    title                     roi  year
##    <chr>                   <dbl> <int>
##  1 Paranormal Activity      671.  2007
##  2 The Blair Witch Project  648.  1999
##  3 El Mariachi              583.  1992
##  4 Clerks.                  258.  1994
##  5 In the Company of Men    231.  1997
##  6 Napoleon Dynamite        227.  2004
##  7 Once                     190.  2006
##  8 The Devil Inside         155.  2012
##  9 Primer                   142.  2004
## 10 Fireproof                134.  2008
## # … with 1,605 more rows
## # ℹ Use `print(n = ...)` to see more rows

Below is a visualization of the return on investment by test result, however it’s difficult to see the distributions due to a few extreme observations.

ggplot(data = bechdel90_13, 
       mapping = aes(x = clean_test, y = roi, color = binary)) +
  geom_boxplot() +
  labs(
    title = "Return on investment vs. Bechdel test result",
    x = "Detailed Bechdel result",
    y = "Return on investment",
    color = "Binary Bechdel result"
    )

What are those movies with very high returns on investment?

bechdel90_13 %>%
  filter(roi > 400) %>%
  select(title, budget_2013, domgross_2013, year)
## # A tibble: 3 × 4
##   title                   budget_2013 domgross_2013  year
##   <chr>                         <int>         <dbl> <int>
## 1 Paranormal Activity          505595     121251476  2007
## 2 The Blair Witch Project      839077     196538593  1999
## 3 El Mariachi                   11622       3388636  1992

Zooming in on the movies with roi <16 provides a better view of how the medians across the categories compare:

ggplot(data = bechdel90_13, mapping = aes(x = clean_test, y = roi, color = binary)) +
  geom_boxplot() +
  labs(
    title = "Return on investment vs. Bechdel test result",
    subtitle = "Zoomed in to 16 or less", 
    #Something about zooming in to a certain level
    x = "Detailed Bechdel result",
    y = "Return on investment",
    color = "Binary Bechdel result"
    ) +
  coord_cartesian(ylim = c(0, 15))