The Bechdel Test

This image was sourced from btch>flks.com

Introduction

The two data sets I used were both webscraped from various sources including, IMDB, FiveThirtyEight, BechdelTest.com, and The-Number.com. As far as I can tell, based on the information available from these data sets, all data was collected by webscraping the sites I previously mentioned, which get their data from mostly crowd sourcing. The data sets, which I merged, cover a variety of variables pertaining to movies from 1970 to 2013, such as how much money each movie made, what genre the movies fall into, and importantly, whether or not the movie passed the Bechdel Test. The variables I will be using in my project that need some explaining include, “bechdelRating”, “domgross_2013”, “budget_2013”, and “intgross_2013”. BechdelRating refers to a system of rating movies based off of their representation of women. A rating of 1 means there were at least two named women, a rating of 2 means there were at least 2 named women who had a conversation, and a rating of 3 means there were at least 2 named women who had a conversation about something other than a man. A rating of zero means the movie meets none of the conditions. “domgross_2013” is the domestic gross in millions (translated to 2013 value to account for inflation), “budget_2013” is the movies budget also translated to account for inflation, and “intgross_2013” is the international gross in millions, again translated to account for inflation. My interest lies in what kind of effect whether a movie passes the Bechdel test or not has on the amount of money the movie makes, the IMDB rating users give, and whether certain genre’s tend to past the Bechdel Test more than others by an interesting margin. I chose a data set on this topic frankly because I am a woman, and particularly after the conversations surrounding the Barbie movie occurred, men calling it boring or uninteresting, the comedian at the Oscars boiling it down to a movie about a “Plastic doll with big boobs,” I thought a lot about how much of pop culture and art is created for men, by men. Now, i’m not a particular fan of the Barbie movie, I don’t think I was the target age group, but nonetheless, it pained me to hear groups of adult men complain that one piece of mega popular media wasn’t totally created with them in mind. I’m not sure I have ever seen a movie where there wasn’t two named men who talked to each other about something other than a woman, perhaps besides the Barbie movie. I find it to be a curiously frustrating topic. To clean my sets, I first arranged the years of one set in ascending order to match the other, and used select to delete columns I wasn’t interested in looking at in the other. I then merged the two data sets, and then mutated new columns for each genre of movie using ifelse statements that resolved to TRUE/FALSE in order to effectively present the numerous genres of many of the movies, and setting NA.s to appear as ‘none’.

Loading Libraries

Here I am loading my necessary libraries and reading in my data sets.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(nnet)
library(ggplot2)
library(dplyr)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("~/Documents/Data 110")
unclean <- read_csv("movies.csv")

Rows: 1794 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): imdb, title, test, clean_test, binary, code
dbl (10): index, year, budget, domgross, intgross, budget_2013$, domgross_20...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(unclean)

# A tibble: 6 × 16
  index  year imdb  title test  clean_test binary budget domgross intgross code 
  <dbl> <dbl> <chr> <chr> <chr> <chr>      <chr>   <dbl>    <dbl>    <dbl> <chr>
1     0  2013 tt17… 21 &… nota… notalk     FAIL   1.3 e7 25682380   4.22e7 2013…
2     1  2012 tt13… Dred… ok-d… ok         PASS   4.50e7 13414714   4.09e7 2012…
3     2  2013 tt20… 12 Y… nota… notalk     FAIL   2   e7 53107035   1.59e8 2013…
4     3  2013 tt12… 2 Gu… nota… notalk     FAIL   6.1 e7 75612460   1.32e8 2013…
5     4  2013 tt04… 42    men   men        FAIL   4   e7 95020213   9.50e7 2013…
6     5  2013 tt13… 47 R… men   men        FAIL   2.25e8 38362475   1.46e8 2013…
# ℹ 5 more variables: `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
#   `intgross_2013$` <dbl>, `period code` <dbl>, `decade code` <dbl>

clean <- read_csv("Bechdel_IMDB_Merge0524.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 9718 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): title, genre1, genre2, genre3
dbl (7): year, imdbid, id, bechdelRating, imdbAverageRating, numVotes, runti...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(clean)

# A tibble: 6 × 11
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl>         <dbl>             <dbl>    <dbl>
1 Miss Jerry          1894      9  9779             0               5.4      212
2 Story of the Kell…  1906    574  1349             1               6        903
3 Cleopatra           1912   2101  2003             2               5.1      622
4 A Florida Enchant…  1914   3973  4457             2               5.8      300
5 Birth of a Nation…  1915   4972  1258             2               6.1    26403
6 Gretchen the Gree…  1916   6745  2008             3               6.4      537
# ℹ 4 more variables: runtimeMinutes <dbl>, genre1 <chr>, genre2 <chr>,
#   genre3 <chr>

Re-arranging My Data Sets

Here I am arranging one of my data sets in ascending order by year, to match the other, and then using select to remove columns I don’t need for my analysis.

unclean_sorted <- arrange(unclean, year)

unclean_fin <- unclean_sorted |>
  select(-imdb,
         -index,
         -test,
         -clean_test,
         -binary,
         -`period code`,
         -`decade code`,
         -budget,
         -domgross,
         -intgross,
         -code)

head(unclean_fin)

# A tibble: 6 × 5
   year title                   `budget_2013$` `domgross_2013$` `intgross_2013$`
  <dbl> <chr>                            <dbl>            <dbl>            <dbl>
1  1970 Beyond the Valley of t…        5997631         53978683         53978683
2  1971 Escape from the Planet…       14386286         70780525         70780525
3  1971 Shaft                        305063707        404702718        616827003
4  1971 Straw Dogs                   143862856         59412143         64760273
5  1971 The French Connection         12659931        236848653        236848653
6  1971 Willy Wonka &amp; the …       17263543         23018057         23018057

Cleaning Second Data Set

Here I am simply filtering the year to match the years included in the first data set, only 1970 and beyond.

clean_sorted <- clean |>
  filter(year >= 1970)

Merging Data Sets

Here I am merging my two data sets by title and year, and then filtering out the NA’s that showed up in my newly joined columns.

merged_set <- left_join(clean_sorted, unclean_fin, by = c("title", "year"))

merged_set <- merged_set |>
  filter(!is.na(`budget_2013$`), !is.na(`domgross_2013$`), !is.na(`intgross_2013$`))

Mutating New Column

Here I am mutating a new column called bechdel_test in the data set I just created using and ifelse statement to transalte the ratings to binary, i.e. if 3, 1, if anything else, 0.

merged_set <- merged_set |>
  mutate(
    bechdel_test = ifelse(merged_set$bechdelRating == "3", 1, 0))

Preparing to Replace NA’s

Here I am converting the genre columns to character so that I may change how the value of NA appears, as I cannot simply delete it from the data set.

merged_set$genre1 <- as.character(merged_set$genre1)
merged_set$genre2 <- as.character(merged_set$genre2)
merged_set$genre3 <- as.character(merged_set$genre3)

head(merged_set)

# A tibble: 6 × 15
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl>         <dbl>             <dbl>    <dbl>
1 Beyond the Valley…  1970  65466   583             3               6.1    12073
2 Escape from the P…  1971  67065  2893             1               6.3    40300
3 Shaft               1971  67741  2751             1               6.6    21091
4 Straw Dogs          1971  67800  2620             1               7.4    64649
5 Willy Wonka &amp;…  1971  67992   304             2               7.8   228105
6 Pink Flamingos      1972  69089   708             3               6      27999
# ℹ 8 more variables: runtimeMinutes <dbl>, genre1 <chr>, genre2 <chr>,
#   genre3 <chr>, `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
#   `intgross_2013$` <dbl>, bechdel_test <dbl>

Replacing NA’s

Here I am replacing the NA entries in the genre columns with ‘None,’ so that they will be easier to work with later on.

merged_set$genre1[is.na(merged_set$genre1)] = "None"

merged_set$genre2[is.na(merged_set$genre2)] = "None"

merged_set$genre3[is.na(merged_set$genre3)] = "None"

head(merged_set)

# A tibble: 6 × 15
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl>         <dbl>             <dbl>    <dbl>
1 Beyond the Valley…  1970  65466   583             3               6.1    12073
2 Escape from the P…  1971  67065  2893             1               6.3    40300
3 Shaft               1971  67741  2751             1               6.6    21091
4 Straw Dogs          1971  67800  2620             1               7.4    64649
5 Willy Wonka &amp;…  1971  67992   304             2               7.8   228105
6 Pink Flamingos      1972  69089   708             3               6      27999
# ℹ 8 more variables: runtimeMinutes <dbl>, genre1 <chr>, genre2 <chr>,
#   genre3 <chr>, `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
#   `intgross_2013$` <dbl>, bechdel_test <dbl>

Converting back

Here I am changing the class of the genre columns back to ‘factor’, as I no longer need the entries to be classified as ‘character.’

merged_set$genre1 <- as.factor(merged_set$genre1)
merged_set$genre2 <- as.factor(merged_set$genre2)
merged_set$genre3 <- as.factor(merged_set$genre3)

Mutating a New Column for Each Genre

Here I am mutating new columns for each genre by using ifelse statements, that will result in a TRUE/FALSE output, depending on whether the movie is categorized in that genre. I wanted to do this because many movies have multiple genre’s and I wanted to count each each movie separately per genre.

merged_set <- merged_set |>
  mutate(
    Fantasy = ifelse(genre1 == 'Fantasy' | genre2 == 'Fantasy' | genre3 == 'Fantasy', TRUE, FALSE),
    Action = ifelse(genre1 == 'Action' | genre2 == 'Action' | genre3 == 'Action', TRUE, FALSE), 
    Adventure = ifelse(genre1 == 'Adventure' | genre2 == 'Adventure' | genre3 == 'Adventure', TRUE, FALSE),
    Animation = ifelse(genre1 == 'Animation' | genre2 == 'Animation' | genre3 == 'Animation', TRUE, FALSE), 
    Biography = ifelse(genre1 == 'Biography' | genre2 == 'Biography' | genre3 == 'Biography', TRUE, FALSE), 
    Comedy = ifelse(genre1 == 'Comedy' | genre2 == 'Comdey' | genre3 == 'Comdey', TRUE, FALSE),
    Crime = ifelse(genre1 == 'Crime' | genre2 == 'Crime' | genre3 == 'Crime', TRUE, FALSE),
    Drama = ifelse(genre1 == 'Drama' | genre2 == 'Drama' | genre3 == 'Drama', TRUE, FALSE),
    Family = ifelse(genre1 == 'Family' | genre2 == 'Family' | genre3 == 'Family', TRUE, FALSE),
    History = ifelse(genre1 == 'History' | genre2 == 'History' | genre3 == 'History', TRUE, FALSE),
    Horror = ifelse(genre1 == 'Horror' | genre2 == 'Horror' | genre3 == 'Horror', TRUE, FALSE),
    Mystery = ifelse(genre1 == 'Mystery' | genre2 == 'Mystery' | genre3 == 'Mystery', TRUE, FALSE),
    Romance = ifelse(genre1 == 'Romance' | genre2 == 'Romance' | genre3 == 'Romance', TRUE, FALSE),
    `Sci-fi` = ifelse(genre1 == 'Sci-Fi' | genre2 == 'Sci-Fi' | genre3 == 'Sci-Fi', TRUE, FALSE),
    Thriller = ifelse(genre1 == 'Thriller' | genre2 == 'Thriller' | genre3 == 'Thriller', TRUE, FALSE)
    )

head(merged_set)

# A tibble: 6 × 30
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl>         <dbl>             <dbl>    <dbl>
1 Beyond the Valley…  1970  65466   583             3               6.1    12073
2 Escape from the P…  1971  67065  2893             1               6.3    40300
3 Shaft               1971  67741  2751             1               6.6    21091
4 Straw Dogs          1971  67800  2620             1               7.4    64649
5 Willy Wonka &amp;…  1971  67992   304             2               7.8   228105
6 Pink Flamingos      1972  69089   708             3               6      27999
# ℹ 23 more variables: runtimeMinutes <dbl>, genre1 <fct>, genre2 <fct>,
#   genre3 <fct>, `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
#   `intgross_2013$` <dbl>, bechdel_test <dbl>, Fantasy <lgl>, Action <lgl>,
#   Adventure <lgl>, Animation <lgl>, Biography <lgl>, Comedy <lgl>,
#   Crime <lgl>, Drama <lgl>, Family <lgl>, History <lgl>, Horror <lgl>,
#   Mystery <lgl>, Romance <lgl>, `Sci-fi` <lgl>, Thriller <lgl>

Linear Regression Model

Here I am creating a linear regression model that predicts the budget of a movie.

lm_model <- lm(`budget_2013$` ~ imdbAverageRating + runtimeMinutes + `domgross_2013$`, data = merged_set)

summary(lm_model)


Call:
lm(formula = `budget_2013$` ~ imdbAverageRating + runtimeMinutes + 
    `domgross_2013$`, data = merged_set)

Residuals:
       Min         1Q     Median         3Q        Max 
-344181335  -24552010   -9975204   17435855  227176887 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.592e+07  1.023e+07   4.487 7.79e-06 ***
imdbAverageRating -1.534e+07  1.482e+06 -10.349  < 2e-16 ***
runtimeMinutes     8.458e+05  6.550e+04  12.914  < 2e-16 ***
`domgross_2013$`   2.089e-01  1.002e-02  20.847  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45570000 on 1428 degrees of freedom
Multiple R-squared:  0.3284,    Adjusted R-squared:  0.327 
F-statistic: 232.7 on 3 and 1428 DF,  p-value: < 2.2e-16

Equation:

budget_2013 = – 15340000(imdbAverageRating) + 845800(runtimeMinutes) + 0.2089(domgross_2013) + 45920000

Diagnostic Plot 1:

Here I am plotting the regression line of budget of movie and it’s average rating on IMDB.

ggplot(merged_set, aes(x = `budget_2013$`, y = imdbAverageRating)) +
  geom_point() +  
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  labs(x = "Budget in Millions",
       y = "Average IMDB Rating",
       title = "Linear Regression: Budget of a Movie vs. It's Average Rating on IMDB")+  
  theme_bw()

`geom_smooth()` using formula = 'y ~ x'

Diagnostic Plot 2:

Here I am plotting the residuals vs. fitted.

plot(lm_model, which=1)

Linear Regression Analysis:

Based on the p-values in the model I can see that all of my predictor variables are very significantly associated with my dependent variable, all of them are very close to zero, and also all the same, for some reason I do not understand (0.0000000000000002). The average IMDB rating has a negative relationship to the budget, meaning as the budget goes up, the rating goes down, which I find interesting as that does not seem intuitive. The run time and domestic gross are both positively related to the budget, meaning as the budget goes up, runtime and domestic gross go up as well, which I think does make intuitive sense as higher budget movies allow for more scenes and props and actors, etc., and movies such as those tend to get heavily promoted as well, meaning more people spend money to go see them. The adjusted r-squared value is moderate, the model accounts for around 33% of the variance in the data (variance in movie budget). My guess is that there are other major factors I do not have access to that drive the budget of a movie, meaning my model cannot account for very much. As for my diagnostic plots, the first visualizes the linear regression line of movie budget and average rating on IMDB, and I can see that there are a few outliers in the budget that pull the regression line to the right, so as the very slightly negative relationship is illustrated, the confidence level also widens, due to the lack of data. Most of the points are centered around lower numbers (low budgets), in addition to being clustered at the 6-8 range in IMDB rating, as it appears, the average rating for most movies is around there. This plot shows me that, while the relationship between budget and IMDB rating might seem significant by the p-value, in reality, it is very slight. The second diagnostic plot, residuals vs. fitted, shows that at my lower predicted values, my models errors are smaller, and as the predicted values increase my model’s errors become bigger in both positive and negative directions, i.e. my model is performing worse the higher my predicted values for budget are. This makes sense as my model does not account for most of the variance in the data, and as movie budgets get higher, the less data points I have to work with in this set.

More Cleaning

Here I am cleaning one of the data sets by itself for later plotting purposes, again using ifelse statements to create new columns for each genre, and then I am pivoting the data set longer, i.e. converting the genre’s into one column, meaning there is multiple entries for most movies. I also created new values for the bechdel ratings that are more descriptive for a viewer.

clean_2 <- clean
clean_fin <- clean |>
  mutate(
    Fantasy = ifelse(genre1 == 'Fantasy' | genre2 == 'Fantasy' | genre3 == 'Fantasy', TRUE, FALSE),
    Action = ifelse(genre1 == 'Action' | genre2 == 'Action' | genre3 == 'Action', TRUE, FALSE), 
    Adventure = ifelse(genre1 == 'Adventure' | genre2 == 'Adventure' | genre3 == 'Adventure', TRUE, FALSE),
    Animation = ifelse(genre1 == 'Animation' | genre2 == 'Animation' | genre3 == 'Animation', TRUE, FALSE), 
    Biography = ifelse(genre1 == 'Biography' | genre2 == 'Biography' | genre3 == 'Biography', TRUE, FALSE), 
    Comedy = ifelse(genre1 == 'Comedy' | genre2 == 'Comdey' | genre3 == 'Comdey', TRUE, FALSE),
    Crime = ifelse(genre1 == 'Crime' | genre2 == 'Crime' | genre3 == 'Crime', TRUE, FALSE),
    Drama = ifelse(genre1 == 'Drama' | genre2 == 'Drama' | genre3 == 'Drama', TRUE, FALSE),
    Family = ifelse(genre1 == 'Family' | genre2 == 'Family' | genre3 == 'Family', TRUE, FALSE),
    History = ifelse(genre1 == 'History' | genre2 == 'History' | genre3 == 'History', TRUE, FALSE),
    Horror = ifelse(genre1 == 'Horror' | genre2 == 'Horror' | genre3 == 'Horror', TRUE, FALSE),
    Mystery = ifelse(genre1 == 'Mystery' | genre2 == 'Mystery' | genre3 == 'Mystery', TRUE, FALSE),
    Romance = ifelse(genre1 == 'Romance' | genre2 == 'Romance' | genre3 == 'Romance', TRUE, FALSE),
    `Sci-fi` = ifelse(genre1 == 'Sci-Fi' | genre2 == 'Sci-Fi' | genre3 == 'Sci-Fi', TRUE, FALSE),
    Thriller = ifelse(genre1 == 'Thriller' | genre2 == 'Thriller' | genre3 == 'Thriller', TRUE, FALSE)
    )
head(clean_fin)

# A tibble: 6 × 26
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl>         <dbl>             <dbl>    <dbl>
1 Miss Jerry          1894      9  9779             0               5.4      212
2 Story of the Kell…  1906    574  1349             1               6        903
3 Cleopatra           1912   2101  2003             2               5.1      622
4 A Florida Enchant…  1914   3973  4457             2               5.8      300
5 Birth of a Nation…  1915   4972  1258             2               6.1    26403
6 Gretchen the Gree…  1916   6745  2008             3               6.4      537
# ℹ 19 more variables: runtimeMinutes <dbl>, genre1 <chr>, genre2 <chr>,
#   genre3 <chr>, Fantasy <lgl>, Action <lgl>, Adventure <lgl>,
#   Animation <lgl>, Biography <lgl>, Comedy <lgl>, Crime <lgl>, Drama <lgl>,
#   Family <lgl>, History <lgl>, Horror <lgl>, Mystery <lgl>, Romance <lgl>,
#   `Sci-fi` <lgl>, Thriller <lgl>

clean_fin_long <- clean_fin |>
  pivot_longer(cols = c("Action", "Comedy", "Drama", "Sci-fi", "Thriller", "Romance", "Fantasy", "Adventure", "Crime"),
               names_to = "genre",
               values_to = "has_genre") |>
  filter(has_genre == TRUE)

clean_fin_long$bechdelRating <- as.character(clean_fin_long$bechdelRating)
clean_fin_long$bechdelRating <- factor(clean_fin_long$bechdelRating, levels = c("3", "2", "1", "0"), labels = c("3, Pass", "2, Two named women talk", "1, Two named women", "0, Fail"))
head(clean_fin_long)

# A tibble: 6 × 19
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl> <fct>                     <dbl>    <dbl>
1 Miss Jerry          1894      9  9779 0, Fail                     5.4      212
2 Story of the Kell…  1906    574  1349 1, Two named…               6        903
3 Story of the Kell…  1906    574  1349 1, Two named…               6        903
4 Cleopatra           1912   2101  2003 2, Two named…               5.1      622
5 A Florida Enchant…  1914   3973  4457 2, Two named…               5.8      300
6 Birth of a Nation…  1915   4972  1258 2, Two named…               6.1    26403
# ℹ 12 more variables: runtimeMinutes <dbl>, genre1 <chr>, genre2 <chr>,
#   genre3 <chr>, Animation <lgl>, Biography <lgl>, Family <lgl>,
#   History <lgl>, Horror <lgl>, Mystery <lgl>, genre <chr>, has_genre <lgl>

More Cleaning

Here I am pivoting my merged data set longer, i.e. converting the genre columns into one column, with multiple entries per movie, and then again, changing the bechdel rating values to be more descriptive for viewers.

merged_set_2 <- merged_set |>
  pivot_longer(cols = c("Action", "Comedy", "Drama", "Sci-fi", "Thriller", "Romance", "Fantasy", "Adventure", "Crime"),
               names_to = "genre",
               values_to = "has_genre") |>
  filter(has_genre == TRUE)


merged_set_2$bechdelRating <- as.character(merged_set_2$bechdelRating)
merged_set_2$bechdelRating <- factor(merged_set_2$bechdelRating, levels = c("3", "2", "1", "0"), labels = c("3, Pass", "2, Two named women talk", "1, Two named women", "0, Fail"))

head(merged_set_2)

# A tibble: 6 × 23
  title               year imdbid    id bechdelRating imdbAverageRating numVotes
  <chr>              <dbl>  <dbl> <dbl> <fct>                     <dbl>    <dbl>
1 Beyond the Valley…  1970  65466   583 3, Pass                     6.1    12073
2 Beyond the Valley…  1970  65466   583 3, Pass                     6.1    12073
3 Escape from the P…  1971  67065  2893 1, Two named…               6.3    40300
4 Escape from the P…  1971  67065  2893 1, Two named…               6.3    40300
5 Shaft               1971  67741  2751 1, Two named…               6.6    21091
6 Shaft               1971  67741  2751 1, Two named…               6.6    21091
# ℹ 16 more variables: runtimeMinutes <dbl>, genre1 <fct>, genre2 <fct>,
#   genre3 <fct>, `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
#   `intgross_2013$` <dbl>, bechdel_test <dbl>, Animation <lgl>,
#   Biography <lgl>, Family <lgl>, History <lgl>, Horror <lgl>, Mystery <lgl>,
#   genre <chr>, has_genre <lgl>

Calculating Proportion

Here I am calculating the proportion of movies for each bechdel rating in each genre.

genre_summary <- merged_set_2 |>
  group_by(genre, bechdelRating) |>
  summarise(count = n()) |>
  group_by(genre) |>
  mutate(proportion = count / sum(count))

`summarise()` has grouped output by 'genre'. You can override using the
`.groups` argument.

genre_summary

# A tibble: 36 × 4
# Groups:   genre [9]
   genre     bechdelRating           count proportion
   <chr>     <fct>                   <int>      <dbl>
 1 Action    3, Pass                   174     0.396 
 2 Action    2, Two named women talk    38     0.0866
 3 Action    1, Two named women        171     0.390 
 4 Action    0, Fail                    56     0.128 
 5 Adventure 3, Pass                   164     0.459 
 6 Adventure 2, Two named women talk    31     0.0868
 7 Adventure 1, Two named women        127     0.356 
 8 Adventure 0, Fail                    35     0.0980
 9 Comedy    3, Pass                   251     0.668 
10 Comedy    2, Two named women talk    45     0.120 
# ℹ 26 more rows

First Visualization

Here I am creating my first visualization, a stacked bar plot. I am comparing different genre’s by their proportion of movies for each rating of the bechdel test. I used geom_text to add annotations to clearly show the percentages for each category, then added a horizontal line to clearly denote across the bars where 50% is on the y-axis, to illustrate which genres managed at least 50% of movies passing the bechdel test. I used scale continuous to create a percentages scale, then used scale_fill_manual to color the different bechdel ratings. Lastly, I changed the theme to minimal and adjusted the x axis text to be at a 45 degree angle.

  ggplot(merged_set_2, aes(x = genre, fill = as.factor(bechdelRating))) +
  geom_bar(position = "fill") +
  geom_text(
    aes(x = "Action", y = 0.05), 
    label = "12.7%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Action", y = 0.3), 
    label = "38.9%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Action", y = 0.56), 
    label = "8.6%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Action", y = 0.8), 
    label = "39.6%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Adventure", y = 0.05), 
    label = "9.8%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Adventure", y = 0.25), 
    label = "35.5%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Adventure", y = 0.49), 
    label = "8.6%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Adventure", y = 0.75), 
    label = "45.9%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Comedy", y = 0.1), 
    label = "19.4%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Comedy", y = 0.27), 
    label = "11.9%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Comedy", y = 0.7), 
    label = "66.7%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Crime", y = 0.05), 
    label = "11.6%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Crime", y = 0.27), 
    label = "34.4%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Crime", y = 0.53), 
    label = "9.4%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Crime", y = 0.75), 
    label = "44.3%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Drama", y = 0.03), 
    label = "5.8%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Drama", y = 0.2), 
    label = "21.8%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Drama", y = 0.36), 
    label = "10.9%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Drama", y = 0.7), 
    label = "61.4%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Fantasy", y = 0.03), 
    label = "5.3%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Fantasy", y = 0.2), 
    label = "26.1%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Fantasy", y = 0.37), 
    label = "8%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Fantasy", y = 0.71), 
    label = "60.4%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Romance", y = 0.07), 
    label = "10.5%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Romance", y = 0.23), 
    label = "14.7%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Romance", y = 0.7), 
    label = "71.4%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Sci-fi", y = 0.04), 
    label = "7.8%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Sci-fi", y = 0.27), 
    label = "39.3%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Sci-fi", y = 0.52), 
    label = "8.9%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Sci-fi", y = 0.78), 
    label = "43.8%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Thriller", y = 0.05), 
    label = "10%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Thriller", y = 0.25), 
    label = "30.9%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Thriller", y = 0.45), 
    label = "9.5%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_text(
    aes(x = "Thriller", y = 0.75), 
    label = "49.5%",   
    color = "black",
    size = 3,
    fontface = "bold"
  ) +
  geom_hline(yintercept = 0.5, color = "grey", linetype = "dashed", linewidth = 0.8, alpha = 0.9) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual( values = c("3, Pass" = "green", "2, Two named women talk" = "yellow", "1, Two named women" = "pink", "0, Fail" = "red")) +
  labs(
    x = "Genre",
    y = "Proportion of Movies",
    fill = "Bechdel Test Rating",
    title = "Proportion of Bechdel Ratings by Genre"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Second Visualization

Here I am creating a time series heat map, representing each genre’s frequency of passing the bechdel test over the years. I used height to create the size of the tiles, and scale_fill_manual to color code the bechdel ratings. Finally, I changed the theme to line draw.

ggplot(merged_set_2, aes(x = year, y = genre, fill = factor(bechdelRating))) +
  geom_tile(height = 0.6) + 
  scale_fill_manual(values = c("green", "yellow", "pink", "red")) +
  labs(title = "Movies by Year and Genre Colored by Bechdel Test Rating",
       x = "Year",
       y = "Genre",
       subtitle = "Combined Data Set, 1970 - 2013",
       caption = "Source: FiveThirtyEight, The-Numbers.com, & BechdelTest.com",
       fill = "Bechdel Rating") +
  theme_linedraw()

Third Visualization

Here I am making the same heat map time series plot, only this time I am using only one of my data sets, the one that had quite a bit more entries over many more years, because I thought it might illuminate more trends. Every other part of the plot is the same as the first heat map.

ggplot(clean_fin_long, aes(x = year, y = genre, fill = factor(bechdelRating))) +
  geom_tile(height = 0.6) +  
  scale_fill_manual(values = c("green", "yellow", "pink", "red")) +
  labs(title = "Movies by Year and Genre Colored by Bechdel Test Rating",
       x = "Year",
       y = "Genre",
       subtitle = "One Data Set, 1894 - 2013",
       caption = "Source: IMDB & BechdelTest.com",
       fill = "Bechdel Rating") +
  theme_linedraw()

Creating New Set For Final Visualization

Here I am grouping by genre and bechdel rating, and then summarising by a new column called avg_imdb which is the average IMDB score per bechdel test group, in each genre.

avg_ratings <- merged_set_2 |>
  group_by(genre, bechdelRating) |>
  summarise(avg_imdb = mean(imdbAverageRating, na.rm = TRUE))

`summarise()` has grouped output by 'genre'. You can override using the
`.groups` argument.

head(avg_ratings)

# A tibble: 6 × 3
# Groups:   genre [2]
  genre     bechdelRating           avg_imdb
  <chr>     <fct>                      <dbl>
1 Action    3, Pass                     6.45
2 Action    2, Two named women talk     6.47
3 Action    1, Two named women          6.60
4 Action    0, Fail                     6.71
5 Adventure 3, Pass                     6.65
6 Adventure 2, Two named women talk     6.8

Fourth and Final Visualization

Here I am creating side by side bar plots that dive deeper into the different genres, comparing the average IMDB rating for movies within a genre, based on Bechdel test rating. I am using the text function at the beginning to be able to customize the tool tip for when I later put this plot into plotly. I am using geom_col to outline the bars in black, and adjusting the width of the bars to be smaller so that it is easier on your eyes to differentiate between the grouped genres, as they are farther apart. I used the function coord_cartesian (found on ggplot website, see works cited), after trying to normally adjust the limits of the y-axis failed for some reason, and removed all my data points even thought they were within the range I set. I wanted to edit the y range, because as I previously mentioned, the average IMDB rating for most movies seems to be mostly in the same small range, so I wanted to amplify differences in ratings by making the range displayed smaller. I used scale_fill_manual to color code the bechdel ratings, set the labs, changed the theme to dark, and then adjusted the x axis text to be at a 45 degree angle.

dodged_bar <- ggplot(avg_ratings, aes(x = genre, y = avg_imdb, fill = factor(bechdelRating), text = paste("<b> Av. IMDB Rating: </b>", avg_imdb, "<br> <b> Bechdel Test Score: </b>", bechdelRating, "<br> <b> Genre: </b>", genre))) +
  geom_col(position = "dodge", color = "black", linewidth = 1, width = 0.7) +
  coord_cartesian(ylim = c(6, 7.3)) +
  scale_fill_manual(values = c("0, Fail" = "red", "1, Two named women" = "pink", "2, Two named women talk" = "yellow", "3, Pass" = "green")) +
  labs(title = "Average IMDb Rating by Genre and Bechdel Rating",
       x = "Genre",
       y = "Average IMDb Rating",
       fill = "Bechdel Rating",
       caption = "Source: FiveThirtyEight, BechdelTest.com, IMDB, The-Numbers.com") +
  theme_dark() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
dodged_bar

Final Visualization in Plotly

Here I am simply putting the above visualization into plotly to make it interactive, making sure to set tooltip to the text value I previously set, so it appears in the way I want it to, with the information I chose for it. I thought this visualization would be the best candidate for interactivity as annotations like I did for my first plot would have been hard to achieve on this one, and I thought there were a few significant pieces of information that I could only present using a tooltip.

ggplotly(dodged_bar, tooltip = "text")

Concluding Essay

According to the article, ““What Is the Bechdel Test? Movies That Pass & Fail”, by Lois Nevielle, the origins of the Bechdel Test date back to 1985, when a cartoonist named Alison Bechdel, illustrated two women talking to each other about their feminist criteria for watching movies. Alison Bechdel attributes the idea behind the feminist rating system to her friend Liz Wallace, and while in this project I am using it to categorize specifically movies based on their baseline representation of women, it can be used to categorize a variety of media types. According to this article, a few incredibly popular and iconic movies that, surprisingly or not, don’t pass the Bechdel Test are, “The entire Lord of The Rings trilogy, The Avengers, and Breakfast at Tiffany’s.” As mentioned in the article, while the the Bechdel test can be an interesting tool, it does not come close to all encompassing, and it is clear that many movies can and have passed the Bechdel Test, while still displaying sexist ideals. One example of this mentioned in the article is, ‘Good Fellows.’ The Bechdel test is really more of a bare minimum, and so it is all the more shocking to see how many movies can’t even pass this simple criteria. My first visualization was a stacked bar plot, something I have yet to try, that depicts the proportion of movies in each genre that fall into the 4 Bechdel test categories. The visualization shows the rather startling idea that of nearly half of the genres recorded in this data set, all well documented and main stream, the proportion of movies that pass the Bechdel Test is under 50%. Granted, some of those genre’s are very close, but when you really consider just how bare minimum the conditions of the bechdel test are, it becomes sobering. Of the Action movies in this set, 38.9% percent of the movies have only two named women, no conversation of any sort. Granted, how well this set predicts the population proportion is questionable, but it certainly feels intuitive that Action would do the worst, Sci-Fi a close second. Surprisingly, romance does the best at the Bechdel Test, with 71.4% passing. I would have thought that romance movies, which are more often than not about straight relationships, would struggle to meet the final criteria of the Bechdel Test, the conversation not about a man. As for my second and third visualizations, heat map time series, the first draws from the merged set I created, and the second one draws from my first data set before it was merged with my second, in order to get some more data and therefore more perspective. In both of these heat map time series, no trend necessarily pops out at the viewer, though in the second visualization one can see that the amount of movies that don’t pass any of the Bechdel Test criteria for some of the genres seem to become a little less frequent as the years go on. It also seems to me, that many more of the movies from the 1920’s to the 1980’s or so, pass the Bechdel Test than one might have suspected would be the case. I suppose, as mentioned earlier on in this essay, that could be an example of the shortcomings of the Bechdel Test, and it’s broad criteria that often fails to get to the heart of gender roles in film. As for my last visualization, side by side bar plots, it investigates how IMDB users rate movies within the same genre differently based on their Bechdel rating. This I found to be very interesting. I see that for Action movies, the highest rated ones are the films that completely fail the test, and the lowest are the ones that completely pass, as is true for Romance oddly enough, even though that genre has the highest representation of women. Crime seems to be the only category where movies that don’t pass the Bechdel Test in any way are rated the lowest, and Sci-Fi is the only category where movies that pass are rated the highest. One thing I wish I had been able to do, is figure out a neater way to add my annotations to my first plot, instead of having to hard code it, which took forever.

Works Cited:

Cover Photo Source:

“The Bechdel Test and Women in Movies.” Bitch Flicks, 17 Nov. 2013, btchflcks.com/2013/09/the-bechdel-test-and-women-in-movies.html.

Data Source 1:

“Bechdeltest.Com API Documentation.” API Documentation, bechdeltest.com/api/v1/doc. Accessed 10 May 2025.

Source For New Function:

“Cartesian Coordinates - Coord_cartesian.” - Coord_cartesian • Ggplot2, ggplot2.tidyverse.org/reference/coord_cartesian.html. Accessed 10 May 2025.

Source For Y-axis Adjusting:

“Change Y-Axis to Percentage Points in GGPLOT2 Barplot in R.” GeeksforGeeks, GeeksforGeeks, 15 Feb. 2023, www.geeksforgeeks.org/change-y-axis-to-percentage-points-in-ggplot2-barplot-in-r/.

Data Source 2:

Hickey, Walt. “The Dollar-and-Cents Case against Hollywood’s Exclusion of Women.” FiveThirtyEight, FiveThirtyEight, 1 Apr. 2014, fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/.

Data Source 3:

“IMDb Non-Commercial Datasets.” IMDb, IMDb.com, developer.imdb.com/non-commercial-datasets/. Accessed 10 May 2025.

Background Research Source:

Nevielle, Lois. “What Is the Bechdel Test? Movies That Pass & Fail | Backstage.” Backstage.Com, 30 May 2023, www.backstage.com/magazine/article/what-is-the-bechdel-test-75534/.

Data Source 4:

“Where Data and the Movie Business Meet.” The Numbers, the-numbers.com/. Accessed 10 May 2025.