Week 3 HW

Week 3 - Bechdel

#loading the data and needed libraries into the markdown file
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)

bechdel_data_raw <- read_csv("C:/Users/Lauren/Documents/Stats Data/raw_bechdel.csv")

## Rows: 8839 Columns: 5

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): imdb_id, title
## dbl (3): year, id, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bechdel_data_movies <- read_csv("C:/Users/Lauren/Documents/Stats Data/movies.csv")

## Rows: 1794 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): imdb, title, test, clean_test, binary, domgross, intgross, code, d...
## dbl  (7): year, budget, budget_2013, period_code, decade_code, metascore, im...
## num  (1): imdb_votes
## lgl  (2): response, error
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Before we get started in grouping in this data set, we should set a baseline of what is notable in the context of this data set. We can do that by first seeing how many films pass the bechdel test in the data set versus not.

table_pass_fail <- table(bechdel_data_movies$binary)
table_pass_fail

## 
## FAIL PASS 
##  991  803

barplot(table_pass_fail)

So we can see that approximately 44.8% of movies in the data set pass the bechdel test. We will use this as a sort of benchmark going forward in this notebook.

Group By Director

gb_directors <- bechdel_data_movies |>
  group_by(director) |>
  summarise(bechdel_pass = sum(binary == "PASS"),
            bechdel_fail = sum(binary == "FAIL"),
            pass_percentage = bechdel_pass/(bechdel_pass + bechdel_fail)) #|>
  #arrange(desc(pass_percentage))
gb_directors

## # A tibble: 892 × 4
##    director        bechdel_pass bechdel_fail pass_percentage
##    <chr>                  <int>        <int>           <dbl>
##  1 Aaron Schneider            0            1             0  
##  2 Adam Brooks                1            0             1  
##  3 Adam Green                 0            1             0  
##  4 Adam McKay                 0            3             0  
##  5 Adam Shankman              4            0             1  
##  6 Adrian Lyne                1            1             0.5
##  7 Adrienne Shelly            1            0             1  
##  8 Aki Kaurismäki             0            1             0  
##  9 Akiva Schaffer             0            2             0  
## 10 Alan Metter                0            1             0  
## # ℹ 882 more rows

ggplot(gb_directors, mapping = aes (x = pass_percentage)) + geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

From this frequency graph, we can see that a little over a fifth of directors with movies in this data set have no movies that pass the bechdel test! Additionally, there turns out that almost a fifth of the directors who have all the films in this data set have all their films pass the bechdel test. This makes sense that the numbers would be close considering we know a movie passing the bechdel test is almost 50%. There are a few more who also have a majority of the movies in the data set who pass the bechdel test. I would love to sort the directors by their gender, to see if women are more likely to have all their movies pass the bechdel test.

The lowest probability group appears to be those who have films that pass the bechdel test sometimes, but not always. Specifically, those who passed the bechdel test a little over 75% of the time. We can call those the ‘Bechdel-75P’. Given the specificity of the data, it is likely such a small percentage of those represented due to the fact that one must have several movies in the dataset to get to the smaller numbers. For instance, you can get to 100% or 0% through having one movie in the data set, but to get to 75% using a binary category, you’d need at least four. Having exactly four films in the dataset and then having exactly one of those films fail the bechdel test must be a unique combination to get a small percentage of film directors to that pass percentage.

gb_directors$is_75p <- ifelse(gb_directors$pass_percentage >= .75 & gb_directors$pass_percentage <= .80, "TRUE", "FALSE")
gb_directors_sorted <- gb_directors[order(gb_directors$is_75p, decreasing = TRUE),]
gb_directors_sorted

## # A tibble: 892 × 5
##    director         bechdel_pass bechdel_fail pass_percentage is_75p
##    <chr>                   <int>        <int>           <dbl> <chr> 
##  1 Alexander Payne             3            1            0.75 TRUE  
##  2 David O. Russell            3            1            0.75 TRUE  
##  3 David Yates                 3            1            0.75 TRUE  
##  4 Jan de Bont                 3            1            0.75 TRUE  
##  5 Wayne Wang                  3            1            0.75 TRUE  
##  6 Aaron Schneider             0            1            0    FALSE 
##  7 Adam Brooks                 1            0            1    FALSE 
##  8 Adam Green                  0            1            0    FALSE 
##  9 Adam McKay                  0            3            0    FALSE 
## 10 Adam Shankman               4            0            1    FALSE 
## # ℹ 882 more rows

As we expected, those elusive ‘Bechdel-75P’ all have exactly 4 movies in the dataset, while it is neither the most nor the least amount of films per director, it is a number that is able to give us exactly 75%.

Group By Year

Do newer films pass the bechdel test more often than older films (or vice versa)?

gb_year <- bechdel_data_movies |>
  group_by(year) |>
  summarise(year_pass = sum(binary == "PASS"),
            year_fail = sum(binary == "FAIL"),
            year_pass_percentage = year_pass/(year_pass + year_fail)) 
gb_year

## # A tibble: 44 × 4
##     year year_pass year_fail year_pass_percentage
##    <dbl>     <int>     <int>                <dbl>
##  1  1970         1         0                1    
##  2  1971         0         5                0    
##  3  1972         1         2                0.333
##  4  1973         1         4                0.2  
##  5  1974         2         5                0.286
##  6  1975         0         5                0    
##  7  1976         3         5                0.375
##  8  1977         2         5                0.286
##  9  1978         2         6                0.25 
## 10  1979         2         3                0.4  
## # ℹ 34 more rows

ggplot(gb_year, mapping = aes (x = year, y= year_pass_percentage)) + geom_col() + geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

The data shows that 1970 is unique in that the only film in the data set from that year is one that passes the bechdel test. That being said thanks to the trend line, you can see a slight increase in the general trend of the percent of film passing the bechdel test since the 1970s until today. Looking at the bar chart, we can also see a dip around 2009. It might be interesting to see what happened at that time either in life or the data set to cause this change in data.

From this graph, it looks like the lowest non-zero percent pass year is 1981. We can call those films that do pass from that year the ‘81-pass’.

bechdel_data_movies_copy <- bechdel_data_movies
bechdel_data_movies_copy$is_81pass <- ifelse(bechdel_data_movies_copy$year == 1981 & bechdel_data_movies_copy$binary == "PASS", "TRUE", "FALSE")

copy_year_sorted <- bechdel_data_movies_copy[order(bechdel_data_movies_copy$is_81pass, decreasing = TRUE),]
copy_year_sorted

## # A tibble: 1,794 × 35
##     year imdb      title  test  clean_test binary budget domgross intgross code 
##    <dbl> <chr>     <chr>  <chr> <chr>      <chr>   <dbl> <chr>    <chr>    <chr>
##  1  1981 tt0082495 Hallo… ok    ok         PASS   2.5 e6 25533818 25533818 1981…
##  2  2013 tt1711425 21 &a… nota… notalk     FAIL   1.3 e7 25682380 42195766 2013…
##  3  2012 tt1343727 Dredd… ok-d… ok         PASS   4.50e7 13414714 40868994 2012…
##  4  2013 tt2024544 12 Ye… nota… notalk     FAIL   2   e7 53107035 1586070… 2013…
##  5  2013 tt1272878 2 Guns nota… notalk     FAIL   6.1 e7 75612460 1324930… 2013…
##  6  2013 tt0453562 42     men   men        FAIL   4   e7 95020213 95020213 2013…
##  7  2013 tt1335975 47 Ro… men   men        FAIL   2.25e8 38362475 1458038… 2013…
##  8  2013 tt1606378 A Goo… nota… notalk     FAIL   9.2 e7 67349198 3042491… 2013…
##  9  2013 tt2194499 About… ok-d… ok         PASS   1.20e7 15323921 87324746 2013…
## 10  2013 tt1814621 Admis… ok    ok         PASS   1.3 e7 18007317 18007317 2013…
## # ℹ 1,784 more rows
## # ℹ 25 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## #   intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## #   plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## #   writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## #   released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## #   type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, is_81pass <chr>

There is only one ‘81-pass’ movie, Halloween II. That’s pretty rare! We could test to see if movies made in 1981 are just of genres that were more likely to fail the bechdel test to begin with, or perhaps there was a higher density of male stars/characters in the films of that year.

Group by Bechdel Test Result

I would like to know if movies that pass the bechdel test rate in general higher than those that do not (or vice versa).

gb_testresults <- bechdel_data_movies |>
  group_by(binary) |>
  summarise(mean_imdb = mean(imdb_rating, na.rm=TRUE),
            median_imdb = median(imdb_rating, na.rm=TRUE)) 
gb_testresults

## # A tibble: 2 × 3
##   binary mean_imdb median_imdb
##   <chr>      <dbl>       <dbl>
## 1 FAIL        6.89         7  
## 2 PASS        6.60         6.7

gb_long <- gb_testresults |>
  pivot_longer(cols = c(mean_imdb, median_imdb),
               names_to = "stat",
               values_to = "value") |>
  mutate(stat = recode(stat, mean_imdb = "Mean", median_imdb = "Median"))

ggplot(gb_long, aes(x = binary, y = value, fill = stat)) +
  geom_col()  +
  scale_fill_brewer() +
  labs(x = "Bechdel result", y = "IMDB rating", title = "Mean and Median IMDB by Bechdel", fill = "") +
  theme_minimal()

This data shows us that movies that pass the bechdel test are rated on IMDB very slightly worse than those that don’t. I’d love to investigate this further, with a look into genres and how they comparing to both IMDB ratings and bechdel test pass/fail rates.

This also allows us to check and see if there are fewer highly rated movies that pass the bechdel test. Let’s arbitrarily say that a highly rated movie has an IMDB rating of 8 or higher.

bechdel_data_movies_copy$is_highly_rated <- ifelse(bechdel_data_movies_copy$imdb_rating >= 8.0 & bechdel_data_movies_copy$binary == "PASS", "TRUE", " ")

copy_rating_sorted <- bechdel_data_movies_copy[order(bechdel_data_movies_copy$is_highly_rated, decreasing = TRUE),]
copy_rating_sorted

## # A tibble: 1,794 × 36
##     year imdb      title  test  clean_test binary budget domgross intgross code 
##    <dbl> <chr>     <chr>  <chr> <chr>      <chr>   <dbl> <chr>    <chr>    <chr>
##  1  2013 tt1392214 Priso… ok    ok         PASS   4.6 e7 61002302 1134023… 2013…
##  2  2011 tt1454029 The H… ok    ok         PASS   2.5 e7 1697055… 2131200… 2011…
##  3  2010 tt0947798 Black… ok-d… ok         PASS   1.3 e7 1069546… 3312667… 2010…
##  4  2010 tt0892769 How t… ok    ok         PASS   1.65e8 2175812… 4948709… 2010…
##  5  2009 tt0796366 Star … ok-d… ok         PASS   1.4 e8 2577300… 3856804… 2009…
##  6  2007 tt0808417 Perse… ok    ok         PASS   7.3 e6 4443403  22742498 2007…
##  7  2006 tt0405508 Rang … ok    ok         PASS   5.3 e6 2197694  29197694 2006…
##  8  2005 tt0379786 Seren… ok    ok         PASS   3.90e7 25514517 38514517 2005…
##  9  2005 tt0401792 Sin C… ok-d… ok         PASS   4   e7 74103820 1587538… 2005…
## 10  2005 tt0434409 V for… ok-d… ok         PASS   5   e7 70511035 1292315… 2005…
## # ℹ 1,784 more rows
## # ℹ 26 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## #   intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## #   plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## #   writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## #   released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## #   type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, is_81pass <chr>, …

highly_rated_count_pass <- sum(bechdel_data_movies_copy$binary == "PASS" & bechdel_data_movies_copy$is_highly_rated == "TRUE", na.rm = TRUE)

highly_rated_count_fail <- sum(bechdel_data_movies_copy$binary == "FAIL" & bechdel_data_movies_copy$is_highly_rated == " ", na.rm = TRUE)


highly_rated_count_pass

## [1] 37

highly_rated_count_fail

## [1] 991

Of the movies that are rated 8.0 or higher, there are 37 movies that pass the bechdel test (let’s call them “highly_rated_passers”, and 991 that fail. Therefor we can conclude that highly rated movies are extremely skewed against movies that pass the bechdel test. If the movies that were rated above 8 were more like the whole data set, 460 movies (44.8%) would pass the bechdel test.

DataFrame Combinations

rating_bechdel_unique <- unique(bechdel_data_movies[c('rated', 'binary')])
rating_bechdel_unique

## # A tibble: 22 × 2
##    rated binary
##    <chr> <chr> 
##  1 <NA>  FAIL  
##  2 <NA>  PASS  
##  3 R     FAIL  
##  4 PG-13 FAIL  
##  5 R     PASS  
##  6 PG-13 PASS  
##  7 PG    FAIL  
##  8 PG    PASS  
##  9 N/A   FAIL  
## 10 G     FAIL  
## # ℹ 12 more rows

We notice here that TV-PG and TV-14 don’t have both pass and fail options. TV-PG does not have any movies that pass the bechdel test, and TV-14 doesn’t have any options that fail the bechdel test. Because this data set involves movies, it’s entirely possible these data points were included either in error (due to their TV rating) or there are so few of them as to not offer the opportunity to have both pass and fail options.

common_combos <- bechdel_data_movies |>
  count(rated, binary)
sorted_combos <- common_combos[order(common_combos$n), decreasing = TRUE]
sorted_combos

## # A tibble: 22 × 3
##    rated     binary     n
##    <chr>     <chr>  <int>
##  1 TV-14     PASS       1
##  2 TV-PG     FAIL       1
##  3 X         PASS       1
##  4 NC-17     PASS       2
##  5 Unrated   PASS       2
##  6 Unrated   FAIL       3
##  7 X         FAIL       3
##  8 N/A       PASS       4
##  9 Not Rated PASS       4
## 10 NC-17     FAIL       5
## # ℹ 12 more rows

The least common combination are tied between the TV rated movies, and the X-rated movie that passes the bechdel test, due to all of these items appearing only once in the data set. We surmised that the TV-rated films don’t really belong in the data set, due to their TV, not movie/film/MPAA rating. The X-rated movie that passes the bechdel test is likely rare because the X-rating was replaced in 1990 with NC-17, and things that occur in films that merit an X/NC-17 rating aren’t necessarily conducive to passing the bechdel test.

Conversely, the most common combination is R-rated movies that fail the bechdel test with 394 instances listed in the data set. It may be the case that similar to X/NC-17 rated films, the things that occur in R-rated films aren’t conducive to passing the bechdel test. But, it could also be that R-rated films are really popular/common, as R-rated movies that pass the bechdel test are also really common in the data set (297 movies).