THE BECHDEL TEST AND THE X-MANSION WITH TIDYMODELS AND TIDYTUESDAY

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast focuses on using bootstrap resampling with this week’s TidyTueday dataset from the Claremont Run Project about issues of the comic book series Uncanny X-Men

Read in the data

Out modeling goal is to use information about speech bubbles, thought bubbles, narrative statements, and character depictions from this week’s TidyTueday dataset to understand more about characteristics of individual comic book issues. Let’s focus on two modeling questions

We’re going to use three of the datasets from this week

suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(ggthemes))

character_visualization <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/character_visualization.csv")
## Parsed with column specification:
## cols(
##   issue = col_double(),
##   costume = col_character(),
##   character = col_character(),
##   speech = col_double(),
##   thought = col_double(),
##   narrative = col_double(),
##   depicted = col_double()
## )
xmen_bechdel <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/xmen_bechdel.csv")
## Parsed with column specification:
## cols(
##   issue = col_double(),
##   pass_bechdel = col_character(),
##   notes = col_character(),
##   reprint = col_logical()
## )
locations <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/locations.csv")
## Parsed with column specification:
## cols(
##   issue = col_double(),
##   location = col_character(),
##   context = col_character(),
##   notes = col_character()
## )
theme_set(theme_light())

The character_visualization dataset counts up each time one of the main 25 character speaks, thinks, is involved in narrative statements, or is depicted total

character_visualization
## # A tibble: 9,800 x 7
##    issue costume character                     speech thought narrative depicted
##    <dbl> <chr>   <chr>                          <dbl>   <dbl>     <dbl>    <dbl>
##  1    97 Costume Editor narration                   0       0         0        0
##  2    97 Costume Omnipresent narration              0       0         0        0
##  3    97 Costume Professor X = Charles Xavier~      0       0         0        0
##  4    97 Costume Wolverine = Logan                  7       0         0       10
##  5    97 Costume Cyclops = Scott Summers           24       3         0       23
##  6    97 Costume Marvel Girl/Phoenix = Jean G~      0       0         0        0
##  7    97 Costume Storm = Ororo Munroe              11       0         0        9
##  8    97 Costume Colossus = Peter (Piotr) Ras~      9       0         0       17
##  9    97 Costume Nightcrawler = Kurt Wagner        10       0         0       17
## 10    97 Costume Banshee = Sean Cassidy             0       0         0        5
## # ... with 9,790 more rows

Let’s aggregate this dataset to the issue level so we can nuild models using per-issue differences in speaking, thinking, narrative, and total depictions

per_issue <- character_visualization %>% 
               group_by(issue) %>% 
               summarise(across(speech:depicted, sum)) %>% 
               ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)

I’m not doing a ton of EDA here but there are lots of great examples out there to explore on Twitter.

Which issues have the X-Mansion as a location?

Let’s start with our first model. The X-Mansion is the most frequently used location, but it does not appear in every episode

x_mansion <- locations %>% 
               group_by(issue) %>% 
               summarise(mansion = "X-Mansion" %in% location)
## `summarise()` ungrouping output (override with `.groups` argument)
locations_joined <- per_issue %>% 
               inner_join(x_mansion)
## Joining, by = "issue"
locations_joined %>% 
               mutate(mansion = if_else(mansion, "X_Mansion", "No Mansion")) %>% 
               pivot_longer(speech:depicted, names_to = 'visualization') %>% 
               mutate(visualization = fct_inorder(visualization)) %>% 
               ggplot(aes(x = mansion, y = value, fill = visualization)) +
               geom_dotplot(
                              binaxis = 'y', stackdir = 'center',
                              binpositions = 'all',
                              show.legend = FALSE
               ) +
               facet_wrap(~visualization, scales = 'free_y') + 
               labs(
                              x= NULL, y= NULL,
                              title = 'Which issues contain the X-Mansion as a location?',
                              subtitle = 'Comparing the top 25 characters speech, thought, narrative portayal, and total depictions', 
                              caption = ' Data from the Claremont  Run Project'
               )
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Now let’s create bootstrap resamples and fit a logistic regression model to each resample. What are the boostrap confidence intervals on the model parameters?

library(tidymodels)

set.seed(123)

boots <- bootstraps(locations_joined, times = 1000, apparent = TRUE)

boot_models <- boots %>% mutate(
               model = map(
                              splits,
                              ~glm(mansion ~ speech + thought + narrative + depicted, family = 'binomial', data = analysis(.)
                                   )
               ),
               coef_info = map(model, tidy)
)

boot_coef <- boot_models %>% 
               unnest(coef_info)

int_pctl(boot_models, coef_info)
## # A tibble: 5 x 6
##   term          .lower .estimate    .upper .alpha .method   
##   <chr>          <dbl>     <dbl>     <dbl>  <dbl> <chr>     
## 1 (Intercept) -2.42     -1.29    -0.277      0.05 percentile
## 2 depicted     0.00193   0.0103   0.0196     0.05 percentile
## 3 narrative   -0.0106    0.00222  0.0143     0.05 percentile
## 4 speech      -0.0148   -0.00716  0.000617   0.05 percentile
## 5 thought     -0.0143   -0.00338  0.00645    0.05 percentile

How are the parameters distributed?

boot_coef %>% 
               filter(term != "(Intercept)") %>% 
               mutate(term = fct_inorder(term)) %>% 
               ggplot(aes(x = estimate, fill = term))+ 
               geom_vline(
                              xintercept = 0, color = 'gray50',
                              alpha = 0.6, lty = 2, size =1.5
               ) + 
               geom_histogram(alpha = 0.8, bins = 25, show.legend = FALSE) +
               facet_wrap(~term, scales = 'free') + 
               labs(
                              title = 'Which issues contain the X-Mansion as a location?',
                              subtitle = 'Comparing the top 25 characters speech, thought, narrative portrayal, and total depictions',
                              caption = 'Data from the Claremont Run Project')

Apparently issues with lots of talking are more likely to occu elsewhere!

Now let’s do the Bechdel test

We can use the same approach from the previous section but replace the data about issue locations with the Bechdel test data

bechdel_joined <- per_issue %>% 
               inner_join(xmen_bechdel) %>% 
               mutate(pass_bechdel = if_else(pass_bechdel == 'yes', TRUE, FALSE)) 
## Joining, by = "issue"
bechdel_joined %>% 
               mutate(pass_bechdel = if_else(pass_bechdel, 'Pass Bechdel', 'Fails Bechdel')) %>% 
               pivot_longer(speech:depicted, names_to= 'visualization') %>% 
               mutate(visualization = fct_inorder(visualization)) %>% 
               ggplot(aes(x = pass_bechdel, y = value, fill = visualization)) +
               geom_dotplot(
                              binaxis = 'y', stackdir = 'center',
                              binpositions = 'all',
                              show.legend = FALSE
               ) + 
               facet_wrap(~visualization, scales = 'free_y') + 
               labs(
                              x = NULL, y = NULL,
                              title = 'Which Uncanny X-Men issues pass the Bechdel test?', 
                              subtitle = 'Comparing the top 25 characters speech, thought, narrative portrayal, and total depictions',
                              caption = 'Data from the Claremont Run Project'
               )
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

We cam again create bootstrap resamples, fit logistic regression models, and compute bootstrap confidence intervals

set.seed(123)

boots <- bootstraps(bechdel_joined, times = 1000, apparent = TRUE)

boot_models <- boots %>% 
               mutate(
                              model = map(
                                             splits, 
                                             ~glm(pass_bechdel ~ speech + thought + narrative + depicted, family = 'binomial', data = analysis(.))
                              ),
                              coef_info = map(model, tidy)
               )

boot_coefs <- boot_models %>% unnest(coef_info)

int_pctl(boot_models, coef_info)
## # A tibble: 5 x 6
##   term           .lower .estimate    .upper .alpha .method   
##   <chr>           <dbl>     <dbl>     <dbl>  <dbl> <chr>     
## 1 (Intercept) -1.18      -0.248    0.699      0.05 percentile
## 2 depicted    -0.0232    -0.0111  -0.000509   0.05 percentile
## 3 narrative   -0.00405    0.00966  0.0260     0.05 percentile
## 4 speech       0.00521    0.0151   0.0285     0.05 percentile
## 5 thought      0.000561   0.0155   0.0361     0.05 percentile

How are these parameters distributed?

boot_coefs %>%
               filter(term != '(Intercept)') %>% 
               mutate(term = fct_inorder(term)) %>% 
               ggplot(aes(x = estimate, fill = term)) +
               geom_vline(
                              xintercept = 0, color = 'gray50',
                              alpha = 0.6, lty = 2, size = 1.5
               ) + 
               geom_histogram(alpha = 0.8, bins = 25, show.legend = FALSE) +
               facet_wrap(~term, scales = 'free') +
               labs(
                              title = 'Which Uncanny X-Men issues pass the Bechdel test?',
                              subtitle = 'Comparing the top 25 characters speech, thought, narrative portayal, and total depictions',
                              caption = 'Data from the Claremont Run project'
               )

I think it makes sense that issues with lots of speaking are more likely to pass the Bechdel test, which is about characters speaking to each other. Interesting that the issuers with lots of character depictions are less likely to pass!