THE BECHDEL TEST AND THE X-MANSION WITH TIDYMODELS AND TIDYTUESDAY
Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast focuses on using bootstrap resampling with this week’s TidyTueday dataset from the Claremont Run Project about issues of the comic book series Uncanny X-Men
Out modeling goal is to use information about speech bubbles, thought bubbles, narrative statements, and character depictions from this week’s TidyTueday dataset to understand more about characteristics of individual comic book issues. Let’s focus on two modeling questions
We’re going to use three of the datasets from this week
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(ggthemes))
character_visualization <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/character_visualization.csv")
## Parsed with column specification:
## cols(
## issue = col_double(),
## costume = col_character(),
## character = col_character(),
## speech = col_double(),
## thought = col_double(),
## narrative = col_double(),
## depicted = col_double()
## )
xmen_bechdel <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/xmen_bechdel.csv")
## Parsed with column specification:
## cols(
## issue = col_double(),
## pass_bechdel = col_character(),
## notes = col_character(),
## reprint = col_logical()
## )
locations <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/locations.csv")
## Parsed with column specification:
## cols(
## issue = col_double(),
## location = col_character(),
## context = col_character(),
## notes = col_character()
## )
theme_set(theme_light())
The character_visualization dataset counts up each time one of the main 25 character speaks, thinks, is involved in narrative statements, or is depicted total
character_visualization
## # A tibble: 9,800 x 7
## issue costume character speech thought narrative depicted
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 97 Costume Editor narration 0 0 0 0
## 2 97 Costume Omnipresent narration 0 0 0 0
## 3 97 Costume Professor X = Charles Xavier~ 0 0 0 0
## 4 97 Costume Wolverine = Logan 7 0 0 10
## 5 97 Costume Cyclops = Scott Summers 24 3 0 23
## 6 97 Costume Marvel Girl/Phoenix = Jean G~ 0 0 0 0
## 7 97 Costume Storm = Ororo Munroe 11 0 0 9
## 8 97 Costume Colossus = Peter (Piotr) Ras~ 9 0 0 17
## 9 97 Costume Nightcrawler = Kurt Wagner 10 0 0 17
## 10 97 Costume Banshee = Sean Cassidy 0 0 0 5
## # ... with 9,790 more rows
Let’s aggregate this dataset to the issue level so we can nuild models using per-issue differences in speaking, thinking, narrative, and total depictions
per_issue <- character_visualization %>%
group_by(issue) %>%
summarise(across(speech:depicted, sum)) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
I’m not doing a ton of EDA here but there are lots of great examples out there to explore on Twitter.
Let’s start with our first model. The X-Mansion is the most frequently used location, but it does not appear in every episode
x_mansion <- locations %>%
group_by(issue) %>%
summarise(mansion = "X-Mansion" %in% location)
## `summarise()` ungrouping output (override with `.groups` argument)
locations_joined <- per_issue %>%
inner_join(x_mansion)
## Joining, by = "issue"
locations_joined %>%
mutate(mansion = if_else(mansion, "X_Mansion", "No Mansion")) %>%
pivot_longer(speech:depicted, names_to = 'visualization') %>%
mutate(visualization = fct_inorder(visualization)) %>%
ggplot(aes(x = mansion, y = value, fill = visualization)) +
geom_dotplot(
binaxis = 'y', stackdir = 'center',
binpositions = 'all',
show.legend = FALSE
) +
facet_wrap(~visualization, scales = 'free_y') +
labs(
x= NULL, y= NULL,
title = 'Which issues contain the X-Mansion as a location?',
subtitle = 'Comparing the top 25 characters speech, thought, narrative portayal, and total depictions',
caption = ' Data from the Claremont Run Project'
)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Now let’s create bootstrap resamples and fit a logistic regression model to each resample. What are the boostrap confidence intervals on the model parameters?
library(tidymodels)
set.seed(123)
boots <- bootstraps(locations_joined, times = 1000, apparent = TRUE)
boot_models <- boots %>% mutate(
model = map(
splits,
~glm(mansion ~ speech + thought + narrative + depicted, family = 'binomial', data = analysis(.)
)
),
coef_info = map(model, tidy)
)
boot_coef <- boot_models %>%
unnest(coef_info)
int_pctl(boot_models, coef_info)
## # A tibble: 5 x 6
## term .lower .estimate .upper .alpha .method
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -2.42 -1.29 -0.277 0.05 percentile
## 2 depicted 0.00193 0.0103 0.0196 0.05 percentile
## 3 narrative -0.0106 0.00222 0.0143 0.05 percentile
## 4 speech -0.0148 -0.00716 0.000617 0.05 percentile
## 5 thought -0.0143 -0.00338 0.00645 0.05 percentile
How are the parameters distributed?
boot_coef %>%
filter(term != "(Intercept)") %>%
mutate(term = fct_inorder(term)) %>%
ggplot(aes(x = estimate, fill = term))+
geom_vline(
xintercept = 0, color = 'gray50',
alpha = 0.6, lty = 2, size =1.5
) +
geom_histogram(alpha = 0.8, bins = 25, show.legend = FALSE) +
facet_wrap(~term, scales = 'free') +
labs(
title = 'Which issues contain the X-Mansion as a location?',
subtitle = 'Comparing the top 25 characters speech, thought, narrative portrayal, and total depictions',
caption = 'Data from the Claremont Run Project')
Apparently issues with lots of talking are more likely to occu elsewhere!
We can use the same approach from the previous section but replace the data about issue locations with the Bechdel test data
bechdel_joined <- per_issue %>%
inner_join(xmen_bechdel) %>%
mutate(pass_bechdel = if_else(pass_bechdel == 'yes', TRUE, FALSE))
## Joining, by = "issue"
bechdel_joined %>%
mutate(pass_bechdel = if_else(pass_bechdel, 'Pass Bechdel', 'Fails Bechdel')) %>%
pivot_longer(speech:depicted, names_to= 'visualization') %>%
mutate(visualization = fct_inorder(visualization)) %>%
ggplot(aes(x = pass_bechdel, y = value, fill = visualization)) +
geom_dotplot(
binaxis = 'y', stackdir = 'center',
binpositions = 'all',
show.legend = FALSE
) +
facet_wrap(~visualization, scales = 'free_y') +
labs(
x = NULL, y = NULL,
title = 'Which Uncanny X-Men issues pass the Bechdel test?',
subtitle = 'Comparing the top 25 characters speech, thought, narrative portrayal, and total depictions',
caption = 'Data from the Claremont Run Project'
)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
We cam again create bootstrap resamples, fit logistic regression models, and compute bootstrap confidence intervals
set.seed(123)
boots <- bootstraps(bechdel_joined, times = 1000, apparent = TRUE)
boot_models <- boots %>%
mutate(
model = map(
splits,
~glm(pass_bechdel ~ speech + thought + narrative + depicted, family = 'binomial', data = analysis(.))
),
coef_info = map(model, tidy)
)
boot_coefs <- boot_models %>% unnest(coef_info)
int_pctl(boot_models, coef_info)
## # A tibble: 5 x 6
## term .lower .estimate .upper .alpha .method
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -1.18 -0.248 0.699 0.05 percentile
## 2 depicted -0.0232 -0.0111 -0.000509 0.05 percentile
## 3 narrative -0.00405 0.00966 0.0260 0.05 percentile
## 4 speech 0.00521 0.0151 0.0285 0.05 percentile
## 5 thought 0.000561 0.0155 0.0361 0.05 percentile
How are these parameters distributed?
boot_coefs %>%
filter(term != '(Intercept)') %>%
mutate(term = fct_inorder(term)) %>%
ggplot(aes(x = estimate, fill = term)) +
geom_vline(
xintercept = 0, color = 'gray50',
alpha = 0.6, lty = 2, size = 1.5
) +
geom_histogram(alpha = 0.8, bins = 25, show.legend = FALSE) +
facet_wrap(~term, scales = 'free') +
labs(
title = 'Which Uncanny X-Men issues pass the Bechdel test?',
subtitle = 'Comparing the top 25 characters speech, thought, narrative portayal, and total depictions',
caption = 'Data from the Claremont Run project'
)
I think it makes sense that issues with lots of speaking are more likely to pass the Bechdel test, which is about characters speaking to each other. Interesting that the issuers with lots of character depictions are less likely to pass!