Week 11 Bechdel

model2_graph <- ggplot(data = binary_model2, aes(x = profitability, y = binary_num)) + geom_jitter() + geom_smooth(method = 'lm', se = FALSE)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
##   Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

model2_graph

## `geom_smooth()` using formula = 'y ~ x'

binary_model2$coefficients

##   (Intercept) profitability 
## -9.072093e-02 -5.284544e-10

sigmoid <- \(x) 1 / (1 + exp(-(-0.091 - 0.000000000528 * x)))

bechdel_data_movies |>
  ggplot(mapping = aes(x = profitability, y = binary_num)) +
  geom_jitter() +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "title")#+

## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).

model2_graph

## `geom_smooth()` using formula = 'y ~ x'

Your RMarkdown notebook for this data dive should contain the following:

Build a linear (or generalized linear) model as you like
- Use whatever response variable and explanatory variables you prefer - binary and profitability.

Use the tools from previous weeks to diagnose the model

y <- binary_model2$y
phat <- binary_model2$fitted.values
loss <- -(y * log(phat) + (1 - y) * log(1 - phat))

loss_model_df <- model.frame(binary_model2)
loss_model_df$loss <- loss


loss_graph <- 
  ggplot(loss_model_df) + geom_boxplot(mapping = aes(x = as_factor(binary_num), y = loss)) + labs(title = 'title')

loss_graph

plot(binary_model2, which = 4, id.n = 3)

loss_model_df["1558", ]

##      binary_num profitability     loss
## 1558          1    2206384238 1.507136

res <- residuals(binary_model2, type = "deviance")

residual_df <- data.frame(residuals = res)

residuals_graph <- ggplot(residual_df, aes(x = residuals)) + geom_histogram()

residuals_graph

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Highlight any issues with the model:
According to the box plot, the model looks to be a little more representative of movies that fail the Bechdel test at the expense of movie that pass it. The Cook’s D says that our top three influencial points are correlated with observations 1446, 1558, and 1784. Those observations are 1. a passing movie with 3.8B in profit, 2. a passing movie with 3.1B in profit, and 3. a passing movie with 2.2B in profit. This means that the high grossing movies that pass the bechdel test are highly influential on the model, meaning that in actuality, it is even more skewed against movies that pass the bechdel test than appears with those three movies. The graph of the residuals is bimodal, which is normal for a binary glm, and we can tell that the graph skews slightly left, which means the model is slightly under-predicting movies passing the bechdel test. This aligns with the rest of our model’s review graph findings.

Interpret at least one of the coefficients
- The coefficient for the model means that as profitability increases by a dollar, the probability of the movie passing the bechdel test very, very, very slightly decreases. Since this number is so small, and profitability is more often tracked by millions, if profitability is increased by one million dollars, the chance of a movie passing the bechdel test decreases by .05%, which is still very small, almost negligible. This is important to know because some people think that movies that pass the bechdel test are less profitable, and technically they are. But, the effect is very, very small, and may be outweighed by other factors, such as genre.

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Week 11 Bechdel

Tolley

2026-03-31