model2_graph <- ggplot(data = binary_model2, aes(x = profitability, y = binary_num)) + geom_jitter() + geom_smooth(method = 'lm', se = FALSE)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
## Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
model2_graph
## `geom_smooth()` using formula = 'y ~ x'
binary_model2$coefficients
## (Intercept) profitability
## -9.072093e-02 -5.284544e-10
sigmoid <- \(x) 1 / (1 + exp(-(-0.091 - 0.000000000528 * x)))
bechdel_data_movies |>
ggplot(mapping = aes(x = profitability, y = binary_num)) +
geom_jitter() +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "title")#+
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
model2_graph
## `geom_smooth()` using formula = 'y ~ x'
Your RMarkdown notebook for this data dive should contain the following:
Build a linear (or generalized linear) model as you like
Use the tools from previous weeks to diagnose the model
y <- binary_model2$y
phat <- binary_model2$fitted.values
loss <- -(y * log(phat) + (1 - y) * log(1 - phat))
loss_model_df <- model.frame(binary_model2)
loss_model_df$loss <- loss
loss_graph <-
ggplot(loss_model_df) + geom_boxplot(mapping = aes(x = as_factor(binary_num), y = loss)) + labs(title = 'title')
loss_graph
plot(binary_model2, which = 4, id.n = 3)
loss_model_df["1558", ]
## binary_num profitability loss
## 1558 1 2206384238 1.507136
res <- residuals(binary_model2, type = "deviance")
residual_df <- data.frame(residuals = res)
residuals_graph <- ggplot(residual_df, aes(x = residuals)) + geom_histogram()
residuals_graph
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Highlight any issues with the model:
According to the box plot, the model looks to be a little more representative of movies that fail the Bechdel test at the expense of movie that pass it. The Cook’s D says that our top three influencial points are correlated with observations 1446, 1558, and 1784. Those observations are 1. a passing movie with 3.8B in profit, 2. a passing movie with 3.1B in profit, and 3. a passing movie with 2.2B in profit. This means that the high grossing movies that pass the bechdel test are highly influential on the model, meaning that in actuality, it is even more skewed against movies that pass the bechdel test than appears with those three movies. The graph of the residuals is bimodal, which is normal for a binary glm, and we can tell that the graph skews slightly left, which means the model is slightly under-predicting movies passing the bechdel test. This aligns with the rest of our model’s review graph findings.
Interpret at least one of the coefficients
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.