Week 9 Bechdel

#importing last week's model
linearity_check <- ggplot(bechdel_data_movies, aes(x = profitability, y = budget_2013, color = binary)) + geom_point() + labs(title = "Movie Profitability and Budget", x = "Profitability (2013 USD)", y= "Budget (2013 USD)")+ scale_y_log10()

linreg <- lm(profitability ~ budget_2013, data = bechdel_data_movies)

checking_lin <- linearity_check + geom_abline(intercept = 63670000, slope = 3.117, color = "purple")
checking_lin

## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).

Your RMarkdown notebook for this data dive should contain the following:

Refer to the simple linear regression model you built last week. Include 1-3 more variables into your regression model.

movie_model <- lm(profitability ~ budget_2013 + imdb_votes + metascore, data = bechdel_data_movies)
summary(movie_model)

## 
## Call:
## lm(formula = profitability ~ budget_2013 + imdb_votes + metascore, 
##     data = bechdel_data_movies)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.323e+09 -1.131e+08 -2.590e+07  5.750e+07  2.716e+09 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.481e+08  2.895e+07  -5.115 3.57e-07 ***
## budget_2013  2.420e+00  1.442e-01  16.786  < 2e-16 ***
## imdb_votes   1.071e+03  6.348e+01  16.869  < 2e-16 ***
## metascore    1.918e+06  4.722e+05   4.062 5.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 279700000 on 1410 degrees of freedom
##   (380 observations deleted due to missingness)
## Multiple R-squared:  0.4252, Adjusted R-squared:  0.424 
## F-statistic: 347.6 on 3 and 1410 DF,  p-value: < 2.2e-16

-   For each new variable you try, explain why you should include it, or not. *E.g., are there any issues with multicollinearity?*
I think there would be a multicolinearity with the intgross_2013 and domgross_2013 due to them both being measures of how well a movie did/does financially. If a movie is popular domestically, it is more likely to be popular internationally as well. I also think there is some colinearity with budget, domgross, and intgross with their 2013 counterparts. I do not think there is colinearity between metascores and imdb_votes because one is a measure of how many votes a movie got and one is another site's rating of the movie overall out of 100.

plots <- gg_resX(movie_model, plot.all = FALSE)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plots$budget_2013 +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$imdb_votes +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$metascore +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

- For each plot, point out any indications of issues with the model. Otherwise, explain how the plot supports the claim that an assumption is met. - Try to measure the severity of any issues as well as the level of confidence you have in an assumption being met. - For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

It looks to be some slight overestimation in the residual vs 2013 budget graph, some underestimation throughout the residual vs imdb votes graph, both with some fanning behavior. This means that there is evidence that the relationship between these two and profitability are not linear. This is evidence that the plot does not support the claim the assumption of linearity is met. I am not confident the assumption is met for the 2013 budget and imdb votes. It is slight, but non-linear.

Also, as the budget and votes increase, the model becomes less applicable. There are also a few outliers that can be potential points to further investigate. The metascore graph, on the other hand, appears to have a normal distribution of data and thus shows profitability and the metascores have a linear relationship, with a few potential outliers also in need of further investigation. This is helpful information for potential stake holders who are looking at aspects that dictate or relate to a film’s success. I am confident that there is a linear relationship between profitability and metascores.

ggcorr(select(bechdel_data_movies,
              budget_2013,
              imdb_votes,
              metascore), label = TRUE) +
  labs(title='Correlation Heatmap')

This correlation heatmap shows that the 2013 budget, imdb votes, and metascore are weakly related to eachother (with the metascore and 2013 budget being very weakly negatively correlated). This tells me that there is little multicolinearity risk among my chosen variables. This is important as it supports an assumption that the variables are not related to each other and the movement in the profitability is solely due to variable movement. I am confident that there is no risk of multicolinearity with the chosen variables.

gg_reshist(movie_model)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

This histogram shows that the residuals are slightly lopsided with a skew towards numbers that are very slightly less than zero. This tells me that my model is slightly overestimating high profitability numbers with the variables I’ve given it. I am slightly confident that there is normality in the variables selected.

gg_qqplot(movie_model)

The QQ plot means that there are serious outliers that need to be investigated, they appear to be pulling the qq-curve off the line of normal distribution. I would like to use Cook’s D to see if the outliers have a lot of leverage.

Week 9 Bechdel

Tolley

2026-03-10