#loading libraries and data into the file
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
bechdel_data_movies <- read_csv("C:/Users/Lauren/Documents/Stats Data/movies.csv")
## Rows: 1794 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): imdb, title, test, clean_test, binary, domgross, intgross, code, d...
## dbl (7): year, budget, budget_2013, period_code, decade_code, metascore, im...
## num (1): imdb_votes
## lgl (2): response, error
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bechdel_data_movies$domgross_2013 <- as.numeric(bechdel_data_movies$domgross_2013, na.rm = TRUE)
## Warning: NAs introduced by coercion
bechdel_data_movies$intgross_2013 <- as.numeric(bechdel_data_movies$intgross_2013, na.rm = TRUE)
## Warning: NAs introduced by coercion
bechdel_data_movies$budget_2013 <- as.numeric(bechdel_data_movies$budget_2013, na.rm = TRUE)
bechdel_data_movies <- bechdel_data_movies |>
mutate(profitability = (domgross_2013 + intgross_2013) - budget_2013)
The purpose of this week’s data dive is for you to get experience running ANOVA tests and building regression models.
Your RMarkdown notebook for this data dive should contain the following:
Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.
Select a categorical column of data (explanatory variable) that you expect might influence the response variable.
Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results (e.g., use box plots). Be clear about how the R output relates to your conclusions.
# checking that the assumptions under ANOVA are true
profitability_chart <- ggplot(bechdel_data_movies) + aes(x = binary, y = profitability, color = binary) + geom_boxplot() + labs( title = "Verifying Homogeneity", x = "Passes Bechdel Test", y = "Profitability in 2013 USD")
profitability_chart
## Warning: Removed 18 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplots look comparable for both categories, so we will continue with the ANOVA testing.
anova_1_bechdel <- oneway.test(profitability ~ binary,
data = bechdel_data_movies,
var.equal = TRUE
)
anova_1_bechdel
##
## One-way analysis of means
##
## data: profitability and binary
## F = 13.903, num df = 1, denom df = 1774, p-value = 0.0001985
Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.
Because F is greater than 1, there is not enough evidence to accept the null hypothesis and therefore we can assume that the variance between groups is higher than the variance within each group, which means we can assume that the difference between the means of the movies that pass or fail the Bechdel test are not due to chance, but because of the fact they are passing or failing the bechdel test. Because there are only two categories, it is sufficient to know that they are not the same.
This is significant as profitability is an important factor when movies are pitched and created, and if passing (or failing) the Bechdel test is an important part of projecting profitability, then people should know. Again, I would love to investigate if the genres that pass/fail the bechdel test influence or are influenced by budget and profitability.
Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.
#checking for linearity
linearity_check <- ggplot(bechdel_data_movies, aes(x = profitability, y = budget_2013, color = binary)) + geom_point() + labs(title = "Movie Profitability and Budget", x = "Profitability (2013 USD)", y= "Budget (2013 USD)")+ scale_y_log10()
linearity_check
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
It is kind of adjacent to something like linear, so we move ahead!
linreg <- lm(profitability ~ budget_2013, data = bechdel_data_movies)
linreg
##
## Call:
## lm(formula = profitability ~ budget_2013, data = bechdel_data_movies)
##
## Coefficients:
## (Intercept) budget_2013
## 6.367e+07 3.117e+00
#adding linreg to linearity_check
checking_lin <- linearity_check + geom_abline(intercept = 63670000, slope = 3.117, color = "purple")
checking_lin
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
Build a linear regression model of the response using just this column, and evaluate its fit.
Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?
The slope and coefficient of the linear regression line appears to imply that once you’ve spent 63.7M on your movie you get a return of about $3.12 for every dollar you invest in the movie. This is taking into account the fact that some movies aren’t profitable at all. This is significant as this may explain how big budget blockbuster movies get to be so popular; its the only way to make a profit. Small budget movies may mean you risk less, but it also means you don’t get as much profit, if any.
I would still like to explore how profitability and budget_2013 are related to the movie’s genre.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.