library(dplyr)
library(ggplot2)
library(tidyr)
# load data
movies_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv"
movies_df <- read.csv(url(movies_url))
str(movies_df)
##change columns to integer
movies_df$domgross <- as.integer(movies_df$domgross)
movies_df$intgross <- as.integer(movies_df$intgross)
movies_df$domgross_2013. <- as.integer(movies_df$domgross_2013.)
movies_df$intgross_2013. <- as.integer(movies_df$intgross_2013.)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is there a significant difference in gross box office revenue between movies that do and do not pass the Bechdel test?
What are the cases, and how many are there?
Each case represents a movie released in the years 1970-2013. There 1794 observations in the data set.
Describe the method of data collection. Data was collected by FiveThirtyEight analysts from two websites, BechdelTest.com, where movie viewers submit information about movies and whether or not they pass the Bechdel test, and The-Numbers.com, for box office data on the movies in question.
What type of study is this (observational/experiment)?
This study is observational.
If you collected the data, state self-collected. If not, provide a citation/link.
Hickey, Walt. “The Dollars and Cents Case Against Hollywood’s Exclusion of Women” FiveThirtyEight, 2014. https://github.com/fivethirtyeight/data/tree/master/bechdel
What is the response variable? Is it quantitative or qualitative?
The response variable is the domestic gross box office revenue, which is quantitative.
What is the explanatory variable, and what type is it (numerical/categorical)?
The explanatory variable is whether or not a movie passes the Bechdel test (and if it fails, why it fails.) This is a categorical variable.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
table(movies_df$clean_test)
##
## dubious men notalk nowomen ok
## 142 194 514 141 803
summary(movies_df$domgross_2013.)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.990e+02 2.055e+07 5.599e+07 9.517e+07 1.217e+08 1.772e+09 18
movies_df %>% drop_na(domgross_2013.) %>%
ggplot(aes(x = "", y = domgross_2013.)) +
geom_boxplot() +
labs(title = "Domestic Gross Box Office (adjusted for inflation)", x = "", y = "")+
coord_flip()
movies_df %>% drop_na(domgross_2013.) %>%
ggplot(aes(x = domgross_2013.)) +
geom_histogram(binwidth = 15000000) +
labs(title = "Domestic Gross Box Office (adjusted for inflation)", x = "Domestic Gross Box Office", y = "Frequency")