Data Preparation

library(dplyr)
library(ggplot2)
library(tidyr)
# load data
movies_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv"
movies_df <- read.csv(url(movies_url))

str(movies_df)

##change columns to integer
movies_df$domgross <- as.integer(movies_df$domgross)
movies_df$intgross <- as.integer(movies_df$intgross)
movies_df$domgross_2013. <- as.integer(movies_df$domgross_2013.)
movies_df$intgross_2013. <- as.integer(movies_df$intgross_2013.)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is there a significant difference in gross box office revenue between movies that do and do not pass the Bechdel test?

Cases

What are the cases, and how many are there?

Each case represents a movie released in the years 1970-2013. There 1794 observations in the data set.

Data collection

Describe the method of data collection. Data was collected by FiveThirtyEight analysts from two websites, BechdelTest.com, where movie viewers submit information about movies and whether or not they pass the Bechdel test, and The-Numbers.com, for box office data on the movies in question.

Type of study

What type of study is this (observational/experiment)?

This study is observational.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Hickey, Walt. “The Dollars and Cents Case Against Hollywood’s Exclusion of Women” FiveThirtyEight, 2014. https://github.com/fivethirtyeight/data/tree/master/bechdel

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is the domestic gross box office revenue, which is quantitative.

Independent Variable(s)

What is the explanatory variable, and what type is it (numerical/categorical)?

The explanatory variable is whether or not a movie passes the Bechdel test (and if it fails, why it fails.) This is a categorical variable.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

table(movies_df$clean_test)
## 
## dubious     men  notalk nowomen      ok 
##     142     194     514     141     803
summary(movies_df$domgross_2013.)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 8.990e+02 2.055e+07 5.599e+07 9.517e+07 1.217e+08 1.772e+09        18
movies_df %>% drop_na(domgross_2013.) %>%
ggplot(aes(x = "", y = domgross_2013.)) +
  geom_boxplot() +
  labs(title = "Domestic Gross Box Office (adjusted for inflation)", x = "", y = "")+
  coord_flip()

movies_df %>% drop_na(domgross_2013.) %>%
ggplot(aes(x = domgross_2013.)) +
  geom_histogram(binwidth = 15000000) +
  labs(title = "Domestic Gross Box Office (adjusted for inflation)", x = "Domestic Gross Box Office", y = "Frequency")