First we bring in all the libraries we will be using. Then we load the data set we have downloaded.
#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
#remove all na's
movies_raw <- movies_raw |>
drop_na(gross)
movies_raw <- movies_raw |>
drop_na(score)
movies_raw <- movies_raw |>
drop_na(rating)
movies_raw <- movies_raw |>
drop_na(budget)
The next step for our data set is to clean it and format it so that we can begin to work through it.
#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
separate(released, into = c("release_new","country_released"), sep=" \\(") |>
mutate(country_released = str_remove(country_released, "\\)$")) |> #remove the end parathensis
mutate(release_date=mdy(release_new)) |> #then change the date to an easier format
rename(country_filmed=country) #rename column for ease of understanding
movies_
## # A tibble: 5,424 × 17
## name rating genre year release_new country_released score votes director
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 The Sh… R Drama 1980 June 13, 1… United States 8.4 9.27e5 Stanley…
## 2 The Bl… R Adve… 1980 July 2, 19… United States 5.8 6.5 e4 Randal …
## 3 Star W… PG Acti… 1980 June 20, 1… United States 8.7 1.20e6 Irvin K…
## 4 Airpla… PG Come… 1980 July 2, 19… United States 7.7 2.21e5 Jim Abr…
## 5 Caddys… R Come… 1980 July 25, 1… United States 7.3 1.08e5 Harold …
## 6 Friday… R Horr… 1980 May 9, 1980 United States 6.4 1.23e5 Sean S.…
## 7 The Bl… R Acti… 1980 June 20, 1… United States 7.9 1.88e5 John La…
## 8 Raging… R Biog… 1980 December 1… United States 8.2 3.30e5 Martin …
## 9 Superm… PG Acti… 1980 June 19, 1… United States 6.8 1.01e5 Richard…
## 10 The Lo… R Biog… 1980 May 16, 19… United States 7 1 e4 Walter …
## # ℹ 5,414 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## # budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## # release_date <date>
For my first hypothesis, I decided to see if movies with a higher budget tend to receive higher average scores.
My main variable is the score column in the data set. For group A and B, I decided that group A would be movies over the median budget for movies in the data set, with group B being movies under the median budget.
My null and alternate hypothesis are:
Ho - There is no significant difference in average score between high-budget and low-budget movies.
Ha - There is a significant difference between high-budget movie scores and low-budget movie scores.
For my alpha level I chose 0.01. I chose this because of how much money is involved with movies. Our median budget for this data set is $21.65 million. Movie production companies cannot afford for movies to have high budgets when they don’t need, which we might find out from our hypothesis testing.
I have our beta value set as 0.95. This is because in the context of our data set movie companies can’t afford to miss out on opportunities they could’ve had over other production studios.
#Look at the median budget for all movies
movies_ |>
summarize(
median_budget = median(budget, na.rm=TRUE)
)
## # A tibble: 1 × 1
## median_budget
## <dbl>
## 1 21650000
#create the group
high_budget <- movies_ |>
filter(budget >= 21650000)
low_budget <- movies_ |>
filter(budget <= 21650000)
#find the average scores
avg_highscore <- high_budget |>
summarize(avg_score = mean(score))
avg_lowscore <- low_budget |>
summarize(avg_score = mean(score))
avg_highscore
## # A tibble: 1 × 1
## avg_score
## <dbl>
## 1 6.40
avg_lowscore
## # A tibble: 1 × 1
## avg_score
## <dbl>
## 1 6.39
#find the observed difference
observed_diff <- (avg_highscore - avg_lowscore)
observed_diff
## avg_score
## 1 0.0105826
#plug in the median budget for cohen's D
cohen.d(d = filter(movies_, budget >= 21650000 ) |> pluck("score"),
f = filter(movies_, budget <= 21650000 ) |> pluck("score"))
##
## Cohen's d
##
## d estimate: 0.01099152 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.04224608 0.06422912
We can see that the observed difference above is 0.0105 which is actually interesting as it is very close to the Cohen’s D we also calculated (0.110). Based on the Cohen’s D , I am going to set my delta or parameter difference at 0.15 and round it to 0.2 to match the data set decimals.
test <- pwrss.t.2means(mu1 = 0.2,
sd1 = 1,
kappa = 1,
power = .95, alpha = 0.01,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.95
## n1 = 893
## n2 = 893
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 1784
## Non-centrality parameter = 4.226
## Type I error rate = 0.01
## Type II error rate = 0.05
plot(test)
Using the prwss test we can see that we need at least 893*2 = 1786 movies in the data set which we easily have so we can use 1800 movies as our sample.
Now we need to create our bootstrapped samples of 1800 each, we will then do AB Testing and Direct Simulation Testing to find our p value.
#create bootstrap
bootstrap <- function (x, func=mean, n_iter=10^4) {
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample (a vector)
x_sample <- sample(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
# Now we get the difference in the averages for our samples
avgs_high <- high_budget |>
pluck("score") |>
bootstrap(n_iter = 1800)
avgs_low <- low_budget |>
pluck("score") |>
bootstrap(n_iter = 1800)
diffs_in_avgs <- avgs_high - avgs_low
Now we do our test to find the p value
# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)
# proportion of times the difference is more extreme
paste("p-value ",
sum(abs(observed_diff) < abs(diffs_in_avgs_d)) /
length(diffs_in_avgs_d))
## [1] "p-value 0.000555555555555556"
We find that our p value is 0.0006, this means that there is a significant difference between high budget movies and low budget movies when it comes to average score. We can reject the null hypothesis that there is NO difference.
For my second test I want to look at how a movie’s genre can affect its gross revenue. One would think that movies rated R would lead to a smaller gross revenue as that eliminates a lot of the population from being allowed to watch it. For my null and alternate hypothesis I decided on:
Ho - The R rating for a movie is independent of gross revenue
Ha - The R rating has a significant impact on the gross revenue of a movie
#create a new columns to help with analyzing
#create a column that sorts movies into R rating or other
movies_$rating_group <- ifelse(movies_$rating == "R", "R", "Other")
#create a column that seperates movies from high or low gross
median_gross <- median(movies_$gross)
movies_$high_gross <- ifelse(movies_$gross > median_gross, 'High Gross', 'Low Gross')
#create the chi square table
ctr_table <- table(movies_$rating_group, movies_$high_gross)
ctr_table
##
## High Gross Low Gross
## Other 1731 1096
## R 981 1616
We can see from this chi square table that there looks to be a difference between revenues but now we need to actually perform the chi square test.
chisq.test(ctr_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ctr_table
## X-squared = 296.96, df = 1, p-value < 2.2e-16
We can see that the p value IS significant. This means we can reject the null hypothesis that R rated movies are independent of their genre.
Hypothesis Test 1 - High Budget vs Low Budget
#create columns in dataset to more easily make the graphs
median_budget <- median(movies_$budget)
movies_$budget_level <- ifelse(movies_$budget > median_budget, "High Budget", "Low Budget")
ggplot(movies_, mapping = aes(x = budget_level, y = score)) +
geom_boxplot(outlier.color = "red") +
labs(title = "Distribution of Movie Scores by Budget Level",
x = "Budget Level",
y = "IMDB Movie Score")
Hypothesis Test 2 - R Rating vs other Genres with gross revenue
#Create a table based off the chi square table created earlier
gross_table <- data.frame(
rating = rep(c("Other", "R"), each = 2),
gross_level = rep(c("high gross", "low gross"), times = 2),
count = c(1731, 1096, 981, 1616)
)
ggplot(gross_table, aes(x = rating, y = count, fill = gross_level)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Count of High and Low Grossing Movies by Rating",
x = "Rating Category",
y = "Number of Movies",
fill = "Gross Level") +
theme_minimal()