#install.packages("pwr")
#load the data libraries - remove or add as needed
library(tidyverse) #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #loaded to address viz issues, including currency issues
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
library(pwr)
## Warning: package 'pwr' was built under R version 4.4.3
#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")
#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"
# View updated column names
#cat(colnames(movies), sep = ", ") #commented out after test
#create a version of the data set which excludes worldwide revenue outliers
# Calculate IQR for WorldwideRevenue
Q1 <- quantile(movies$WorldwideRevenue, 0.25) # First quartile
Q3 <- quantile(movies$WorldwideRevenue, 0.75) # Third quartile
IQR <- Q3 - Q1
# Define lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter out outliers
movies_clean <- movies[movies$WorldwideRevenue >= lower_bound & movies$WorldwideRevenue <= upper_bound, ]
# View updated column names
cat(colnames(movies_clean), sep = ", ")
## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10
After having explored your dataset over the past few weeks, you should already have some questions. For this week, we will do AB Testing within the context of your data set. So, you’ll be using hypothesis testing to calculate some difference between two groups.
For each of the following, you’ll need at least one main variable, and some way to define “Group A” and “Group B” (e.g., if you have a categorical column, you could consider “category 1” and “[the rest]”). Your main variable should either be continuous or binary (i.e., “success” or “failure”).
Devise two different null hypotheses based on two different columns data (i.e., each hypothesis gets a different column of data). For Hypothesis 1: Determine if you have enough data to perform a hypothesis test using the Neyman-Pearson framework (i.e., choose a test, alpha level, Type 2 Error, etc., and reject or fail-to reject …). If you have enough data, show your sample size calculation, perform the test, and interpret results. If there is not enough data, explain why. Make sure your alpha level, power level, and minimum effect size are intentional, i.e., explain why you chose them. A good start here is to think about the practical purpose of your test, and the risk of a False Negative (or of a False Positive …) assuming your null hypothesis is true. For Hypothesis 2: Perform a hypothesis test using Fisher’s Significance Testing framework (i.e., p-value, recommendation, etc.). For this, you should only need to choose a test, interpret the p-value, and give some reasoning for why we should be confident in your data and your conclusions. Build two visualizations that best illustrate your results, one for each null hypothesis.
I am interested in Worldwide revenue by the genres of Action and Comedy. The column of interest are WorldwideRevenue, group A will be the prime_genre of Action and group B will be the prime_genre of Comedy
Our interest in these genres is to see if there is a significant difference between them in relation to worldwide revenue. This might be of value to movie makers, exhibitors, or distributors when determining what type of movie to make, show, or distribute.
The null hypothesis - H0: “There is no significant difference in WorldwideRevenue between Action and Comedy films.”
The alternative hypothesis - H1: “There is a significant difference in WorldwideRevenue between Action and Comedy films.”
How many movies of each type do we have?
#check how many comedy and Action movies there are
# Filter the dataset to include only Action and Comedy films
movies_filtered <- subset(movies_clean, Prime_Genre %in% c("Action", "Comedy"))
# Count the number of Action and Comedy films
table(movies_filtered$Prime_Genre)
##
## Action Comedy
## 640 1012
There are 640 Action Movies and 1012 Comedies, this seems like a substantial amount but we need a more scientific method to establish if this is enough.
A power test is the solution, but I need more information to decide what the parameters are. I believe looking at the distribution of the data and testing for normality will help.
# Visualize distribution of WorldwideRevenue for Action and Comedy
ggplot(movies_filtered,
aes(x = WorldwideRevenue,
fill = Prime_Genre)) +
geom_density(alpha = 0.5) +
#facet_wrap(~Prime_Genre) +
labs(title = "Revenue Distribution for Action and Comedy Movies",
x = "Worldwide Revenue", y = "Density")
These are not normal distributuions and they show the desnsity of the data is similar but with some significant differences.
Next, test for normality, just to make sure.
# Test for normality for Action movies
shapiro.test(movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Action"])
##
## Shapiro-Wilk normality test
##
## data: movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Action"]
## W = 0.85703, p-value < 0.00000000000000022
# Test for normality for Comedy movies
shapiro.test(movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Comedy"])
##
## Shapiro-Wilk normality test
##
## data: movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Comedy"]
## W = 0.81499, p-value < 0.00000000000000022
The P values for both groups are significantly less that .05 meaning the data is not normally distributed so we cannot use the t-test.
# Test for equality of variances
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
leveneTest(WorldwideRevenue ~ Prime_Genre, data = movies_filtered)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 13.33 0.0002694 ***
## 1650
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
alpha level: 5% seems reasonable no obvious reason to change. This is exploratory data and the impact of type I and II errors is minimal. There are no extenuating circumstances that would suggest using anything other than the standard.
Power: The density paths follow similar trajectories so I think I’d like to make this higher maybe .9. I feel comfortable with this also knowing that I have over 600 rows for each group, so I imagine the size needed won’t exceed what we have.
Effect size: I’m torn on this. The density map shows a similar slope to both lines but I think the dfferences will be clear but moderate so I think leaving it at .5 is ok.
Two vs one sided. While the action movies do seem to have more outliers with higher revenue comedies there seems to be a large number of comedies distributed around the same point. I don’t think there is any clear reason to adopt a one-sided test.
pwr
package in R to calculate the
required sample size.# Load the pwr package
#library(pwr)
# Calculate sample size for a two-sample test (medium effect size)
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.9, type = "two.sample")
##
## Two-sample t test power calculation
##
## n = 85.03128
## d = 0.5
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
Insight
If There is Enough Data, Run the Hypothesis Test; If Not, Explain Why
If sufficient data is available:
# Perform the Mann-Whitney U test
wilcox.test(WorldwideRevenue ~ Prime_Genre, data = movies_filtered,
alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: WorldwideRevenue by Prime_Genre
## W = 375435, p-value = 0.00000004701
## alternative hypothesis: true location shift is not equal to 0
Test statistic is 375435
The p-value is .00000004701
Interpretation: The p-value is very small and is
far below the alpha level of .05 indicating there is a statistically
significant difference in worldwide
Revenue between Comedy and Action movies, which in trun rejects the null
hypothesis.
# Violin plot of WorldwideRevenue by genre
ggplot(movies_filtered,
aes(x = Prime_Genre,
y = WorldwideRevenue,
fill = Prime_Genre)) +
geom_violin(trim = TRUE,
alpha = 0.7) +
geom_boxplot(width = 0.2,
position = position_dodge(0.9),
alpha = 0.5,
outlier.color = "red") +
labs(title = "Distribution of Worldwide Revenue by Genre",
x = "Genre", y = "Worldwide Revenue") +
theme_minimal() +
scale_fill_manual(values = c("Action" = "skyblue",
"Comedy" = "orange")) +
theme(legend.position = "none")
Insight - I was experimenting with plots and found this combined box and violin chart. I think it does a great job drawing the differences between the two genres. It’s easy to see here that there are more Comedy movies at a lower revenue value while the Action Movies have a wider spread. It’s also very easy to compare the quartile ranges and see how different they are.
The null hypothesis (H₀): “There is no significant difference in Ratings between Westerns and Non-Westerns”
The alternative hypothesis (H₁): “There is significant difference in Ratings between Westerns and Non-Westerns”
# Add IsWestern column to the dataset
movies_clean$IsWestern <- movies_clean$Prime_Genre == "Western"
# Check the first few rows to verify
#head(movies_clean)
table(movies_clean$Prime_Genre == "Western")
##
## FALSE TRUE
## 4265 7
Choose the appropriate test based on the nature of your variable and groups.
Check to see if the t-test assumptions are met:
shapiro.test(movies_clean$Rating_of_10[movies_clean$IsWestern == TRUE])
##
## Shapiro-Wilk normality test
##
## data: movies_clean$Rating_of_10[movies_clean$IsWestern == TRUE]
## W = 0.83145, p-value = 0.0826
shapiro.test(movies_clean$Rating_of_10[movies_clean$IsWestern == FALSE])
##
## Shapiro-Wilk normality test
##
## data: movies_clean$Rating_of_10[movies_clean$IsWestern == FALSE]
## W = 0.90056, p-value < 0.00000000000000022
The p-value for Westerns is greater that .05 so this suggest the Rating_of_10 isn’t too different from a normal distribution. The p-value for Not Westerns is extremely small so we reject the null hypothesis as it is significantly different than a normal distribution. This breaks the normality rule for t-tests so we can’t use it. We’ll use Wilcoxon Ranks sum test as the not Westerns group is not normal.
# Perform Mann-Whitney U test (Wilcoxon rank-sum test) on movies_clean dataset
wilcox.test(Rating_of_10 ~ IsWestern,
data = movies_clean,
alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Rating_of_10 by IsWestern
## W = 9787, p-value = 0.1149
## alternative hypothesis: true location shift is not equal to 0
Interpret the p-value:
test statistic is 9787
The p-value is .1149 which is higher than .05
Interpretation: There isn’t enough evidence to reject the null hypothesis. There might be a weak relationship that bears further investigation.
Create a Visualization to Illustrate the Results `
# Remove NA values from the dataset
movies_clean <- drop_na(movies_clean, c(IsWestern, Rating_of_10))
# Create the violin plot
ggplot(movies_clean, aes(x = IsWestern, y = Rating_of_10, fill = IsWestern)) +
geom_violin(trim = FALSE, alpha = 0.7) +
labs(title = "Violin Plot of Ratings: Westerns vs Non-Westerns",
x = "Is Western",
y = "Rating (out of 10)") +
theme_minimal() +
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF")) # Optional: Custom colors
The null hypothesis is that there is no significant difference between ratings of Western’s and non-Westerns. While there does seem to be a higher concentration of rating in the 6-7 range this illustration supports the finding that there is not enough information to reject the null hypothesis. Both groups have the bulk of their data in the same general range. If the distributions had been at two different points of the Y scale, one at 6 and the other at 9 we might reach a different conclusion.