Week 7 Data Dive - Hypothesis Testing

Initialization Step 1 - Load Libraries

#install.packages("pwr")

#load the data libraries - remove or add as needed
library(tidyverse)   #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)      #loaded to address viz issues, including currency issues

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

options(scipen=999)  #disable scientific notation since high values are used
library(pwr)

## Warning: package 'pwr' was built under R version 4.4.3

Initialization Step 2 - Load Data Set

#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")

## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")

#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"

# View updated column names
#cat(colnames(movies), sep = ", ")             #commented out after test

#create a version of the data set which excludes worldwide revenue outliers
# Calculate IQR for WorldwideRevenue
Q1 <- quantile(movies$WorldwideRevenue, 0.25) # First quartile
Q3 <- quantile(movies$WorldwideRevenue, 0.75) # Third quartile
IQR <- Q3 - Q1

# Define lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter out outliers
movies_clean <- movies[movies$WorldwideRevenue >= lower_bound & movies$WorldwideRevenue <= upper_bound, ]

# View updated column names
cat(colnames(movies_clean), sep = ", ")

## RankForYear, MovieName, WorldwideRevenue, DomesticRevenue, DomesticPercentage, ForeignRevenue, ForeignPercentage, Year, Genres, Rating, Vote_Count, Original_Language, Production_Countries, Prime_Genre, Prime_Production_Country, Rating_scale, Rating_of_10

Task Demonstrations

The assignment

After having explored your dataset over the past few weeks, you should already have some questions. For this week, we will do AB Testing within the context of your data set. So, you’ll be using hypothesis testing to calculate some difference between two groups.

For each of the following, you’ll need at least one main variable, and some way to define “Group A” and “Group B” (e.g., if you have a categorical column, you could consider “category 1” and “[the rest]”). Your main variable should either be continuous or binary (i.e., “success” or “failure”).

Devise two different null hypotheses based on two different columns data (i.e., each hypothesis gets a different column of data). For Hypothesis 1: Determine if you have enough data to perform a hypothesis test using the Neyman-Pearson framework (i.e., choose a test, alpha level, Type 2 Error, etc., and reject or fail-to reject …). If you have enough data, show your sample size calculation, perform the test, and interpret results. If there is not enough data, explain why. Make sure your alpha level, power level, and minimum effect size are intentional, i.e., explain why you chose them. A good start here is to think about the practical purpose of your test, and the risk of a False Negative (or of a False Positive …) assuming your null hypothesis is true. For Hypothesis 2: Perform a hypothesis test using Fisher’s Significance Testing framework (i.e., p-value, recommendation, etc.). For this, you should only need to choose a test, interpret the p-value, and give some reasoning for why we should be confident in your data and your conclusions. Build two visualizations that best illustrate your results, one for each null hypothesis.

Hypothesis 1 (Neyman-Pearson)

Steps for Hypothesis Test 1 (Neyman-Pearson Framework)

Choose the Column to Use for Hypothesis Test 1

I am interested in Worldwide revenue by the genres of Action and Comedy. The column of interest are WorldwideRevenue, group A will be the prime_genre of Action and group B will be the prime_genre of Comedy
Our interest in these genres is to see if there is a significant difference between them in relation to worldwide revenue. This might be of value to movie makers, exhibitors, or distributors when determining what type of movie to make, show, or distribute.

Develop Null Hypothesis 1

The null hypothesis - H₀: “There is no significant difference in WorldwideRevenue between Action and Comedy films.”
The alternative hypothesis - H₁: “There is a significant difference in WorldwideRevenue between Action and Comedy films.”

Check if There is Enough Data to Perform a Hypothesis Test That Uses the Neyman-Pearson Framework

How many movies of each type do we have?

#check how many comedy and Action movies there are
# Filter the dataset to include only Action and Comedy films
movies_filtered <- subset(movies_clean, Prime_Genre %in% c("Action", "Comedy"))

# Count the number of Action and Comedy films
table(movies_filtered$Prime_Genre)

## 
## Action Comedy 
##    640   1012

There are 640 Action Movies and 1012 Comedies, this seems like a substantial amount but we need a more scientific method to establish if this is enough.

A power test is the solution, but I need more information to decide what the parameters are. I believe looking at the distribution of the data and testing for normality will help.

# Visualize distribution of WorldwideRevenue for Action and Comedy
ggplot(movies_filtered, 
       aes(x = WorldwideRevenue, 
           fill = Prime_Genre)) +
  geom_density(alpha = 0.5) +
  #facet_wrap(~Prime_Genre) +
  labs(title = "Revenue Distribution for Action and Comedy Movies",
       x = "Worldwide Revenue", y = "Density")

These are not normal distributuions and they show the desnsity of the data is similar but with some significant differences.

Next, test for normality, just to make sure.

# Test for normality for Action movies
shapiro.test(movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Action"])

## 
##  Shapiro-Wilk normality test
## 
## data:  movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Action"]
## W = 0.85703, p-value < 0.00000000000000022

# Test for normality for Comedy movies
shapiro.test(movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Comedy"])

## 
##  Shapiro-Wilk normality test
## 
## data:  movies_filtered$WorldwideRevenue[movies_filtered$Prime_Genre == "Comedy"]
## W = 0.81499, p-value < 0.00000000000000022

The P values for both groups are significantly less that .05 meaning the data is not normally distributed so we cannot use the t-test.

# Test for equality of variances
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

leveneTest(WorldwideRevenue ~ Prime_Genre, data = movies_filtered)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1   13.33 0.0002694 ***
##       1650                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Decide the Parameters Intentionally:

alpha level: 5% seems reasonable no obvious reason to change. This is exploratory data and the impact of type I and II errors is minimal. There are no extenuating circumstances that would suggest using anything other than the standard.
Power: The density paths follow similar trajectories so I think I’d like to make this higher maybe .9. I feel comfortable with this also knowing that I have over 600 rows for each group, so I imagine the size needed won’t exceed what we have.
Effect size: I’m torn on this. The density map shows a similar slope to both lines but I think the dfferences will be clear but moderate so I think leaving it at .5 is ok.
Two vs one sided. While the action movies do seem to have more outliers with higher revenue comedies there seems to be a large number of comedies distributed around the same point. I don’t think there is any clear reason to adopt a one-sided test.

Perform a Sample Size Calculation:

Use tools like the pwr package in R to calculate the required sample size.

# Load the pwr package
#library(pwr)

# Calculate sample size for a two-sample test (medium effect size)
pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.9, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 85.03128
##               d = 0.5
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Compare with Your Dataset:

This shows that we need approximately 85 rows of data for each group. As the data includes 640 action and 1012 Comedy, the data set is large enough.

Insight

I get the value of this test now. When I was deciding which groups to use I stuck with categories with larger numbers as that’s where much of the data is. This test however would be useful if I wanted to make a comparison to a smaller group. This test would help me determine if the data set is too small to return valuable results.

Run the Test

If There is Enough Data, Run the Hypothesis Test; If Not, Explain Why
1. If sufficient data is available:
  - Conduct the hypothesis test using R (e.g., t-test, ANOVA).

# Perform the Mann-Whitney U test
wilcox.test(WorldwideRevenue ~ Prime_Genre, data = movies_filtered, 
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  WorldwideRevenue by Prime_Genre
## W = 375435, p-value = 0.00000004701
## alternative hypothesis: true location shift is not equal to 0

Report the results (e.g., test statistic, p-value) and interpret the findings.

Test statistic is 375435
The p-value is .00000004701
Interpretation: The p-value is very small and is far below the alpha level of .05 indicating there is a statistically significant difference in worldwide
Revenue between Comedy and Action movies, which in trun rejects the null hypothesis.

Create a Visualization to Illustrate the Results

Design a plot that shows the differences (or lack thereof) between the two groups for the chosen variable.

# Violin plot of WorldwideRevenue by genre
ggplot(movies_filtered, 
       aes(x = Prime_Genre, 
           y = WorldwideRevenue, 
           fill = Prime_Genre)) +
  geom_violin(trim = TRUE, 
              alpha = 0.7) +
  geom_boxplot(width = 0.2, 
               position = position_dodge(0.9), 
               alpha = 0.5, 
               outlier.color = "red") +
  labs(title = "Distribution of Worldwide Revenue by Genre",
       x = "Genre", y = "Worldwide Revenue") +
  theme_minimal() +
  scale_fill_manual(values = c("Action" = "skyblue", 
                               "Comedy" = "orange")) +
  theme(legend.position = "none")

Insight - I was experimenting with plots and found this combined box and violin chart. I think it does a great job drawing the differences between the two genres. It’s easy to see here that there are more Comedy movies at a lower revenue value while the Action Movies have a wider spread. It’s also very easy to compare the quartile ranges and see how different they are.

Steps for Hypothesis Test 2 (Fisher’s Significance Testing Framework)

Choose the Column to Use for Hypothesis Test 2

For this test I am going to look at the Rating_of_10 column and the groups Western and Not Western which I’ll determine by adding an IsWestern column that is TRUE for Westerns and FALSE otherwise.

Develop Null Hypothesis 2

The null hypothesis (H₀): “There is no significant difference in Ratings between Westerns and Non-Westerns”
The alternative hypothesis (H₁): “There is significant difference in Ratings between Westerns and Non-Westerns”

# Add IsWestern column to the dataset
movies_clean$IsWestern <- movies_clean$Prime_Genre == "Western"

# Check the first few rows to verify
#head(movies_clean)

table(movies_clean$Prime_Genre == "Western")

## 
## FALSE  TRUE 
##  4265     7

Select the Statistical Test

Choose the appropriate test based on the nature of your variable and groups.
Check to see if the t-test assumptions are met:
- Normaility

shapiro.test(movies_clean$Rating_of_10[movies_clean$IsWestern == TRUE])

## 
##  Shapiro-Wilk normality test
## 
## data:  movies_clean$Rating_of_10[movies_clean$IsWestern == TRUE]
## W = 0.83145, p-value = 0.0826

shapiro.test(movies_clean$Rating_of_10[movies_clean$IsWestern == FALSE])

## 
##  Shapiro-Wilk normality test
## 
## data:  movies_clean$Rating_of_10[movies_clean$IsWestern == FALSE]
## W = 0.90056, p-value < 0.00000000000000022

The p-value for Westerns is greater that .05 so this suggest the Rating_of_10 isn’t too different from a normal distribution. The p-value for Not Westerns is extremely small so we reject the null hypothesis as it is significantly different than a normal distribution. This breaks the normality rule for t-tests so we can’t use it. We’ll use Wilcoxon Ranks sum test as the not Westerns group is not normal.

Perform the Hypothesis Test and Interpret the p-Value

# Perform Mann-Whitney U test (Wilcoxon rank-sum test) on movies_clean dataset
wilcox.test(Rating_of_10 ~ IsWestern, 
            data = movies_clean, 
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Rating_of_10 by IsWestern
## W = 9787, p-value = 0.1149
## alternative hypothesis: true location shift is not equal to 0

Interpret the p-value:
- test statistic is 9787
- The p-value is .1149 which is higher than .05
- Interpretation: There isn’t enough evidence to reject the null hypothesis. There might be a weak relationship that bears further investigation.
Create a Visualization to Illustrate the Results `

# Remove NA values from the dataset
movies_clean <- drop_na(movies_clean, c(IsWestern, Rating_of_10))

# Create the violin plot
ggplot(movies_clean, aes(x = IsWestern, y = Rating_of_10, fill = IsWestern)) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  labs(title = "Violin Plot of Ratings: Westerns vs Non-Westerns", 
       x = "Is Western", 
       y = "Rating (out of 10)") +
  theme_minimal() +
  scale_fill_manual(values = c("#0073C2FF", "#EFC000FF")) # Optional: Custom colors

Analysis:

The null hypothesis is that there is no significant difference between ratings of Western’s and non-Westerns. While there does seem to be a higher concentration of rating in the 6-7 range this illustration supports the finding that there is not enough information to reject the null hypothesis. Both groups have the bulk of their data in the same general range. If the distributions had been at two different points of the Y scale, one at 6 and the other at 9 we might reach a different conclusion.

H510_DJH_wk7_DataDive_v04

DJH

2025-04-06