This week, I will conduct hypothesis testing on the TMDB TV Show dataset, which contains detailed information on various TV shows, such as ratings, genres, episodes, and more. Over the past few weeks, I have explored this dataset, and now I will apply hypothesis testing to investigate two different questions related to TV shows’ characteristics and their impact on ratings.
Null Hypothesis (H0): There is no relationship between the average number of episodes per season and the average vote rating of a TV show.
Neyman-Pearson Framework To test this hypothesis, I will use the Neyman-Pearson framework. This involves choosing an appropriate test, defining the significance level (α), and calculating the sample size to determine if I have enough data.
Test Selection: I will use Pearson’s Correlation test to determine if there is a linear relationship between the number of episodes per season and the vote average.
Significance Level(α): I will choose α=0.05 to maintain a 95% confidence level.
Power and Type II Error (β): I will aim for a power level of 0.8 (80%) to minimize the likelihood of a Type II error (failing to detect a true relationship). This means that I want to detect an effect size of at least 0.3 (small to medium correlation), with 80% confidence.
Sample Size Calculation Using Cohen’s rule of thumb, I will calculate the required sample size to achieve the desired power and effect size.
# Load necessary libraries
if(!require(pwr)){
install.packages("pwr", dependencies=TRUE)
library(pwr)
}
## Loading required package: pwr
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Sample size calculation for correlation test
effect_size <- 0.3 # Small to medium effect
power_level <- 0.8
alpha_level <- 0.05
# Calculate sample size
sample_size <- pwr.r.test(r = effect_size, power = power_level, sig.level = alpha_level)$n
sample_size
## [1] 84.07364
The dataset contains more than this number of observations, so we can proceed with the hypothesis test.
Hypothesis Test
# Load the dataset (replace with your data)
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create a new column for average episodes per season
tv_data <- tv_data |>
filter(!is.na(number_of_episodes), !is.na(number_of_seasons), number_of_seasons != 0) |>
mutate(avg_episodes_per_season = number_of_episodes / number_of_seasons)
# Perform correlation test between avg_episodes_per_season and vote_average
correlation_test <- cor.test(tv_data$avg_episodes_per_season, tv_data$vote_average, use = "complete.obs")
correlation_test
##
## Pearson's product-moment correlation
##
## data: tv_data$avg_episodes_per_season and tv_data$vote_average
## t = 34.104, df = 146209, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08375136 0.09392199
## sample estimates:
## cor
## 0.08883899
Interpretation of Results
Correlation Coefficient (r): The correlation coefficient (r) is approximately 0.0888, which indicates a very weak positive correlation between the average number of episodes per season and the vote average. This means that while there is a slight tendency for shows with more episodes per season to have higher ratings, the relationship is very weak, and the number of episodes per season is not a strong predictor of the vote average.
95% Confidence interval: The 95% confidence interval for the correlation coefficient is [0.0838, 0.0939]. This means we are 95% confident that the true correlation between the average number of episodes per season and the vote average lies between 0.0838 and 0.0939, further confirming that the relationship, while statistically significant, is weak.
Visualization
# Scatter plot: Average Episodes per Season vs Vote Average
ggplot(tv_data, aes(x = avg_episodes_per_season, y = vote_average)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Number of Episodes per Season vs. Vote Average",
x = "Average Episodes per Season", y = "Vote Average") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: This scatter plot shows the relationship between the number of episodes per season and the vote average. If the regression line trends upwards and the p-value is significant, we can conclude that more episodes are associated with higher vote averages. Outliers, if any, can be highlighted for further investigation.
Null Hypothesis (H0): There is no significant difference in the average vote rating between Drama and Comedy genres.
Fisher’s Significance Testing Framework To test this hypothesis, I will use Fisher’s significance testing framework, focusing on the p-value to determine whether there is enough evidence to reject the null hypothesis.
Test Selection: I will use a t-test to compare the means of the vote averages between the Drama and Comedy genres.
Hypothesis Test
# Filter dataset for Drama and Comedy genres
drama_vs_comedy <- tv_data |>
filter(genres %in% c("Drama", "Comedy"))
# Perform t-test between Drama and Comedy vote averages
t_test_result <- t.test(vote_average ~ genres, data = drama_vs_comedy)
t_test_result
##
## Welch Two Sample t-test
##
## data: vote_average by genres
## t = 7.9335, df = 20840, p-value = 2.238e-15
## alternative hypothesis: true difference in means between group Comedy and group Drama is not equal to 0
## 95 percent confidence interval:
## 0.2784945 0.4612617
## sample estimates:
## mean in group Comedy mean in group Drama
## 3.625033 3.255154
Interpretation of Results
t-statistic: The t-statistic of 7.3362 indicates that the difference between the mean vote averages of the Comedy and Drama genres is substantial when compared to the variability within each genre. A higher t-value suggests stronger evidence that the means are different.
Confidence Interval: The 95% confidence interval for the difference in means between the Comedy and Drama genres ranges from 0.2434 to 0.4209. This means that we are 95% confident that the true difference in average vote ratings between Comedy and Drama lies between 0.2434 and 0.4209. Since this interval does not contain zero, it reinforces the conclusion that there is a significant difference between the two genres’ vote averages.
Mean Vote average by group:
This suggests that, on average, Comedy shows tend to have higher vote ratings than Drama shows by a small but statistically significant margin.
Visualization
# Boxplot: Drama vs Comedy Vote Average
ggplot(drama_vs_comedy, aes(x = genres, y = vote_average, fill = genres)) +
geom_boxplot() +
labs(title = "Vote Average Comparison: Drama vs Comedy",
x = "Genres", y = "Vote Average") +
theme_minimal()
Interpretation: The boxplot compares the vote averages for Drama and Comedy genres. If the means are visibly different and the p-value is significant, this suggests a meaningful difference between the ratings of these two genres. If there is overlap between the boxplots, this may indicate that the difference is less substantial.
Hypothesis 1: Based on the results, if the p-value is significant, we can conclude that there is a relationship between the average number of episodes per season and vote average.
Hypothesis 2: If the p-value from the t-test is significant, it would suggest that audience preferences for Drama and Comedy genres differ significantly in terms of ratings.