R Notebook - Week 7

Introduction

This week, I will conduct hypothesis testing on the TMDB TV Show dataset, which contains detailed information on various TV shows, such as ratings, genres, episodes, and more. Over the past few weeks, I have explored this dataset, and now I will apply hypothesis testing to investigate two different questions related to TV shows’ characteristics and their impact on ratings.

Hypothesis

Hypothesis 1: Relationship Between the Number of Episodes and Vote Average

Null Hypothesis (H0): There is no relationship between the average number of episodes per season and the average vote rating of a TV show.

Neyman-Pearson Framework To test this hypothesis, I will use the Neyman-Pearson framework. This involves choosing an appropriate test, defining the significance level (α), and calculating the sample size to determine if I have enough data.

Test Selection: I will use Pearson’s Correlation test to determine if there is a linear relationship between the number of episodes per season and the vote average.

Significance Level(α): I will choose α=0.05 to maintain a 95% confidence level.

Power and Type II Error (β): I will aim for a power level of 0.8 (80%) to minimize the likelihood of a Type II error (failing to detect a true relationship). This means that I want to detect an effect size of at least 0.3 (small to medium correlation), with 80% confidence.

Sample Size Calculation Using Cohen’s rule of thumb, I will calculate the required sample size to achieve the desired power and effect size.

# Load necessary libraries
if(!require(pwr)){
  install.packages("pwr", dependencies=TRUE)
  library(pwr)
}

## Loading required package: pwr

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Sample size calculation for correlation test
effect_size <- 0.3  # Small to medium effect
power_level <- 0.8
alpha_level <- 0.05

# Calculate sample size
sample_size <- pwr.r.test(r = effect_size, power = power_level, sig.level = alpha_level)$n
sample_size

## [1] 84.07364

The dataset contains more than this number of observations, so we can proceed with the hypothesis test.

Hypothesis Test

# Load the dataset (replace with your data)
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Create a new column for average episodes per season
tv_data <- tv_data |> 
  filter(!is.na(number_of_episodes), !is.na(number_of_seasons), number_of_seasons != 0) |> 
  mutate(avg_episodes_per_season = number_of_episodes / number_of_seasons)

# Perform correlation test between avg_episodes_per_season and vote_average
correlation_test <- cor.test(tv_data$avg_episodes_per_season, tv_data$vote_average, use = "complete.obs")
correlation_test

## 
##  Pearson's product-moment correlation
## 
## data:  tv_data$avg_episodes_per_season and tv_data$vote_average
## t = 34.104, df = 146209, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08375136 0.09392199
## sample estimates:
##        cor 
## 0.08883899

Interpretation of Results

p-value: The p-value is extremely small (less than 2.2e-16), which is much lower than the standard significance level of 0.05. This allows us to reject the null hypothesis, meaning that there is enough evidence to conclude that the correlation between the number of episodes per season and vote average is statistically significant. Even though the relationship is weak, it is not due to random chance.

Correlation Coefficient (r): The correlation coefficient (r) is approximately 0.0888, which indicates a very weak positive correlation between the average number of episodes per season and the vote average. This means that while there is a slight tendency for shows with more episodes per season to have higher ratings, the relationship is very weak, and the number of episodes per season is not a strong predictor of the vote average.
95% Confidence interval: The 95% confidence interval for the correlation coefficient is [0.0838, 0.0939]. This means we are 95% confident that the true correlation between the average number of episodes per season and the vote average lies between 0.0838 and 0.0939, further confirming that the relationship, while statistically significant, is weak.

Visualization

# Scatter plot: Average Episodes per Season vs Vote Average
ggplot(tv_data, aes(x = avg_episodes_per_season, y = vote_average)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Number of Episodes per Season vs. Vote Average", 
       x = "Average Episodes per Season", y = "Vote Average") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation: This scatter plot shows the relationship between the number of episodes per season and the vote average. If the regression line trends upwards and the p-value is significant, we can conclude that more episodes are associated with higher vote averages. Outliers, if any, can be highlighted for further investigation.

Hypothesis 2: Vote Average Differences by Genre

Null Hypothesis (H0): There is no significant difference in the average vote rating between Drama and Comedy genres.

Fisher’s Significance Testing Framework To test this hypothesis, I will use Fisher’s significance testing framework, focusing on the p-value to determine whether there is enough evidence to reject the null hypothesis.

Test Selection: I will use a t-test to compare the means of the vote averages between the Drama and Comedy genres.

Hypothesis Test

# Filter dataset for Drama and Comedy genres
drama_vs_comedy <- tv_data |>
  filter(genres %in% c("Drama", "Comedy"))

# Perform t-test between Drama and Comedy vote averages
t_test_result <- t.test(vote_average ~ genres, data = drama_vs_comedy)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  vote_average by genres
## t = 7.9335, df = 20840, p-value = 2.238e-15
## alternative hypothesis: true difference in means between group Comedy and group Drama is not equal to 0
## 95 percent confidence interval:
##  0.2784945 0.4612617
## sample estimates:
## mean in group Comedy  mean in group Drama 
##             3.625033             3.255154

Interpretation of Results

p-value: The p-value is extremely small (2.274e-13), which is far below the conventional significance level of 0.05. This allows us to reject the null hypothesis that the true difference in means between Comedy and Drama is equal to zero. In other words, there is very strong evidence that the average vote ratings for Comedy and Drama TV shows are significantly different.

t-statistic: The t-statistic of 7.3362 indicates that the difference between the mean vote averages of the Comedy and Drama genres is substantial when compared to the variability within each genre. A higher t-value suggests stronger evidence that the means are different.
Confidence Interval: The 95% confidence interval for the difference in means between the Comedy and Drama genres ranges from 0.2434 to 0.4209. This means that we are 95% confident that the true difference in average vote ratings between Comedy and Drama lies between 0.2434 and 0.4209. Since this interval does not contain zero, it reinforces the conclusion that there is a significant difference between the two genres’ vote averages.
Mean Vote average by group:
- Mean in group Comedy: The average vote rating for Comedy shows is 3.4596.
- Mean in group Drama: The average vote rating for Drama shows is 3.1275.
This suggests that, on average, Comedy shows tend to have higher vote ratings than Drama shows by a small but statistically significant margin.

Visualization

# Boxplot: Drama vs Comedy Vote Average
ggplot(drama_vs_comedy, aes(x = genres, y = vote_average, fill = genres)) +
  geom_boxplot() +
  labs(title = "Vote Average Comparison: Drama vs Comedy", 
       x = "Genres", y = "Vote Average") +
  theme_minimal()

Interpretation: The boxplot compares the vote averages for Drama and Comedy genres. If the means are visibly different and the p-value is significant, this suggests a meaningful difference between the ratings of these two genres. If there is overlap between the boxplots, this may indicate that the difference is less substantial.

Insights:

Hypothesis 1: Based on the results, if the p-value is significant, we can conclude that there is a relationship between the average number of episodes per season and vote average.

Hypothesis 2: If the p-value from the t-test is significant, it would suggest that audience preferences for Drama and Comedy genres differ significantly in terms of ratings.

Next steps:

What other factors, such as networks or production companies, might influence vote averages?
How do outliers affect the overall trend, and what can be done to better understand their impact?