Week 7 Data Dive

Importing libraries to run.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(boot)
library(binom)

## Warning: package 'binom' was built under R version 4.2.3

library(dplyr)
library(knitr)

## Warning: package 'knitr' was built under R version 4.2.3

library(pwrss)

## Warning: package 'pwrss' was built under R version 4.2.3

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(effsize)

## Warning: package 'effsize' was built under R version 4.2.3

Initially setting our directories and loading our data.

knitr::opts_knit$set(root.dir = "C:/Users/Prana/OneDrive/Documents/Topics in Info FA23(Grad)")
youtube <- read_delim("./Global Youtube Statistics.csv", delim = ",")

## Rows: 995 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Youtuber, category, Title, Country, Abbreviation, channel_type, cr...
## dbl (21): rank, subscribers, video views, uploads, video_views_rank, country...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Devising first null hypothesis and testing on it

\[ H_0: \text{Mean of subscribers for Youtube channels created in the year 2016 is equal to the mean of subscribers for Youtube channels created in the year 2017.} \]

Creating vectors that contain subscriber count of Youtube channels created in 2016 and 2017.

subs_2016 <- youtube|>
  filter(created_year == 2016) |>
  select("subscribers")
subs_2016

## # A tibble: 77 × 1
##    subscribers
##          <dbl>
##  1   106000000
##  2    89800000
##  3    54000000
##  4    45500000
##  5    43600000
##  6    43200000
##  7    41300000
##  8    38200000
##  9    36200000
## 10    34900000
## # ℹ 67 more rows

subs_2017 <- youtube|>
  filter(created_year == 2017) |>
  select("subscribers")
subs_2017

## # A tibble: 68 × 1
##    subscribers
##          <dbl>
##  1    42400000
##  2    39700000
##  3    37600000
##  4    37200000
##  5    34000000
##  6    31800000
##  7    29300000
##  8    27300000
##  9    27100000
## 10    23200000
## # ℹ 58 more rows

Finding the means and standard deviations of the vectors to calculate the sample size.

sd_2016<-sd(subs_2016[["subscribers"]])
mean_2016<-mean(subs_2016[["subscribers"]])
sd_2017<-sd(subs_2017[["subscribers"]])
mean_2017<-mean(subs_2017[["subscribers"]])

Finding minimum effect size using cohen.d

cohen.d(d = filter(youtube, created_year==2016) |> pluck("subscribers"),
        f = filter(youtube, created_year==2017) |> pluck("subscribers"))

## 
## Cohen's d
## 
## d estimate: 0.3638053 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.0321608 0.6954497

Since our calculated Cohen’s d is approximately 0.36, it falls into the “medium” range. Therefore, we can interpret this value as the minimum meaningful or practically significant effect size. Therefore our delta value is 0.36.

Calculating the sample size to be used for hypothesis testing.

# Define the known parameters
alpha <- 0.05   # Alpha level (significance level)
power <- 0.80  # Power level
delta <- 0.36  # Minimum effect size

# Define the known parameters from your data or previous knowledge
sigma_2016 <-  sd_2016 # Standard deviation of video views in 2016
sigma_2017 <-  sd_2017 # Standard deviation of video views in 2017
mu_2016 <- mean_2016 # Expected mean of video views in 2016
mu_2017 <- mean_2017 # Expected mean of video views in 2017

# Calculate the critical value for alpha (one-tailed test, right-tailed)
z_alpha <- qnorm(1 - alpha)
# Calculate the critical value for beta (one-tailed test, right-tailed)
z_beta <- qnorm(1 - power)

# Calculate the sample size
n <- ((2 * (z_alpha + z_beta)^2 * (sigma_2016^2 + sigma_2017^2)) / 
      ((mu_2016 - mu_2017 - delta)^2))

# Round up to the nearest whole number because sample size must be an integer
n <- ceiling(n)

# Print the calculated sample size
cat("Required sample size:", n, "\n")

## Required sample size: 19

We chose an alpha level of 0.05 because it’s a common significance level that strikes a balance between the risk of incorrectly rejecting the null hypothesis (Type I error) and the need to detect meaningful effects. A power level of 0.80 was selected to provide an 80% probability of correctly detecting true effects , ensuring a reasonable chance of finding real differences in subscribers between the years.

We also see that our minimum sample size is 19 which is well below our current sample size. Hence we have enough data to perform the Neyman-Pearson hypothesis test.

Now we shall run the Neyman-Pearson hypothesis test on the minimum sample size. (One-tailed test)

# Subset your data to get the required samples for the test (2016 and 2017)
sample_2016 <- youtube$subscribers[youtube$created_year == 2016][1:n]
sample_2017 <- youtube$subscribers[youtube$created_year == 2017][1:n]

# Impute missing values with the mean of the respective samples
sample_2016[is.na(sample_2016)] <- mean(sample_2016, na.rm = TRUE)
sample_2017[is.na(sample_2017)] <- mean(sample_2017, na.rm = TRUE)

# Calculate the test statistic (e.g., t-statistic for comparing means)
t_statistic <- (mean(sample_2016) - mean(sample_2017)) / sqrt((var(sample_2016)/n) + (var(sample_2017)/n))

# Make a decision based on the critical value and test statistic
critical_value <- qt(1 - alpha, df = n - 1)  # For one-tailed test (right-tailed)

if (!is.na(t_statistic) && !is.na(critical_value)) {
  if (t_statistic > critical_value) {
    cat("Reject the null hypothesis. There is a significant difference between the means.")
  } else {
    cat("Fail to reject the null hypothesis. There is no significant difference between the means.")
  }
} else {
  cat("Error: Test statistic or critical value contains missing values.")
}

## Reject the null hypothesis. There is a significant difference between the means.

# Print the test statistic, critical value, and calculated sample size
cat("\nTest Statistic:", t_statistic, "\n")

## 
## Test Statistic: 2.895515

cat("Critical Value:", critical_value, "\n")

## Critical Value: 1.734064

We see that the null hypothesis has been rejected. This means that we have an alternate hypothesis that states: \[ H_a: \text{Mean of subscribers for Youtube channels created in the year 2016 is not equal to the mean of subscribers for Youtube channels created in the year 2017.} \]

Interpretation of the result: Since our calculated test statistic (2.895515) exceeds the critical value (1.734064), we can conclude that there is strong evidence to reject the null hypothesis. This suggests that there is a significant difference in the mean number of subscribers between the years 2016 and 2017, favoring one of the years over the other. The direction of the difference (whether 2016 has more subscribers than 2017 or vice versa) can be determined by the sign of the test statistic (positive in this case)

Let us visualize this hypothesis test with a graph that illustrates the results with respect to the critical value, delta, power and alpha value.

critical_value <- 1.734064
delta <- 0.36

f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)

ggplot() +
  stat_function(mapping = aes(fill = 'power'),
                fun = f_a, 
                xlim = c(critical_value, 4),
                geom = "area") +
    stat_function(mapping = aes(fill = 'alpha'),
                fun = f_0, 
                xlim = c(critical_value, 4),
                geom = "area") +
  geom_function(mapping = aes(color = 'Null Hypothesis'),
                xlim = c(-4, 4), fun = f_0) +
  geom_function(mapping = aes(color = 'Alternative Hypothesis'),
                xlim = c(-4, 4), fun = f_a) +
  geom_vline(mapping = aes(xintercept = critical_value,
                           color = "Critical Value")) +
  geom_vline(mapping = aes(xintercept = delta,
                           color = "Delta")) +
  geom_vline(mapping = aes(xintercept = 0),
             color = 'gray', linetype=2) +
  labs(title = "One-Tailed Test Illustration",
       subtitle = "(Mirror the right side for two-tailed tests.)",
       x = "Test Statistic",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-4, 4, 1)) +
  scale_fill_manual(values = c('lightblue', 'pink')) +
  scale_color_manual(values = c('darkred', 'darkorange', 'darkblue', 
                                'darkgreen')) +
  theme_minimal()

The above plot helps illustrate the concepts of power (the ability to detect a significant difference under the alternative hypothesis) and alpha (the significance level, which represents the probability of a Type I error) in the context of a one-tailed Neyman hypothesis test as done before. It shows how the test statistic compares to the critical value and the assumed means under both hypotheses.

We create a contingency table consisting of the mean of uploads and maximum uploads of India and USA so that we can apply to fisher’s test.

combined_contingency_table <- data.frame(Mean = c(mean_2016,mean_2017), Minimum=c(min(subs_2016),min(subs_2017)))
print(combined_contingency_table)

##       Mean  Minimum
## 1 23101299 12400000
## 2 18666176 12300000

Implementing Fisher’s Exact test

fisher.test(combined_contingency_table)

## Warning in fisher.test(combined_contingency_table): 'x' has been rounded to
## integer: Mean relative difference: 1.841839e-08

## 
##  Fisher's Exact Test for Count Data
## 
## data:  combined_contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.226377 1.228930
## sample estimates:
## odds ratio 
##   1.227613

The p-value is a measure of the evidence against the null hypothesis. A very small p-value (in this case, less than 2.2e-16, which is essentially zero) suggests strong evidence against the null hypothesis. This aligns with our results from the Neymon-Pearson test.

In this context, an odds ratio of approximately 1.22761 suggests that there is a positive association between the variables, and the 95 percent confidence interval provides a range of plausible values for the true odds ratio.

2. Devising second null hypothesis and testing on it

\[ H_0: \text{Mean number of uploads by Youtubers in India is equal to the mean number of uploads by Youtubers in USA.} \]

Creating vectors that contain youtube channels from India and USA with their respective uploads.

total_india <- youtube|>
  filter(Country=="India") |>
  select(uploads)
total_india

## # A tibble: 168 × 1
##    uploads
##      <dbl>
##  1   20082
##  2  116536
##  3    8548
##  4   71270
##  5  129204
##  6     633
##  7    8502
##  8  112915
##  9      13
## 10    4741
## # ℹ 158 more rows

total_usa <- youtube|>
  filter(Country=="United States") |>
  select(uploads)
total_usa

## # A tibble: 313 × 1
##    uploads
##      <dbl>
##  1       1
##  2     741
##  3     966
##  4    1111
##  5     574
##  6   70127
##  7       3
##  8    2865
##  9     389
## 10   39113
## # ℹ 303 more rows

Finding the means and standard deviations of the vectors to calcuate the sample size.

sd_india<-sd(total_india[["uploads"]])
mean_india<-mean(total_india[["uploads"]])
sd_usa<-sd(total_usa[["uploads"]])
mean_usa<-mean(total_usa[["uploads"]])

Finding minimum effect size using cohen.d

cohen.d(d = filter(youtube, Country == "India") |> pluck("uploads"),
        f = filter(youtube, Country == "United States") |> pluck("uploads"))

## 
## Cohen's d
## 
## d estimate: 0.5107773 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.3200833 0.7014713

Since our calculated Cohen’s d is approximately 0.51, it falls into the “medium” range. Therefore, we can interpret this value as the minimum meaningful or practically significant effect size. Therefore our delta value is 0.51.

Calculating the sample size to be used for hypothesis testing.

# Define the known parameters
alpha <- 0.05  # Alpha level (significance level)
power <- 0.80  # Power level
delta <- 0.51  # Minimum effect size

# Define the known parameters from your data or previous knowledge
sigma_india <-  sd_india # Standard deviation of video views in 2016
sigma_usa <-  sd_usa # Standard deviation of video views in 2017
mu_india <- mean_india # Expected mean of video views in 2016
mu_usa <- mean_usa # Expected mean of video views in 2017

# Calculate the critical values (Z scores) for alpha and beta
z_alpha <- qnorm(1 - alpha)  # Two-tailed test
z_beta <- qnorm(1 - power)

# Calculate the sample size
n <- ((2 * (z_alpha + z_beta)^2 * (sigma_india^2 + sigma_usa^2)) / 
      ((mu_india - mu_usa - delta)^2))

# Round up to the nearest whole number because sample size must be an integer
n <- ceiling(n)

# Print the calculated sample size
cat("Required sample size:", n, "\n")

## Required sample size: 14

We also see that our minimum sample size is 14 which is well below our current sample size. Hence we have enough data to perform the Neyman-Pearson hypothesis test.

Now we shall run the Neyman-Pearson hypothesis test on the minimum sample size. (One-tailed test)

# Subset your data to get the required samples for the test (2016 and 2017)
sample_india <- youtube$uploads[youtube$Country == "India"][1:n]
sample_usa <- youtube$uploads[youtube$Country == "United States"][1:n]

# Impute missing values with the mean of the respective samples
sample_india[is.na(sample_india)] <- mean(sample_india, na.rm = TRUE)
sample_usa[is.na(sample_usa)] <- mean(sample_usa, na.rm = TRUE)

# Calculate the test statistic (e.g., t-statistic for comparing means)
t_statistic <- (mean(sample_india) - mean(sample_usa)) / sqrt((var(sample_india)/n) + (var(sample_usa)/n))

# Make a decision based on the critical value and test statistic
critical_value <- qt(1 - alpha, df = n - 1)  # For one-tailed test (right-tailed)

if (!is.na(t_statistic) && !is.na(critical_value)) {
  if (t_statistic > critical_value) {
    cat("Reject the null hypothesis. There is a significant difference between the means.")
  } else {
    cat("Fail to reject the null hypothesis. There is no significant difference between the means.")
  }
} else {
  cat("Error: Test statistic or critical value contains missing values.")
}

## Reject the null hypothesis. There is a significant difference between the means.

# Print the test statistic, critical value, and calculated sample size
cat("\nTest Statistic:", t_statistic, "\n")

## 
## Test Statistic: 2.125735

cat("Critical Value:", critical_value, "\n")

## Critical Value: 1.770933

We see that the null hypothesis has been rejected. This means that we have an alternate hypothesis that states: \[ H_a: \text{Mean number of uploads by Youtubers in India is not equal to the mean number of uploads by Youtubers in USA.} \]

Interpretation of the result: Since our calculated test statistic (2.125735) exceeds the critical value (1.770933), we can conclude that there is strong evidence to reject the null hypothesis. This suggests that there is a significant difference in the mean number of uploads between India and USA, favoring India over the other. The direction of the difference (whether India has more uploads than USA or vice versa) can be determined by the sign of the test statistic (positive in this case).

Let us visualize this hypothesis test with a graph that illustrates the results with respect to the critical value, delta, power and alpha value.

critical_value <- 1.770933
delta <- 0.51

f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)

ggplot() +
  stat_function(mapping = aes(fill = 'power'),
                fun = f_a, 
                xlim = c(critical_value, 4),
                geom = "area") +
    stat_function(mapping = aes(fill = 'alpha'),
                fun = f_0, 
                xlim = c(critical_value, 4),
                geom = "area") +
  geom_function(mapping = aes(color = 'Null Hypothesis'),
                xlim = c(-4, 4), fun = f_0) +
  geom_function(mapping = aes(color = 'Alternative Hypothesis'),
                xlim = c(-4, 4), fun = f_a) +
  geom_vline(mapping = aes(xintercept = critical_value,
                           color = "Critical Value")) +
  geom_vline(mapping = aes(xintercept = delta,
                           color = "Delta")) +
  geom_vline(mapping = aes(xintercept = 0),
             color = 'gray', linetype=2) +
  labs(title = "One-Tailed Test Illustration",
       subtitle = "(Mirror the right side for two-tailed tests.)",
       x = "Test Statistic",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-4, 4, 1)) +
  scale_fill_manual(values = c('lightblue', 'pink')) +
  scale_color_manual(values = c('darkred', 'darkorange', 'darkblue', 
                                'darkgreen')) +
  theme_minimal()

We create a contingency table consisting of the mean of subscribers and minimum subscribers of 2016 and 2017 so that we can apply to fisher’s test.

combined_contingency_table <- data.frame(Mean = c(mean_india,mean_usa), Maximum=c(max(total_india),max(total_usa)))
print(combined_contingency_table)

##        Mean Maximum
## 1 22191.738  301308
## 2  4152.105  200933

Implementing Fisher’s Exact test

fisher.test(combined_contingency_table)

## Warning in fisher.test(combined_contingency_table): 'x' has been rounded to
## integer: Mean relative difference: 1.394391e-05

## 
##  Fisher's Exact Test for Count Data
## 
## data:  combined_contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  3.446287 3.686955
## sample estimates:
## odds ratio 
##   3.564287

Here we get the p-value <2.2e-16 as well. This is a measure of the evidence against the null hypothesis. A p-value less than your chosen significance level (in this case, less than 2.2e-16) suggests strong evidence against the null hypothesis. It indicates that there is a highly significant association or difference between the variables being tested.

The odds ratio, along with its confidence interval, provides information about the strength and direction of this association. In this context, an odds ratio of approximately 3.564287 suggests a strong positive association, and the 95 percent confidence interval helps quantify the uncertainty around this estimate.

Conclusion of this data dive:

The first null hypothesis stated that the mean number of subscribers for YouTube channels created in the year 2016 is equal to the mean of subscribers for YouTube channels created in the year 2017. To test this hypothesis, a Neyman-Pearson hypothesis test was conducted, incorporating a sample size calculation based on Cohen’s d effect size and desired power level. The results of this analysis provided compelling evidence to reject the null hypothesis. The difference in the mean number of subscribers between the two years was statistically significant, favoring one year over the other. Further visualization and Fisher’s exact tests supported the significance of this difference.

The second null hypothesis posited that the mean number of uploads by YouTubers in Japan is equal to the mean number of uploads by YouTubers in the United States. This hypothesis was rigorously tested, mirroring the approach taken in the first analysis. Once again, the Neyman-Pearson hypothesis test yielded substantial evidence to reject the null hypothesis. The mean number of uploads was found to be significantly different between the two countries, with India surpassing the United States. Visualization and Fisher’s exact tests confirmed the statistical significance of this disparity.

In conclusion, our analyses demonstrate that statistically significant differences exist in subscriber counts and upload rates across different years and countries. These findings underscore the dynamic nature of YouTube’s ecosystem and emphasize the need for content creators to adapt to changing audience behaviors and preferences.

Week 7 Data Dive

2023-10-03