Week 8 Data Dive

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(boot)
library(binom)

## Warning: package 'binom' was built under R version 4.2.3

library(dplyr)
library(knitr)

## Warning: package 'knitr' was built under R version 4.2.3

library(pwrss)

## Warning: package 'pwrss' was built under R version 4.2.3

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(effsize)

## Warning: package 'effsize' was built under R version 4.2.3

Initially setting our directories and loading our data.

knitr::opts_knit$set(root.dir = "C:/Users/Prana/OneDrive/Documents/Topics in Info FA23(Grad)")
youtube <- read_delim("./Global Youtube Statistics.csv", delim = ",")

## Rows: 995 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Youtuber, category, Title, Country, Abbreviation, channel_type, cr...
## dbl (21): rank, subscribers, video views, uploads, video_views_rank, country...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ANOVA TEST

Given we have a dataset that ranks the top Youtube Channels in the globe, the main value that dictates this rank system will be the number of subscribers. This value will be the most common aspect viewers will look forward to when they want to know who are the top Youtubers.Therefore, we shall be taking ‘subscribers’ as our response variable.

One explanatory variable we can take into account with respect to subscribers can be ‘channel_type’ as it influences the number of subscribers for a youtube channel in a significant way.

Since there are more than 10 categories for channel types, we are consolidating common categories to ease our process of doing the ANOVA test.

library(dplyr)

youtube <- youtube |>
  mutate(Consolidated_Type = case_when(
    channel_type %in% c("Entertainment", "Music", "Film", "Comedy") ~ "Entertainment",
    channel_type %in% c("Education") ~ "Education",
    channel_type %in% c("Tech") ~ "Technology",
    TRUE ~ "Other"
  ))

Let us visualize box plots for these consolidated channel types with respect to number of subscribers.

youtube |>
  ggplot() +
  geom_boxplot(mapping = aes(y = subscribers, x = Consolidated_Type)) +
  scale_y_log10(labels = \(x) paste(x / 100000, 'M')) +
  annotation_logticks(sides = 'l') +
  labs(x = "Channel Type",
       y = "Number of subscribers (in millions") +
  theme_minimal()

We can see that some of these channel types are very slightly than others, but we want to know if the differences are large enough to significantly challenge our hypothesis that they’re actually all basically the same. So, in this case, our null hypothesis test is

\[ H_0 : \text{average subscribers is equal across all channel types} \]

Before we go forward with the ANOVA Test, we must make sure a few assumptions hold:

Independence. Each data point must be independent of every other data point. It is visible that each data point among the channel types is independent of the other. So this assumption holds.
Normality. The distribution of within each group must be roughly normal. (E.g., imagine making a histogram for each group.)

youtube|>
  filter(Consolidated_Type=="Education")|>
  ggplot() +
  geom_histogram(mapping = aes(x = subscribers), color = 'white', fill = 'blue', bins = 30)+
  labs(
    title = "Distribution of Subscribers for Education Channels",
    x = "Subscribers",
    y = "Frequency"
  )

youtube|>
  filter(Consolidated_Type=="Entertainment")|>
  ggplot() +
  geom_histogram(mapping = aes(x = subscribers), color = 'white', fill = 'blue', bins = 30)+
  labs(
    title = "Distribution of Subscribers for Entertainment Channels",
    x = "Subscribers",
    y = "Frequency"
  )

youtube|>
  filter(Consolidated_Type=="Other")|>
  ggplot() +
  geom_histogram(mapping = aes(x = subscribers), color = 'white', fill = 'blue', bins = 30)+
  labs(
    title = "Distribution of Subscribers for Other Channels",
    x = "Subscribers",
    y = "Frequency"
  )

youtube|>
  filter(Consolidated_Type=="Technology")|>
  ggplot() +
  geom_histogram(mapping = aes(x = subscribers), color = 'white', fill = 'blue', bins = 30)+
  labs(
    title = "Distribution of Subscribers for Technology Channels",
    x = "Subscribers",
    y = "Frequency"
  )

We notice all graphs except “Technology” are right-tailed distributions. Hence, this assumption doesn’t entirely hold.

Homoscedasticity (Constant Variance). The variance of data within groups remains consistent for every group. (I.e., no group has a significantly higher standard deviation than any other group.)

# Filter the data by each group
education_group <- youtube[youtube$Consolidated_Type == "Education", ]
entertainment_group <- youtube[youtube$Consolidated_Type == "Entertainment", ]
other_group <- youtube[youtube$Consolidated_Type == "Other", ]
technology_group <- youtube[youtube$Consolidated_Type == "Technology", ]

# Calculate the variance within each group
variance_education <- var(education_group$subscribers)
variance_entertainment <- var(entertainment_group$subscribers)
variance_other <- var(other_group$subscribers)
variance_technology <- var(technology_group$subscribers)

# Print the variances
cat("Standard deviation within Education group:", sqrt(variance_education), "\n")

## Standard deviation within Education group: 23973775

cat("Standard deviation within Entertainment group:", sqrt(variance_entertainment), "\n")

## Standard deviation within Entertainment group: 18605778

cat("Standard deviation within Other group:", sqrt(variance_other), "\n")

## Standard deviation within Other group: 14043218

cat("Standard deviation within Technology group:", sqrt(variance_technology), "\n")

## Standard deviation within Technology group: 5408069

From this, we can see that there is actually some difference in standard deviations among the groups. “Education”, “Entertainment” and “Other” are not too far off. However, “Technology” is significantly different.

We see not all assumptions hold. But we shall go forward to explore the hypothesis.

Let us a visualize a f-distribution graph to compare the variation between group means to the variation within the groups.

n <- nrow(youtube)
k <- n_distinct(youtube$Consolidated_Type)

ggplot() +
  geom_function(xlim = c(0, 10), 
                fun = \(x) df(x, k - 1, n - k)) +
  geom_vline(xintercept = 1, color = 'orange') +
  labs(title = 'F Distribution for channel types',
       x = "F Values",
       y = "Probability Density") +
  theme_minimal()

From the graph we can see that the long tail is on the right side of the distribution. This represents larger F-statistics, indicating greater variation between groups compared to within groups. We will get more proof upon testing for ANOVA below.

m <- aov(subscribers ~ Consolidated_Type, data = youtube)
summary(m)

##                    Df    Sum Sq   Mean Sq F value Pr(>F)  
## Consolidated_Type   3 3.285e+15 1.095e+15   3.593 0.0133 *
## Residuals         991 3.020e+17 3.048e+14                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value (0.0133) is less than the typical significance level of 0.05. This suggests that there is some evidence of a significant difference in the number of subscribers between the consolidated channel types. You can conclude that the consolidated channel type has a statistically significant effect on the number of subscribers. A high F value (3.593) with a small p-value suggests that there are significant group differences

This ANOVA suggests that there are significant differences in the number of subscribers among different consolidated channel types, and you can proceed with further post-hoc tests or specific analyses to understand which consolidated channel types differ from each other in terms of subscribers.Therefore, we reject our null hypothesis and come up with our alternative hypothesis:

\[ H_a : \text{average subscribers is not equal across all channel types} \]

To determine which one is most unlikely to be the same as the rest, we can use multiple pairwise t-tests to compare each group (rows) to each other group (columns).

pairwise.t.test(youtube$subscribers, youtube$Consolidated_Type, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  youtube$subscribers and youtube$Consolidated_Type 
## 
##               Education Entertainment Other
## Entertainment 1.000     -             -    
## Other         0.170     0.033         -    
## Technology    0.715     1.000         1.000
## 
## P value adjustment method: bonferroni

Interpretations from the pair-wise t-test:

in the “Education” row, the p-value for the comparison with “Entertainment” is 1.000, which indicates no significant difference.
In the “Other” row, the p-value for the comparison with “Entertainment” is 0.033. This p-value suggests a significant difference at a conventional significance level
The p-value for the comparison between “Education” and “Other” is 0.170. This indicates marginally significant difference in the number of subscribers between these groups.
The p-value for the comparison between “Technology” and “Education” is 0.715, indicating no significant difference in the number of subscribers between these groups.

LINEAR REGRESSION

Let us choose one other continuous (or ordered integer) column of data that might influence the response variable. In this case, we shall choose video views as the subscriber count of a channel highly its influences video views.

Since there are few Youtube channels with 0 video views (These channels belong to YouTube and don’t post anything), we shall be removing them so that it doesn’t hinder our observations.

youtube <- youtube |>
  filter(`video views` != 0)

Let us first visualize the graph between suscribers and video views.

youtube |>
  ggplot(mapping = aes(x = subscribers, y = `video views`)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

We can see that as subscribers increase, video views does increase. It seems to be roughly linear. But to evaluate its fit, it is better to create a linear regression model and assess its summary values.

Let us build a linear regression for subscribers and video views, and evaluate its fit.

# Build the linear regression model
model <- lm(subscribers ~ `video views`, data = youtube)
# Summarize the model
summary(model)

## 
## Call:
## lm(formula = subscribers ~ `video views`, data = youtube)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -41617692  -4370437  -1239051   2698367 126872192 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.193e+07  3.773e+05   31.62   <2e-16 ***
## `video views` 9.586e-04  2.098e-05   45.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9311000 on 985 degrees of freedom
## Multiple R-squared:  0.6794, Adjusted R-squared:  0.6791 
## F-statistic:  2087 on 1 and 985 DF,  p-value: < 2.2e-16

The multiple R-squared is approximately 0.6794, indicating that about 67.94% of the variance in the number of subscribers is explained by the linear relationship with video_views. This is a relatively good fit, suggesting that the model captures a substantial portion of the variation in subscribers. The F-statistic is significant (p-value: < 2.2e-16), indicating that the model as a whole is significant.the model appears to have a reasonably good fit, as evidenced by the significant F-statistic and the high R-squared value. It explains a significant portion of the variation in the number of subscribers.

Let us do an appropriate hypothesis test to see whether the model as a whole is significant. Hypothesis Test for the Significance of the Model

\[ H_0:\text{The model does not explain any variance in the response variable (subscribers).} \]

\[ H_a: \text{The model does explain some variance in the response variable.} \]

We perform and interpret the F-statistic test for the significance of the model. The result will indicate whether the model, as a whole, is a significant predictor of the number of subscribers.

# Build the linear regression model
model <- lm(subscribers ~`video views`, data = youtube)

# Perform the F-statistic test for the significance of the model
f_test <- summary(model)$fstatistic

# Extract the F-statistic and its associated p-value
f_statistic <- f_test[1]
p_value <- pf(f_statistic, f_test[2], f_test[3], lower.tail = FALSE)

# Display the F-statistic and p-value
cat("F-statistic:", f_statistic, "\n")

## F-statistic: 2087.439

cat("P-value:", format(p_value, scientific = TRUE), "\n")

## P-value: 1.481359e-245

# Determine the significance based on the p-value
if (p_value < 0.05) {
  cat("The model is significant (Reject the null hypothesis).\n")
} else {
  cat("The model is not significant (Fail to reject the null hypothesis).\n")
}

## The model is significant (Reject the null hypothesis).

Results:

The F-statistic is highly significant, indicating that the model, as a whole, is a significant predictor of the number of subscribers.
The extremely low p-value (practically zero) provides strong evidence to reject the null hypothesis, which states that the model does not explain any variance in the response variable. Thus, we reject the null hypothesis.

Let us use diagnostic plots to identify any issues with our model.

# Build the linear regression model
model <- lm(subscribers ~ `video views`, data = youtube)

# Create diagnostic plots
par(mfrow = c(2, 2))  # Set up a 2x2 grid for multiple plots

# Plot 1: Residuals vs. Fitted Values (Checking for Linearity)
plot(model, which = 1)

# Plot 2: Normal Q-Q Plot (Checking for Normality of Residuals)
plot(model, which = 2)

# Plot 3: Scale-Location Plot (Checking for Homoscedasticity)
plot(model, which = 3)

# Plot 4: Residuals vs. Leverage (Checking for Outliers)
plot(model, which = 5)

# Reset the graphics parameters
par(mfrow = c(1, 1))

Interpretation: 1. Residuals vs. Fitted: We see random scatter points with no clear pattern. This suggests linearity. Therefore, we don’t have any issues with showing linearity in this linear regression model.

Normal Q-Q Plot: We notice a slight deviation that results in a curve. This could be caused by outliers or influential points in our data which can be assessed better in the Residuals vs Leverage graph. In the end, we understand the residuals aren’t normally distributed which results in the slight curve. This can be considered as a small issue.
Scale-Location: We see that the the points are randomly scattered, with no clear fan shape. This means the spread of residuals is consistent and there is no issue with homoscedasticity.
Residuals vs. Leverage: We see there are no points outside the dashed line. This means that there are no influential points. However, since there is a point very close to reaching the outside of the dashed line, we could see this as a small hinderance towards the normality of residuals which was seen in ‘Normal Q-Q Points’.

Hence from the diagnostic plots, we don’t see any alarming issue with our model apart from the slight deviation of normality.

Interpreting coefficients of my model:

model$coefficients

##   (Intercept) `video views` 
##  1.193245e+07  9.586346e-04

In the context of our data:

The coefficient for video views is highly statistically significant (p-value < 2e-16), indicating a strong relationship between video views and subscribers.

The positive coefficient for video views suggests that as the number of video views increases, the number of subscribers is expected to increase as well.

Recommendations:

While the model provides insights into the relationship between video views and subscribers, its important to keep in mind that the intercept’s practical interpretation may not be meaningful, as it’s unlikely to have zero views. Focus on the slope coefficient for practical recommendations.
our goal is to grow the number of subscribers to your YouTube channel, strategies that lead to more video views could be effective.

Adding another variable to our regression model. We add “Consolidated_Type” from our ANOVA Test.

# Build the linear regression model with 'video views' and 'Consolidated_Type' as predictors
model <- lm(subscribers ~ `video views` + Consolidated_Type, data = youtube)

# View the summary of the model
summary(model)

## 
## Call:
## lm(formula = subscribers ~ `video views` + Consolidated_Type, 
##     data = youtube)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -41755621  -4353979  -1244261   2751330 126809089 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.116e+07  1.386e+06   8.051 2.36e-15 ***
## `video views`                  9.613e-04  2.127e-05  45.200  < 2e-16 ***
## Consolidated_TypeEntertainment 7.593e+05  1.398e+06   0.543   0.5871    
## Consolidated_TypeOther         6.273e+05  1.453e+06   0.432   0.6660    
## Consolidated_TypeTechnology    4.354e+06  2.641e+06   1.648   0.0996 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9312000 on 982 degrees of freedom
## Multiple R-squared:  0.6804, Adjusted R-squared:  0.679 
## F-statistic: 522.5 on 4 and 982 DF,  p-value: < 2.2e-16

Evaluation with respect to previous model:

The R-squared value for the model with Consolidated_Type is slightly higher (0.6804) compared to the initial model (0.6794), suggesting that the addition of channel type as a predictor explains a slightly larger proportion of the variation in subscribers.
In both models, the coefficient for video views is highly significant, indicating a strong relationship between video views and subscribers.
The addition of Consolidated_Type as a predictor provides additional information about the impact of channel types on subscribers.

Adding interaction term “created_month”

# Build the linear regression model with 'video views', 'uploads', and 'Consolidated_Type' as predictors
model <- lm(subscribers ~ `video views` * created_month + Consolidated_Type, data = youtube)
summary(model)

## 
## Call:
## lm(formula = subscribers ~ `video views` * created_month + Consolidated_Type, 
##     data = youtube)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -34986081  -4134220  -1202524   2750172 117917711 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     9.900e+06  1.986e+06   4.984 7.41e-07 ***
## `video views`                   1.104e-03  9.392e-05  11.751  < 2e-16 ***
## created_monthAug                1.647e+06  1.986e+06   0.829  0.40707    
## created_monthDec                1.739e+06  2.095e+06   0.830  0.40671    
## created_monthFeb               -1.764e+06  2.276e+06  -0.775  0.43836    
## created_monthJan                2.418e+06  1.899e+06   1.273  0.20319    
## created_monthJul                3.586e+06  2.038e+06   1.759  0.07882 .  
## created_monthJun                2.590e+06  2.075e+06   1.248  0.21223    
## created_monthMar                2.554e+05  1.831e+06   0.139  0.88912    
## created_monthMay                3.879e+05  1.975e+06   0.196  0.84431    
## created_monthnan                1.450e+06  7.781e+06   0.186  0.85225    
## created_monthNov               -5.933e+05  2.262e+06  -0.262  0.79316    
## created_monthOct                1.261e+06  2.268e+06   0.556  0.57850    
## created_monthSep                9.145e+05  1.821e+06   0.502  0.61565    
## Consolidated_TypeEntertainment  1.060e+06  1.407e+06   0.754  0.45133    
## Consolidated_TypeOther          1.059e+06  1.468e+06   0.721  0.47090    
## Consolidated_TypeTechnology     4.648e+06  2.661e+06   1.747  0.08099 .  
## `video views`:created_monthAug -2.279e-04  1.260e-04  -1.808  0.07086 .  
## `video views`:created_monthDec -1.733e-04  1.411e-04  -1.228  0.21975    
## `video views`:created_monthFeb  2.671e-04  1.648e-04   1.620  0.10557    
## `video views`:created_monthJan -2.890e-04  1.219e-04  -2.372  0.01789 *  
## `video views`:created_monthJul -4.398e-04  1.431e-04  -3.074  0.00217 ** 
## `video views`:created_monthJun -1.346e-04  1.274e-04  -1.057  0.29068    
## `video views`:created_monthMar -8.105e-05  1.018e-04  -0.796  0.42618    
## `video views`:created_monthMay -1.057e-04  1.181e-04  -0.895  0.37105    
## `video views`:created_monthnan -3.314e-04  8.938e-04  -0.371  0.71085    
## `video views`:created_monthNov -1.089e-04  1.619e-04  -0.673  0.50136    
## `video views`:created_monthOct -3.215e-04  1.835e-04  -1.752  0.08017 .  
## `video views`:created_monthSep -1.703e-04  1.031e-04  -1.652  0.09892 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9256000 on 958 degrees of freedom
## Multiple R-squared:  0.6919, Adjusted R-squared:  0.6829 
## F-statistic: 76.84 on 28 and 958 DF,  p-value: < 2.2e-16

Reason “created_month” was included is becuase seasonal effects or patterns can impact the number of subscribers to a channel. By adding interaction terms between video views and created_month, we allow the effect of video views on subscribers to vary by month.

Evaluations:

The multiple R-squared is 0.6919, indicating that the model explains about 69.19% of the variance in subscribers. The multiple R-squared is 0.6919, indicating that the model explains about 69.19% of the variance in subscribers.
Some of these interaction terms are statistically significant, such as video views:created_monthJan, which is significant at the 0.05 level (p-value: 0.01789), and video views:created_monthJul, which is significant at the 0.01 level (p-value: 0.00217).
Including interaction terms for created_month can be particularly useful when we have a prior reason to believe that seasonality or other temporal factors might affect our response variable “subscribers”, and you want our model to capture and account for these variations.

Conclusion of our data dive:

In this data dive, we conducted a comprehensive analysis of a dataset containing information about the top YouTube channels worldwide. Our primary objective was to understand the factors influencing the number of subscribers to these channels. We began with an ANOVA test to evaluate whether different channel types significantly impacted subscriber counts. Subsequently, we performed linear regression modeling to explore the relationship between subscribers and video views, with diagnostic plots confirming the model’s suitability. Further analysis included extending the regression model with “Consolidated_Type” as a predictor and adding interaction terms with “created_month” to account for seasonal effects. Our findings revealed that video views, channel type, and seasonality all play important roles in predicting subscriber counts, providing valuable insights for content creators and marketers seeking to optimize their YouTube channels.

Week 8 Data Dive

2023-10-23