library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.2.3
library(broom)
## Warning: package 'broom' was built under R version 4.2.3
library(lindia)
## Warning: package 'lindia' was built under R version 4.2.3
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)

Initially setting our directories and loading our data.

knitr::opts_knit$set(root.dir = "C:/Users/Prana/OneDrive/Documents/Topics in Info FA23(Grad)")
youtube <- read_delim("./Global Youtube Statistics.csv", delim = ",")
## Rows: 995 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Youtuber, category, Title, Country, Abbreviation, channel_type, cr...
## dbl (21): rank, subscribers, video views, uploads, video_views_rank, country...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Since there are few Youtube channels with 0 uploads (These channels belong to YouTube and don't post anything), we shall be removing them so that it doesn't hinder our observations.
youtube <- youtube |>
  filter(`video views` != 0)

Before getting into logistic regression, let us first transform our response variable before making it a binary response variable. In this case, our response variable is ‘subscribers’ and explanatory variables is ‘video views’.

Creating graph to visualize the relation between subscribers and video views.

model <- lm(youtube$subscribers ~ youtube$`video views` )

rsquared <- summary(model)$r.squared

youtube |> 
  ggplot(mapping = aes(x = `video views`, 
                       y = subscribers)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "subscribers vs. videoviews",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We see that there the dashed lines suggest there is a positive linear relationship. It at least lets us know that for most channels, more video views dictates more subscriber. But we shall explore this more later.

Let us find the lambda value to understand how we can transform our response variable.

pT <- powerTransform(model, family="bcPower")
pT$lambda
##         Y1 
## -0.8238529

Since our lambda value of -0.82 is close to -1, we shall transform subscribers to 1/subscribers and create a visualization to showcase the relation between the newly transformed response variable and explanatory variable.

youtube <- youtube |>
  mutate(l_sub = 1/subscribers)  # calculate 1/subscribers

model <- lm(youtube$l_sub ~ youtube$`video views`)

rsquared <- summary(model)$r.squared

youtube |> 
  ggplot(mapping = aes(x = `video views`, 
                       y = l_sub)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
              se = FALSE) +
  labs(title = "subscribers vs. videoviews",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

The graph indicates a negative linear relationship.

plots <- gg_diagnose(model, plot.all = FALSE)
plot_all(plots[c('res_fitted', 'qqplot')], max.per.page = 1)

LOGISTIC REGRESSION

Now we are going to create a logistic regression model using a binary response variable ‘subscribers_div’ with explanatory variable ‘video views’.

First let us create our binary response variable using our transformed variable ‘l_sub’.

library(dplyr)

youtube <- youtube |>
  mutate(subscribers_div = ifelse(l_sub > 5.650e-08, 1, 0))

Before we go into the creation of the model, let us create a visualization of our binary response variable and explanatory variable.

youtube |>
  ggplot(mapping = aes(x = `video views`, y = subscribers_div)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) + 
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

We see that there are more 0’s than 1’s here maybe due to inverse created by transforming the original response variable.

model <- glm(subscribers_div ~ `video views`, data = youtube,
             family = binomial(link = 'logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
model$coefficients
##   (Intercept) `video views` 
##  1.719022e+00 -1.895708e-10

Interpretation of coefficients:

Let us try another explanatory variable ‘uploads’ to judge the model

model <- glm(subscribers_div ~ `video views`+uploads, data = youtube,
             family = binomial(link = 'logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
model$coefficients
##   (Intercept) `video views`       uploads 
##  1.727278e+00 -1.987966e-10  8.039662e-06

The coefficient for uploads represents how a one-unit increase in “uploads” affects the log-odds of success in “subscribers_div.” The coefficient is positive, indicating that as “uploads” increase by one unit, the log-odds of success also increase.

BONUS (Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning) Here we are using ‘video views’.

model_summary <- summary(model)

# Extract the standard error for the coefficient of "video views"
se_video_views <- model_summary$coefficients["`video views`", "Std. Error"]

# Coefficient for "video views"
coefficient <- -1.895708e-10

# Margin of error (MOE) for a 95% confidence interval
MOE <- 1.96 * se_video_views  # Assuming a 95% confidence level

# Calculate the confidence interval
lower_bound <- coefficient - MOE
upper_bound <- coefficient + MOE

# Print the confidence interval
cat("95% Confidence Interval: (", lower_bound, ", ", upper_bound, ")\n")
## 95% Confidence Interval: ( -2.199557e-10 ,  -1.591859e-10 )

Interpretation of our intervals:

TRANSFORMING EXPLANATORY VARIABLE (video views)

youtube|>
  ggplot() +
  geom_histogram(mapping = aes(x = `video views`), color = 'white', fill = 'blue', bins = 30)+
  labs(
    title = "Distribution of Subscribers for Other Channels",
    x = "views",
    y = "Frequency")

From the graph we see that the explanatory variable ‘video views’ is highly skewed towards the right. Therefore, we are going to apply the log transformation for this variable.

youtube |>
  ggplot(mapping = aes(x = log(`video views`), y = subscribers_div)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) + 
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

model <- glm(subscribers_div ~ log(`video views`), data = youtube, family = binomial(link = 'logit'))

model$coefficients
##        (Intercept) log(`video views`) 
##          27.873852          -1.226497

From our graph and result, we see that this transformation is needed because it compresses extreme values and explanatory variable does show some effect on the model unlike the previous interpretation where it had negligible effect.

Intepretation of coefficient:

Conclusion of Data Dive:

In this data dive, we explored the relationship between YouTube channel “subscribers” and “video views” using logistic regression. Initially, we transformed the response variable by taking its reciprocal, creating a binary variable “subscribers_div.” We observed that there was a positive linear relationship between “video views” and “subscribers_div” on the original scale. However, due to the skewed distribution of “video views,” we applied a logarithmic transformation.

After the transformation, the logistic regression model showed a meaningful relationship. The coefficient for log-transformed “video views” was approximately -1.2265, indicating that a one-unit increase in the log of “video views” resulted in a decrease of about 1.2265 in the log odds of “subscribers_div.” This suggested that as “video views” increased, the odds of a channel having a high number of subscribers decreased.

We also briefly explored the effect of the “uploads” variable and found a positive relationship. However, the practical impact of “video views” was statistically significant but relatively small. The confidence interval for this coefficient was extremely narrow due to the large dataset, highlighting its statistical significance, but the effect size remained modest.

In summary, the log-transformed “video views” proved to be a more informative explanatory variable, and it demonstrated a significant but subtle impact on the odds of having a high number of subscribers for YouTube channels. Further analysis could consider additional variables and model improvement for a more comprehensive understanding of subscriber acquisition.