Our consultation roll play needed probabilities to provided to those seeking to increase their engagement rate on YouTube. A Logistic Regression allows us to receive the odds that each variable will increase the engagement of a YouTube video. I kept the duplicates of video within my data set to account for the time frame of repeated interactions with videos. This provided useful insite into the “Engagement Decay” over time.
library(ggplot2)
1. Load data: I used Keith’s cleaned data and converted the model into one called youtube_model.csv for my specific purposes.
youtube_model_df <- read.csv("youtube_model.csv")
2. Variable selection: The logistic regression required Binary variables in order for the results to be easily inheritable. The variables “high_engagement”, “high_views”, “high_likes”, “high_dislikes”, “high_comments” had a string of “Yes” and “No” that needed to be converted.
cols_to_convert <- c("high_engagement", "high_views", "high_likes", "high_dislikes", "high_comments")
youtube_model_df[cols_to_convert] <- lapply(youtube_model_df[cols_to_convert], function(x) {
as.numeric(x == "Yes")
})
3. Logistic Regression: After the conversion the regression was run with high_engagement as the dependent variable and the rest were considered independent variables.
engagement_model <- glm(
high_engagement ~ high_views + high_likes + high_dislikes + high_comments,
data = youtube_model_df,
family = "binomial"
)
4. Summary Analysis: All of our selected variables were deemed highly significant from this model which was nice to see. Our estimate or log odds gave us some insite into the probability of increasing engagement.
summary(engagement_model)
##
## Call:
## glm(formula = high_engagement ~ high_views + high_likes + high_dislikes +
## high_comments, family = "binomial", data = youtube_model_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.38322 0.01518 -91.11 <2e-16 ***
## high_views -2.52515 0.06208 -40.67 <2e-16 ***
## high_likes 2.36164 0.05558 42.49 <2e-16 ***
## high_dislikes -0.91792 0.04582 -20.03 <2e-16 ***
## high_comments 1.59291 0.04770 33.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46029 on 40925 degrees of freedom
## Residual deviance: 40006 on 40921 degrees of freedom
## AIC: 40016
##
## Number of Fisher Scoring iterations: 5
5. YouTube Engagement Drivers Using Log Odds: This graph provides a intuitive guide as to which variables directly effect the engagement odds of a video. High likes and comments increase the odds, while High dislikes and view decrease the odds. Interesting views decrease the odds of engagements which can be explained by our original formula of likes + dislike + comments / views. Views increase exponentially compared to the the other variables which causes a large shift in the odds. So yes high view do increase your odds of disproportional engagement.
model_summary <- summary(engagement_model)$coefficients[-1, ]
plot_data <- data.frame(
Metric = rownames(model_summary),
Impact = model_summary[, "Estimate"]
)
plot_data$Metric <- gsub("high_", "High ", plot_data$Metric)
plot_data$Metric <- gsub("yes", "", plot_data$Metric)
ggplot(plot_data, aes(x = reorder(Metric, Impact), y = Impact, fill = Impact > 0)) +
geom_bar(stat = "identity", width = 0.7, color = "black") +
coord_flip() + # This flips the chart sideways!
scale_fill_manual(values = c("TRUE" = "#2ecc71", "FALSE" = "#e74c3c"), guide = "none") +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40", linewidth = 0.8) +
labs(
title = "What Drives YouTube Engagement?",
subtitle = "Model Coefficients (Log-Odds Impact on High Engagement)",
x = "Video Metric",
y = "Direction & Strength of Impact"
) +
theme_minimal(base_size = 13) +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold", size = 16),
axis.text = element_text(color = "black", face = "bold")
)
6. Second Regression Model For Analyzing Engagement Decay: I wanted to run a second logistic regression but include the continuous variable “days_since_published”. My hopes were this would provide me with a curve that would display the odds of engagement decay overtime for a YouTube video after it has been published.
# Update your model to include the continuous time variable
engagement_time_model <- glm(
high_engagement ~ high_views + high_likes + high_dislikes + high_comments + days_since_published,
data = youtube_model_df,
family = "binomial"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(engagement_time_model)
##
## Call:
## glm(formula = high_engagement ~ high_views + high_likes + high_dislikes +
## high_comments + days_since_published, family = "binomial",
## data = youtube_model_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.935814 0.021697 -43.13 <2e-16 ***
## high_views -2.420017 0.063418 -38.16 <2e-16 ***
## high_likes 2.445976 0.056581 43.23 <2e-16 ***
## high_dislikes -0.913524 0.046647 -19.58 <2e-16 ***
## high_comments 1.641240 0.048677 33.72 <2e-16 ***
## days_since_published -0.081131 0.003093 -26.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46029 on 40925 degrees of freedom
## Residual deviance: 39084 on 40920 degrees of freedom
## AIC: 39096
##
## Number of Fisher Scoring iterations: 8
7. Engagement Decay Over Time : This graph depicts the 60-day projected Engagement decline. A typical youtube video with begin having around 60% engament after it has first been published. There we will see the engaments drop consistantly as the week go by until we platue at the 2 month mark.
engagement_time_model <- glm(
high_engagement ~ high_views + high_likes + high_dislikes + high_comments + days_since_published,
data = youtube_model_df,
family = "binomial"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
time_range_60 <- seq(1, 60, length.out = 150)
curve_data_60 <- data.frame(
days_since_published = time_range_60,
high_views = 1,
high_likes = 1,
high_dislikes = 0,
high_comments = 1
)
curve_data_60$Probability <- predict(engagement_time_model, newdata = curve_data_60, type = "response")
ggplot(curve_data_60, aes(x = days_since_published, y = Probability)) +
geom_line(color = "#e74c3c", linewidth = 1.5) + # Decay line
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
scale_x_continuous(breaks = c(1, 7, 14, 30, 45, 60),
labels = c("Day 1", "Week 1", "Week 2", "Month 1", "Day 45", "Month 2")) +
coord_cartesian(xlim = c(1, 60)) +
labs(
title = "The 60-Day Engagement Decay Effect",
subtitle = "Predicted Probability of High Engagement Over a Video's First Two Months",
x = "Time Since Published",
y = "Probability of High Engagement Rate"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
panel.grid.minor = element_blank(),
axis.text.x = element_text(face = "bold")
)
Results Summary: Our logistic regression on 40,000 US trending videos successfully mapped out the interaction dynamics that dictate video performance. Provides evidence that high likes and comments directly drive engagement. In terms of impact strength, hitting high likes threshold increases a videos engagement odds by 10x, while high comments increases those odds by 5x. Additionally, high views reduces the odds of high engagement by ~ 90% because the virality of a video introduces more passive viewers. Negative viewer sentiment plays a much smaller role in this performance, as hitting high dislikes only decreases our odds of engagement by a minor 0.4x.