One of the questions key questions in our “social media consultant” role play we wanted to answer via modeling was “do video attributes drive engagement? if so, which?” we decided a random forest was a good approach to find out. honestly, because I have a hard time feature engineering for other algos, but for random forest/decision trees its very easy since it isn’t scale dependent, it’s just about managing splits and purity. I deduped (this is panel data and the same video can appear over many days) and just did log-transform for the view counts to manage splitting.
library(tidyverse)
library(lubridate)
library(ranger)
library(caret)
library(vip)
library(pdp)
library(patchwork)
category_map <- c(
"1" = "Film & Animation", "2" = "Autos & Vehicles", "10" = "Music",
"15" = "Pets & Animals", "17" = "Sports", "19" = "Travel & Events",
"20" = "Gaming", "22" = "People & Blogs", "23" = "Comedy",
"24" = "Entertainment", "25" = "News & Politics", "26" = "Howto & Style",
"27" = "Education", "28" = "Science & Technology",
"29" = "Nonprofits & Activism", "43" = "Shows"
)
raw <- read_csv("data_set/USvideos.csv", show_col_types = FALSE)
cat("Raw rows:", nrow(raw), "| Unique videos:", n_distinct(raw$video_id), "\n")
## Raw rows: 40949 | Unique videos: 6351
The raw dataset has ~40,000 rows but only ~6,300 unique videos. I kept the most mature snapshot per video (highest view count) so engagement rates reflect an accumulated audience rather than a single early-trending moment.
df <- raw %>%
group_by(video_id) %>%
slice_max(views, n = 1, with_ties = FALSE) %>%
ungroup()
cat("Unique videos after dedup:", nrow(df), "\n")
## Unique videos after dedup: 6351
I took my best guess to derive the target variable and all predictors from scratch. Key decisions:
YY.DD.MM date formatdf <- df %>%
mutate(
engagement_rate = (likes + dislikes + comment_count) / views,
title_length = nchar(title),
title_word_count = str_count(title, "\\S+"),
has_caps = str_detect(title, "[A-Z]{2,}"),
has_question = str_detect(title, "\\?"),
has_exclaim = str_detect(title, "!"),
tag_count = if_else(tags == "[none]", 0L, str_count(tags, "\\|") + 1L),
publish_dt = ymd_hms(publish_time),
publish_hour = hour(publish_dt),
publish_dow = wday(publish_dt, label = FALSE),
trend_dt = as.Date(trending_date, format = "%y.%d.%m"),
days_to_trend = as.numeric(trend_dt - as.Date(publish_dt)),
comments_disabled = (comments_disabled == "True"),
ratings_disabled = (ratings_disabled == "True"),
description_length = nchar(coalesce(description, "")),
log_views = log1p(views),
category = factor(category_map[as.character(category_id)])
)
df_model <- df %>%
select(
engagement_rate,
title_length, title_word_count, has_caps, has_question, has_exclaim,
tag_count, publish_hour, publish_dow, days_to_trend,
comments_disabled, ratings_disabled, description_length,
log_views, category
) %>%
mutate(
has_caps = as.factor(has_caps),
has_question = as.factor(has_question),
has_exclaim = as.factor(has_exclaim),
comments_disabled = as.factor(comments_disabled),
ratings_disabled = as.factor(ratings_disabled),
publish_dow = as.factor(publish_dow)
) %>%
filter(
engagement_rate >= 0,
engagement_rate <= quantile(engagement_rate, 0.99, na.rm = TRUE),
days_to_trend >= 0,
!is.na(engagement_rate)
) %>%
drop_na()
cat("Final model rows:", nrow(df_model), "\n")
## Final model rows: 6280
summary(df_model$engagement_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01566 0.02882 0.03404 0.04646 0.13691
Typical 80/20 split, stratified on the target variable via
caret::createDataPartition to ensure the engagement rate
distribution is balanced across both sets.
set.seed(42)
train_idx <- createDataPartition(df_model$engagement_rate, p = 0.8, list = FALSE)
train <- df_model[ train_idx, ]
test <- df_model[-train_idx, ]
cat("Train rows:", nrow(train), "| Test rows:", nrow(test), "\n")
## Train rows: 5024 | Test rows: 1256
I use claude code for most all this code and it advised the
following: “use the ranger package — a fast implementation
of random forest with identical results to the base
randomForest package. For feature importance go with
permutation importance, which is more honest than
impurity-based importance: it measures how much model accuracy drops
when a feature is randomly shuffled, rather than how often a feature
appears in splits.”
rf_model <- ranger(
engagement_rate ~ .,
data = train,
num.trees = 500,
importance = "permutation",
seed = 42
)
cat("OOB R²:", round(rf_model$r.squared, 4), "\n")
## OOB R²: 0.269
preds <- predict(rf_model, data = test)$predictions
rmse <- sqrt(mean((test$engagement_rate - preds)^2))
mae <- mean(abs(test$engagement_rate - preds))
ss_res <- sum((test$engagement_rate - preds)^2)
ss_tot <- sum((test$engagement_rate - mean(test$engagement_rate))^2)
r2 <- 1 - ss_res / ss_tot
cat("Test R²:", round(r2, 4), "| RMSE:", round(rmse, 5), "| MAE:", round(mae, 5), "\n")
## Test R²: 0.2659 | RMSE: 0.02083 | MAE: 0.01531
The model achieved a test R² of 0.2659, meaning it explains roughly 26% of the variance in engagement rate!! So awesome since this is only using video metadata. While not a perfect predictor, this is a meaningful signal because we have no information about content quality, thumbnail design, or creator-audience relationships, or other factors that go into the UGC ecosystem and certainly drive a ton of variance. RMSE of 0.0208 is reasonable relative to the typical engagement rate range in the dataset.
tibble(actual = test$engagement_rate, predicted = preds) %>%
ggplot(aes(x = actual, y = predicted)) +
geom_point(alpha = 0.3, color = "#2196F3", size = 1.2) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Actual vs. Predicted Engagement Rate",
subtitle = paste0("Test R² = ", round(r2, 4)),
x = "Actual", y = "Predicted"
) +
theme_minimal(base_size = 13)
This chart tells us which features the model relies on most. It does not tell us direction, so I asked claude to help me visualize that with pdp below.
vip(rf_model, num_features = 15, aesthetics = list(fill = "#2196F3")) +
labs(
title = "What Drives YouTube Engagement Rate?",
subtitle = "Random Forest — permutation importance (US trending videos)",
x = "Feature", y = "Importance"
) +
theme_minimal(base_size = 13)
importance_df <- tibble(
feature = names(rf_model$variable.importance),
importance = rf_model$variable.importance
) %>% arrange(desc(importance))
knitr::kable(importance_df, digits = 6, caption = "Feature importance (permutation)")
| feature | importance |
|---|---|
| title_length | 0.000136 |
| category | 0.000118 |
| title_word_count | 0.000107 |
| description_length | 0.000061 |
| publish_hour | 0.000059 |
| log_views | 0.000051 |
| tag_count | 0.000042 |
| days_to_trend | 0.000033 |
| has_caps | 0.000021 |
| publish_dow | 0.000010 |
| has_exclaim | 0.000008 |
| has_question | 0.000003 |
| comments_disabled | 0.000000 |
| ratings_disabled | 0.000000 |
Variable importance tells us that a feature matters. Partial dependence plots (PDPs) tell us how by holding all other features constant and showing how the model’s prediction changes as one feature varies across its range.
pdp_plot <- function(feature, label) {
partial(rf_model, pred.var = feature, train = train, plot = FALSE) %>%
as_tibble() %>%
rename(x = 1, yhat = 2) %>%
ggplot(aes(x = x, y = yhat)) +
geom_line(color = "#2196F3", linewidth = 1.1) +
geom_rug(data = train, aes_string(x = feature, y = NULL),
alpha = 0.15, sides = "b") +
labs(title = label, x = feature, y = "Predicted engagement rate") +
theme_minimal(base_size = 12)
}
p_title_len <- pdp_plot("title_length", "Title Length")
p_word_count <- pdp_plot("title_word_count", "Title Word Count")
p_desc <- pdp_plot("description_length", "Description Length")
p_hour <- pdp_plot("publish_hour", "Publish Hour")
p_tags <- pdp_plot("tag_count", "Tag Count")
p_views <- pdp_plot("log_views", "Log Views (channel size proxy)")
(p_title_len | p_word_count | p_desc) /
(p_hour | p_tags | p_views) +
plot_annotation(
title = "How Each Feature Shapes Engagement Rate",
subtitle = "Partial dependence plots — all other features held at their mean",
theme = theme_minimal(base_size = 13)
)
PDPs for categorical features get cluttered, so we use a box plot of the actual data instead. Categories are sorted by median engagement rate.
df_model %>%
mutate(category = fct_reorder(category, engagement_rate, .fun = median)) %>%
ggplot(aes(x = category, y = engagement_rate)) +
geom_boxplot(fill = "#2196F3", alpha = 0.6, outlier.alpha = 0.2) +
coord_flip() +
labs(
title = "Engagement Rate by Category",
subtitle = "Median-sorted — actual values from dataset",
x = NULL, y = "Engagement Rate"
) +
theme_minimal(base_size = 12)
Combining publish hour and day of week into a heatmap reveals optimal posting windows.
df_model %>%
mutate(publish_dow = as.integer(as.character(publish_dow))) %>%
group_by(publish_hour, publish_dow) %>%
summarise(avg_engagement = mean(engagement_rate), .groups = "drop") %>%
mutate(
day_label = factor(publish_dow, levels = 1:7,
labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"))
) %>%
ggplot(aes(x = publish_hour, y = day_label, fill = avg_engagement)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "#e3f2fd", high = "#1565C0", name = "Avg engagement") +
labs(
title = "Best Time to Post",
subtitle = "Average engagement rate by publish hour × day of week",
x = "Hour of day (UTC)", y = NULL
) +
theme_minimal(base_size = 12)
Using a random forest model on ~6,300 unique US YouTube trending videos, we achieved a test R² of 0.2659, meaning our model explains about 26% of the variance in engagement rate using only video metadata (no content quality, thumbnail appeal, or audience loyalty data.) The most important predictors were title-related features (title length and word count), followed by content category, description length, publish hour, and tag count. Partial dependence plots let us go beyond just knowing that these features matter and actually see how they affect engagement. The category breakdown confirmed that some niches consistently outperform others, which is a strong signal for any creator deciding what kind of content to make.
The script uses tidyverse for data loading and
manipulation, lubridate for parsing messy date formats,
ranger for training the random forest, caret
for the stratified train/test split, vip for variable
importance plots, pdp for partial dependence plots, and
patchwork for combining multiple plots. The raw CSV had
~40,000 rows but only ~6,300 unique videos, so we deduplicated by
keeping each video’s highest-view snapshot. All features were engineered
from scratch. title length, word count, punctuation flags, tag count,
publish hour and day of week, days from publish to trending, description
length, and a log-transformed view count as a channel-size proxy. Likes,
dislikes, and comment count were excluded entirely since they are direct
inputs to the target variable and including them would constitute data
leakage.