What Drives YouTube Engagement? A Random Forest Approach

Overview

One of the questions key questions in our “social media consultant” role play we wanted to answer via modeling was “do video attributes drive engagement? if so, which?” we decided a random forest was a good approach to find out. honestly, because I have a hard time feature engineering for other algos, but for random forest/decision trees its very easy since it isn’t scale dependent, it’s just about managing splits and purity. I deduped (this is panel data and the same video can appear over many days) and just did log-transform for the view counts to manage splitting.

Setup

library(tidyverse)
library(lubridate)
library(ranger)
library(caret)
library(vip)
library(pdp)
library(patchwork)

category_map <- c(
  "1"  = "Film & Animation", "2"  = "Autos & Vehicles", "10" = "Music",
  "15" = "Pets & Animals",   "17" = "Sports",            "19" = "Travel & Events",
  "20" = "Gaming",           "22" = "People & Blogs",    "23" = "Comedy",
  "24" = "Entertainment",    "25" = "News & Politics",   "26" = "Howto & Style",
  "27" = "Education",        "28" = "Science & Technology",
  "29" = "Nonprofits & Activism", "43" = "Shows"
)

1. Load Data

raw <- read_csv("data_set/USvideos.csv", show_col_types = FALSE)
cat("Raw rows:", nrow(raw), "| Unique videos:", n_distinct(raw$video_id), "\n")

## Raw rows: 40949 | Unique videos: 6351

2. Deduplicate

The raw dataset has ~40,000 rows but only ~6,300 unique videos. I kept the most mature snapshot per video (highest view count) so engagement rates reflect an accumulated audience rather than a single early-trending moment.

df <- raw %>%
  group_by(video_id) %>%
  slice_max(views, n = 1, with_ties = FALSE) %>%
  ungroup()

cat("Unique videos after dedup:", nrow(df), "\n")

## Unique videos after dedup: 6351

3. Feature Engineering

I took my best guess to derive the target variable and all predictors from scratch. Key decisions:

engagement_rate = (likes + dislikes + comments) / views — drop all three component columns to prevent data leakage
log_views the raw range for views (549 to 225M) is huge; log-transforming helps the model find clean split points
days_to_trend — how quickly the video hit trending, parsed from the raw YY.DD.MM date format
Title signals (length, word count, punctuation) extracted from the raw title string

df <- df %>%
  mutate(
    engagement_rate    = (likes + dislikes + comment_count) / views,
    title_length       = nchar(title),
    title_word_count   = str_count(title, "\\S+"),
    has_caps           = str_detect(title, "[A-Z]{2,}"),
    has_question       = str_detect(title, "\\?"),
    has_exclaim        = str_detect(title, "!"),
    tag_count          = if_else(tags == "[none]", 0L, str_count(tags, "\\|") + 1L),
    publish_dt         = ymd_hms(publish_time),
    publish_hour       = hour(publish_dt),
    publish_dow        = wday(publish_dt, label = FALSE),
    trend_dt           = as.Date(trending_date, format = "%y.%d.%m"),
    days_to_trend      = as.numeric(trend_dt - as.Date(publish_dt)),
    comments_disabled  = (comments_disabled == "True"),
    ratings_disabled   = (ratings_disabled  == "True"),
    description_length = nchar(coalesce(description, "")),
    log_views          = log1p(views),
    category           = factor(category_map[as.character(category_id)])
  )

df_model <- df %>%
  select(
    engagement_rate,
    title_length, title_word_count, has_caps, has_question, has_exclaim,
    tag_count, publish_hour, publish_dow, days_to_trend,
    comments_disabled, ratings_disabled, description_length,
    log_views, category
  ) %>%
  mutate(
    has_caps          = as.factor(has_caps),
    has_question      = as.factor(has_question),
    has_exclaim       = as.factor(has_exclaim),
    comments_disabled = as.factor(comments_disabled),
    ratings_disabled  = as.factor(ratings_disabled),
    publish_dow       = as.factor(publish_dow)
  ) %>%
  filter(
    engagement_rate >= 0,
    engagement_rate <= quantile(engagement_rate, 0.99, na.rm = TRUE),
    days_to_trend >= 0,
    !is.na(engagement_rate)
  ) %>%
  drop_na()

cat("Final model rows:", nrow(df_model), "\n")

## Final model rows: 6280

summary(df_model$engagement_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01566 0.02882 0.03404 0.04646 0.13691

4. Train / Test Split

Typical 80/20 split, stratified on the target variable via caret::createDataPartition to ensure the engagement rate distribution is balanced across both sets.

set.seed(42)
train_idx <- createDataPartition(df_model$engagement_rate, p = 0.8, list = FALSE)
train <- df_model[ train_idx, ]
test  <- df_model[-train_idx, ]

cat("Train rows:", nrow(train), "| Test rows:", nrow(test), "\n")

## Train rows: 5024 | Test rows: 1256

5. Train Random Forest

I use claude code for most all this code and it advised the following: “use the ranger package — a fast implementation of random forest with identical results to the base randomForest package. For feature importance go with permutation importance, which is more honest than impurity-based importance: it measures how much model accuracy drops when a feature is randomly shuffled, rather than how often a feature appears in splits.”

rf_model <- ranger(
  engagement_rate ~ .,
  data       = train,
  num.trees  = 500,
  importance = "permutation",
  seed       = 42
)

cat("OOB R²:", round(rf_model$r.squared, 4), "\n")

## OOB R²: 0.269

6. Model Evaluation

preds  <- predict(rf_model, data = test)$predictions
rmse   <- sqrt(mean((test$engagement_rate - preds)^2))
mae    <- mean(abs(test$engagement_rate - preds))
ss_res <- sum((test$engagement_rate - preds)^2)
ss_tot <- sum((test$engagement_rate - mean(test$engagement_rate))^2)
r2     <- 1 - ss_res / ss_tot

cat("Test R²:", round(r2, 4), "| RMSE:", round(rmse, 5), "| MAE:", round(mae, 5), "\n")

## Test R²: 0.2659 | RMSE: 0.02083 | MAE: 0.01531

The model achieved a test R² of 0.2659, meaning it explains roughly 26% of the variance in engagement rate!! So awesome since this is only using video metadata. While not a perfect predictor, this is a meaningful signal because we have no information about content quality, thumbnail design, or creator-audience relationships, or other factors that go into the UGC ecosystem and certainly drive a ton of variance. RMSE of 0.0208 is reasonable relative to the typical engagement rate range in the dataset.

Actual vs. Predicted

tibble(actual = test$engagement_rate, predicted = preds) %>%
  ggplot(aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.3, color = "#2196F3", size = 1.2) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title    = "Actual vs. Predicted Engagement Rate",
    subtitle = paste0("Test R² = ", round(r2, 4)),
    x = "Actual", y = "Predicted"
  ) +
  theme_minimal(base_size = 13)

7. Variable Importance

This chart tells us which features the model relies on most. It does not tell us direction, so I asked claude to help me visualize that with pdp below.

vip(rf_model, num_features = 15, aesthetics = list(fill = "#2196F3")) +
  labs(
    title    = "What Drives YouTube Engagement Rate?",
    subtitle = "Random Forest — permutation importance (US trending videos)",
    x = "Feature", y = "Importance"
  ) +
  theme_minimal(base_size = 13)

importance_df <- tibble(
  feature    = names(rf_model$variable.importance),
  importance = rf_model$variable.importance
) %>% arrange(desc(importance))

knitr::kable(importance_df, digits = 6, caption = "Feature importance (permutation)")

Feature importance (permutation)
feature	importance
title_length	0.000136
category	0.000118
title_word_count	0.000107
description_length	0.000061
publish_hour	0.000059
log_views	0.000051
tag_count	0.000042
days_to_trend	0.000033
has_caps	0.000021
publish_dow	0.000010
has_exclaim	0.000008
has_question	0.000003
comments_disabled	0.000000
ratings_disabled	0.000000

8. Partial Dependence Plots

Variable importance tells us that a feature matters. Partial dependence plots (PDPs) tell us how by holding all other features constant and showing how the model’s prediction changes as one feature varies across its range.

pdp_plot <- function(feature, label) {
  partial(rf_model, pred.var = feature, train = train, plot = FALSE) %>%
    as_tibble() %>%
    rename(x = 1, yhat = 2) %>%
    ggplot(aes(x = x, y = yhat)) +
    geom_line(color = "#2196F3", linewidth = 1.1) +
    geom_rug(data = train, aes_string(x = feature, y = NULL),
             alpha = 0.15, sides = "b") +
    labs(title = label, x = feature, y = "Predicted engagement rate") +
    theme_minimal(base_size = 12)
}

p_title_len  <- pdp_plot("title_length",       "Title Length")
p_word_count <- pdp_plot("title_word_count",    "Title Word Count")
p_desc       <- pdp_plot("description_length",  "Description Length")
p_hour       <- pdp_plot("publish_hour",        "Publish Hour")
p_tags       <- pdp_plot("tag_count",           "Tag Count")
p_views      <- pdp_plot("log_views",           "Log Views (channel size proxy)")

(p_title_len | p_word_count | p_desc) /
(p_hour      | p_tags       | p_views) +
  plot_annotation(
    title    = "How Each Feature Shapes Engagement Rate",
    subtitle = "Partial dependence plots — all other features held at their mean",
    theme    = theme_minimal(base_size = 13)
  )

9. Engagement Rate by Category

PDPs for categorical features get cluttered, so we use a box plot of the actual data instead. Categories are sorted by median engagement rate.

df_model %>%
  mutate(category = fct_reorder(category, engagement_rate, .fun = median)) %>%
  ggplot(aes(x = category, y = engagement_rate)) +
  geom_boxplot(fill = "#2196F3", alpha = 0.6, outlier.alpha = 0.2) +
  coord_flip() +
  labs(
    title    = "Engagement Rate by Category",
    subtitle = "Median-sorted — actual values from dataset",
    x = NULL, y = "Engagement Rate"
  ) +
  theme_minimal(base_size = 12)

10. Best Time to Post

Combining publish hour and day of week into a heatmap reveals optimal posting windows.

df_model %>%
  mutate(publish_dow = as.integer(as.character(publish_dow))) %>%
  group_by(publish_hour, publish_dow) %>%
  summarise(avg_engagement = mean(engagement_rate), .groups = "drop") %>%
  mutate(
    day_label = factor(publish_dow, levels = 1:7,
                       labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"))
  ) %>%
  ggplot(aes(x = publish_hour, y = day_label, fill = avg_engagement)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "#e3f2fd", high = "#1565C0", name = "Avg engagement") +
  labs(
    title    = "Best Time to Post",
    subtitle = "Average engagement rate by publish hour × day of week",
    x = "Hour of day (UTC)", y = NULL
  ) +
  theme_minimal(base_size = 12)

Results Summary

Using a random forest model on ~6,300 unique US YouTube trending videos, we achieved a test R² of 0.2659, meaning our model explains about 26% of the variance in engagement rate using only video metadata (no content quality, thumbnail appeal, or audience loyalty data.) The most important predictors were title-related features (title length and word count), followed by content category, description length, publish hour, and tag count. Partial dependence plots let us go beyond just knowing that these features matter and actually see how they affect engagement. The category breakdown confirmed that some niches consistently outperform others, which is a strong signal for any creator deciding what kind of content to make.

Script Summary

The script uses tidyverse for data loading and manipulation, lubridate for parsing messy date formats, ranger for training the random forest, caret for the stratified train/test split, vip for variable importance plots, pdp for partial dependence plots, and patchwork for combining multiple plots. The raw CSV had ~40,000 rows but only ~6,300 unique videos, so we deduplicated by keeping each video’s highest-view snapshot. All features were engineered from scratch. title length, word count, punctuation flags, tag count, publish hour and day of week, days from publish to trending, description length, and a log-transformed view count as a channel-size proxy. Likes, dislikes, and comment count were excluded entirely since they are direct inputs to the target variable and including them would constitute data leakage.