As social media consultants, our team’s objective was to see what type of YouTube content drives the highest engagement. This R markdown will show and covers Phase 1 & 2 (Exploratory Data Analysis) and Phase 3 (Data Validation) using the datasnaek YouTube trending dataset sourced from Kaggle, which contains real trending video statistics from the United States.
The dataset (USvideos.csv) is sourced from Mitchell JY’s
Trending YouTube Scraper via Kaggle
(datasnaek/youtube-new). It contains 40,949 observations
across 16 variables including views, likes, dislikes, comment count,
publish time, and trending date.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
youtube <- read_csv("USvideos.csv")
## Rows: 40949 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): video_id, trending_date, title, channel_title, tags, thumbnail_lin...
## dbl (5): category_id, views, likes, dislikes, comment_count
## lgl (3): comments_disabled, ratings_disabled, video_error_or_removed
## dttm (1): publish_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(youtube)
## Rows: 40,949
## Columns: 16
## $ video_id <chr> "2kyS6SvSYSE", "1ZAPwfrtAFY", "5qpjK5DgCt4", "p…
## $ trending_date <chr> "17.14.11", "17.14.11", "17.14.11", "17.14.11",…
## $ title <chr> "WE WANT TO TALK ABOUT OUR MARRIAGE", "The Trum…
## $ channel_title <chr> "CaseyNeistat", "LastWeekTonight", "Rudy Mancus…
## $ category_id <dbl> 22, 24, 23, 24, 24, 28, 24, 28, 1, 25, 17, 24, …
## $ publish_time <dttm> 2017-11-13 17:13:01, 2017-11-13 07:30:00, 2017…
## $ tags <chr> "SHANtell martin", "last week tonight trump pre…
## $ views <dbl> 748374, 2418783, 3191434, 343168, 2095731, 1191…
## $ likes <dbl> 57527, 97185, 146033, 10172, 132235, 9763, 1599…
## $ dislikes <dbl> 2966, 6146, 5339, 666, 1989, 511, 2445, 778, 11…
## $ comment_count <dbl> 15954, 12703, 8181, 2146, 17518, 1434, 1970, 34…
## $ thumbnail_link <chr> "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ ratings_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ video_error_or_removed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ description <chr> "SHANTELL'S CHANNEL - https://www.youtube.com/s…
dim(youtube)
## [1] 40949 16
The dataset contains 40,949 rows and 16 columns. Key variables
include views, likes, dislikes,
comment_count, publish_time, and
trending_date. Notably, the derived variables required for
modeling (engagement_rate, like_per_view,
video_length_category, days_since_published)
are not present in the raw data and does have to be engineered.
colSums(is.na(youtube))
## video_id trending_date title
## 0 0 0
## channel_title category_id publish_time
## 0 0 0
## tags views likes
## 0 0 0
## dislikes comment_count thumbnail_link
## 0 0 0
## comments_disabled ratings_disabled video_error_or_removed
## 0 0 0
## description
## 578
Only the description column contains missing values (578
NAs). All key numeric columns like views, likes, dislikes, and
comment_count are complete with zero missing values.
youtube %>%
select(views, likes, dislikes, comment_count) %>%
summarise(across(everything(), list(mean = mean, sd = sd, var = var)))
## # A tibble: 1 × 12
## views_mean views_sd views_var likes_mean likes_sd likes_var dislikes_mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2360785. 7394114. 5.47e13 74267. 228885. 52388498047. 3711.
## # ℹ 5 more variables: dislikes_sd <dbl>, dislikes_var <dbl>,
## # comment_count_mean <dbl>, comment_count_sd <dbl>, comment_count_var <dbl>
All four numeric columns show substantial rea world variance. Views range from 549 to over 225 million with a standard deviation of 7.39 million, confirming this isa real data which is a stark contrast to the near zero variance found in the original synthetic dataset we were using.
youtube_clean <- youtube %>%
filter(video_error_or_removed == FALSE) %>%
mutate(
engagement_rate = (likes + dislikes + comment_count) / views,
like_per_view = likes / views,
days_since_published = as.numeric(
as.Date(trending_date, format = "%y.%d.%m") - as.Date(publish_time)
),
video_length_category = NA # not available in raw data
)
glimpse(youtube_clean)
## Rows: 40,926
## Columns: 20
## $ video_id <chr> "2kyS6SvSYSE", "1ZAPwfrtAFY", "5qpjK5DgCt4", "p…
## $ trending_date <chr> "17.14.11", "17.14.11", "17.14.11", "17.14.11",…
## $ title <chr> "WE WANT TO TALK ABOUT OUR MARRIAGE", "The Trum…
## $ channel_title <chr> "CaseyNeistat", "LastWeekTonight", "Rudy Mancus…
## $ category_id <dbl> 22, 24, 23, 24, 24, 28, 24, 28, 1, 25, 17, 24, …
## $ publish_time <dttm> 2017-11-13 17:13:01, 2017-11-13 07:30:00, 2017…
## $ tags <chr> "SHANtell martin", "last week tonight trump pre…
## $ views <dbl> 748374, 2418783, 3191434, 343168, 2095731, 1191…
## $ likes <dbl> 57527, 97185, 146033, 10172, 132235, 9763, 1599…
## $ dislikes <dbl> 2966, 6146, 5339, 666, 1989, 511, 2445, 778, 11…
## $ comment_count <dbl> 15954, 12703, 8181, 2146, 17518, 1434, 1970, 34…
## $ thumbnail_link <chr> "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg…
## $ comments_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ ratings_disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ video_error_or_removed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ description <chr> "SHANTELL'S CHANNEL - https://www.youtube.com/s…
## $ engagement_rate <dbl> 0.102150796, 0.047972059, 0.049994141, 0.037835…
## $ like_per_view <dbl> 0.076869319, 0.040179297, 0.045757800, 0.029641…
## $ days_since_published <dbl> 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1,…
## $ video_length_category <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
After filtering out 23 error rows, the cleaned dataset contains
40,926 observations and 20 variables. Three of the four required derived
variables were successfully engineered.
video_length_category does remains NA as video duration is
not included in this dataset and would need to have separate YouTube
Data API lookup.
ggplot(youtube_clean, aes(x = views)) +
geom_histogram(bins = 50, fill = "steelblue") +
scale_x_log10(labels = scales::comma) +
labs(title = "Distribution of Views (log scale)",
x = "Views", y = "Count")
ggplot(youtube_clean, aes(x = engagement_rate)) +
geom_histogram(bins = 50, fill = "darkgreen") +
labs(title = "Distribution of Engagement Rate",
x = "Engagement Rate", y = "Count")
ggplot(youtube_clean, aes(x = days_since_published)) +
geom_histogram(bins = 50, fill = "coral") +
labs(title = "Days Since Published When Trending",
x = "Days", y = "Count")
### Key Validation Findings
The views distributiondoes follow a near normal curve on a log scale, being around around 1 million views, ranging from ~10,000 to over 100 million. This is very consistent with real YouTube trending behavior.
The engagement rate distribution is righ -skewed, with most videos falling between 0.01 and 0.05, and a long tail of high engagement outliers. This is showing real genuine user behavior.
The days since published distribution shows the vast majority of videos trend within days of publishing, with a small number of outliers trending years later. This is consistent with the way viral patterns are
The datasnaek YouTube trending dataset passes all data validation checks. Variance is real, missing values are minimal and derived variables have been successfully engineered for three of the four required modeling inputs. The dataset is ready to support Phase 4 model development (Random Forest and Logistic Regression) by Lennin and Nico respectively.
video_length_category remains a limitation — the team
should discuss whether to drop this variable or supplement with a
YouTube Data API lookup for video duration.
Mitchell JY. (2019). Trending YouTube Video Statistics. Kaggle. https://www.kaggle.com/datasets/datasnaek/youtube-new