YouTube has become a crucial platform for content creators and businesses. With more than 2.7 billion monthly active users (Times, 2024), it offers endless opportunities to engage with diverse audiences worldwide. Understanding video performance is essential for optimizing growth potential, and maximizing revenue. This project aims to develop a regression model and classification model.To achieve the objective, we compare 3 model for each problem to evaluate the best performing models, and each model speciality will be explained throughout the modelling process. This project used a dataset derived from Kaggle named “YouTube Channel Performance Analytics” dataset (2016-2024).
The dataset has the following initial characteristics:
Questions:
Objective:
To begin, we load the required R packages that provide the specialized tools and functions needed to carry out our analysis efficiently.
library(readr)
library(dplyr)
library(ggplot2)
library(knitr)
library(janitor)
library(patchwork)
library(caret)
library(randomForest)
library(xgboost)
library(tidyr)
library(e1071)
library(ROCR)
First, we load our dataset from the csv file.
dataset <- as.data.frame(read_csv("archive/youtube_channel_real_performance_analytics.csv"))
We notice that our column headers contain spaces and special characters, which may cause issues.
colnames(dataset)
## [1] "ID" "Video Duration"
## [3] "Video Publish Time" "Days Since Publish"
## [5] "Day" "Month"
## [7] "Year" "Day of Week"
## [9] "Revenue per 1000 Views (USD)" "Monetized Playbacks (Estimate)"
## [11] "Playback-Based CPM (USD)" "CPM (USD)"
## [13] "Ad Impressions" "Estimated AdSense Revenue (USD)"
## [15] "DoubleClick Revenue (USD)" "YouTube Ads Revenue (USD)"
## [17] "Watch Page Ads Revenue (USD)" "YouTube Premium (USD)"
## [19] "Transaction Revenue (USD)" "Transactions"
## [21] "Revenue from Transactions (USD)" "Reactions"
## [23] "Chat Messages Count" "Reminders Set"
## [25] "Stream Hours" "Remix Views"
## [27] "Remix Count" "Subscribers from Posts"
## [29] "New Comments" "Shares"
## [31] "Like Rate (%)" "Dislikes"
## [33] "Likes" "Unsubscribes"
## [35] "New Subscribers" "Returned Items (USD)"
## [37] "Unconfirmed Commissions (USD)" "Approved Commissions (USD)"
## [39] "Orders" "Total Sales Volume (USD)"
## [41] "End Screen Click-Through Rate (%)" "End Screen Impressions"
## [43] "End Screen Clicks" "Teaser Click-Through Rate (%)"
## [45] "Teaser Impressions" "Teaser Clicks"
## [47] "Card Click-Through Rate (%)" "Card Impressions"
## [49] "Card Clicks" "Views per Playlist Start"
## [51] "Playlist Views" "Playlist Watch Time (hours)"
## [53] "Clip Watch Time (hours)" "Clip Views"
## [55] "YouTube Premium Watch Time (hours)" "YouTube Premium Views"
## [57] "Returning Viewers" "New Viewers"
## [59] "Average Views per User" "Unique Viewers"
## [61] "Watched (Not Skipped) (%)" "Feed Impressions"
## [63] "Average View Percentage (%)" "Average View Duration"
## [65] "Views" "Watch Time (hours)"
## [67] "Subscribers" "Estimated Revenue (USD)"
## [69] "Impressions" "Video Thumbnail CTR (%)"
To address this, we use the janitor package to clean them up.
dataset <- dataset %>% clean_names()
Below are the new column names of the dataset.
colnames(dataset)
## [1] "id"
## [2] "video_duration"
## [3] "video_publish_time"
## [4] "days_since_publish"
## [5] "day"
## [6] "month"
## [7] "year"
## [8] "day_of_week"
## [9] "revenue_per_1000_views_usd"
## [10] "monetized_playbacks_estimate"
## [11] "playback_based_cpm_usd"
## [12] "cpm_usd"
## [13] "ad_impressions"
## [14] "estimated_ad_sense_revenue_usd"
## [15] "double_click_revenue_usd"
## [16] "you_tube_ads_revenue_usd"
## [17] "watch_page_ads_revenue_usd"
## [18] "you_tube_premium_usd"
## [19] "transaction_revenue_usd"
## [20] "transactions"
## [21] "revenue_from_transactions_usd"
## [22] "reactions"
## [23] "chat_messages_count"
## [24] "reminders_set"
## [25] "stream_hours"
## [26] "remix_views"
## [27] "remix_count"
## [28] "subscribers_from_posts"
## [29] "new_comments"
## [30] "shares"
## [31] "like_rate_percent"
## [32] "dislikes"
## [33] "likes"
## [34] "unsubscribes"
## [35] "new_subscribers"
## [36] "returned_items_usd"
## [37] "unconfirmed_commissions_usd"
## [38] "approved_commissions_usd"
## [39] "orders"
## [40] "total_sales_volume_usd"
## [41] "end_screen_click_through_rate_percent"
## [42] "end_screen_impressions"
## [43] "end_screen_clicks"
## [44] "teaser_click_through_rate_percent"
## [45] "teaser_impressions"
## [46] "teaser_clicks"
## [47] "card_click_through_rate_percent"
## [48] "card_impressions"
## [49] "card_clicks"
## [50] "views_per_playlist_start"
## [51] "playlist_views"
## [52] "playlist_watch_time_hours"
## [53] "clip_watch_time_hours"
## [54] "clip_views"
## [55] "you_tube_premium_watch_time_hours"
## [56] "you_tube_premium_views"
## [57] "returning_viewers"
## [58] "new_viewers"
## [59] "average_views_per_user"
## [60] "unique_viewers"
## [61] "watched_not_skipped_percent"
## [62] "feed_impressions"
## [63] "average_view_percentage_percent"
## [64] "average_view_duration"
## [65] "views"
## [66] "watch_time_hours"
## [67] "subscribers"
## [68] "estimated_revenue_usd"
## [69] "impressions"
## [70] "video_thumbnail_ctr_percent"
Next, we analyze the structure of our dataset to gain insight into its organization and identify the necessary steps for the data cleaning process. This allows us to prepare the dataset effectively for answering the questions outlined earlier.
From the code below, we find out that our dataset consists of 364 rows and 70 columns.
dim(dataset)
## [1] 364 70
Identifier column as well as columns that contain only a single constant value do not contribute any predictive information to our analysis and shall be removed.
## Columns that contain only a single constant value:
## [1] "transaction_revenue_usd" "transactions"
## [3] "revenue_from_transactions_usd" "reactions"
## [5] "chat_messages_count" "reminders_set"
## [7] "stream_hours" "remix_views"
## [9] "remix_count" "subscribers_from_posts"
## [11] "returned_items_usd" "unconfirmed_commissions_usd"
## [13] "approved_commissions_usd" "orders"
## [15] "total_sales_volume_usd" "returning_viewers"
## [17] "new_viewers" "average_views_per_user"
## [19] "unique_viewers"
As a result, a total of 20 columns are removed from the dataset.
cat("Dataset dimension before removing columns:", dim(dataset), "\n")
## Dataset dimension before removing columns: 364 70
dataset <- dataset %>% select(-id) %>% select(where(~ n_distinct(.) > 1))
cat("Dataset dimension after removing columns:", dim(dataset), "\n")
## Dataset dimension after removing columns: 364 50
colnames(dataset)
## [1] "video_duration"
## [2] "video_publish_time"
## [3] "days_since_publish"
## [4] "day"
## [5] "month"
## [6] "year"
## [7] "day_of_week"
## [8] "revenue_per_1000_views_usd"
## [9] "monetized_playbacks_estimate"
## [10] "playback_based_cpm_usd"
## [11] "cpm_usd"
## [12] "ad_impressions"
## [13] "estimated_ad_sense_revenue_usd"
## [14] "double_click_revenue_usd"
## [15] "you_tube_ads_revenue_usd"
## [16] "watch_page_ads_revenue_usd"
## [17] "you_tube_premium_usd"
## [18] "new_comments"
## [19] "shares"
## [20] "like_rate_percent"
## [21] "dislikes"
## [22] "likes"
## [23] "unsubscribes"
## [24] "new_subscribers"
## [25] "end_screen_click_through_rate_percent"
## [26] "end_screen_impressions"
## [27] "end_screen_clicks"
## [28] "teaser_click_through_rate_percent"
## [29] "teaser_impressions"
## [30] "teaser_clicks"
## [31] "card_click_through_rate_percent"
## [32] "card_impressions"
## [33] "card_clicks"
## [34] "views_per_playlist_start"
## [35] "playlist_views"
## [36] "playlist_watch_time_hours"
## [37] "clip_watch_time_hours"
## [38] "clip_views"
## [39] "you_tube_premium_watch_time_hours"
## [40] "you_tube_premium_views"
## [41] "watched_not_skipped_percent"
## [42] "feed_impressions"
## [43] "average_view_percentage_percent"
## [44] "average_view_duration"
## [45] "views"
## [46] "watch_time_hours"
## [47] "subscribers"
## [48] "estimated_revenue_usd"
## [49] "impressions"
## [50] "video_thumbnail_ctr_percent"
Despite removing columns that do not contribute, we are still left with 50 features. Training and evaluating models with these many features require more computational resources and time. Also, not all features contribute meaningfully to the target variable or analysis and these redundant features could introduce noise that negatively impact our model performance.
Thus, data reduction is crucial to ensure that we retain only the critical information while minimizing noise, redundancy and computational burden.
Upon closer inspection, we find out that some of the columns are derived or aggregated features. These information reiterations cause multicollinearity issue and can lead to difficulty in isolating individual effect of each feature on the target variable. Thus, we remove them to ensure that our model remains interpretable and can generalize better to new data.
E.g.
“days_since_published” column is calculated from “video_publish_time” column.
“day”, “month” and “year” column are extracted from “video_publish_time” column.
“like_rate_percent” column is derived from dividing “likes” column from the total of both “like” and “dislikes” column.
“subscribers” are the subtraction of “Unsubscribes” from “new_subscribers” and etc.
cols_to_remove <- c("video_publish_time", "like_rate_percent", "end_screen_click_through_rate_percent", "teaser_click_through_rate_percent", "card_click_through_rate_percent", "average_view_percentage_percent", "subscribers", "video_thumbnail_ctr_percent", "average_view_percentage_percent")
cat("Dataset dimension before removing columns:", dim(dataset), "\n")
## Dataset dimension before removing columns: 364 42
dataset <- dataset %>%
select(-cols_to_remove)
cat("Dataset dimension after removing columns:", dim(dataset), "\n")
## Dataset dimension after removing columns: 364 34
colnames(dataset)
## [1] "video_duration" "days_since_publish"
## [3] "day" "month"
## [5] "year" "day_of_week"
## [7] "monetized_playbacks_estimate" "ad_impressions"
## [9] "new_comments" "shares"
## [11] "dislikes" "likes"
## [13] "unsubscribes" "new_subscribers"
## [15] "end_screen_impressions" "end_screen_clicks"
## [17] "teaser_impressions" "teaser_clicks"
## [19] "card_impressions" "card_clicks"
## [21] "views_per_playlist_start" "playlist_views"
## [23] "playlist_watch_time_hours" "clip_watch_time_hours"
## [25] "clip_views" "you_tube_premium_watch_time_hours"
## [27] "you_tube_premium_views" "watched_not_skipped_percent"
## [29] "feed_impressions" "average_view_duration"
## [31] "views" "watch_time_hours"
## [33] "estimated_revenue_usd" "impressions"
We then check if there is any missing values in our remaining dataset. From the result, none is present in our case.
data.frame(Features = names(dataset), MissingValues = colSums(is.na(dataset))) %>%
arrange(desc(MissingValues)) %>%
print(row.names = FALSE)
## Features MissingValues
## video_duration 0
## days_since_publish 0
## day 0
## month 0
## year 0
## day_of_week 0
## monetized_playbacks_estimate 0
## ad_impressions 0
## new_comments 0
## shares 0
## dislikes 0
## likes 0
## unsubscribes 0
## new_subscribers 0
## end_screen_impressions 0
## end_screen_clicks 0
## teaser_impressions 0
## teaser_clicks 0
## card_impressions 0
## card_clicks 0
## views_per_playlist_start 0
## playlist_views 0
## playlist_watch_time_hours 0
## clip_watch_time_hours 0
## clip_views 0
## you_tube_premium_watch_time_hours 0
## you_tube_premium_views 0
## watched_not_skipped_percent 0
## feed_impressions 0
## average_view_duration 0
## views 0
## watch_time_hours 0
## estimated_revenue_usd 0
## impressions 0
We also check if there is any duplicated records in our dataset. As shown in the result, there is no duplicated records.
dataset[duplicated(dataset),]
## [1] video_duration days_since_publish
## [3] day month
## [5] year day_of_week
## [7] monetized_playbacks_estimate ad_impressions
## [9] new_comments shares
## [11] dislikes likes
## [13] unsubscribes new_subscribers
## [15] end_screen_impressions end_screen_clicks
## [17] teaser_impressions teaser_clicks
## [19] card_impressions card_clicks
## [21] views_per_playlist_start playlist_views
## [23] playlist_watch_time_hours clip_watch_time_hours
## [25] clip_views you_tube_premium_watch_time_hours
## [27] you_tube_premium_views watched_not_skipped_percent
## [29] feed_impressions average_view_duration
## [31] views watch_time_hours
## [33] estimated_revenue_usd impressions
## <0 rows> (or 0-length row.names)
After partial cleaning of the dataset, below is a brief summary.
glimpse(dataset)
## Rows: 364
## Columns: 34
## $ video_duration <dbl> 201, 391, 133, 14, 45, 496, 9, 34, 1…
## $ days_since_publish <dbl> 0, 8, 4, 15, 2, 7, 28, 3, 3, 1, 5, 1…
## $ day <dbl> 2, 10, 14, 29, 1, 8, 5, 8, 11, 12, 1…
## $ month <dbl> 6, 6, 6, 6, 7, 7, 8, 8, 8, 8, 8, 9, …
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, …
## $ day_of_week <chr> "Thursday", "Friday", "Tuesday", "We…
## $ monetized_playbacks_estimate <dbl> 723, 727, 76, 18, 0, 491, 32, 404, 1…
## $ ad_impressions <dbl> 981, 861, 88, 35, 0, 673, 43, 597, 2…
## $ new_comments <dbl> 91, 35, 0, 12, 50, 27, 16, 33, 37, 1…
## $ shares <dbl> 12, 5, 4, 7, 7, 3, 14, 37, 21, 3, 11…
## $ dislikes <dbl> 30, 18, 20, 14, 180, 17, 8, 11, 16, …
## $ likes <dbl> 924, 322, 239, 220, 602, 290, 151, 4…
## $ unsubscribes <dbl> 3, 1, 0, 0, 3, 1, 0, 0, 0, 0, 2, 2, …
## $ new_subscribers <dbl> 54, 34, 8, 2, 31, 20, 4, 24, 14, 5, …
## $ end_screen_impressions <dbl> 46, 0, 0, 0, 4, 0, 0, 18, 0, 0, 15, …
## $ end_screen_clicks <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_impressions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_clicks <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_impressions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_clicks <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ views_per_playlist_start <dbl> 3.3333, 3.5000, 0.0000, 6.6667, 1.66…
## $ playlist_views <dbl> 10, 7, 11, 20, 10, 10, 10, 11, 11, 1…
## $ playlist_watch_time_hours <dbl> 0.2974, 0.5187, 0.1683, 0.0640, 0.06…
## $ clip_watch_time_hours <dbl> 0.1575, 0.0000, 0.0000, 0.0000, 0.00…
## $ clip_views <dbl> 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ you_tube_premium_watch_time_hours <dbl> 2.5358, 0.8911, 0.1838, 0.0711, 0.25…
## $ you_tube_premium_views <dbl> 152, 32, 28, 20, 39, 22, 20, 37, 27,…
## $ watched_not_skipped_percent <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ feed_impressions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ average_view_duration <dbl> 81, 156, 41, 14, 25, 182, 10, 35, 8,…
## $ views <dbl> 23531, 11478, 6153, 4398, 14659, 841…
## $ watch_time_hours <dbl> 533.1636, 500.5628, 70.7287, 17.6251…
## $ estimated_revenue_usd <dbl> 0.561, 0.648, 0.089, 0.017, 0.000, 0…
## $ impressions <dbl> 41118, 41627, 38713, 35245, 46218, 4…
As shown on above summary, our dataset consists of mostly numeric data, we have to transform and prepare the data in ways that align with our classification task, which is to classify if a video is going to have high or low subscriber growth potential based on observed factors.
We decided to work on the number of subscribers by first calculating the net growth using the differences between new_subscribers and unsubscribes. Then we will use the mean value of this newly derived features as threshold to separate the values into Low (0) and High (1) classes.
dataset <- dataset %>%
mutate(subscriber_growth = new_subscribers - unsubscribes)
threshold <- as.integer(mean(dataset$subscriber_growth, na.rm = TRUE))
cat("Threshold (mean) value:", threshold, "\n")
## Threshold (mean) value: 321
dataset <- dataset %>%
mutate(subscriber_growth = ifelse(subscriber_growth >= threshold, 1, 0))
cat("Values in the new column:", unique(dataset$subscriber_growth), "\n")
## Values in the new column: 0 1
cat("Dataset dimension after adding new column:", dim(dataset), "\n")
## Dataset dimension after adding new column: 364 35
colnames(dataset)
## [1] "video_duration" "days_since_publish"
## [3] "day" "month"
## [5] "year" "day_of_week"
## [7] "monetized_playbacks_estimate" "ad_impressions"
## [9] "new_comments" "shares"
## [11] "dislikes" "likes"
## [13] "unsubscribes" "new_subscribers"
## [15] "end_screen_impressions" "end_screen_clicks"
## [17] "teaser_impressions" "teaser_clicks"
## [19] "card_impressions" "card_clicks"
## [21] "views_per_playlist_start" "playlist_views"
## [23] "playlist_watch_time_hours" "clip_watch_time_hours"
## [25] "clip_views" "you_tube_premium_watch_time_hours"
## [27] "you_tube_premium_views" "watched_not_skipped_percent"
## [29] "feed_impressions" "average_view_duration"
## [31] "views" "watch_time_hours"
## [33] "estimated_revenue_usd" "impressions"
## [35] "subscriber_growth"
From our understanding of the dataset, We have identified “day_of_week”, “day”, “month”, “year” and “subscriber_growth” columns to be categorical data. We first check the unique values present in the columns.
unique(dataset$day_of_week)
## [1] "Thursday" "Friday" "Tuesday" "Wednesday" "Monday" "Sunday"
## [7] "Saturday"
unique(dataset$day)
## [1] 2 10 14 29 1 8 5 11 12 17 18 15 20 26 3 6 9 23 25 27 24 31 7 28 4
## [26] 22 30 16 13 19 21
unique(dataset$month)
## [1] 6 7 8 9 10 11 12 1 2 3 4 5
unique(dataset$year)
## [1] 2016 2017 2018 2019 2020 2021 2022 2023 2024
unique(dataset$subscriber_growth)
## [1] 0 1
Then we convert them into factors accordingly.
dataset$day_of_week <- factor(dataset$day_of_week,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
dataset$day <- factor(dataset$day)
dataset$month <- factor(dataset$month)
dataset$year <- factor(dataset$year)
dataset$subscriber_growth <- factor(dataset$subscriber_growth, levels = c(0, 1))
As shown below, the columns have been converted into factors.
unique(dataset$day_of_week)
## [1] Thursday Friday Tuesday Wednesday Monday Sunday Saturday
## Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
unique(dataset$day)
## [1] 2 10 14 29 1 8 5 11 12 17 18 15 20 26 3 6 9 23 25 27 24 31 7 28 4
## [26] 22 30 16 13 19 21
## 31 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 31
unique(dataset$month)
## [1] 6 7 8 9 10 11 12 1 2 3 4 5
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12
unique(dataset$year)
## [1] 2016 2017 2018 2019 2020 2021 2022 2023 2024
## Levels: 2016 2017 2018 2019 2020 2021 2022 2023 2024
unique(dataset$subscriber_growth)
## [1] 0 1
## Levels: 0 1
We use bar chart to show the distribution of each categorical columns.
There are 67.85% (247) of High Subscriber and 32.15% (117) of Low Subscriber
37.36% (136) of the dataset are published in 2017.
p1 <- ggplot(dataset, aes(x = day_of_week)) +
geom_bar() +
labs(title = "Day of Week Distribution", x = "day_of_week", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
p2 <- ggplot(dataset, aes(x = day)) +
geom_bar() +
labs(title = "Day Distribution", x = "day", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
p3 <- ggplot(dataset, aes(x = month)) +
geom_bar() +
labs(title = "Month Distribution", x = "month", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
p4 <- ggplot(dataset, aes(x = year)) +
geom_bar() +
labs(title = "Year Distribution", x = "year", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
p5 <- ggplot(dataset, aes(x = subscriber_growth)) +
geom_bar() +
labs(title = "Subscriber Growth Distribution", x = "subscriber_growth", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
wrap_plots(p1, p2, p3, p4, p5)
And also bar chart to show the distribution of each categorical columns with respect to “subscriber_growth”.
p1 <- ggplot(dataset, aes(x = day_of_week, fill = subscriber_growth)) +
geom_bar(position = "dodge") +
labs(title = "Day of Week Distribution", x = "day_of_week", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)
p2 <- ggplot(dataset, aes(x = day, fill = subscriber_growth)) +
geom_bar(position = "dodge") +
labs(title = "Day Distribution", x = "day", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)
p3 <- ggplot(dataset, aes(x = month, fill = subscriber_growth)) +
geom_bar(position = "dodge") +
labs(title = "Month Distribution", x = "month", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)
p4 <- ggplot(dataset, aes(x = month, fill = subscriber_growth)) +
geom_bar(position = "dodge") +
labs(title = "Month Distribution", x = "month", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)
wrap_plots(p1, p2, p3, p4)
Below are the numerical data in our dataset.
numeric_columns_names <- dataset %>%
select(where(~ is.numeric(.))) %>%
colnames()
numeric_columns_names
## [1] "video_duration" "days_since_publish"
## [3] "monetized_playbacks_estimate" "ad_impressions"
## [5] "new_comments" "shares"
## [7] "dislikes" "likes"
## [9] "unsubscribes" "new_subscribers"
## [11] "end_screen_impressions" "end_screen_clicks"
## [13] "teaser_impressions" "teaser_clicks"
## [15] "card_impressions" "card_clicks"
## [17] "views_per_playlist_start" "playlist_views"
## [19] "playlist_watch_time_hours" "clip_watch_time_hours"
## [21] "clip_views" "you_tube_premium_watch_time_hours"
## [23] "you_tube_premium_views" "watched_not_skipped_percent"
## [25] "feed_impressions" "average_view_duration"
## [27] "views" "watch_time_hours"
## [29] "estimated_revenue_usd" "impressions"
We use histogram chart to look at the distribution of these columns.
From the histogram chart, we can see that majority of the numerical data are right-skewed distribution.
hist_plot_list <- list()
for (col in numeric_columns_names){
plot <- ggplot(dataset, aes_string(x=col)) + geom_histogram() + labs(y="Frequency")
hist_plot_list[[col]] <- plot
}
wrap_plots(hist_plot_list, ncol=5)
We use scatterplot chart to show the distribution of the numerical columns with respect to “estimated_revenue_usd” and “subscriber_growth”.
hist_plot_list <- list()
for (col in numeric_columns_names){
if (col == "estimated_revenue_usd") {
next
}
plot <- ggplot(dataset, aes_string(x=col, y="estimated_revenue_usd", color = "subscriber_growth")) + geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "black") +
theme(legend.position = "top")
hist_plot_list[[col]] <- plot
}
wrap_plots(hist_plot_list, ncol=5)
We will convert factors into numeric values for machine learning algorithms that require numerical input.
glimpse(dataset[, sapply(dataset, is.factor)])
## Rows: 364
## Columns: 5
## $ day <fct> 2, 10, 14, 29, 1, 8, 5, 8, 11, 12, 17, 5, 11, 18, 1,…
## $ month <fct> 6, 6, 6, 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 10…
## $ year <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
## $ day_of_week <fct> Thursday, Friday, Tuesday, Wednesday, Friday, Friday…
## $ subscriber_growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
# to avoid converting the target variable for classification into numeric
dataset <- dataset %>%
mutate_if(~is.factor(.) && !all(levels(.) %in% c(0, 1)), as.numeric)
After label encoding:
glimpse(dataset)
## Rows: 364
## Columns: 35
## $ video_duration <dbl> 201, 391, 133, 14, 45, 496, 9, 34, 1…
## $ days_since_publish <dbl> 0, 8, 4, 15, 2, 7, 28, 3, 3, 1, 5, 1…
## $ day <dbl> 2, 10, 14, 29, 1, 8, 5, 8, 11, 12, 1…
## $ month <dbl> 6, 6, 6, 6, 7, 7, 8, 8, 8, 8, 8, 9, …
## $ year <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day_of_week <dbl> 4, 5, 2, 3, 5, 5, 5, 1, 4, 5, 3, 1, …
## $ monetized_playbacks_estimate <dbl> 723, 727, 76, 18, 0, 491, 32, 404, 1…
## $ ad_impressions <dbl> 981, 861, 88, 35, 0, 673, 43, 597, 2…
## $ new_comments <dbl> 91, 35, 0, 12, 50, 27, 16, 33, 37, 1…
## $ shares <dbl> 12, 5, 4, 7, 7, 3, 14, 37, 21, 3, 11…
## $ dislikes <dbl> 30, 18, 20, 14, 180, 17, 8, 11, 16, …
## $ likes <dbl> 924, 322, 239, 220, 602, 290, 151, 4…
## $ unsubscribes <dbl> 3, 1, 0, 0, 3, 1, 0, 0, 0, 0, 2, 2, …
## $ new_subscribers <dbl> 54, 34, 8, 2, 31, 20, 4, 24, 14, 5, …
## $ end_screen_impressions <dbl> 46, 0, 0, 0, 4, 0, 0, 18, 0, 0, 15, …
## $ end_screen_clicks <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_impressions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_clicks <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_impressions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_clicks <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ views_per_playlist_start <dbl> 3.3333, 3.5000, 0.0000, 6.6667, 1.66…
## $ playlist_views <dbl> 10, 7, 11, 20, 10, 10, 10, 11, 11, 1…
## $ playlist_watch_time_hours <dbl> 0.2974, 0.5187, 0.1683, 0.0640, 0.06…
## $ clip_watch_time_hours <dbl> 0.1575, 0.0000, 0.0000, 0.0000, 0.00…
## $ clip_views <dbl> 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ you_tube_premium_watch_time_hours <dbl> 2.5358, 0.8911, 0.1838, 0.0711, 0.25…
## $ you_tube_premium_views <dbl> 152, 32, 28, 20, 39, 22, 20, 37, 27,…
## $ watched_not_skipped_percent <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ feed_impressions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ average_view_duration <dbl> 81, 156, 41, 14, 25, 182, 10, 35, 8,…
## $ views <dbl> 23531, 11478, 6153, 4398, 14659, 841…
## $ watch_time_hours <dbl> 533.1636, 500.5628, 70.7287, 17.6251…
## $ estimated_revenue_usd <dbl> 0.561, 0.648, 0.089, 0.017, 0.000, 0…
## $ impressions <dbl> 41118, 41627, 38713, 35245, 46218, 4…
## $ subscriber_growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Since subscriber_growth is a derived feature we prepared for classification task, we removed it from our preparation for regression task.
cols_to_remove <- c("subscriber_growth")
cat("Dataset dimension before removing columns:", dim(dataset), "\n")
## Dataset dimension before removing columns: 364 35
regression_dataset <- dataset %>%
select(-cols_to_remove)
cat("Dataset dimension after removing column:", dim(regression_dataset), "\n")
## Dataset dimension after removing column: 364 34
Standardization is done to ensure that the analysis will not be dominated by features with larger scale. In our code, the transformation is applied only to columns that are numeric. Since estimated_revenue_usd is the target variable in our regression problem, we make sure to exclude it from standardization as it could distort the values and interfere with regression model’s ability to predict accurately.
regression_dataset <- regression_dataset %>%
mutate(across(where(is.numeric), scale))
target_column <- "estimated_revenue_usd"
regression_dataset[[target_column]] <- dataset[[target_column]]
Correlation analysis between features and the target variable can help us identify strong predictors to focus on the most relevant features. This step can help to improve model efficiency and simplify the model by reducing the number of features to consider. It can also help mitigate the risk of overfitting as model is less likely to learn noise from the data.
We pick a threshold of 0.3 and keep features with an absolute correlation coefficient equal to or greater than that as it is our goal to keep features with moderate association with the target variable.
correlation_results <- cor(regression_dataset, use = "complete.obs")[, "estimated_revenue_usd"]
sorted_correlation_results <- sort(correlation_results, decreasing = TRUE)
sorted_correlation_results
## estimated_revenue_usd monetized_playbacks_estimate
## 1.00000000 0.94415536
## ad_impressions you_tube_premium_watch_time_hours
## 0.82553441 0.58818300
## you_tube_premium_views clip_views
## 0.57453255 0.47250389
## impressions watch_time_hours
## 0.46993994 0.43199850
## likes new_subscribers
## 0.42554355 0.41293203
## clip_watch_time_hours playlist_watch_time_hours
## 0.40509641 0.37645932
## shares views
## 0.36208904 0.35790110
## teaser_impressions end_screen_impressions
## 0.29957602 0.21852646
## card_clicks playlist_views
## 0.21722540 0.21078692
## average_view_duration teaser_clicks
## 0.20728645 0.20231039
## new_comments year
## 0.20167705 0.19833342
## days_since_publish card_impressions
## 0.17689012 0.16111830
## unsubscribes dislikes
## 0.15548499 0.14381436
## video_duration month
## 0.13576667 0.12153583
## end_screen_clicks day
## 0.08489775 0.03859435
## watched_not_skipped_percent feed_impressions
## 0.02559761 0.02559761
## views_per_playlist_start day_of_week
## -0.02274427 -0.06396180
correlation_df <- tibble(
Column = names(sorted_correlation_results),
Correlation = sorted_correlation_results
) %>%
filter(abs(Correlation) >= 0.3)
regression_dataset <- regression_dataset %>%
select(all_of(correlation_df$Column))
colnames(regression_dataset)
## [1] "estimated_revenue_usd" "monetized_playbacks_estimate"
## [3] "ad_impressions" "you_tube_premium_watch_time_hours"
## [5] "you_tube_premium_views" "clip_views"
## [7] "impressions" "watch_time_hours"
## [9] "likes" "new_subscribers"
## [11] "clip_watch_time_hours" "playlist_watch_time_hours"
## [13] "shares" "views"
In the end, apart from the target variable, we ended up with 13 features in our cleaned dataset for regression problem.
We remove new_subscribers and unsubscribes column from our classification dataset.
cols_to_remove <- c("new_subscribers", "unsubscribes")
cat("Dataset dimension before removing columns:", dim(dataset), "\n")
## Dataset dimension before removing columns: 364 35
classification_dataset <- dataset %>%
select(-cols_to_remove)
cat("Dataset dimension after removing column:", dim(classification_dataset), "\n")
## Dataset dimension after removing column: 364 33
Again, we go through the same process as with regression dataset by standardizing our data before performing feature selection.
classification_dataset <- classification_dataset %>%
mutate(across(where(is.numeric), scale))
target_column <- "subscriber_growth"
classification_dataset[[target_column]] <- dataset[[target_column]]
classification_dataset$subscriber_growth <- as.numeric(classification_dataset$subscriber_growth) - 1
correlation_results <- cor(classification_dataset, use = "complete.obs")[, "subscriber_growth"]
sorted_correlation_results <- sort(correlation_results, decreasing = TRUE)
sorted_correlation_results
## subscriber_growth likes
## 1.00000000 0.69957606
## views watch_time_hours
## 0.68007090 0.63970462
## impressions you_tube_premium_views
## 0.63931946 0.63524804
## shares you_tube_premium_watch_time_hours
## 0.56135916 0.54961989
## dislikes new_comments
## 0.52902280 0.51674931
## clip_views monetized_playbacks_estimate
## 0.30666774 0.28333213
## estimated_revenue_usd playlist_watch_time_hours
## 0.27956059 0.23455686
## playlist_views clip_watch_time_hours
## 0.22275060 0.22171100
## ad_impressions day_of_week
## 0.19906096 0.12537902
## days_since_publish end_screen_impressions
## 0.11972858 0.10526641
## month end_screen_clicks
## 0.10120653 0.07702524
## teaser_impressions feed_impressions
## 0.07012644 0.06738849
## watched_not_skipped_percent teaser_clicks
## 0.06738849 0.04659678
## card_clicks card_impressions
## 0.03601428 0.02697676
## average_view_duration year
## -0.01231093 -0.02826497
## day views_per_playlist_start
## -0.02937354 -0.03913703
## video_duration
## -0.04753051
correlation_df <- tibble(
Column = names(sorted_correlation_results),
Correlation = sorted_correlation_results
) %>%
filter(abs(Correlation) >= 0.3)
classification_dataset <- classification_dataset %>%
select(all_of(correlation_df$Column))
colnames(classification_dataset)
## [1] "subscriber_growth" "likes"
## [3] "views" "watch_time_hours"
## [5] "impressions" "you_tube_premium_views"
## [7] "shares" "you_tube_premium_watch_time_hours"
## [9] "dislikes" "new_comments"
## [11] "clip_views"
# convert back to factor
classification_dataset$subscriber_growth <- factor(classification_dataset$subscriber_growth, levels = c(0, 1))
In the end, apart from the target variable, we ended up with 10
features in our cleaned dataset for classification problem.
Accurately predicting revenue can help content creators and platforms
optimize strategies for video production and audience engagement. To
achieve this, we evaluated three predictive models: Linear Regression,
Random Forest Regression and XGBoost. Linear Regression offers
simplicity and interpretability, while Random Forest and XGBoost account
for non-linear relationships and feature interactions, making them
suitable for complex datasets.
Splitting the data into training and testing sets is important to check how well the model works on new data. The model is trained using one part of the data and tested on the other, which mimics real-world situations. This helps avoid overfitting, where the model performs well on the training data but fails with new data. It also gives a clear and fair measure of the model’s accuracy, helping us choose the best model for making predictions. We use a ratio of 80:20 on our train-test split ratio because our dataset size is on the smaller scale.
set.seed(130) # For reproducibility
train_index <- createDataPartition(regression_dataset$estimated_revenue_usd, p = 0.8, list = FALSE)
regression_train_data <- regression_dataset[train_index, ]
regression_test_data <- regression_dataset[-train_index, ]
cat("Train data dimension:", dim(regression_train_data))
## Train data dimension: 292 14
cat("Test data dimension:", dim(regression_test_data))
## Test data dimension: 72 14
model_lm <- lm(estimated_revenue_usd ~ ., data = regression_train_data)
predictions_lm <- predict(model_lm, newdata = regression_test_data)
model_rf <- randomForest(estimated_revenue_usd ~ ., data = regression_train_data, ntree = 500)
predictions_rf <- predict(model_rf, newdata = regression_test_data)
train_matrix <- model.matrix(estimated_revenue_usd ~ ., data = regression_train_data)[, -1]
test_matrix <- model.matrix(estimated_revenue_usd ~ ., data = regression_test_data)[, -1]
model_xgb <- xgboost(data = train_matrix, label = regression_train_data$estimated_revenue_usd, nrounds = 100, objective = "reg:squarederror", verbose = 0)
predictions_xgb <- predict(model_xgb, newdata = test_matrix)
Evaluation metrics are critical for assessing the performance of predictive models and ensuring they meet the desired accuracy. Metrics like Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) measure the average difference between predicted and actual values, with RMSE giving more weight to larger errors. R-squared (R²) indicates how well the model explains the variation in the target variable, with values closer to 1 showing better performance. By using these metrics, we can compare different models and choose the one that best balances accuracy and generalization.
# rmse
rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
# mae
mae <- function(actual, predicted) {
mean(abs(actual - predicted))
}
# r-squared
rsquared <- function(actual, predicted) {
cor(actual, predicted)^2
}
metrics <- data.frame(
Model = c("Linear Regression", "Random Forest", "XGBoost"),
RMSE = c(rmse(regression_test_data$estimated_revenue_usd, predictions_lm),
rmse(regression_test_data$estimated_revenue_usd, predictions_rf),
rmse(regression_test_data$estimated_revenue_usd, predictions_xgb)),
MAE = c(mae(regression_test_data$estimated_revenue_usd, predictions_lm),
mae(regression_test_data$estimated_revenue_usd, predictions_rf),
mae(regression_test_data$estimated_revenue_usd, predictions_xgb)),
R2 = c(rsquared(regression_test_data$estimated_revenue_usd, predictions_lm),
rsquared(regression_test_data$estimated_revenue_usd, predictions_rf),
rsquared(regression_test_data$estimated_revenue_usd, predictions_xgb))
)
metrics
## Model RMSE MAE R2
## 1 Linear Regression 3.431527 2.280876 0.8498157
## 2 Random Forest 4.102293 2.387962 0.8005550
## 3 XGBoost 3.216333 1.707398 0.8675043
Our goal is to choose the model with the lowest RMSE and MAE and highest R² for the best predictive performance. From the table and plot, we can see that XGBoost outperforms the other models based on all three metrics, possibly because it can handle non-linear relationships and interactions between features, which are prevalent in forecasting tasks like YouTube revenue.
However, the absolute values of the evaluation metrics suggest that there may still be room or improvement. This is further supported by the table below showing the actual value vs predicted value by model.
comparison <- data.frame(Actual = regression_test_data$estimated_revenue_usd,
Predicted = predictions_xgb)
head(comparison)
## Actual Predicted
## 1 0.000 0.3305748
## 2 0.478 0.3543185
## 3 6.996 12.7158175
## 4 0.386 0.3092237
## 5 6.795 8.9117165
## 6 0.113 0.2099476
Similar to regression task, we also perform train-test split on our classification dataset prepared in previous phase.
set.seed(50) # For reproducibility
train_index <- createDataPartition(classification_dataset$subscriber_growth, p = 0.8, list = FALSE)
classification_train_data <- classification_dataset[train_index, ]
classification_test_data <- classification_dataset[-train_index, ]
cat("Train data dimension:", dim(classification_train_data))
## Train data dimension: 292 11
cat("Test data dimension:", dim(classification_test_data))
## Test data dimension: 72 11
Logistic Regression is chosen for its simplicity and interpretability, providing clear insights into the influence of each feature on the probability of high or low subscriber growth.
logistic_model <- glm(subscriber_growth ~ ., data = classification_train_data, family = binomial)
predictions_logistic <- predict(logistic_model, newdata = classification_test_data, type = "response")
predictions_logistic <- ifelse(predictions_logistic > 0.5, 1, 0)
Random Forest is selected due to its ability to handle complex relationships between features and its robustness to overfitting, making it suitable for datasets with interactions and non-linear patterns.
rf_model <- randomForest(subscriber_growth ~ ., data = classification_train_data, importance = TRUE)
predictions_rf <- predict(rf_model, newdata = classification_test_data)
SVM is included for its effectiveness in high-dimensional spaces and its ability to create decision boundaries that are optimal for classifying non-linear data, often yielding strong performance in classification tasks.
svm_model <- svm(subscriber_growth ~ ., data = classification_train_data, probability = TRUE)
predictions_svm <- predict(svm_model, newdata = classification_test_data)
We choose accuracy, precision, recall and F1 score as the evaluation metrics for classification task. Together, these metrics provide a comprehensive understanding of the model’s effectiveness in correctly identifying and classifying YouTube videos based on their growth potential.
# Confusion Matrix for Logistic Regression
confusion_logistic <- confusionMatrix(factor(predictions_logistic), classification_test_data$subscriber_growth)
accuracy_logistic <- confusion_logistic$overall['Accuracy']
precision_logistic <- confusion_logistic$byClass['Pos Pred Value']
recall_logistic <- confusion_logistic$byClass['Sensitivity']
f1_logistic <- 2 * (precision_logistic * recall_logistic) / (precision_logistic + recall_logistic)
# Confusion Matrix for Random Forest
confusion_rf <- confusionMatrix(predictions_rf, classification_test_data$subscriber_growth)
accuracy_rf <- confusion_rf$overall['Accuracy']
precision_rf <- confusion_rf$byClass['Pos Pred Value']
recall_rf <- confusion_rf$byClass['Sensitivity']
f1_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)
# Confusion Matrix for SVM
confusion_svm <- confusionMatrix(predictions_svm, classification_test_data$subscriber_growth)
accuracy_svm <- confusion_svm$overall['Accuracy']
precision_svm <- confusion_svm$byClass['Pos Pred Value']
recall_svm <- confusion_svm$byClass['Sensitivity']
f1_svm <- 2 * (precision_svm * recall_svm) / (precision_svm + recall_svm)
# Create a data frame to hold the evaluation metrics
evaluation_results <- data.frame(
Model = c("Logistic Regression", "Random Forest", "SVM"),
Accuracy = c(accuracy_logistic, accuracy_rf, accuracy_svm),
Precision = c(precision_logistic, precision_rf, precision_svm),
Recall = c(recall_logistic, recall_rf, recall_svm),
F1_Score = c(f1_logistic, f1_rf, f1_svm)
)
evaluation_results
## Model Accuracy Precision Recall F1_Score
## 1 Logistic Regression 0.9166667 0.9574468 0.9183673 0.9375000
## 2 Random Forest 0.9305556 0.9583333 0.9387755 0.9484536
## 3 SVM 0.9444444 0.9591837 0.9591837 0.9591837
Overall, SVM model is the best choice for classifying YouTube videos
into high and low subscriber growth potential based on the provided
metrics, possibly due to its ability to handle complex, non-linear
relationships in the data and its focus on maximizing the margin between
classes. It achieves the highest accuracy, indicating its superior
overall classification performance, and the highest F1-score, which
balances precision and recall effectively. These metrics highlight SVM’s
ability to generalize well in this classification task.
In this project, we successfully addressed two key objectives related to YouTube video analytics. For forecasting Estimated Revenue (USD), XGBoost emerged as the best-performing model, thanks to its ability to capture non-linear relationships and feature interactions. However, the absolute evaluation metrics indicate potential for further improvement. For classifying videos into high and low subscriber growth potential, SVM proved to be the most effective model in handling complex, non-linear relationships and maximizing class margins. These results demonstrate the models’ robustness while highlighting areas for refinement in future work.
In future studies, the inclusion of a more comprehensive and diverse dataset would likely yield significant improvements in model performance.