WQD7004: YouTube Channel Performance Analytics

Introduction and Project Objectives

YouTube has become a crucial platform for content creators and businesses. With more than 2.7 billion monthly active users (Times, 2024), it offers endless opportunities to engage with diverse audiences worldwide. Understanding video performance is essential for optimizing growth potential, and maximizing revenue. This project aims to develop a regression model and classification model.To achieve the objective, we compare 3 model for each problem to evaluate the best performing models, and each model speciality will be explained throughout the modelling process. This project used a dataset derived from Kaggle named “YouTube Channel Performance Analytics” dataset (2016-2024).

The dataset has the following initial characteristics:

364 rows and 70 features.
The dataset is primarily numerical, with a few categorical features such as ‘Day of Week’, and the ‘Video Publish Time’.
The average video length is 11 minutes and generates approximately $8.85 in revenue, with some earning up to $103.

Questions:

How to predict Estimated Revenue (USD) of a video based on key factors that drive revenue?
How can a YouTube videos be classified based on their high and low subscriber growth potential?

Objective:

To develop a regression model to identify key factors that drive revenue and forecast earnings for future videos.
To classify the YouTube video into high and low subscriber growth potential.

Data Preparation & EDA

Import necessary packages

To begin, we load the required R packages that provide the specialized tools and functions needed to carry out our analysis efficiently.

library(readr)
library(dplyr)
library(ggplot2)
library(knitr)
library(janitor)
library(patchwork)
library(caret)
library(randomForest)
library(xgboost)
library(tidyr)
library(e1071)
library(ROCR)

Load data

First, we load our dataset from the csv file.

dataset <- as.data.frame(read_csv("archive/youtube_channel_real_performance_analytics.csv"))

Inspect data

We notice that our column headers contain spaces and special characters, which may cause issues.

colnames(dataset)

##  [1] "ID"                                 "Video Duration"                    
##  [3] "Video Publish Time"                 "Days Since Publish"                
##  [5] "Day"                                "Month"                             
##  [7] "Year"                               "Day of Week"                       
##  [9] "Revenue per 1000 Views (USD)"       "Monetized Playbacks (Estimate)"    
## [11] "Playback-Based CPM (USD)"           "CPM (USD)"                         
## [13] "Ad Impressions"                     "Estimated AdSense Revenue (USD)"   
## [15] "DoubleClick Revenue (USD)"          "YouTube Ads Revenue (USD)"         
## [17] "Watch Page Ads Revenue (USD)"       "YouTube Premium (USD)"             
## [19] "Transaction Revenue (USD)"          "Transactions"                      
## [21] "Revenue from Transactions (USD)"    "Reactions"                         
## [23] "Chat Messages Count"                "Reminders Set"                     
## [25] "Stream Hours"                       "Remix Views"                       
## [27] "Remix Count"                        "Subscribers from Posts"            
## [29] "New Comments"                       "Shares"                            
## [31] "Like Rate (%)"                      "Dislikes"                          
## [33] "Likes"                              "Unsubscribes"                      
## [35] "New Subscribers"                    "Returned Items (USD)"              
## [37] "Unconfirmed Commissions (USD)"      "Approved Commissions (USD)"        
## [39] "Orders"                             "Total Sales Volume (USD)"          
## [41] "End Screen Click-Through Rate (%)"  "End Screen Impressions"            
## [43] "End Screen Clicks"                  "Teaser Click-Through Rate (%)"     
## [45] "Teaser Impressions"                 "Teaser Clicks"                     
## [47] "Card Click-Through Rate (%)"        "Card Impressions"                  
## [49] "Card Clicks"                        "Views per Playlist Start"          
## [51] "Playlist Views"                     "Playlist Watch Time (hours)"       
## [53] "Clip Watch Time (hours)"            "Clip Views"                        
## [55] "YouTube Premium Watch Time (hours)" "YouTube Premium Views"             
## [57] "Returning Viewers"                  "New Viewers"                       
## [59] "Average Views per User"             "Unique Viewers"                    
## [61] "Watched (Not Skipped) (%)"          "Feed Impressions"                  
## [63] "Average View Percentage (%)"        "Average View Duration"             
## [65] "Views"                              "Watch Time (hours)"                
## [67] "Subscribers"                        "Estimated Revenue (USD)"           
## [69] "Impressions"                        "Video Thumbnail CTR (%)"

To address this, we use the janitor package to clean them up.

dataset <- dataset %>% clean_names()

Below are the new column names of the dataset.

colnames(dataset)

##  [1] "id"                                   
##  [2] "video_duration"                       
##  [3] "video_publish_time"                   
##  [4] "days_since_publish"                   
##  [5] "day"                                  
##  [6] "month"                                
##  [7] "year"                                 
##  [8] "day_of_week"                          
##  [9] "revenue_per_1000_views_usd"           
## [10] "monetized_playbacks_estimate"         
## [11] "playback_based_cpm_usd"               
## [12] "cpm_usd"                              
## [13] "ad_impressions"                       
## [14] "estimated_ad_sense_revenue_usd"       
## [15] "double_click_revenue_usd"             
## [16] "you_tube_ads_revenue_usd"             
## [17] "watch_page_ads_revenue_usd"           
## [18] "you_tube_premium_usd"                 
## [19] "transaction_revenue_usd"              
## [20] "transactions"                         
## [21] "revenue_from_transactions_usd"        
## [22] "reactions"                            
## [23] "chat_messages_count"                  
## [24] "reminders_set"                        
## [25] "stream_hours"                         
## [26] "remix_views"                          
## [27] "remix_count"                          
## [28] "subscribers_from_posts"               
## [29] "new_comments"                         
## [30] "shares"                               
## [31] "like_rate_percent"                    
## [32] "dislikes"                             
## [33] "likes"                                
## [34] "unsubscribes"                         
## [35] "new_subscribers"                      
## [36] "returned_items_usd"                   
## [37] "unconfirmed_commissions_usd"          
## [38] "approved_commissions_usd"             
## [39] "orders"                               
## [40] "total_sales_volume_usd"               
## [41] "end_screen_click_through_rate_percent"
## [42] "end_screen_impressions"               
## [43] "end_screen_clicks"                    
## [44] "teaser_click_through_rate_percent"    
## [45] "teaser_impressions"                   
## [46] "teaser_clicks"                        
## [47] "card_click_through_rate_percent"      
## [48] "card_impressions"                     
## [49] "card_clicks"                          
## [50] "views_per_playlist_start"             
## [51] "playlist_views"                       
## [52] "playlist_watch_time_hours"            
## [53] "clip_watch_time_hours"                
## [54] "clip_views"                           
## [55] "you_tube_premium_watch_time_hours"    
## [56] "you_tube_premium_views"               
## [57] "returning_viewers"                    
## [58] "new_viewers"                          
## [59] "average_views_per_user"               
## [60] "unique_viewers"                       
## [61] "watched_not_skipped_percent"          
## [62] "feed_impressions"                     
## [63] "average_view_percentage_percent"      
## [64] "average_view_duration"                
## [65] "views"                                
## [66] "watch_time_hours"                     
## [67] "subscribers"                          
## [68] "estimated_revenue_usd"                
## [69] "impressions"                          
## [70] "video_thumbnail_ctr_percent"

Next, we analyze the structure of our dataset to gain insight into its organization and identify the necessary steps for the data cleaning process. This allows us to prepare the dataset effectively for answering the questions outlined earlier.

From the code below, we find out that our dataset consists of 364 rows and 70 columns.

dim(dataset)

## [1] 364  70

Remove irrelevant features

Identifier column as well as columns that contain only a single constant value do not contribute any predictive information to our analysis and shall be removed.

## Columns that contain only a single constant value:

##  [1] "transaction_revenue_usd"       "transactions"                 
##  [3] "revenue_from_transactions_usd" "reactions"                    
##  [5] "chat_messages_count"           "reminders_set"                
##  [7] "stream_hours"                  "remix_views"                  
##  [9] "remix_count"                   "subscribers_from_posts"       
## [11] "returned_items_usd"            "unconfirmed_commissions_usd"  
## [13] "approved_commissions_usd"      "orders"                       
## [15] "total_sales_volume_usd"        "returning_viewers"            
## [17] "new_viewers"                   "average_views_per_user"       
## [19] "unique_viewers"

As a result, a total of 20 columns are removed from the dataset.

cat("Dataset dimension before removing columns:", dim(dataset), "\n")

## Dataset dimension before removing columns: 364 70

dataset <- dataset %>% select(-id) %>% select(where(~ n_distinct(.) > 1))
cat("Dataset dimension after removing columns:", dim(dataset), "\n")

## Dataset dimension after removing columns: 364 50

colnames(dataset)

##  [1] "video_duration"                       
##  [2] "video_publish_time"                   
##  [3] "days_since_publish"                   
##  [4] "day"                                  
##  [5] "month"                                
##  [6] "year"                                 
##  [7] "day_of_week"                          
##  [8] "revenue_per_1000_views_usd"           
##  [9] "monetized_playbacks_estimate"         
## [10] "playback_based_cpm_usd"               
## [11] "cpm_usd"                              
## [12] "ad_impressions"                       
## [13] "estimated_ad_sense_revenue_usd"       
## [14] "double_click_revenue_usd"             
## [15] "you_tube_ads_revenue_usd"             
## [16] "watch_page_ads_revenue_usd"           
## [17] "you_tube_premium_usd"                 
## [18] "new_comments"                         
## [19] "shares"                               
## [20] "like_rate_percent"                    
## [21] "dislikes"                             
## [22] "likes"                                
## [23] "unsubscribes"                         
## [24] "new_subscribers"                      
## [25] "end_screen_click_through_rate_percent"
## [26] "end_screen_impressions"               
## [27] "end_screen_clicks"                    
## [28] "teaser_click_through_rate_percent"    
## [29] "teaser_impressions"                   
## [30] "teaser_clicks"                        
## [31] "card_click_through_rate_percent"      
## [32] "card_impressions"                     
## [33] "card_clicks"                          
## [34] "views_per_playlist_start"             
## [35] "playlist_views"                       
## [36] "playlist_watch_time_hours"            
## [37] "clip_watch_time_hours"                
## [38] "clip_views"                           
## [39] "you_tube_premium_watch_time_hours"    
## [40] "you_tube_premium_views"               
## [41] "watched_not_skipped_percent"          
## [42] "feed_impressions"                     
## [43] "average_view_percentage_percent"      
## [44] "average_view_duration"                
## [45] "views"                                
## [46] "watch_time_hours"                     
## [47] "subscribers"                          
## [48] "estimated_revenue_usd"                
## [49] "impressions"                          
## [50] "video_thumbnail_ctr_percent"

Data reduction

Despite removing columns that do not contribute, we are still left with 50 features. Training and evaluating models with these many features require more computational resources and time. Also, not all features contribute meaningfully to the target variable or analysis and these redundant features could introduce noise that negatively impact our model performance.

Thus, data reduction is crucial to ensure that we retain only the critical information while minimizing noise, redundancy and computational burden.

Remove money-related features (except estimated_revenue_usd)

The decision is made based on several justifications.

For regression problem: Forecasting Estimated Revenue (USD) for YouTube Videos

Money-related features are highly correlated with the target variable (estimated_revenue_usd), including them will create circular logic, where model learns to replicate instead of forecasting
Excluding features tied directly to current revenue ensures the model can generalize better to future data, where revenue dynamics might differ

For classification problem: Classifying YouTube Videos into High and Low Subscriber Growth Potential

Monetary features may introduce noise into the model since they are not causally linked to subscriber behavior
Excluding revenue-related features ensures a simpler model that focuses on behavioral factors, making it easier to interpret the results

colnames_usd <- sapply(names(dataset), function(x) endsWith(x, "usd") && x != "estimated_revenue_usd")
names(dataset)[colnames_usd]

## [1] "revenue_per_1000_views_usd"     "playback_based_cpm_usd"        
## [3] "cpm_usd"                        "estimated_ad_sense_revenue_usd"
## [5] "double_click_revenue_usd"       "you_tube_ads_revenue_usd"      
## [7] "watch_page_ads_revenue_usd"     "you_tube_premium_usd"

cat("Dataset dimension before removing columns:", dim(dataset), "\n")

## Dataset dimension before removing columns: 364 50

dataset <- dataset %>% 
  select(-which(endsWith(names(dataset), "usd") & names(dataset) != "estimated_revenue_usd"))
cat("Dataset dimension after removing columns:", dim(dataset), "\n")

## Dataset dimension after removing columns: 364 42

colnames(dataset)

##  [1] "video_duration"                       
##  [2] "video_publish_time"                   
##  [3] "days_since_publish"                   
##  [4] "day"                                  
##  [5] "month"                                
##  [6] "year"                                 
##  [7] "day_of_week"                          
##  [8] "monetized_playbacks_estimate"         
##  [9] "ad_impressions"                       
## [10] "new_comments"                         
## [11] "shares"                               
## [12] "like_rate_percent"                    
## [13] "dislikes"                             
## [14] "likes"                                
## [15] "unsubscribes"                         
## [16] "new_subscribers"                      
## [17] "end_screen_click_through_rate_percent"
## [18] "end_screen_impressions"               
## [19] "end_screen_clicks"                    
## [20] "teaser_click_through_rate_percent"    
## [21] "teaser_impressions"                   
## [22] "teaser_clicks"                        
## [23] "card_click_through_rate_percent"      
## [24] "card_impressions"                     
## [25] "card_clicks"                          
## [26] "views_per_playlist_start"             
## [27] "playlist_views"                       
## [28] "playlist_watch_time_hours"            
## [29] "clip_watch_time_hours"                
## [30] "clip_views"                           
## [31] "you_tube_premium_watch_time_hours"    
## [32] "you_tube_premium_views"               
## [33] "watched_not_skipped_percent"          
## [34] "feed_impressions"                     
## [35] "average_view_percentage_percent"      
## [36] "average_view_duration"                
## [37] "views"                                
## [38] "watch_time_hours"                     
## [39] "subscribers"                          
## [40] "estimated_revenue_usd"                
## [41] "impressions"                          
## [42] "video_thumbnail_ctr_percent"

Remove derived or aggregated features

Upon closer inspection, we find out that some of the columns are derived or aggregated features. These information reiterations cause multicollinearity issue and can lead to difficulty in isolating individual effect of each feature on the target variable. Thus, we remove them to ensure that our model remains interpretable and can generalize better to new data.

E.g.

“days_since_published” column is calculated from “video_publish_time” column.

“day”, “month” and “year” column are extracted from “video_publish_time” column.

“like_rate_percent” column is derived from dividing “likes” column from the total of both “like” and “dislikes” column.

“subscribers” are the subtraction of “Unsubscribes” from “new_subscribers” and etc.

cols_to_remove <- c("video_publish_time", "like_rate_percent", "end_screen_click_through_rate_percent", "teaser_click_through_rate_percent", "card_click_through_rate_percent", "average_view_percentage_percent", "subscribers", "video_thumbnail_ctr_percent", "average_view_percentage_percent")
cat("Dataset dimension before removing columns:", dim(dataset), "\n")

## Dataset dimension before removing columns: 364 42

dataset <- dataset %>% 
  select(-cols_to_remove)
cat("Dataset dimension after removing columns:", dim(dataset), "\n")

## Dataset dimension after removing columns: 364 34

colnames(dataset)

##  [1] "video_duration"                    "days_since_publish"               
##  [3] "day"                               "month"                            
##  [5] "year"                              "day_of_week"                      
##  [7] "monetized_playbacks_estimate"      "ad_impressions"                   
##  [9] "new_comments"                      "shares"                           
## [11] "dislikes"                          "likes"                            
## [13] "unsubscribes"                      "new_subscribers"                  
## [15] "end_screen_impressions"            "end_screen_clicks"                
## [17] "teaser_impressions"                "teaser_clicks"                    
## [19] "card_impressions"                  "card_clicks"                      
## [21] "views_per_playlist_start"          "playlist_views"                   
## [23] "playlist_watch_time_hours"         "clip_watch_time_hours"            
## [25] "clip_views"                        "you_tube_premium_watch_time_hours"
## [27] "you_tube_premium_views"            "watched_not_skipped_percent"      
## [29] "feed_impressions"                  "average_view_duration"            
## [31] "views"                             "watch_time_hours"                 
## [33] "estimated_revenue_usd"             "impressions"

Handle missing values

We then check if there is any missing values in our remaining dataset. From the result, none is present in our case.

data.frame(Features = names(dataset), MissingValues = colSums(is.na(dataset))) %>%
  arrange(desc(MissingValues)) %>%
  print(row.names = FALSE)

##                           Features MissingValues
##                     video_duration             0
##                 days_since_publish             0
##                                day             0
##                              month             0
##                               year             0
##                        day_of_week             0
##       monetized_playbacks_estimate             0
##                     ad_impressions             0
##                       new_comments             0
##                             shares             0
##                           dislikes             0
##                              likes             0
##                       unsubscribes             0
##                    new_subscribers             0
##             end_screen_impressions             0
##                  end_screen_clicks             0
##                 teaser_impressions             0
##                      teaser_clicks             0
##                   card_impressions             0
##                        card_clicks             0
##           views_per_playlist_start             0
##                     playlist_views             0
##          playlist_watch_time_hours             0
##              clip_watch_time_hours             0
##                         clip_views             0
##  you_tube_premium_watch_time_hours             0
##             you_tube_premium_views             0
##        watched_not_skipped_percent             0
##                   feed_impressions             0
##              average_view_duration             0
##                              views             0
##                   watch_time_hours             0
##              estimated_revenue_usd             0
##                        impressions             0

Handle duplicated records

We also check if there is any duplicated records in our dataset. As shown in the result, there is no duplicated records.

dataset[duplicated(dataset),]

##  [1] video_duration                    days_since_publish               
##  [3] day                               month                            
##  [5] year                              day_of_week                      
##  [7] monetized_playbacks_estimate      ad_impressions                   
##  [9] new_comments                      shares                           
## [11] dislikes                          likes                            
## [13] unsubscribes                      new_subscribers                  
## [15] end_screen_impressions            end_screen_clicks                
## [17] teaser_impressions                teaser_clicks                    
## [19] card_impressions                  card_clicks                      
## [21] views_per_playlist_start          playlist_views                   
## [23] playlist_watch_time_hours         clip_watch_time_hours            
## [25] clip_views                        you_tube_premium_watch_time_hours
## [27] you_tube_premium_views            watched_not_skipped_percent      
## [29] feed_impressions                  average_view_duration            
## [31] views                             watch_time_hours                 
## [33] estimated_revenue_usd             impressions                      
## <0 rows> (or 0-length row.names)

Summary on dataset after columns removal

After partial cleaning of the dataset, below is a brief summary.

glimpse(dataset)

## Rows: 364
## Columns: 34
## $ video_duration                    <dbl> 201, 391, 133, 14, 45, 496, 9, 34, 1…
## $ days_since_publish                <dbl> 0, 8, 4, 15, 2, 7, 28, 3, 3, 1, 5, 1…
## $ day                               <dbl> 2, 10, 14, 29, 1, 8, 5, 8, 11, 12, 1…
## $ month                             <dbl> 6, 6, 6, 6, 7, 7, 8, 8, 8, 8, 8, 9, …
## $ year                              <dbl> 2016, 2016, 2016, 2016, 2016, 2016, …
## $ day_of_week                       <chr> "Thursday", "Friday", "Tuesday", "We…
## $ monetized_playbacks_estimate      <dbl> 723, 727, 76, 18, 0, 491, 32, 404, 1…
## $ ad_impressions                    <dbl> 981, 861, 88, 35, 0, 673, 43, 597, 2…
## $ new_comments                      <dbl> 91, 35, 0, 12, 50, 27, 16, 33, 37, 1…
## $ shares                            <dbl> 12, 5, 4, 7, 7, 3, 14, 37, 21, 3, 11…
## $ dislikes                          <dbl> 30, 18, 20, 14, 180, 17, 8, 11, 16, …
## $ likes                             <dbl> 924, 322, 239, 220, 602, 290, 151, 4…
## $ unsubscribes                      <dbl> 3, 1, 0, 0, 3, 1, 0, 0, 0, 0, 2, 2, …
## $ new_subscribers                   <dbl> 54, 34, 8, 2, 31, 20, 4, 24, 14, 5, …
## $ end_screen_impressions            <dbl> 46, 0, 0, 0, 4, 0, 0, 18, 0, 0, 15, …
## $ end_screen_clicks                 <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_impressions                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_clicks                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_impressions                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_clicks                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ views_per_playlist_start          <dbl> 3.3333, 3.5000, 0.0000, 6.6667, 1.66…
## $ playlist_views                    <dbl> 10, 7, 11, 20, 10, 10, 10, 11, 11, 1…
## $ playlist_watch_time_hours         <dbl> 0.2974, 0.5187, 0.1683, 0.0640, 0.06…
## $ clip_watch_time_hours             <dbl> 0.1575, 0.0000, 0.0000, 0.0000, 0.00…
## $ clip_views                        <dbl> 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ you_tube_premium_watch_time_hours <dbl> 2.5358, 0.8911, 0.1838, 0.0711, 0.25…
## $ you_tube_premium_views            <dbl> 152, 32, 28, 20, 39, 22, 20, 37, 27,…
## $ watched_not_skipped_percent       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ feed_impressions                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ average_view_duration             <dbl> 81, 156, 41, 14, 25, 182, 10, 35, 8,…
## $ views                             <dbl> 23531, 11478, 6153, 4398, 14659, 841…
## $ watch_time_hours                  <dbl> 533.1636, 500.5628, 70.7287, 17.6251…
## $ estimated_revenue_usd             <dbl> 0.561, 0.648, 0.089, 0.017, 0.000, 0…
## $ impressions                       <dbl> 41118, 41627, 38713, 35245, 46218, 4…

Create classification target column - “subscriber_growth”

As shown on above summary, our dataset consists of mostly numeric data, we have to transform and prepare the data in ways that align with our classification task, which is to classify if a video is going to have high or low subscriber growth potential based on observed factors.

We decided to work on the number of subscribers by first calculating the net growth using the differences between new_subscribers and unsubscribes. Then we will use the mean value of this newly derived features as threshold to separate the values into Low (0) and High (1) classes.

dataset <- dataset %>%
  mutate(subscriber_growth = new_subscribers - unsubscribes)
threshold <- as.integer(mean(dataset$subscriber_growth, na.rm = TRUE))
cat("Threshold (mean) value:", threshold, "\n")

## Threshold (mean) value: 321

dataset <- dataset %>%
  mutate(subscriber_growth = ifelse(subscriber_growth >= threshold, 1, 0))
cat("Values in the new column:", unique(dataset$subscriber_growth), "\n")

## Values in the new column: 0 1

cat("Dataset dimension after adding new column:", dim(dataset), "\n")

## Dataset dimension after adding new column: 364 35

colnames(dataset)

##  [1] "video_duration"                    "days_since_publish"               
##  [3] "day"                               "month"                            
##  [5] "year"                              "day_of_week"                      
##  [7] "monetized_playbacks_estimate"      "ad_impressions"                   
##  [9] "new_comments"                      "shares"                           
## [11] "dislikes"                          "likes"                            
## [13] "unsubscribes"                      "new_subscribers"                  
## [15] "end_screen_impressions"            "end_screen_clicks"                
## [17] "teaser_impressions"                "teaser_clicks"                    
## [19] "card_impressions"                  "card_clicks"                      
## [21] "views_per_playlist_start"          "playlist_views"                   
## [23] "playlist_watch_time_hours"         "clip_watch_time_hours"            
## [25] "clip_views"                        "you_tube_premium_watch_time_hours"
## [27] "you_tube_premium_views"            "watched_not_skipped_percent"      
## [29] "feed_impressions"                  "average_view_duration"            
## [31] "views"                             "watch_time_hours"                 
## [33] "estimated_revenue_usd"             "impressions"                      
## [35] "subscriber_growth"

Convert categorical data to factor

From our understanding of the dataset, We have identified “day_of_week”, “day”, “month”, “year” and “subscriber_growth” columns to be categorical data. We first check the unique values present in the columns.

unique(dataset$day_of_week)

## [1] "Thursday"  "Friday"    "Tuesday"   "Wednesday" "Monday"    "Sunday"   
## [7] "Saturday"

unique(dataset$day)

##  [1]  2 10 14 29  1  8  5 11 12 17 18 15 20 26  3  6  9 23 25 27 24 31  7 28  4
## [26] 22 30 16 13 19 21

unique(dataset$month)

##  [1]  6  7  8  9 10 11 12  1  2  3  4  5

unique(dataset$year)

## [1] 2016 2017 2018 2019 2020 2021 2022 2023 2024

unique(dataset$subscriber_growth)

## [1] 0 1

Then we convert them into factors accordingly.

dataset$day_of_week <- factor(dataset$day_of_week,
                              levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

dataset$day <- factor(dataset$day)

dataset$month <- factor(dataset$month)

dataset$year <- factor(dataset$year)

dataset$subscriber_growth <- factor(dataset$subscriber_growth, levels = c(0, 1))

As shown below, the columns have been converted into factors.

unique(dataset$day_of_week)

## [1] Thursday  Friday    Tuesday   Wednesday Monday    Sunday    Saturday 
## Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday

unique(dataset$day)

##  [1] 2  10 14 29 1  8  5  11 12 17 18 15 20 26 3  6  9  23 25 27 24 31 7  28 4 
## [26] 22 30 16 13 19 21
## 31 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 31

unique(dataset$month)

##  [1] 6  7  8  9  10 11 12 1  2  3  4  5 
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12

unique(dataset$year)

## [1] 2016 2017 2018 2019 2020 2021 2022 2023 2024
## Levels: 2016 2017 2018 2019 2020 2021 2022 2023 2024

unique(dataset$subscriber_growth)

## [1] 0 1
## Levels: 0 1

EDA on categorical columns

We use bar chart to show the distribution of each categorical columns.

There are 67.85% (247) of High Subscriber and 32.15% (117) of Low Subscriber

37.36% (136) of the dataset are published in 2017.

p1 <- ggplot(dataset, aes(x = day_of_week)) +
  geom_bar() +
  labs(title = "Day of Week Distribution", x = "day_of_week", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)

p2 <- ggplot(dataset, aes(x = day)) +
  geom_bar() +
  labs(title = "Day Distribution", x = "day", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)

p3 <- ggplot(dataset, aes(x = month)) +
  geom_bar() +
  labs(title = "Month Distribution", x = "month", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)

p4 <- ggplot(dataset, aes(x = year)) +
  geom_bar() +
  labs(title = "Year Distribution", x = "year", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)

p5 <- ggplot(dataset, aes(x = subscriber_growth)) +
  geom_bar() +
  labs(title = "Subscriber Growth Distribution", x = "subscriber_growth", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)

wrap_plots(p1, p2, p3, p4, p5)

EDA on categorical columns with respect to “subscriber_growth”

And also bar chart to show the distribution of each categorical columns with respect to “subscriber_growth”.

p1 <- ggplot(dataset, aes(x = day_of_week, fill = subscriber_growth)) +
  geom_bar(position = "dodge") +
  labs(title = "Day of Week Distribution", x = "day_of_week", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)

p2 <- ggplot(dataset, aes(x = day, fill = subscriber_growth)) +
  geom_bar(position = "dodge") +
  labs(title = "Day Distribution", x = "day", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)

p3 <- ggplot(dataset, aes(x = month, fill = subscriber_growth)) +
  geom_bar(position = "dodge") +
  labs(title = "Month Distribution", x = "month", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)

p4 <- ggplot(dataset, aes(x = month, fill = subscriber_growth)) +
  geom_bar(position = "dodge") +
  labs(title = "Month Distribution", x = "month", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(stat = "count", aes(label = ..count..), position = position_dodge(width = 0.9), vjust = -0.5)

wrap_plots(p1, p2, p3, p4)

EDA on numerical columns

Below are the numerical data in our dataset.

numeric_columns_names <- dataset %>%
  select(where(~ is.numeric(.))) %>%
  colnames()

numeric_columns_names

##  [1] "video_duration"                    "days_since_publish"               
##  [3] "monetized_playbacks_estimate"      "ad_impressions"                   
##  [5] "new_comments"                      "shares"                           
##  [7] "dislikes"                          "likes"                            
##  [9] "unsubscribes"                      "new_subscribers"                  
## [11] "end_screen_impressions"            "end_screen_clicks"                
## [13] "teaser_impressions"                "teaser_clicks"                    
## [15] "card_impressions"                  "card_clicks"                      
## [17] "views_per_playlist_start"          "playlist_views"                   
## [19] "playlist_watch_time_hours"         "clip_watch_time_hours"            
## [21] "clip_views"                        "you_tube_premium_watch_time_hours"
## [23] "you_tube_premium_views"            "watched_not_skipped_percent"      
## [25] "feed_impressions"                  "average_view_duration"            
## [27] "views"                             "watch_time_hours"                 
## [29] "estimated_revenue_usd"             "impressions"

We use histogram chart to look at the distribution of these columns.

From the histogram chart, we can see that majority of the numerical data are right-skewed distribution.

hist_plot_list <- list()
for (col in numeric_columns_names){
  plot <- ggplot(dataset, aes_string(x=col)) + geom_histogram() + labs(y="Frequency")
  hist_plot_list[[col]] <- plot
}
wrap_plots(hist_plot_list, ncol=5)

EDA on numerical columns with respect to “estimated_revenue_usd” and “subscriber_growth”

We use scatterplot chart to show the distribution of the numerical columns with respect to “estimated_revenue_usd” and “subscriber_growth”.

hist_plot_list <- list()
for (col in numeric_columns_names){
  if (col == "estimated_revenue_usd") {
    next
  }
  
  plot <- ggplot(dataset, aes_string(x=col, y="estimated_revenue_usd", color = "subscriber_growth")) + geom_point() +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    theme(legend.position = "top")
  hist_plot_list[[col]] <- plot
}
wrap_plots(hist_plot_list, ncol=5)

Label encoding for categorical feature

We will convert factors into numeric values for machine learning algorithms that require numerical input.

glimpse(dataset[, sapply(dataset, is.factor)])

## Rows: 364
## Columns: 5
## $ day               <fct> 2, 10, 14, 29, 1, 8, 5, 8, 11, 12, 17, 5, 11, 18, 1,…
## $ month             <fct> 6, 6, 6, 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 10…
## $ year              <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
## $ day_of_week       <fct> Thursday, Friday, Tuesday, Wednesday, Friday, Friday…
## $ subscriber_growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…

# to avoid converting the target variable for classification into numeric 
dataset <- dataset %>%
  mutate_if(~is.factor(.) && !all(levels(.) %in% c(0, 1)), as.numeric)

After label encoding:

glimpse(dataset)

## Rows: 364
## Columns: 35
## $ video_duration                    <dbl> 201, 391, 133, 14, 45, 496, 9, 34, 1…
## $ days_since_publish                <dbl> 0, 8, 4, 15, 2, 7, 28, 3, 3, 1, 5, 1…
## $ day                               <dbl> 2, 10, 14, 29, 1, 8, 5, 8, 11, 12, 1…
## $ month                             <dbl> 6, 6, 6, 6, 7, 7, 8, 8, 8, 8, 8, 9, …
## $ year                              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day_of_week                       <dbl> 4, 5, 2, 3, 5, 5, 5, 1, 4, 5, 3, 1, …
## $ monetized_playbacks_estimate      <dbl> 723, 727, 76, 18, 0, 491, 32, 404, 1…
## $ ad_impressions                    <dbl> 981, 861, 88, 35, 0, 673, 43, 597, 2…
## $ new_comments                      <dbl> 91, 35, 0, 12, 50, 27, 16, 33, 37, 1…
## $ shares                            <dbl> 12, 5, 4, 7, 7, 3, 14, 37, 21, 3, 11…
## $ dislikes                          <dbl> 30, 18, 20, 14, 180, 17, 8, 11, 16, …
## $ likes                             <dbl> 924, 322, 239, 220, 602, 290, 151, 4…
## $ unsubscribes                      <dbl> 3, 1, 0, 0, 3, 1, 0, 0, 0, 0, 2, 2, …
## $ new_subscribers                   <dbl> 54, 34, 8, 2, 31, 20, 4, 24, 14, 5, …
## $ end_screen_impressions            <dbl> 46, 0, 0, 0, 4, 0, 0, 18, 0, 0, 15, …
## $ end_screen_clicks                 <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_impressions                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teaser_clicks                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_impressions                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ card_clicks                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ views_per_playlist_start          <dbl> 3.3333, 3.5000, 0.0000, 6.6667, 1.66…
## $ playlist_views                    <dbl> 10, 7, 11, 20, 10, 10, 10, 11, 11, 1…
## $ playlist_watch_time_hours         <dbl> 0.2974, 0.5187, 0.1683, 0.0640, 0.06…
## $ clip_watch_time_hours             <dbl> 0.1575, 0.0000, 0.0000, 0.0000, 0.00…
## $ clip_views                        <dbl> 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ you_tube_premium_watch_time_hours <dbl> 2.5358, 0.8911, 0.1838, 0.0711, 0.25…
## $ you_tube_premium_views            <dbl> 152, 32, 28, 20, 39, 22, 20, 37, 27,…
## $ watched_not_skipped_percent       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ feed_impressions                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ average_view_duration             <dbl> 81, 156, 41, 14, 25, 182, 10, 35, 8,…
## $ views                             <dbl> 23531, 11478, 6153, 4398, 14659, 841…
## $ watch_time_hours                  <dbl> 533.1636, 500.5628, 70.7287, 17.6251…
## $ estimated_revenue_usd             <dbl> 0.561, 0.648, 0.089, 0.017, 0.000, 0…
## $ impressions                       <dbl> 41118, 41627, 38713, 35245, 46218, 4…
## $ subscriber_growth                 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Dataset for Regression Problem

Remove subscriber_growth from regression dataset

Since subscriber_growth is a derived feature we prepared for classification task, we removed it from our preparation for regression task.

cols_to_remove <- c("subscriber_growth")
cat("Dataset dimension before removing columns:", dim(dataset), "\n")

## Dataset dimension before removing columns: 364 35

regression_dataset <- dataset %>% 
  select(-cols_to_remove)
cat("Dataset dimension after removing column:", dim(regression_dataset), "\n")

## Dataset dimension after removing column: 364 34

Standardization

Standardization is done to ensure that the analysis will not be dominated by features with larger scale. In our code, the transformation is applied only to columns that are numeric. Since estimated_revenue_usd is the target variable in our regression problem, we make sure to exclude it from standardization as it could distort the values and interfere with regression model’s ability to predict accurately.

regression_dataset <- regression_dataset %>%
  mutate(across(where(is.numeric), scale))
target_column <- "estimated_revenue_usd"
regression_dataset[[target_column]] <- dataset[[target_column]]

Feature selection

Correlation analysis between features and the target variable can help us identify strong predictors to focus on the most relevant features. This step can help to improve model efficiency and simplify the model by reducing the number of features to consider. It can also help mitigate the risk of overfitting as model is less likely to learn noise from the data.

We pick a threshold of 0.3 and keep features with an absolute correlation coefficient equal to or greater than that as it is our goal to keep features with moderate association with the target variable.

correlation_results <- cor(regression_dataset, use = "complete.obs")[, "estimated_revenue_usd"]
sorted_correlation_results <- sort(correlation_results, decreasing = TRUE)
sorted_correlation_results

##             estimated_revenue_usd      monetized_playbacks_estimate 
##                        1.00000000                        0.94415536 
##                    ad_impressions you_tube_premium_watch_time_hours 
##                        0.82553441                        0.58818300 
##            you_tube_premium_views                        clip_views 
##                        0.57453255                        0.47250389 
##                       impressions                  watch_time_hours 
##                        0.46993994                        0.43199850 
##                             likes                   new_subscribers 
##                        0.42554355                        0.41293203 
##             clip_watch_time_hours         playlist_watch_time_hours 
##                        0.40509641                        0.37645932 
##                            shares                             views 
##                        0.36208904                        0.35790110 
##                teaser_impressions            end_screen_impressions 
##                        0.29957602                        0.21852646 
##                       card_clicks                    playlist_views 
##                        0.21722540                        0.21078692 
##             average_view_duration                     teaser_clicks 
##                        0.20728645                        0.20231039 
##                      new_comments                              year 
##                        0.20167705                        0.19833342 
##                days_since_publish                  card_impressions 
##                        0.17689012                        0.16111830 
##                      unsubscribes                          dislikes 
##                        0.15548499                        0.14381436 
##                    video_duration                             month 
##                        0.13576667                        0.12153583 
##                 end_screen_clicks                               day 
##                        0.08489775                        0.03859435 
##       watched_not_skipped_percent                  feed_impressions 
##                        0.02559761                        0.02559761 
##          views_per_playlist_start                       day_of_week 
##                       -0.02274427                       -0.06396180

correlation_df <- tibble(
  Column = names(sorted_correlation_results),
  Correlation = sorted_correlation_results
) %>%
  filter(abs(Correlation) >= 0.3)
regression_dataset <- regression_dataset %>%
  select(all_of(correlation_df$Column))
colnames(regression_dataset)

##  [1] "estimated_revenue_usd"             "monetized_playbacks_estimate"     
##  [3] "ad_impressions"                    "you_tube_premium_watch_time_hours"
##  [5] "you_tube_premium_views"            "clip_views"                       
##  [7] "impressions"                       "watch_time_hours"                 
##  [9] "likes"                             "new_subscribers"                  
## [11] "clip_watch_time_hours"             "playlist_watch_time_hours"        
## [13] "shares"                            "views"

In the end, apart from the target variable, we ended up with 13 features in our cleaned dataset for regression problem.

Dataset for Classification Problem

Remove columns used to derived target variable

We remove new_subscribers and unsubscribes column from our classification dataset.

cols_to_remove <- c("new_subscribers", "unsubscribes")
cat("Dataset dimension before removing columns:", dim(dataset), "\n")

## Dataset dimension before removing columns: 364 35

classification_dataset <- dataset %>% 
  select(-cols_to_remove)
cat("Dataset dimension after removing column:", dim(classification_dataset), "\n")

## Dataset dimension after removing column: 364 33

Standardization

Again, we go through the same process as with regression dataset by standardizing our data before performing feature selection.

classification_dataset <- classification_dataset %>%
  mutate(across(where(is.numeric), scale))
target_column <- "subscriber_growth"
classification_dataset[[target_column]] <- dataset[[target_column]]

Feature selection

classification_dataset$subscriber_growth <- as.numeric(classification_dataset$subscriber_growth) - 1
correlation_results <- cor(classification_dataset, use = "complete.obs")[, "subscriber_growth"]
sorted_correlation_results <- sort(correlation_results, decreasing = TRUE)
sorted_correlation_results

##                 subscriber_growth                             likes 
##                        1.00000000                        0.69957606 
##                             views                  watch_time_hours 
##                        0.68007090                        0.63970462 
##                       impressions            you_tube_premium_views 
##                        0.63931946                        0.63524804 
##                            shares you_tube_premium_watch_time_hours 
##                        0.56135916                        0.54961989 
##                          dislikes                      new_comments 
##                        0.52902280                        0.51674931 
##                        clip_views      monetized_playbacks_estimate 
##                        0.30666774                        0.28333213 
##             estimated_revenue_usd         playlist_watch_time_hours 
##                        0.27956059                        0.23455686 
##                    playlist_views             clip_watch_time_hours 
##                        0.22275060                        0.22171100 
##                    ad_impressions                       day_of_week 
##                        0.19906096                        0.12537902 
##                days_since_publish            end_screen_impressions 
##                        0.11972858                        0.10526641 
##                             month                 end_screen_clicks 
##                        0.10120653                        0.07702524 
##                teaser_impressions                  feed_impressions 
##                        0.07012644                        0.06738849 
##       watched_not_skipped_percent                     teaser_clicks 
##                        0.06738849                        0.04659678 
##                       card_clicks                  card_impressions 
##                        0.03601428                        0.02697676 
##             average_view_duration                              year 
##                       -0.01231093                       -0.02826497 
##                               day          views_per_playlist_start 
##                       -0.02937354                       -0.03913703 
##                    video_duration 
##                       -0.04753051

correlation_df <- tibble(
  Column = names(sorted_correlation_results),
  Correlation = sorted_correlation_results
) %>% 
  filter(abs(Correlation) >= 0.3)
classification_dataset <- classification_dataset %>%
  select(all_of(correlation_df$Column))
colnames(classification_dataset)

##  [1] "subscriber_growth"                 "likes"                            
##  [3] "views"                             "watch_time_hours"                 
##  [5] "impressions"                       "you_tube_premium_views"           
##  [7] "shares"                            "you_tube_premium_watch_time_hours"
##  [9] "dislikes"                          "new_comments"                     
## [11] "clip_views"

# convert back to factor 
classification_dataset$subscriber_growth <- factor(classification_dataset$subscriber_growth, levels = c(0, 1))

In the end, apart from the target variable, we ended up with 10 features in our cleaned dataset for classification problem.

Data Analysis

Forecasting Estimated Revenue (USD) for YouTube Videos

Accurately predicting revenue can help content creators and platforms optimize strategies for video production and audience engagement. To achieve this, we evaluated three predictive models: Linear Regression, Random Forest Regression and XGBoost. Linear Regression offers simplicity and interpretability, while Random Forest and XGBoost account for non-linear relationships and feature interactions, making them suitable for complex datasets.

Train-test split

Splitting the data into training and testing sets is important to check how well the model works on new data. The model is trained using one part of the data and tested on the other, which mimics real-world situations. This helps avoid overfitting, where the model performs well on the training data but fails with new data. It also gives a clear and fair measure of the model’s accuracy, helping us choose the best model for making predictions. We use a ratio of 80:20 on our train-test split ratio because our dataset size is on the smaller scale.

set.seed(130)  # For reproducibility
train_index <- createDataPartition(regression_dataset$estimated_revenue_usd, p = 0.8, list = FALSE)
regression_train_data <- regression_dataset[train_index, ]
regression_test_data <- regression_dataset[-train_index, ]
cat("Train data dimension:", dim(regression_train_data))

## Train data dimension: 292 14

cat("Test data dimension:", dim(regression_test_data))

## Test data dimension: 72 14

Linear Regression

model_lm <- lm(estimated_revenue_usd ~ ., data = regression_train_data)
predictions_lm <- predict(model_lm, newdata = regression_test_data)

Random Forest Regression

model_rf <- randomForest(estimated_revenue_usd ~ ., data = regression_train_data, ntree = 500)
predictions_rf <- predict(model_rf, newdata = regression_test_data)

XGBoost

train_matrix <- model.matrix(estimated_revenue_usd ~ ., data = regression_train_data)[, -1]
test_matrix <- model.matrix(estimated_revenue_usd ~ ., data = regression_test_data)[, -1]

model_xgb <- xgboost(data = train_matrix, label = regression_train_data$estimated_revenue_usd, nrounds = 100, objective = "reg:squarederror", verbose = 0)
predictions_xgb <- predict(model_xgb, newdata = test_matrix)

Evaluation

Evaluation metrics are critical for assessing the performance of predictive models and ensuring they meet the desired accuracy. Metrics like Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) measure the average difference between predicted and actual values, with RMSE giving more weight to larger errors. R-squared (R²) indicates how well the model explains the variation in the target variable, with values closer to 1 showing better performance. By using these metrics, we can compare different models and choose the one that best balances accuracy and generalization.

# rmse
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# mae
mae <- function(actual, predicted) {
  mean(abs(actual - predicted))
}

# r-squared
rsquared <- function(actual, predicted) {
  cor(actual, predicted)^2
}

metrics <- data.frame(
  Model = c("Linear Regression", "Random Forest", "XGBoost"),
  RMSE = c(rmse(regression_test_data$estimated_revenue_usd, predictions_lm),
           rmse(regression_test_data$estimated_revenue_usd, predictions_rf),
           rmse(regression_test_data$estimated_revenue_usd, predictions_xgb)),
  MAE = c(mae(regression_test_data$estimated_revenue_usd, predictions_lm),
          mae(regression_test_data$estimated_revenue_usd, predictions_rf),
          mae(regression_test_data$estimated_revenue_usd, predictions_xgb)),
  R2 = c(rsquared(regression_test_data$estimated_revenue_usd, predictions_lm),
         rsquared(regression_test_data$estimated_revenue_usd, predictions_rf),
         rsquared(regression_test_data$estimated_revenue_usd, predictions_xgb))
)

metrics

##               Model     RMSE      MAE        R2
## 1 Linear Regression 3.431527 2.280876 0.8498157
## 2     Random Forest 4.102293 2.387962 0.8005550
## 3           XGBoost 3.216333 1.707398 0.8675043

Our goal is to choose the model with the lowest RMSE and MAE and highest R² for the best predictive performance. From the table and plot, we can see that XGBoost outperforms the other models based on all three metrics, possibly because it can handle non-linear relationships and interactions between features, which are prevalent in forecasting tasks like YouTube revenue.

However, the absolute values of the evaluation metrics suggest that there may still be room or improvement. This is further supported by the table below showing the actual value vs predicted value by model.

comparison <- data.frame(Actual = regression_test_data$estimated_revenue_usd,
                         Predicted = predictions_xgb)
head(comparison)

##   Actual  Predicted
## 1  0.000  0.3305748
## 2  0.478  0.3543185
## 3  6.996 12.7158175
## 4  0.386  0.3092237
## 5  6.795  8.9117165
## 6  0.113  0.2099476

Classifying YouTube Videos into High and Low Subscriber Growth Potential

Train-test split

Similar to regression task, we also perform train-test split on our classification dataset prepared in previous phase.

set.seed(50)  # For reproducibility
train_index <- createDataPartition(classification_dataset$subscriber_growth, p = 0.8, list = FALSE)
classification_train_data <- classification_dataset[train_index, ]
classification_test_data <- classification_dataset[-train_index, ]
cat("Train data dimension:", dim(classification_train_data))

## Train data dimension: 292 11

cat("Test data dimension:", dim(classification_test_data))

## Test data dimension: 72 11

Logistic Regression

Logistic Regression is chosen for its simplicity and interpretability, providing clear insights into the influence of each feature on the probability of high or low subscriber growth.

logistic_model <- glm(subscriber_growth ~ ., data = classification_train_data, family = binomial)
predictions_logistic <- predict(logistic_model, newdata = classification_test_data, type = "response")
predictions_logistic <- ifelse(predictions_logistic > 0.5, 1, 0)

Random Forest

Random Forest is selected due to its ability to handle complex relationships between features and its robustness to overfitting, making it suitable for datasets with interactions and non-linear patterns.

rf_model <- randomForest(subscriber_growth ~ ., data = classification_train_data, importance = TRUE)
predictions_rf <- predict(rf_model, newdata = classification_test_data)

SVM

SVM is included for its effectiveness in high-dimensional spaces and its ability to create decision boundaries that are optimal for classifying non-linear data, often yielding strong performance in classification tasks.

svm_model <- svm(subscriber_growth ~ ., data = classification_train_data, probability = TRUE)
predictions_svm <- predict(svm_model, newdata = classification_test_data)

Evaluation

We choose accuracy, precision, recall and F1 score as the evaluation metrics for classification task. Together, these metrics provide a comprehensive understanding of the model’s effectiveness in correctly identifying and classifying YouTube videos based on their growth potential.

# Confusion Matrix for Logistic Regression
confusion_logistic <- confusionMatrix(factor(predictions_logistic), classification_test_data$subscriber_growth)
accuracy_logistic <- confusion_logistic$overall['Accuracy']
precision_logistic <- confusion_logistic$byClass['Pos Pred Value']
recall_logistic <- confusion_logistic$byClass['Sensitivity']
f1_logistic <- 2 * (precision_logistic * recall_logistic) / (precision_logistic + recall_logistic)

# Confusion Matrix for Random Forest
confusion_rf <- confusionMatrix(predictions_rf, classification_test_data$subscriber_growth)
accuracy_rf <- confusion_rf$overall['Accuracy']
precision_rf <- confusion_rf$byClass['Pos Pred Value']
recall_rf <- confusion_rf$byClass['Sensitivity']
f1_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)

# Confusion Matrix for SVM
confusion_svm <- confusionMatrix(predictions_svm, classification_test_data$subscriber_growth)
accuracy_svm <- confusion_svm$overall['Accuracy']
precision_svm <- confusion_svm$byClass['Pos Pred Value']
recall_svm <- confusion_svm$byClass['Sensitivity']
f1_svm <- 2 * (precision_svm * recall_svm) / (precision_svm + recall_svm)

# Create a data frame to hold the evaluation metrics
evaluation_results <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "SVM"),
  Accuracy = c(accuracy_logistic, accuracy_rf, accuracy_svm),
  Precision = c(precision_logistic, precision_rf, precision_svm),
  Recall = c(recall_logistic, recall_rf, recall_svm),
  F1_Score = c(f1_logistic, f1_rf, f1_svm)
)
evaluation_results

##                 Model  Accuracy Precision    Recall  F1_Score
## 1 Logistic Regression 0.9166667 0.9574468 0.9183673 0.9375000
## 2       Random Forest 0.9305556 0.9583333 0.9387755 0.9484536
## 3                 SVM 0.9444444 0.9591837 0.9591837 0.9591837

Overall, SVM model is the best choice for classifying YouTube videos into high and low subscriber growth potential based on the provided metrics, possibly due to its ability to handle complex, non-linear relationships in the data and its focus on maximizing the margin between classes. It achieves the highest accuracy, indicating its superior overall classification performance, and the highest F1-score, which balances precision and recall effectively. These metrics highlight SVM’s ability to generalize well in this classification task.

Conclusion

In this project, we successfully addressed two key objectives related to YouTube video analytics. For forecasting Estimated Revenue (USD), XGBoost emerged as the best-performing model, thanks to its ability to capture non-linear relationships and feature interactions. However, the absolute evaluation metrics indicate potential for further improvement. For classifying videos into high and low subscriber growth potential, SVM proved to be the most effective model in handling complex, non-linear relationships and maximizing class margins. These results demonstrate the models’ robustness while highlighting areas for refinement in future work.

In future studies, the inclusion of a more comprehensive and diverse dataset would likely yield significant improvements in model performance.

Model could benefit from greater generalizability
Help mitigate biases and provide a more accurate representation of real-world conditions
Could lead to the identification of new predictive factors or the refinement of existing features

WQD7004: YouTube Channel Performance Analytics

Group 4

Introduction and Project Objectives

Data Preparation & EDA

Import necessary packages

Load data

Inspect data

Remove irrelevant features

Data reduction

Remove money-related features (except estimated_revenue_usd)

Remove derived or aggregated features

Handle missing values

Handle duplicated records

Summary on dataset after columns removal

Create classification target column - “subscriber_growth”

Convert categorical data to factor

EDA on categorical columns

EDA on categorical columns with respect to “subscriber_growth”

EDA on numerical columns

EDA on numerical columns with respect to “estimated_revenue_usd” and “subscriber_growth”

Label encoding for categorical feature

Dataset for Regression Problem

Remove subscriber_growth from regression dataset

Standardization

Feature selection

Dataset for Classification Problem

Remove columns used to derived target variable

Standardization

Feature selection

Data Analysis

Forecasting Estimated Revenue (USD) for YouTube Videos

Train-test split

Linear Regression

Random Forest Regression

XGBoost

Evaluation

Classifying YouTube Videos into High and Low Subscriber Growth Potential

Train-test split

Logistic Regression

Random Forest

SVM

Evaluation

Conclusion