Predicting the Number of Likes to Views Ratio of a YouTube Video

Abstract

This study developed several models to predict the target (likes to views ratio) of a YouTube video using a a data set from Kaggle. Aside from prediction, the study’s purpose was to provide an insight into what factors might affect target ratio the most and what categories of videos tend to perform better/worse. The study concluded that-out of the different models developed-the random forest model with 1000 trees performed the best with an r-squared value of ~0.67 and a rmse of 0.024. It also identified enabling/disabling comments as an important variable that is worth of further study. Finally, it showed that, when isolating the category factor, music and comedy videos tend to perform the best while sports and news/politics videos tend to perform the worst.

Note: For detail regarding the model-building process step-by-step, please check the following link https://github.com/xamanthalc/youtube_modeling_data_science (open link in new tab).

Data Overview

Source: The data was obtained from Kaggle from a competition called “Predict Youtube Video Likes.” The data has been compiled by Kaggle, which provides a strong trust on its validity, and it is a compilation of metrics of over 90,000 YouTube Videos.

General Facts: The data set has 92,275 rows and 20 variables, as shown below.

[1] 92275

[1] 20

Out of the 20 variables:

8 were character variables.
1 was a date variable.
3 were logical variables.
7 were numeric variables.
1 was a POSIXc variable.

Additionally, after skimming the data set, it was found that 2 variables have missing observations: description (1476 missing) and duration_seconds (2176 missing). Given that duration_seconds is a numerical variable that is likely to be important for modeling and its missing observations represent a very small part of the data (2.39%), such observations with missing duration_seconds were deleted. Overall, the data was in tidy format. The result is shown below:

Data summary
Name	youtube_data
Number of rows	90099
Number of columns	20
_______________________
Column type frequency:
character	8
Date	1
logical	3
numeric	7
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
video_id	0	1.00	11	11	16530
title	0	1.00	1	100	16870
channelId	0	1.00	24	24	4453
channelTitle	0	1.00	2	47	4521
tags	0	1.00	2	500	12564
thumbnail_link	0	1.00	46	51	16530
description	1414	0.98	2	4998	17260
id	0	1.00	22	22	90099

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
trending_date	0	1	2020-08-12	2021-11-30	2021-04-08	461

Variable type: logical

skim_variable	complete_rate	mean	count
comments_disabled	1	0.02	FAL: 88722, TRU: 1377
ratings_disabled	1	0.00	FAL: 89668, TRU: 431
has_thumbnail	1	0.89	TRU: 80622, FAL: 9477

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
categoryId	1	18.72	6.86	1	17.00	20.00	24.00	29.00
view_count	1	2787115.43	7408199.50	38510	532918.00	1102651.00	2490732.00	264407389.00
likes	1	152989.52	440810.11	0	21679.00	52148.00	131998.50	16021534.00
dislikes	1	3097.00	13401.05	0	369.00	851.00	2233.00	879354.00
comment_count	1	13856.33	97838.48	0	1719.00	3891.00	9353.00	6738537.00
duration_seconds	1	760.61	5816.82	3	184.00	446.00	854.00	485620.00
target	1	0.06	0.04	0	0.03	0.05	0.08	0.43

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
publishedAt	0	1	2020-08-03 21:51:14	2021-11-29 18:50:25	2021-04-03 22:39:01	16257

Considerations - Derived Variables: From the skim above, it was noticed that the variable publishedAt is a POSIXc variable that outlined the date and time when a video was released. New variables called hour_released, released_date, time_release_trending, and day_week_released were created from publishedAt. The variable hour_released stores the hour at which a Youtube video was posted and it is a numeric variable. The variable released_date stores the date when the video was released and it is a date variable. The variable time_release_trending stores the number of days between the release of a video and the date when it becomes trending. The variable day_week_released stores the day of the week when the video was released and it is a character variable. The justification was that isolating date and hour as individual variables could bring up interesting insights during the EDA and model-building process later on.

Thus, after the modifications, the data set had 90,099 rows and 24 variables, out of which:

9 were character variables.
2 were a date variable.
3 were logical variables.
9 were numeric variables.
1 was a POSIXct variable.

Codebook: A codebook for the data set is included in the project under the name “codebook_youtube_data_set” and can also be found in the following link: https://drive.google.com/file/d/1fePKrgvBqZPlPeJk1C_k9rQt5ggomMHV/view?usp=sharing.

Leading Research Questions

The main research question for this study was: Can the target (ratio of number of likes and views) be successfully predicted based in video’s metrics (duration, comments, etc.)?

Exploratory Data Analysis

In the section below, an Exploratory Data Analysis was conducted to explore the distributions and relationships of the variables in the data set to use the insights as starting point for the model-building process. Due to the large amount of data, this study could afford to use 40% of the entries for EDA (36,040 observations) and separate 60% of the data for the predictive-modeling phase to avoid introducing bias (54,059) observations.

Univariate

Numeric Variables

The following plots show the distribution of the numeric variable.

As shown above, the variables likes, view_count, comment_count, dislikes, duration_seconds had strong right-skewed shapes. This is due to the fact that there were outliers at levels much bigger than those where most of the observations are. This behavior could later cause trouble during modeling, so such variables were transformed with logarithmically to base 10 to reduce skewness. The distributions after the transformation are shown below:

Additionally, the distribution of the number of days from video release to becoming trending is shown below. Despite its right-skewness, the variable was not transformed due to its small range (0-37).

Some takeaways:

The most popular categories of YouTube videos in the data set are categoryId = 24 (entertainment), followed by categoryId = 10 (music).
The most popular hour to release a YouTube video in the data set is 16:00, followed by 17:00.
The media number of: views is ~1 million, likes is ~50 thousand, dislikes is 852, comments is ~4 thousand, duration is ~450 seconds (7,5 minutes), and target (likes/views) is 0.05.
Most videos become trending within 10 days of release.

Non-numeric Variables

Some takeaways:

The most popular date when Youtube videos were released was October 30, 2020.
Friday is the most popular day to release videos (although they are fairly released across all days of the week).
Most videos did not disable comments/ratings/thumbnails.

Bivariate

Given that the outcome variable of the predictive model was target, the bivariate analysis was focused on finding insights of the behavior of target relative to other variables.

Note: The variable categoryId was tested as a qualitative variable for bivariate analysis.

Numeric Variables:

Some takeaways:

Overall, the sooner a video becomes trending from release date, the higher its target ratio.
Comment count and target ratio have a positive relationship.
Both likes and view_count are not in the plots because target ratio = likes/view_count:
- All else equal, the higher the number of like, the higher the target ratio.
- All else equal, the higher the number of view_count, the lower the target ratio.

Non-numeric Variables:

Some takeaways:

Videos that allow ratings and comments tend to have a higher target.
The categories with the higher median target were 23: comedy and 10: music.
Video released on Sunday performed the worse according to target, and videos released on Friday performed the best.

Modeling

Data Set Manipulation

From the EDA section, it was possible to determine that variables such as categoryId, comments_disabled, ratings_disabled, comment_count, time_release_trending, day_week_released, and duration_seconds seemed to be relevant when predicting the target of a video. To continue with the modeling phase, some variables were left-out due to lack of enough importance and with the purpose of having a more focused data set. Those variables were:

title: too specific to each entry and difficult to make into a metric. For further research, it could be interesting to investigate whether the length of a title, or its sentiment may impact target.
publishedAt: Irrelevant given that it is redundant with the columns released_date and hour_released.
channelId and channelTitle: too specific to each entry and difficult to make into a metric. For further research, it could be interesting to collect the number of subscriptions and use that as a metric for modeling target.
tags: too specific to each entry and difficult to make into a metric. However, with NPL (Natural Processing Language), a study could dismantle whether the presence/absence of some tags are likely to impact target.
thumbnail_link: not relevant.
description: too specific to each entry and difficult to make into a metric. For further research, it could be interesting to investigate whether the sentiment of the description may impact target.
id: not relevant given that video_id was already being used as primary key.
released_date and trending date: to guarantee not dealing with time series.

Thus, the data set for modeling ended up with 14 variables and 54,059 observations. Finally, the variables day_week_released, has_thumbnail, ratings_disabled, comments_disabled, and categoryId were turned into factors.

After the above modifications, the first ten observations were:

# A tibble: 10 × 14
   video_id categoryId view_count  likes dislikes comment_count comments_disabl…
   <chr>    <fct>           <int>  <int>    <int>         <int> <fct>           
 1 AqqMA2m… 17             381755   3534       95           877 FALSE           
 2 CZJvBfo… 24           20224542 657687     6147         43081 FALSE           
 3 SKQ5r3A… 10            1842144 203285     1127         17019 FALSE           
 4 O3V4UBZ… 20            2319270 255979    30349         48695 FALSE           
 5 Cew-Bnj… 25            1546608  32579     1795         15246 FALSE           
 6 311ZcJo… 24            7482473 484530    11628         23978 FALSE           
 7 TPbQCrk… 24             522929  65095      536          2010 FALSE           
 8 9xnTz6s… 20             530910  14657      288           793 FALSE           
 9 EhAemz1… 27            2829884 173762     2239         17019 FALSE           
10 A7JdZaC… 24            1851793  88507     2809          3020 FALSE           
# … with 7 more variables: ratings_disabled <fct>, duration_seconds <dbl>,
#   has_thumbnail <fct>, target <dbl>, hour_released <int>,
#   day_week_released <fct>, time_release_trending <dbl>

Allocation of Data for Modeling

The training data set used 80% of the data for predictive modeling (43,247 observations), and the 20% left was isolated for the testing phase (10,812 observations). Additionally, repeated cross-validation with 5 folds and 3 repeats was exercise on the training data set for later operations. The resamples are show below:

#  5-fold cross-validation repeated 3 times using stratification 
# A tibble: 15 × 3
   splits               id      id2  
   <list>               <chr>   <chr>
 1 <split [34595/8652]> Repeat1 Fold1
 2 <split [34596/8651]> Repeat1 Fold2
 3 <split [34599/8648]> Repeat1 Fold3
 4 <split [34599/8648]> Repeat1 Fold4
 5 <split [34599/8648]> Repeat1 Fold5
 6 <split [34595/8652]> Repeat2 Fold1
 7 <split [34596/8651]> Repeat2 Fold2
 8 <split [34599/8648]> Repeat2 Fold3
 9 <split [34599/8648]> Repeat2 Fold4
10 <split [34599/8648]> Repeat2 Fold5
11 <split [34595/8652]> Repeat3 Fold1
12 <split [34596/8651]> Repeat3 Fold2
13 <split [34599/8648]> Repeat3 Fold3
14 <split [34599/8648]> Repeat3 Fold4
15 <split [34599/8648]> Repeat3 Fold5

Simple Model Selection

4 main models were used to fit the resamples and assess which performed the best: Linear models, lasso models, ridge models, and random forest models. For the first three type of models, each one of them had two versions: without and with pairwise interactions. In the case of random forest, two versions were also created: using the default values, and customizing the number of trees to 1000.

The plots below show how the models ranked in terms of r-squared and Root Mean Squared Error (rmse). The random forests performed better across both metrics:

The tables below show that the random forest with customized values had a small edge over the random forest with default values:

R-squared:

# A tibble: 8 × 9
  wflow_id     .config     preproc model  .metric .estimator  mean     n std_err
  <chr>        <chr>       <chr>   <chr>  <chr>   <chr>      <dbl> <int>   <dbl>
1 random_fore… Preprocess… recipe  rand_… rsq     standard   0.662    15 0.00210
2 random_fore… Preprocess… recipe  rand_… rsq     standard   0.661    15 0.00209
3 lm_no        Preprocess… recipe  linea… rsq     standard   0.371    15 0.00277
4 lm_yes       Preprocess… recipe  linea… rsq     standard   0.458    15 0.00365
5 lasso_no     Preprocess… recipe  linea… rsq     standard   0.357    15 0.00220
6 lasso_yes    Preprocess… recipe  linea… rsq     standard   0.384    15 0.00993
7 ridge_no     Preprocess… recipe  linea… rsq     standard   0.369    15 0.00256
8 ridge_yes    Preprocess… recipe  linea… rsq     standard   0.441    15 0.00533

Root Mean Squared Error:

# A tibble: 8 × 9
  wflow_id    .config     preproc model  .metric .estimator   mean     n std_err
  <chr>       <chr>       <chr>   <chr>  <chr>   <chr>       <dbl> <int>   <dbl>
1 random_for… Preprocess… recipe  rand_… rmse    standard   0.0246    15 9.89e-5
2 random_for… Preprocess… recipe  rand_… rmse    standard   0.0246    15 9.50e-5
3 lm_no       Preprocess… recipe  linea… rmse    standard   0.0319    15 1.37e-4
4 lm_yes      Preprocess… recipe  linea… rmse    standard   0.0296    15 1.26e-4
5 lasso_no    Preprocess… recipe  linea… rmse    standard   0.0325    15 1.44e-4
6 lasso_yes   Preprocess… recipe  linea… rmse    standard   0.0318    15 2.17e-4
7 ridge_no    Preprocess… recipe  linea… rmse    standard   0.0320    15 1.39e-4
8 ridge_yes   Preprocess… recipe  linea… rmse    standard   0.0301    15 1.49e-4

As show above, the random forest with customized values performed better and had an r-squared = 0.662, which is considered to be moderate. Additionally, it also had the smallest rmse = 0.0246 which meant that it was the model that had the highest proportion of accurate predictions and the smallest error. Even though it was the model the performed the best, it was necessary to put the results in perspective:

As show by the plot, at targets over 0.15, the models tends to under-predict. For example, some of the observations have target values of 0.4 that are predicted at slightly over 0.1.

Tuned Model Selection

Aside from the 8 models created above, another 32 models resulting from tuning the linear models were developed. The linear model without penalty was tuned according to the number of splines, while the lasso and ridge model were tuned according to penalty. The results are shown below:

Linear Model

The iterations used for tuning on the number of splines were:

# A tibble: 16 × 2
   comment_count_df hour_release_df
              <int>           <int>
 1                1               1
 2                5               1
 3               10               1
 4               15               1
 5                1               5
 6                5               5
 7               10               5
 8               15               5
 9                1              10
10                5              10
11               10              10
12               15              10
13                1              15
14                5              15
15               10              15
16               15              15

The plot below shows that the rmse and the r-squared performed better when both the number of degrees of freedom of the hour_released and comment_count is higher.

This is further shown with the output of the select_best function:

# A tibble: 1 × 3
  `comment_count df` `hour_release df` .config              
               <int>             <int> <chr>                
1                 15                15 Preprocessor16_Model1

Thus, the final model was tuned using 15 degrees of freedom for both categories, and the metrics across folds are shown below:

# A tibble: 2 × 6
  .metric .estimator   mean     n   std_err .config             
  <chr>   <chr>       <dbl> <int>     <dbl> <chr>               
1 rmse    standard   0.0283    15 0.0000727 Preprocessor1_Model1
2 rsq     standard   0.507     15 0.00213   Preprocessor1_Model1

Lasso Model

The iterations used for tuning on the penalty were:

# A tibble: 8 × 1
        penalty
          <dbl>
1 0.0000000001 
2 0.00000000268
3 0.0000000720 
4 0.00000193   
5 0.0000518    
6 0.00139      
7 0.0373       
8 1

The plot below shows that the rmse and the r-squared performed better at the same point in both cases.

The select_best function shows the value of such point:

# A tibble: 1 × 2
       penalty .config             
         <dbl> <chr>               
1 0.0000000001 Preprocessor1_Model1

Thus, the final model was tuned using 0.0000000001 as penalty, and the metrics across folds are shown below:

# A tibble: 2 × 6
  .metric .estimator   mean     n  std_err .config             
  <chr>   <chr>       <dbl> <int>    <dbl> <chr>               
1 rmse    standard   0.0301    15 0.000334 Preprocessor1_Model1
2 rsq     standard   0.444     15 0.0102   Preprocessor1_Model1

Ridge Model

The iterations used for tuning on the penalty were:

# A tibble: 8 × 1
        penalty
          <dbl>
1 0.0000000001 
2 0.00000000268
3 0.0000000720 
4 0.00000193   
5 0.0000518    
6 0.00139      
7 0.0373       
8 1

The plot below shows that the rmse and the r-squared performed better at the same point in both cases.

The select_best function shows the value of such point:

# A tibble: 1 × 2
       penalty .config             
         <dbl> <chr>               
1 0.0000000001 Preprocessor1_Model1

Thus, the final model was tuned using 0.0000000001 as penalty, and the metrics across folds are shown below:

# A tibble: 2 × 6
  .metric .estimator   mean     n  std_err .config             
  <chr>   <chr>       <dbl> <int>    <dbl> <chr>               
1 rmse    standard   0.0302    15 0.000168 Preprocessor1_Model1
2 rsq     standard   0.440     15 0.00565  Preprocessor1_Model1

Best Tuned Model

The table below summarizes the results of the tuned models:

          linear_splines  lasso  ridge
mean rsqr         0.5070 0.4440 0.4400
mean rmse         0.0283 0.0301 0.0302

The linear model with tuned splines at 15 degrees of freedom outperforms the tuned lasso and ridge models with a mean r-squared across folds of 0.5 and a mean rmse of 0.0283. Additionally, the model is also better than the linear model developed without tuning in the previous section. Yet, the random forest customized remains as the best model (mean r-squared = 0.662, mean rmse = 0.0246).

Note: This section did not tune random forest models due to the fact that in the non-tuned section, it was shown that further increment of the number of trees or minimum nodes did not have a sufficiently large impact on the model. Additionally, all the tuned models were created with a recipe that includes interaction since the “Simple Model Selection” section showed that including interactions improved all models.

Model Application to the Testing Set

Having chosen the random forest with customized values as the best model, it was applied on the testing set. The table and plot shown below show a snapshot of the actual values vs the predictions:

# A tibble: 10,812 × 2
   target  .pred
    <dbl>  <dbl>
 1 0.0325 0.0537
 2 0.110  0.0974
 3 0.110  0.0744
 4 0.0993 0.0810
 5 0.0797 0.0797
 6 0.0151 0.0254
 7 0.0386 0.0537
 8 0.0189 0.0742
 9 0.101  0.0827
10 0.0314 0.0491
# … with 10,802 more rows

The plot in the testing set very closely resembled the trend shown by the resamples. The metrics below confirmed that it was, indeed, the case:

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.681

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0240

The r-squared of the model in the testing set was 0.674 which is slightly better than its value in the resamples (0.662). The rmse in the testing set was 0.024 which was also slight better than its value in the resamples (0.025). Because both metrics are similar in the resampling and testing phase, it was reasonable to conclude that the model was built properly.

Conclusions

It was possible to predict the target ratio (number of likes over view count) at a moderate level by developing a random forest model that accurately predicted ~67% of observations with a Root Mean Squared Error (rmse) of 0.024. A highly noticeable weakness of the model occurs at higher target values where the model tends to under-predict.

Additionally, the regression model with interaction and tuned splines provided insights into the fact that the comment_disabled_TRUE. might be one (if not “the”) main factors driving target ratio. Similarly, the model’s coefficients unveiled that if someone aims to maximize the target ratio of their videos, they should probably produce music, comedy, or personal blog videos. Nonetheless, neither of the previous statements can be assured, but they provide a good starting point for further study.

The EDA showed that highly successful videos should expect up to 0.4 for target ratio. It also showed that the sooner a video becomes trending, the higher its target ratio tends to be. On release day of the week, Fridays generally performed the best (likely because people have more time to watch the video over the weekend), and Sundays performed the worst (probably due to the opposite reason).

To further improve the model, it is recommended to look at variables that were excluded from the predicting phase with an emphasis on: tags and channelId since one might find that the presence of a specific tag tends to increase/decrease target and that videos from specific channels tend to perform better.

Predicting the Number of Likes to Views Ratio of a YouTube Video

Long Report

Xamantha Laos Cueva

Abstract

Data Overview

Leading Research Questions

Exploratory Data Analysis

Univariate

Numeric Variables

Non-numeric Variables

Bivariate

Numeric Variables:

Non-numeric Variables:

Modeling

Data Set Manipulation

Allocation of Data for Modeling

Simple Model Selection

Tuned Model Selection

Linear Model

Lasso Model

Ridge Model

Best Tuned Model

Model Application to the Testing Set

Other Questions

Most Relevant Factor for Prediction

Category to Optimize Target Ratio

Conclusions